Is there a !GOOD! program or method to get email addresses from multiple word docs?

albangaalbanga Member Posts: 164
Hi All,

I have an important situation at work where we need to obtain email addresses from hundreds of word documents. Basically we have acquired another company whose database comprised word docs and spreadsheets. So what we are hoping to do is find an application or some other method to scan all these word documents and then spit out all the found email addresses into a separate file to be used in a mass mail out.

To go through each document one by one would be an admin nightmare and we just want to avoid it.

I know this can be done by using an application because I have trialled several solutions all of which have failed to do the job properly. I did find one that did a good job when up against around 3 documents, but when i gave it the task of a directory with 15 word documents it crashed out.

So i was hoping someone on here might have dealt with this before and had a good solution. I think by the time i try every dodge piece of software they could have hired someone to sift through every doc! :)

Any help would be much appreciated!

Comments

  • astorrsastorrs Member Posts: 3,139 ■■■■■■□□□□
    Something like this from Linux (or cygwin) should work:

    grep -Eihor '\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}\b' C:\FolderToSearch | sort | uniq > emails.txt
  • astorrsastorrs Member Posts: 3,139 ■■■■■■□□□□
    Okay I decided to figure out how to translate it into PowerShell 2.0, here you go:
    $searchDirectory = "C:\Users\Andrew\Documents"
    $searchExtensions = "*.doc", "*.docx", "*.xls", "*.xlsx"
    $outputFile = "C:\Users\Andrew\Desktop\emails.txt"
    
    Get-ChildItem * -Include $searchExtensions -Path $searchDirectory -Recurse | `
    Select-String -Pattern "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[a-zA-Z]{2,4}\b" `
    -AllMatches | Select-Object -ExpandProperty Matches | Select-Object `
    -ExpandProperty Value | ForEach-Object { $_.ToString().ToLower() } | `
    Sort-Object | Get-Unique | Out-File $outputFile
    

    This will scan all files in $searchDirectory (and any subdirectories) with one of the extensions listed in $searchExtensions and save a list of all the unique email addresses it finds (no duplicates) in a file called $outputFile.

    Hopefully this will work for you. :D
  • carboncopycarboncopy Member Posts: 259
    astorrs wrote: »
    Okay I decided to figure out how to translate it into PowerShell 2.0, here you go:
    $searchDirectory = "C:\Users\Andrew\Documents"
    $searchExtensions = "*.doc", "*.docx", "*.xls", "*.xlsx"
    $outputFile = "C:\Users\Andrew\Desktop\emails.txt"
    
    Get-ChildItem * -Include $searchExtensions -Path $searchDirectory -Recurse | `
    Select-String -Pattern "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[a-zA-Z]{2,4}\b" `
    -AllMatches | Select-Object -ExpandProperty Matches | Select-Object `
    -ExpandProperty Value | ForEach-Object { $_.ToString().ToLower() } | `
    Sort-Object | Get-Unique | Out-File $outputFile
    

    This will scan all files in $searchDirectory (and any subdirectories) with one of the extensions listed in $searchExtensions and save a list of all the unique email addresses it finds (no duplicates) in a file called $outputFile.

    Hopefully this will work for you. :D

    I don't think I will ever use that but that is pretty cool. :)
  • albangaalbanga Member Posts: 164
    Thanks astorrs, I handed it to my developer in the end. Your script didnt work as well as he would have hoped so he had to re-do it. I will post up the end result as he is currently away sick.

    Thanks for the reply though :D
  • Hyper-MeHyper-Me Banned Posts: 2,059
    albanga wrote: »
    Thanks astorrs, I handed it to my developer in the end. Your script didnt work as well as he would have hoped so he had to re-do it. I will post up the end result as he is currently away sick.

    Thanks for the reply though :D



    That sounds like a typical developer response, lol icon_lol.gif
  • astorrsastorrs Member Posts: 3,139 ■■■■■■□□□□
    albanga wrote: »
    Thanks astorrs, I handed it to my developer in the end. Your script didnt work as well as he would have hoped so he had to re-do it. I will post up the end result as he is currently away sick.
    I didn't expect much given I couldn't test against your files, etc. I'd love to see what changes he made to it to meet your needs for future.

    Either way, glad you got what you needed.
  • AhriakinAhriakin Member Posts: 1,799 ■■■■■■■■□□
    You can get eGrep for Windows (and SED etc.) so you can get some of that nice Linux CLI text manipulation natively.
    We responded to the Year 2000 issue with "Y2K" solutions...isn't this the kind of thinking that got us into trouble in the first place?
Sign In or Register to comment.