find codes in text

I need to extract the codes that look like “8776_16_088_076” (they always have the XXXX_XX_XXX_XXX format…
I just can’t work out how to do it? Help please

set theText to "8776_01_006_301 8776_01_002_047 8776_01_004_159
8776_03_018_093
8776_14_076_081

8776_15_084_021
C.
8776_15_086_042
B.
8776_15_087_025
8776_15_085_006
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Morbi imperdiet consequat risus vel vehicula. Sed fringilla, risus vitae dignissim pharetra, quam elit mattis turpis, a consectetur nisl ligula nec orci. Quisque eu quam odio, non rutrum tortor. Morbi sed facilisis elit. Vivamus egestas eleifend gravida. Aliquam non turpis eget dolor eleifend porttitor. Donec id massa eros.

Sed eget eleifend arcu. Cras id mauris neque, nec malesuada metus. Duis suscipit congue enim ac luctus. Aenean tempus auctor dolor ut lacinia. Fusce convallis egestas ipsum non interdum. Donec a mauris nec lacus mattis tempor. Maecenas massa nulla, pretium vitae ultricies facilisis, dignissim vitae lorem. Nulla facilisi.

Phasellus eu erat in eros gravida laoreet. Nunc ac eros odio. Fusce in est sit amet dolor sagittis posuere eu fringilla ipsum. Maecenas orci lectus, porta nec malesuada id, condimentum sed ipsum. Duis quis urna enim, id porttitor lectus. Donec lacus purus, condimentum sit amet convallis eget, malesuada faucibus diam. Quisque vulputate feugiat dapibus. Praesent molestie elementum leo. Sed nulla dui, sagittis ac adipiscing vel, interdum at nisl. Phasellus quis malesuada risus. Maecenas quis diam diam, eu molestie elit.
A.
GRABAGIFT,
FOR A FUN
NIGHT'S SLEEP
8776_16_089_053
8776_16_088_076
RECEIVE A BONUS
SCARF
Conditions apply. See page 3 for details.
C.
8776_16_090_013
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Morbi imperdiet consequat risus vel vehicula. Sed fringilla,"

every paragraph of (do shell script "egrep -o '[0-9]{4}_[0-9]{2}_[0-9]{3}_[0-9]{3}' <<< " & quoted form of theText)

Nice way to crush a guy man :o Thank good you don’t know how many hours I’ve been trying to do this with delimiters.

Thank you so MUCH.

You’re welcome. However when I think about it, it’s not flawless. You need:

every paragraph of (do shell script "egrep -o '\\b[0-9]{4}_[0-9]{2}_[0-9]{3}_[0-9]{3}\\b' <<< " & quoted form of theText)

The /b means word boundaries. Now needs to be an whole word so the only surrounding characters are not allowed to be word characters.

how would i grep a txt file?

Hello.

You would grep a textfile like this.

grep [options] regexp aFile

Try opening a Terminal window and type: man grep for seeing what options you got. A good man command for learning regexp’s are man re_format.

Maybe you should google for a bash tutorial. :slight_smile:

easier said than done… my head hurts

Hi rapdigital,

I sympathize with you. :slight_smile:

There are good tutorials here:

http://www.grymoire.com/Unix/

I’ve haven’t read all of them lately, but if you have the time they’re easy to follow.

gl,

I’m not surprised! I know a bit about regex and couldn’t follow “re_format” at all! There’s a kinder (though still occasonally head-hurting) tutorial here. Unix’s egrep and sed programs don’t necessarily implement all the features covered, but it only takes a little trial and error to find out what works in either.

There are probably other good tutorials out there too. I see kel1’s just mentioned one. I found grymoire’s sed tutorial easy to follow, so the regular expressions one is probably equally digestible.

Hello.

I agree with Nigel, the learning curve to begin with is somewhat steep, Nigel’s tutorial is very nice. Let me add to something to it. :slight_smile:

First of all reg-exp can at least be viewed as language that describes sets, like the sets and venn diagrams you were taught in shcool.

Come to think about it: There should be a regexp tutorial included with TextWrangler, since TextWranglers find command supports grep. That is an easy-going tutorial for starters. I also think This wikipediapage seems to treat the subject well, as an overview.

Then I’d fill in with Nigel’s page.

By the way I have made a script for looking up the words in the dictionary by entering regular expressions, you can use that to play with, as you read about regular expressions, so you can try out what you have learned. It can be found in post #9 in this thread.

But I must say, I think the best place to play with regexp is from the commandline, Websters dictionary can be found on your computer at /usr/share/dict/web2. :slight_smile:

I also think that the book, “The UNIX Programming Environment” treats regexp and basic shell scripting both gently and thoroughly although made for bourne shell, all examples work with bash.
Try to get hold of that book, and read chapters 1,2 and 3.

Every tutorial on this grymoire page may help you, but skip the c-shell tutorials, since we use the bash shell, which is a descendant of the bourne shell, in fact start with those concerning the shell and regular expressions only. The basics you need, is about regexp, and quoting, and understanding how entering commands works, that is, how they are looked for by the $PATH variable, and the simplests form of redirection of input and output.

Edit

The place I have learned the most of about reg-exps, is by following the threads in ths Apple Script forum! And played with the samples, until I have understood them. But that is later, learning to use reg-exps is abit like growing potatoes, it needs some time.

If I were you, I’d start out by learning the basic regexp syntax, and the grep command from Terminal, or through my script, you can edit the regexps, in both places, but you can only play with grep’s options from the Terminal.

There are some commands that make your life in the terminal easier.

pbcopy <afile, puts a file to clipboard. pbpaste >afile, puts the contents of the clipboard into afile.

sed -l afile, shows tabs and such as \t, and is a great help when you want to understand what the pattern needs to loook like. cat afile| od -cb is a variation of that theme, but more detailed.

Thanks guys
I definitely want to look at this in greater depth when I have some more time.

For now I’m trying to get this to work.

I have this but the returned text is garbed

on run
	set theF to (choose file) -- a .txt file
	set theFile to read theF as Unicode text
	my processpdf(theFile, theF)
end run

on processpdf(theFile, theF)
	set searchtxt to every paragraph of (do shell script "egrep  -o  '\\b[0-9]{4}_[0-9]{2}_[0-9]{3}_[0-9]{3}\\b' <<< " & quoted form of theFile)
	set theList to searchtxt as text
	set the clipboard to theList
	set f to (open for access theF with write permission)
	set eof of f to 0 --> empty file contents if needed
	write (theList as Unicode text) to f --as text
	close access f
end processpdf



anyone know why the text is messed up?

Hello.

Try changing Unicode text to text, what I think happnes is that a Unicode text file, is denoting a utf-16 file, whereas text, reads the file in as utf8, at least it seems like that in Script Debugger, as I don’t have to read chinese using the latter. :slight_smile:

but then the shell script fails?

It really shouldn’t fail, but if you have overwritten the original file with the file as Unicode text, then it would fail of course.

I tried your example text and script , and from the commandline I tried DJ Bazzie Wazzie’s second egrep expression, and that worked for me.

Found this http://macscripter.net/viewtopic.php?id=24535 by julifos

and added his

--> now write the flag
write ((ASCII character 254) & (ASCII character 255)) to f --> not as Unicode text

that fixed it

Thanks All

Hello.

I am glad you solved it, is however much better to do without a BOM if you can.

Here is a slight modification of your code, I have added text items delimiters, to make the lines as they were read in, and doesn’t use a BOM to store the file as text, I simply write it out as text:

on run
    set theF to (choose file) -- a .txt file
    set theFile to read theF as text
    my processpdf(theFile, theF)
end run

on processpdf(theFile, theF)
    set searchtxt to every paragraph of (do shell script "egrep -o '\\b[0-9]{4}_[0-9]{2}_[0-9]{3}_[0-9]{3}\\b' <<< " & quoted form of theFile)
    set {tids, AppleScript's text item delimiters} to {AppleScript's text item delimiters, linefeed}
    set theList to searchtxt as text
    set AppleScript's text item delimiters to tids
    set the clipboard to theList
    set f to (open for access theF with write permission)
    set eof of f to 0 --> empty file contents if needed
    write (theList as text) to f --as text
    close access f
end processpdf

Here is the result:

8776_01_006_301 8776_01_002_047 8776_01_004_159 8776_03_018_093 8776_14_076_081 8776_15_084_021 8776_15_086_042 8776_15_087_025 8776_15_085_006 8776_16_089_053 8776_16_088_076 8776_16_090_013

Awesome thank you

This seems to work


on run
	set theF to (choose file) -- a .txt file
	set theFile to read theF as Unicode text
	my processpdf(theFile, theF)
end run

on processpdf(theFile, theF)
	set searchtxt to every paragraph of (do shell script "egrep  -o  '\\b[0-9]{4}_[0-9]{2}_[0-9]{3}_[0-9]{3}\\b' <<< " & quoted form of theFile)
	set {tids, AppleScript's text item delimiters} to {AppleScript's text item delimiters, linefeed}
	set theList to searchtxt as Unicode text
	--set the clipboard to theList
	set f to (open for access theF with write permission)
	set eof of f to 0 --> empty file contents if needed
	write ((ASCII character 254) & (ASCII character 255)) to f --> not as Unicode text
	write theList to f as Unicode text
	close access f
end processpdf

You guy are really amazing

Hello

You should really set back the text item delimiters again to their previous value. And don’t coerce to Unicode text, text will suffice. :slight_smile:

Thanks good catch

Now you’re undoing what I just did :smiley: ; Converting line feed delimited data into an list. The data returned from the shell is already linefeed delimited so you only need to remove the coercion. This can keep your code simpler like.

set theList to do shell script "egrep  -o  '\\b[0-9]{4}_[0-9]{2}_[0-9]{3}_[0-9]{3}\\b' <<< " & quoted form of theFile without altering line endings as unicode text

I added ‘without altering line endings’ because with this options the do shell script command doesn’t convert line endings into returns.