find codes in text

Hello.

I just kept it as simple as possible, not introducing any new stuff, just making it work for the OP.

Let me add to this, while I am at it, If you have diacriticals in your text, and they turn out badly, then writing the BOM, may help in that, so don’t toss the code away.

Generally speaking, if you are to write text back out into a file again, it is best to do so within the same shell script, you can use the tee command to get the text back out again. This technique has its bearings when it comes to diacriticals, as it is not guaranteed that diacriticals comes out right again, because text is transformed from utf8 into utf16, and back again.

Should this be a problem, then the iconv utility may be handy. :slight_smile:

Here is an example, that won’t transform any diacriticals, using the tee command, to pass a copy of the output into a file.

on run
	set theF to (choose file) -- a .txt file
	my processpdf(theF)
end run

on processpdf(theF)
	set infile to POSIX path of theF
	set outfile to infile & ".out"
	set searchtxt to every paragraph of (do shell script "egrep -o '\\b[0-9]{4}_[0-9]{2}_[0-9]{3}_[0-9]{3}\\b' <" & quoted form of infile & " |tee  " & quoted form of outfile)
	set the clipboard to searchtxt
end processpdf

The utf-16 to utf-8 issue is not with AppleScript and Bash. Both uses composed character sets including iconv. However you could write the data immediately to the file as utf-16 if you want, you can use xxd to write an bom. There is no reason for concerns about uncomposed utf-8 output, otherwise iconv should be aware (when you’re not given an input character set)

Now using the character conversion with iconv.

set fp to quoted form of POSIX path of ((choose file) as string)
do shell script "DATA=$(cat " & fp & ");xxd -r <<< 0xFEFF > " & fp & " && egrep -o '\\b[0-9]{4}_[0-9]{2}_[0-9]{3}_[0-9]{3}\\b' <<< $DATA | iconv -t UTF-16BE >> " & fp

That’s right, I mixed the cards, and was to go back and edit my post.

The issue I wanted to address, was when the application that will open the file defaults to reading say UTF16-LE, and you have written it out as UTF8, then having a BOM saying that the file is indeed UTF8 helps.

I am reasonably sure that Applescript, will convert characters by itself back and forth between UTF16 and UTF8 correctly.

There is one caveat however, and that is when you use a do shell script for reading in a UTF16 file, then stuff will break, as the do shell script may misinterpret UTF16 characters, with diacriticals, and then you have it going from there, with characters that are wrong.

Edit

The fix to this is to use iconv to prepare an intermediary file first, that you pass through the filter chain (command string)

That’s right too but then you would need to read the file with iconv first. For instance when the given file is UTF-16 too you need instead of the cat command in my example above iconv:


--if the given file is utf-16
set fp to quoted form of POSIX path of ((choose file) as string)
do shell script "DATA=$(iconv -f UTF-16 " & fp & ");xxd -r <<< 0xFEFF > " & fp & " && egrep -o '\\b[0-9]{4}_[0-9]{2}_[0-9]{3}_[0-9]{3}\\b' <<< $DATA | iconv -t UTF-16BE >> " & fp

Now the processing of the actual data still remains UTF-8 . This is needed because (e)grep won’t work with multibyte characters and can only work with single bytes.

Hello.

While we are treating this subject :slight_smile: I find it worthy to mention, that if you have diacriticals in characters, that you are to feed into a regexp interactively through a do shell script command, it would be a good idea to remove the diacriticals first, as most shell tools only may just be dealing with the ASCII 7 bit character set [a-z][A-Z].

You can do that by using a considering diacriticals clause, and then build up new search words, by looking up the characters in uppercase, and lowercase “Alphabet strings”.

I should have provided some code for this, but it should be fairly easy to implement, and I have to be productive. :slight_smile:

Edit

Grep, awk and sed seems to work fine with UTF8 diacriticals.

You need a BOM when writing Unicode text to a file destined to be read by some application because for historical reasons, ‘write’ writes UTF-16 in big-endian form, whereas on a machine with an Intel processor, the application’s likely to assume the text is in native little-endian form unless told otherwise.

The big-endian UTF-16 BOM is FEFF in hexadecimal notation. You’ve written this as one byte of ‘ASCII character 245’ (FE) and one of ‘ASCII character 255’ (FF). ‘ASCII character’ is now deprecated and shouldn’t be used. It would be better from all angles to write a data object instead:

set f to (open for access theF with write permission)
try
	set eof f to 0 --> empty the file.
	write «data rdatFEFF» to f --> Big-endian BOM.
	write theList to f as Unicode text
end try
close access f

Not entirely, most shell tools works with bytes not encodings. The important thing is the arguments and the data from the file or stdin are both with the same encoding. In earlier version of Mac OS X bash was MacRoman encoded. When an character é was send as arguments in an regular expression, grep was looking for byte values 142 files. When the file was for instance Windows Latin 1 (CP1252) encoded byte 233 were used for character é.

Nowadays we’re having more problems with regular expressions since UTF-8. With regular expressions you can point to an character or an character range. The character range is where it goes wrong in UTF-8 because an multibyte character is interpreted in as multiple characters in the range. This has nothing to with grep, sed, awk or anything but with the regex library these tools uses.

There are wide character regex libraries today (which is UTF-32) but still aren’t used in Mountain Lion.

It’s not that you’re wrong about diacriticals, but it’s not the 7-bits ASCII character issue you describe. You can solve this by sending the parameters and data as an 8-bit ASCII extended character set like MacRoman or CP1252

.

Now, just having a regexp library, that is native to BSD, and works!, is a huge step in the right direction. We have had the ICU library since Leopard, I think. (This is also UTF-16 interpretation). UTF32 is very nice to work with, at least towards UTF8! :slight_smile: (We’ve also had a BSD regexp library, but which didn’t work with wide-chars for instance on Snow Leopard, it really cheered me up to see that the guys at Apple are improving stuff (Darwin Opensource).)

Another framwork to be considered is CoreFoundation, I have never used reg-exps from CoreFoundation, but I believe if that option exists, to be the easiest one, although not as multi-platform as the two other alternatives.

Of course I agree fully on the byte-stream, and converting to Latin 1 or Macroman, is a nice trick! :slight_smile:

Edit

By the way: at least the tools in the heirloom project, open sourced by Caldera, (formerly Santa Cruz Operations), at least supports multi-byte characters. It wouldn’t surprise me if the tools supports utf-8 regexps. Full source code is included.

One more thing: I have downloaded and compiled GNU coreutils privately, I trust them solemnly to deal with UTF8 in regular expressions. Though I will test it first, should I need to use them for something. :slight_smile:

I installed them under /opt/bin/libexec as to not interfere with the regular tools, something that can break functionality in other apps.

Well I was referring in programming perspective because UTF-32 uses unsigned integers which will be equally fast as normal 8-bit byte code (single byte characters), which makes it the fastest Unicode encoding. An variable length string needs an huge character set library and considerably slower. I wasn’t speaking UTF-32 at the user level that goes in or out the process, that can be any encoding of course.

Hello.

Looking at the world, and its character sets, I think we both agree, that looked upon as a whole, we would be best served with UTF16, but UTF8 is indeed much better for us under the western hemisphere.

I got what you meant, and widechars do take space, don’t they, but it is much easier to convert from UTF8 to UTF32 than from UTF8 to UTF16. I also think addressing individual UTTF32 chars should be faster than UTF8 bytes, since you don’t have to do extra stuff for alignment. For all I know, both representations takes up the same space on a CPU that uses 64 bits for addressing, at least if the compiler optimizes for speed, or so I believe. UTF8 is overall the easisest, when you can just operate on a stream of bytes, both as a user and a programmer. UTF32 when you can’t, as a programmer, at least from a BSD perspective (tool-making), I am sure that view will change, once you deal with NSText, and CFString’s (writes software that intergrates with the GUI of OS X platform or XNU, (kernel level)). :slight_smile:

Edit

Just for the record: ICU is so much more than just reg-exp’s, it is really “Localization R Us”. :slight_smile:

And maybe UTF32 would be best for representing Unicode after all, and just see aside the deliberate waiste of space, at least in userland; that move would incenrate a lot of problems with regards to encoding and text transformations.

Can you guys explain how the following works?

'\\b[0-9]{4}_[0-9]{2}_[0-9]{3}_[0-9]{3}\\b'

So after a bit more time in google I’m beginning to love reg exp. i.e.

This searches for a number pattern that matches xxxx_xxxx_xxxx_xxx the last group can be between 3 and 6 characters

to include letters in the pattern add

This post is mainly for my benefit. But I like to think I’m contributing :rolleyes:

Hello.

To search for ranges, say you were to find every year between 1970 and 2036 in your code, you would use the grouping operator of egrep.

The search can be expressed as 1970…2036 in human form, (a range).

egrep -o \\b([1][9][7,8,9][0-9]|[2][0][0-2][0-9]|[2][0][3][0-6])\\b

as a regexep it would look like above, where I had to make three regexp separated by ‘|’ to create the full range.

Very cool but wouldn’t

egrep -o \\b([1][9][7,9][0-9]|[2][0][0-2][0-9]|[2][0][3][0-6])\\b

work in the same way? I know its only one extra character in this instance but…

Hello.

A comma within a character class points to an alternate, specific value, while a dash is uses to specify a range, so if you wanted to write [7,8,9] more consicely, you would would express that ast [7-9] and not [7,9].

I advise you to have a look into the various sources regarding regular expressions we have mentioned to you earlier.

The syntax you are using is the extended reg exp syntax, and not the basic regexp syntax, when you read documents about regexp. There are a lot more escaping in basic regexp; a “count group” {3} for instance, would look like {3} and a group would look like ( .*) and the “|” doesn’t work, so you would have to go for a whole other scheme.

I wonder if Nigel’s page is best after all, when it comes to creating concrete regular expressions, or the grymoire reg-exp page.

A comma within a character class means that a comma is one of the acceptable values for that character ” ie. [7,9] means “7” or “,” or “9”.

Except in certain useful cases, a character class containing only one character is the same as that character anyway, so:

"egrep -o '\\b(19[7-9][0-9]|20[0-2][0-9]|203[0-6])\\b'"

Or of course, if you’re feeling really anal:

"egrep -o '\\b(19[7-9][0-9]|20([0-2][0-9]|3[0-6]))\\b'"
-- Or:
"egrep -o '\\b((19[7-9]|20[0-2])[0-9]|203[0-6])\\b'"

Hello NIgel.

You are right, I think I have the comma from some ancient syntax somewhere, maybe globs? I’ll have to figure out where I did got it from, or if I dreamt it up. :slight_smile:

Edit

You don’t use “,” to part characters in a characters in shell-glob syntax either, so I dreamt it up. :slight_smile:

Well I can explain how regular expression works, which makes the code above better understanable. An regular expression is nothing more than an character sequence/string prediction. With repetitions and boundaries you make an string prediction of how an string would look like. It’s an advanced search utility.

When you search for an exact match, regular expressions are simple as well. If you search for ‘Hello world’, the regular expression will only do an exact match like any other basic search. Because the match is done on byte level and not character level, the matches are case sensitive. That would mean that if the given string contains ‘hello world’ that there is no match. some tools, like grep, have the option to do an case insensitive expression.

When an string starts at the beginning of an sentence, the first character is capped. To get these strings as well we can define character ranges. In our previous example we’re talked about an exact match, but basically every single character is an character range with only 1 character in it. Character ranged are defined within brackets ([ and ]). So ‘Hello world!’ is exactly the same as ‘[H][e][l][l][o][ ][W][o][r][l][d]’ but it almost unreadable, single characters shouldn’t be surrounded by brackets. To make an regular expression respond to multiple first characters we could make ‘[Hh]ello world’. Now when this is string is at the beginning of an sentence or not, it will match because we said that the first character of the match could be H and h.

[0-9] “ which is the same as [0123456789] “ are character ranges where the byte values (according to the ascii table) followup. The hyphen means that it will look for it’s byte value left of the hypen and right of the hypen and all byte values in between. Also [a-z] is the same as [abcdefghijklmnopqrstuvwxyz], no comma’s, I assume McUsr used pseudo code.

Then there are some macro’s for you that you can use. For instance [:digit:] is the same as [0-9] which is the same as [0123456789]. Then there is also [:alnum:] which is the same as [0-9a-zA-Z]. There are also shorter macro’s which is \b for example. This short macro means word boundary, only non word characters allowed. It is the same as [^0-9a-zA-Z] were the caret means an not comparison. Another most used macro is the period, it means any character is allowed. When using the macro \b with our example we’re saying that we only want it is an complete word. For instance now wit ‘[Hh]ello world’ we also match with an string like ‘hello worlds’, when we wrap our expression in word boundaries we avoid this. So our expression should look something like ‘\b[Hh]ello world\b’ (NOTE: in AppleScript we need \b because \ has an special meaning, not because of the regular expression)

At last we have to define how many time our ranges match. By default, of course, the match is 1 time at least. Therefore we can do excact matches. But we can also define optional matches, or multiple with static or variable lengths (repetitions). There is *, ?, +, {n}, {n, } and {n, m} which means (copied from grep man page)

? The preceding item is optional and matched at most once.

  •    The preceding item will be matched zero or more times.
    
  •    The preceding item will be matched one or more times.
    

{n} The preceding item is matched exactly n times.
{n,} The preceding item is matched n or more times.
{n,m} The preceding item is matched at least n times, but not more than m times.

It’s an bit of overkill but our hello world expression has an repetition in it, the double L. So to make sure there are two L in hello we could write ‘\b[Hh]el{2}o world\b’. Because l is in single character option we don’t need the brackets. Defining repetitions makes sometimes your expression better readable. When I want numbers from an text who are 8 in length I could make an expression like ‘[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]’ or I could write down ‘[0-9]{8}’ which is obvious better to read.

Basically there are two things to remember. 1) Every character, if not surrounded by an bracket, should be considered as an character range of only 1 character; everything is an character range. 2)Define the amount of repetitions if it’s not 1.

An expression like ‘rapdigital’ is just the same as ‘[r]{1}[a]{1}[p]{1}[d]{1}[i]{1}[g]{1}[i]{1}[t]{1}[a]{1}[l]’.

Now back to your question now we understand the basics of regular expressions:

'\\b[0-9]{4}_[0-9]{2}_[0-9]{3}_[0-9]{3}\\b'

Pseudo code:

\\b (\b) : word boundary [0-9] : only characters 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9 are allowed {4} : Repeat the preceding match 4 times _ : equal to [_] only underscore allowed, no repetition; 1 match [0-9] : only characters 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9 are allowed {2} : Repeat the preceding match 2 times _ : equal to [_] only underscore allowed, no repetition; 1 match [0-9] : only characters 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9 are allowed {3} : Repeat the preceding match 3 times _ : equal to [_] only underscore allowed, no repetition; 1 match [0-9] : only characters 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9 are allowed {3} : Repeat the preceding match 3 times \\b (\b) : word boundary

Hello.

That was a really great post DJ! :slight_smile:

I’ll just add a couple of details, that may or may not be of interest:

The empty set is a member of every set. This implicates, that if something may turn out to not be a part of anything i.e, it has a * has a quantity specifier, or {0,1} or ?, then we are still dealing with a match, even if that part isn’t fulfiilled. This may seem obvious when you read it like this, but that isn’t necessarily always so when you construct regular expressions.

Edit

A good example of the empty set, is DJ Bazzie Wazzie’s usage of \b to denote a word boundary in the regexp of the post above this. Hadn’t he used that, then the pattern would have matched everywhere, since at both ends of the regular expressions, there would have been empty regular expressions, that would have matched anything.

Regular expressions are greedy by nature, i.e. they will be parsed from left to right, and the leftest expression, will match as much as it can, without any regards to what comes later in the regular expression, and this may lead to a NOMATCH of the regular expression as a whole. (Provided we haven’t used any quantifier that specifies an exact number).

We often make up for the greedyness, by inverting a characterclass at some point, to be sure that the regexp stops, so we can get a match on the next part of it.

Say you wanted a regular expression, that gave you a posix filename in return without its path, such a regular expression could look like:

[^/]\{1,\}$

^ and $ are anchors, denoting the start and end of a line, basic and extended regular expressions are line oriented by nature. There are also anchors for the start and end of words, look it up, should you need it. :slight_smile:

We can often also make regular expressions easier by using inverted character classes, say you want a regular expressions that should print out all non-printable characters, then below would be easier, than enumerating those that aren’t printable.

[^[:print:]]

What I am trying to say, is that it is often easier to express what we are not looking for. And then we use that of course.

Thanks and great input too…:cool:

Awesome guys thanks so much.