Problem with GREP

CalvinFold · May 1, 2009, 2:45pm

Not understanding GREP well enough to fix a problem.

I have a GREP search routine that has worked fine for a multitude of purposes, but I think I’ve got an escape character issue with a specific search I need.

I am searching for the string “P.r.o.g.r.e.s.s.i.v.e” in the header of a JPEG file in order to convert it to a standard-encoded JPEG.

I have a GREP handler, but the problem is it seems to be searching just for the “P”:

on grepForString(path_to_grep, search_list)
	repeat with current_grep_item in search_list
		try --known bug between AppleScript and GREP where if GREP finds nothing, AppleScript errors-out
			do shell script "/usr/bin/grep --count " & quoted form of current_grep_item & " " & quoted form of POSIX path of path_to_grep
			set grep_result to result
			exit repeat
		on error error_message number error_number
			if error_message is "0" then -- grep didn't find anything
				set grep_result to 0
			else
				-- pass on the error
				error error_message number error_number
			end if
		end try
	end repeat
	return {grep_result, contents of current_grep_item}
end grepForString

And I’m simply trying to use it like this:

on illegalJPEG(some_file)
	set header_grep to grepForString(some_file, "P.r.o.g.r.e.s.s.i.v.e")
	
	if header_grep = 0 then
		return false
	else
		return true
	end if
end illegalJPEG

Problem is, grepForString is returning 35 instances of “P” and not 1 instance of “P.r.o.g.r.e.s.s.i.v.e” on my test file.

Mark67 · May 1, 2009, 3:26pm

Your full stop is a meta character have you tried “P[.]r[.]o[.]g[.]r[.]e[.]s[.]s[.]i[.]v[.]e” to return to it’s literal meaning? Untested.

CalvinFold · May 1, 2009, 3:44pm

Still returning 35 instances of “P”. :-/

I was wondering if there was some way to clarify I wanted the whole string, literally–and your idea was along those lines. I don’t mind manually “escaping” the errand characters if need be, this is a one-off use. (Though points to anyone who know how to fix the problem with the grepForString handler itself.)

Mark67 · May 1, 2009, 3:49pm

Calvin, give this one a try I’ve only ever tried grep using “-E” extended for pattern matches I mostly use Satimage myself.

set x to choose file

try
	do shell script "/usr/bin/grep -E " & quoted form of "P[.]r[.]o[.]g[.]r[.]e[.]s[.]s[.]i[.]v[.]e" & " " & quoted form of POSIX path of x
end try

returns

“Binary file /Users/marklarsen/Desktop/P4200091.jpg matches”

CalvinFold · May 1, 2009, 4:18pm

Not sure how that helps, since I’m searching the “innards” or contents of a JPEG file, not the filename. Maybe I’m misunderstanding, or I myself am being unclear. Here’s the project so far, with your changes (still returning 35 instances of “P”):

(still prototyping, haven’t implemented the full featureset…trying to get the detection of file an JPEG type before adding the “fix it” code)

--
-- SplashPhoto Normalizer
-- by Kevin Quosig, May 2009
--
-- Used to drag-n-drop files to convert any progressive JPEGs to standard.
--

--
-- DECLARE PROPERTIES
--
-- debugging on?
property g_debug : true

--basic file path and names
property g_home_folder_path : path to home folder
property g_log_file_name : "SplashPhoto Normalizer.txt"


--
-- UTILITY HANDLERS
--

-- Log Entry Generation
-- With help from StephanK of MacScripter
-- http://bbs.applescript.net/viewtopic.php?pid=76607#p76607
--
on logMe(log_string, indent_level)
	if g_debug is true then --allows turning the debugger on and off so my logMe's can be left in final version
		set log_target to (g_home_folder_path & "Library:Logs:" & g_log_file_name) as text
		try
			set log_file_ref to open for access file log_target with write permission
			repeat indent_level times
				write tab to log_file_ref starting at eof
			end repeat
			write ((log_string as text) & return) to log_file_ref starting at eof
			close access log_file_ref
			return true
		on error
			try
				close access file log_target
			end try
			return false
		end try
	end if
end logMe

-- revised GREP routine courtesy of
-- Bruce Phillips of MacScripter
-- http://bbs.applescript.net/viewtopic.php?pid=83871#p83871
--
on grepForString(path_to_grep, search_list)
	my logMe("BEGIN grepForString Handler", 1)
	
	repeat with current_grep_item in search_list
		try --known bug between AppleScript and GREP where if GREP finds nothing, AppleScript errors-out
			do shell script "/usr/bin/grep -E --count" & quoted form of current_grep_item & " " & quoted form of POSIX path of path_to_grep
			set grep_result to result
			exit repeat
		on error error_message number error_number
			if error_message is "0" then -- grep didn't find anything
				set grep_result to 0
			else
				-- pass on the error
				error error_message number error_number
			end if
		end try
	end repeat
	
	my logMe("¢ RETURN: " & grep_result & "x | " & current_grep_item, 1)
	my logMe("END grepForString Handler", 1)
	
	return {grep_result, contents of current_grep_item}
end grepForString

-- See if the file has been saved as a Photoshop file
--
on photoshopFile(some_file)
	if file creator is "8BIM" then
		return true
	else
		return false
	end if
end photoshopFile

-- What type of JPEG is it?
--
-- Optimized or Progressive = Progressive, which we don't want
--
on illegalJPEG(some_file)
	set header_grep to grepForString(some_file, "P[.]r[.]o[.]g[.]r[.]e[.]s[.]s[.]i[.]v[.]e")
	
	if header_grep = 0 then
		return false
	else
		return true
	end if
end illegalJPEG

--
-- MAIN SCRIPT
--

on open actionItems
	--start log
	my logMe("--------------------------------------------------", 0)
	my logMe("SplashPhoto Normalizer Started", 0)
	my logMe("Debugging Mode = " & g_debug, 0)
	my logMe("--------------------------------------------------", 0)
	
	repeat with current_actionItem in actionItems
		--log basic information (testing)
		my logMe("FILE: " & actionItems, 1)
		my logMe("Need fixing: " & illegalJPEG(current_actionItem), 1)
		my logMe(return, 0)
	end repeat
	
	--end log
	my logMe("--------------------------------------------------", 0)
	my logMe("SplashPhoto Normalizer Finished", 0)
	my logMe("--------------------------------------------------", 0)
end open

mark_hunte · May 1, 2009, 5:51pm

Hi Calvin,

Would it not be more simple to only test if the file is jpeg of not and run the conversion on any jpeg found.
My reasoning is that doing that would take the same amount of time as the grep and do no harm on files that are already baseline.

Also are you trying to distinguish between jpeg and PSD.
8BIM can be either creator of type in a photoshop file.

You can use the Sips command sips -g typeIdentifier /Users/USERNAME/FOO…JPG, to get the type.
–>typeIdentifier: public.jpeg

sips -g format /Users/USERNAME/FOO.jpg
–>format: jpeg
–>format: psd (if file had wrong file extention)

I preferer to use Exiftool command line tool to check info on images. its quick and can get lots of info.
exiftool -s -s -s -EncodingProcess /Users/USERNAME/FOO…JPG
–>Baseline DCT, Huffman coding
or
–>Progressive DCT, Huffman coding

By the way are you looking at Splash Photo News Agency images…

CalvinFold · May 1, 2009, 6:07pm

The script needs to discern between saved methods of JPEG. In this case, SplashPhoto for the PalmOS does not support Progressive JPEGs (Optimized or Progressive in the JPEG save dialog), so I need to convert thousands of images that use Progressive to “Standard” JPEGs.

Yeah, I just prefer not to rely on anything the OS doesn’t come with so I don’t have to remember to install all these extras every time my work or home machine is re-imaged, or in order to pass a solution around…it’s a real PITA. So I used my handy-dandy GREP handler to parse the file innards to get the information I want. I’ve gotten alot of mileage out of that handler and a hexdump handler that lets me look for these unique strings to begin with.

mark_hunte · May 1, 2009, 6:55pm

I mean’t in this part of your script…

“-- See if the file has been saved as a Photoshop file
–”

Rather than you end goal, which was clear to me

I understand that all too much

CalvinFold · May 1, 2009, 7:08pm

Part two of the script is that the JPEGs that are “fixed” are only ones that have already been, in the past, saved by Photoshop. Any JPEG that is any other creator is left alone because it will need additional processing (i.e. is raw from a non-Mac, non-Photoshop source). So “8BIM” is an easy way to tell if the JPEG has been previously saved as a Photoshop file.

Sorry, I hadn’t intended for anyone to go into the process/methodology of the script, just my specific mechanical question…why the GREP routines is getting confused and how to fix it.

chrys · May 1, 2009, 8:17pm

CalvinFold:

Not understanding GREP well enough to fix a problem.

I have a GREP search routine that has worked fine for a multitude of purposes, but I think I’ve got an escape character issue with a specific search I need.

I am searching for the string “P.r.o.g.r.e.s.s.i.v.e” in the header of a JPEG file in order to convert it to a standard-encoded JPEG.

I have a GREP handler, but the problem is it seems to be searching just for the “P”:
on grepForString(path_to_grep, search_list)
	repeat with current_grep_item in search_list
		.
	end repeat
	.
end grepForString
And I’m simply trying to use it like this:
on illegalJPEG(some_file)
	set header_grep to grepForString(some_file, "P.r.o.g.r.e.s.s.i.v.e")
	.
Problem is, grepForString is returning 35 instances of “P” and not 1 instance of “P.r.o.g.r.e.s.s.i.v.e” on my test file.

The second parameter to grepForString looks like is supposed to be a list, but you are passing a string/text. When search_list is a string, repeat with current_grep_item in search_list does not give an error, but instead loop with current_grep_item being each character.

Send a list as the second parameter:

set header_grep to grepForString(some_file, {"P.r.o.g.r.e.s.s.i.v.e"})

Or, have the handler do some coercion of its own before the loop:

on grepForString(path_to_grep, search_list)
	if class of search_list is not list then set search_list to {search_list}
	repeat with current_grep_item in search_list
	.

Also, Mark67’s point still holds. The period “.” is a metacharacter for normal grep. It is a single character “wildcard” (it will match any character). So as a regular expression, “P.r.o.g.r.e.s.s.i.v.e” will match “Parmoegrraetsoshiavee” (for example). If you want to find the literal string “P.r.o.g.r.e.s.s.i.v.e”, you need to search for “P.r.o.g.r.e.s.s.i.v.e” (er, that is the actual string, the AppleScript representation would need each backslash doubled so that it parses correctly). Alternatively if you will always be using grepForString to search for exact matches then add “-F” (fixed string) to the grep command line and leave the pattern as it currently is.

My guess, though, is that those dots are there to match against the NUL “characters” that are present “between” “characters” when you look at UTF-16 encoded text as a byte stream (which is more or less how grep “thinks” of files). In that case, you can probably proceed with what you have, but keep in mind that your liberal regular expression also matches other things, like word I made up above.

mark_hunte · May 1, 2009, 10:50pm

Hi Calvin,

What is you hexdump handler

I have been playing with ( first time) hexdump and applescript to convert From and To hex <-> ascii <-> strings but although I can find matching hex or strings with my conversions.
I find nothing that comes close to being the word “Progressive”

Progresive <-> 50 72 6F 67 72 65 73 69 76 65
P.r.o.g.r.e.s.i.v.e <-> 50 2E 72 2E 6F 2E 67 2E 72 2E 65 2E 73 2E 69 2E 76 2E 65

CalvinFold · May 4, 2009, 2:35pm

Aside from my GREP handler, this is one of my favorite handlers. I’ve had to resort to prying-into the contents of files alot lately to get at information the OS and apps are stubborn about giving.

Save as an application, requires drag-n-drop.

--
-- Get Hexdump Info v4
-- by Kevin Quosig, 3/28/07
--
-- Used to drag-n-drop files to examine their contents/headers.
--
-- Most code segments courtesy of James Nierodzik of MacScripter
-- http://bbs.applescript.net/profile.php?id=8727
--


--
-- UTILITY HANDLER
--

-- Search and Replace routine using AppleScript Text Item Delimiters "trick"
--
on searchNreplace(parse_me, find_me, replace_with_me)
	
	--save incoming TID state, set new TIDs
	set {ATID, AppleScript's text item delimiters} to {"", find_me}
	
	--using the specified character as a break point to strip the delimiter out and break the string into items
	set being_parsed to text items of parse_me
	
	--switch the TIDs again (replace string)
	set AppleScript's text item delimiters to {replace_with_me}
	
	--coerce it back to a string with new delimiters
	set parse_me to being_parsed as string
	
	--restore incoming TID state
	set AppleScript's text item delimiters to ATID
	
	--return results
	return parse_me
	
end searchNreplace


--
-- MAIN HANDLER
--
on open fileList
	
	-- parse through files dropped onto droplet
	repeat with i from 1 to number of items in fileList
		
		set AppleScript's text item delimiters to {""} --reset delimiters
		set this_item to item i of fileList as string ---pick item to work with
		set this_item_posix to quoted form of POSIX path of this_item --need POSIX path for shell scripts
		set doc_name to name of (info for alias this_item) --used for renaming the TextEdit window
		
		--Improved hexdump script line by TheMouthofSauron at MacScripter
		--http://bbs.applescript.net/viewtopic.php?pid=77811#p77811
		--
		--hexdump with the -C parameter formats the hexdump as columns of hex pairs
		--and then a column with a human-readable "ASCII translation" delimited by a pipe
		--character at the beginning and end of the ASCII column
		--
		--"awk" takes the entire -C formatted hexdump line ($0 = all arguements)
		--and filters-out the hex pairs and the delimiting of pipe characters
		--(return only 16 characters starting at position 62)
		--
		set hex_dump to (do shell script "hexdump -C " & this_item_posix & " | awk '{print(substr($0,62,16))}'")
		
		--remove carriage returns so output is one giant paragraph
		--(allows for TextEdit searching for strings and manual scanning)
		set hex_dump to searchNreplace(hex_dump, return, "")
		
		--write to TextEdit window and rename window to file name to keep things straight
		tell application "TextEdit"
			make new document
			set text of front document to hex_dump
			set name of front window to doc_name
		end tell
	end repeat
end open

CalvinFold · May 4, 2009, 2:42pm

Ah…[slaps forehead]…duh good catch, needed a second set of eyeballs there, nice!

Yeah, it will always be exact string matches, including case sensitivity…it’s the whole point if I’m resorting to this sort of method (which by my own admission is rather on the brute-force side of things). So I will add the -F as a script improvement…thanks!

The output I am comparing against is from the hexdump handler shown in the previous post in this thread. They are literal periods, near as I can tell.

THANKS chrys, as always!

CalvinFold · May 4, 2009, 3:00pm

Okay, here’s where my Unix knowledge fails me…trying to add the “-F”

Tried…

do shell script "/usr/bin/grep -E -F --count" & quoted form of current_grep_item & " " & quoted form of POSIX path of path_to_grep

…and…

do shell script "/usr/bin/grep -EF --count" & quoted form of current_grep_item & " " & quoted form of POSIX path of path_to_grep

The first just confuses GREP altogether, the second says I’ve got a parameter conflict.

chrys · May 4, 2009, 6:16pm

CalvinFold:

Okay, here’s where my Unix knowledge fails me…trying to add the “-F”

Tried…
do shell script "/usr/bin/grep -E -F --count" & quoted form of current_grep_item & " " & quoted form of POSIX path of path_to_grep
…and…
do shell script "/usr/bin/grep -EF --count" & quoted form of current_grep_item & " " & quoted form of POSIX path of path_to_grep
The first just confuses GREP altogether, the second says I’ve got a parameter conflict.

Hmm, on my Tiger machine, I get “grep: conflicting matchers specified” errors with both variations. Maybe the missing space after “–count” is complicating the issue.

As the “conflict” error message implies, “-E” and “-F” are incompatible.
The “-E” option tells grep to interpret the pattern argument as an Extended regular expression (a particular variation on regexp syntax).
The “-F” option tells grep to treat the pattern argument as a literal string (no regexp interpretation at all).

Also, you will need to reinstate the space after “–count”.

If all you want to do is “hide” any non-printable characters, you can do that more easily than hexdump, awk, and searchNreplace:

do shell script "tr -c '[:print:]' . < " & quoted form of somefile

It will replace all non-printable characters with a period. One feature missing as compared to hexdump is that this variation will not suppress repeated “lines” like hexdump does.

$ perl -e 'print chr(0), chr(1), (chr(2).chr(3)) x (7+3*8), chr(4)' | hexdump -C 00000000 00 01 02 03 02 03 02 03 02 03 02 03 02 03 02 03 |................| 00000010 02 03 02 03 02 03 02 03 02 03 02 03 02 03 02 03 |................| * 00000040 04 |.| 00000041
That asterisk on the third line of output means that the bytes at “00000020” (offset 16) and “00000030” (offset 24) are identical to the bytes shown in the second line.

The tr invocation I gave will not do this for you. It also does not leave an extra vertical bar (pipe character) at the end of the output for input that is not a multiple of 16 bytes.

CalvinFold · May 4, 2009, 7:45pm

BAH!

In the end it didn’t matter. Got the GREP part working, but it looks like Adobe stores historical information in files, like the “original” file name, and so I was GREPing for information that wasn’t always there. Booo! Hissss!

Guess will have to resort to EXIF, will start another thread on that…