Not understanding GREP well enough to fix a problem.
I have a GREP search routine that has worked fine for a multitude of purposes, but I think I’ve got an escape character issue with a specific search I need.
I am searching for the string “P.r.o.g.r.e.s.s.i.v.e” in the header of a JPEG file in order to convert it to a standard-encoded JPEG.
I have a GREP handler, but the problem is it seems to be searching just for the “P”:
on grepForString(path_to_grep, search_list)
repeat with current_grep_item in search_list
try --known bug between AppleScript and GREP where if GREP finds nothing, AppleScript errors-out
do shell script "/usr/bin/grep --count " & quoted form of current_grep_item & " " & quoted form of POSIX path of path_to_grep
set grep_result to result
exit repeat
on error error_message number error_number
if error_message is "0" then -- grep didn't find anything
set grep_result to 0
else
-- pass on the error
error error_message number error_number
end if
end try
end repeat
return {grep_result, contents of current_grep_item}
end grepForString
And I’m simply trying to use it like this:
on illegalJPEG(some_file)
set header_grep to grepForString(some_file, "P.r.o.g.r.e.s.s.i.v.e")
if header_grep = 0 then
return false
else
return true
end if
end illegalJPEG
Problem is, grepForString is returning 35 instances of “P” and not 1 instance of “P.r.o.g.r.e.s.s.i.v.e” on my test file.
I was wondering if there was some way to clarify I wanted the whole string, literally–and your idea was along those lines. I don’t mind manually “escaping” the errand characters if need be, this is a one-off use. (Though points to anyone who know how to fix the problem with the grepForString handler itself.)
Calvin, give this one a try I’ve only ever tried grep using “-E” extended for pattern matches I mostly use Satimage myself.
set x to choose file
try
do shell script "/usr/bin/grep -E " & quoted form of "P[.]r[.]o[.]g[.]r[.]e[.]s[.]s[.]i[.]v[.]e" & " " & quoted form of POSIX path of x
end try
Not sure how that helps, since I’m searching the “innards” or contents of a JPEG file, not the filename. Maybe I’m misunderstanding, or I myself am being unclear. Here’s the project so far, with your changes (still returning 35 instances of “P”):
(still prototyping, haven’t implemented the full featureset…trying to get the detection of file an JPEG type before adding the “fix it” code)
--
-- SplashPhoto Normalizer
-- by Kevin Quosig, May 2009
--
-- Used to drag-n-drop files to convert any progressive JPEGs to standard.
--
--
-- DECLARE PROPERTIES
--
-- debugging on?
property g_debug : true
--basic file path and names
property g_home_folder_path : path to home folder
property g_log_file_name : "SplashPhoto Normalizer.txt"
--
-- UTILITY HANDLERS
--
-- Log Entry Generation
-- With help from StephanK of MacScripter
-- http://bbs.applescript.net/viewtopic.php?pid=76607#p76607
--
on logMe(log_string, indent_level)
if g_debug is true then --allows turning the debugger on and off so my logMe's can be left in final version
set log_target to (g_home_folder_path & "Library:Logs:" & g_log_file_name) as text
try
set log_file_ref to open for access file log_target with write permission
repeat indent_level times
write tab to log_file_ref starting at eof
end repeat
write ((log_string as text) & return) to log_file_ref starting at eof
close access log_file_ref
return true
on error
try
close access file log_target
end try
return false
end try
end if
end logMe
-- revised GREP routine courtesy of
-- Bruce Phillips of MacScripter
-- http://bbs.applescript.net/viewtopic.php?pid=83871#p83871
--
on grepForString(path_to_grep, search_list)
my logMe("BEGIN grepForString Handler", 1)
repeat with current_grep_item in search_list
try --known bug between AppleScript and GREP where if GREP finds nothing, AppleScript errors-out
do shell script "/usr/bin/grep -E --count" & quoted form of current_grep_item & " " & quoted form of POSIX path of path_to_grep
set grep_result to result
exit repeat
on error error_message number error_number
if error_message is "0" then -- grep didn't find anything
set grep_result to 0
else
-- pass on the error
error error_message number error_number
end if
end try
end repeat
my logMe("¢ RETURN: " & grep_result & "x | " & current_grep_item, 1)
my logMe("END grepForString Handler", 1)
return {grep_result, contents of current_grep_item}
end grepForString
-- See if the file has been saved as a Photoshop file
--
on photoshopFile(some_file)
if file creator is "8BIM" then
return true
else
return false
end if
end photoshopFile
-- What type of JPEG is it?
--
-- Optimized or Progressive = Progressive, which we don't want
--
on illegalJPEG(some_file)
set header_grep to grepForString(some_file, "P[.]r[.]o[.]g[.]r[.]e[.]s[.]s[.]i[.]v[.]e")
if header_grep = 0 then
return false
else
return true
end if
end illegalJPEG
--
-- MAIN SCRIPT
--
on open actionItems
--start log
my logMe("--------------------------------------------------", 0)
my logMe("SplashPhoto Normalizer Started", 0)
my logMe("Debugging Mode = " & g_debug, 0)
my logMe("--------------------------------------------------", 0)
repeat with current_actionItem in actionItems
--log basic information (testing)
my logMe("FILE: " & actionItems, 1)
my logMe("Need fixing: " & illegalJPEG(current_actionItem), 1)
my logMe(return, 0)
end repeat
--end log
my logMe("--------------------------------------------------", 0)
my logMe("SplashPhoto Normalizer Finished", 0)
my logMe("--------------------------------------------------", 0)
end open
Would it not be more simple to only test if the file is jpeg of not and run the conversion on any jpeg found.
My reasoning is that doing that would take the same amount of time as the grep and do no harm on files that are already baseline.
Also are you trying to distinguish between jpeg and PSD.
8BIM can be either creator of type in a photoshop file.
You can use the Sips command sips -g typeIdentifier/Users/USERNAME/FOO…JPG, to get the type. –>typeIdentifier: public.jpeg
I preferer to use Exiftool command line tool to check info on images. its quick and can get lots of info. exiftool -s -s -s -EncodingProcess/Users/USERNAME/FOO…JPG –>Baseline DCT, Huffman coding
or –>Progressive DCT, Huffman coding
By the way are you looking at Splash Photo News Agency images…
The script needs to discern between saved methods of JPEG. In this case, SplashPhoto for the PalmOS does not support Progressive JPEGs (Optimized or Progressive in the JPEG save dialog), so I need to convert thousands of images that use Progressive to “Standard” JPEGs.
Yeah, I just prefer not to rely on anything the OS doesn’t come with so I don’t have to remember to install all these extras every time my work or home machine is re-imaged, or in order to pass a solution around…it’s a real PITA. So I used my handy-dandy GREP handler to parse the file innards to get the information I want. I’ve gotten alot of mileage out of that handler and a hexdump handler that lets me look for these unique strings to begin with.
Part two of the script is that the JPEGs that are “fixed” are only ones that have already been, in the past, saved by Photoshop. Any JPEG that is any other creator is left alone because it will need additional processing (i.e. is raw from a non-Mac, non-Photoshop source). So “8BIM” is an easy way to tell if the JPEG has been previously saved as a Photoshop file.
Sorry, I hadn’t intended for anyone to go into the process/methodology of the script, just my specific mechanical question…why the GREP routines is getting confused and how to fix it.
The second parameter to grepForString looks like is supposed to be a list, but you are passing a string/text. When search_list is a string, repeat with current_grep_item in search_list does not give an error, but instead loop with current_grep_item being each character.
Send a list as the second parameter:
set header_grep to grepForString(some_file, {"P.r.o.g.r.e.s.s.i.v.e"})
Or, have the handler do some coercion of its own before the loop:
on grepForString(path_to_grep, search_list)
if class of search_list is not list then set search_list to {search_list}
repeat with current_grep_item in search_list
.
Also, Mark67’s point still holds. The period “.” is a metacharacter for normal grep. It is a single character “wildcard” (it will match any character). So as a regular expression, “P.r.o.g.r.e.s.s.i.v.e” will match “Parmoegrraetsoshiavee” (for example). If you want to find the literal string “P.r.o.g.r.e.s.s.i.v.e”, you need to search for “P.r.o.g.r.e.s.s.i.v.e” (er, that is the actual string, the AppleScript representation would need each backslash doubled so that it parses correctly). Alternatively if you will always be using grepForString to search for exact matches then add “-F” (fixed string) to the grep command line and leave the pattern as it currently is.
My guess, though, is that those dots are there to match against the NUL “characters” that are present “between” “characters” when you look at UTF-16 encoded text as a byte stream (which is more or less how grep “thinks” of files). In that case, you can probably proceed with what you have, but keep in mind that your liberal regular expression also matches other things, like word I made up above.
I have been playing with ( first time) hexdump and applescript to convert From and To hex <-> ascii <-> strings but although I can find matching hex or strings with my conversions.
I find nothing that comes close to being the word “Progressive”
Progresive <-> 50 72 6F 67 72 65 73 69 76 65
P.r.o.g.r.e.s.i.v.e <-> 50 2E 72 2E 6F 2E 67 2E 72 2E 65 2E 73 2E 69 2E 76 2E 65
Aside from my GREP handler, this is one of my favorite handlers. I’ve had to resort to prying-into the contents of files alot lately to get at information the OS and apps are stubborn about giving.
Save as an application, requires drag-n-drop.
--
-- Get Hexdump Info v4
-- by Kevin Quosig, 3/28/07
--
-- Used to drag-n-drop files to examine their contents/headers.
--
-- Most code segments courtesy of James Nierodzik of MacScripter
-- http://bbs.applescript.net/profile.php?id=8727
--
--
-- UTILITY HANDLER
--
-- Search and Replace routine using AppleScript Text Item Delimiters "trick"
--
on searchNreplace(parse_me, find_me, replace_with_me)
--save incoming TID state, set new TIDs
set {ATID, AppleScript's text item delimiters} to {"", find_me}
--using the specified character as a break point to strip the delimiter out and break the string into items
set being_parsed to text items of parse_me
--switch the TIDs again (replace string)
set AppleScript's text item delimiters to {replace_with_me}
--coerce it back to a string with new delimiters
set parse_me to being_parsed as string
--restore incoming TID state
set AppleScript's text item delimiters to ATID
--return results
return parse_me
end searchNreplace
--
-- MAIN HANDLER
--
on open fileList
-- parse through files dropped onto droplet
repeat with i from 1 to number of items in fileList
set AppleScript's text item delimiters to {""} --reset delimiters
set this_item to item i of fileList as string ---pick item to work with
set this_item_posix to quoted form of POSIX path of this_item --need POSIX path for shell scripts
set doc_name to name of (info for alias this_item) --used for renaming the TextEdit window
--Improved hexdump script line by TheMouthofSauron at MacScripter
--http://bbs.applescript.net/viewtopic.php?pid=77811#p77811
--
--hexdump with the -C parameter formats the hexdump as columns of hex pairs
--and then a column with a human-readable "ASCII translation" delimited by a pipe
--character at the beginning and end of the ASCII column
--
--"awk" takes the entire -C formatted hexdump line ($0 = all arguements)
--and filters-out the hex pairs and the delimiting of pipe characters
--(return only 16 characters starting at position 62)
--
set hex_dump to (do shell script "hexdump -C " & this_item_posix & " | awk '{print(substr($0,62,16))}'")
--remove carriage returns so output is one giant paragraph
--(allows for TextEdit searching for strings and manual scanning)
set hex_dump to searchNreplace(hex_dump, return, "")
--write to TextEdit window and rename window to file name to keep things straight
tell application "TextEdit"
make new document
set text of front document to hex_dump
set name of front window to doc_name
end tell
end repeat
end open
Ah…[slaps forehead]…duh good catch, needed a second set of eyeballs there, nice!
Yeah, it will always be exact string matches, including case sensitivity…it’s the whole point if I’m resorting to this sort of method (which by my own admission is rather on the brute-force side of things). So I will add the -F as a script improvement…thanks!
The output I am comparing against is from the hexdump handler shown in the previous post in this thread. They are literal periods, near as I can tell.
Hmm, on my Tiger machine, I get “grep: conflicting matchers specified” errors with both variations. Maybe the missing space after “–count” is complicating the issue.
As the “conflict” error message implies, “-E” and “-F” are incompatible.
The “-E” option tells grep to interpret the pattern argument as an Extended regular expression (a particular variation on regexp syntax).
The “-F” option tells grep to treat the pattern argument as a literal string (no regexp interpretation at all).
Also, you will need to reinstate the space after “–count”.
If all you want to do is “hide” any non-printable characters, you can do that more easily than hexdump, awk, and searchNreplace:
do shell script "tr -c '[:print:]' . < " & quoted form of somefile
It will replace all non-printable characters with a period. One feature missing as compared to hexdump is that this variation will not suppress repeated “lines” like hexdump does.
$ perl -e 'print chr(0), chr(1), (chr(2).chr(3)) x (7+3*8), chr(4)' | hexdump -C
00000000 00 01 02 03 02 03 02 03 02 03 02 03 02 03 02 03 |................|
00000010 02 03 02 03 02 03 02 03 02 03 02 03 02 03 02 03 |................|
*
00000040 04 |.|
00000041
That asterisk on the third line of output means that the bytes at “00000020” (offset 16) and “00000030” (offset 24) are identical to the bytes shown in the second line.
The tr invocation I gave will not do this for you. It also does not leave an extra vertical bar (pipe character) at the end of the output for input that is not a multiple of 16 bytes.
In the end it didn’t matter. Got the GREP part working, but it looks like Adobe stores historical information in files, like the “original” file name, and so I was GREPing for information that wasn’t always there. Booo! Hissss!
Guess will have to resort to EXIF, will start another thread on that…