Search string in ISO 8859-1 encoded file

Hello MacSrcipter Community,

I wrote a script to search for a string in a textfile which is encoded in ISO 8859-1.
This text contains special characters like ä,ö,ü but also french accents like é or ç.
The searchstring is generated from the title of a file like this
Köln_Dom.jpg or Berlin_Rotes,Rathaus.jpg.

I split this filename on the “_” and use the first part for searching → Köln.

Example from textfile:

  [i] +-- Städte
       +-- Algerien: #7710
       +-- Annaba (Bône): #18982
       +-- Batna: #18983
       +-- Bechar (Colomb Béchar): #18984
        +-- Chlef (Orléansville): #18988
  +--Scottsdale: #18456
 +-- Los Angeles: #20453
+-- Köln: #33333

[/i]

I first read all lines in a list and search then through this list (This takes quite a while with the whole file which is about 10000 lines long. )

A search for strings without special characters works.

For instance:
getDCCategory(“Los,Angeles”)
getDCCategory(“Scottsdale”)

works.

this does not work getDCCategory(“Köln”)

here I get the error " dc_category is not defined"

Here is my script so far



--read textfile with the categories and extract the category 
on getDCCategory(searchstring)
	try
		-- if searchstring contains spaces placeholder
		if searchstring contains ",," then
			set searchstring to replaceString(searchstring, ",,", " ")
		end if
		
		set categoriesfile to (((path to documents folder) as string) & "MyFiles:TEST.txt")
		
		--read all lines from categoriesfile in a list 
		set allcategories to (every paragraph of (read (categoriesfile as alias)))
		-- filtering out the category
		repeat with i from 2 to number of items in allcategories
			set this_item to item i of allcategories
			if this_item contains searchstring then
				set split to explode("#", this_item)
				set dc_category to item 2 of split
			end if
		end repeat
		
		--this seems not to work
		if dc_category is equal to "" then
			--temp
			return "Nothing found!"
			--should be later
			-- if you dont find a category then move the file to folder 2CHECK on desktop (which has to be created) for manual checking
		else
			return dc_category
		end if
		
	on error eMsg number eNum
		error "Can't getDCCategory: " & eMsg number eNum
	end try
end getDCCategory



(*
-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
--c-                                                                                           STRING LIBRARY
-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --

*)

--c--   explode(delimiter, input)
--d--   Split a string on a specific delimiter.
--a--   separator : string -- the delimiter used to split the string
--a--   input : string -- the string to split
--o--   list
--x--   explode("-", "a-b-c") --> {"a", "b", "c"}
--u--   ljr (http://applescript.bratis-lover.net/library/string/),
--u--   modified from 'Applescript.net' (http://bbs.applescript.net/viewtopic.php?id=18377)
on explode(delimiter, input)
	local delimiter, input, ASTID
	set ASTID to AppleScript's text item delimiters
	try
		set AppleScript's text item delimiters to delimiter
		set input to text items of input
		set AppleScript's text item delimiters to ASTID
		return input --> list
	on error eMsg number eNum
		set AppleScript's text item delimiters to ASTID
		error "Can't explode: " & eMsg number eNum
	end try
end explode

--c--   replaceString(theText, oldString, newString)
--d--   Case-sensitive find and replace of all occurrences.
--a--   theText : string -- the string to search
--a--   oldString : string -- the find string
--a--   newString : string -- the replacement string
--r--   string
--x--   replaceString("Hello hello", "hello", "Bye") --> "Hello Bye"
--u--   ljr (http://applescript.bratis-lover.net/library/string/)
on replaceString(theText, oldString, newString)
	local ASTID, theText, oldString, newString, lst
	set ASTID to AppleScript's text item delimiters
	try
		considering case
			set AppleScript's text item delimiters to oldString
			set lst to every text item of theText
			set AppleScript's text item delimiters to newString
			set theText to lst as string
		end considering
		set AppleScript's text item delimiters to ASTID
		return theText
	on error eMsg number eNum
		set AppleScript's text item delimiters to ASTID
		error "Can't replaceString: " & eMsg number eNum
	end try
end replaceString


I am still a newbie in Applescript so please be patient.
Thanks a lot
Marc

Hi,

you could perform the whole task with the shell (converting to UTF-8, searching and extracting the number).
The script returns the category number or missing value.

For multiple searches I’d recommended to read “ and convert to UTF-8 “ the text file once


set searchString to "Los Angeles"
set category to getDCCategory(searchString)

on getDCCategory(searchString)
	set categoriesfile to POSIX path of (path to documents folder) & "MyFiles/TEST.txt"
	try
		set allcategories to paragraphs of (do shell script "textutil -stdout -convert txt -inputencoding iso-8859-1 -encoding UTF-8 " & quoted form of categoriesfile & " | grep " & quoted form of searchString & " | cut -d '#' -f 2")
		if (count allcategories) > 0 then
			return item 1 of allcategories
		end if
	end try
	return missing value
end getDCCategory

The easiest way would be to coerce the file into Unicoded text before processing, special characters are then supported as well

change:

set allcategories to (every paragraph of (read (categoriesfile as alias)))

to:

set allcategories to every paragraph of (do shell script "iconv -f ISO-8859-1 /path/to/textfile.txt")

Hi Stefan,

Just tried it and it works :D. And it is also waaaayyy faster then my script.
Seems I should spend some time to learn more about shell scripting.

@DJ Bazzie Wazzie
thanks for your code as well, will check it out too but have the leave now for work.

This forum is such a great help and source for learning more about scripting
Thanks a lot

Marc