Selecting Text within Text with AppleScript

bowjest · October 28, 2012, 4:25pm

Hello to all,

I’m trying to write a script that will allow me to select a certain section of text from the results of a web page, but with no luck so far.

The easiest way to illustrate what I’m trying to do would be using a page such as this one:

http://mymemory.translated.net/api/get?q=Hello%20World!&langpair=en|it

Here is one long string returned with the result (in this case a translation) being between either:

{“responseData”:{“translatedText”:"

and

"},“responseDetails”

further along in the string between:

“translation”:"

and

",“quality”:

I thought I was onto a possible solution after finding the following:

http://macscripter.net/viewtopic.php?id=15880

but have not been able to rework this to capture the data I need (namely everything between the two above-mentioned start and end points).

This seems like it should be such an easy thing to do, but I can’t seem to get my head around it.

I suppose for this particular page, curl might be the better alternative, but I’m not able to get that to work so far either. I suspect I don’t understand the requirements for the syntax:


do shell script "curl http://mymemory.translated.net/api/get?q=Hello World!&langpair=en|it"

Is it just the quotes I need to fix or do I need to “escape” any of the other things like the forward slashes and question mark?

Can anyone offer any helpful suggestions?

Many thanks,

Bowjest

Yvan_Koenig · October 28, 2012, 10:06pm

Try with :


do shell script "curl " & quoted form of "http://mymemory.translated.net/api/get?q=Hello World!&langpair=en|it"

Running it, I got :

“{"responseData":{"translatedText":"INVALID LANGUAGE PAIR SPECIFIED. EXAMPLE: LANGPAIR=EN|IT USING 2 LETTER ISO OR RFC3066 LIKE ZH-CN. ALMOST ALL LANGUAGES SUPPORTED BUT SOME MAY HAVE NO CONTENT"},"responseDetails":"INVALID LANGUAGE PAIR SPECIFIED. EXAMPLE: LANGPAIR=EN|IT USING 2 LETTER ISO OR RFC3066 LIKE ZH-CN. ALMOST ALL LANGUAGES SUPPORTED BUT SOME MAY HAVE NO CONTENT","responseStatus":403,"matches":""}”

It seems that the syntax describing the language pair is wrong.

With this one :


do shell script "curl " & quoted form of "http://mymemory.translated.net/api/get?q=Hello%20World!&langpair=en|it"

I got :
“{"responseData":{"translatedText":"Ciao Mondo"},"responseDetails":"","responseStatus":200,"matches":[{"id":"0","segment":"Hello World","translation":"Ciao Mondo","quality":"70","reference":"Machine Translation provided by Google, Microsoft, Worldlingo or MyMemory customized engine.","usage-count":1,"subject":"All","created-by":"MT!","last-updated-by":null,"create-date":"2012-10-28","last-update-date":"2012-10-28","match":0.85}]}”

Which is in fact :

{“responseData”:{“translatedText”:“Ciao Mondo”},“responseDetails”:“”,“responseStatus”:200,“matches”:[{“id”:“0”,“segment”:“Hello World”,“translation”:“Ciao Mondo”,“quality”:“70”,“reference”:“Machine Translation provided by Google, Microsoft, Worldlingo or MyMemory customized engine.”,“usage-count”:1,“subject”:“All”,“created-by”:“MT!”,“last-updated-by”:null,“create-date”:“2012-10-28”,“last-update-date”:“2012-10-28”,“match”:0.85}]}

Yvan KOENIG (VALLAURIS, France) dimanche 28 octobre 2012 23:06:48

regulus6633 · October 28, 2012, 10:11pm

To get the text between 2 delimiters try this…

set t to "{\"responseData\":{\"translatedText\":\"Ciao Mondo\"},\"responseDetails\":\"\",\"responseStatus\":200,\"matches\":[{\"id\":\"0\",\"segment\":\"Hello World\",\"translation\":\"Ciao Mondo\",\"quality\":\"70\",\"reference\":\"Machine Translation provided by Google, Microsoft, Worldlingo or MyMemory customized engine.\",\"usage-count\":1,\"subject\":\"All\",\"created-by\":\"MT!\",\"last-updated-by\":null,\"create-date\":\"2012-10-28\",\"last-update-date\":\"2012-10-28\",\"match\":0.85}]}"

getTextBetweenDelimiters(t, "{\"responseData\":{\"translatedText\":\"", "\"},\"responseDetails\"")

on getTextBetweenDelimiters(theText, firstDelim, secondDelim)
	try
		set {tids, text item delimiters} to {text item delimiters, firstDelim}
		set a to second text item of theText
		set text item delimiters to secondDelim
		set b to first text item of a
		set text item delimiters to tids
		return b
	on error
		return theText
	end try
end getTextBetweenDelimiters

StefanK · October 28, 2012, 10:18pm

The problem is the vertical bar, which is used as pipeline in the shell
With the quotation the shell ignores the character, but unfortunately it’s also misinterpreted by curl

Yvan_Koenig · October 28, 2012, 10:28pm

Hello Stefan

As I added this one does the job :


do shell script "curl " & quoted form of "http://mymemory.translated.net/api/get?q=Hello%20World!&langpair=en|it"

I just escaped the space character between Hello and World

Yvan KOENIG (VALLAURIS, France) dimanche 28 octobre 2012 23:28:23

StefanK · October 28, 2012, 10:33pm

mon dieu, I tried dozens of different forms without avail, but not this one

McUsr · October 28, 2012, 11:28pm

Hello!

This may come in handy!

There are some encoding routines floating around here, for usage when assembling the text. I guess you have to encode characters outside the normal ascii range as well. Those by DJ Bazzie Wazzie are the most up to date ones.

I just put them here for easy access!


  on rawURLEncode(str)
       return do shell script "/bin/echo -n " & quoted form of str & " | php -r ' echo rawurlencode(fgets(STDIN)); '"
   end rawURLEncode
   
   on rawURLDecode(str)
       return do shell script "/bin/echo -n " & quoted form of str & " | php -r ' echo rawurldecode(fgets(STDIN)); '"
   end rawURLDecode

bowjest · October 29, 2012, 8:18am

Thanks, everyone, for your replies. It seems this is far more complex than I originally thought.

I’ve tried each of the examples provided and I do get the results outlined, but I’m unsure how to combine these so that either by using curl or by calling the webpage I can capture only the bits of text between the delimiters.

For instance, Regulus6633’s following suggestion worked fine for the given example:

set t to "{\"responseData\":{\"translatedText\":\"Ciao Mondo\"},\"responseDetails\":\"\",\"responseStatus\":200,\"matches\":[{\"id\":\"0\",\"segment\":\"Hello World\",\"translation\":\"Ciao Mondo\",\"quality\":\"70\",\"reference\":\"Machine Translation provided by Google, Microsoft, Worldlingo or MyMemory customized engine.\",\"usage-count\":1,\"subject\":\"All\",\"created-by\":\"MT!\",\"last-updated-by\":null,\"create-date\":\"2012-10-28\",\"last-update-date\":\"2012-10-28\",\"match\":0.85}]}"

getTextBetweenDelimiters(t, "{\"responseData\":{\"translatedText\":\"", "\"},\"responseDetails\"")

on getTextBetweenDelimiters(theText, firstDelim, secondDelim)
   try
       set {tids, text item delimiters} to {text item delimiters, firstDelim}
       set a to second text item of theText
       set text item delimiters to secondDelim
       set b to first text item of a
       set text item delimiters to tids
       return b
   on error
       return theText
   end try
end getTextBetweenDelimiters

I then tried to combine this with Yvan’s script to get:

set t to do shell script "curl " & quoted form of "http://mymemory.translated.net/api/get?q=Hello%20World!&langpair=en|it"

getTextBetweenDelimiters(t, "{\"responseData\":{\"translatedText\":\"", "\"},\"responseDetails\"")

on getTextBetweenDelimiters(theText, firstDelim, secondDelim)
	try
		set {tids, text item delimiters} to {text item delimiters, firstDelim}
		set a to second text item of theText
		set text item delimiters to secondDelim
		set b to first text item of a
		set text item delimiters to tids
		return b
	on error
		return theText
	end try
end getTextBetweenDelimiters

The problem being, of course, how to deal with spaces and punctuation (i.e. if I put in some other, longer text and don’t insert %20 where there are spaces, it will fail. I assume the same will happen if I input any punctuation as well).

Would I just be better off having the script open a webpage and work with it from there? That seems to work very well:


set t to the clipboard

getTextBetweenDelimiters(t, "{\"responseData\":{\"translatedText\":\"", "\"},\"responseDetails\"")

on getTextBetweenDelimiters(theText, firstDelim, secondDelim)
	try
		set {tids, text item delimiters} to {text item delimiters, firstDelim}
		set a to second text item of theText
		set text item delimiters to secondDelim
		set b to first text item of a
		set text item delimiters to tids
		return b
	on error
		return theText
	end try
end getTextBetweenDelimiters

But I’m unable to then copy the final output to the clipboard so that it can be pasted to a give file. I tried using:

set clipboard to getTextBetweenDelimiters

This didn’t, however, work.

Can anyone advise what the best course of action would be?

Thanks,

Bowjest

McUsr · October 29, 2012, 8:44am

That was what the handler i posted you was for. After you have encoded it, you then concatenate the string into the full string, which you then encode before calling curl with it.

IMHO I think curl is a much better way to do it, than using the webpage, due to different load times, and that it is so much more involved that may go wrong.


set txtToTrans to "There exists already a file named ledger.dat, do you really want to overwrite?"

set txtToTrans to rawURLEncode(txtToTrans)

--> "There%20exists%20already%20a%20file%20named%20ledger.dat%2C%20do%20you%20really%20want%20to%20overwrite%3F"

on rawURLEncode(str)
	return do shell script "/bin/echo -n " & quoted form of str & " | php -r ' echo rawurlencode(fgets(STDIN)); '"
end rawURLEncode

bowjest · October 29, 2012, 11:33am

Hi McUsr,

I should have replied to you before. Please excuse my oversight.

I wasn’t really sure how to use your suggestion, but your example now makes sense. Thanks!

The biggest problem now for me is how to substitute all the variables to get through the various parts of the script.

This is how I would use this:

Highlight and copy text to the clipboard
Invoke the final script, which would curl the results
Output of the script is now saved to the clipboard so that it can be pasted into a given file or files

I thought I was part of the way there:

set txtToTrans to the clipboard

set txtToTrans to rawURLEncode(txtToTrans)

on rawURLEncode(str)
	return do shell script "/bin/echo -n " & quoted form of str & " | php -r ' echo rawurlencode(fgets(STDIN)); '"
end rawURLEncode

set rawURLEncodeToTrans to do shell script "curl " & quoted form of "http://mymemory.translated.net/api/get?q=" & "rawURLEncode" & "&langpair=en|it"

getTextBetweenDelimiters(rawURLEncodeToTrans, "{\"responseData\":{\"translatedText\":\"", "\"},\"responseDetails\"")

on getTextBetweenDelimiters(theText, firstDelim, secondDelim)
	try
		set {tids, text item delimiters} to {text item delimiters, firstDelim}
		set a to second text item of theText
		set text item delimiters to secondDelim
		set b to first text item of a
		set text item delimiters to tids
		return b
	on error
		return theText
	end try
end getTextBetweenDelimiters

But this fails with the following line being highlighted:

set rawURLEncodeToTrans to do shell script "curl " & quoted form of “http://mymemory.translated.net/api/get?q=” & “rawURLEncode” & “&langpair=en|it”

The error quoted is:

error “sh: it: command not found”

Is there a syntax or variable error that I’m not seeing?

Many thanks,

bowjest

StefanK · October 29, 2012, 11:41am

You have to use the encoded string txtToTrans and quote the whole URL


set rawURLEncodeToTrans to do shell script "curl " & quoted form of ("http://mymemory.translated.net/api/get?q=" & txtToTrans & "&langpair=en|it")

bowjest · October 29, 2012, 12:01pm

Thanks, Stefan. That’s fixed it.

If I wanted to have this output saved/copied to the clipboard, how would I need to work that with regard to the output of getTextBetweenDelimiters?

I tried using:

set clipboard to getTextBetweenDelimiters

This, however, failed with the message:

“error “Can’t set clipboard to «handler getTextBetweenDelimiters».” number -10006 from clipboard”

Do I need to set the clipboard to the entire routine in some way?

Many thanks,

bowjest

StefanK · October 29, 2012, 12:07pm

you have to copy the result of the handler to the clipboard, not the handler itself
for example

set theFilteredText to getTextBetweenDelimiters(rawURLEncodeToTrans, "{\"responseData\":{\"translatedText\":\"", "\"},\"responseDetails\"")
set the clipboard to theFilteredText

bowjest · October 29, 2012, 12:15pm

Thanks, Stefan.

That did the trick.

All the best and thanks to everyone for your help.

Regards,

bowjest

regulus6633 · October 29, 2012, 1:12pm

All you need to do is replace spaces in your search phrase with “%20”. So a simple find/replace subroutine is all you need. You just input the search phrase and the rest should work. So we can use dialogs to make it adjustable…

set appTitle to "English to Italian Translation"
set tempVar to display dialog "Enter your English search word(s)." default answer "Hello World!" with icon note buttons {"Cancel", "OK"} default button "OK" with title appTitle
set searchPhrase to text returned of tempVar

if searchPhrase is not "" then
	set fixedSearchPhrase to findReplace(searchPhrase, space, "%20")
	set t to do shell script "curl " & quoted form of ("http://mymemory.translated.net/api/get?q=" & fixedSearchPhrase & "&langpair=en|it")
	
	set theTranslation to getTextBetweenDelimiters(t, "{\"responseData\":{\"translatedText\":\"", "\"},\"responseDetails\"")
	display dialog "The Italian traslation of \"" & searchPhrase & "\" is:" & return & return & theTranslation with icon note buttons {"OK"} default button "OK" with title appTitle
else
	display dialog "You did not enter any English text! Please try again." with icon 0 buttons {"OK"} default button 1 with title appTitle
end if

(*************** SUBROUTINES *****************)
on findReplace(theString, search_string, replacement_string)
	if theString contains search_string then
		set AppleScript's text item delimiters to search_string
		set text_item_list to text items of theString
		set AppleScript's text item delimiters to replacement_string
		set theString to text_item_list as text
		set AppleScript's text item delimiters to ""
	end if
	return theString
end findReplace

on getTextBetweenDelimiters(theText, firstDelim, secondDelim)
	try
		set {tids, text item delimiters} to {text item delimiters, firstDelim}
		set a to second text item of theText
		set text item delimiters to secondDelim
		set b to first text item of a
		set text item delimiters to tids
		return b
	on error
		return theText
	end try
end getTextBetweenDelimiters