Copying text from Safari - Hard spaces?!

I have a problem… If I manually select and copy text from a safari webpage to the clipboard I can parse out the text using a script I wrote by setting TID’s to {return} and {tab}. The text in TextWrangler with invisibles shows spaces and tabs and line returns.
If however, I use this method;

tell application "Safari"
	set clipboardText2 to text of document 1
end tell

Or this method;

tell application "Safari" to set clipboardText2 to (do JavaScript "\"\"+window.getSelection();" in document 1)

The text now includes HARD spaces and my script no longer functions properly. I’m guessing this is a function of the HTML code. Without subjecting everyone to my long-winded script, am I missing something fundamental here? I am trying to avoid GUI scripting however, this little wrinkle is doing my head in!!

This is the section of my script where it now fumbles

set the text item delimiters to {return}
	set memoContentsPt2 to the text items of memoContentsPt1
	set the text item delimiters to {tab}
	--display dialog memoContentsPt2
	repeat with loopCounter from 1 to count of items in memoContentsPt2
		if the number of words in text item 1 of item loopCounter of memoContentsPt2 > 1 then
			copy words 1 thru -2 of text item 1 of item loopCounter of memoContentsPt2 to surname
		else
			copy text item 1 of item loopCounter of memoContentsPt2 to surname
		end if
		
		copy text item 2 of item loopCounter of memoContentsPt2 to firstname
		copy text item 3 of item loopCounter of memoContentsPt2 to rank
		copy text item 4 of item loopCounter of memoContentsPt2 to myduty
		
		set item loopCounter of memoContentsPt2 to firstname & " " & surname & " " & rank & " " & myduty
		set loopCounter to loopCounter + 1
	end repeat
	
	set the text item delimiters to {return}
	set memoContents to memoContentsPt2 as text
	set crewNames to memoContentsPt2 as list
	
	set text item delimiters to {""}

I can share the whole script and the clipboard text that is causing the issues if I haven’t been clear enough in my request…

Cheers,

Kev

You can use paragraphs

tell application "Safari"
	set clipboardText2 to text of document 1
	set xxx to every paragraph of clipboardText2
end tell

Hello!

I haven’t come around to it. It is Ascii character 202. I got one into a bash script! :slight_smile:

it is a nasty invisble bugger!

(And hearing you, now I know I can’t fake straight margins in an Applescript). :smiley:

The bash incident made me seriously ponder converting everything I paste into scripts into 7-bit ascii … no kidding!

As it seems like one character after the other becomes troublesome!

The cure is of course to


set dirtyText to the clipboard
set oldDelims to AppleScript's text item delimiters
set AppleScript's text item delimiters to {ASCII character 202}
set cleanText to every text item of dirtyText
set AppleScript's text item delimiters to oldDelims
set cleanText to cleanText as text

Thank you McUsr,

I’m using your fix for some of my script however, it is still not recognizing the {tab}‘s. This is making the parsing of my web page extremely difficult. I am almost at the stage of using GUI scripting "Argghh’

I have investigated using curl however, the web page sits behind a password and a curl request forces the page to redirect. I’ve tried --location-trusted and various form entries for username and password to no avail.

I’ll keep ‘code-mining’ and try to find a solution.

Thanks again.

Kev

You just need to figure out the character ids of the different things on a page. Then you can use that to convert those ids into return or tab characters when you need to parse something. So run something like this on some text to learn the characters ids you will have to deal with. Then one simple “text item delimiter” statement can convert all of those for you.

tell application "Safari" to set theText to (do JavaScript "\"\"+window.getSelection();" in document 1)

set chars to characters of theText
set charIDs to {}
repeat with i from 1 to count of chars
	set thisChar to item i of chars
	set end of charIDs to {thisChar & " = " & (id of thisChar as text)}
end repeat
return charIDs