Getting text of Safari documents yields strange paragraphs?

Hi!

I’m having a strange issue. When I as Safari for the text of some web pages, I get text for which I can extract a specific paragraph, and basically, every new line is a new paragraph.

On some other pages, like this one:
http://www.csaffluents.qc.ca/ecoles/fiches/006.html

I get two paragraphs, which seem to be delimited by how tables are set in the page. But I cannot get to paragraphs within each block of text, even if within the blocks, I can see there are line feeds or carriage returns…

Tried to convert line feeds or returns using text item delimiters, but they do not seem to be normal line breaks at all…

Any ideas how I can convert the text of such a page to a normal piece of Text which responds normally to the “word” or “Paragraph” string elements?

Thanks for your ideas…

btw, tried saving this to a file, but it does not save plain text… also tried to get the ktxt class from there, but either I am not doing that right, or it simply doesn’t work.

Post what you are using now. Hard to tell what to fix without that. :slight_smile:

Sorry about that…

This is what I’m using:

tell application "Safari"
	set URL of document 1 to "http://www.csaffluents.qc.ca/ecoles/fiches/006.html"
	--set URL of document 1 to "http://www.csmv.qc.ca/8repertoire/vieuxlongueuil/001.html"
	set theText to text of document 1
	every paragraph of theText
end tell

If you use the second URL instead of the first, you’ll see that the returned paragraphs are quite differently split and it is impossible to get a specific info out of that one using (paragraph X of theText)

That’s because on the one site the address is embedded in a

tag.
and applescripts/safari interprets this as one paragraph

The line feed is a strange one. Checking the id of that unicode character it is 8232, whatever that is! I found it like this…

tell application "Safari"
	set URL of document 1 to "http://www.csaffluents.qc.ca/ecoles/fiches/006.html"
	--set URL of document 1 to "http://www.csmv.qc.ca/8repertoire/vieuxlongueuil/001.html"
	set theText to text of document 1
end tell
set paras to every paragraph of theText -- this should not be in the safari tell block because safari is not needed

set a to item 3 of paras
get id of (character 1 of a)
--> 8232

So you can use that with text item delimiters to break that down…

tell application "Safari"
	set URL of document 1 to "http://www.csaffluents.qc.ca/ecoles/fiches/006.html"
	--set URL of document 1 to "http://www.csmv.qc.ca/8repertoire/vieuxlongueuil/001.html"
	set theText to text of document 1
end tell
set paras to every paragraph of theText -- this should not be in the safari tell block because safari is not needed

set a to item 3 of paras
set text item delimiters to character id 8232
set b to text items of a
set text item delimiters to ""
return b
--> {"", "", "", "École primaire", "Amédée-Marsan (institutionnelle)", "Édifice Amédée-Marsan", "", "", "Adresse:"}

Wonderful! Thanks! This will work perfectly.

Nice job Hank!