Extract text inside tags of a string

graemeaustin · May 11, 2006, 9:04pm

Hi, I’m an Applescript newbie who has searched through this forum and cannot figure out the answer to my problem, so all/any help will be greatly appreciated.

I have a series of paragraphs from an hmtl file and I want to extract the text within the tags, which may or may not have attributes eg

text here

or

text here

My guess is that I need to use a shell call to perl but I cannot figure out what perl to use (haven’t coded in it for about 7 years) nor how to get the results into applescript. So I’m pretty stuck. Naturally I am not wedded to perl. I am doing the coding on 10.4

I would love to be able to show you my script so far but none of the code I have written does much good. Here goes (it’s only a test for proof of concept, ha ha):

set test_result to do shell script "perl << EOF
local $w = 'try';
print 'hi';
EOF"
--set test_result to text of return
display alert test_result & " there"

If I comment out the local line the script runs but with the local line I get a compile error, so I can’t even run a simple piece of perl let alone do the text search.

Many thanks in advance

Graeme

Model: 17" PowerBook G4 100Gb
Browser: Safari 417.9.2
Operating System: Mac OS X (10.4)

Bruce_Phillips · May 11, 2006, 9:18pm

Hopefully this will get you started.

get "<p class=\"under_score\" onclick=\"void:\">text here</p>"
do shell script "echo " & quoted form of result & " | ruby -n -e 'puts $_.gsub(/\\<[A-Za-z0-9_:\\ \\\"\\/=]+\\>/, \"\")'"

Any characters that will appear in the attributes need to be added to the ruby gsub method. (I just made a few changes to my BBCode removal script, which works great. However, BBCode doesn’t allow as many characters as HTML.)

Edit: Actually, this should work better. (Unless one of the attributes contains a ˜>’ character).

get "<p class=\"under_score\" onclick=\"void:\">text here</p>"
do shell script "echo " & quoted form of result & " | ruby -n -e 'puts $_.gsub(/<.+?>/, \"\")'"

(I had to remember how to stop the expression from being greedy.)

graemeaustin · May 11, 2006, 9:36pm

Bruce

Thanks for that!

I’ve just run it on my laptop and it works.

It’s past my bedtime now but I will start to incorporate it in my code tomorrow (the manipulation of the html file works fine at least for the moment).

Cheers

Graeme

Model: 17" PowerBook G4 100Gb
Browser: Safari 417.9.2
Operating System: Mac OS X (10.4)

julifos · May 12, 2006, 8:18am

Just for the records, here is an alternate version:

stripHTMLTags("html text")

to stripHTMLTags(t)
	script a
		property o : {}
	end script
	set q to AppleScript's text item delimiters
	set AppleScript's text item delimiters to "<"
	set a's o to t's text items
	set AppleScript's text item delimiters to ">"
	repeat with i from 1 to count a's o
		try
			set a's o's item i to a's o's item i's text item 2
		end try
	end repeat
	set AppleScript's text item delimiters to q
	a's o as string
end stripHTMLTags

The faster option, though, would be piping the regexp to the Satimage osax, just in case you need a quite intensive batch-processing…

graemeaustin · May 12, 2006, 8:35am

jj: Thanks for that too!

While I have about 40 folders of html which needs to be translated, I only have to do this as a one-off exercise so time is not entirely of the essence. My biggest problem is getting something to work right now!

I’ll post a follow up if I have anything useful to offer the community by way of conclusions.

Cheers

Graeme

Model: 17" PowerBook G4 100Gb
Browser: Safari 417.9.2
Operating System: Mac OS X (10.4)