Extract text inside tags of a string

Hi, I’m an Applescript newbie who has searched through this forum and cannot figure out the answer to my problem, so all/any help will be greatly appreciated.

I have a series of paragraphs from an hmtl file and I want to extract the text within the tags, which may or may not have attributes eg

text here

or

text here

My guess is that I need to use a shell call to perl but I cannot figure out what perl to use (haven’t coded in it for about 7 years) nor how to get the results into applescript. So I’m pretty stuck. Naturally I am not wedded to perl. I am doing the coding on 10.4

I would love to be able to show you my script so far but none of the code I have written does much good. Here goes (it’s only a test for proof of concept, ha ha):

set test_result to do shell script "perl << EOF
local $w = 'try';
print 'hi';
EOF"
--set test_result to text of return
display alert test_result & " there"

If I comment out the local line the script runs but with the local line I get a compile error, so I can’t even run a simple piece of perl let alone do the text search.

Many thanks in advance

Graeme

Model: 17" PowerBook G4 100Gb
Browser: Safari 417.9.2
Operating System: Mac OS X (10.4)

Hopefully this will get you started.

get "<p class=\"under_score\" onclick=\"void:\">text here</p>"
do shell script "echo " & quoted form of result & " | ruby -n -e 'puts $_.gsub(/\\<[A-Za-z0-9_:\\ \\\"\\/=]+\\>/, \"\")'"

Any characters that will appear in the attributes need to be added to the ruby gsub method. (I just made a few changes to my BBCode removal script, which works great. However, BBCode doesn’t allow as many characters as HTML.)

Edit: Actually, this should work better. (Unless one of the attributes contains a ˜>’ character).

get "<p class=\"under_score\" onclick=\"void:\">text here</p>"
do shell script "echo " & quoted form of result & " | ruby -n -e 'puts $_.gsub(/<.+?>/, \"\")'"

(I had to remember how to stop the expression from being greedy.)

Bruce

Thanks for that!

I’ve just run it on my laptop and it works.

It’s past my bedtime now but I will start to incorporate it in my code tomorrow (the manipulation of the html file works fine at least for the moment).

Cheers

Graeme

Model: 17" PowerBook G4 100Gb
Browser: Safari 417.9.2
Operating System: Mac OS X (10.4)

Just for the records, here is an alternate version:

stripHTMLTags("html text")

to stripHTMLTags(t)
	script a
		property o : {}
	end script
	set q to AppleScript's text item delimiters
	set AppleScript's text item delimiters to "<"
	set a's o to t's text items
	set AppleScript's text item delimiters to ">"
	repeat with i from 1 to count a's o
		try
			set a's o's item i to a's o's item i's text item 2
		end try
	end repeat
	set AppleScript's text item delimiters to q
	a's o as string
end stripHTMLTags

The faster option, though, would be piping the regexp to the Satimage osax, just in case you need a quite intensive batch-processing…

jj: Thanks for that too!

While I have about 40 folders of html which needs to be translated, I only have to do this as a one-off exercise so time is not entirely of the essence. My biggest problem is getting something to work right now!

I’ll post a follow up if I have anything useful to offer the community by way of conclusions.

Cheers

Graeme

Model: 17" PowerBook G4 100Gb
Browser: Safari 417.9.2
Operating System: Mac OS X (10.4)