Help removing HTML from text

I’ve been trying to figure out how to remove HTML code from text. This script I found work pretty well on every example of text I plug into except for the one example I need it to work with.

set theText to "Rio Yamasaki stars in this new release. Great for fans of <a href=\"/search/glasss_girl\">meganekko</a>."
set {od, AppleScript's text item delimiters} to ¬
	{AppleScript's text item delimiters, "<"}
set theText to text items of theText
set newText to ""
set AppleScript's text item delimiters to ">"
repeat with anItem in theText
	set newList to text items of anItem
	if (count newList) > 1 then
		set newText to newText & text item 2 of newList
	end if
end repeat
set AppleScript's text item delimiters to od
newText

This script should output “Rio Yamasaki stars in this new release. Great for fans of meganekko.” hopefully, but instead it just gives me “meganekko.” Can anyone help me tweak it? Every time I change something I break it. :slight_smile: What is it about my script that only what’s after the first a href link will be picked up?

Hi,

there’s a very smart way to convert html to text using the shell textutil command.
The only disadvantage is, textutil can only read data from a file on disk, so a temporary file has to be created.

The script takes your sample line, writes it to disk, converts the html code to text and removes the temp file.
You can also test to read some html code from a web site, just set the sampleText property to false


property sampleText : true -- set to false to read an example from epguides.com

set temp to ((path to temporary items as Unicode text) & "html_test.txt")
set Ptemp to quoted form of POSIX path of temp

if sampleText then
	-- either create a sample text file
	set theText to "Rio Yamasaki stars in this new release. Great for fans of <a href=\"/search/glasss_girl\">meganekko</a>."
	set ff to open for access file temp with write permission
	write theText to ff
	close access ff
else
	-- or read a web page with curl
	do shell script "curl http://epguides.com/ATeam/ -o " & Ptemp
end if
-- convert html to txt
do shell script "textutil -format html -inputencoding iso-8859-1 -convert txt -encoding UTF-16 " & Ptemp
set theText to paragraphs of (read file temp as Unicode text)
do shell script "rm " & Ptemp -- delete tempfile

theText


Oh, excellent. Thank you very much, this works great!

Hmm, small issue. I’m trying to use this from inside Filemaker Pro 9 Advanced, and it’s giving me errors when I try to use the script, or compile it when pasting it into a “perform applescript” script step. The exact script I’m using is below, and it works perfectly when run from the script editor. However, it refuses to compile in FMpro, showing “Expected end of line, found identifier” at the word “theText” in “write theText to ff”. If I comment this line out, it then complains about the word “file” in “set theText to paragraphs of (read file temp as Unicode text)”. Is there anything about the script that would make Filemaker refuse to work with it even though it fundamentally works?

property sampleText : true -- set to false to read an example from epguides.com
set temp to ((path to temporary items as Unicode text) & "html_test.txt")
set Ptemp to quoted form of POSIX path of temp
if sampleText then
	-- either create a sample text file
	set theText to "Got some great new manga for you today, with the gorgeous<a href='/search/mogudan'>Brave Soul</a> artist Mogudan. "
	set ff to open for access file temp with write permission
	write theText to ff
	close access ff
else
	-- or read a web page with curl
	do shell script "curl http://epguides.com/ATeam/ -o " & Ptemp
end if
-- convert html to txt
do shell script "textutil -format html -inputencoding iso-8859-1 -convert txt -encoding UTF-16 " & Ptemp
set theText to paragraphs of (read file temp as Unicode text)
do shell script "rm " & Ptemp -- delete tempfile
try
	set the clipboard to (theText) as string
on error errMsg
	display dialog errMsg
end try

You were on the right track. All you were missing was the else clause added below. This implementation assumes that you don’t care if individual less than and greater than signs (not part of HTML tags) are stripped out as well.

set theText to "Rio Yamasaki stars in this new release. Great for fans of <a href=\"/search/glasss_girl\">meganekko</a>."
my stripHTML(theText)

on stripHTML(theText)
	set newText to ""
	set {oldDelim, AppleScript's text item delimiters} to {AppleScript's text item delimiters, "<"}
	set theText to text items of theText
	set AppleScript's text item delimiters to ">"
	repeat with anItem in theText
		set newList to text items of anItem
		if (count newList) > 1 then
			set newText to newText & text item 2 of newList
		else
			set newText to newText & anItem
		end if
	end repeat
	set AppleScript's text item delimiters to oldDelim
	newText
end stripHTML

Excellent, thanks very much for the tip! Works great!