I’ve been trying to figure out how to remove HTML code from text. This script I found work pretty well on every example of text I plug into except for the one example I need it to work with.
set theText to "Rio Yamasaki stars in this new release. Great for fans of <a href=\"/search/glasss_girl\">meganekko</a>."
set {od, AppleScript's text item delimiters} to ¬
{AppleScript's text item delimiters, "<"}
set theText to text items of theText
set newText to ""
set AppleScript's text item delimiters to ">"
repeat with anItem in theText
set newList to text items of anItem
if (count newList) > 1 then
set newText to newText & text item 2 of newList
end if
end repeat
set AppleScript's text item delimiters to od
newText
This script should output “Rio Yamasaki stars in this new release. Great for fans of meganekko.” hopefully, but instead it just gives me “meganekko.” Can anyone help me tweak it? Every time I change something I break it. What is it about my script that only what’s after the first a href link will be picked up?
there’s a very smart way to convert html to text using the shell textutil command.
The only disadvantage is, textutil can only read data from a file on disk, so a temporary file has to be created.
The script takes your sample line, writes it to disk, converts the html code to text and removes the temp file.
You can also test to read some html code from a web site, just set the sampleText property to false
property sampleText : true -- set to false to read an example from epguides.com
set temp to ((path to temporary items as Unicode text) & "html_test.txt")
set Ptemp to quoted form of POSIX path of temp
if sampleText then
-- either create a sample text file
set theText to "Rio Yamasaki stars in this new release. Great for fans of <a href=\"/search/glasss_girl\">meganekko</a>."
set ff to open for access file temp with write permission
write theText to ff
close access ff
else
-- or read a web page with curl
do shell script "curl http://epguides.com/ATeam/ -o " & Ptemp
end if
-- convert html to txt
do shell script "textutil -format html -inputencoding iso-8859-1 -convert txt -encoding UTF-16 " & Ptemp
set theText to paragraphs of (read file temp as Unicode text)
do shell script "rm " & Ptemp -- delete tempfile
theText
Hmm, small issue. I’m trying to use this from inside Filemaker Pro 9 Advanced, and it’s giving me errors when I try to use the script, or compile it when pasting it into a “perform applescript” script step. The exact script I’m using is below, and it works perfectly when run from the script editor. However, it refuses to compile in FMpro, showing “Expected end of line, found identifier” at the word “theText” in “write theText to ff”. If I comment this line out, it then complains about the word “file” in “set theText to paragraphs of (read file temp as Unicode text)”. Is there anything about the script that would make Filemaker refuse to work with it even though it fundamentally works?
property sampleText : true -- set to false to read an example from epguides.com
set temp to ((path to temporary items as Unicode text) & "html_test.txt")
set Ptemp to quoted form of POSIX path of temp
if sampleText then
-- either create a sample text file
set theText to "Got some great new manga for you today, with the gorgeous<a href='/search/mogudan'>Brave Soul</a> artist Mogudan. "
set ff to open for access file temp with write permission
write theText to ff
close access ff
else
-- or read a web page with curl
do shell script "curl http://epguides.com/ATeam/ -o " & Ptemp
end if
-- convert html to txt
do shell script "textutil -format html -inputencoding iso-8859-1 -convert txt -encoding UTF-16 " & Ptemp
set theText to paragraphs of (read file temp as Unicode text)
do shell script "rm " & Ptemp -- delete tempfile
try
set the clipboard to (theText) as string
on error errMsg
display dialog errMsg
end try
You were on the right track. All you were missing was the else clause added below. This implementation assumes that you don’t care if individual less than and greater than signs (not part of HTML tags) are stripped out as well.
set theText to "Rio Yamasaki stars in this new release. Great for fans of <a href=\"/search/glasss_girl\">meganekko</a>."
my stripHTML(theText)
on stripHTML(theText)
set newText to ""
set {oldDelim, AppleScript's text item delimiters} to {AppleScript's text item delimiters, "<"}
set theText to text items of theText
set AppleScript's text item delimiters to ">"
repeat with anItem in theText
set newList to text items of anItem
if (count newList) > 1 then
set newText to newText & text item 2 of newList
else
set newText to newText & anItem
end if
end repeat
set AppleScript's text item delimiters to oldDelim
newText
end stripHTML