webpage and text item delimiters

I have been using this script to get text from a webpage that I found on this post.
http://bbs.applescript.net/viewtopic.php?id=13384

set myHtml to do shell script "curl [url=http://www.wpr.org/book/lastweek.html]http://www.wpr.org/book/lastweek.html"[/url]

set {myDels, AppleScript's text item delimiters} to {AppleScript's text item delimiters, {"</head>"}}
set myText to text item 2 of myHtml
set AppleScript's text item delimiters to myDels
set myText to remove_markup(myText)

on remove_markup(this_text)
	set myAllowed to {"://", "@"}
	set {myTID, AppleScript's text item delimiters} to {AppleScript's text item delimiters, {"<"}}
	set these_items to text items of this_text
	set AppleScript's text item delimiters to {">"}
	repeat with my_item in these_items
		if length of text items of my_item is greater than 1 then
			if (text item 1 of my_item) is not in myAllowed then
				set my_item's contents to (text item 2 of my_item)
			else
				set my_item's contents to ("<" & (text items of my_item)) as text
			end if
		end if
	end repeat
	set AppleScript's text item delimiters to myTID
	set clean_text to these_items as text
	
	return clean_text
end remove_markup
set newtext to remove_markup()

the text returns a lot of

and the newtext command gives the error

I don’t understand why this doesn’t work. Isn’t it declared as one variable on the call statement?
I have tried using this script to remove the text
set thisText1 to these_items as text
set AppleScript’s text item delimiters to " "
set thisText1 to thisText1’s text items
set AppleScript’s text item delimiters to “”
set thisText1 to “” & thisText1
set AppleScript’s text item delimiters to {“”}
return thisText1
but this isn’t working inside the subroutine and I can’t call it outside of the subroutine.
why?

When you want to trace the errors in a handler as short as this one, make it inline:
What is it you actually want to find?
This article might help.


set myHtml to do shell script "curl [url=http://www.wpr.org/book/lastweek.html]http://www.wpr.org/book/lastweek.html"[/url]

set {myDels, AppleScript's text item delimiters} to {AppleScript's text item delimiters, {"</head>"}}
set myText to text item 2 of myHtml
set AppleScript's text item delimiters to myDels
set this_text to myText
--set myText to remove_markup(myText)

--on remove_markup(this_text)
set myAllowed to {"://", "@"}
set {myTID, AppleScript's text item delimiters} to {AppleScript's text item delimiters, {"<"}}
set these_items to text items of this_text
set AppleScript's text item delimiters to {">"}
repeat with my_item in these_items
	if length of text items of my_item is greater than 1 then
		if (text item 1 of my_item) is not in myAllowed then
			set my_item's contents to (text item 2 of my_item)
		else
			set my_item's contents to ("<" & (text items of my_item)) as text
		end if
	end if
end repeat
set AppleScript's text item delimiters to myTID
set clean_text to these_items as text
--> "


mmLoadMenus();

  
    
    
    
    
    
    
    
  
  
     
    
        
        
        
        
      
        
        
          
            
          
          
                         
          
          
                         
          
        
        
          
          
          
        
        

      
      
        Winner of the        
64th Annual
            Peabody Award
            for Radio
    Programming!       
    
    
     
      
         
           
            Author! 
              Author: Great Writers on Great Books
              A Four Part Series from TTBOOK!
          
        
      
      
        Public Radio International
       
        
      Wisconsin Public Radio
      Sunday Mornings on 
        a Satellite Near You!
      
      more info
    
     
  
  
    
    from Wisconsin Public Radio 
     
  
  
    
     
      
         
          
             Listen! 
          
           
            AUTHOR! AUTHOR!
            Part III: Kids' Lit
           
          
        
         
          
             Listen! 
          
           
        
         
           
          LIFE 
            AND DEATH IN IRAQ 
           
        
         
           
            
          
        
        
           
            
          
           
            Next 
              Week 
            
          
           
        
      
    
  
  
     
     
     
    
        Wisconsin Public Radio is a service of the Wisconsin 
        Educational Communications Board and the University 
        of Wisconsin-Madison Extension. 
      Page design and management 
        by Jim Fleming at Wisconsin Public Radio 
        and Sarah Fleming.
      © 2007 WHA Radio 
        and the Board of Regents of the University 
        of Wisconsin System. All rights reserved. 
       
     
  


"
--return clean_text
--end remove_markup
--set newtext to remove_markup()

when I use this script I don’t get the script above, I get this

why do I get the

?

I don’t have a clue. :rolleyes:

nbsp is non-breaking space. It’s used in html for leading spaces of paragraphs.

gl,

But why does it show up on his machine and not on mine, given the same code?

The first line of the “handler statement” (all the text from on remove_markup(this_text) to end remove_markup) shows that the handler expects to be passed one value when it’s called.

Your “call statement” to that handler (set newtext to remove_markup()) has nothing in its parameter list, so it’s not passing any values to the handler. Hence the error.

From the fact that you’ve put the handler statement in the middle of the running code, I’d guess you haven’t appreciated that a handler is a discrete block of code that’s only run when it’s called. The variable(s) mentioned in its first line (‘this_text’ in this case) are there to receive the values passed by the call and are set when the handler’s called. They’re local to the handler and can’t be seen in the script outside. More specifically, they’re local to that particular execution of the handler. If the handler’s called again, or if it calls itself recursively, the values of the parameter variables only apply to the current call. (Each execution has its own set of variables with those names.) The same goes for any other variables used in the handler that aren’t explicitly declared globals or properties.

If you want a handler to return a value, use a ‘return’ statement as you’ve done above. Otherwise, the handler will return the result of the last statement executed in it.

This is all relevant to your other thread placing text in a handler.

It’s neater and less confusing not to put handler statements in the middle of the run code, but to have them all at one end of the script, either at the beginning or at the end:

set myHtml to do shell script "curl [url=http://www.wpr.org/book/lastweek.html]http://www.wpr.org/book/lastweek.html"[/url]

set {myDels, AppleScript's text item delimiters} to {AppleScript's text item delimiters, {"</head>"}}
set myText to text item 2 of myHtml
set AppleScript's text item delimiters to myDels
set myText to remove_markup(myText)
set newtext to remove_markup("I can't think of anything to <say>")

-- Handler statement(s) at one end of the script.
on remove_markup(this_text)
	set myAllowed to {"://", "@"}
	set {myTID, AppleScript's text item delimiters} to {AppleScript's text item delimiters, {"<"}}
	set these_items to text items of this_text
	set AppleScript's text item delimiters to {">"}
	repeat with my_item in these_items
		if length of text items of my_item is greater than 1 then
			if (text item 1 of my_item) is not in myAllowed then
				set my_item's contents to (text item 2 of my_item)
			else
				set my_item's contents to ("<" & (text items of my_item)) as text
			end if
		end if
	end repeat
	set AppleScript's text item delimiters to myTID
	set clean_text to these_items as text
	
	return clean_text
end remove_markup

Hi,

just for sports, here is a different appraoch to extract the text from the website

property a1 : «data utxtFFFC» as Unicode text
property a2 : «data utxtFFFC2028» as Unicode text
property a3 : «data utxtFFFCFFFC» as Unicode text
property a4 : «data utxt00A0» as Unicode text
property a5 : «data utxt2028» as Unicode text
property a6 : «data utxtFFFC00202028» as Unicode text
property a7 : «data utxtFFFC2028FFFC2028» as Unicode text

set temp to ((path to temporary items as Unicode text) & "wpr.txt") -- define tempfile
set Ptemp to quoted form of POSIX path of temp
do shell script "curl http://www.wpr.org/book/lastweek.html -o " & Ptemp
do shell script "textutil -format html -inputencoding iso-8859-1 -convert txt -encoding UTF-16 " & Ptemp -- convert html to txt
set theText to paragraphs of (read file temp as Unicode text)
do shell script "rm " & Ptemp -- delete tempfile

set newText to {}
repeat with i in theText
	tell contents of i
		if {it} is in {a1, a2, a3, a4} or it is "" or it begins with a7 then
			-- do nothing
		else if it ends with a1 then
			set end of newText to text 1 thru -2 of it
		else if it begins with a6 then
			set end of newText to text 4 thru -1 of it
		else if it begins with a2 or it begins with (a1 & space) then
			set end of newText to text 3 thru -1 of it
		else if it begins with a1 or it begins with space or it begins with a5 then
			set end of newText to text 2 thru -1 of it
		else
			set end of newText to it
		end if
	end tell
end repeat
set {TID, text item delimiters} to {text item delimiters, return}
set newText to newText as Unicode text
set text item delimiters to TID
newText

StefanK,

Is there a definite reason for having a1 - a7 set as properties, as opposed to normal variables?

The reason I ask is to be able to make that code into a self-contained subroutine that I could reuse easily from one script to the next, without splitting it up by having separate properties. I tried it and it worked…but I didn’t know if there was a more long term reason that one attempt did not show.

The main reason is to save time. The values are predefined when you run the script.
But you can also use normal variables.
If you access other sites than www.wpr.org, you can probably skip the filter lines, for example apple.com


set temp to ((path to temporary items as Unicode text) & "wpr.txt") -- define tempfile
set Ptemp to quoted form of POSIX path of temp
do shell script "curl http://www.apple.com/ -o " & Ptemp
do shell script "textutil -format html -inputencoding iso-8859-1 -convert txt -encoding UTF-16 " & Ptemp -- convert html to txt
set theText to paragraphs of (read file temp as Unicode text)
do shell script "rm " & Ptemp -- delete tempfile
set {TID, text item delimiters} to {text item delimiters, return}
set theText to theText as Unicode text
set text item delimiters to TID
theText

Note: in Leopard you even don’t need the temp file, because textutil can convert stdin to stdout

If you happen to own a copy of DEVONagent or DEVONthink, then the following code is a very convenient way to extract (rich) text from a given website. I use this a lot in our company when it comes to automated patent analysis:


tell application "DEVONagent"
	set htmlcode to download markup from "http://en.wikipedia.org/wiki/Steve_Wozniak"
	set sitetext to get rich text of htmlcode
end tell

If you only own DEVONthink, then just replace DEVONagent with DEVONthink in the above code example.