curl downloading icon for images

I referred to some scripts in this forum which involved downloading a newspaper and I wrote one for myself.

tell me to activate
set thedisplay to display dialog "Enter the articles range" default answer ""
--dialog box will collect the article number to start with for each page and maximum articles to be downloaded for each page
--for example, specifying "1,8" in the dialog box will download first 8 articles on each of the 24 pages
set theresult to the text returned of thedisplay
set AppleScript's text item delimiters to ","
set thestarting to the first text item of theresult
set maxarticles to the second text item of theresult
property maxpages : 24
property destfolderpath : "5:ET:"
set theprefix to "http://epaper.timesofindia.com/Default/Layout/Includes/ETNEW/ArtWin.asp?From=Archive&Source=Page&Skin=ETNEW&BaseHref=ETM"
set thedate to my thedatestring()
set thesuffix1 to "&ViewMode=HTML&GZ=T&PageLabel=1&EntityId=Ar0"
-- & pagenumber & articlenumber &
set thesuffix2 to "&AppName=1"

repeat with i from thestarting to maxarticles
	set articlenumber to my createarticlenumber(i)
	repeat with i from 1 to maxpages
		set pagenumber to my createpagenumber(i)
		set theURL to theprefix & thedate & thesuffix1 & pagenumber & articlenumber & thesuffix2
		set filename to my createfilename(articlenumber, pagenumber)
		set filepath to destfolderpath & filename
		set qtdposixfilepath to quoted form of POSIX path of filepath
		set command to "curl " & quoted form of theURL & " -o " & qtdposixfilepath
		--set the end of thelist to theURL
		try
			do shell script command
		on error e
			log e
		end try
	end repeat
end repeat
--thelist

on thedatestring()
	set command to "date \"+%Y/%m/%d/\""
	set todaysdatestring to do shell script command
	set AppleScript's text item delimiters to "/"
	set theyear to first text item of todaysdatestring
	set themonth to second text item of todaysdatestring
	set theday to third text item of todaysdatestring
	set theform to "%2F"
	set thefinal to theform & theyear & theform & themonth & theform & theday & theform
	return thefinal
end thedatestring

on createpagenumber(i)
	return text -2 thru -1 of ("0" & i as text)
end createpagenumber

on createarticlenumber(i)
	return text -2 thru -1 of ("0" & i as text)
end createarticlenumber

on createfilename(pagenum, articlenum)
	set command to "date \"+%d/%m\""
	set datestring to do shell script command
	set filename to datestring & "-" & articlenum & "_" & pagenum & ".txt"
end createfilename

I do not know what is the format/encoding of the file though I have given it a “.txt” extension. As much as I know, curl downloads only the URL supplied to it and not the URLs contained in the supplied URL. This means, it does not download images. I do not want images either. But an icon shows up for the images when I open the file in TextEdit. I wrote an applescript to convert the text files to only text(i.e. to remove the image icon) by writing:

However, when I saved the file my original file: http://files.getdropbox.com/u/872430/01%3A08-01_01(original).txt got converted to this file: http://files.getdropbox.com/u/872430/01%3A08-01_01.txt

I really don’t what the new file is and why it is created–my guess is it is something related to encoding.

I would like to have curl download only text without even the icon for images and if that is not possible, I would like to atleast save the downloaded files as text only by running another applescript.

Thanks.

Hi

The files are in HTML format

You get the text with one of these commands

set command to "curl " & quoted form of theURL & " | textutil -stdin -convert rtf -format html -output  " & qtdposixfilepath -- RTF file

set command to "curl " & quoted form of theURL & " | textutil -stdin -convert txt -format html -output  " & qtdposixfilepath -- Plain text

Thanks again, Jacques.

If I use your script as it is, the files created look good in a Text editor in Mac OS but in ipod “Ú Ú Ú Ú” are seen. Also, the punctuations like " " " have to be removed which is done when files are encoded to UTF-8 with BOM.
I want to convert these files to UTF-8 with BOM encoding.

does not work. It gives a file with undesired characters like “Ú Ú Ú Ú”. Text Wrangler solves my problem when I use

set properties of front document to {line breaks:Unix, encoding:"Unicodeâ„¢ (UTF-8)"}

but i don’t want to open every document in Text Wrangler to convert it. I tried its command line utility as well

It also opens Text Wrangler but it does not encode UTF-8 with BOM. It is only UTF-8.
Google has failed me on “encoding to UTF-8 with BOM”