Download newspaper

I prefer to listen to my newspapers rather than read them and also it serves as a better storage. So i download an electronic copy. I wish to automate this task since i do it everyday and it is very time-consuming because i have to click on every article manually to download it.
Here is the sample URL:

http://epaper.business-standard.com/bsepaper/pdf/2008/10/02/20081002aA001101020.pdf#zoom=130&statusbar=0&messages=0&toolbars=0&navpanes=0

the items that change everyday are

  1. “2008/10/02/20081002” to the current date in “yyyy/mm/dd/yyyymmdd” format

  2. a A 001 1010 20
    here, “a” never changes
    “A” changes to any random alphabet in capital (e.g. “B”, “J” etc)
    “001” is the page number…so if the page number is 12 then it would be “012”
    “1010” never changes
    “20” is the article number on that page …so it the article number is 6 then it is “06”

i tried this:

set the target_URL to "http://epaper.business-standard.com/bsepaper/pdf/2008/10/02/20081002aA001101020.pdf#zoom=130&statusbar=0&messages=0&toolbars=0&navpanes=0"
set the destination_file to ((path to desktop as string) & "a.pdf")
tell application "URL Access Scripting"
	download target_URL to destination_file
end tell

and it works for a single file. (I think a login is compulsory before download but once i login from Safari then i can use this script)

Can someone plz help me in setting up the target_URL correctly?

Thanks in advance.

sorry
nothing

can it be done?

Yes. I have often written such scripts for my former employee to automatically download electroplating patents from the internet into a large database containing acid copper plating patent information.

For example the first part, the date string, could be generated with code like follows:


set command to "date \"+%Y/%m/%d/%Y%m%d\""
set datestring to do shell script command
-- "2008/10/03/20081003"

The basic questions are how to find out the random letter and what article and page numbers can be used. Of course you can just try to download a PDF file at a certain generated URL and if it fails (try-block), well, then there is no article or page. But that is not ideal. It works, but is not efficient.

So it might be better to write a script that downloads the article while you are reading it in Safari. Then you could start it, the script reads the URL in the frontmost Safari window and generates the necessary URLs from this source URL. Then the script would already «know» about the article number and the random letter used.

Other questions are how and where to save the PDF files (single PDF files per page, or also combine all pages of an article into one PDF file; location?).

But yes, it can be done :smiley:



set x to 1
set the destination_file to ((path to desktop as string) & "b.pdf")
tell application "URL Access Scripting"
	set the target_URL to "http://epaper.business-standard.com/bsepaper/pdf/2008/10/03/20081003a_01210100" & x & ".pdf#zoom=130&statusbar=0&messages=0&toolbars=0&navpanes=0"
	repeat with x from 1 to 9
		set x to x as number
		set x to x + 1
		
		download target_URL to destination_file
	end repeat
end tell

i dont know why the random letter was just “_” for all pages today.
i set the date once manually and kept on changing the page number as you see i have a counter for the article number.

i would not mind having this…because i wrote a script to delete such files.

set ifolder to alias "Leopard:Users:lance:Desktop"
set dfolder to alias "Leopard:Users:lance:Desktop:for deleting"

tell application "Finder"
	set allitems to every item of ifolder
	repeat with aItem in allitems
		get size of aItem
		set z to the result
		if z is less than 4096 and name extension of aItem is "pdf" then
			move aItem to dfolder
		end if
	end repeat
end tell

Please help me in setting up a script so that i dont have to change the page number manually. Let us say all i want is first 12 articles of each page.

do u mean i can use a wild character in place of that single random character?
plz tell me how to set a wild character?

Hi Lance,

I just wrote you some sample code, but I could not test it, as I don’t have access to this newspaper website. It first asks you for todays random article char and then tries to download the articles to your desktop. I have written the code in a way that it is (hopefully) easy to understand AND modify. I know that it is not perfect yet, but maybe a shot in the right direction :smiley:


property mytitle : "Paper-O-Mat"
property maxarticles : 20
property maxpages : 20
property destfolderpath : "Leopard:Users:lance:Desktop:"

property starturl : "http://epaper.business-standard.com/bsepaper/pdf/"
property endurl : ".pdf#zoom=130&statusbar=0&messages=0&toolbars=0&navpanes=0"

on run
	set todaysrandomchar to my askfortodaysrandomchar()
	if todaysrandomchar is missing value then
		return
	end if
	set todaysdatestring to my gettodaysdatestring()
	repeat with i from 1 to maxarticles
		set articlenumber to my createarticlenumber(i)
		repeat with i from 1 to maxpages
			set pagenumber to my createpagenumber(i)
			set pageurl to starturl & todaysdatestring & "a" & todaysrandomchar & pagenumber & "1010" & articlenumber & endurl
			set filename to my createfilename(articlenumber, pagenumber)
			set filepath to destfolderpath & filename
			try
				tell application "URL Access Scripting"
					download pageurl to filepath
				end tell
			end try
		end repeat
	end repeat
end run

on askfortodaysrandomchar()
	try
		tell me
			activate
			display dialog "Please enter today's random character:" default answer "" buttons {"Cancel", "Enter"} default button 2 with title mytitle
			set dlgresult to result
		end tell
		set usrinput to text returned of dlgresult
		return usrinput
	on error
		return missing value
	end try
end askfortodaysrandomchar

on gettodaysdatestring()
	set command to "date \"+%Y/%m/%d/%Y%m%d\""
	set todaysdatestring to do shell script command
	return todaysdatestring
end gettodaysdatestring

on createarticlenumber(i)
	if i < 10 then
		return ("0" & i) as Unicode text
	else
		return (i as Unicode text)
	end if
end createarticlenumber

on createpagenumber(i)
	if i < 10 then
		return ("00" & i) as Unicode text
	else if i = 10 or (i > 10 and i < 100) then
		return ("0" & i)
	else if i = 100 or i > 100 then
		return (i as Unicode text)
	end if
end createpagenumber

on createfilename(articlenum, pagenum)
	set command to "date \"+%Y%m%d\""
	set datestring to do shell script command
	set filename to datestring & "_" & articlenum & "_" & pagenum & ".pdf"
end createfilename

Hi,

a little note:

could be replaced by


on createpagenumber(i)
	return text -3 thru -1 of ("00" & i as text)
end createpagenumber

even the as text coercion is actually not necessary, AppleScript coerces implicitly
a valid text representation of a number to a number

Thanks so much for this great piece of code. So simple and elegant. Perfect!

thank you, Martin Michel for your efforts…you gave me what i asked but sorry i did not ask for the right thing.
The images (i.e. advertisements) take up a lot of file size. So my newspaper turns out to be some 100 MB. (i just downloaded first 5 items with your script and stopped it and it was 10 MB. Is there a way to get size of the file first and then reject download it if it exceeds 300 KB ?
Whenever i manually download it using Safari, i get the file size before downloading (e.g. dowloading 1 KB of 125 KB) , so would it be possible with applescript too.
Thanks again for your efforts.
(if u want i can send you the username and password through private message)

Hi Lance,

I am currently in a hurry, so unfortunately I don’t have time to post code, but maybe I can give you some hints into the right direction. If you switch from downloading files with URL Access Scripting to «curl», then you could also use its arguments to get control over the file size:

Or

The forums contain a lot of examples about how to easily use curl in AppleScript’s to download files.

Now heading for a beloved family reunion :lol:

i was just trying whether i could download with curl, so i modified the above script like this

property maxarticles : 20
property maxpages : 10
.....
set articlenumber to my createarticlenumber(i)
		set errorcnt to 0
......
			try
				do shell script "curl " & quoted form of pageurl & " -o " & quoted form of destfolderpath
			on error
				set errorcnt to errorcnt + 1
			end try
......
if errorcnt > 0 then display dialog (errorcnt as text) & " errors while downloading files" buttons {"Done"} default button 1
end run

i get an error like “10 errors while downloading files” and nothing gets downloaded. Any ideas, why?
also plz help me setup the “max-filesize”

stop!!
i removed everything related to “error” display to see the error
problem is with the website…it is still showing newspaper dated 4th october instead of 5th october. I mean, it has not uploaded 5th October’s newspaper.

Hi Lance,

If you are using the curl command to download files, then you need to use (quoted) Posix paths instead of Mac paths:

Mac path: “Macintosh HD:Users:lance:Desktop:article.pdf”
Posix path: “/Users/lance/Desktop/article.pdf”

You can create a Posix path from a Mac path like this:


set macpath to "Macintosh HD:Users:lance:Desktop:article.pdf"
set posixpath to Posix path of macpath
-- quoting the path
set posixpath to quoted form of posixpath

If you want to control the maximum file size of a download, you can use the following curl syntax:


set command to "curl http://www.irs.gov/pub/irs-pdf/fw4.pdf -o /Users/lance/Desktop/test.pdf --max-filesize 20000"
try
	do shell script command
end try

Please note that the file size is given in bytes:

1 MB = 1024 KB = 1048576 bytes

hi Martin! good u see u come back soon…hope you had a great time!!

what about URL? do i have to change them to POSIX ?
i get a weird really really long error (which i have uploaded here to save forum space: http://www.mediafire.com/?hcmmwj9htdw) when i did this:


on run
set destfolderpath to “Leopard:Users:lance:Desktop:”
set posixpath to POSIX path of destfolderpath
set posixpathq to quoted form of posixpath

set filepath to posixpathq & filename
try
do shell script "curl " & quoted form of pageurl & posixpath

also, will this have to be changed?
set filename to datestring & “" & articlenum & "” & pagenum & “.pdf”

No, only the Mac paths have to be modified in order to work with curl (or any other command line tool).

I have modified the initial script to use curl instead of URL Access Scripting, have a look at how it works.


property mytitle : "Paper-O-Mat"
-- 307200 bytes = 300 KB
property maxfilesize : "307200"
property maxarticles : 20
property maxpages : 20
property destfolderpath : "Leopard:Users:lance:Desktop:"

property starturl : "http://epaper.business-standard.com/bsepaper/pdf/"
property endurl : ".pdf#zoom=130&statusbar=0&messages=0&toolbars=0&navpanes=0"

on run
	set todaysrandomchar to my askfortodaysrandomchar()
	if todaysrandomchar is missing value then
		return
	end if
	set todaysdatestring to my gettodaysdatestring()
	repeat with i from 1 to maxarticles
		set articlenumber to my createarticlenumber(i)
		repeat with i from 1 to maxpages
			set pagenumber to my createpagenumber(i)
			set pageurl to starturl & todaysdatestring & "a" & todaysrandomchar & pagenumber & "1010" & articlenumber & endurl
			set filename to my createfilename(articlenumber, pagenumber)
			set filepath to destfolderpath & filename
			set qtdposixfilepath to quoted form of POSIX path of filepath
			set command to "curl " & pageurl & " -o " & qtdposixfilepath & " --max-filesize " & maxfilesize
			return
			-- alternative command (also quoted form of pageurl):
			--set command to "curl " & quoted form of pageurl & " -o " & qtdposixfilepath & " --max-filesize 307200"
			try
				do shell script command
			end try
		end repeat
	end repeat
end run

on askfortodaysrandomchar()
	try
		tell me
			activate
			display dialog "Please enter today's random character:" default answer "" buttons {"Cancel", "Enter"} default button 2 with title mytitle
			set dlgresult to result
		end tell
		set usrinput to text returned of dlgresult
		return usrinput
	on error
		return missing value
	end try
end askfortodaysrandomchar

on gettodaysdatestring()
	set command to "date \"+%Y/%m/%d/%Y%m%d\""
	set todaysdatestring to do shell script command
	return todaysdatestring
end gettodaysdatestring

on createarticlenumber(i)
	if i < 10 then
		return ("0" & i) as Unicode text
	else
		return (i as Unicode text)
	end if
end createarticlenumber

on createpagenumber(i)
	if i < 10 then
		return ("00" & i) as Unicode text
	else if i = 10 or (i > 10 and i < 100) then
		return ("0" & i)
	else if i = 100 or i > 100 then
		return (i as Unicode text)
	end if
end createpagenumber

on createfilename(articlenum, pagenum)
	set command to "date \"+%Y%m%d\""
	set datestring to do shell script command
	set filename to datestring & "_" & articlenum & "_" & pagenum & ".pdf"
end createfilename

thanks Martin.
i hope i only had to run it and not modify it.
it does nothing except it asks for the random number and takes it.
no errors, nothing in the result.
plz try to use it if possible.
i had sent you a pm.

Hi Lance,

Yes, the script code contained an error. But using code as follows worked like a charm on my Mac. I also added a «log» statement for occuring errors, which you can see when you activate the «Event Log» in Script Editor.


property mytitle : "Paper-O-Mat"
-- 307200 bytes = 300 KB
property maxfilesize : "307200"
property maxarticles : 20
property maxpages : 20
property destfolderpath : "Leopard:Users:lance:Desktop:"

property starturl : "http://epaper.business-standard.com/bsepaper/pdf/"
property endurl : ".pdf#zoom=130&statusbar=0&messages=0&toolbars=0&navpanes=0"

on run
	set todaysrandomchar to my askfortodaysrandomchar()
	if todaysrandomchar is missing value then
		return
	end if
	set todaysdatestring to my gettodaysdatestring()
	repeat with i from 1 to maxarticles
		set articlenumber to my createarticlenumber(i)
		repeat with i from 1 to maxpages
			set pagenumber to my createpagenumber(i)
			set pageurl to starturl & todaysdatestring & "a" & todaysrandomchar & pagenumber & "1010" & articlenumber & endurl
			set filename to my createfilename(articlenumber, pagenumber)
			set filepath to destfolderpath & filename
			set qtdposixfilepath to quoted form of POSIX path of filepath
			set command to "curl " & quoted form of pageurl & " -o " & qtdposixfilepath & " --max-filesize " & maxfilesize
			try
				do shell script command
			on error e
				log e
			end try
		end repeat
	end repeat
end run

on askfortodaysrandomchar()
	try
		tell me
			activate
			display dialog "Please enter today's random character:" default answer "" buttons {"Cancel", "Enter"} default button 2 with title mytitle
			set dlgresult to result
		end tell
		set usrinput to text returned of dlgresult
		return usrinput
	on error
		return missing value
	end try
end askfortodaysrandomchar

on gettodaysdatestring()
	set command to "date \"+%Y/%m/%d/%Y%m%d\""
	set todaysdatestring to do shell script command
	return todaysdatestring
end gettodaysdatestring

on createarticlenumber(i)
	if i < 10 then
		return ("0" & i) as Unicode text
	else
		return (i as Unicode text)
	end if
end createarticlenumber

on createpagenumber(i)
	if i < 10 then
		return ("00" & i) as Unicode text
	else if i = 10 or (i > 10 and i < 100) then
		return ("0" & i)
	else if i = 100 or i > 100 then
		return (i as Unicode text)
	end if
end createpagenumber

on createfilename(articlenum, pagenum)
	set command to "date \"+%Y%m%d\""
	set datestring to do shell script command
	set filename to datestring & "_" & articlenum & "_" & pagenum & ".pdf"
end createfilename

same here.

                                                                            thanks a lot, Martin