Reading a web page

rvamerongen · June 18, 2010, 8:45pm

Hi
I try to read a webpage full with text and has a txt extension in the url.

The page contains just text, but Safari ( 5 ) inspector shows using <pre … then the text… The source menu item is then not available.

Reading a HTML page works well, did do this many times, but I dont know how to read that text page into a variable.


set theURL to baseURL & eachYear & "-" & eachMonth & ".txt"
				
log {"1 url", theURL}
try
	make new document with properties {URL:theURL}
	delay 2
	set theWholeSource to source of front document as string
	log {"2 theWholeSource", theWholeSource}
	set lengthSource to number of characters in theWholeSource
	log {"3 lengthSource", lengthSource}
on error
	display dialog "There was an Error with loading the source from the URL" with icon 0
end try

How should I do this?

Ray_Barber · June 18, 2010, 9:11pm

Using cURL you can download anything from the web.

set myURL to "http://www.example.com/text.txt"
set mySource to do shell script "curl " & quoted form of myURL

Hope it helps,
ief2

rvamerongen · June 18, 2010, 9:42pm

Thank you very much, this is good stuff. No load time, like in safari.

But if there is still a way to do this in safari, I still like to know, because later in my code I still need it in Safari to create other things with javascript.

McUsr · June 18, 2010, 9:56pm

Hello.

It should be pretty easy to remove the

tags which hinders you now that you have gotten it down as source,
then save the source as html, then open with safari, then process Dom tree.

Best Regards

McUsr

rvamerongen · June 18, 2010, 10:05pm

The problem in Safari is, that there is no source. The return of


set theWholeSource to source of front document as string

is empty. So even I put it back in Safari its still empty, or I should create a html-header-body etc…

McUsr · June 18, 2010, 10:16pm

Hello.

Yes, but if the text you downloaded is just plain text: what are you up to doing with it anyway?

Maybe you don’t need Safari for that?

Or could you specify exactly what you are up to do with the downloaded source?

Best Regards

McUsr

rvamerongen · June 19, 2010, 12:03am

Thank you for your reply.

I have to add some xml in some parts and in other parts I have to put html around it.
Both part should be visual checked before saved.

Before that can happen, I have to split the txt file into parts.

Where I should split depends on the following pattern.

So I have to search for the first words ‘q55’ ‘q67:’ ‘ww3:’ and ‘qw4’ in four following lines and then split that part till the next repeated four lines. The text length could be from 10 - 500 lines off each part till the next pattern of four lines.

Because those parts could be repeated at almost 150 times, with daily 50 - 80 text files, I am searching for the quickest way to do the splitting.

Any idea, how to do this split with the quickest pattern search?

After that I will do one more split because on the xml, but I guess that will be not that difficult.

Thank you very much.

McUsr · June 19, 2010, 12:34am

Hello.

I have misunderstood you perfectly you are looking for a run of lines each starting subsequently with ‘q55’ ‘q67:’ ‘ww3:’ and ‘qw4’. Then comes the the text you want to extract. This runs until the next pattern of four lines starting with the aforementioned “codes”.

Sed is the definitively fastest pattern matcher around. But this is a very complex pattern. I might be faster to write
a snippet in C than getting the correct regexp :lol:, which I can do for you. for this pattern, due to the share volume of the input.
The codes will then be considered “atomic” in that they won’t change. (Hardcoded in the C-snippet.)

I foresee that the program will leave you with some files for example with #1 #2 and so on added to the filenames
if given an intial filename. The utility could also get the input from a stream.

I’ll give sed a try first, so don’t expect any thing, but that you will have a working solution within 48 hours.

Please do tell me more specifically how the resulting files should be. We also need a work folder.

It would be nice if you elaborated a little bit more on the workflow.
Should for instance the resulting text files be deleted before the next job or should a new working folder be created
and so on.

Best Regards

McUsr

rvamerongen · June 19, 2010, 1:28am

McUsr:

Hello.

I have misunderstood you perfectly you are looking for a run of lines each starting subsequently with ‘q55’ ‘q67:’ ‘ww3:’ and ‘qw4’. Then comes the the text you want to extract. This runs until the next pattern of four lines starting with the aforementioned “codes”.

Sed is the definitively fastest pattern matcher around. But this is a very complex pattern. I might be faster to write
a snippet in C than getting the correct regexp :lol:, which I can do for you. for this pattern, due to the share volume of the input.
The codes will then be considered “atomic” in that they won’t change. (Hardcoded in the C-snippet.)

I foresee that the program will leave you with some files for example with #1 #2 and so on added to the filenames
if given an intial filename. The utility could also get the input from a stream.

I’ll give sed a try first, so don’t expect any thing, but that you will have a working solution within 48 hours.

Please do tell me more specifically how the resulting files should be. We also need a work folder.

It would be nice if you elaborated a little bit more on the workflow.
Should for instance the resulting text files be deleted before the next job or should a new working folder be created
and so on.

Best Regards

McUsr

Hi McUsr

Thank you for your reply.

I got the xml part ready.

I will also try to read the sed pages and figure things out.

The files should be saved and have a filename of what comes after the ww3: till eol ( max 36 characters )
The converted html and xml results will become just one ( 1 ) file, this for earlier each spliced element/part, this to keep data and meta data together and will be read/used into a app with a text webview and some textfields with the meta data.
That one file part is also just finished.

Its very kind from you about your proposed effort to created a working solution. However, for now, I am very happy just to have the basic what will get me started and will teach me a lot.

Also with this magnitude of numbers of files I will test some speed when I got something basic, otherwise I have to write something in C or Obj-c.

About the check of the layout, I have to create something a kind of template layer, because visually by a person will make that person grazy.

good night

McUsr · June 19, 2010, 10:10am

Sure that is most funny.

Apart for figuring out the regexp with Sed. Google “Sed Towers of Hanoi” that’s an example (very heavy) but uses the registers should you need that.

I don’t still see exactly what you are up to, but if you are going to to process the file and not split it, it can be done with sed. If you are splitting the files in part that is a much more difficult story.

I’d rather jump directly for a solution in C (raw) since it is really an easy task. Maybe, just for the fun of it, I’d use awk as an intermediary solution, and check if that worked ( and were speedy enough) before writing the C code.

It would be interesting to see your final solution.

Best Regards

McUsr