Find specific tagged string in html doc and write to new txt file

Given a FOLDER named “HTML” that contains html files… and for each file in HTML folder…

Find text string between the

tags shown below and output that text as a single line to a newly created text file.
Repeat for each html file in the folder appending found text to the end of the text doc.

Here’s the tag to work on…

GRAB THIS TEXT FOUND BETWEEN THE TAGS, WRITE IT TO TEXT FILE

(I have a set of several hundred html files for which I need a text listing of article title’s)

Thanks in advance!

Not sure what it is you want exactly. Are you looking for help or an already made solution? What did you try so far?

Few pointers: GREP would help tremendously. This is exactly what you need in this case. But not easy to use with AppleScript. You need shell script and treat each line with the GREP command line utility.

If you search for this you’ll find other scripts that will point you in the right direction.

Good luck!

Browser: Safari 6533.18.5
Operating System: Mac OS X (10.7)

Hello.

Try this, you may of course save the document containing the titles to somewhere. :slight_smile:

set f to choose folder with prompt "Choose folder with blog entries..."
set p to quoted form of POSIX path of f
do shell script "cd " & p & " ; (cat *.html |sed -n 's/\\(<[hH]1 [cC][lL][aA][sS][sS]=\"[bB][lL][oO][gG]-[eE][nN][tT][rR][yY][-][tT][iI][tT][lL][eE]\">\\)\\(.*\\)\\(<[/][hH]1>\\)/\\2/p' ) |open -f"

Sorry, I had thought I had explained my need but upon re-reading I see how vague my post is. Apologies.

I was hoping for a complete solution as I know just enough applescript (or shell) to be dangerous. But any help/suggestion or link to similar script is appreciated.

I have a folder with hundreds of HTML files which are blog entries. My client wants a simple text listing of the articles by title. A script would be a much better solution than typing up several hundred article names.

The “title” text that I need to “grab” in each html file is a “title” tag such as this one:


So for each html file, I need the text found in between

and the next tag which is

Once that text string (the Title) is grabbed, write it to a plain text (.txt) file, appending it to the end of the file.

The sample code provided runs but all it does for me is open the first file in a window and that’s it.

Hello.

You should look at the titles of the other blog entries then (also for anything uppercase in the tags), for when I select a folder that contains the entries, it shows all titles in the TextEdit window (in my case).

Understood… Thanks so much! Will investigate title chars…

Hello.

I have made the snippet in post #4 case insensitive. Please look for other discrepancies as well, with regards to spaces, spelling, quotes, etc.

ok… still stops… I’m investigating … Thank you VERY MUCH! Appreciate your time and knowledge. I will use your shell for learning too…

Hello.

It really shouldn’t stop, if it stops, then the script doesn’t have read access to your blog entry files, which very well may be the case if the blog entries are hosted by a local Apache Server or something, that is, they hare hosted by a different user. Please do an ls-l to check the rights of the html files.

The files I have been testing on have rights like so: -rw-r–r–, and are owned by me, and I belong to the “staff” group.

The directory has the rights: drwxr-xr-x (ls -ld ).

Also try a whoami to see that you are the same user as the owner of the blog entries.

If the users differ, we’ll do it all by shell, so we don’t interfere with your clien’ts blog system.

ok… I’m red faced… I maintain the web site so the html files are local on my Mac… not on the server.

I try not to do stuff like this on a live site…

I’m sorry… Since I was in a Applescript forum I guess I assumed you’d know files were local and it was a VERY POOR assumption on my part.

I’m really embarrassed! I’ve run php, cgi, etc… how would I run this against a shared server?

Hello.

No reason to be red-faced. :slight_smile:

Run it local then, and perform the tests as I have told you, so that we can pinpoint the problem.

This makes you run the script as a root, so there should be no problems regarding privileges. But if it is so, that the blog is hosted by a web-server, then I urge you to stop the server while the script is executing.

set f to choose folder with prompt "Choose folder with blog entries..."
set p to quoted form of POSIX path of f
do shell script "cd " & p & " ; (cat *.html |sed -n 's/\\(<[hH]1 [cC][lL][aA][sS][sS]=\"[bB][lL][oO][gG]-[eE][nN][tT][rR][yY][-][tT][iI][tT][lL][eE]\">\\)\\(.*\\)\\(<[/][hH]1>\\)/\\2/p' ) |open -f" with administrator privileges

And, if this doesn’t work, then either the tag is missing from the rest of the files, or written in another way. The files may of course also be damaged in some sort of way.

You can try running to stand in the directory containing the html files and execute: for i in *.html; do file $i ; done
to see that they are indeed not binary files but lists as something with ascii in it.

Error:
“The document xxxx could not be opened. You don’t have permission.”

This after it asked for my admin password and I provided. As info, I am the only user of this Mac and use that admin password all the time… (so yes, it’t the correct password) - Folder with files is at root (user) level.

Hello

Ok, this might be a little bit complex, first of all, is your account an administrator account?

If you have a sepearate administrator account, then that is the username, and password to use.

Yes… admin… I use it almost daily for the usual software updates, installs, etc, etc.

Get info shows “ME” with Read and Write priv…

Hello, if you used that with the script I just wrote in post #11 above, then I think you need to enable the root user.

Please follow the instructions in This Apple Technical Note that applies for your version of OS X.

Then retry the script in post #11, and see to that you enabled the root user for the account that have administrator privileges, and use that username and password when you run the script.

Followed instructions to the letter in tech note and set a root password… no help… same error msg. I set “admin” permissions of Read and Write for the folder and “all files enclosed” - no help… still same error. Restarted Mac… Re-logged in… no help.

perhaps I’ll just move the extra file set to the server and use perl…

Hello.

Just for the hell of it, stand in the folder with your html files in a Terminal window, and issue this command:

sudo (cat *.html |sed -n 's/\(<[hH]1 [cC][lL][aA][sS][sS]="[bB][lL][oO][gG]-[eE][nN][tT][rR][yY][-][tT][iI][tT][lL][eE]">\)\(.*\)\(<[/][hH]1>\)/\2/p' ) |open -f

Does that work?

Edit
And you can execute that in the directory with the blog entries that serves your blog as well (server-side), while you are logged on as an Adminstrator with root-privileges.

syntax error as shown below

Hello. I am sorry about that.

Save this as a shellscript somewhere, it should be saved with unix-linefeeds, UTF-8 No Bom and make it show in a finder window. (The saved filename).

#!/bin/bash (cat *.html |sed -n 's/\(<[hH]1 [cC][lL][aA][sS][sS]="[bB][lL][oO][gG]-[eE][nN][tT][rR][yY][-][tT][iI][tT][lL][eE]">\)\(.*\)\(<[/][hH]1>\)/\2/p' ) |open -f
type chmod gu+x in a Terminal window, and drag the shell script over. Then hit enter.

Then type sudo and drag the shell script over once again, before you hit enter. (You should of course be standing in the directory where your html files are located.)