filtering story from news web pages

ephramz · April 10, 2005, 12:53am

Does anyone have a script or an app that will strip out only the main text of a news story web page, based on the html? This is the overall scenario: I have a scriptable RSS news feeder program, like NetNewsWire, which will direct me to the webpages of all the news stories in the feeds I like. I want those web pages (not just the headlines and first paragraph in the feed) converted to an MP3 through text to speech synthesis, so I can listen to it on my MP3 player whenever I have time away from the computer.

The only problem is that there’s so much extraneous text in the news web page that I don’t want spoken, so I need a script to strip it out. On some web pages I can search for the “print” or “printer friendly version” (e.g. on the New York Times website), but not all of them have this. So an alternate approach might be to look for the largest block of text on the page, although this might not always work since some articles are split in many parts with a pullquote in between or in broken lines, such as:
http://www.nature.com/news/2005/050404/full/050404-10.html
which are very hard to deal with.

Anyone have any tips, ideas, or previous attempts on this?

Thanks