get list of words in online dictionary

moosmahna · August 22, 2015, 9:09am

Hi
Is it possible to get a list off all words in an online dictionary that starts with “y”?
I want to use this dictionary: www.dude.de
I searched a lot but could not find a way to get this idea working

Thanks a lot!!!
Kind Regards

moosmahna · August 22, 2015, 9:14am

Sorry: www.duden.de

moosmahna · August 22, 2015, 9:37am

I found this:

curl -s -S http://www.duden.de/suchen/dudenonline/Ã¤*/[1-814] | grep ’ - Artikel anzeigen"’ | sed -r -e ‘s/.Artikel anzeigen">//’ -e 's/<.//’ | grep ^[Ã¤Ã„]

This searches for all words starting with “Ã¤Ã„”
But when i try to start this in the terminal i get this message:

sed: illegal option – r
usage: sed script [-Ealn] [-i extension] [file …]
sed [-Ealn] [-i extension] [-e script] … [-f script_file] … [file …

Nigel_Garvey · August 22, 2015, 6:25pm

Hi.

The immediate cause of the error is that Mac OS’s ‘sed’ implementation doesn’t have a -r option. On other systems, this option’s something to do with extended regex, so maybe you need -E instead, although I don’t see anything in the code that requires it.

It also appears that the page at the URL in your ‘curl’ command doesn’t contain the string “Artikel anzeigen”, so the first ‘grep’ command isn’t passing anything on to ‘sed’ anyway.

I’m afraid I can’t be much more help than this as I don’t understand what you’re trying to get off the page.

moosmahna · August 24, 2015, 7:01am

Hi Nigel
Thanks for your answer. I try to explain my idea.
On www.duden.de (Online-Dictionary) you can search for words and the meaning of the words. But I want to have a list of all words that are found. E.g. if I search for “A” (http://www.duden.de/suchen/dudenonline/A) I get the results. But I need the words only: A-Dur, A-Jugend, A-Jugendliche, A-Jugendlicher, A-Kohle…

I hope you can help me
Thanks a lot again

Nigel_Garvey · August 24, 2015, 2:56pm

OK. This shell script works in AppleScript:

Edit: These scripts are now superseded by the one in post #10 below.

do shell script "curl http://www.duden.de/suchen/dudenonline/A | sed -En '
/<h2>/ {
	s||\\'$'\\n''|g
	s|^[^[:cntrl:]]+\\n||
	s|</h2>[^[:cntrl:]]+||g
	s|<[^>]+>||g
	p
}'"

As does the straight-line version:

do shell script "curl http://www.duden.de/suchen/dudenonline/A | sed -En '/<h2>/ { s||\\'$'\\n''|g ; s|^[^[:cntrl:]]+\\n||; s|</h2>[^[:cntrl:]]+||g ; s|<[^>]+>||g ; p ; }'"

However, the equivalent code copied into Terminal produces the wrong parts of the dictionary entries, for reasons I don’t understand as I never use Terminal myself:

In the HTML for Duden’s Web pages, the word heading each entry is between

and

tags. The ‘sed’ code above seeks out each HTML line containing “

”, replaces every instance of this tag with a linefeed, deletes everything up to and including the first of the linefeeds, deletes everything from each “

” tag up to the linefeed following it (or to the end of the original line), deletes all remaining tags, and outputs whatever’s left!

The code you originally posted seems to be for an earlier version of the Web site. Obviously the code here could break too if the site design changes again.

Outstanding issues are:

There are several pages returned for each search. Do you want the words from all of them?
Many of the words returned by the search don’t actually begin with the search letter or even contain it! Do you just want words beginning with the string you specify?

moosmahna · August 24, 2015, 4:50pm

Hi Nigel
Thanks a lot for your help!!!

About the Outstanding issues:

Yes, I want to have the complete words of all pages if it is possible
Do you just want words beginning with the string you specify? YES

Thanks a lot again
Kind Regards

Nigel_Garvey · August 24, 2015, 6:11pm

Hi moosmahna.

Glad you like it so far. I’ve found out why I was getting different results in Script Editor and Terminal. The HTML text actually contains soft-hyphen characters, which were visible in Terminal, but not in Script Editor’s result pane. I’m doing more work on the script at the moment. It now deletes the soft hyphens and words which don’t begin with the search character and I’m just about to tackle the multi-page aspect. I’ll post the new version when it’s ready…

moosmahna · August 24, 2015, 6:24pm

Thank You very, very much for your time to help me!!!

Nigel_Garvey · August 25, 2015, 1:02pm

Hi moosmahna.

Sorry this has taken so long. My attempts to do all the parsing with sed kept getting thwarted by incompatibilities between sed’s limitations and the site’s vagaries. The script now uses sed to extract and clean up the words from the HTML, but grep to decide which words actually begin with the search string. When run on my set-up at home, it takes nearly 18 minutes to get 1341 “words” beginning with “A” from 1030 Duden pages.

on getWords(searchString)
	-- Fetch the first page, extract the number of the last page from it, and derive the ending for a curl-format multi-URL. Alternatively, if there are no more pages, return "". Or if the search string wasn't matched, "keine Treffer gefunden" (hopefully).
	set baseURL to "http://www.duden.de/suchen/dudenonline/" & searchString
	set infoFromFirstPage to (do shell script "curl " & baseURL & " | egrep -o '[0-9]+\">letzte |keine Treffer gefunden' | sed -E 's/([0-9]+).+/{,?page=[1-\\1]}/'")
	
	if (infoFromFirstPage is "keine Treffer gefunden") then
		-- If no matches, that's the result.
		set wordList to "No matches for "" & searchString & ""."
	else
		-- Otherwise stick the multi-URL together. The result's the original URL if "" was returned above.
		set multiURL to baseURL & infoFromFirstPage
		-- Fetch the pages with curl, extract the words with sed, pick out the right ones with grep.
		set wordList to (do shell script "curl " & multiURL & " | sed -En '
# If a line in the HTML contains one or more instances of "<h2>" followed by a link tag:
/<h2><a href=[^>]+>/ {
	# Replace every instance of the two tags with a linefeed.
	s//\\'$'\\n''/g
	# Delete the first of the new lines thus formed.
	s/^[^[:cntrl:]]+\\n//
	# Delete from "</a>" onward in the remaining lines.
	s/<[/]a>[^[:cntrl:]]+//g
	# Delete any remaining tags and soft hyphens.
	s/<[^>]+>|" & (character id 173) & "//g
	# "Print" the returned words.
	p
}' |
# Case-insensitively pick out the returned words which actually start with the search string.
grep -i '^" & searchString & "'")
	end if
	
	return wordList
end getWords

on getWordsFromDuden()
	set searchString to text returned of (display dialog "Search for words beginning with:" default answer "" with title "Get words from duden.de" with icon note)
	if (searchString is "") then return
	
	set wordList to getWords(searchString)
	
	tell application "TextEdit"
		activate
		make new document with properties {text:wordList}
	end tell
end getWordsFromDuden

getWordsFromDuden()

moosmahna · August 25, 2015, 1:17pm

Hi Nigel
You are GREAT. I try it as fast as possible. One question: If i search for words etc, do I have to change (searchString) to e.g. (Auto)?
Thanks again for your answer.
P.S. You are faster than the speed of light!!!

Nigel_Garvey · August 25, 2015, 3:14pm

Hi moosmahna.

Just run the script and enter the search string into the dialog that appears. When the script’s gathered all the words ” which may take from a few seconds to several minutes ” it should activate TextEdit and open a text window with all the words in it.

moosmahna · August 25, 2015, 5:20pm

Hi Nigel

THANK YOU VERY MUCH!!! You are awesome!!!
Kind Regards from Austria

moosmahna · August 26, 2015, 10:55am

Hi Nigel
I think the script won´t show all results. When I search on duden.de i get more results. Can you please look at the script?

THANK YOU!!!

moosmahna · August 26, 2015, 11:34am

It´s me again :o
Is it possible to get all words from the Yosemite dictionary?

Thanks again…

Nigel_Garvey · August 26, 2015, 3:33pm

Hi moosmahna.

The script is supposed to leave out results which don’t begin with the search string. But if you can give me a few examples where it leaves out words which do begin with the string and which show up in your browser with the same search, I’ll look into it.

That’s a separate query (to which I don’t know the answer). You should start a new topic for it.

moosmahna · August 26, 2015, 3:40pm

Hi Nigel
Sorry, it was my mistake. I´ve looked at the wrong number of results.
Sorry, sorry!!!

Everything works great!!