Translate Text with Key

adayzdone · January 30, 2014, 3:24pm

I am working on a project that has presented and interesting problem. I need to search through several blocks of text to look for target phrases and words that need to be translated based on a key file. Another layer of complexity arises from many of the text blocks having linefeeds, which need to be removed to match against the phrases in the key, and then re-inserted in a similar location (to preserve formatting) in the translated string. Yet another issue is once a piece of the block has been translated, you need to avoid matching substrings of the translated piece while searching the rest of the block.

Nigel points out:

Although Nigel is spot on, since the phrases are drawn out of a catalog, I will ignore this for now.

My solution involves:

Sorting sourceLanguage phrases from largest to smallest.
Normalizing the textBlock to remove linefeeds (and marking their location)
Searching and replacing sourceLanguage to destLanguage from largest phrases to shortest.
Re-inserting linefeeds at similar locations to match the formatting of the original text.

I will post my solution shortly once I get some bugs ironed out.

adayzdone · January 30, 2014, 3:36pm

Also, why can’t I read the file as utf8?

set translationData to paragraphs of (read POSIX file (myFolder & "Key.txt") as «class utf8»)

– Can’t make some data into the expected type

DJ_Bazzie_Wazzie · January 30, 2014, 4:17pm

My job is writing tools for text processing and translating too. When the database is getting bigger than 125,000 records and still want to translate 10,000 words in less than a seconds you have to search for alternative ways. I’m working with databases over 500,000 records and still growing.

Translating software is more than just replacing binary streams with another binary streams. There are grammatical records and spelling records. For instance Dutch, German and English are all West-Germanic languages, so not only the spelling looks alike but also it’s grammar. While Dutch is the most “free” language, German is the most strict. The problem is then that it’s very easy to write software that will automatically translate from German to Dutch. However translating from English to German is harder. For instance you have schÃ¶ne, schÃ¶nen, schÃ¶nem, schÃ¶ner and schÃ¶nes which means all beautiful.

But, starting writing translation software based on length of the match is indeed the first step in all translation software. Still there are two things important in this. Which are your “record delimiters”. Most starting translation software choose not to use boundary words like with, of, in, or, etc… This way, when the software starts reading from the beginning of the string you have a minimum overlap in your software. Others choose to read from the end to the beginning of the string but matching from the beginning till the end. This way you have a lot more movement in the string pointers (sorry I’ve written everything in C and assembly for performance reasons) but the “perfect” match is much better and overlap of records is considered in finding the perfect match. When the database is growing above 50,000 a funny thing will happen, most of it’s translation is still mostly single words (when not using “boundaries”)

Looking at the example text, the markup is familiar, I’m guessing this is for InDesign or QuarkXpress for creating big catalogs. Big catalogs have different categories. This means you have translations that are client specific or category/context specific. This means that the same word, or combination of words, are used for different product specifications while in the target language they use two different translations. How are you going to catch this?

Then last, but not least, there is the grammar in languages. What we’ve done is moving all grammar aside and create an hybrid of most popular languages. This way we were able to create translation software that goes further, much further, than google translate, babelfish or trados (SDL). However our databases are purely focused on catalogs as well which makes them not useful for commercial use, maybe in the future.

The biggest advantage of Google translate’s principle (read:it’s concept) is that it’s always up to date but it purely based on statistics and not on a correct spelling nor grammar. An translation based on a database, like all other translation software, needs always maintenance. A translation for a certain word can be changed by tomorrow.

Because it seems like you’re starting with this project I would give you some heads up about the problems I have already faced 15 years ago I’ve also read some old books and papers about concepts how to translate text from one language to another. They can tell you some issues that you maybe not have considered yet and will show up in the future.