My job is writing tools for text processing and translating too. When the database is getting bigger than 125,000 records and still want to translate 10,000 words in less than a seconds you have to search for alternative ways. I’m working with databases over 500,000 records and still growing.
Translating software is more than just replacing binary streams with another binary streams. There are grammatical records and spelling records. For instance Dutch, German and English are all West-Germanic languages, so not only the spelling looks alike but also it’s grammar. While Dutch is the most “free” language, German is the most strict. The problem is then that it’s very easy to write software that will automatically translate from German to Dutch. However translating from English to German is harder. For instance you have schöne, schönen, schönem, schöner and schönes which means all beautiful.
But, starting writing translation software based on length of the match is indeed the first step in all translation software. Still there are two things important in this. Which are your “record delimiters”. Most starting translation software choose not to use boundary words like with, of, in, or, etc… This way, when the software starts reading from the beginning of the string you have a minimum overlap in your software. Others choose to read from the end to the beginning of the string but matching from the beginning till the end. This way you have a lot more movement in the string pointers (sorry I’ve written everything in C and assembly for performance reasons) but the “perfect” match is much better and overlap of records is considered in finding the perfect match. When the database is growing above 50,000 a funny thing will happen, most of it’s translation is still mostly single words
(when not using “boundaries”)
Looking at the example text, the markup is familiar, I’m guessing this is for InDesign or QuarkXpress for creating big catalogs. Big catalogs have different categories. This means you have translations that are client specific or category/context specific. This means that the same word, or combination of words, are used for different product specifications while in the target language they use two different translations. How are you going to catch this?
Then last, but not least, there is the grammar in languages. What we’ve done is moving all grammar aside and create an hybrid of most popular languages. This way we were able to create translation software that goes further, much further, than google translate, babelfish or trados (SDL). However our databases are purely focused on catalogs as well which makes them not useful for commercial use, maybe in the future.
The biggest advantage of Google translate’s principle (read:it’s concept) is that it’s always up to date but it purely based on statistics and not on a correct spelling nor grammar. An translation based on a database, like all other translation software, needs always maintenance. A translation for a certain word can be changed by tomorrow.
Because it seems like you’re starting with this project I would give you some heads up about the problems I have already faced 15 years ago
I’ve also read some old books and papers about concepts how to translate text from one language to another. They can tell you some issues that you maybe not have considered yet and will show up in the future.