mess with special characters -umlauts and more

Joy · November 6, 2014, 1:40pm

Hi,
this days i’ve to copy some text from pdf files, written in german text to be exact. Special characters like: “Ã¤,Ã¶,Ã¼,ÃŸ” and “«,»” are copied in different fashions (for the same umlaut) and drive me crazy.
Somebody knows a workaround to set text strings correctly ?

DJ_Bazzie_Wazzie · November 6, 2014, 2:40pm

PDF files are in fact two documents on top of eachother. You have an image (glyph) document with on top an text document which is invisible but is selectable. This makes it feels like if you are copying the text you actually see. In postscript format there is no text, only meaningless glyphs, so when you select text and copy it, the invisible text is highlighted and copied. It means that when the document has been created the actual font being used to draw and create the invisible text layer has lied about it’s true value. This not reversible unless you rescan the document with an OCR with the right font settings, or just re-create the PDF with less compression or an actual PDF distiller. This is more common with free (less quality) PDF writers like Apple’s PDF writer or web services that writes PDFs on the fly than with the Adobe’s own acrobat professional.

Joy · November 6, 2014, 8:15pm

Thanks, DJ Dazzie Wazzie
so there is no way to correct the wrong characters, as i understand you?

For now i wrote a script and the result was quite convincing.
However, after 5 pages i saw that my solution worked only partially, and i dont know how much more variations can hinder my script to perform correctly.

DJ_Bazzie_Wazzie · November 7, 2014, 1:57am

Maybe re-exporting them with acrobat, but it’s a long shot. I have no source but I read somewhere that it will create a new text table on the Adobe forums. Also I haven’t tested it myself, but saw some forum users responding that it worked for them (they had damage text tables).