Yet *another* delimiters question (replace only if)

I hope I’m not repeating another post. My search returned a ton of topics that seemed to be identical (not to my question, but to themselves), so if you know for certain this is a repeat then just point me in the right direction, otherwise:

All my teachers use powerpoints, and I find it beneficial to turn these into pdf’s (I annotate with Skim then export as text outline). My issue is for certain letters, or letter combinations, the exported text is replaced with a variety of different characters. Also, this only happens with a select few pdf’s, most are exported correctly.

Here are some examples: Propriocep4ve, Posi+on, Contraindica%ons, a`achments, LeZ, sta-cally, Lingual$nerve, vestibule, supraglo.c, respiratory)center

As you can see the overwhelming concern here is the substitution of “ti” with the substitution of space as a close second. (I gave more examples of “ti” but the space is also a concern). These are the only two cases I want to manipulate as they are the most frequent.

I was thinking that I could assign variables: theLetters (upper- & lowercase), theNumbers, thePunctuation

Then do something similar to the below. However, I know this is incorrect 1. because it fails miserably, 2. because it doesn’t automatically choose the erroneous character in the handler, and 3. It doesn’t take into account the situation where both the “ti” and a " " are misrepresented. Any pointers are most appreciated.

set theFile to choose file --a .txt file
set theText to read theFile
set theTextList to paragraphs of theText


set theLetters to {"A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z"}
set theNumbersAndPunctuation to {"0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "!", "#", "%", "'", "+", "-", "."}
--set thePunctuation to {"!", "#", "%", "'", "+", "-", "."}
set theWrongSpaces to {"$", "(", ")"}


if characters of theTextList contains (items of theLetters & items of theNumbersAndPunctuation & items of theLetters) then
	
	set newText to my replaceChar(theTextList)
	
else if items of theTextList contains (items of theLetters & items of theWrongSpaces & items of theLetters) then
	
	set newText to my replaceSpace(theTextList)
	
end if


on replaceChar(noteText)
	set tid to text item delimiters
	set text item delimiters to "4" -- here i want the script to automatically choose charachter in error
	set newText to every text item in noteText
	set text item delimiters to "ti"
	set finalText to every item in newText as text
	set text item delimiters to tid
	return finalText
end replaceChar

on replaceSpace(noteText)
	set tid to text item delimiters
	set text item delimiters to "4" -- here i want the script to automatically choose charachter in error
	set newText to every text item in noteText
	set text item delimiters to " "
	set finalText to every item in newText as text
	set text item delimiters to tid
	return finalText
end replaceSpace

HI. Someone may come up with a better way, but, here is my first crack at it before going to bed. I have no idea as to what “LeZ” or “supraglo.c” should evaluate. Considering the commonality of the period and Z characters, I’m not actually even sure that TIDs are a workable method; you may need GREP.

strip(split("Propriocep4ve, Posi+on, Contraindica%ons, a`achments, LeZ, sta-cally, Lingual$nerve, vestibule, supraglo.c, respiratory)center", {"ti", "4", "+", "%", "-"}), {space, ""})
strip(split(result, {space, ")", "$"}), {space, ""})
strip(split(result, {"tt", "`"}), {space, ""})



on strip(someText, TID)
	set AppleScript's text item delimiters to TID
	set someText to someText's text items --delete delimiter(s) (and coerce to list form)
	
	set AppleScript's text item delimiters to return
	set someText to (someText as text)'s paragraphs --add delimiter (and coerce to string)
	
	# destroy empty paragraphs created by coercion in conjunction with the primary TID
	set AppleScript's text item delimiters to ""
	set someText to (someText as text) --add delimiter (and coerce to string)
end strip

on split(someText, TID)
	set AppleScript's text item delimiters to TID
	set someText to someText's text items
	set AppleScript's text item delimiters to TID & return --split after
	set someText to (someText as text)'s paragraphs
	
	set AppleScript's text item delimiters to " "
	set someText to someText as text
end split

yield:

–>“Proprioceptive, Position, Contraindications, attachments, LeZ, statically, Lingual nerve, vestibule, supraglo.c, respiratory center”

Hello. The “LeZ” is supposed to be Left, and the “supraglo.c” should be supraglottic, but those are uncommon as far as misrepresented characters. I’m more concerned with the “ti” and the " " issues.

What I was trying to do is setup a condition where the script recognizes when there is a number or punctuation between two letters, and then uses the number or punctuation as the delimiter. Of course the punctuation should be limited to a few types as the punctuation is often normal, but things like supraglo.c (among others) should be recognized.

What is GREP?

is this the correct syntax for GREP? [a-zA-Z][0-9+%][a-zA-Z] (should i leave the \ before the + when inside the brackets?)

Hi DrLulz,

Instead to fix the problem with development of a Script, why you don’t investigate better about why the text exported from PDF contains errors?

Stefano - Ame

Probably because it’s done by an OCR (internally)

I could be wrong, but I think the issue has to do with the text encoding on a few of my professors .pptx files. Like I said most export without a hitch. I could export the ppt as a txt or rtf, but that would defeat the purpose, as the ppt’s often have additional images, and I want an outline based on my annotations, not the entire transcription.

EDIT: I’ve tried printing with CUPS to generate the pdf (instead of exporting from PP), however, the text displays on the screen, but when I copy/paste to test nothing is pasted. weird…

Hi DrLulz,

My opinion is that no OCR is used. Because if you or something else input text there is no need to OCR it.
Also should not a problem of text encoding. Text encoding problem usually appear when you use accented chars or Unicode chars not mapped in ASCII table. “LeZ is supposed to be Left” cannot be a text encoding problem.

Stefano - Ame

What could it be then?

But as an alternative, what would the GREP be to search and replace instances where LETTERSNUMBERLETTERS or LETTERSPUNCTUATIONLETTERS with “ti” ?

Hi,

Should be interesting to investigate on the source PDF and also on Skim app that produce the export (you can also ask to developer of Skim). You export using Acrobat Reader, Acrobat Pro or Skim?
About Grep I know very little. Much probably in this forum there are Guru Grep…

Stefano - Ame

  1. Skim is OCR software 2) an imaged PDF doesn’t contains text, everything is an image.

Why it’s not an encoding issue is because the same characters would always look strange. Also the correct and incorrect characters are part of the 7-bits ASCII table (read: characters without diacritics). Character encodings don’t support half alphabets.

After using powerpoint to export a pdf, I can open it in anything that reads a pdf (Acrobat, Preview, Skim), select text, and after pasting see the error. That is to say Skim has nothing to do with it, it just exports what it already there.

When in PP do you use the save as… and select the PDF file format? Or do you print the page and print it to a PDF file?

I’ve tried:

  1. Save as PDF

result: cartilage = car5lage

  1. Print to PDF using OSX’s inherent ability

result: cartilage = car5lage

  1. Print to PDF using CUPS

result: cartilage = (nothing)

It sounds like you really need to use something other than Skim.

Why do you think its Skim if the problem occurs regardless of it use?

I missed your earlier post about it also happening in Acrobat.

I think, though, that you should still try to find the problem if possible – cleaning up after it is going to be a messy business. You say it only happens sometimes: is there something about the documents where it happens?

I wonder if sometimes the PP files are using graphics containing text – that would explain the OCR-like behavior.

Here is a GREP alternative with which you can tinker. Unfortunately, I had to create a dependency with TextWrangler because I’m not certain how to do lookahead/behind using the shell and it would take more time than I have to figure out. You can directly search an external file, but I tested a document containing your supplied string, and all were corrected.

tell application "TextWrangler"
	replace "(\\d+|%|(\\+)|(\\-))((?<=\\w)|(?=\\w))" using "ti" searching in document 1 options {search mode:grep, starting at top:true, returning results:1, showing results:0} saving no
	replace "(\\))|\\$((?<=\\w)|(?=\\w))" using " " searching in document 1 options {search mode:grep, starting at top:true, returning results:1, showing results:0} saving no
	replace "(Z)((?<=\\w)|(?=\\w))" using "ft" searching in document 1 options {search mode:grep, starting at top:true, returning results:1, showing results:0} saving no
	replace "(\\.)((?<=\\w)|(?=\\w))" using "tti" searching in document 1 options {search mode:grep, starting at top:true, returning results:1, showing results:0} saving no
	replace "(\\`)((?<=\\w)|(?=\\w))" using "tt" searching in document 1 options {search mode:grep, starting at top:true, returning results:1, showing results:0} saving no
end tell

I’m a bit puzzled.

I may imagine that Propriocep4ve must be replaced by Proprioceptive
Posi+on may become Position
Contraindica%ons may become Contraindications
sta-cally may become statically
supraglo.c may become supraglotic
Lingual$nerve may become Lingual nerve
respiratory)center may become respiratory center
vestibule would remain as is
LeZ would remain as is

but with the given rules,
a`achments would become atiachments which, as far as I know, doesn’t exists.

Am I misunderstanding the rules or is it an oddity in them ?

Yvan KOENIG (VALLAURIS, France) samedi 14 juin 2014 20:27:38

Edit

If I understand well, the list theLetters doesn’t need to be so long.
As you may make comparisons ignoring case, you may reduced it to {“A”, “B”, “C”, “D”, “E”, “F”, “G”, “H”, “I”, “J”, “K”, “L”, “M”, “N”, “O”, “P”, “Q”, “R”, “S”, “T”, “U”, “V”, “W”, “X”, “Y”, “Z”} or to {“a”, “b”, “c”, “d”, “e”, “f”, “g”, “h”, “i”, “j”, “k”, “l”, “m”, “n”, “o”, “p”, “q”, “r”, “s”, “t”, “u”, “v”, “w”, “x”, “y”, “z”}
I guess that it would shorten the process.

Thank you sir that does exactly what I need, and it also gives me a starting point from which to learn. Much obliged.

This site is so awesome. Some other sites make you feel dumb when you’re a beginner, but people here are very kind.

I stated before that the majority of my issue centered around “ti” being incorrect, with the space issue being second. So, while LeZ may remain untouched, and a`achments becomes atiachments (and you are correct, that is not a word) these occur less frequently than all the others. That is to say, for ever 50 Propriocep4ve there may be 1 LeZ.