IDCS3 Detecting double hyphenation (and other bad auto-hyphenation)

owebro · May 1, 2008, 10:09pm

Hey folks,

So I’ve been trying to figure out how to find instances of double hyphenations (e.g., double-hyph-enation) and other bad auto-hyphens (e.g., double/hyph-enation, double“hyph-enation, double”hyph-enation).

I pulled a script from this forum (http://bbs.macscripter.net/viewtopic.php?id=21293) and massaged it down a little bit to find the first two examples above and color them green. My code is:


tell application "Adobe InDesign CS3"
	activate
	set storynumber to count document 1 each story
	set lineNumber to {}
	repeat with m from 1 to storynumber
		set end of lineNumber to count story m of document 1 each line
	end repeat
	tell document 1
		set mySwatch to swatch "Green"
		repeat with n from 1 to storynumber
			tell story n
				repeat with i from 1 to item n of lineNumber
					if length of line i is greater than 1 then
						tell line i
							set lastword to last word
							set firstword to first word
							set lastChar to last character
							set firstChar to first character
							set verifBaseline to ((baseline of first character of last word) is not (baseline of last character of last word))
							set verifHyphen to (characters of lastword contains "-") and (lastChar is not "-")
							set verifSlash to (characters of lastword contains "/") and (lastChar is not "/")
							if (verifHyphen and verifBaseline) or (verifSlash and verifBaseline) then
								set selection of application "Adobe InDesign CS3" to last word
								select (last word)
								tell (last word) to set fill color to mySwatch
							end if
						end tell
					end if
				end repeat
			end tell
		end repeat
	end tell
end tell

Now, I can’t figure out how to make it detect the other two examples above (the en and em dash instances). The problem is that the dashes aren’t considered part of the word, so the [characters of lastword] lines don’t work on them. Is there any way to include them somehow? Basically I need to tell it to (A) find the words that start on one line and end on another, then (B) determine if the previous or next character is an en or em dash, and then color it. But I just can’t figure out the syntax to make it do this.

Any ideas? Thanks so much! Oh and by the way, I’m pretty much nubcake at Applescript, so forgive me if this is a stupidly easy (or stupidly impossible O_O) request.

Marc_Anthony · May 4, 2008, 11:31pm

Hi Owebro,

Regarding your em & en dash issue, my solution is to work around the fact that they aren’t part of the word “ find and replace those with other text characters that are but are also rare/strange enough to not be in your story. When done adjusting the offending word, you can find and replace them back to ems & ens.

regulus6633 · May 5, 2008, 1:52am

I don’t have in design, so I can’t give you an example script using it, but maybe these will help. Basically I want you to see that using the term “words” does in fact remove hyphens and other special characters automatically from words in a sentence. But you don’t have to use “words”. Here’s 2 other ways to parse the text and find what you want. I’d bet they would work when parsing text with in design.

set theLines to "this is a test of hyph-
ens. Can they be detected?"

-- first example
set theWords to words of theLines
--> result: {"this", "is", "a", "test", "of", "hyph", "ens", "Can", "they", "be", "detected"}
-- notice how we get the lines in an 11 item list
-- with this list we have no way of knowing when one line ended and the next line begins, so this is no good
-- also notice how using the term "words" removes the hyphen, again this is no good

set theLines to "this is a test of hyph-
ens. Can they be detected?"

-- second example
set theParagraphs to paragraphs of theLines
--> result: {"this is a test of hyph-", "ens. Can they be detected?"}
-- notice how we get the lines in a 2 item list
-- notice how we know where one line ends and the next begins, this is good
-- notice how the hyphen stayed with us, this is good
if item -1 of (item 1 of theParagraphs) is "-" then
	set theResult to "We found the hyphen"
else
	set theResult to "We didn't find the hyphen"
end if

set theLines to "this is a test of hyph-
ens. Can they be detected?"

-- third example
set text item delimiters to "-"
set tidLines to text items of theLines
set text item delimiters to ""
tidLines
--> result: {"this is a test of hyph", "
-- ens. Can they be detected?"}
-- notice how we get a 2 item list and the items of the list are divided at the hyphens, we can work with this

owebro · May 5, 2008, 5:38pm

Marc–thanks for the suggestion. That’s an excellent idea that could be implemented quite easily, but the only problem is that replacing ens and ems would cause the text to reflow the vast majority of the time, and therefore interfere with the detection process. BTW I’m working with articles that are 2-col justified.

regulus6633–Thanks also for the suggestions, but I’m able to detect when a line is broken using the script I posted. What it does is find each line that ends in a word whose first-character baseline does not equal its last-character baseline. Upon finding such a word, it determines whether the word contains a hyphen (an actual hyphen, not an InDesign line-break hyphen) or a slash.

So the trouble remains with the en and em dashes. Say I have something like:

With a 5-point scale (1 for completely unrelated, and 5 for highly related), the target“unre-
lated preview pairs were rated to be unrelated.

The script would detect the break in “target“unrelated” (baseline of “u” â‰ baseline of “d”) and it would check if it contains a hyphen or a slash, which it doesn’t, so on it goes to the next line. I could easily do it if “target“unrelated” were considered one word, but unfortunately it seems “target,” ““,” and “unrelated” are all considered individual words.

So if I had it my way (:P), I would insert a verifEndash variable to throw into my if statement as follows:

 set verifEndash to (word before firstChar is En dash) or (word after lastChar is En dash)

However, this returns a “Can’t get word before xxx” error, so I don’t think the “before” and “after” commands work like I think they do (but maybe the kinda do?).

So. This isn’t really a huge deal, but it’d be sweet to be able to get it to detect every possible bad auto-hyphenation I can think of.

regulus6633 · May 5, 2008, 6:15pm

Look at this. The result is a count of the characters you want to detect for every word in the phrase. So notice how you can tell when a word has one or more of those characters. This was done by not using the term “word” but using text item delimiters and paragraphs to get the words/paragraphs of the lines.

set theLines to "With a 5-point scale (1 for completely unrelated, and 5 for highly related), the target“unre-
lated preview pairs were rated to be unrelate"

set charsToDetect to {"“", "-", "/"}

set theParagraphs to paragraphs of theLines

set everyWordCount to {}
repeat with aPara in theParagraphs
	set text item delimiters to " "
	set parasWords to text items of aPara
	set text item delimiters to ""
	
	set paraCount to {}
	repeat with aWord in parasWords
		set wordCount to 0
		repeat with j from 1 to count of aWord
			if character j of aWord is in charsToDetect then set wordCount to wordCount + 1
		end repeat
		set end of paraCount to wordCount
	end repeat
	set end of everyWordCount to paraCount
end repeat
everyWordCount

script results → {{0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2}, {0, 0, 0, 0, 0, 0, 0, 0}}

Notice how the 3rd word of the first paragraph has 1 occurrence of the charsToDetect and the 15th word of the first paragraph has 2 of them. Each paragraph is a different list item so you even know which paragraph you are in. None of the words in the second paragraph have any of the characters to detect.

Marc_Anthony · May 5, 2008, 7:01pm

regulus,
Using text item delimiters won’t work here. InDesign has multiple types of hyphens, and some of them are aren’t “real””they are automatically generated, non-text items; the only way to detect them is by their affect”a single word that has a line break but no responsible hyphen.

Owebro,
Just change the words back before you start operating on them. Insert the object reference of each instance of the em/en words into a list. The character range will still be the same once they’ve been returned to their original state.
Another idea is to edit the glyphs. Maybe you can create something that technically isn’t an em or en, but looks just like them?

owebro · May 5, 2008, 7:55pm

All right Marc, I’m not sure I fully understand. Are you saying to change the en and em dashes BEFORE the script does its thing, then change them back afterward? If that’s correct, then it won’t detect the instances unless the break remains intact. If I change it to, say, an ampersand, there’s a good chance the word wouldn’t break the same (or it probably wouldn’t break at all).

So maybe this is totally off-base and you’re talking about something different, but it does seem like a double hyphen (i.e., “–”) keeps the break the same. Well, in the two examples I just tried it on, anyway. So I guess I could:

Find/Replace en dash with --, em dash with — at the beginning of the script.
Do the line break check (no alteration necessary, actually, because it checks for hyphens, which the word will have, and thusly will turn green).
Change them back to en/em dashes at the end of the script.

Is that basically what you’re talking about? If I use hyphens I guess I won’t need to designate any kind of special characters because the word will contain them. I think this should work, but worst-case is that it won’t always work, but it shouldn’t leave any --'s or —'s floating around. But that’s what proofreaders are for.

By the way, thanks for your help, guys. I really appreciate it.

Marc_Anthony · May 6, 2008, 2:46am

Sorry if I lost you, but the find/change & break check may or may not need to run in separate passes. A suggested order of operations:

1.) Find and replace. If you’ve already gotten multiple hyphens to work, then go with that as your replacement symbol, but I’ve found a couple characters that are close lengthwise matches to the dashes; change every instance of “^=” to “¬”, and every “^_” to “°”. If this flows and breaks okay, run the remaining code, then change these symbols back, and stop here.
2.) Didn’t flow right? Test the story’s words for contains your replacement symbols and put the true’s object references into a list. This will give you their character range in the story, and allow you to track the “word” even after you change it back.
3.) Find and replace the symbols back to the original dashes. This puts you back to where you started, breaks intact.
4.) You can now iterate through your list of object references (from step 2), painting them green or some other color to differentiate that they once harbored dashes.
5.) Now run the code looking for weird line breaks.

I’d help you more with some code samples, but I only have access to InDesign while I’m at work, and I was too busy to devote much time to this today. : (

owebro · May 6, 2008, 10:03pm

OK, so unfortunately the replacement characters you suggested are causing the text to reflow. InDesign’s seemingly random hyphenation algorithm is frustrating sometimes.

I’m pretty sure I understand what you’re talking about with the object references in the list and keeping track of them by their range. That’s neat. Now I need to figure out how to set up the iteration to determine whether the first or second word of each item in the list breaks. That should be possible, yeah? I’ll tackle that one in the morning…time to go home now.

Thanks very much for your help, Marc. Hope the weather’s OK over in Dallas. We’ve been getting soaked in Austin the past few days.

owebro · May 7, 2008, 4:52pm

Wowza, thanks for taking the time to work out that entire thing. Unfortunately I can’t follow the logic of it. My grasp of Applescript and how its code flows is pretty limited.

I do know that when I open it up and try to run it I get an error on the third line:

tell story 1 to set (the end of line 1 of (every word of every paragraph whose baseline of character -1 > baseline of character 1)) to discretionary hyphen

It says “InDesign can’t set end of line 1 of every word of every . . . to discretionary hyphen.”

I think I’ll just stick with the original with the double- and triple-hyphen as en/em replacements, which seem to be working OK for now. They seem to be maintaining the breaks. I guess because they’re hyphens? It’s funny that they apparently work better to that end than other characters of equal length. You’d think InDesign would try to avoid hyphenating words that are already hyphenated. Whatever.

Anyhow, thanks very much for your input. I wish I could understand your code, because it looks complicated and powerful, but alas, it’s Applegreek to me.