Unicode text stripping problem

I have InDesign documents that have XML tagged text inside of text frames. There is a unicode character that is used to delimit the tags on tagged words/phrases. If I get all the text from a text frame, then this unicode character can come through in the string, and I need to strip it out. Example: “This sentence [has tagged] text.” But the brackets are the invisible unicode character.
In ASOC I have been doing it like this:

set unicodeTagChar to «data utxtFEFFFEFF»
set unicodeTagChar to unicodeTagChar as Unicode text

set finalXMLString to replaceText_oldChr_newChr_(finalXMLString, unicodeTagChar, "")

using this AS code it works:

on replaceText_oldChr_newChr_(theText, theBadChar, theReplaceChar)
		set AppleScript's text item delimiters to theBadChar
		set newText to text items of theText
		set AppleScript's text item delimiters to theReplaceChar
		set newText to newText as string
		set AppleScript's text item delimiters to "" --reset to "normal" TID's
		return newText
end replaceText_oldChr_newChr_

What I really want to do is replace the AS code with an Objective C method. I have a class I’m building to do a lot of “helper functions”. I’ve got this code, which works OK on normal characters:


+(NSString *)replaceText:(NSString *)theText oldChr:(NSString *)theBadChar newChr:(NSString *)theReplaceChar
{
	//replaces unwanted chars with a specific new char
	NSMutableString *mutText = [NSMutableString stringWithString:theText];
	[mutText replaceOccurrencesOfString:theBadChar
							 withString:theReplaceChar 
								options:NSCaseInsensitiveSearch
								  range:NSMakeRange(0, [mutText length])];
	return (NSString *)mutText;
}

but when I call it like this in my script:

set finalXMLString to CocoaHelpers's replaceText_oldChr_newChr_(finalXMLString, unicodeTagChar, "")
tell me to log "STRING= " & (finalXMLString as string)

RESULT: This sentence ?has tagged? text.

Encoding confuses me and I don’t know what to do for the ObjC code (could I use 0xFEFF or U+FEFF or U0999 format? how do I even find the equivalent to «data utxtFEFFFEFF»?). I am also hesitant to just strip out a question mark from the final string, in case my text would contain a real question.

I think your Objective-C code can be simplified to one line:

+(NSString *)replaceText:(NSString *)theText oldChr:(NSString *)theBadChar newChr:(NSString *)theReplaceChar
{
       return [theText stringByReplacingOccurrencesOfString:theBadChar withString:theReplaceChar ];
}

When I called that, using this example, it worked fine:

script CharStrippingAppDelegate
	property parent : class "NSObject"
	
	on applicationWillFinishLaunching_(aNotification)
		set finalXMLString to "This sentence [has tagged text."
		set unicodeTagChar to «data utxt005B» as Unicode text
		set finalXMLString to current application's CocoaHelpers's replaceText_oldChr_newChr_(finalXMLString, unicodeTagChar, "")
		log "STRING= " & (finalXMLString as string)
	end applicationWillFinishLaunching_
	
end script

I used <> which is the code for a left bracket, and that got stripped from my sentence. I got that code from a UTF8 code chart on the web. I don’t know what sentence you used to get the result you posted, so it’s hard to tell what’s not working in your code.

Ric

Ric

Thanks Ric,
I think FEFF is a very unusual character. I’m finding some stuff about it… “zero width non break space”, byte order mark. Hm. http://unicode.org/charts/

My code had been this and works:
set unicodeTagChar to «data utxtFEFFFEFF»
set unicodeTagChar to unicodeTagChar as Unicode text
tell me to log (count characters of unicodeTagChar)
→ 1

tried this just for fun:
set unicodeTagChar to «data utxtFEFF»
set unicodeTagChar to unicodeTagChar as Unicode text
tell me to log (count characters of unicodeTagChar)
–>0

although no difference in what happens with the script. :confused:

As another test:

set testUniChar to «data utxtFEFFFEFF»
		set testUniChar to testUniChar as Unicode text
		set aTestPhrase to "Mary had a little _" & testUniChar & "_lamb."
		set aTest to CocoaHelpers's replaceText_oldChr_newChr_(aTestPhrase, testUniChar, "*")
		log aTest -->Mary had a little __lamb.
		log (aTest as string) -->Mary had a little _?_lamb.

I’m not sure how this will come across the web. But when I copy the log output for the together underscore version, and paste it into TextEdit, I can then cursor through the text, and I must go across 2 times between the underscores. So there is a blank character there. And that is exactly how my text from InDesign behaves, I must walk across an invisible blank character from the tagged text. And the Cocoa method won’t strip it out, I am not sure how this weird char goes from AS to ObjC.

It is an odd character. What are you doing with the extracted text, and how? It could be that you’re trying to solve a problem that shouldn’t really be a problem…

Shane may be right about whether there is a problem or not, but here is a way to strip that character from your text that seems to work (testing it the same way you did by pasting into TextEdit and moving the cursor through the text):

on applicationWillFinishLaunching_(aNotification)
		set testUniChar to «data utxtFEFFFEFF»
		set testUniChar to testUniChar as Unicode text
		set aTestPhrase to "Mary had a little_" & testUniChar & "_lamb."
		set aTest to current application's CocoaHelpers's stripText_(aTestPhrase)
		log aTestPhrase
		log aTest
	end applicationWillFinishLaunching_

Using this objective-C file:

@implementation CocoaHelpers

+(NSString *)stripText:(NSString *)theText {
    UniChar chars[] = {0xFEFF};
    NSString *string = [NSString stringWithCharacters:chars length:1];
    return [theText stringByReplacingOccurrencesOfString:string withString:@""];
}
@end

Ric

After Edit: If you need to pass the old and and new characters to your objective-C method, I haven’t found a way to do that with passing the unicode character, but it does work if you pass the code number, 65279. So in this modified version of your method, I’ve changed the one parameter to oldChrCode and typed it an an NSUInteger.

on applicationWillFinishLaunching_(aNotification)
		set testUniChar to character id 65279 -- decimal equivalent of FEFF
		set aTestPhrase to "Mary had a little_" & testUniChar & "_lamb."
		set aTest to current application's CocoaHelpers's replaceText_oldChrCode_newChr_(aTestPhrase, 65279, "")
		log aTestPhrase
		log aTest
	end applicationWillFinishLaunching_

And this is the objective-C method:

+(NSString *)replaceText:(NSString *)theText oldChrCode:(NSUInteger)theBadCharCode newChr:(NSString *)theReplaceChar {
    UniChar chars[] = {theBadCharCode};
    NSString *string = [NSString stringWithCharacters:chars length:1];
    return [theText stringByReplacingOccurrencesOfString:string withString:theReplaceChar];
}

Hi Shane,
well I am getting the text from a frame that has the unusual tagging character. When that text is returned from InDesign I coerce it to a string. I want to write the text as ascii or unicode-8. What I’ve gotten so far, is that the character does get returned in my string and then the text has : blah blah ?some tagged text? and blah
The text is ultimately ending up in an XML file that can be UTF8, but I still need it be clean from oddball characters because it’ll end up online later.

I’ve been trying to slim down my app’s main class file, it’s about 2500 lines and seems sluggish. So I’m breaking out some simple functions to another ASOC helper class, and for even more fun/speed, doing some functions as ObjC.

Ric- Thanks I will try that out soon to see how it works for me.

Ric-
thanks that code works, I appreciate it. I’ll have to understand the UniChar a little better now. :slight_smile: