Problems adding Unicode character to string variable

Simon_Knight · August 26, 2018, 1:43pm

Hi,
I am attempting to replace the ASCII 10 Line feed characters in a string variable with the UTF Line separator character or code point. This code point has a UTF-16 value of 2028 hex or in a hex editor appears as E2 80 A8 (hex).

I have written the following function but I am unable to correctly specify the correct code point.
on CleanText(pText)
set tFind to character id 10
set tReplace to character id 2028 --8232 does not work

-- save the existing delimiters
set prevTIDs to text item delimiters of AppleScript

-- find the newline hex 0A characters
set text item delimiters of AppleScript to tFind

set tText to text items of pText

-- add the replacement character / string
set text item delimiters of AppleScript to tReplace
set tText to "" & tText

-- resset the delimiters back to how they were
set text item delimiters of AppleScript to prevTIDs

return tText

end CleanText

I believe that \u2028 is a hex value which in UTF16 is 8232 decimal but this does not work. I believe that I need to specify the value in hex E2 80 A8 but the decimal value of 14844072 fails.

Any ideas?
best wishes
Simon K.

StefanK · August 26, 2018, 2:21pm

You are mixing up a few things

\u2028 is UTF8 not UTF16, hex2028 is dec8232, character id expects decimal values.

Your handler works if the text is UTF8 encoded, there is a constant linefeed representing character id 10 and of AppleScript is redundant in this case

on CleanText(pText)
	set tFind to linefeed
	set tReplace to character id 8232
	
	-- save the existing delimiters
	set prevTIDs to text item delimiters
	
	-- find the newline hex 0A characters
	set text item delimiters to tFind
	set tText to text items of pText
	
	-- add the replacement character / string
	set text item delimiters to tReplace
	set tText to tText as text
	
	-- resset the delimiters back to how they were
	set text item delimiters to prevTIDs
	
	return tText
end CleanText

Simon_Knight · August 26, 2018, 3:05pm

Hi Stefan,

Thanks for your reply - however the code does not work.

While I don’t really know what I am talking about I think the problem is with the code point ID. This and other sites on the web describe u2028 as utf-16 : https://www.fileformat.info/info/unicode/char/2028/index.htm

When run as posted above the 0A characters remain as 0A characters when viewed in a hex editor.

The decimal value of hex E280A8 is 14814072 but this value generates an error : “Can’t get character id 14814072”

If I search and replace 0A with E280A8 using the hexeditor then the text is parsed correctly when it is pasted into an database type application.

best wishes
Simon

StefanK · August 26, 2018, 3:35pm

Once again the code is supposed to work if the source text is UTF8 encoded.

I tested the code successfully with an UTF8 encoded text file.

Simon_Knight · August 26, 2018, 4:43pm

Hi Stefan,

On reflection during a dog walk I suspect that its the way I’m using the output that is causing the problem:

I’m using the command set the clipboard to tText, and I suspect that its converting the utf8 back to ASCII.

Thanks for your confirmation that the code is working.

Simon

Simon_Knight · August 26, 2018, 10:18pm

Hi,

I finally worked it out. I had to ensure that UTF 8 was written to the clipboard otherwise Applescript changes the text to Ascii which means the LS character is replaced with LF.

e.g. set the clipboard to tExport as «class utf8»

Thanks for the assistance.

Simon

Shane_Stanley · August 27, 2018, 12:18am

You shouldn’t. AppleScript will write it correctly:

on CleanText(pText)
	set tFind to linefeed
	set tReplace to character id 8232
	
	-- save the existing delimiters
	set prevTIDs to text item delimiters
	
	-- find the newline hex 0A characters
	set text item delimiters to tFind
	set tText to text items of pText
	
	-- add the replacement character / string
	set text item delimiters to tReplace
	set tText to tText as text
	
	-- resset the delimiters back to how they were
	set text item delimiters to prevTIDs
	
	return tText
end CleanText

set x to "one" & linefeed & "two"
set x to my CleanText(x)
set the clipboard to x
set y to the clipboard
id of character 4 of y
--> 8232

Simon_Knight · August 27, 2018, 6:32am

Shane,

You are quite correct and I’m not sure how I arrived at my line of code only I was having problems with it. It made sense yesterday!

By “shouldn’t” do you mean that “I should not have to” or “I should never” ?

best wishes

Simon

Shane_Stanley · August 27, 2018, 7:24am

I meant the former at the time, but on reflection the latter is probably better. The general principle with the clipboard is to pass the “richest” version of what you have. The clipboard will then be able to offer it where requested, as well as simpler versions for clients that can’t cope with it (unlikely to be any these days anyway).

DJ_Bazzie_Wazzie · August 27, 2018, 9:59am

Code points (slash + ‘u’ notations) are unicode character values, not byte code. Therefore It’s neither UTF8 or UTF16. It may seem UTF16 because an large portion of all valid unicode points are identical to UTF16 but when characters require two 16 integers they will differ.