Unicode text to ASCII

Easiest way would be using html entity decoder in PHP.

set theString to “Power #2 – Lawyerin’”
do shell script “php -r ‘$stdin = file_get_contents("php://stdin"); echo html_entity_decode($stdin);’ <<<” & quoted form of theString

edit: removed applescript tag because due to html entities it cannot be opened

Thanks for this, DJ. I think it’s going to simplify a couple of my own scripts at home. :slight_smile:

While reading the php man file this morning and fooling around with some code, I’ve come up with a simpler version of your shell script:

set theString to "Power #2 &" & "#8211; Lawyerin&" & "#8217;" -- HTML split for posting.
do shell script "php -R 'echo html_entity_decode($argn).\"\\n\";' <<<" & quoted form of theString

Does it have any advantages or disadvantages compared with yours? (The “."\n"” is just to restore any linefeeds ripped out by the preceding code, a necessity for my purposes. Your version preserves linefeeds anyway.)

Not that I’m aware of, I just wrote it from the server side perspective but I’m not aware of the command utility variables in the script (only server side variables). I like yours much better :slight_smile:

Update: From what I see there is a difference. Yours will execute each line of code separately and mine will execute the string entirely. The difference is that for large files (like entire HTML files) your will have an much smaller memory footprint while mine will probably be faster. But when it comes to arbitrary data handling I prefer a smaller memory footprint over performance because yours is more secure as well. So I still like yours better :slight_smile:

As I am unable to understand your PHP “incantations” I tested a code using Shane Stanley’s BridgePlus and it worked well.

use scripting additions
use framework "Foundation"
use script "BridgePlus"
load framework


set theString to "Power #2 &" & "#8211; Lawyerin&" & "#8217;" -- HTML split for posting.
# decode XML encoded ampersands
set theString to (current application's SMSForder's unencodedForXMLFrom:theString) as string
# decode the Html decimal encodings
set theString to (current application's SMSForder's decodedDecimalFrom:theString) as string
--> "Power #2 “ Lawyerin'"

Yvan KOENIG running El Capitan 10.11.3 in French (VALLAURIS, France) mercredi 9 mars 2016 15:19:31

Thanks guys! It works great. I don’t understand the PHP method either, but prefer it since I don’t have to install additional software.

FWIW, if you are running 10.11 this is probably the quickest:

use framework "Foundation"

set theString to "Power #2 &" & "#8211; Lawyerin&" & "#8217;" -- HTML split for posting.
set theString to current application's NSString's stringWithString:theString
set theString to (theString's stringByApplyingTransform:"Any-Hex/XML10" |reverse|:true) as string

There isn’t much to understand. If your data is HTML (I assumed because you mentioned ‘form’) you should stick with PHP (Nigel’s method). if your data is XML you can use Shane’s, Yvan’s or Nigel’s method which you like best.

Or even if it’s not, really:

use framework "Foundation"

set theString to "Power #2 &" & "#8211; Lawyerin&" & "#8217;" -- HTML split for posting.
set theString to current application's NSString's stringWithString:theString
set theString to (theString's stringByApplyingTransform:"Hex-Any" |reverse|:false) as string

It doesn’t support named entities which are standardized in HTML. In XML they are custom for each DTD and therefore ICU only supports XML entities, not HTML entities. The reason ‘Any-Hex’ is working is because the string is scanned and it recognized &#n; format and applies Any-Hex/XML. When using the real default of ICU Any-Hex, which is Any-Hex/Java, you’ll notice it won’t work. Both your examples uses ‘Any-Hex/XML’; useful for XML entities, not for (all) HTML entities.

update: The proper way in AppleScriptObjC is using NSAttributedString using the appkit extension on it using the initWithHTML: method.


property NSString : class "NSString"
property NSAttributedString : class "NSAttributedString"
property NSNumber : class "NSNumber"
property NSDictionary : class "NSDictionary"

set HTMLString to "Power #2 &" & "#8211; Lawyerin&" & "#8217;" -- HTML split for posting.
set theString to NSString's stringWithString:HTMLString
set dataStr to theString's dataUsingEncoding:current application's NSUTF8StringEncoding
set encodingOpt to NSNumber's numberWithInt:current application's NSUTF8StringEncoding
        
set options to NSDictionary's alloc()'s initWithObjectsAndKeys_(encodingOpt, current application's NSCharacterEncodingDocumentOption, missing value)
set attStr to NSAttributedString's alloc's initWithHTML:dataStr options:options documentAttributes:missing value
set outputStr to attStr's |string|()
return outputStr

I’ve made a note of all the methods and comments. Thanks everyone. :slight_smile:

For what you have teach us over the years, you’re more than welcome :cool:

I’ve updated my post above with an proper, non-3rd party, AppleScriptObjC solution. I’m running 10.9 so I i’m not sure if use framework “AppKit” is required to make initWithHTML: to work.

I’m not sure of the reference to 10.9, but it should be there (along with a Foundation use statement).

And it needs a few more ˜current application’s’es and a coercion of the final result to text.

Edit: It does appear to work without framework “AppKit”. Linefeeds are not preserved.

With a bit of tidying up:

use framework "Foundation"
use framework "AppKit"

-- classes, constants, and enums used
property NSUTF8StringEncoding : a reference to 4
property NSAttributedString : a reference to current application's NSAttributedString
property NSCharacterEncodingDocumentOption : a reference to current application's NSCharacterEncodingDocumentOption
property NSDictionary : a reference to current application's NSDictionary
property NSString : a reference to current application's NSString

set HTMLString to "Power #2 &" & "#8211; Lawyerin&" & "#8217;" -- HTML split for posting.
set theString to NSString's stringWithString:HTMLString
set dataStr to theString's dataUsingEncoding:NSUTF8StringEncoding
set options to NSDictionary's dictionaryWithObject:NSUTF8StringEncoding forKey:(NSCharacterEncodingDocumentOption)
set attStr to NSAttributedString's alloc()'s initWithHTML:dataStr options:options documentAttributes:(missing value)
set outputStr to attStr's |string|()
return outputStr as text

Linefeeds are ignored in HTML unless they are in a pre formatted tag, linefeeds are tags like
in HTML or <br > in XHTML.

Because this is how my 10.9 script library handler looked like while writing the example (no use statements at all):

property NSString : class "NSString"
property NSAttributedString : class "NSAttributedString"
property NSNumber : class "NSNumber"
property NSDictionary : class "NSDictionary"

on HTMLDecode(HTMLString)
	set theString to NSString's stringWithString:HTMLString
	set dataStr to theString's dataUsingEncoding:(current application's NSUTF8StringEncoding)
	set encodingOpt to NSNumber's numberWithInt:(current application's NSUTF8StringEncoding)
	set options to NSDictionary's alloc()'s initWithObjectsAndKeys_(encodingOpt, current application's NSCharacterEncodingDocumentOption, missing value)
	set attStr to NSAttributedString's alloc's initWithHTML:dataStr options:options documentAttributes:(missing value)
	set outputStr to attStr's |string|()
	return outputStr as string
end HTMLDecode

Well. Adding

 and 
tags works. :wink: But the tags themselves don’t appear in the result, which betrays the fact that this particular ASObjC code interprets tags too, not just entities.

Correct, it’s not the same as the PHP solution but there is no equivalent for it in AppleScriptObjC (With CoreFoundation you can, so an Objective-C wrapper class could work). If you want to ignore HTML elements, the easiest way is making it believe it’s the contents of an multi line text input, in this case wrap around the textarea tag around. Entities are replaced, but tags are not also the
tag isn’t.

I believe the official stance is that you should include use statements for all required frameworks. It mostly works without them, but there’s no guarantee, and things may change. I suspect it has the potential to cause problems with enums and constants, which are lazy-loaded from the .bridgesupport files, apart from anything else.

Interestingly, initWithHTML actually loads WebKit to do the actual conversion, and WebKit is not thread-safe. So to be absolutely bombproof, there should possibly be a thread check in there. Not that I’m aware of anything other than editors that run scripts on background threads.

I was aware of that but the class extension in AppKit do this for us, You can even make it thread safe by running initWithHTML: on another thread. Here is what Apple has to say about it: