The Text Encoding Quagmire | Rixstep Developers Workshop

Why leave it simple when you can successfully complicate it? And how complicated is it really?

Text encodings aren't easy to understand. Even for developers. The bytes on disk never change by themselves. Text files don't have embedded encoding instructions. Save for Unicode files which are completely different. At least some of the time.

Text files have only text. The encoding helps I/O map bytes to glyphs and vice versa.

What's a text encoding? It's a specification for interpreting bytes in a file on disk for purposes of displaying glyphs on screen; alternately a specification for 'encoding' glyphs on screen back to a file on disk.

Open the same single byte text file using different encodings and you'll get completely different results on screen.

Anyone who's grappled with Microsoft's 'embrace extend and exterminate' in this regard knows how terrible it can get. And web pages written in 'foreign' languages with 'foreign' character sets need their HTML header specs so browsers can correctly render them.

<meta http-equiv="Content-Type" content="text/html; charset=gb2312" /> <meta name="keywords" content="新闻中心,时政,人事任免,国际,地方,香港,台湾,澳门,华人,军事,图片,财经,政权,股票,房产,汽车,体育,奥运,法治,廉政,社会,科技,互联网,教育,文娱,电视剧,电影,视频,访谈,直播,专题" />

Not all text files today are single byte. Unicode UTF-8 is but it's alone in this regard. UTF-8 is the invention of Unix creator Ken Thompson. It's a way of using single byte values beyond the reach of 7-bit ASCII to denote 'escape sequences' for values greater than 0x7f (or 0xff).

Unicode (UTF-16) files are 16-bit (two byte) and have a two byte prefix. Either 0xfffe or 0xfeff to denote byte order. After that it's two bytes at a time to read in the text. [Empty Unicode files using an encoding with a prefix are always nonzero in size because the prefix remains.]

Single byte encodings are arbitrary. Single byte text files don't have encodings: they're read as having encodings. This might be the single most difficult thing to grasp.

Single byte text files don't normally have encodings - they're opened and saved presuming they have one.

Things get very hairy when you read in files as having one encoding, write them out as having another encoding, read them in again with a third encoding, and so forth. There's no telling after a while what you had in the beginning.

NeXTSTEP had a brilliant way of dealing with this. Files were either Unicode or they were not. If they were Unicode - if they had the prefix - they were read in as such. If they didn't have the prefix they were read in as single byte files.

Files you save with the old system - the one that's been in use for 20 years - can be Unicode. UTF-16. No worries. They're read in and written out correctly. The system figures out what to do all by itself.

This worked great for years. Either it was native (Mac OS Roman on OS X) or it had to be UTF-16 - period. But things change. The OS X file system itself uses UTF-8 throughout - so why not? All fine and good and knock on wood.

The New APIs

Now Apple want to change the APIs. They want the following instead. Not 'in addition' but 'instead'. This began with 10.4 Tiger 29 April 2005 and continues with 10.5 Leopard and will continue with Leopard's successor.

As most of your system is UTF-8 you'd think it was OK. But it's not OK if you have UTF-16 files lying around. Previously - for the past twenty years - the system was able to figure out encodings by itself. Now it can't anymore.

There may of course be reasons for change. There are a number of new encodings introduced with 10.4 Tiger that provide a portent of what's to come. The ACP's Lightman reports the following Cocoa encodings under the new system. The new 32-bit Unicode encodings have the highest values and are at the end of the list.

The New UTFs

If all the system's dealing with are single byte text files and double byte text files with an identifying prefix it's pretty easy. Either your files have a prefix of 0xfeff or 0xfffe or you just assume they're in the native (Mac OS Roman) single byte text format. And damn the consequences.

But now there are 32-bit UTF formats - and not all of them have identifying prefixes.

Here's a file saved with the different UTF encodings. The files all contain the ASCII character set from 0x20 to 0x7f inclusive. There's no difference between Mac OS Roman and UTF-8 in this case as there are no non-ASCII characters to escape; UTF-16 has the 0xfffe prefix; UTF-16BE and UTF-16LE do not; UTF-32 is 32-bit but it has the prefix; UTF-32BE and UTF-32LE are also 32-bit but they do not have the prefix. Those with no prefix can't be automatically recognised.

Now if the new encodings are conspicuous in their total absence and the old APIs are left in place all would be fine. It would also be fine - or beyond reproach - if the old APIs were left in place anyway and their use up to the discretion of the user. But that would be too smooth. Apple have instead officially deprecated the old APIs. All ISV code sooner or later has to stop using them as a version of OS X can at any time come along and no longer support them.

The API changes mentioned above took place for Tiger 29 April 2005: that's when things started to spin out of control. Let's review.

The old way you simply use the path to the file. That's it. The old 'writeToFile:' has a second parameter but that's got nothing to do with text encoding. It's about whether you want files saved 'atomically' - the system first saves to a neutral location and then 'moves' the file into place.

Now you have to specify an encoding both going in and coming out. You have a chance to specify a pointer to an NSError variable you set up for 'error:' in case something doesn't go according to expectations and you want to issue a diagnostic to the program user.

The 'localized' methods access values in the _userInfo dictionary if available; otherwise things are constructed on the fly given _code and _domain.

So there's a lot of stuff there if you want to get it on with your end user. But unfortunately there are many situations where it simply isn't going to work. When you're working through the Cocoa document controller you're being called by the system and asked to read in and write out files - and that good old document controller is really interested to know how things turn out - so much so in fact it's going to issue its own error dialog if you tell it something went wrong.

All methods that interact with NSDocumentController have to indicate success or failure in some way. Either they tell the document controller if things turned out OK or they're being asked to return a pointer to data the controller is going to write to disk - in which case a zero pointer tells the controller something didn't work out.

So the system borks and the user gets two message boxes one after the other. Suck it up.

Thankfully you can specify '0' for the argument to 'error:' so the system ignores your 'pointer' and won't try to save anything to it. So essentially you have a parameter you can't always realistically use but can still fortunately work around.

Now the encoding. In the old days you let the system decide what to do. If your file was Mac OS Roman it was read in as such. Actually that should read 'if your file could be read as Mac OS Roman'.

Now it's perhaps interesting to note that UTF-16 and ordinary single byte text files are mutually exclusive. There are no 100% guarantees in theory but as good as in daily use. UTF-16 files start with the weird two byte prefix and normally contain a lot of zero bytes. Mac OS Roman and all other single byte encodings can't tolerate the zero bytes.

And same way back out again. You don't specify the encoding because you don't need to.

The old way you get two types of files but you never need to care. The system took care of it all for you. And it never used Unicode (double byte) to store a file if it didn't need to - even if you'd read it in that way. If your file started as a double byte UTF-16 but you removed the non-native part the system would save as single byte. Automatically.

The system doesn't care. The applications don't care. Look at NeXTSTEP code going all the way back. As found today in Etoile, GNUstep, et al: they have these APIs - they literally don't care. It just works. Or: it used to work but doesn't want to anymore.

Cut to 2004. A lot of things happen as we all know today and guess what? This is another of them.

Now the old APIs still exist - but they're deprecated. This means they can officially disappear at any time without further notice. Notice has already been given - get it?

A Way Around Everything?

You don't specify an encoding: you let the system figure it out like before but you can see what encoding was used. For 'usedEncoding:' you provide a pointer to an 'NSStringEncoding' variable (unsigned integer). You read the unsigned integer back afterwards to see what encoding the system used to read the file.

Out with the Old?

They don't really conflict with anything. They don't really need to be removed. The system can welcome new APIs without removing the old ones. The issue is someone thinks the new APIs supersede the old APIs - which as has been demonstrated they categorically do not.

Playing guessing games with file systems about how to read and write files is going to wear out users and programmers alike. 10.5 Leopard's begun to use 'text encoding' extended attributes but that's hardly a cross platform solution. If text is to remain the 'lingua franca' of the Internet as Doug McIlroy envisioned then portability must be maintained - or when 8 bits or even 16 bits no longer can manage it new easily identifiable encodings must be used. And systems should continue to recognise these text encodings intuitively and by themselves without endless user interaction.

If you think this whole thing stinks then write to Apple. But don't expect things to change. The only good your writing does is give you a chance to vent a bit of anger and frustration. Things are more complex today but there's no reason things should be more complicated to use.

Postscript: usedEncoding

It turns out the 'new' method with 'usedEncoding:' is even more worthless than expected.

Target is Mac OS Roman.

$ ./usedEncoding usedEncodingtest.rtx usedEncodingtest.rtx: file found. usedEncodingtest.rtx: The specified text encoding is not applicable. The file may have been saved using a different text encoding, or it may not be a text file.

Target is UTF-8.

$ ./usedEncoding usedEncodingtest.rtx usedEncodingtest.rtx: file found. usedEncodingtest.rtx: The specified text encoding is not applicable. The file may have been saved using a different text encoding, or it may not be a text file.

Target is UTF-16.

$ ./usedEncoding usedEncodingtest.rtx usedEncodingtest.rtx: file found. usedEncodingtest.rtx: Used encoding: 0000000A.

Target is UTF-16BE.

$ ./usedEncoding usedEncodingtest.rtx usedEncodingtest.rtx: file found. usedEncodingtest.rtx: The specified text encoding is not applicable. The file may have been saved using a different text encoding, or it may not be a text file.

Target is UTF-16LE.

$ ./usedEncoding usedEncodingtest.rtx usedEncodingtest.rtx: file found. usedEncodingtest.rtx: The specified text encoding is not applicable. The file may have been saved using a different text encoding, or it may not be a text file.

Target is UTF-32.

$ ./usedEncoding usedEncodingtest.rtx usedEncodingtest.rtx: file found. usedEncodingtest.rtx: The specified text encoding is not applicable. The file may have been saved using a different text encoding, or it may not be a text file.

Target is UTF-32BE.

$ ./usedEncoding usedEncodingtest.rtx usedEncodingtest.rtx: file found. usedEncodingtest.rtx: The specified text encoding is not applicable. The file may have been saved using a different text encoding, or it may not be a text file.

Target is UTF-32LE.

$ ./usedEncoding usedEncodingtest.rtx usedEncodingtest.rtx: file found. usedEncodingtest.rtx: The specified text encoding is not applicable. The file may have been saved using a different text encoding, or it may not be a text file.

The only one it picks up is UTF-16. It can't even pick up UTF-32 despite there being a signature prefix.

http://www.omnigroup.com/mailman/archive/macosx-dev/2005-June/056865.html You've no doubt discovered that encoding-sniffing is new with Tiger. Before this, people muddled through as best they could. You can load an NSData with the contents of the file and sniff the first four bytes of the file for Unicode byte-order markers (a file can still be UTF without a BOM). You can sniff the whole contents for bit 7 being set, and if it never is, pick ASCII. After that, you guess based on your market. Mac Roman encoding is often the safest 8-bit encoding, though UTF-8 is taking over. ISO Latin-1, if you're dealing with Windows-origin text that isn't Unicode. When you're ready to throw the dice, use -[NSString initWithData:encoding:] and watch the fun.

http://www.omnigroup.com/mailman/archive/macosx-dev/2005-June/056893.html In my tests, the only encoding this method could sniff was UTF-16. Not UTF-8, or ASCII, or Windows Cyrillic.

http://lists.apple.com/archives/cocoa-dev/2006/Apr/msg00747.html I'm trying to initialize a string using NSString - initWithContentsOfFile:usedEncoding:error:. I'm getting an error (#261) that says the file can't be opened using the specified text encoding. This happens whether I specify an encoding or not. According to the docs, this method is supposed to try to determine the encoding used and return it by reference. On the other hand, if I use -initWithContentsOfFile:encoding:error: and specify NSUTF8StringEncoding, it works fine. If I could count on the files I'm opening always being in UTF8 encoding, that would be great. But I'm not sure that's realistic given that TextEdit's default encoding for plain text is Mac OS Roman. I suppose the other solution is to iterate over every possible encoding until I don't get an error. So am I using the first method the wrong way, or is not working?

http://lists.apple.com/archives/cocoa-dev/2006/Apr/msg00766.html It does try, but alas it is not particularly good trying. IIRC, it would correctly recognize UTF-16 and -32 if they have their prefixes, and that's about all (perhaps plain ASCII, too :)) If you need to determine the encoding from the data, it's best to DIY at any level you need (from just trying which encoding can interpret the data through a frequency analysis of characters to a full-blown analysis which may include spellchecking the text in the target language -- if known -- and selecting the encoding which yields the least number of misspelled words). Whatever you do though, don't forget to allow the user to override the encoding, for just *any* heuristic is bound to fail sometimes.

All of which is more or less true. But everyone's missing an important distinction.

The old deprecated method always opened a target file; the new one doesn't open shit.