About | ACP | Buy | Forum | Industry Watch | Learning Curve | Search | Twitter | Xnews
Home » Learning Curve » Developers Workshop

The Text Encoding Quagmire

Why leave it simple when you can successfully complicate it? And how complicated is it really?


Buy It

Try It

Text encodings aren't easy to understand. Even for developers. The bytes on disk never change by themselves. Text files don't have embedded encoding instructions. Save for Unicode files which are completely different. At least some of the time.

Text files have only text. The encoding helps I/O map bytes to glyphs and vice versa.

What's a text encoding? It's a specification for interpreting bytes in a file on disk for purposes of displaying glyphs on screen; alternately a specification for 'encoding' glyphs on screen back to a file on disk.

Open the same single byte text file using different encodings and you'll get completely different results on screen.

Anyone who's grappled with Microsoft's 'embrace extend and exterminate' in this regard knows how terrible it can get. And web pages written in 'foreign' languages with 'foreign' character sets need their HTML header specs so browsers can correctly render them.

<meta http-equiv="Content-Type" content="text/html; charset=gb2312" />
<meta name="keywords" content="新闻中心,时政,人事任免,国际,地方,香港,台湾,澳门,华人,军事,图片,财经,政权,股票,房产,汽车,体育,奥运,法治,廉政,社会,科技,互联网,教育,文娱,电视剧,电影,视频,访谈,直播,专题" />

Not all text files today are single byte. Unicode UTF-8 is but it's alone in this regard. UTF-8 is the invention of Unix creator Ken Thompson. It's a way of using single byte values beyond the reach of 7-bit ASCII to denote 'escape sequences' for values greater than 0x7f (or 0xff).

Unicode (UTF-16) files are 16-bit (two byte) and have a two byte prefix. Either 0xfffe or 0xfeff to denote byte order. After that it's two bytes at a time to read in the text. [Empty Unicode files using an encoding with a prefix are always nonzero in size because the prefix remains.]

Single byte encodings are arbitrary. Single byte text files don't have encodings: they're read as having encodings. This might be the single most difficult thing to grasp.

Take a step back from the TextEdit preferences dialog. All the clues are there.

Single byte text files don't normally have encodings - they're opened and saved presuming they have one.

Things get very hairy when you read in files as having one encoding, write them out as having another encoding, read them in again with a third encoding, and so forth. There's no telling after a while what you had in the beginning.

NeXTSTEP had a brilliant way of dealing with this. Files were either Unicode or they were not. If they were Unicode - if they had the prefix - they were read in as such. If they didn't have the prefix they were read in as single byte files.

The NeXTSTEP NSString class took care of the above.

-[NSString stringWithContentsOfFile:]; // read a file
-[NSString writeToFile:atomically:]; // write a file

Files you save with the old system - the one that's been in use for 20 years - can be Unicode. UTF-16. No worries. They're read in and written out correctly. The system figures out what to do all by itself.

This worked great for years. Either it was native (Mac OS Roman on OS X) or it had to be UTF-16 - period. But things change. The OS X file system itself uses UTF-8 throughout - so why not? All fine and good and knock on wood.

But can't you hear Mark Pilgrim moaning?

The New APIs

Now Apple want to change the APIs. They want the following instead. Not 'in addition' but 'instead'. This began with 10.4 Tiger 29 April 2005 and continues with 10.5 Leopard and will continue with Leopard's successor.

The new APIs want you to specify an encoding on the way in and going out.

-[NSString stringWithContentsOfFile:encoding:error:];
-[NSString writeToFile:atomically:encoding:error:];

As most of your system is UTF-8 you'd think it was OK. But it's not OK if you have UTF-16 files lying around. Previously - for the past twenty years - the system was able to figure out encodings by itself. Now it can't anymore.

Or more correctly: someone in Cupertino doesn't want it to try anymore.

There may of course be reasons for change. There are a number of new encodings introduced with 10.4 Tiger that provide a portent of what's to come. The ACP's Lightman reports the following Cocoa encodings under the new system. The new 32-bit Unicode encodings have the highest values and are at the end of the list.

Codes - Encodings
-----------------
(
    {"#" = 00000001; Encoding = "Western (ASCII)"; }, 
    {"#" = 00000002; Encoding = "Western (NextStep)"; }, 
    {"#" = 00000003; Encoding = "Japanese (EUC)"; }, 
    {"#" = 00000004; Encoding = "Unicode (UTF-8)"; }, 
    {"#" = 00000005; Encoding = "Western (ISO Latin 1)"; }, 
    {"#" = 00000006; Encoding = "Symbol (Mac OS)"; }, 
    {"#" = 00000007; Encoding = "Non-lossy ASCII"; }, 
    {"#" = 00000008; Encoding = "Japanese (Windows, DOS)"; }, 
    {"#" = 00000009; Encoding = "Central European (ISO Latin 2)"; }, 
    {"#" = 0000000A; Encoding = "Unicode (UTF-16)"; }, 
    {"#" = 0000000B; Encoding = "Cyrillic (Windows)"; }, 
    {"#" = 0000000C; Encoding = "Western (Windows Latin 1)"; }, 
    {"#" = 0000000D; Encoding = "Greek (Windows)"; }, 
    {"#" = 0000000E; Encoding = "Turkish (Windows Latin 5)"; }, 
    {"#" = 0000000F; Encoding = "Central European (Windows Latin 2)"; }, 
    {"#" = 00000015; Encoding = "Japanese (ISO 2022-JP)"; }, 
    {"#" = 0000001E; Encoding = "Western (Mac OS Roman)"; }, 
    {"#" = 80000001; Encoding = "Japanese (Mac OS)"; }, 
    {"#" = 80000002; Encoding = "Traditional Chinese (Mac OS)"; }, 
    {"#" = 80000003; Encoding = "Korean (Mac OS)"; }, 
    {"#" = 80000004; Encoding = "Arabic (Mac OS)"; }, 
    {"#" = 80000005; Encoding = "Hebrew (Mac OS)"; }, 
    {"#" = 80000006; Encoding = "Greek (Mac OS)"; }, 
    {"#" = 80000007; Encoding = "Cyrillic (Mac OS)"; }, 
    {"#" = 80000009; Encoding = "Devanagari (Mac OS)"; }, 
    {"#" = 8000000A; Encoding = "Gurmukhi (Mac OS)"; }, 
    {"#" = 8000000B; Encoding = "Gujarati (Mac OS)"; }, 
    {"#" = 80000015; Encoding = "Thai (Mac OS)"; }, 
    {"#" = 80000019; Encoding = "Simplified Chinese (Mac OS)"; }, 
    {"#" = 8000001A; Encoding = "Tibetan (Mac OS)"; }, 
    {"#" = 8000001D; Encoding = "Central European (Mac OS)"; }, 
    {"#" = 80000022; Encoding = "Dingbats (Mac OS)"; }, 
    {"#" = 80000023; Encoding = "Turkish (Mac OS)"; }, 
    {"#" = 80000024; Encoding = "Croatian (Mac OS)"; }, 
    {"#" = 80000025; Encoding = "Icelandic (Mac OS)"; }, 
    {"#" = 80000026; Encoding = "Romanian (Mac OS)"; }, 
    {"#" = 80000027; Encoding = "Celtic (Mac OS)"; }, 
    {"#" = 80000028; Encoding = "Gaelic (Mac OS)"; }, 
    {"#" = 80000029; Encoding = "Keyboard Symbols (Mac OS)"; }, 
    {"#" = 8000008C; Encoding = "Farsi (Mac OS)"; }, 
    {"#" = 80000098; Encoding = "Cyrillic (Mac OS Ukrainian)"; }, 
    {"#" = 800000EC; Encoding = "Inuit (Mac OS)"; }, 
    {"#" = 80000203; Encoding = "Western (ISO Latin 3)"; }, 
    {"#" = 80000204; Encoding = "Central European (ISO Latin 4)"; }, 
    {"#" = 80000205; Encoding = "Cyrillic (ISO 8859-5)"; }, 
    {"#" = 80000206; Encoding = "Arabic (ISO 8859-6)"; }, 
    {"#" = 80000207; Encoding = "Greek (ISO 8859-7)"; }, 
    {"#" = 80000208; Encoding = "Hebrew (ISO 8859-8)"; }, 
    {"#" = 80000209; Encoding = "Turkish (ISO Latin 5)"; }, 
    {"#" = 8000020A; Encoding = "Nordic (ISO Latin 6)"; }, 
    {"#" = 8000020B; Encoding = "Thai (ISO 8859-11)"; }, 
    {"#" = 8000020D; Encoding = "Baltic Rim (ISO Latin 7)"; }, 
    {"#" = 8000020E; Encoding = "Celtic (ISO Latin 8)"; }, 
    {"#" = 8000020F; Encoding = "Western (ISO Latin 9)"; }, 
    {"#" = 80000210; Encoding = "Romanian (ISO Latin 10)"; }, 
    {"#" = 80000400; Encoding = "Latin-US (DOS)"; }, 
    {"#" = 80000405; Encoding = "Greek (DOS)"; }, 
    {"#" = 80000406; Encoding = "Baltic Rim (DOS)"; }, 
    {"#" = 80000410; Encoding = "Western (DOS Latin 1)"; }, 
    {"#" = 80000411; Encoding = "Greek (DOS Greek 1)"; }, 
    {"#" = 80000412; Encoding = "Central European (DOS Latin 2)"; }, 
    {"#" = 80000413; Encoding = "Cyrillic (DOS)"; }, 
    {"#" = 80000414; Encoding = "Turkish (DOS)"; }, 
    {"#" = 80000415; Encoding = "Portuguese (DOS)"; }, 
    {"#" = 80000416; Encoding = "Icelandic (DOS)"; }, 
    {"#" = 80000417; Encoding = "Hebrew (DOS)"; }, 
    {"#" = 80000418; Encoding = "Canadian French (DOS)"; }, 
    {"#" = 80000419; Encoding = "Arabic (DOS)"; }, 
    {"#" = 8000041A; Encoding = "Nordic (DOS)"; }, 
    {"#" = 8000041B; Encoding = "Cyrillic (DOS)"; }, 
    {"#" = 8000041C; Encoding = "Greek (DOS Greek 2)"; }, 
    {"#" = 8000041D; Encoding = "Thai (Windows, DOS)"; }, 
    {"#" = 80000421; Encoding = "Simplified Chinese (Windows, DOS)"; }, 
    {"#" = 80000422; Encoding = "Korean (Windows, DOS)"; }, 
    {"#" = 80000423; Encoding = "Traditional Chinese (Windows, DOS)"; }, 
    {"#" = 80000505; Encoding = "Hebrew (Windows)"; }, 
    {"#" = 80000506; Encoding = "Arabic (Windows)"; }, 
    {"#" = 80000507; Encoding = "Baltic Rim (Windows)"; }, 
    {"#" = 80000508; Encoding = "Vietnamese (Windows)"; }, 
    {"#" = 80000628; Encoding = "Japanese (Shift JIS X0213)"; }, 
    {"#" = 80000631; Encoding = "Chinese (GBK)"; }, 
    {"#" = 80000632; Encoding = "Chinese (GB 18030)"; }, 
    {"#" = 80000840; Encoding = "Korean (ISO 2022-KR)"; }, 
    {"#" = 80000930; Encoding = "Simplified Chinese (EUC)"; }, 
    {"#" = 80000931; Encoding = "Traditional Chinese (EUC)"; }, 
    {"#" = 80000940; Encoding = "Korean (EUC)"; }, 
    {"#" = 80000A01; Encoding = "Japanese (Shift JIS)"; }, 
    {"#" = 80000A02; Encoding = "Cyrillic (KOI8-R)"; }, 
    {"#" = 80000A03; Encoding = "Traditional Chinese (Big 5)"; }, 
    {"#" = 80000A04; Encoding = "Western (Mac Mail)"; }, 
    {"#" = 80000A05; Encoding = "Simplified Chinese (HZ GB 2312)"; }, 
    {"#" = 80000A06; Encoding = "Traditional Chinese (Big 5 HKSCS)"; }, 
    {"#" = 80000A08; Encoding = "Ukrainian (KOI8-U)"; }, 
    {"#" = 80000A09; Encoding = "Traditional Chinese (Big 5-E)"; }, 
    {"#" = 80000C02; Encoding = "Western (EBCDIC Latin 1)"; }, 
    {"#" = 8C000100; Encoding = "Unicode (UTF-32)"; }, 
    {"#" = 90000100; Encoding = "Unicode (UTF-16BE)"; }, 
    {"#" = 94000100; Encoding = "Unicode (UTF-16LE)"; }, 
    {"#" = 98000100; Encoding = "Unicode (UTF-32BE)"; }, 
    {"#" = 9C000100; Encoding = "Unicode (UTF-32LE)"; }
)

The above is reflected somewhat in NSString.h.

enum {
   NSASCIIStringEncoding = 1,
   NSNEXTSTEPStringEncoding = 2,
   NSJapaneseEUCStringEncoding = 3,
   NSUTF8StringEncoding = 4,
   NSISOLatin1StringEncoding = 5,
   NSSymbolStringEncoding = 6,
   NSNonLossyASCIIStringEncoding = 7,
   NSShiftJISStringEncoding = 8,
   NSISOLatin2StringEncoding = 9,
   NSUnicodeStringEncoding = 10,
   NSWindowsCP1251StringEncoding = 11,
   NSWindowsCP1252StringEncoding = 12,
   NSWindowsCP1253StringEncoding = 13,
   NSWindowsCP1254StringEncoding = 14,
   NSWindowsCP1250StringEncoding = 15,
   NSISO2022JPStringEncoding = 21,
   NSMacOSRomanStringEncoding = 30,
   NSUTF32StringEncoding = 0x8c000100,
   NSUTF16BigEndianStringEncoding = 0x90000100,
   NSUTF16LittleEndianStringEncoding = 0x94000100,
   NSUTF32BigEndianStringEncoding = 0x98000100,
   NSUTF32LittleEndianStringEncoding = 0x9c000100,
   NSProprietaryStringEncoding = 65536
};

The New UTFs

If all the system's dealing with are single byte text files and double byte text files with an identifying prefix it's pretty easy. Either your files have a prefix of 0xfeff or 0xfffe or you just assume they're in the native (Mac OS Roman) single byte text format. And damn the consequences.

But now there are 32-bit UTF formats - and not all of them have identifying prefixes.

Here's a file saved with the different UTF encodings. The files all contain the ASCII character set from 0x20 to 0x7f inclusive. There's no difference between Mac OS Roman and UTF-8 in this case as there are no non-ASCII characters to escape; UTF-16 has the 0xfffe prefix; UTF-16BE and UTF-16LE do not; UTF-32 is 32-bit but it has the prefix; UTF-32BE and UTF-32LE are also 32-bit but they do not have the prefix. Those with no prefix can't be automatically recognised.

Mac OS Roman
------------
00000000  20 21 22 23 24 25 26 27  28 29 2a 2b 2c 2d 2e 2f  | !"#$%&'()*+,-./|
00000010  30 31 32 33 34 35 36 37  38 39 3a 3b 3c 3d 3e 3f  |0123456789:;<=>?|
00000020  40 41 42 43 44 45 46 47  48 49 4a 4b 4c 4d 4e 4f  |@ABCDEFGHIJKLMNO|
00000030  50 51 52 53 54 55 56 57  58 59 5a 5b 5c 5d 5e 5f  |PQRSTUVWXYZ[\]^_|
00000040  60 61 62 63 64 65 66 67  68 69 6a 6b 6c 6d 6e 6f  |`abcdefghijklmno|
00000050  70 71 72 73 74 75 76 77  78 79 7a 7b 7c 7d 7e 7f  |pqrstuvwxyz{|}~.|

UTF-8
-----
00000000  20 21 22 23 24 25 26 27  28 29 2a 2b 2c 2d 2e 2f  | !"#$%&'()*+,-./|
00000010  30 31 32 33 34 35 36 37  38 39 3a 3b 3c 3d 3e 3f  |0123456789:;<=>?|
00000020  40 41 42 43 44 45 46 47  48 49 4a 4b 4c 4d 4e 4f  |@ABCDEFGHIJKLMNO|
00000030  50 51 52 53 54 55 56 57  58 59 5a 5b 5c 5d 5e 5f  |PQRSTUVWXYZ[\]^_|
00000040  60 61 62 63 64 65 66 67  68 69 6a 6b 6c 6d 6e 6f  |`abcdefghijklmno|
00000050  70 71 72 73 74 75 76 77  78 79 7a 7b 7c 7d 7e 7f  |pqrstuvwxyz{|}~.|

UTF-16
------
00000000  ff fe 20 00 21 00 22 00  23 00 24 00 25 00 26 00  |.. .!.".#.$.%.&.|
00000010  27 00 28 00 29 00 2a 00  2b 00 2c 00 2d 00 2e 00  |'.(.).*.+.,.-...|
00000020  2f 00 30 00 31 00 32 00  33 00 34 00 35 00 36 00  |/.0.1.2.3.4.5.6.|
00000030  37 00 38 00 39 00 3a 00  3b 00 3c 00 3d 00 3e 00  |7.8.9.:.;.<.=.>.|
00000040  3f 00 40 00 41 00 42 00  43 00 44 00 45 00 46 00  |?.@.A.B.C.D.E.F.|
00000050  47 00 48 00 49 00 4a 00  4b 00 4c 00 4d 00 4e 00  |G.H.I.J.K.L.M.N.|
00000060  4f 00 50 00 51 00 52 00  53 00 54 00 55 00 56 00  |O.P.Q.R.S.T.U.V.|
00000070  57 00 58 00 59 00 5a 00  5b 00 5c 00 5d 00 5e 00  |W.X.Y.Z.[.\.].^.|
00000080  5f 00 60 00 61 00 62 00  63 00 64 00 65 00 66 00  |_.`.a.b.c.d.e.f.|
00000090  67 00 68 00 69 00 6a 00  6b 00 6c 00 6d 00 6e 00  |g.h.i.j.k.l.m.n.|
000000a0  6f 00 70 00 71 00 72 00  73 00 74 00 75 00 76 00  |o.p.q.r.s.t.u.v.|
000000b0  77 00 78 00 79 00 7a 00  7b 00 7c 00 7d 00 7e 00  |w.x.y.z.{.|.}.~.|
000000c0  7f 00                                             |..|

UTF-16BE
--------
00000000  00 20 00 21 00 22 00 23  00 24 00 25 00 26 00 27  |. .!.".#.$.%.&.'|
00000010  00 28 00 29 00 2a 00 2b  00 2c 00 2d 00 2e 00 2f  |.(.).*.+.,.-.../|
00000020  00 30 00 31 00 32 00 33  00 34 00 35 00 36 00 37  |.0.1.2.3.4.5.6.7|
00000030  00 38 00 39 00 3a 00 3b  00 3c 00 3d 00 3e 00 3f  |.8.9.:.;.<.=.>.?|
00000040  00 40 00 41 00 42 00 43  00 44 00 45 00 46 00 47  |.@.A.B.C.D.E.F.G|
00000050  00 48 00 49 00 4a 00 4b  00 4c 00 4d 00 4e 00 4f  |.H.I.J.K.L.M.N.O|
00000060  00 50 00 51 00 52 00 53  00 54 00 55 00 56 00 57  |.P.Q.R.S.T.U.V.W|
00000070  00 58 00 59 00 5a 00 5b  00 5c 00 5d 00 5e 00 5f  |.X.Y.Z.[.\.].^._|
00000080  00 60 00 61 00 62 00 63  00 64 00 65 00 66 00 67  |.`.a.b.c.d.e.f.g|
00000090  00 68 00 69 00 6a 00 6b  00 6c 00 6d 00 6e 00 6f  |.h.i.j.k.l.m.n.o|
000000a0  00 70 00 71 00 72 00 73  00 74 00 75 00 76 00 77  |.p.q.r.s.t.u.v.w|
000000b0  00 78 00 79 00 7a 00 7b  00 7c 00 7d 00 7e 00 7f  |.x.y.z.{.|.}.~..|

UTF-16LE
--------
00000000  20 00 21 00 22 00 23 00  24 00 25 00 26 00 27 00  | .!.".#.$.%.&.'.|
00000010  28 00 29 00 2a 00 2b 00  2c 00 2d 00 2e 00 2f 00  |(.).*.+.,.-.../.|
00000020  30 00 31 00 32 00 33 00  34 00 35 00 36 00 37 00  |0.1.2.3.4.5.6.7.|
00000030  38 00 39 00 3a 00 3b 00  3c 00 3d 00 3e 00 3f 00  |8.9.:.;.<.=.>.?.|
00000040  40 00 41 00 42 00 43 00  44 00 45 00 46 00 47 00  |@.A.B.C.D.E.F.G.|
00000050  48 00 49 00 4a 00 4b 00  4c 00 4d 00 4e 00 4f 00  |H.I.J.K.L.M.N.O.|
00000060  50 00 51 00 52 00 53 00  54 00 55 00 56 00 57 00  |P.Q.R.S.T.U.V.W.|
00000070  58 00 59 00 5a 00 5b 00  5c 00 5d 00 5e 00 5f 00  |X.Y.Z.[.\.].^._.|
00000080  60 00 61 00 62 00 63 00  64 00 65 00 66 00 67 00  |`.a.b.c.d.e.f.g.|
00000090  68 00 69 00 6a 00 6b 00  6c 00 6d 00 6e 00 6f 00  |h.i.j.k.l.m.n.o.|
000000a0  70 00 71 00 72 00 73 00  74 00 75 00 76 00 77 00  |p.q.r.s.t.u.v.w.|
000000b0  78 00 79 00 7a 00 7b 00  7c 00 7d 00 7e 00 7f 00  |x.y.z.{.|.}.~...|

UTF-32
------
00000000  00 00 fe ff 20 00 00 00  21 00 00 00 22 00 00 00  |.... ...!..."...|
00000010  23 00 00 00 24 00 00 00  25 00 00 00 26 00 00 00  |#...$...%...&...|
00000020  27 00 00 00 28 00 00 00  29 00 00 00 2a 00 00 00  |'...(...)...*...|
00000030  2b 00 00 00 2c 00 00 00  2d 00 00 00 2e 00 00 00  |+...,...-.......|
00000040  2f 00 00 00 30 00 00 00  31 00 00 00 32 00 00 00  |/...0...1...2...|
00000050  33 00 00 00 34 00 00 00  35 00 00 00 36 00 00 00  |3...4...5...6...|
00000060  37 00 00 00 38 00 00 00  39 00 00 00 3a 00 00 00  |7...8...9...:...|
00000070  3b 00 00 00 3c 00 00 00  3d 00 00 00 3e 00 00 00  |;...<...=...>...|
00000080  3f 00 00 00 40 00 00 00  41 00 00 00 42 00 00 00  |?...@...A...B...|
00000090  43 00 00 00 44 00 00 00  45 00 00 00 46 00 00 00  |C...D...E...F...|
000000a0  47 00 00 00 48 00 00 00  49 00 00 00 4a 00 00 00  |G...H...I...J...|
000000b0  4b 00 00 00 4c 00 00 00  4d 00 00 00 4e 00 00 00  |K...L...M...N...|
000000c0  4f 00 00 00 50 00 00 00  51 00 00 00 52 00 00 00  |O...P...Q...R...|
000000d0  53 00 00 00 54 00 00 00  55 00 00 00 56 00 00 00  |S...T...U...V...|
000000e0  57 00 00 00 58 00 00 00  59 00 00 00 5a 00 00 00  |W...X...Y...Z...|
000000f0  5b 00 00 00 5c 00 00 00  5d 00 00 00 5e 00 00 00  |[...\...]...^...|
00000100  5f 00 00 00 60 00 00 00  61 00 00 00 62 00 00 00  |_...`...a...b...|
00000110  63 00 00 00 64 00 00 00  65 00 00 00 66 00 00 00  |c...d...e...f...|
00000120  67 00 00 00 68 00 00 00  69 00 00 00 6a 00 00 00  |g...h...i...j...|
00000130  6b 00 00 00 6c 00 00 00  6d 00 00 00 6e 00 00 00  |k...l...m...n...|
00000140  6f 00 00 00 70 00 00 00  71 00 00 00 72 00 00 00  |o...p...q...r...|
00000150  73 00 00 00 74 00 00 00  75 00 00 00 76 00 00 00  |s...t...u...v...|
00000160  77 00 00 00 78 00 00 00  79 00 00 00 7a 00 00 00  |w...x...y...z...|
00000170  7b 00 00 00 7c 00 00 00  7d 00 00 00 7e 00 00 00  |{...|...}...~...|
00000180  7f 00 00 00                                       |....|

UTF-32BE
--------
00000000  00 00 00 20 00 00 00 21  00 00 00 22 00 00 00 23  |... ...!..."...#|
00000010  00 00 00 24 00 00 00 25  00 00 00 26 00 00 00 27  |...$...%...&...'|
00000020  00 00 00 28 00 00 00 29  00 00 00 2a 00 00 00 2b  |...(...)...*...+|
00000030  00 00 00 2c 00 00 00 2d  00 00 00 2e 00 00 00 2f  |...,...-......./|
00000040  00 00 00 30 00 00 00 31  00 00 00 32 00 00 00 33  |...0...1...2...3|
00000050  00 00 00 34 00 00 00 35  00 00 00 36 00 00 00 37  |...4...5...6...7|
00000060  00 00 00 38 00 00 00 39  00 00 00 3a 00 00 00 3b  |...8...9...:...;|
00000070  00 00 00 3c 00 00 00 3d  00 00 00 3e 00 00 00 3f  |...<...=...>...?|
00000080  00 00 00 40 00 00 00 41  00 00 00 42 00 00 00 43  |...@...A...B...C|
00000090  00 00 00 44 00 00 00 45  00 00 00 46 00 00 00 47  |...D...E...F...G|
000000a0  00 00 00 48 00 00 00 49  00 00 00 4a 00 00 00 4b  |...H...I...J...K|
000000b0  00 00 00 4c 00 00 00 4d  00 00 00 4e 00 00 00 4f  |...L...M...N...O|
000000c0  00 00 00 50 00 00 00 51  00 00 00 52 00 00 00 53  |...P...Q...R...S|
000000d0  00 00 00 54 00 00 00 55  00 00 00 56 00 00 00 57  |...T...U...V...W|
000000e0  00 00 00 58 00 00 00 59  00 00 00 5a 00 00 00 5b  |...X...Y...Z...[|
000000f0  00 00 00 5c 00 00 00 5d  00 00 00 5e 00 00 00 5f  |...\...]...^..._|
00000100  00 00 00 60 00 00 00 61  00 00 00 62 00 00 00 63  |...`...a...b...c|
00000110  00 00 00 64 00 00 00 65  00 00 00 66 00 00 00 67  |...d...e...f...g|
00000120  00 00 00 68 00 00 00 69  00 00 00 6a 00 00 00 6b  |...h...i...j...k|
00000130  00 00 00 6c 00 00 00 6d  00 00 00 6e 00 00 00 6f  |...l...m...n...o|
00000140  00 00 00 70 00 00 00 71  00 00 00 72 00 00 00 73  |...p...q...r...s|
00000150  00 00 00 74 00 00 00 75  00 00 00 76 00 00 00 77  |...t...u...v...w|
00000160  00 00 00 78 00 00 00 79  00 00 00 7a 00 00 00 7b  |...x...y...z...{|
00000170  00 00 00 7c 00 00 00 7d  00 00 00 7e 00 00 00 7f  |...|...}...~....|

UTF-32LE
--------
00000000  20 00 00 00 21 00 00 00  22 00 00 00 23 00 00 00  | ...!..."...#...|
00000010  24 00 00 00 25 00 00 00  26 00 00 00 27 00 00 00  |$...%...&...'...|
00000020  28 00 00 00 29 00 00 00  2a 00 00 00 2b 00 00 00  |(...)...*...+...|
00000030  2c 00 00 00 2d 00 00 00  2e 00 00 00 2f 00 00 00  |,...-......./...|
00000040  30 00 00 00 31 00 00 00  32 00 00 00 33 00 00 00  |0...1...2...3...|
00000050  34 00 00 00 35 00 00 00  36 00 00 00 37 00 00 00  |4...5...6...7...|
00000060  38 00 00 00 39 00 00 00  3a 00 00 00 3b 00 00 00  |8...9...:...;...|
00000070  3c 00 00 00 3d 00 00 00  3e 00 00 00 3f 00 00 00  |<...=...>...?...|
00000080  40 00 00 00 41 00 00 00  42 00 00 00 43 00 00 00  |@...A...B...C...|
00000090  44 00 00 00 45 00 00 00  46 00 00 00 47 00 00 00  |D...E...F...G...|
000000a0  48 00 00 00 49 00 00 00  4a 00 00 00 4b 00 00 00  |H...I...J...K...|
000000b0  4c 00 00 00 4d 00 00 00  4e 00 00 00 4f 00 00 00  |L...M...N...O...|
000000c0  50 00 00 00 51 00 00 00  52 00 00 00 53 00 00 00  |P...Q...R...S...|
000000d0  54 00 00 00 55 00 00 00  56 00 00 00 57 00 00 00  |T...U...V...W...|
000000e0  58 00 00 00 59 00 00 00  5a 00 00 00 5b 00 00 00  |X...Y...Z...[...|
000000f0  5c 00 00 00 5d 00 00 00  5e 00 00 00 5f 00 00 00  |\...]...^..._...|
00000100  60 00 00 00 61 00 00 00  62 00 00 00 63 00 00 00  |`...a...b...c...|
00000110  64 00 00 00 65 00 00 00  66 00 00 00 67 00 00 00  |d...e...f...g...|
00000120  68 00 00 00 69 00 00 00  6a 00 00 00 6b 00 00 00  |h...i...j...k...|
00000130  6c 00 00 00 6d 00 00 00  6e 00 00 00 6f 00 00 00  |l...m...n...o...|
00000140  70 00 00 00 71 00 00 00  72 00 00 00 73 00 00 00  |p...q...r...s...|
00000150  74 00 00 00 75 00 00 00  76 00 00 00 77 00 00 00  |t...u...v...w...|
00000160  78 00 00 00 79 00 00 00  7a 00 00 00 7b 00 00 00  |x...y...z...{...|
00000170  7c 00 00 00 7d 00 00 00  7e 00 00 00 7f 00 00 00  ||...}...~.......|

Now if the new encodings are conspicuous in their total absence and the old APIs are left in place all would be fine. It would also be fine - or beyond reproach - if the old APIs were left in place anyway and their use up to the discretion of the user. But that would be too smooth. Apple have instead officially deprecated the old APIs. All ISV code sooner or later has to stop using them as a version of OS X can at any time come along and no longer support them.

The API changes mentioned above took place for Tiger 29 April 2005: that's when things started to spin out of control. Let's review.

Old Way

-[NSString stringWithContentsOfFile:];
-[NSString writeToFile:atomically:];

New Way

-[NSString stringWithContentsOfFile:encoding:error:];
-[NSString writeToFile:atomically:encoding:error:];

The old way you simply use the path to the file. That's it. The old 'writeToFile:' has a second parameter but that's got nothing to do with text encoding. It's about whether you want files saved 'atomically' - the system first saves to a neutral location and then 'moves' the file into place.

Now you have to specify an encoding both going in and coming out. You have a chance to specify a pointer to an NSError variable you set up for 'error:' in case something doesn't go according to expectations and you want to issue a diagnostic to the program user.

What's NSError? This is what.

@interface NSError : NSObject <NSCopying, NSCoding> {
    void *_reserved;

    int _code;
    NSString *_domain;
    NSDictionary *_userInfo;
}

You can ask an NSError lots of things.

           -(int)[NSError code];
    -(NSString *)[NSError domain];
    -(NSString *)[NSError localizedDescription];
    -(NSString *)[NSError localizedFailureReason];
    -(NSString *)[NSError localizedRecoveryOptions];
    -(NSString *)[NSError localizedRecoverySuggestion];
            -(id)[NSError recoveryAttempter];
-(NSDictionary *)[NSError userInfo];

The 'localized' methods access values in the _userInfo dictionary if available; otherwise things are constructed on the fly given _code and _domain.

Method_userInfo Key
localizedDescriptionNSLocalizedDescriptionKey
localizedFailureReasonNSLocalizedFailureReasonErrorKey
localizedRecoveryOptionsNSLocalizedRecoveryOptionsErrorKey
localizedRecoverySuggestionNSLocalizedRecoverySuggestionErrorKey
recoveryAttempterNSRecoveryAttempterErrorKey

So there's a lot of stuff there if you want to get it on with your end user. But unfortunately there are many situations where it simply isn't going to work. When you're working through the Cocoa document controller you're being called by the system and asked to read in and write out files - and that good old document controller is really interested to know how things turn out - so much so in fact it's going to issue its own error dialog if you tell it something went wrong.

All methods that interact with NSDocumentController have to indicate success or failure in some way. Either they tell the document controller if things turned out OK or they're being asked to return a pointer to data the controller is going to write to disk - in which case a zero pointer tells the controller something didn't work out.

    // Document controller wants app to read
    -(BOOL)[NSDocument loadDataRepresentation:ofType:]; (deprecated)
    -(BOOL)[NSDocument readFromData:ofType:error:]
// Document controller wants data to write
-(NSData *)[NSDocument dataRepresentationOfType:]; (deprecated)
-(NSData *)[NSDocument dataOfType:error:];

So the system borks and the user gets two message boxes one after the other. Suck it up.

Thankfully you can specify '0' for the argument to 'error:' so the system ignores your 'pointer' and won't try to save anything to it. So essentially you have a parameter you can't always realistically use but can still fortunately work around.

Now the encoding. In the old days you let the system decide what to do. If your file was Mac OS Roman it was read in as such. Actually that should read 'if your file could be read as Mac OS Roman'.

Now it's perhaps interesting to note that UTF-16 and ordinary single byte text files are mutually exclusive. There are no 100% guarantees in theory but as good as in daily use. UTF-16 files start with the weird two byte prefix and normally contain a lot of zero bytes. Mac OS Roman and all other single byte encodings can't tolerate the zero bytes.

And same way back out again. You don't specify the encoding because you don't need to.

The old way you get two types of files but you never need to care. The system took care of it all for you. And it never used Unicode (double byte) to store a file if it didn't need to - even if you'd read it in that way. If your file started as a double byte UTF-16 but you removed the non-native part the system would save as single byte. Automatically.

  1. Ordinary single byte 'native' (Mac OS Roman) files.
  2. Unicode double byte files capable of storing anything.

The system doesn't care. The applications don't care. Look at NeXTSTEP code going all the way back. As found today in Etoile, GNUstep, et al: they have these APIs - they literally don't care. It just works. Or: it used to work but doesn't want to anymore.

Cut to 2004. A lot of things happen as we all know today and guess what? This is another of them.

Now the old APIs still exist - but they're deprecated. This means they can officially disappear at any time without further notice. Notice has already been given - get it?

That's the situation.

A Way Around Everything?

There's another API for reading in files of course.

-[NSString stringWithContentsOfFile:usedEncoding:error:];

You don't specify an encoding: you let the system figure it out like before but you can see what encoding was used. For 'usedEncoding:' you provide a pointer to an 'NSStringEncoding' variable (unsigned integer). You read the unsigned integer back afterwards to see what encoding the system used to read the file.

But what the bloody good is that? None at all.

  • When you read in a file your user - hold on for dear life - can edit the file and change its contents. And the old encoding used to read in the file may not work to save it.

Oh whoa that's so heavy you might need to take a moment. Go ahead.

OK. Next point.

  • UTF-8 is the only way you can possibly save unless you go back to the old scheme again with the helpful crutches the new team put on it. Those crutches and their booster rockets. [Or so they think.]

  • If you save files as UTF-8 you've saved as single byte and the system can't see what encoding you used. There is a suggested prefix for UTF-8 files but not only does it interfere with the 'text only' aspect of text but also it's not even a de facto standard much less an official one. So reading the file again won't reveal a thing. And if you saved as UTF-8 because you needed to your file will look like gibberish when it gets read again into your program.

  • If you let the system decide what encoding to use and it chooses Unicode that's no guarantee you need Unicode to save. The user might have removed all the 'non-ASCII' stuff that's been in the file. In the old days the system would figure this stuff out by itself; now with the 'new' APIs you can't do that anymore - you'll get double byte Unicode even if you don't need it.

  • This means that even if you once opened a file as Unicode and since then removed the Unicode characters your file will still be saved as Unicode forever after - as the system now being promoted can't have a clue what's going on.

  • NSString has always had methods to heuristically determine optimal encodings for storing files - methods such as fastestEncoding and smallestEncoding - and of the latter the documentation says 'this method may take some time to execute' - but this was previously done automatically by the system on file writes and the new encodings don't make this any more difficult. So why the deprecation?

Out with the Old?

Now the old methods haven't been removed. Yet. But they might be at any time.

They don't really conflict with anything. They don't really need to be removed. The system can welcome new APIs without removing the old ones. The issue is someone thinks the new APIs supersede the old APIs - which as has been demonstrated they categorically do not.

Playing guessing games with file systems about how to read and write files is going to wear out users and programmers alike. 10.5 Leopard's begun to use 'text encoding' extended attributes but that's hardly a cross platform solution. If text is to remain the 'lingua franca' of the Internet as Doug McIlroy envisioned then portability must be maintained - or when 8 bits or even 16 bits no longer can manage it new easily identifiable encodings must be used. And systems should continue to recognise these text encodings intuitively and by themselves without endless user interaction.

If you think this whole thing stinks then write to Apple. But don't expect things to change. The only good your writing does is give you a chance to vent a bit of anger and frustration. Things are more complex today but there's no reason things should be more complicated to use.

Postscript: usedEncoding

It turns out the 'new' method with 'usedEncoding:' is even more worthless than expected.

Target is Mac OS Roman.

$ ./usedEncoding usedEncodingtest.rtx
usedEncodingtest.rtx: file found.
usedEncodingtest.rtx: The specified text encoding is not applicable.
The file may have been saved using a different text encoding, or it may not be a text file.


Target is UTF-8.

$ ./usedEncoding usedEncodingtest.rtx
usedEncodingtest.rtx: file found.
usedEncodingtest.rtx: The specified text encoding is not applicable.
The file may have been saved using a different text encoding, or it may not be a text file.


Target is UTF-16.

$ ./usedEncoding usedEncodingtest.rtx
usedEncodingtest.rtx: file found.
usedEncodingtest.rtx: Used encoding: 0000000A.


Target is UTF-16BE.

$ ./usedEncoding usedEncodingtest.rtx
usedEncodingtest.rtx: file found.
usedEncodingtest.rtx: The specified text encoding is not applicable.
The file may have been saved using a different text encoding, or it may not be a text file.


Target is UTF-16LE.

$ ./usedEncoding usedEncodingtest.rtx
usedEncodingtest.rtx: file found.
usedEncodingtest.rtx: The specified text encoding is not applicable.
The file may have been saved using a different text encoding, or it may not be a text file.


Target is UTF-32.

$ ./usedEncoding usedEncodingtest.rtx
usedEncodingtest.rtx: file found.
usedEncodingtest.rtx: The specified text encoding is not applicable.
The file may have been saved using a different text encoding, or it may not be a text file.


Target is UTF-32BE.

$ ./usedEncoding usedEncodingtest.rtx
usedEncodingtest.rtx: file found.
usedEncodingtest.rtx: The specified text encoding is not applicable.
The file may have been saved using a different text encoding, or it may not be a text file.


Target is UTF-32LE.

$ ./usedEncoding usedEncodingtest.rtx
usedEncodingtest.rtx: file found.
usedEncodingtest.rtx: The specified text encoding is not applicable.
The file may have been saved using a different text encoding, or it may not be a text file.

The only one it picks up is UTF-16. It can't even pick up UTF-32 despite there being a signature prefix.

Talk about lame.

http://www.omnigroup.com/mailman/archive/macosx-dev/2005-June/056865.html

You've no doubt discovered that encoding-sniffing is new with Tiger.

Before this, people muddled through as best they could. You can load an NSData with the contents of the file and sniff the first four bytes of the file for Unicode byte-order markers (a file can still be UTF without a BOM). You can sniff the whole contents for bit 7 being set, and if it never is, pick ASCII.

After that, you guess based on your market. Mac Roman encoding is often the safest 8-bit encoding, though UTF-8 is taking over. ISO Latin-1, if you're dealing with Windows-origin text that isn't Unicode.

When you're ready to throw the dice, use -[NSString initWithData:encoding:] and watch the fun.

http://www.omnigroup.com/mailman/archive/macosx-dev/2005-June/056893.html

In my tests, the only encoding this method could sniff was UTF-16. Not UTF-8, or ASCII, or Windows Cyrillic.

http://lists.apple.com/archives/cocoa-dev/2006/Apr/msg00747.html

I'm trying to initialize a string using NSString - initWithContentsOfFile:usedEncoding:error:. I'm getting an error (#261) that says the file can't be opened using the specified text encoding. This happens whether I specify an encoding or not. According to the docs, this method is supposed to try to determine the encoding used and return it by reference.

On the other hand, if I use -initWithContentsOfFile:encoding:error: and specify NSUTF8StringEncoding, it works fine. If I could count on the files I'm opening always being in UTF8 encoding, that would be great. But I'm not sure that's realistic given that TextEdit's default encoding for plain text is Mac OS Roman. I suppose the other solution is to iterate over every possible encoding until I don't get an error.

So am I using the first method the wrong way, or is not working?

http://lists.apple.com/archives/cocoa-dev/2006/Apr/msg00766.html

It does try, but alas it is not particularly good trying. IIRC, it would correctly recognize UTF-16 and -32 if they have their prefixes, and that's about all (perhaps plain ASCII, too :))

If you need to determine the encoding from the data, it's best to DIY at any level you need (from just trying which encoding can interpret the data through a frequency analysis of characters to a full-blown analysis which may include spellchecking the text in the target language -- if known -- and selecting the encoding which yields the least number of misspelled words).

Whatever you do though, don't forget to allow the user to override the encoding, for just *any* heuristic is bound to fail sometimes.

All of which is more or less true. But everyone's missing an important distinction.

The old deprecated method always opened a target file; the new one doesn't open shit.

Write to Apple. Ask for Avie.

About | ACP | Buy | Forum | Industry Watch | Learning Curve | Search | Twitter | Xnews
Copyright © Rixstep. All rights reserved.