Home » Learning Curve » Developers Workshop
The Text Encoding QuagmireWhy leave it simple when you can successfully complicate it? And how complicated is it really?
Text encodings aren't easy to understand. Even for developers. The bytes on disk never change by themselves. Text files don't have embedded encoding instructions. Save for Unicode files which are completely different. At least some of the time.
Text files have only text. The encoding helps I/O map bytes to glyphs and vice versa.
What's a text encoding? It's a specification for interpreting bytes in a file on disk for purposes of displaying glyphs on screen; alternately a specification for 'encoding' glyphs on screen back to a file on disk.
Open the same single byte text file using different encodings and you'll get completely different results on screen.
Anyone who's grappled with Microsoft's 'embrace extend and exterminate' in this regard knows how terrible it can get. And web pages written in 'foreign' languages with 'foreign' character sets need their HTML header specs so browsers can correctly render them.
<meta http-equiv="Content-Type" content="text/html; charset=gb2312" />
<meta name="keywords" content="新闻中心,时政,人事任免,国际,地方,香港,台湾,澳门,华人,军事,图片,财经,政权,股票,房产,汽车,体育,奥运,法治,廉政,社会,科技,互联网,教育,文娱,电视剧,电影,视频,访谈,直播,专题" />
Not all text files today are single byte. Unicode UTF-8 is but it's alone in this regard. UTF-8 is the invention of Unix creator Ken Thompson. It's a way of using single byte values beyond the reach of 7-bit ASCII to denote 'escape sequences' for values greater than 0x7f (or 0xff).
Unicode (UTF-16) files are 16-bit (two byte) and have a two byte prefix. Either 0xfffe or 0xfeff to denote byte order. After that it's two bytes at a time to read in the text. [Empty Unicode files using an encoding with a prefix are always nonzero in size because the prefix remains.]
Single byte encodings are arbitrary. Single byte text files don't have encodings: they're read as having encodings. This might be the single most difficult thing to grasp.
Take a step back from the TextEdit preferences dialog. All the clues are there.
Single byte text files don't normally have encodings - they're opened and saved presuming they have one.
Things get very hairy when you read in files as having one encoding, write them out as having another encoding, read them in again with a third encoding, and so forth. There's no telling after a while what you had in the beginning.
NeXTSTEP had a brilliant way of dealing with this. Files were either Unicode or they were not. If they were Unicode - if they had the prefix - they were read in as such. If they didn't have the prefix they were read in as single byte files.
The NeXTSTEP NSString class took care of the above.
-[NSString stringWithContentsOfFile:]; // read a file
-[NSString writeToFile:atomically:]; // write a file
Files you save with the old system - the one that's been in use for 20 years - can be Unicode. UTF-16. No worries. They're read in and written out correctly. The system figures out what to do all by itself.
This worked great for years. Either it was native (Mac OS Roman on OS X) or it had to be UTF-16 - period. But things change. The OS X file system itself uses UTF-8 throughout - so why not? All fine and good and knock on wood.
But can't you hear Mark Pilgrim moaning?
The New APIs
Now Apple want to change the APIs. They want the following instead. Not 'in addition' but 'instead'. This began with 10.4 Tiger 29 April 2005 and continues with 10.5 Leopard and will continue with Leopard's successor.
The new APIs want you to specify an encoding on the way in and going out.
-[NSString stringWithContentsOfFile:encoding:error:];
-[NSString writeToFile:atomically:encoding:error:];
As most of your system is UTF-8 you'd think it was OK. But it's not OK if you have UTF-16 files lying around. Previously - for the past twenty years - the system was able to figure out encodings by itself. Now it can't anymore.
Or more correctly: someone in Cupertino doesn't want it to try anymore.
There may of course be reasons for change. There are a number of new encodings introduced with 10.4 Tiger that provide a portent of what's to come. The ACP's Lightman reports the following Cocoa encodings under the new system. The new 32-bit Unicode encodings have the highest values and are at the end of the list.
Codes - Encodings
-----------------
(
{"#" = 00000001; Encoding = "Western (ASCII)"; },
{"#" = 00000002; Encoding = "Western (NextStep)"; },
{"#" = 00000003; Encoding = "Japanese (EUC)"; },
{"#" = 00000004; Encoding = "Unicode (UTF-8)"; },
{"#" = 00000005; Encoding = "Western (ISO Latin 1)"; },
{"#" = 00000006; Encoding = "Symbol (Mac OS)"; },
{"#" = 00000007; Encoding = "Non-lossy ASCII"; },
{"#" = 00000008; Encoding = "Japanese (Windows, DOS)"; },
{"#" = 00000009; Encoding = "Central European (ISO Latin 2)"; },
{"#" = 0000000A; Encoding = "Unicode (UTF-16)"; },
{"#" = 0000000B; Encoding = "Cyrillic (Windows)"; },
{"#" = 0000000C; Encoding = "Western (Windows Latin 1)"; },
{"#" = 0000000D; Encoding = "Greek (Windows)"; },
{"#" = 0000000E; Encoding = "Turkish (Windows Latin 5)"; },
{"#" = 0000000F; Encoding = "Central European (Windows Latin 2)"; },
{"#" = 00000015; Encoding = "Japanese (ISO 2022-JP)"; },
{"#" = 0000001E; Encoding = "Western (Mac OS Roman)"; },
{"#" = 80000001; Encoding = "Japanese (Mac OS)"; },
{"#" = 80000002; Encoding = "Traditional Chinese (Mac OS)"; },
{"#" = 80000003; Encoding = "Korean (Mac OS)"; },
{"#" = 80000004; Encoding = "Arabic (Mac OS)"; },
{"#" = 80000005; Encoding = "Hebrew (Mac OS)"; },
{"#" = 80000006; Encoding = "Greek (Mac OS)"; },
{"#" = 80000007; Encoding = "Cyrillic (Mac OS)"; },
{"#" = 80000009; Encoding = "Devanagari (Mac OS)"; },
{"#" = 8000000A; Encoding = "Gurmukhi (Mac OS)"; },
{"#" = 8000000B; Encoding = "Gujarati (Mac OS)"; },
{"#" = 80000015; Encoding = "Thai (Mac OS)"; },
{"#" = 80000019; Encoding = "Simplified Chinese (Mac OS)"; },
{"#" = 8000001A; Encoding = "Tibetan (Mac OS)"; },
{"#" = 8000001D; Encoding = "Central European (Mac OS)"; },
{"#" = 80000022; Encoding = "Dingbats (Mac OS)"; },
{"#" = 80000023; Encoding = "Turkish (Mac OS)"; },
{"#" = 80000024; Encoding = "Croatian (Mac OS)"; },
{"#" = 80000025; Encoding = "Icelandic (Mac OS)"; },
{"#" = 80000026; Encoding = "Romanian (Mac OS)"; },
{"#" = 80000027; Encoding = "Celtic (Mac OS)"; },
{"#" = 80000028; Encoding = "Gaelic (Mac OS)"; },
{"#" = 80000029; Encoding = "Keyboard Symbols (Mac OS)"; },
{"#" = 8000008C; Encoding = "Farsi (Mac OS)"; },
{"#" = 80000098; Encoding = "Cyrillic (Mac OS Ukrainian)"; },
{"#" = 800000EC; Encoding = "Inuit (Mac OS)"; },
{"#" = 80000203; Encoding = "Western (ISO Latin 3)"; },
{"#" = 80000204; Encoding = "Central European (ISO Latin 4)"; },
{"#" = 80000205; Encoding = "Cyrillic (ISO 8859-5)"; },
{"#" = 80000206; Encoding = "Arabic (ISO 8859-6)"; },
{"#" = 80000207; Encoding = "Greek (ISO 8859-7)"; },
{"#" = 80000208; Encoding = "Hebrew (ISO 8859-8)"; },
{"#" = 80000209; Encoding = "Turkish (ISO Latin 5)"; },
{"#" = 8000020A; Encoding = "Nordic (ISO Latin 6)"; },
{"#" = 8000020B; Encoding = "Thai (ISO 8859-11)"; },
{"#" = 8000020D; Encoding = "Baltic Rim (ISO Latin 7)"; },
{"#" = 8000020E; Encoding = "Celtic (ISO Latin 8)"; },
{"#" = 8000020F; Encoding = "Western (ISO Latin 9)"; },
{"#" = 80000210; Encoding = "Romanian (ISO Latin 10)"; },
{"#" = 80000400; Encoding = "Latin-US (DOS)"; },
{"#" = 80000405; Encoding = "Greek (DOS)"; },
{"#" = 80000406; Encoding = "Baltic Rim (DOS)"; },
{"#" = 80000410; Encoding = "Western (DOS Latin 1)"; },
{"#" = 80000411; Encoding = "Greek (DOS Greek 1)"; },
{"#" = 80000412; Encoding = "Central European (DOS Latin 2)"; },
{"#" = 80000413; Encoding = "Cyrillic (DOS)"; },
{"#" = 80000414; Encoding = "Turkish (DOS)"; },
{"#" = 80000415; Encoding = "Portuguese (DOS)"; },
{"#" = 80000416; Encoding = "Icelandic (DOS)"; },
{"#" = 80000417; Encoding = "Hebrew (DOS)"; },
{"#" = 80000418; Encoding = "Canadian French (DOS)"; },
{"#" = 80000419; Encoding = "Arabic (DOS)"; },
{"#" = 8000041A; Encoding = "Nordic (DOS)"; },
{"#" = 8000041B; Encoding = "Cyrillic (DOS)"; },
{"#" = 8000041C; Encoding = "Greek (DOS Greek 2)"; },
{"#" = 8000041D; Encoding = "Thai (Windows, DOS)"; },
{"#" = 80000421; Encoding = "Simplified Chinese (Windows, DOS)"; },
{"#" = 80000422; Encoding = "Korean (Windows, DOS)"; },
{"#" = 80000423; Encoding = "Traditional Chinese (Windows, DOS)"; },
{"#" = 80000505; Encoding = "Hebrew (Windows)"; },
{"#" = 80000506; Encoding = "Arabic (Windows)"; },
{"#" = 80000507; Encoding = "Baltic Rim (Windows)"; },
{"#" = 80000508; Encoding = "Vietnamese (Windows)"; },
{"#" = 80000628; Encoding = "Japanese (Shift JIS X0213)"; },
{"#" = 80000631; Encoding = "Chinese (GBK)"; },
{"#" = 80000632; Encoding = "Chinese (GB 18030)"; },
{"#" = 80000840; Encoding = "Korean (ISO 2022-KR)"; },
{"#" = 80000930; Encoding = "Simplified Chinese (EUC)"; },
{"#" = 80000931; Encoding = "Traditional Chinese (EUC)"; },
{"#" = 80000940; Encoding = "Korean (EUC)"; },
{"#" = 80000A01; Encoding = "Japanese (Shift JIS)"; },
{"#" = 80000A02; Encoding = "Cyrillic (KOI8-R)"; },
{"#" = 80000A03; Encoding = "Traditional Chinese (Big 5)"; },
{"#" = 80000A04; Encoding = "Western (Mac Mail)"; },
{"#" = 80000A05; Encoding = "Simplified Chinese (HZ GB 2312)"; },
{"#" = 80000A06; Encoding = "Traditional Chinese (Big 5 HKSCS)"; },
{"#" = 80000A08; Encoding = "Ukrainian (KOI8-U)"; },
{"#" = 80000A09; Encoding = "Traditional Chinese (Big 5-E)"; },
{"#" = 80000C02; Encoding = "Western (EBCDIC Latin 1)"; },
{"#" = 8C000100; Encoding = "Unicode (UTF-32)"; },
{"#" = 90000100; Encoding = "Unicode (UTF-16BE)"; },
{"#" = 94000100; Encoding = "Unicode (UTF-16LE)"; },
{"#" = 98000100; Encoding = "Unicode (UTF-32BE)"; },
{"#" = 9C000100; Encoding = "Unicode (UTF-32LE)"; }
)
The above is reflected somewhat in NSString.h.
enum {
NSASCIIStringEncoding = 1,
NSNEXTSTEPStringEncoding = 2,
NSJapaneseEUCStringEncoding = 3,
NSUTF8StringEncoding = 4,
NSISOLatin1StringEncoding = 5,
NSSymbolStringEncoding = 6,
NSNonLossyASCIIStringEncoding = 7,
NSShiftJISStringEncoding = 8,
NSISOLatin2StringEncoding = 9,
NSUnicodeStringEncoding = 10,
NSWindowsCP1251StringEncoding = 11,
NSWindowsCP1252StringEncoding = 12,
NSWindowsCP1253StringEncoding = 13,
NSWindowsCP1254StringEncoding = 14,
NSWindowsCP1250StringEncoding = 15,
NSISO2022JPStringEncoding = 21,
NSMacOSRomanStringEncoding = 30,
NSUTF32StringEncoding = 0x8c000100,
NSUTF16BigEndianStringEncoding = 0x90000100,
NSUTF16LittleEndianStringEncoding = 0x94000100,
NSUTF32BigEndianStringEncoding = 0x98000100,
NSUTF32LittleEndianStringEncoding = 0x9c000100,
NSProprietaryStringEncoding = 65536
};
The New UTFs
If all the system's dealing with are single byte text files and double byte text files with an identifying prefix it's pretty easy. Either your files have a prefix of 0xfeff or 0xfffe or you just assume they're in the native (Mac OS Roman) single byte text format. And damn the consequences.
But now there are 32-bit UTF formats - and not all of them have identifying prefixes.
Here's a file saved with the different UTF encodings. The files all contain the ASCII character set from 0x20 to 0x7f inclusive. There's no difference between Mac OS Roman and UTF-8 in this case as there are no non-ASCII characters to escape; UTF-16 has the 0xfffe prefix; UTF-16BE and UTF-16LE do not; UTF-32 is 32-bit but it has the prefix; UTF-32BE and UTF-32LE are also 32-bit but they do not have the prefix. Those with no prefix can't be automatically recognised.
Mac OS Roman
------------
00000000 20 21 22 23 24 25 26 27 28 29 2a 2b 2c 2d 2e 2f | !"#$%&'()*+,-./|
00000010 30 31 32 33 34 35 36 37 38 39 3a 3b 3c 3d 3e 3f |0123456789:;<=>?|
00000020 40 41 42 43 44 45 46 47 48 49 4a 4b 4c 4d 4e 4f |@ABCDEFGHIJKLMNO|
00000030 50 51 52 53 54 55 56 57 58 59 5a 5b 5c 5d 5e 5f |PQRSTUVWXYZ[\]^_|
00000040 60 61 62 63 64 65 66 67 68 69 6a 6b 6c 6d 6e 6f |`abcdefghijklmno|
00000050 70 71 72 73 74 75 76 77 78 79 7a 7b 7c 7d 7e 7f |pqrstuvwxyz{|}~.|
UTF-8
-----
00000000 20 21 22 23 24 25 26 27 28 29 2a 2b 2c 2d 2e 2f | !"#$%&'()*+,-./|
00000010 30 31 32 33 34 35 36 37 38 39 3a 3b 3c 3d 3e 3f |0123456789:;<=>?|
00000020 40 41 42 43 44 45 46 47 48 49 4a 4b 4c 4d 4e 4f |@ABCDEFGHIJKLMNO|
00000030 50 51 52 53 54 55 56 57 58 59 5a 5b 5c 5d 5e 5f |PQRSTUVWXYZ[\]^_|
00000040 60 61 62 63 64 65 66 67 68 69 6a 6b 6c 6d 6e 6f |`abcdefghijklmno|
00000050 70 71 72 73 74 75 76 77 78 79 7a 7b 7c 7d 7e 7f |pqrstuvwxyz{|}~.|
UTF-16
------
00000000 ff fe 20 00 21 00 22 00 23 00 24 00 25 00 26 00 |.. .!.".#.$.%.&.|
00000010 27 00 28 00 29 00 2a 00 2b 00 2c 00 2d 00 2e 00 |'.(.).*.+.,.-...|
00000020 2f 00 30 00 31 00 32 00 33 00 34 00 35 00 36 00 |/.0.1.2.3.4.5.6.|
00000030 37 00 38 00 39 00 3a 00 3b 00 3c 00 3d 00 3e 00 |7.8.9.:.;.<.=.>.|
00000040 3f 00 40 00 41 00 42 00 43 00 44 00 45 00 46 00 |?.@.A.B.C.D.E.F.|
00000050 47 00 48 00 49 00 4a 00 4b 00 4c 00 4d 00 4e 00 |G.H.I.J.K.L.M.N.|
00000060 4f 00 50 00 51 00 52 00 53 00 54 00 55 00 56 00 |O.P.Q.R.S.T.U.V.|
00000070 57 00 58 00 59 00 5a 00 5b 00 5c 00 5d 00 5e 00 |W.X.Y.Z.[.\.].^.|
00000080 5f 00 60 00 61 00 62 00 63 00 64 00 65 00 66 00 |_.`.a.b.c.d.e.f.|
00000090 67 00 68 00 69 00 6a 00 6b 00 6c 00 6d 00 6e 00 |g.h.i.j.k.l.m.n.|
000000a0 6f 00 70 00 71 00 72 00 73 00 74 00 75 00 76 00 |o.p.q.r.s.t.u.v.|
000000b0 77 00 78 00 79 00 7a 00 7b 00 7c 00 7d 00 7e 00 |w.x.y.z.{.|.}.~.|
000000c0 7f 00 |..|
UTF-16BE
--------
00000000 00 20 00 21 00 22 00 23 00 24 00 25 00 26 00 27 |. .!.".#.$.%.&.'|
00000010 00 28 00 29 00 2a 00 2b 00 2c 00 2d 00 2e 00 2f |.(.).*.+.,.-.../|
00000020 00 30 00 31 00 32 00 33 00 34 00 35 00 36 00 37 |.0.1.2.3.4.5.6.7|
00000030 00 38 00 39 00 3a 00 3b 00 3c 00 3d 00 3e 00 3f |.8.9.:.;.<.=.>.?|
00000040 00 40 00 41 00 42 00 43 00 44 00 45 00 46 00 47 |.@.A.B.C.D.E.F.G|
00000050 00 48 00 49 00 4a 00 4b 00 4c 00 4d 00 4e 00 4f |.H.I.J.K.L.M.N.O|
00000060 00 50 00 51 00 52 00 53 00 54 00 55 00 56 00 57 |.P.Q.R.S.T.U.V.W|
00000070 00 58 00 59 00 5a 00 5b 00 5c 00 5d 00 5e 00 5f |.X.Y.Z.[.\.].^._|
00000080 00 60 00 61 00 62 00 63 00 64 00 65 00 66 00 67 |.`.a.b.c.d.e.f.g|
00000090 00 68 00 69 00 6a 00 6b 00 6c 00 6d 00 6e 00 6f |.h.i.j.k.l.m.n.o|
000000a0 00 70 00 71 00 72 00 73 00 74 00 75 00 76 00 77 |.p.q.r.s.t.u.v.w|
000000b0 00 78 00 79 00 7a 00 7b 00 7c 00 7d 00 7e 00 7f |.x.y.z.{.|.}.~..|
UTF-16LE
--------
00000000 20 00 21 00 22 00 23 00 24 00 25 00 26 00 27 00 | .!.".#.$.%.&.'.|
00000010 28 00 29 00 2a 00 2b 00 2c 00 2d 00 2e 00 2f 00 |(.).*.+.,.-.../.|
00000020 30 00 31 00 32 00 33 00 34 00 35 00 36 00 37 00 |0.1.2.3.4.5.6.7.|
00000030 38 00 39 00 3a 00 3b 00 3c 00 3d 00 3e 00 3f 00 |8.9.:.;.<.=.>.?.|
00000040 40 00 41 00 42 00 43 00 44 00 45 00 46 00 47 00 |@.A.B.C.D.E.F.G.|
00000050 48 00 49 00 4a 00 4b 00 4c 00 4d 00 4e 00 4f 00 |H.I.J.K.L.M.N.O.|
00000060 50 00 51 00 52 00 53 00 54 00 55 00 56 00 57 00 |P.Q.R.S.T.U.V.W.|
00000070 58 00 59 00 5a 00 5b 00 5c 00 5d 00 5e 00 5f 00 |X.Y.Z.[.\.].^._.|
00000080 60 00 61 00 62 00 63 00 64 00 65 00 66 00 67 00 |`.a.b.c.d.e.f.g.|
00000090 68 00 69 00 6a 00 6b 00 6c 00 6d 00 6e 00 6f 00 |h.i.j.k.l.m.n.o.|
000000a0 70 00 71 00 72 00 73 00 74 00 75 00 76 00 77 00 |p.q.r.s.t.u.v.w.|
000000b0 78 00 79 00 7a 00 7b 00 7c 00 7d 00 7e 00 7f 00 |x.y.z.{.|.}.~...|
UTF-32
------
00000000 00 00 fe ff 20 00 00 00 21 00 00 00 22 00 00 00 |.... ...!..."...|
00000010 23 00 00 00 24 00 00 00 25 00 00 00 26 00 00 00 |#...$...%...&...|
00000020 27 00 00 00 28 00 00 00 29 00 00 00 2a 00 00 00 |'...(...)...*...|
00000030 2b 00 00 00 2c 00 00 00 2d 00 00 00 2e 00 00 00 |+...,...-.......|
00000040 2f 00 00 00 30 00 00 00 31 00 00 00 32 00 00 00 |/...0...1...2...|
00000050 33 00 00 00 34 00 00 00 35 00 00 00 36 00 00 00 |3...4...5...6...|
00000060 37 00 00 00 38 00 00 00 39 00 00 00 3a 00 00 00 |7...8...9...:...|
00000070 3b 00 00 00 3c 00 00 00 3d 00 00 00 3e 00 00 00 |;...<...=...>...|
00000080 3f 00 00 00 40 00 00 00 41 00 00 00 42 00 00 00 |?...@...A...B...|
00000090 43 00 00 00 44 00 00 00 45 00 00 00 46 00 00 00 |C...D...E...F...|
000000a0 47 00 00 00 48 00 00 00 49 00 00 00 4a 00 00 00 |G...H...I...J...|
000000b0 4b 00 00 00 4c 00 00 00 4d 00 00 00 4e 00 00 00 |K...L...M...N...|
000000c0 4f 00 00 00 50 00 00 00 51 00 00 00 52 00 00 00 |O...P...Q...R...|
000000d0 53 00 00 00 54 00 00 00 55 00 00 00 56 00 00 00 |S...T...U...V...|
000000e0 57 00 00 00 58 00 00 00 59 00 00 00 5a 00 00 00 |W...X...Y...Z...|
000000f0 5b 00 00 00 5c 00 00 00 5d 00 00 00 5e 00 00 00 |[...\...]...^...|
00000100 5f 00 00 00 60 00 00 00 61 00 00 00 62 00 00 00 |_...`...a...b...|
00000110 63 00 00 00 64 00 00 00 65 00 00 00 66 00 00 00 |c...d...e...f...|
00000120 67 00 00 00 68 00 00 00 69 00 00 00 6a 00 00 00 |g...h...i...j...|
00000130 6b 00 00 00 6c 00 00 00 6d 00 00 00 6e 00 00 00 |k...l...m...n...|
00000140 6f 00 00 00 70 00 00 00 71 00 00 00 72 00 00 00 |o...p...q...r...|
00000150 73 00 00 00 74 00 00 00 75 00 00 00 76 00 00 00 |s...t...u...v...|
00000160 77 00 00 00 78 00 00 00 79 00 00 00 7a 00 00 00 |w...x...y...z...|
00000170 7b 00 00 00 7c 00 00 00 7d 00 00 00 7e 00 00 00 |{...|...}...~...|
00000180 7f 00 00 00 |....|
UTF-32BE
--------
00000000 00 00 00 20 00 00 00 21 00 00 00 22 00 00 00 23 |... ...!..."...#|
00000010 00 00 00 24 00 00 00 25 00 00 00 26 00 00 00 27 |...$...%...&...'|
00000020 00 00 00 28 00 00 00 29 00 00 00 2a 00 00 00 2b |...(...)...*...+|
00000030 00 00 00 2c 00 00 00 2d 00 00 00 2e 00 00 00 2f |...,...-......./|
00000040 00 00 00 30 00 00 00 31 00 00 00 32 00 00 00 33 |...0...1...2...3|
00000050 00 00 00 34 00 00 00 35 00 00 00 36 00 00 00 37 |...4...5...6...7|
00000060 00 00 00 38 00 00 00 39 00 00 00 3a 00 00 00 3b |...8...9...:...;|
00000070 00 00 00 3c 00 00 00 3d 00 00 00 3e 00 00 00 3f |...<...=...>...?|
00000080 00 00 00 40 00 00 00 41 00 00 00 42 00 00 00 43 |...@...A...B...C|
00000090 00 00 00 44 00 00 00 45 00 00 00 46 00 00 00 47 |...D...E...F...G|
000000a0 00 00 00 48 00 00 00 49 00 00 00 4a 00 00 00 4b |...H...I...J...K|
000000b0 00 00 00 4c 00 00 00 4d 00 00 00 4e 00 00 00 4f |...L...M...N...O|
000000c0 00 00 00 50 00 00 00 51 00 00 00 52 00 00 00 53 |...P...Q...R...S|
000000d0 00 00 00 54 00 00 00 55 00 00 00 56 00 00 00 57 |...T...U...V...W|
000000e0 00 00 00 58 00 00 00 59 00 00 00 5a 00 00 00 5b |...X...Y...Z...[|
000000f0 00 00 00 5c 00 00 00 5d 00 00 00 5e 00 00 00 5f |...\...]...^..._|
00000100 00 00 00 60 00 00 00 61 00 00 00 62 00 00 00 63 |...`...a...b...c|
00000110 00 00 00 64 00 00 00 65 00 00 00 66 00 00 00 67 |...d...e...f...g|
00000120 00 00 00 68 00 00 00 69 00 00 00 6a 00 00 00 6b |...h...i...j...k|
00000130 00 00 00 6c 00 00 00 6d 00 00 00 6e 00 00 00 6f |...l...m...n...o|
00000140 00 00 00 70 00 00 00 71 00 00 00 72 00 00 00 73 |...p...q...r...s|
00000150 00 00 00 74 00 00 00 75 00 00 00 76 00 00 00 77 |...t...u...v...w|
00000160 00 00 00 78 00 00 00 79 00 00 00 7a 00 00 00 7b |...x...y...z...{|
00000170 00 00 00 7c 00 00 00 7d 00 00 00 7e 00 00 00 7f |...|...}...~....|
UTF-32LE
--------
00000000 20 00 00 00 21 00 00 00 22 00 00 00 23 00 00 00 | ...!..."...#...|
00000010 24 00 00 00 25 00 00 00 26 00 00 00 27 00 00 00 |$...%...&...'...|
00000020 28 00 00 00 29 00 00 00 2a 00 00 00 2b 00 00 00 |(...)...*...+...|
00000030 2c 00 00 00 2d 00 00 00 2e 00 00 00 2f 00 00 00 |,...-......./...|
00000040 30 00 00 00 31 00 00 00 32 00 00 00 33 00 00 00 |0...1...2...3...|
00000050 34 00 00 00 35 00 00 00 36 00 00 00 37 00 00 00 |4...5...6...7...|
00000060 38 00 00 00 39 00 00 00 3a 00 00 00 3b 00 00 00 |8...9...:...;...|
00000070 3c 00 00 00 3d 00 00 00 3e 00 00 00 3f 00 00 00 |<...=...>...?...|
00000080 40 00 00 00 41 00 00 00 42 00 00 00 43 00 00 00 |@...A...B...C...|
00000090 44 00 00 00 45 00 00 00 46 00 00 00 47 00 00 00 |D...E...F...G...|
000000a0 48 00 00 00 49 00 00 00 4a 00 00 00 4b 00 00 00 |H...I...J...K...|
000000b0 4c 00 00 00 4d 00 00 00 4e 00 00 00 4f 00 00 00 |L...M...N...O...|
000000c0 50 00 00 00 51 00 00 00 52 00 00 00 53 00 00 00 |P...Q...R...S...|
000000d0 54 00 00 00 55 00 00 00 56 00 00 00 57 00 00 00 |T...U...V...W...|
000000e0 58 00 00 00 59 00 00 00 5a 00 00 00 5b 00 00 00 |X...Y...Z...[...|
000000f0 5c 00 00 00 5d 00 00 00 5e 00 00 00 5f 00 00 00 |\...]...^..._...|
00000100 60 00 00 00 61 00 00 00 62 00 00 00 63 00 00 00 |`...a...b...c...|
00000110 64 00 00 00 65 00 00 00 66 00 00 00 67 00 00 00 |d...e...f...g...|
00000120 68 00 00 00 69 00 00 00 6a 00 00 00 6b 00 00 00 |h...i...j...k...|
00000130 6c 00 00 00 6d 00 00 00 6e 00 00 00 6f 00 00 00 |l...m...n...o...|
00000140 70 00 00 00 71 00 00 00 72 00 00 00 73 00 00 00 |p...q...r...s...|
00000150 74 00 00 00 75 00 00 00 76 00 00 00 77 00 00 00 |t...u...v...w...|
00000160 78 00 00 00 79 00 00 00 7a 00 00 00 7b 00 00 00 |x...y...z...{...|
00000170 7c 00 00 00 7d 00 00 00 7e 00 00 00 7f 00 00 00 ||...}...~.......|
Now if the new encodings are conspicuous in their total absence and the old APIs are left in place all would be fine. It would also be fine - or beyond reproach - if the old APIs were left in place anyway and their use up to the discretion of the user. But that would be too smooth. Apple have instead officially deprecated the old APIs. All ISV code sooner or later has to stop using them as a version of OS X can at any time come along and no longer support them.
The API changes mentioned above took place for Tiger 29 April 2005: that's when things started to spin out of control. Let's review.
Old Way -[NSString stringWithContentsOfFile:]; -[NSString writeToFile:atomically:]; | | New Way -[NSString stringWithContentsOfFile:encoding:error:]; -[NSString writeToFile:atomically:encoding:error:]; |
The old way you simply use the path to the file. That's it. The old 'writeToFile:' has a second parameter but that's got nothing to do with text encoding. It's about whether you want files saved 'atomically' - the system first saves to a neutral location and then 'moves' the file into place.
Now you have to specify an encoding both going in and coming out. You have a chance to specify a pointer to an NSError variable you set up for 'error:' in case something doesn't go according to expectations and you want to issue a diagnostic to the program user.
What's NSError? This is what.
@interface NSError : NSObject <NSCopying, NSCoding> {
void *_reserved;
int _code;
NSString *_domain;
NSDictionary *_userInfo;
}
You can ask an NSError lots of things.
-(int)[NSError code];
-(NSString *)[NSError domain];
-(NSString *)[NSError localizedDescription];
-(NSString *)[NSError localizedFailureReason];
-(NSString *)[NSError localizedRecoveryOptions];
-(NSString *)[NSError localizedRecoverySuggestion];
-(id)[NSError recoveryAttempter];
-(NSDictionary *)[NSError userInfo];
The 'localized' methods access values in the _userInfo dictionary if available; otherwise things are constructed on the fly given _code and _domain.
Method | _userInfo Key | localizedDescription | NSLocalizedDescriptionKey | localizedFailureReason | NSLocalizedFailureReasonErrorKey | localizedRecoveryOptions | NSLocalizedRecoveryOptionsErrorKey | localizedRecoverySuggestion | NSLocalizedRecoverySuggestionErrorKey | recoveryAttempter | NSRecoveryAttempterErrorKey |
So there's a lot of stuff there if you want to get it on with your end user. But unfortunately there are many situations where it simply isn't going to work. When you're working through the Cocoa document controller you're being called by the system and asked to read in and write out files - and that good old document controller is really interested to know how things turn out - so much so in fact it's going to issue its own error dialog if you tell it something went wrong.
All methods that interact with NSDocumentController have to indicate success or failure in some way. Either they tell the document controller if things turned out OK or they're being asked to return a pointer to data the controller is going to write to disk - in which case a zero pointer tells the controller something didn't work out.
// Document controller wants app to read
-(BOOL)[NSDocument loadDataRepresentation:ofType:]; (deprecated)
-(BOOL)[NSDocument readFromData:ofType:error:]
// Document controller wants data to write
-(NSData *)[NSDocument dataRepresentationOfType:]; (deprecated)
-(NSData *)[NSDocument dataOfType:error:];
So the system borks and the user gets two message boxes one after the other. Suck it up.
Thankfully you can specify '0' for the argument to 'error:' so the system ignores your 'pointer' and won't try to save anything to it. So essentially you have a parameter you can't always realistically use but can still fortunately work around.
Now the encoding. In the old days you let the system decide what to do. If your file was Mac OS Roman it was read in as such. Actually that should read 'if your file could be read as Mac OS Roman'.
Now it's perhaps interesting to note that UTF-16 and ordinary single byte text files are mutually exclusive. There are no 100% guarantees in theory but as good as in daily use. UTF-16 files start with the weird two byte prefix and normally contain a lot of zero bytes. Mac OS Roman and all other single byte encodings can't tolerate the zero bytes.
And same way back out again. You don't specify the encoding because you don't need to.
The old way you get two types of files but you never need to care. The system took care of it all for you. And it never used Unicode (double byte) to store a file if it didn't need to - even if you'd read it in that way. If your file started as a double byte UTF-16 but you removed the non-native part the system would save as single byte. Automatically.
- Ordinary single byte 'native' (Mac OS Roman) files.
- Unicode double byte files capable of storing anything.
The system doesn't care. The applications don't care. Look at NeXTSTEP code going all the way back. As found today in Etoile, GNUstep, et al: they have these APIs - they literally don't care. It just works. Or: it used to work but doesn't want to anymore.
Cut to 2004. A lot of things happen as we all know today and guess what? This is another of them.
Now the old APIs still exist - but they're deprecated. This means they can officially disappear at any time without further notice. Notice has already been given - get it?
That's the situation.
A Way Around Everything?
There's another API for reading in files of course.
-[NSString stringWithContentsOfFile:usedEncoding:error:];
You don't specify an encoding: you let the system figure it out like before but you can see what encoding was used. For 'usedEncoding:' you provide a pointer to an 'NSStringEncoding' variable (unsigned integer). You read the unsigned integer back afterwards to see what encoding the system used to read the file.
But what the bloody good is that? None at all.
- When you read in a file your user - hold on for dear life - can edit the file and change its contents. And the old encoding used to read in the file may not work to save it.
Oh whoa that's so heavy you might need to take a moment. Go ahead.
OK. Next point.
- UTF-8 is the only way you can possibly save unless you go back to the old scheme again with the helpful crutches the new team put on it. Those crutches and their booster rockets. [Or so they think.]
- If you save files as UTF-8 you've saved as single byte and the system can't see what encoding you used. There is a suggested prefix for UTF-8 files but not only does it interfere with the 'text only' aspect of text but also it's not even a de facto standard much less an official one. So reading the file again won't reveal a thing. And if you saved as UTF-8 because you needed to your file will look like gibberish when it gets read again into your program.
- If you let the system decide what encoding to use and it chooses Unicode that's no guarantee you need Unicode to save. The user might have removed all the 'non-ASCII' stuff that's been in the file. In the old days the system would figure this stuff out by itself; now with the 'new' APIs you can't do that anymore - you'll get double byte Unicode even if you don't need it.
- This means that even if you once opened a file as Unicode and since then removed the Unicode characters your file will still be saved as Unicode forever after - as the system now being promoted can't have a clue what's going on.
- NSString has always had methods to heuristically determine optimal encodings for storing files - methods such as fastestEncoding and smallestEncoding - and of the latter the documentation says 'this method may take some time to execute' - but this was previously done automatically by the system on file writes and the new encodings don't make this any more difficult. So why the deprecation?
Out with the Old?
Now the old methods haven't been removed. Yet. But they might be at any time.
They don't really conflict with anything. They don't really need to be removed. The system can welcome new APIs without removing the old ones. The issue is someone thinks the new APIs supersede the old APIs - which as has been demonstrated they categorically do not.
Playing guessing games with file systems about how to read and write files is going to wear out users and programmers alike. 10.5 Leopard's begun to use 'text encoding' extended attributes but that's hardly a cross platform solution. If text is to remain the 'lingua franca' of the Internet as Doug McIlroy envisioned then portability must be maintained - or when 8 bits or even 16 bits no longer can manage it new easily identifiable encodings must be used. And systems should continue to recognise these text encodings intuitively and by themselves without endless user interaction.
If you think this whole thing stinks then write to Apple. But don't expect things to change. The only good your writing does is give you a chance to vent a bit of anger and frustration. Things are more complex today but there's no reason things should be more complicated to use.
Postscript: usedEncoding
It turns out the 'new' method with 'usedEncoding:' is even more worthless than expected.
Target is Mac OS Roman.
$ ./usedEncoding usedEncodingtest.rtx
usedEncodingtest.rtx: file found.
usedEncodingtest.rtx: The specified text encoding is not applicable.
The file may have been saved using a different text encoding, or it may not be a text file.
Target is UTF-8.
$ ./usedEncoding usedEncodingtest.rtx
usedEncodingtest.rtx: file found.
usedEncodingtest.rtx: The specified text encoding is not applicable.
The file may have been saved using a different text encoding, or it may not be a text file.
Target is UTF-16.
$ ./usedEncoding usedEncodingtest.rtx
usedEncodingtest.rtx: file found.
usedEncodingtest.rtx: Used encoding: 0000000A.
Target is UTF-16BE.
$ ./usedEncoding usedEncodingtest.rtx
usedEncodingtest.rtx: file found.
usedEncodingtest.rtx: The specified text encoding is not applicable.
The file may have been saved using a different text encoding, or it may not be a text file.
Target is UTF-16LE.
$ ./usedEncoding usedEncodingtest.rtx
usedEncodingtest.rtx: file found.
usedEncodingtest.rtx: The specified text encoding is not applicable.
The file may have been saved using a different text encoding, or it may not be a text file.
Target is UTF-32.
$ ./usedEncoding usedEncodingtest.rtx
usedEncodingtest.rtx: file found.
usedEncodingtest.rtx: The specified text encoding is not applicable.
The file may have been saved using a different text encoding, or it may not be a text file.
Target is UTF-32BE.
$ ./usedEncoding usedEncodingtest.rtx
usedEncodingtest.rtx: file found.
usedEncodingtest.rtx: The specified text encoding is not applicable.
The file may have been saved using a different text encoding, or it may not be a text file.
Target is UTF-32LE.
$ ./usedEncoding usedEncodingtest.rtx
usedEncodingtest.rtx: file found.
usedEncodingtest.rtx: The specified text encoding is not applicable.
The file may have been saved using a different text encoding, or it may not be a text file.
The only one it picks up is UTF-16. It can't even pick up UTF-32 despite there being a signature prefix.
Talk about lame.
http://www.omnigroup.com/mailman/archive/macosx-dev/2005-June/056865.html
You've no doubt discovered that encoding-sniffing is new with Tiger.
Before this, people muddled through as best they could. You can load an NSData with the contents of the file and sniff the first four bytes of the file for Unicode byte-order markers (a file can still be UTF without a BOM). You can sniff the whole contents for bit 7 being set, and if it never is, pick ASCII.
After that, you guess based on your market. Mac Roman encoding is often the safest 8-bit encoding, though UTF-8 is taking over. ISO Latin-1, if you're dealing with Windows-origin text that isn't Unicode.
When you're ready to throw the dice, use -[NSString initWithData:encoding:] and watch the fun.
http://www.omnigroup.com/mailman/archive/macosx-dev/2005-June/056893.html
In my tests, the only encoding this method could sniff was UTF-16. Not UTF-8, or ASCII, or Windows Cyrillic.
http://lists.apple.com/archives/cocoa-dev/2006/Apr/msg00747.html
I'm trying to initialize a string using NSString - initWithContentsOfFile:usedEncoding:error:. I'm getting an error (#261) that says the file can't be opened using the specified text encoding. This happens whether I specify an encoding or not. According to the docs, this method is supposed to try to determine the encoding used and return it by reference.
On the other hand, if I use -initWithContentsOfFile:encoding:error: and specify NSUTF8StringEncoding, it works fine. If I could count on the files I'm opening always being in UTF8 encoding, that would be great. But I'm not sure that's realistic given that TextEdit's default encoding for plain text is Mac OS Roman. I suppose the other solution is to iterate over every possible encoding until I don't get an error.
So am I using the first method the wrong way, or is not working?
http://lists.apple.com/archives/cocoa-dev/2006/Apr/msg00766.html
It does try, but alas it is not particularly good trying. IIRC, it would correctly recognize UTF-16 and -32 if they have their prefixes, and that's about all (perhaps plain ASCII, too :))
If you need to determine the encoding from the data, it's best to DIY at any level you need (from just trying which encoding can interpret the data through a frequency analysis of characters to a full-blown analysis which may include spellchecking the text in the target language -- if known -- and selecting the encoding which yields the least number of misspelled words).
Whatever you do though, don't forget to allow the user to override the encoding, for just *any* heuristic is bound to fail sometimes.
All of which is more or less true. But everyone's missing an important distinction.
The old deprecated method always opened a target file; the new one doesn't open shit.
Write to Apple. Ask for Avie.
|