Home » Learning Curve
Unicodification
Work on the ACP Web Services has led to some startling discoveries. Follow along for the ride.
- Select the following line.http://google.com/search?q=华文仿宋
- Go to your Services menu and pick 'Open URL in Safari'.
- Watch what Google puts in the search box.
- Now compare that with Safari's URL.http://www.google.com/search?q=%E5%8D%8E%E6%96%87%E4%BB%BF%E5%AE%8BWhat you see in the URL are escaped UTF-8 octets: coding in UTF-8 is a rocket science - it's also the way you get Chinese to Google and other places.
- Watch to search for something in Thailand?http://google.co.th/search?q=ประเทศไทย
You get the idea.
- But now try the IMDb:http://imdb.com/find?ประเทศไทยHow did that work out?
- IMDb is owned by Amazon. Now see what Amazon does.http://amazon.com/exec/obidos/search-handle-url/field-keywords=ประเทศไทย
- We'll give the IMDb an easier time of it now:http://imdb.com/find?CaféAnd notice what the IMDb changed the URL to:http://imdb.com/find?Caf%E9(And you can see the original correct UTF-8 code before the IMDb changes it if you're fast.)What's %E9?That's the actual hexadecimal representation of é - but it's not UTF-8.The IMDb can't handle Chinese.
- But hey, that's still better than the parent company Amazon: Jeff can't even take the é:http://amazon.com/exec/obidos/search-handle-url/field-keywords=Café
- And notice the URL:http://amazon.com/exec/obidos/search-handle-url/field-keywords%3DCaf%C3%A9The '%C3%A9' is authentic UTF-8: it's just that Amazon can't understand it.Amazon looks at the octets the same way the IMDb does, but the IMDb 'catches' them and Amazon doesn't: Amazon just gets confused.Try Amazon UK for another original approach:http://amazon.co.uk/exec/obidos/search-handle-url/field-keywords=Café
UTF-8 makes Unicode web communication possible. Unicode is a consortium of a lot of important companies. Apple are members. Apple support for Unicode is almost universal. Everything in OS X is in Unicode. And Safari automatically transforms Unicode strings into UTF-8 format for the web.
Support for Unicode is not as good as it could be. Many major sites have 'wing-it' CGI modules which figure one way or another to deal with 'special characters', and it all works fine internally - but not when someone surfs in from far away.
It's fun to search for things in Chinese - even if you haven't a clue what you're looking at. The web - and computer science both - are growing up.
We can only hope the webmeisters out there take a look-see at the UTF-8 and Unicode standards and start to get with it.
|