Rixstep
 About | ACP | Buy | Industry Watch | Learning Curve | News | Search | Test
Home » Learning Curve

Unicodification

Work on the ACP Web Services has led to some startling discoveries. Follow along for the ride.

  1. Select the following line.

    http://google.com/search?q=华文仿宋

  2. Go to your Services menu and pick 'Open URL in Safari'.

  3. Watch what Google puts in the search box.

  4. Now compare that with Safari's URL.

    http://www.google.com/search?q=%E5%8D%8E%E6%96%87%E4%BB%BF%E5%AE%8B

    What you see in the URL are escaped UTF-8 octets: coding in UTF-8 is a rocket science - it's also the way you get Chinese to Google and other places.

  5. Watch to search for something in Thailand?

    http://google.co.th/search?q=ประเทศไทย

    You get the idea.

  6. But now try the IMDb:

    http://imdb.com/find?ประเทศไทย

    How did that work out?

  7. IMDb is owned by Amazon. Now see what Amazon does.

    http://amazon.com/exec/obidos/search-handle-url/field-keywords=ประเทศไทย

  8. We'll give the IMDb an easier time of it now:

    http://imdb.com/find?Café

    And notice what the IMDb changed the URL to:

    http://imdb.com/find?Caf%E9

    (And you can see the original correct UTF-8 code before the IMDb changes it if you're fast.)

    What's %E9?

    That's the actual hexadecimal representation of é - but it's not UTF-8.

    The IMDb can't handle Chinese.

  9. But hey, that's still better than the parent company Amazon: Jeff can't even take the é:

    http://amazon.com/exec/obidos/search-handle-url/field-keywords=Café

  10. And notice the URL:

    http://amazon.com/exec/obidos/search-handle-url/field-keywords%3DCaf%C3%A9

    The '%C3%A9' is authentic UTF-8: it's just that Amazon can't understand it.

    Amazon looks at the octets the same way the IMDb does, but the IMDb 'catches' them and Amazon doesn't: Amazon just gets confused.

    Try Amazon UK for another original approach:

    http://amazon.co.uk/exec/obidos/search-handle-url/field-keywords=Café


UTF-8 makes Unicode web communication possible. Unicode is a consortium of a lot of important companies. Apple are members. Apple support for Unicode is almost universal. Everything in OS X is in Unicode. And Safari automatically transforms Unicode strings into UTF-8 format for the web.

Support for Unicode is not as good as it could be. Many major sites have 'wing-it' CGI modules which figure one way or another to deal with 'special characters', and it all works fine internally - but not when someone surfs in from far away.

It's fun to search for things in Chinese - even if you haven't a clue what you're looking at. The web - and computer science both - are growing up.

We can only hope the webmeisters out there take a look-see at the UTF-8 and Unicode standards and start to get with it.

About | ACP | Buy | Industry Watch | Learning Curve | News | Products | Search | Substack
Copyright © Rixstep. All rights reserved.