This was originally published in The Perl Journal. It talks about the deficiencies of gettext (whew, I’m glad someone else thinks it’s pretty useless) and the complexities of many languages. Fascinating stuff. Click the headline to read it.
About Contact Me Ask me anything
I just noticed this:
http://www.mobiata.com/blog/2012/02/08/twine-string-management-ios-mac-os-x
which speaks some sense. I especially like the list of reasons why the standard l10n process is broken… these are things I’ve already written about in the past.
If you’re developing a mobile app and looking at breaking into new markets, this is a good place to start.
It’s time for another cardinal rule, and this mostly applies to building web apps.
Never, ever rely solely on information from a web request to determine locale.
You might be confused by this. I mean, how else are you supposed to determine which language to render pages in? Well, the key here is the word “solely”. What does this mean?
In most web applications, the framework (no matter how big or small it is) will almost certainly ‘set up’ some kind of context for the current web request and part of the context will be some kind of ‘request’ object. And most web applications will do something like derive a language code from the request (perhaps by inspecting the language preferences sent by the browser, or by something in the URL, or some information in a cookie; how it does it is not important), and your controllers or components or pages or template loaders will use that information to do some controller things, or compose themselves, or load templates. This is a common pattern, so you might think you’re doing OK if your system works this way.
There are two major problems with this approach.
- What if there is no web request? In other words, what if you want to render pages off-line (say, to send as email to users or to create reports). Which locale setting do you use, and where do you derive it from?
- What about the case where user X, who is surfing your site in Dutch, uses your site to send a message to user Y, who signed up for your site in Norwegian. What language are you going to use to render the email boilerplate? (you know, all the stuff that says “You have a new message from X on foobar.org!” and “Click here to unsubscribe”, etc)
I have seen these problems surface in many different projects in many different places, so bear them in mind when starting your new project. I’ll go into detail about how to avoid these troubles in future posts. For now, just be sure to provide sensible defaults for locale information, and don’t make assumptions about the current request’s locale.
The old one was pretty but the github gists came through all wrong; they need to be monospaced and the old theme was messing that up. The new one is a bit more readable, anyway.
You’d be surprised at how often this one comes up. Here’s the wikipedia entry on collation, which is the term used to describe how things are ordered. We tend to use alphabetic ordering a lot; think of choosing your country or state from a pop-up menu on a web form, for example. It seems simple enough, until you try to translate that pop-up menu.
The first tricky thing that you might run into when trying to translate a list of countries in a pop-up menu is that you probably want the value of the menu in your web form to be the same, regardless of which language the user is entering their info in. But you almost certainly want the labels that are displayed to the user to be in their native language, so you have to sort the pop-up menu by its labels, not its values. OK, that doesn’t sound too hard. But then you might find that alphabetic sort works slightly differently in different languages. For example, if you just use a plain alphabetic sort, characters with accents will probably be sorted incorrectly. To get around this problem, you generally have to base your sorting on the locale of the user; your collation rule can be derived from that, and you can apply your collation rules to your sort to get things in the right order. Here is an example of how to do it in Java.
But what if your terms don’t even have an alphabet? Or have a non-latin alphabet? For example, here are a few of the states in Egypt:
Ad Daqahliyah Al Buhayrah Al Fayyum Al Iskandariya Ash Sharqiyah
Now, the weird thing about this is that they’re in an order that looks OK to us English speakers. But unfortunately it’s not that simple, for a number of reasons. To begin with, the “Ad”, “Al” and “Ash” are all just the word for “the”, which is always written in Arabic with two letters, Alif-Lam (like our “A” and “L”) even if they’re pronounced or transcribed as “ad”, “as”, “ash”, “an”, or “ad”. And then, the Arabic alphabetic ordering is quite different (although not completely) from ours. So if we ignore the word “the”, and reorder the names in their Arabic ordering, we would get
Al Iskandariya Al Buhayrah Ad Daqahliyah Ash Sharqiyah Al Fayyum
Weird, no? And Farsi and Urdu, which ostensibly use the Arabic alphabet, actually have some other letters to worry about, too.
It gets really weird when you deal with non-alphabetic systems like the Chinese writing system. In the case of Chinese, words are ordered almost by their calligraphic complexity.
And, just to prove to you yet again that this is a more complicated problem that you probably first imagined, even in English, I ask you to put these in alphabetic order:
The Empire Strikes Back An Honest Man A Midsummer Night's Dream 28 Days Later Doc Hollywood THX1138 Dr. Acula
Good luck.
Yes, in order to have an effective i18n strategy, you will need to have at least a basic grasp of syntax and morphology. You probably already know what syntax is; the word itself comes from the Greek meaning “ordering together”. Morphology, on the other hand, might not be quite so familiar. A morpheme is a unit of meaning; morphology refers to the study of morphemes. In simple terms, it refers to how the words of a language are built.
There may be some surprises waiting for you when dealing with matters of human languages. Unfortunately, since we all grow up speaking one language or other, we tend to frame our appreciation of these matters within the context of our mother tongue. This blinds us to the dazzling surprises and twists that have evolved in the structures of human language. It’s fascinating, but it makes it pretty hard to build generic i18n structures. Here’s one simple example.
In the last post we had this contrived snippet:
that is part of a page that returns search results of books from an imaginary library. We’re going to look at just the line that says
We found ${self.result_count} books.
and see if we can make it work a bit better. Now, the first, obvious thing here is that we’re going to get some slightly strange-looking output depending on the value of self.result_count:
We found 0 books We found 1 books We found 2 books
Well, it’s not so bad. And we’ve probably seen exactly that on various websites before. ”1 books.” There’s a cheap, quick way out of the cheesiness, and I bet you’ve seen it all over the place:
We found 0 book(s) We found 1 book(s) We found 2 book(s)
but most of the time you’ll see something like (forgive the crappy templating language invented for the purposes of this example)
We found
<if "self.result_count == 1">
one book
<else>
${self.result_count} books
</if>
which seems a bit more reasonable, right? Well, indeed it is, and it works fine, in English. And, to be honest, in this case the same structure in this case will work in quite a few languages. But translate this into Arabic, and you’re in trouble. The reason is because Arabic doesn’t just have the concept of “singular” (as in, “one book”) and “plural” (as in “3 books”), but also has the concept of “dual” (as in “two books”), which is often morphologically different (indeed sometimes a totally different word!).
Continuing with our example, the word in Arabic (transliterated into the latin alphabet here to make a point) for “book” is “ketaab”. The word for “two books” is “ketaabain”. The word for “books” (more than two) is “kutuub”. So in Arabic, you’d need a more complicated template to capture that nuance. To get a real picture of how complicated it can be, check out the Wikipedia article on grammatical number; it’s pretty mind-blowing.
It’s easy to invent another example that breaks profoundly, even in a language as familiar to many of us as Spanish.
Let’s say we’re writing a web-based multiplayer text adventure kind of like the old Zork games, and the player has been given a new object by another player. In English, we could imagine something like this:
Gollum has given you a magic ring.
Gandalf has given you a magic cow.
which seems pretty simple. We could use some kind of string formatting, with a format string like this:
“%s has given you a %s”
and it’s not hard to see how that should suffice in English. Translating that to Spanish, we would have
“%s te ha dado un %s”
and inserting the words for “magic cow” (“vaca mágica”) and “magic ring” (“anillo mágico”) we get
Gollum te ha dado un anillo mágico.
Gandalf te ha dado un vaca mágica.
and guess what? It doesn’t work. ”Vaca mágica” is feminine, so it needs a different word for “a” in front of it. See how easily these things fall apart? And here’s the best thing about this example: it doesn’t even work in English. What happens when the object you’re being given starts with a vowel?
Gollum has given you a apple-flavoured gummi bear.
“A apple-flavoured gummi bear”?
This stuff is hard.
The subject of templates is always a bit complicated. Even without throwing human languages into the mix, you already have a lot of things to deal with in your templates:
- HTML (which dialect? which browser?)
- Getting your data into the template
- Javascript in your template
Well, those aren’t things I’m going to talk about (except where it affects i18n). You’re going to be solving those problems on your own.
But there are some things you can do to make i18n a lot easier in your templates too. Sometimes, you can apply the same common sense to your templates as you’re learning to do with your code, and you’ll get good results.
One, or Many?
This is one of the hard questions to answer. For a given page or component, do you have a single template that works in all languages, or one template per language? (Or, the more complicated scenario, where you have one template for some languages, and other templates for others). For example, let’s pretend that your site has a “Latest news” page that shows some news posts pulled out of a database somewhere. And let’s say you want that page as part of your English and Spanish websites. So, should you have two templates, say “news.en.html”:
and another, “news.es.html”:
or should you have a single template, “news.html”, like this:
Clearly the advantage of the second case is that there are no string literals in the template; that means every language can share a single template, and that means less maintenance, and more ground gained in the battle to get your site translated into many languages. So are there any problems with this approach? Indeed there are. The most obvious one, which usually starts to manifest itself after a few updates and revisions, is that after a while, you may not want your templates to be in sync across all languages. Read that again and think about it. Oftentimes, your primary site will grow and evolve rapidly, and you’ll be playing catchup with the other languages while your copywriters and translators try to stay in sync. You don’t want to have to delay a site update in one language while you wait for translators to finish, so sometimes it’s easier to have separate templates for different languages to allow you to keep the implementations of the pages different. In these cases, it’s sometimes beneficial to design your templating system to allow a base template like this to be overridden by templates in specific languages.
String Literals… Again!
Most of the time, we don’t really step outside ourselves when we’re coding up our templates. We just throw the text in there, add the dynamic bits, and presto! we’re done. Here’s an example of that:
This seems fairly straightforward, but it’s going to be tough to get that working in another language. Who can spot the problems?
Well, without spending too much time dwelling on templating style, it’s pretty obvious that you’re going to run into problems immediately with:
- building the plural of “self.type” by just adding an ‘s’ - let’s face it, that won’t even get you far in English!
- the “self.keyword” bit - what happens when there’s more than one?
- using “self.result_count” is tricky, even in English, because what if there is only one result?
If you’re astute, you’re probably thinking “those are just logic errors, not errors in the i18n” and you’d be correct. And yes, it’s a contrived example. However, the point here is to get you thinking about these things because they will impact your i18n path. And, I hope, as you will see, even when you think you have it right, you still probably don’t.
Syntax and Morphology
If you want to internationalise your web pages successfully, you’re going to have to acquire a basic grasp of syntax and morphology. Stay tuned for the next post, where I’ll explain why these are important, and how they apply to the last example.
This is simple if you start off on the right foot, but if you don’t, it can be an uphill battle. Recall in the previous post that I mentioned that it’s very important to steer clear of giving any language “special treatment”. Unfortunately, I can almost guarantee that if you already have a website set up in, say, English, you are almost certainly treating English as special. In the document root of your website, or in your template hierarchies, chances are you have something like
site/
index.html
about.html
contact.html
images/
javascript/
with no mention of language anywhere. This is problem if you want to translate your website into another language; you’re already treating those English files as “special”. So, when it comes to organising your templates, make sure you take their language into account. I’m not going to tell you how to do it; there are many ways. A common way could be:
site/
en/
index.html
about.html
...
es/
index.html
...
or you could use a naming scheme like
site/
index.en.html
index.es.html
...
although that may not scale quite as well.
Whatever you choose, just make sure you are consistent.