Starting 10 December 2009, companies and private persons based in the European
Union will be able to register
.eu Internationalised Domain Names.
The list of supported characters
is divided into several parts, called IDN scripts,
such as "Latin-1 supplement", "Greek extended", "Cyrillic" and the like.
Indeed, I may consider to get
http://www.β-ιστός-κούτσουρο.eu
Unfortunately, one cannot mix several scripts, thus β-blog.eu won’t be a valid name,
since ASCII-letters belong to the Latin script while β is Greek. (Well, so, I think
I’ll give up that idea
)
To get serious, as an EURID registrar, it’s time for us to check out several
issues that may apply to IDN requests built up with all that strange letters
Europeans may use.
For instance, note that
a1.eu
and
а1.eu
are completely different domain names. But this is just an optical
trick, since the first one starts with an ordinary ASCII "a" while the second
starts with U+0430, wich is the unicode notation of the cyrillic small letter "a".
Indeed, when you hit the second one into your browser, it will calculate
the according ACE-string xn—1-7sb.eu using the punycode algorithm first and will
make up the DNS request with it.
Things are getting more complicated when you notice that
aŀt.eu
on the one hand, and
al·t.eu
on the other hand indeed are the same domain name.
Applying the punycode algorithm to both of them, you will get
xn--at-rqa.eu
for the first one and
xn--alt-mga.eu
for the second one, because both byte streams differ.
Now, something goes wrong here, since when you plan to ask a nameserver for the
IP address to access the domain, you will have to decide for wich one you ask.
Unlikely a nameserver will answer to both of them.
Well, according to the IDNA standard as defined in rfc3490, applications
not only have to do a punycode for IDNs, but also have to apply the
nameprep algorithm first, wich in turn consists of several normalization mappings
such as lower case conversion and, more interesting, also the
Unicode Normalization Form KC
(see http://unicode.org/reports/tr15/).
The latter is the decomposing of characters by unicode compatibility equivalence.
Thus, the character U+0140, the Latin small letter "l" with middle dot,
decomposes into two unicode characters:
U+0140 => U+006C + U+00B7
That is, the middle dot will be aparted from the letter "l". Therefore, xn—at-rqa.eu
is an application of punycode, but not a conversion in the sense of IDNA standard.
Indeed, you will get different results from different so-called IDN converter libraries
with that domain, depending on whether they ara just doing punycode or applying a proper nameprep
first. A reliable reference is the Verisign conversion tool
(http://mct.verisign-grs.com/index.shtml)
and the according SDK for example (although I wasn’t able to get the Win32 version working).
After all, the challenge for the registrar is to maintain the request database properly,
accepting IDN requests in both the normalized and any equivalent form. And moreover,
one has to check a requested name against the given character list wich contains
non-normalized letters, even if the requested name is normalized already.
Since normalization isn’t a reversible mapping this may be complicated
in general, but should be solvable in this case.
leave a comment