<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <title>beta BLOG dot NET - recently in .NET category</title>
  <link rel="alternate" type="text/html" href="http://beta-blog.net/net/" />
  <link rel="self" type="application/atom+xml" href="" />
  <id>tag:beta-blog.net,2009-08-27://1</id>
  <updated>2009-11-25T23:11:42Z</updated>
  
  <generator uri="http://www.sixapart.com/movabletype/">Movable Type 4.25</generator>

<entry>
  <title>understanding unicode surrogates / or: how to deal with Linear B strings in .NET</title>
  <link rel="alternate" type="text/html" href="http://beta-blog.net/2009/11/understanding-unicode-surrogates-or-how-to-deal-with-linear-b-strings-in-net" />
  <id>tag:beta-blog.net,2009://1.52383</id>

  <published>2009-11-17T20:23:58Z</published>
  <updated>2009-11-25T23:11:42Z</updated>

  <summary>Remember a String object in .NET is a collection of Char objects, where a Char object in turn s announced as a unicode character, encoded by a 16bit unsigned integer. Thus, more precisely speaking, a single Char object is able to encode any codepoint within the basic multilingual lane (BMP), i.e. between U+0000 and U+FFFF. So, where goes the rest of the story? Unicode, as an universal character set, is designed to support much more than 65536 characters of ourse.
</summary>
  <author>
    <name>Sebastian</name>
    <uri>http://beta-blog.net</uri>
  </author>
  
  <category term=".NET" scheme="http://www.sixapart.com/ns/types#category" />
  
  <category term="codes" label="codes" scheme="http://www.sixapart.com/ns/types#tag" />
  <category term="math" label="math" scheme="http://www.sixapart.com/ns/types#tag" />
  
  <content type="html" xml:lang="en" xml:base="http://beta-blog.net/">
  <![CDATA[<p>
Remember a <span class="code cs1"><span class="ob">String</span></span> object
in .NET is a collection of <span class="code cs1"><span class="ob">Char</span></span>
objects, where a <span class="code cs1"><span class="ob">Char</span></span> object
in turn is announced as a
<a href="http://en.wikipedia.org/wiki/Unicode" target="_blank">unicode character</a>,
encoded by a 16bit unsigned integer.
Thus, more precisely speaking, a single <span class="code cs1"><span class="ob">Char</span></span>
object is able to encode any codepoint within the
<a href="http://en.wikipedia.org/wiki/Mapping_of_Unicode_character_planes#Basic_Multilingual_Plane" target="_blank">basic multilingual plane (BMP)</a>,
i.e. between <span class="code">U+0000</span> and <span class="code">U+FFFF</span>.
So, where goes the rest of the story? Unicode, as an universal character set,
is designed to support much more than 65536 characters of course.
</p>
<p>
Now, the trick is to encode code points above <span class="code">2<sup>16</sup></span>
by so-called surrogates, that is, by pairs of 16bit integers.
To see how this works, remember the well-known
<a href="http://en.wikipedia.org/wiki/Division_algorithm" target="_blank">division algorithm</a>
for integers. That is, if you have an upper bound <span class="math">M</span> and
fix an integer constant <span class="math">C (0 &lt; C &lt; M)</span>,
for any integer <span class="math">N</span> within the range of
<span class="math">0 &le; N &lt; 2<sup>M</sup></span>,
there exists a unique pair of integers <span class="math">H,L</span>, such that
</p>
<p class="quote">
<span class="math">N = 2<sup>C</sup> * H + L,</span> where <span class="math">0 &le; L &lt; 2<sup>C</sup></span> and <span class="math">0 &le; H &lt; 2<sup>M - C</sup></span>.
</p>
<p>
That way you have simply encoded these <span class="math">2<sup>M</sup></span> numbers
<span class="math">N</span> by <span class="math">2<sup>C</sup> * 2<sup>M - C</sup></span> pairs
of numbers <span class="math">H,L</span>.
Hence <span class="math">2<sup>M</sup></span> large numbers are adressed using a set of
<span class="math">2<sup>C</sup> + 2<sup>M-C</sup></span> small numbers, that's the trick.
</p>

<p>
As we are interested in encoding integers above <span class="math">2<sup>16</sup></span>
by pairs of 16bit integers, we should act on the assumption
</p>
<p class="quote">
<span class="math">2<sup>16</sup> &le; N' &lt; 2<sup>16</sup> + 2<sup>M</sup></span>,
</p>
<p>
dealing with <span class="code">N = N' - 2<sup>16</sup></span> then.
In order to decide whether any 16bit number does belong to a surrogate pair,
playing either the role of <span class="code">H</span> or <span class="code">L</span>,
finally fix an adequate constant <span class="code">T</span> and set
</p>
<p class="quote">
<span class="math">H' = H + T, L' = L + T + 2<sup>C</sup>,</span>
</p>
<p>
thus having tagged all 16bit integers <span class="math">I</span> achieving
<span class="math">T &le; I &lt; T + 2<sup>C</sup> + 2<sup>M-C</sup></span>
as surrogate integers, where the high surrogates of type <span class="math">H'</span>
are less than <span class="math">T + 2<sup>C</sup></span> and
the ones above are the low surrogates of type <span class="math">L'</span>.
</p>

<p>
Now, the setting of unicode is this: <span class="math">C = 10, M = 20, T = 0xD800</span>.
So, by reserving 2048 small integers as
surrogates, more than a million of additional codepoints up to
<span class="code">U+10FFFF</span> are accessible. The resulting formulars may be found here:
<a href="http://www.unicode.org/book/ch03.pdf" target="_blank">http://www.unicode.org/book/ch03.pdf</a>.
</p>

<p>
Thankfully .NET unicoders don't need to deal with hex numbers at all, because it's
ready made.
For instance, consider the name of
<a href="http://en.wikipedia.org/wiki/Amnisos" target="_blank">Amnissos</a>:
written in <a href="http://en.wikipedia.org/wiki/Linear_B" target="_blank">Linear B</a>:
</p>
<p class="quote">
<img src="http://beta-blog.net/2009/11/18/linearb_u10000.gif" alt="U+10000" /><img src="http://beta-blog.net/2009/11/18/linearb_u10016.gif" alt="U+10016" /><img src="http://beta-blog.net/2009/11/18/linearb_u1001B.gif" alt="U+1001B" /><img src="http://beta-blog.net/2009/11/18/linearb_u10030.gif" alt="U+10030" /></p>
<p>
In C# it looks like this:
</p>
<fieldset class="collapsible"><legend><a href="javascript:void(0)" id="collapsible_lcarlzt8_1">[-] hide code</a></legend><div class="collapsible-container"><pre class="code"><code class="csharpnet"><span class="cmnt">// alternatively the Char.ConvertFromUtf32() method may be used</span>
<span class="kwd builtin">string</span> <span class="type">amnisos</span> = <span class="str">&quot;\U00010000&quot;</span> + <span class="str">&quot;\U00010016&quot;</span> + <span class="str">&quot;\U0001001B&quot;</span> + <span class="str">&quot;\U00010030&quot;</span>;</code></pre></div></fieldset><script type="text/javascript">/*<![CDATA[*/xLib.onLoad(function(){Blog.Collapsible.create('collapsible_lcarlzt8_1')})/*]]&gt;*/</script>
<p>
Note that indeed the <span class="code cs1"><span class="sym">Length</span></span> property
of the resulting string has a value of 8, while it contains only 4 unicode characters.
So the appropriate way of accessing the actual codepoints of an arbitrary string
should make use of
<span class="code">System.Globalization.TextElementEnumerator</span>
rather than simply access
<span class="code cs1"><span class="ob">Char</span></span> objects greenly.
It goes like this:
</p>
<fieldset class="collapsible"><legend><a href="javascript:void(0)" id="collapsible_lcarlzt8_2">[-] hide code</a></legend><div class="collapsible-container"><pre class="code"><code class="csharpnet"><span class="cmnt">// using System.Globalization;</span>
<a class="kwd def" href="http://msdn.microsoft.com/en-us/library/system.globalization.textelementenumerator.aspx" target="_blank" rel="nofollow">TextElementEnumerator</a> <span class="type">en</span> = <a class="kwd def" href="http://msdn.microsoft.com/en-us/library/system.globalization.stringinfo.aspx" target="_blank" rel="nofollow">StringInfo</a>.<span class="type">GetTextElementEnumerator</span>(<span class="type">amnisos</span>);
<span class="kwd builtin">while</span> (<span class="type">en.MoveNext</span>())
{
  <span class="kwd builtin">string</span> <span class="type">current</span> = <span class="type">en.GetTextElement</span>();
  <span class="kwd builtin">if</span> (<a class="kwd def" href="http://msdn.microsoft.com/en-us/library/system.char.aspx" target="_blank" rel="nofollow">Char</a>.<span class="type">IsSurrogate</span>(<span class="type">current</span>, <span class="num">0</span>))
  {
    <span class="cmnt">// a surrogate pair encoding one character, i.e. current.Length == 2</span>
    <span class="kwd builtin">int</span> <span class="type">codepoint</span> = <a class="kwd def" href="http://msdn.microsoft.com/en-us/library/system.char.aspx" target="_blank" rel="nofollow">Char</a>.<span class="type">ConvertToUtf32</span>(<span class="type">current</span>[<span class="num">0</span>], <span class="type">current</span>[<span class="num">1</span>]);
    <a class="kwd def" href="http://msdn.microsoft.com/en-us/library/system.console.aspx" target="_blank" rel="nofollow">Console</a>.<span class="type">WriteLine</span>(<span class="str">&quot;U+{0:X6}&quot;</span>, <span class="type">codepoint</span>);
  }
  <span class="kwd builtin">else</span>
  {
    <span class="cmnt">// characters within BMP:</span>
    <span class="cmnt">// current.Length &gt; 1 may be true in case of combining characters </span>
    <span class="cmnt">// cf. StringInfo.ParseCombiningCharacters()</span>
    <span class="kwd builtin">foreach</span> (<span class="kwd builtin">char</span> <span class="type">c</span> <span class="kwd builtin">in</span> <span class="type">current</span>)
    {
      <span class="kwd builtin">int</span> <span class="type">codepoint</span> = (<span class="kwd builtin">int</span>)<span class="type">current</span>[<span class="num">0</span>]; <span class="cmnt">// use AscW() in VB.NET</span>
      <a class="kwd def" href="http://msdn.microsoft.com/en-us/library/system.console.aspx" target="_blank" rel="nofollow">Console</a>.<span class="type">WriteLine</span>(<span class="str">&quot;U+{0:X4}&quot;</span>, <span class="type">codepoint</span>);
    }
  }
}</code></pre></div></fieldset><script type="text/javascript">/*<![CDATA[*/xLib.onLoad(function(){Blog.Collapsible.create('collapsible_lcarlzt8_2')})/*]]&gt;*/</script>
<p>
Now, when we will be able to register Linear B domain names at last? ;)
</p>
]]>
  
  </content>
</entry>

</feed>
