<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <title>beta BLOG dot NET - recently in .NET category</title>
  <link rel="alternate" type="text/html" href="http://beta-blog.net/net/" />
  <link rel="self" type="application/atom+xml" href="" />
  <id>tag:beta-blog.net,2009-08-27://1</id>
  <updated>2010-04-02T01:17:02Z</updated>
  
  <generator uri="http://www.sixapart.com/movabletype/">Movable Type 4.25</generator>

<entry>
  <title>Is LINQ functional?</title>
  <link rel="alternate" type="text/html" href="http://beta-blog.net/2010/03/is-linq-functional" />
  <id>tag:beta-blog.net,2010://1.52390</id>

  <published>2010-03-31T20:11:19Z</published>
  <updated>2010-04-02T01:17:02Z</updated>

  <summary>With it&apos;s 3.5 extensions, the .NET framework started to turn into a really cool looking programming concept, last but not least due to the syntactic sugar of LINQ. A reason for that is surely it&apos;s functional look. Well, as LINQ is integrated into an imperative context, it won&apos;t be ever able to guarantee state-free evaluation as a genuine functional language does. Nevertheless it&apos;s worth to discuss and play around with a few aspects of it in terms of a multiple programming paradigm concept. </summary>
  <author>
    <name>Sebastian</name>
    <uri>http://beta-blog.net</uri>
  </author>
  
  <category term=".NET" scheme="http://www.sixapart.com/ns/types#category" />
  
  <category term="algorithms" scheme="http://www.sixapart.com/ns/types#category" />
  
  <category term="net" label=".NET" scheme="http://www.sixapart.com/ns/types#tag" />
  <category term="c" label="C#" scheme="http://www.sixapart.com/ns/types#tag" />
  <category term="math" label="math" scheme="http://www.sixapart.com/ns/types#tag" />
  
  <content type="html" xml:lang="en" xml:base="http://beta-blog.net/">
  <![CDATA[<p>
With it's 3.5 extensions, the .NET framework started to turn into a really
cool looking programming concept,
last but not least due to the syntactic sugar of
<a href="http://msdn.microsoft.com/en-us/library/bb397676.aspx" target="_blank">LINQ</a>.
A reason for that is surely it's <a href="http://en.wikipedia.org/wiki/Functional_programming" target="_blank">functional</a>
look.
Well, as LINQ is integrated into an imperative context, it won't be ever able to
guarantee state-free evaluation as a genuine functional language does.
Nevertheless it's worth to discuss and play around with a few aspects of it
in terms of a multiple programming paradigm concept.
</p>

<h3>Delegating definitions in C# 3.0</h3>
<p>
Firstly, the concept of
<a href="http://en.wikipedia.org/wiki/First-class_function" target="_blank">first-class functions</a>,
i.e. the invention of the function type, leads to the notion of closures.
So for instance, a constant function such as
</p>
<pre class="code"><code class="csharpnet"><span class="kwd def">Func</span>&lt;<span class="kwd builtin">int</span>&gt; <span class="type">i</span> = () =&gt; <span class="num">1</span>;
</code></pre>
<p>
defines something like a readonly variable.
You may get it's value now, later or never,
but you can always be sure that it's value won't be ever changed anywhere in your code.
Hence, you have won a quantum of control over your program by this
weird piece of code.
That's a basic idea of functional programming.
</p>

<p>
The concept of function types leads to higher order
functions, i.e. functions mapping functions to other functions.
Thus, the <a href="http://en.wikipedia.org/wiki/Currying" target="_blank">curry functor</a>,
a key concept in the theory of functional programming,
is regarded:
</p>
<p class="quote">
<span class="math">curry: (X <span class="small">x</span> Y &rarr; Z) &rarr; (X  &rarr; Y  &rarr; Z)</span>
</p>
<p>
That is, for any function <span class="math">f(x,y)</span>, there is a curryied function
<span class="math">curry(f)(x)</span>
taking <span class="math">x</span> to a function <span class="math">g(y) = f(x,y)</span>.
This is now implemented easily in C# using generic types:
</p>
<pre class="code">
<code class="csharpnet"><span class="kwd builtin">static</span> <span class="kwd def">Func</span>&lt;<span class="type">X</span>, <span class="kwd def">Func</span>&lt;<span class="type">Y</span>, <span class="type">Z</span>&gt;&gt; <span class="type">Curry</span>&lt;<span class="type">X</span>, <span class="type">Y</span>, <span class="type">Z</span>&gt;(<span class="kwd def">Func</span>&lt;<span class="type">X</span>, <span class="type">Y</span>, <span class="type">Z</span>&gt; <span class="type">f</span>)
{
  <span class="kwd builtin">return</span> <span class="type">x</span> =&gt; <span class="type">y</span> =&gt; <span class="type">f</span>(<span class="type">x</span>, <span class="type">y</span>);
}
</code></pre>
<p>
(inspired by this <a target="_blank" href="http://jacobcarpenter.wordpress.com/2008/01/02/c-abuse-of-the-day-functional-library-implemented-with-lambdas/">C# abuse of the day</a>).
Well, that's more or less of academic interest, since one would hardly ever replace
<span class="code">x++</span> by
</p>
<pre class="code">
<code class="csharpnet"><span class="type">x</span> = <span class="type">Curry</span>&lt;<span class="kwd builtin">int</span>, <span class="kwd builtin">int</span>, <span class="kwd builtin">int</span>&gt;((<span class="type">a</span>, <span class="type">b</span>) =&gt; <span class="type">a</span> + <span class="type">b</span>)(<span class="num">1</span>)(<span class="type">x</span>); <span class="cmnt">// x++ ;)</span>
</code></pre>
<p>
A slightly more interesting example is the following:
</p>
<pre class="code">
<code class="csharpnet"><span class="cmnt">// using System.Text.RegularExpressions;</span>
<span class="kwd builtin">var</span> <span class="type">grep</span> = <span class="type">Curry</span>&lt;<a class="kwd def" href="http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex.aspx" target="_blank" rel="nofollow">Regex</a>, <a class="kwd def" href="http://msdn.microsoft.com/en-us/library/system.collections.ienumerable.aspx" target="_blank" rel="nofollow">IEnumerable</a>&lt;<span class="kwd builtin">string</span>&gt;, <a class="kwd def" href="http://msdn.microsoft.com/en-us/library/system.collections.ienumerable.aspx" target="_blank" rel="nofollow">IEnumerable</a>&lt;<span class="kwd builtin">string</span>&gt;&gt;(
  (<span class="type">regex</span>, <span class="type">list</span>) =&gt; <span class="kwd builtin">from</span> <span class="type">s</span> <span class="kwd builtin">in</span> <span class="type">list</span>
                   <span class="kwd builtin">where</span> <span class="type">regex.Match</span>(<span class="type">s</span>).<span class="type">Success</span>
                   <span class="kwd builtin">select</span> <span class="type">s</span>);
<span class="kwd builtin">var</span> <span class="type">grepFoo</span> = <span class="type">grep</span>(<span class="kwd builtin">new</span> <a class="kwd def" href="http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex.aspx" target="_blank" rel="nofollow">Regex</a>(<span class="str">&quot;foo&quot;</span>));
</code></pre>
<p>
Thus, <span class="code">grepFoo</span> will grep all words containing
<code class="csharpnet"><span class="str">&quot;foo&quot;</span></code>
from a wordlist.
Attention should be paid to the fact that with the statement
</p>
<pre class="code">
<code class="csharpnet"><span class="kwd builtin">var</span> <span class="type">fooList</span> = <span class="type">grepFoo</span>(<span class="kwd builtin">new</span> <span class="kwd builtin">string</span>[]{<span class="str">&quot;foo&quot;</span>, <span class="str">&quot;bar&quot;</span>, <span class="str">&quot;foobar&quot;</span>});
</code></pre>
<p>
then there is still no regex applied.
Indeed, <code class="csharpnet"><span class="type">fooList</span></code>
is of type
<code class="csharpnet"><a class="kwd def" href="http://msdn.microsoft.com/en-us/library/system.collections.ienumerable.aspx" target="_blank" rel="nofollow">IEnumerable</a>&lt;<span class="kwd builtin">string</span>&gt;</code>
and not yet enumerated at this point.
So the evaluation of the expression is deferred until it's result is needed by another computation
- smells like lazy evaluation.
</p>

<h3>LINQ is not lazy!</h3>
<p>
One of the most important paradigms of functional programming is the concept of
<a href="http://en.wikipedia.org/wiki/Lazy_evaluation" target="_blank">lazy evaluation</a>.
For instance, in a functional language, such as the good old
<a href="http://haskell.org/" target="_blank">Haskell</a>,
an expression such as
</p>
<pre class="code"><code>length [1, 2, 3/0]
</code></pre>
<p>
evaluates to <span class="code">3</span>.
That is, the control system is too lazy to fail on division by zero,
neither at compile time nor on run time, since it doesn't need to know any element
inside the array in order to calculate it's length.
In <em>C#</em> (where you aren't even able to compile an expression such as <span class="code">1/0</span>),
you may let
</p>
<pre class="code"><code class="csharpnet"><span class="kwd builtin">var</span> <span class="type">q1</span> = <span class="kwd builtin">from</span> <span class="type">i</span> <span class="kwd builtin">in</span> (<a class="kwd def" href="http://msdn.microsoft.com/en-us/library/system.collections.ienumerable.aspx" target="_blank" rel="nofollow">IEnumerable</a>&lt;<span class="kwd builtin">int</span>&gt;)<span class="kwd builtin">new</span> <span class="kwd builtin">int</span>[] { <span class="num">1</span>, <span class="num">2</span>, <span class="num">3</span> }
         <span class="kwd builtin">select</span> <span class="num">1</span>/(<span class="type">i</span> - <span class="num">3</span>);
</code></pre>
<p>
without getting a run time error.
But this has nothing to do with lazy evaluation, since the query expression isn't evaluated at all at this point
(in contrast to the array definition inside the query), so the query expression is simply treated as a function definition.
However, as soon as an aggregation expression such as
</p>
<pre class="code"><code class="csharpnet"><span class="kwd builtin">int</span> <span class="type">three</span> = <span class="type">q1.Count</span>();
</code></pre>
<p>
is reached, a
<span class="code"><code class="csharpnet"><a class="kwd def" href="http://msdn.microsoft.com/en-us/library/system.dividebyzeroexception.aspx" target="_blank" rel="nofollow">DivideByZeroException</a></code></span>
will be thrown.
Thus, LINQ evaluates eager here, not lazy.
On the other hand,
</p>
<pre class="code"><code class="csharpnet"><span class="kwd builtin">int</span> <span class="type">two</span> = <span class="type">q1.Take</span>(<span class="num">2</span>).<span class="type">Count</span>();
</code></pre>
<p>
works fine, since the black hole stays unevaluated due to the <code>Take</code> operator.
But, having
</p>
<pre class="code"><code class="csharpnet"><span class="kwd builtin">var</span> <span class="type">q2</span> = <span class="kwd builtin">from</span> <span class="type">i</span> <span class="kwd builtin">in</span> (<a class="kwd def" href="http://msdn.microsoft.com/en-us/library/system.collections.ienumerable.aspx" target="_blank" rel="nofollow">IEnumerable</a>&lt;<span class="kwd builtin">int</span>&gt;)<span class="kwd builtin">new</span> <span class="kwd builtin">int</span>[] { <span class="num">1</span>, <span class="num">2</span>, <span class="num">3</span> }
         <span class="kwd builtin">select</span> <span class="num">1</span>/(<span class="type">i</span> - <span class="num">1</span>);
<span class="kwd builtin">int</span> <span class="type">two2</span> = <span class="type">q2.Skip</span>(<span class="num">1</span>).<span class="type">Count</span>();
</code></pre>
<p>
instead, you will - guess what! - catch the exception again.
Thus, in contrast to the <span class="code"><code class="csharpnet"><span class="type">Take</span></code></span> operator,
the <span class="code"><code class="csharpnet"><span class="type">Skip</span></code></span> operator
does iterate through skipped elements and hence evaluates them.
Ok, that's no surprise, since these operators are using the
<code class="csharpnet"><a class="kwd def" href="http://msdn.microsoft.com/en-us/library/system.collections.ienumerator.aspx" target="_blank" rel="nofollow">IEnumerator</a></code>
provided by the corresponding
<code class="csharpnet"><a class="kwd def" href="http://msdn.microsoft.com/en-us/library/system.collections.ienumerable.aspx" target="_blank" rel="nofollow">IEnumerable</a></code>.
So, LINQ pretends to be lazy in the way that
</p>
<pre class="code"><code class="csharpnet"><span class="kwd builtin">var</span> <span class="type">p</span> = <span class="type">q2.Reverse</span>();
</code></pre>
<p>
won't be evaluated at this point and thus doesn't fail, wheras
</p>
<pre class="code"><code class="csharpnet"><span class="kwd builtin">int</span> <span class="type">two3</span> = <span class="type">p.Take</span>(<span class="num">2</span>).<span class="type">Count</span>();
</code></pre>
<p>
then throws again the exception even though the evil one shuoldn't be taken here.
</p>
<p>
A functional approach to force lazyness would be to replace value expressions by
constant functions, but the compiler won't accept something like this:
</p>
<pre class="code"><code class="csharpnet"><span class="cmnt">// The type of the expression in the select clause is incorrect.</span>
<span class="cmnt">// Type inference failed in the call to &#039;Select&#039;.</span>
<span class="kwd builtin">var</span> <span class="type">q1_</span> = <span class="kwd builtin">from</span> <span class="type">i</span> <span class="kwd builtin">in</span> (<a class="kwd def" href="http://msdn.microsoft.com/en-us/library/system.collections.ienumerable.aspx" target="_blank" rel="nofollow">IEnumerable</a>&lt;<span class="kwd builtin">int</span>&gt;)<span class="kwd builtin">new</span> <span class="kwd builtin">int</span>[] { <span class="num">1</span>, <span class="num">2</span>, <span class="num">3</span> }
          <span class="kwd builtin">select</span> () =&gt; <span class="num">1</span> / (<span class="type">i</span> - <span class="num">3</span>);
</code></pre>
<p>
Hence, LINQ isn't lazy, but has a smart way to make function definitions
looking like statement expressions.
</p>


<h3>Diving into recursion</h3>
<p>
Remember the famous
</p>
<a href="http://en.wikipedia.org/wiki/Fibonacci_number" target="_blank">Fibonacci numbers</a>:
<p class="quote">
<span class="math">fib<sub>0</sub> = 0, fib<sub>1</sub> = 1, fib<sub>n</sub> = fib<sub>n-1</sub> + fib<sub>n-2</sub>.</span>
</p>
<p>
The sequence starts with
</p>
<p class="quote">
<span class="math">fibs = 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, ...</span>
</p>
<p>
where <span class="math">fibs<sub>100</sub></span> is a number consisting of 21 digits then, so it grows quite fast.
Although one may calculate Fibonacci numbers in constant time using
<a href="http://mathworld.wolfram.com/BinetsFibonacciNumberFormula.html" target="_blank">Binet's formula</a>,
the definition leads to interesting comparisons of different recursion strategies.
</p>

<p>
Well, lets have a
</p>
<pre class="code"><code class="csharpnet"><span class="kwd builtin">delegate</span> <span class="kwd builtin">long</span> <span class="kwd def">Fibonacci</span>(<span class="kwd builtin">int</span> <span class="type">n</span>);
</code></pre>
<p>
A direct translation of the definition into a lambda recursion looks like this:
</p>
<pre class="code"><code class="csharpnet"><span class="kwd def">Fibonacci</span> <span class="type">fib1</span> = <span class="kwd builtin">null</span>; <span class="cmnt">// pre-assigned for use within recursion</span>
<span class="type">fib1</span> = <span class="type">n</span> =&gt; <span class="type">n</span> &lt;= <span class="num">1</span> ? <span class="type">n</span> : <span class="type">fib1</span>(<span class="type">n</span> - <span class="num">1</span>) + <span class="type">fib1</span>(<span class="type">n</span> - <span class="num">2</span>);
</code></pre>
<p>
The funny thing with this implementation is, that the Fibonacci function itself determines it's run time:
It's <span class="math">O(fib<sub>n</sub>)</span>, i.e. lower values will be
recalculated many times again and again in order to get a higher one, due to the lack of an aggregating strategy.
</p>

<p>
Now, in Haskell you may get around this very elegantly by defining an infinitive list:
</p>
<pre class="code"><code>fibs = 0 : 1 : zipWith (+) fibs (tail fibs)
</code></pre>
<p>
The list is inititialized with two elements.
Then, notional, the <code>tail</code> function shifts the first element from the <code>fibs</code> list,
while <code>zipWith (+)</code> creates a new list by adding elements of both
<code>fibs</code> and <code>(tail fibs)</code> with each other then.
But in practice, Haskell is smart and lazy enough to avoid any needless recalculation
of numbers already present in the <code>fibs</code> list.
Thus, the algorithm applied here is the same one a human being would apply spontaneously using a
pencil and a chit of paper. So, it's <span class="math">O(n)</span>.
</p>

<p>
To define an infinitive list in C#, one should
implement the
<code class="csharpnet"><a class="kwd def" href="http://msdn.microsoft.com/en-us/library/system.collections.ienumerable.aspx" target="_blank" rel="nofollow">IEnumerable</a></code>
interface in the way that
the corresponding
<code class="csharpnet"><a class="kwd def" href="http://msdn.microsoft.com/en-us/library/system.collections.ienumerator.aspx" target="_blank" rel="nofollow">IEnumerator</a></code>
expands the list on demand within it's
<code class="csharpnet"><span class="type">MoveNext</span>()</code>
method then.
Here, it's enough to have a little inliner,
taking a list and an expanding function to a
<code class="csharpnet"><span class="kwd def">Fibonacci</span></code> type:
</p>
<pre class="code"><code class="csharpnet"><span class="kwd def">Func</span>&lt;
  <a class="kwd def" href="http://msdn.microsoft.com/en-us/library/system.collections.ienumerable.aspx" target="_blank" rel="nofollow">IEnumerable</a>&lt;<span class="kwd builtin">long</span>&gt;,
  <span class="kwd def">Func</span>&lt;<a class="kwd def" href="http://msdn.microsoft.com/en-us/library/system.collections.ienumerable.aspx" target="_blank" rel="nofollow">IEnumerable</a>&lt;<span class="kwd builtin">long</span>&gt;, <a class="kwd def" href="http://msdn.microsoft.com/en-us/library/system.collections.ienumerable.aspx" target="_blank" rel="nofollow">IEnumerable</a>&lt;<span class="kwd builtin">long</span>&gt;&gt;,
  <span class="kwd def">Fibonacci</span>&gt; <span class="type">infList</span> = <span class="kwd builtin">null</span>;
<span class="type">infList</span> = (<span class="type">list</span>, <span class="type">exp</span>) =&gt; <span class="type">n</span> =&gt; <span class="type">n</span> &lt; <span class="type">list.Count</span>() ?
  <span class="type">list.Skip</span>(<span class="type">n</span>).<span class="type">First</span>() : <span class="type">infList</span>(<span class="type">exp</span>(<span class="type">list</span>), <span class="type">exp</span>)(<span class="type">n</span>);
</code></pre>
<p>
Now, C# also provides a <code>Zip</code> function.
So, a simple syntactic translation of the Haskell list would look like this:
</p>
<pre class="code"><code class="csharpnet"><span class="kwd def">Func</span>&lt;<a class="kwd def" href="http://msdn.microsoft.com/en-us/library/system.collections.ienumerable.aspx" target="_blank" rel="nofollow">IEnumerable</a>&lt;<span class="kwd builtin">long</span>&gt;, <a class="kwd def" href="http://msdn.microsoft.com/en-us/library/system.collections.ienumerable.aspx" target="_blank" rel="nofollow">IEnumerable</a>&lt;<span class="kwd builtin">long</span>&gt;&gt; <span class="type">fibZip</span> = <span class="type">fibs</span> =&gt;
  <span class="type">fibs.Take</span>(<span class="num">2</span>).<span class="type">Concat</span>(<span class="type">fibs.Zip</span>(<span class="type">fibs.Skip</span>(<span class="num">1</span>), (<span class="type">x</span>, <span class="type">y</span>) =&gt; <span class="type">x</span> + <span class="type">y</span>));
</code></pre>
<p>
Hm, but this one is even worse than the naive recursion.
Indeed, trying
</p>
<pre class="code"><code class="csharpnet"><span class="kwd def">Fibonacci</span> <span class="type">fib2</span> = <span class="type">infList</span>(<span class="kwd builtin">new</span> <span class="kwd builtin">long</span>[] { <span class="num">0</span>, <span class="num">1</span> }, <span class="type">fibZip</span>);
</code></pre>
<p>
then, you will see that aggregation doesn't work at all this way, since the concept
of enumeration is not functional.
We may repair the <code>fibZip</code> as follows:
</p>
<pre class="code"><code class="csharpnet"><span class="kwd def">Func</span>&lt;<a class="kwd def" href="http://msdn.microsoft.com/en-us/library/system.collections.ienumerable.aspx" target="_blank" rel="nofollow">IEnumerable</a>&lt;<span class="kwd builtin">long</span>&gt;, <a class="kwd def" href="http://msdn.microsoft.com/en-us/library/system.collections.ienumerable.aspx" target="_blank" rel="nofollow">IEnumerable</a>&lt;<span class="kwd builtin">long</span>&gt;&gt; <span class="type">fibZip2</span> = <span class="type">fibs</span> =&gt;
  <span class="type">fibs.Concat</span>((<a class="kwd def" href="http://msdn.microsoft.com/en-us/library/system.collections.ienumerable.aspx" target="_blank" rel="nofollow">IEnumerable</a>&lt;<span class="kwd builtin">long</span>&gt;)(<span class="kwd builtin">new</span> <span class="kwd builtin">long</span>[] {
    <span class="type">fibs.Skip</span>(<span class="type">fibs.Count</span>() - <span class="num">2</span>).<span class="type">Sum</span>() }));
</code></pre>
<p>
This one looks a bit weird, since it's not that easy to extend an
<code class="csharpnet"><a class="kwd def" href="http://msdn.microsoft.com/en-us/library/system.collections.ienumerable.aspx" target="_blank" rel="nofollow">IEnumerable</a></code>
by one element. Anyway,
</p>
<pre class="code"><code class="csharpnet"><span class="kwd def">Fibonacci</span> <span class="type">fib3</span> = <span class="type">infList</span>(<span class="kwd builtin">new</span> <span class="kwd builtin">long</span>[] { <span class="num">0</span>, <span class="num">1</span> }, <span class="type">fibZip2</span>);
</code></pre>
<p>
indeed does the job in <span class="math">O(n)</span> then,
even though the idea of an infinitive list has lost it's magic this way.
</p>

<h3>Conclusion</h3>
<p>
As expected, neither C# nor LINQ turns out to implement
the paradigms of a functional language.
None  the less, it's really fancy. 8-)
</p>
]]>
  
  </content>
</entry>

<entry>
  <title>understanding unicode surrogates / or: how to deal with Linear B strings in .NET</title>
  <link rel="alternate" type="text/html" href="http://beta-blog.net/2009/11/understanding-unicode-surrogates-or-how-to-deal-with-linear-b-strings-in-net" />
  <id>tag:beta-blog.net,2009://1.52383</id>

  <published>2009-11-17T20:23:58Z</published>
  <updated>2010-07-14T21:34:53Z</updated>

  <summary>Remember a String object in .NET is a collection of Char objects, where a Char object in turn s announced as a unicode character, encoded by a 16bit unsigned integer. Thus, more precisely speaking, a single Char object is able to encode any codepoint within the basic multilingual lane (BMP), i.e. between U+0000 and U+FFFF. So, where goes the rest of the story? Unicode, as an universal character set, is designed to support much more than 65536 characters of ourse.
</summary>
  <author>
    <name>Sebastian</name>
    <uri>http://beta-blog.net</uri>
  </author>
  
  <category term=".NET" scheme="http://www.sixapart.com/ns/types#category" />
  
  <category term="codes" label="codes" scheme="http://www.sixapart.com/ns/types#tag" />
  <category term="math" label="math" scheme="http://www.sixapart.com/ns/types#tag" />
  <category term="unicode" label="unicode" scheme="http://www.sixapart.com/ns/types#tag" />
  
  <content type="html" xml:lang="en" xml:base="http://beta-blog.net/">
  <![CDATA[<p>
Remember a <span class="code cs1"><span class="ob">String</span></span> object
in .NET is a collection of <span class="code cs1"><span class="ob">Char</span></span>
objects, where a <span class="code cs1"><span class="ob">Char</span></span> object
in turn is announced as a
<a href="http://en.wikipedia.org/wiki/Unicode" target="_blank">unicode character</a>,
encoded by a 16bit unsigned integer.
Thus, more precisely speaking, a single <span class="code cs1"><span class="ob">Char</span></span>
object is able to encode any codepoint within the
<a href="http://en.wikipedia.org/wiki/Mapping_of_Unicode_character_planes#Basic_Multilingual_Plane" target="_blank">basic multilingual plane (BMP)</a>,
i.e. between <span class="code">U+0000</span> and <span class="code">U+FFFF</span>.
So, where goes the rest of the story? Unicode, as an universal character set,
is designed to support much more than 65536 characters of course.
</p>
<p>
Now, the trick is to encode code points above <span class="code">2<sup>16</sup></span>
by so-called surrogates, that is, by pairs of 16bit integers.
To see how this works, remember the well-known
<a href="http://en.wikipedia.org/wiki/Division_algorithm" target="_blank">division algorithm</a>
for integers. That is, if you have an upper bound <span class="math">M</span> and
fix an integer constant <span class="math">C (0 &lt; C &lt; M)</span>,
for any integer <span class="math">N</span> within the range of
<span class="math">0 &le; N &lt; 2<sup>M</sup></span>,
there exists a unique pair of integers <span class="math">H,L</span>, such that
</p>
<p class="quote">
<span class="math">N = 2<sup>C</sup> * H + L,</span> where <span class="math">0 &le; L &lt; 2<sup>C</sup></span> and <span class="math">0 &le; H &lt; 2<sup>M - C</sup></span>.
</p>
<p>
That way you have simply encoded these <span class="math">2<sup>M</sup></span> numbers
<span class="math">N</span> by <span class="math">2<sup>C</sup> * 2<sup>M - C</sup></span> pairs
of numbers <span class="math">H,L</span>.
Hence <span class="math">2<sup>M</sup></span> large numbers are adressed using a set of
<span class="math">2<sup>C</sup> + 2<sup>M-C</sup></span> small numbers, that's the trick.
</p>

<p>
As we are interested in encoding integers above <span class="math">2<sup>16</sup></span>
by pairs of 16bit integers, we should act on the assumption
</p>
<p class="quote">
<span class="math">2<sup>16</sup> &le; N' &lt; 2<sup>16</sup> + 2<sup>M</sup></span>,
</p>
<p>
dealing with <span class="code">N = N' - 2<sup>16</sup></span> then.
In order to decide whether any 16bit number does belong to a surrogate pair,
playing either the role of <span class="code">H</span> or <span class="code">L</span>,
finally fix an adequate constant <span class="code">T</span> and set
</p>
<p class="quote">
<span class="math">H' = H + T, L' = L + T + 2<sup>C</sup>,</span>
</p>
<p>
thus having tagged all 16bit integers <span class="math">I</span> achieving
<span class="math">T &le; I &lt; T + 2<sup>C</sup> + 2<sup>M-C</sup></span>
as surrogate integers, where the high surrogates of type <span class="math">H'</span>
are less than <span class="math">T + 2<sup>C</sup></span> and
the ones above are the low surrogates of type <span class="math">L'</span>.
</p>

<p>
Now, the setting of unicode is this: <span class="math">C = 10, M = 20, T = 0xD800</span>.
So, by reserving 2048 small integers as
surrogates, more than a million of additional codepoints up to
<span class="code">U+10FFFF</span> are accessible. The resulting formulars may be found here:
<a href="http://www.unicode.org/book/ch03.pdf" target="_blank">http://www.unicode.org/book/ch03.pdf</a>.
</p>

<p>
Thankfully .NET unicoders don't need to deal with hex numbers at all, because it's
ready made.
For instance, consider the name of
<a href="http://en.wikipedia.org/wiki/Amnisos" target="_blank">Amnissos</a>:
written in <a href="http://en.wikipedia.org/wiki/Linear_B" target="_blank">Linear B</a>:
</p>
<p class="quote">
<img src="http://beta-blog.net/2009/11/18/linearb_u10000.gif" alt="U+10000" /><img src="http://beta-blog.net/2009/11/18/linearb_u10016.gif" alt="U+10016" /><img src="http://beta-blog.net/2009/11/18/linearb_u1001B.gif" alt="U+1001B" /><img src="http://beta-blog.net/2009/11/18/linearb_u10030.gif" alt="U+10030" /></p>
<p>
In C# it looks like this:
</p>
<fieldset class="collapsible"><legend><a href="javascript:void(0)" id="collapsible_trc5jq_1">[-] hide code</a></legend><div class="collapsible-container"><pre class="code"><code class="csharpnet"><span class="cmnt">// alternatively the Char.ConvertFromUtf32() method may be used</span>
<span class="kwd builtin">string</span> <span class="type">amnisos</span> = <span class="str">&quot;\U00010000&quot;</span> + <span class="str">&quot;\U00010016&quot;</span> + <span class="str">&quot;\U0001001B&quot;</span> + <span class="str">&quot;\U00010030&quot;</span>;</code></pre></div></fieldset><script type="text/javascript">/*<![CDATA[*/xLib.onLoad(function(){Blog.Collapsible.create('collapsible_trc5jq_1')})/*]]&gt;*/</script>
<p>
Note that indeed the <span class="code cs1"><span class="sym">Length</span></span> property
of the resulting string has a value of 8, while it contains only 4 unicode characters.
So the appropriate way of accessing the actual codepoints of an arbitrary string
should make use of
<span class="code">System.Globalization.TextElementEnumerator</span>
rather than simply access
<span class="code cs1"><span class="ob">Char</span></span> objects greenly.
It goes like this:
</p>
<fieldset class="collapsible"><legend><a href="javascript:void(0)" id="collapsible_trc5jq_2">[-] hide code</a></legend><div class="collapsible-container"><pre class="code"><code class="csharpnet"><span class="cmnt">// using System.Globalization;</span>
<a class="kwd def" href="http://msdn.microsoft.com/en-us/library/system.globalization.textelementenumerator.aspx" target="_blank" rel="nofollow">TextElementEnumerator</a> <span class="type">en</span> = <a class="kwd def" href="http://msdn.microsoft.com/en-us/library/system.globalization.stringinfo.aspx" target="_blank" rel="nofollow">StringInfo</a>.<span class="type">GetTextElementEnumerator</span>(<span class="type">amnisos</span>);
<span class="kwd builtin">while</span> (<span class="type">en.MoveNext</span>())
{
  <span class="kwd builtin">string</span> <span class="type">current</span> = <span class="type">en.GetTextElement</span>();
  <span class="kwd builtin">if</span> (<a class="kwd def" href="http://msdn.microsoft.com/en-us/library/system.char.aspx" target="_blank" rel="nofollow">Char</a>.<span class="type">IsSurrogate</span>(<span class="type">current</span>, <span class="num">0</span>))
  {
    <span class="cmnt">// a surrogate pair encoding one character, i.e. current.Length == 2</span>
    <span class="kwd builtin">int</span> <span class="type">codepoint</span> = <a class="kwd def" href="http://msdn.microsoft.com/en-us/library/system.char.aspx" target="_blank" rel="nofollow">Char</a>.<span class="type">ConvertToUtf32</span>(<span class="type">current</span>[<span class="num">0</span>], <span class="type">current</span>[<span class="num">1</span>]);
    <a class="kwd def" href="http://msdn.microsoft.com/en-us/library/system.console.aspx" target="_blank" rel="nofollow">Console</a>.<span class="type">WriteLine</span>(<span class="str">&quot;U+{0:X6}&quot;</span>, <span class="type">codepoint</span>);
  }
  <span class="kwd builtin">else</span>
  {
    <span class="cmnt">// characters within BMP:</span>
    <span class="cmnt">// current.Length &gt; 1 may be true in case of combining characters </span>
    <span class="cmnt">// cf. StringInfo.ParseCombiningCharacters()</span>
    <span class="kwd builtin">foreach</span> (<span class="kwd builtin">char</span> <span class="type">c</span> <span class="kwd builtin">in</span> <span class="type">current</span>)
    {
      <span class="kwd builtin">int</span> <span class="type">codepoint</span> = (<span class="kwd builtin">int</span>)<span class="type">current</span>[<span class="num">0</span>]; <span class="cmnt">// use AscW() in VB.NET</span>
      <a class="kwd def" href="http://msdn.microsoft.com/en-us/library/system.console.aspx" target="_blank" rel="nofollow">Console</a>.<span class="type">WriteLine</span>(<span class="str">&quot;U+{0:X4}&quot;</span>, <span class="type">codepoint</span>);
    }
  }
}</code></pre></div></fieldset><script type="text/javascript">/*<![CDATA[*/xLib.onLoad(function(){Blog.Collapsible.create('collapsible_trc5jq_2')})/*]]&gt;*/</script>
<p>
Now, when we will be able to register Linear B domain names at last? ;)
</p>
]]>
  
  </content>
</entry>

</feed>
