1. Technology
Send to a Friend via Email

Your suggestion is on its way!

An email with a link to:

http://visualbasic.about.com/od/usingvbnet/a/chrcds01.htm

was emailed to:

Thanks for sharing About.com with others!

Chars and Codes - All About Encoding

What does "little-endian" actually mean?

By

Unicode logo
unbekannt/Wikimedia Commons
Updated July 02, 2014

In other articles here at About Visual Basic, I've pointed out that a String is almost the same thing as an array of Char data types. But what's a Char? Here's Microsoft's definition:

"An unsigned 16-bit (2-byte) code point ranging in value from 0 through 65535. Each code point, or character code, represents a single Unicode character."

Like wheels within wheels, each new Microsoft definition seems to simply introduce new questions. Two new terms used in this one are code point and Unicode.

A code point is just what the name suggests. Each one of those 64K possible integers is a code point. Visual Basic 4 and higher uses the Unicode standard to assign each of them to a characters. They've been doing something like this since computers were invented. ASCII remains a very popular way to do it and Unicode is just the most "standard" way to do it today and is a complete superset of ASCII as well.

Microsoft's explanation of what is in Unicode is pretty good however!

"The first 128 code points (0–127) of Unicode correspond to the letters and symbols on a standard U.S. keyboard. These first 128 code points are the same as those the ASCII character set defines. The second 128 code points (128–255) represent special characters, such as Latin-based alphabet letters, accents, currency symbols, and fractions. Unicode uses the remaining code points (256-65535) for a wide variety of symbols, including worldwide textual characters, diacritics, and mathematical and technical symbols."

(Back in the day, I recall having to memorize Univac Fielddata codes so that I could write Fortran programs using a Univac 1108 computer. Having one "standard" set of code points that everybody uses avoids this kind of inefficient use of brain cells.)

You may also see the term "glyph" if you're reading about this subject. Characters are different from glyphs. A glyph is the actual shape displayed or printed. So, for example, the letter "a" can be displayed with different font faces which would be different glyphs. A glyph can also represent more than one character. The characters "ae" and "fi" are often represented by a single glyph. Code heads call this a ligature. And it gets deeper. There are also 'graphemes' and 'contextual forms' ... this way lies madness. And since this is a programming site rather than one about linguistics, I'll refer you to www.unicode.org for the authoritative word.

Enough of this verbage. I feel a need for some code!

Let's look at a single Unicode character in VB.NET. To make it a little different, we'll use the code point for the Greek letter Pi. This code does the job:

Dim StringToCheck As String
StringToCheck = ChrW(&H3A0)
InputChar.Text = Hex(&H3A0)
Dim I As Int16
Dim myUnicode As System.Text.Encoding = _
   System.Text.Encoding.Unicode
Dim myUnicode_Bytes As Byte() = _
   myUnicode.GetBytes(StringToCheck)
CodeLength.Text = myUnicode_Bytes.Length
CodeDisplay.Text = ""
For I = 0 To CInt(myUnicode_Bytes.Length - 1)
   CodeDisplay.Text &= _
   System.String.Format("{0:x2} ", _
   myUnicode_Bytes(I))
Next

Notice that we had to use the ChrW function to specify the code point for Pi. Because Chr is used for ASCII, only code points between 0 through 255 are valid. If you use a value outside this range, you get an argument exception. The Hex function just gives me the string representation of the hexadecimal number to display.

The fun part begins when I declare a System.Text.Encoding.Unicode variable. This object has a lot of methods and properties that make handling Unicode (or other encodings, more about this just a little later) powerful and flexible in VB.NET. One is the GetBytes which returns a Byte array the individual bytes in a string. Using the length of this array, I display the elements, one byte at a time in the For-Next loop.

When you look at the displayed results, however, there is one very significant difference:

--------
Click Here to display the illustration
Click the Back button on your browser to return
--------

Notice that the bytes have been reversed from the two byte hex number used as input.

This is an example of one of the most colorfully named subjects in software, "big-endian" versus "little-endian". Again, quoting Microsoft,

"The two bytes of an encoded character are stored in either little-endian or big-endian byte order depending on the computer architecture. In big-endian architectures the most significant byte is written and read first, while in little-endian architectures the least significant byte is written and read first."

The "least significant" byte is "A0". That's in myUnicode_Bytes(0), so it was written and read first. That makes this architecture "little-endian". In other words, the "little end" is first.

You may also run into the term Byte Order Mark (BOM). This is just the hex value 0xFEFF which indicates that encoding is big-endian or 0xFFFE which means little-endian.

All this stuff was named when nobody paid any attention to what programmers did, before the MBA's took over. Today, it would be called, "PowerByte" or something like that.

On the next page, we look at some of the other VB.NET functions that you'll need to do this kind of work.

  1. About.com
  2. Technology
  3. Visual Basic
  4. Using VB.NET
  5. Chars and Codes - All About Encoding

©2014 About.com. All rights reserved.