Last updated at Thu, 17 Aug 2017 23:53:42 GMT

A short, mostly-accurate history of character encodings

In the beginning, when you wanted to use a computer to store text, there were not many options - you inherited something from punchcards like EBCDIC or invented something convenient and unique to your system. Computers did not need to talk to each other, so there was not much point in standardizing between vendors. Things were pretty simple.

Then, there came the need for computers and vendors to interoperate and communicate. Thus, ASCII and later some of its cousins like ISCII, VISCII, and YUSCII were born. And now, large islands of computers, teletypes, modems, etc. could communicate with each other. ASCII only defined the lower 7 bits of a character, so other non-english encodings frequently used the upper half to define the extra language-specific characters. Things were a little confusing on the edges, but largely compatible, as long as you stuck to the lower 7-bits.

To deal with alternate languages MS-DOS, OS/2 and 16-bit Windows used a concept called code pages to define how the display characters from different languages. This was somewhat consistent as long as you stuck to Microsoft products (they were not a real standard), and as long as you did not need to look at more than one language at a time. Of course, there was still a limit on the number of characters one can represent with the extra half-byte in ASCII. Various multi-byte 'encodings' like Big5 and Shift JIS were invented to represent eastern languages like Chinese and Japanese, which can have thousands of characters. These were all largely incompatible and still did not work if one wanted to use more than one language. The term 'mojibake' describes some of the hilarious results that can occur when incompatible encodings are interpreted the wrong way. Things started getting really confusing.

Eventually, a new standard called Unicode was born to deal with the mess of these incompatible language encodings.  It defined the concept of code points, which are numerical representations of all the characters, ligatures, symbols, and other doodads from the world's languages, along with ways to store and transmit them between computers.

Fast forward to the present: if you're an Android, iOS, Mac, Linux, BSD, or Unix user, you're probably used to Unicode working out of the box. This is thanks to a simple exchange and encoding standard called UTF-8. Being fully compatible with ASCII text, it does not necessarily require a lot of special support from existing programs in order to work. This is especially true in languages like C, where a UTF-8 string looks just like a C string. Sure, concepts like 'uppercase' and 'isalpha' take on new meaning and require thousands of lines of code to get right, but it is generally possible to write a program that ignores the concept of Unicode but still works with it without issues with UTF-8 strings.

Now, let's go back to around the time that the Unicode standard was born. Prior to UTF-8 existing, there was an encoding standard called UCS-2 that used 16-bit words (the WCHAR or wide character of you have done any Windows programming) to represent international characters. This had benefits for efficiency: doing things like finding the number of characters in a UCS-2 string is just as easy as a traditional C string. The designers of Microsoft Windows NT decided to adopt this fledgling standard and designed the whole system to be compatible with UCS-2. Of course, to make it compatible with all the C code in the world that did not understand wide characters and still wanted to use code pages and weird encodings, Windows also was designed with a compatibility layer that supports those too. Why 2 bytes? Well, 4 was seen as too much wasted space, and after all, 65535 characters should be enough for anyone.

As you can probably guess, it did not take long to determine that 16 bits was not actually sufficient to actually encode the world's character sets. Even efforts to merge common Japanese and Chinese characters, called Han Unification, did not yield enough space to represent all of the known characters (to say nothing of Emoji !)

So, the Unicode standard grew, Windows switched to a new UTF-16 encoding which allowed using multiple 16-bit words to represent the extended characters. Unfortunately, UTF-16 that has none of the speed benefits of UCS-2 and none of the compatibility or size benefits of UTF-8. Effectively, internationalization in Windows is based on a mixture of different encoding standards, old, new and transitional, where programs can opt in or out of seeing all characters by explicitly enabling Unicode, or only some by living in an ASCII or code-paged world.

And today

Where does this tie in to the Metasploit Framework? Meterpreter now has Unicode Support on Windows. Why so long in coming? Effectively, Meterpreter was originally built as an ASCII-native C program that uses byte-size characters, taking advantage of the legacy support for ASCII and code pages in Windows. Flipping the switch wholesale to UTF-16 would break a number of things. The Ruby language (which the Metasploit Framework is based on) has gone through some growing pains as well with Unicode, but it is now a stable and first-class citizen.

Thankfully, since UTF-8 support in Ruby 2 is actually very good (some might say over-zealous!), and since Meterpreter was already dealing in C strings, switching to UTF-8 was largely painless. In fact, once the Unicode filters were turned off in the Metasploit Framework, the Posix, PHP and Python meterpreters worked largely unmodified. On Windows, some string gymnastics are applied to convert from UTF-8 on the wire to UTF-16 which Windows is now based on. The Java language's support for Unicode largely mirrors Windows (it uses UTF-16 internally), but there is a Pull Request to add support there as well.

Here are some examples of its usage - it is largely transparent:

meterpreter > ls
 
 
Listing: e:\metasploit-framework\mytest\test
============================================
 
 
Mode              Size  Type  Last modified              Name
----              ----  ----  -------------              ----
40777/rwxrwxrwx   102   dir   2015-03-20 15:43:52 -0500  .
40777/rwxrwxrwx   1564  dir   2015-03-23 09:50:17 -0500  ..
100666/rw-rw-rw-  15    fil   2015-03-17 02:50:27 -0500  プロトタイプ.txt2
 
 
meterpreter > mv プロトタイプ.txt2 ʇıoʃdsɐʇǝW.txt
meterpreter > ls
 
 
Listing: e:\metasploit-framework\mytest\test
============================================
 
 
Mode              Size  Type  Last modified              Name
----              ----  ----  -------------              ----
40777/rwxrwxrwx   102   dir   2015-03-24 09:42:53 -0500  .
40777/rwxrwxrwx   1564  dir   2015-03-23 09:50:17 -0500  ..
100666/rw-rw-rw-  15    fil   2015-03-17 02:50:27 -0500  ʇıoʃdsɐʇǝW.txt

As a side note, the ls command gained a few new tricks of its own, such as viewing Windows 'Short' names for compatibility with older operating systems and MS-DOS:

meterpreter > ls -x c:\
 
 
Listing: c:\
============
 
 
Mode              Size       Type  Last modified              Short Name  Name
----              ----       ----  -------------              ----------  ----
40777/rwxrwxrwx   0          dir   2015-02-19 03:31:32 -0600  $SYSRE~1    $SysReset
40777/rwxrwxrwx   0          dir   2013-08-22 03:45:52 -0500  DOCUME~1    Documents and Settings
40555/r-xr-xr-x   0          dir   2015-03-17 04:29:10 -0500  PROGRA~1    Program Files
40555/r-xr-xr-x   0          dir   2015-03-19 10:48:07 -0500  PROGRA~2    Program Files (x86)
40777/rwxrwxrwx   0          dir   2015-03-18 08:21:01 -0500  PROGRA~3    ProgramData
40777/rwxrwxrwx   0          dir   2015-03-24 06:57:11 -0500  SYSTEM~1    System Volume Information
40777/rwxrwxrwx   0          dir   2015-03-17 04:33:11 -0500  METASP~1    metasploit

As well as new sorting options.

meterpreter > ls -h
Usage: ls [dir] [-x] [-S] [-t] [-r]
   -x Show short file names
   -S Sort by size
   -t Sort by time modified
   -r Reverse sort order

There are still other areas of the Meterpreter we are looking at to improve Unicode support, such as registry access and user enumeration, but this is a good first step.

Have fun using Unicode in meterpreter. A big thanks to OJ, HD and @zeroSteiner and @schierlm for contributing fixes, testing time and reviews.