Unicode Utility

Unicode utils 1
Unicode utils 1 p

Problem

The Unicode standard governs how modern software handles textual data, including characters from all of the world's natural languages, as well as thousands of emojis and other pictorial symbols ― over 110,000 characters in all.

The representation of these characters as sequences of bytes is governed by any of several different encoding schemes, and unfortunately it's sometimes necessary to deal with the differences between them. Although databases such as MongoDB and MySQL use UTF-8, several important programming languages including Javascript and Java use UTF-16.

A further complication is that some emojis, such as ❤️, are composed of two adjacent Unicode characters. (That's a red heart. If your browser isn't fully compliant, you might see it as a black heart, ❤).

Demo

Luckily, the Unicode standard fully specifies how all of this works, and the rules are simple to implement in code.

This demo uses a small Javascript library to exercise these rules, accepting an arbitrary string of text as input and then breaking it down into its constituent characters, showing how each of them is represented in both UTF-8 and UTF-16.

You can also enter numeric values, if you select the "Numeric" checkbox. The following formats are accepted:

  • 77
  • 0x4D
  • \u004D
  • U+004D

These numeric values can be strung together, to form:

  • UTF-16 surrogate pairs, such as "\uD83D\uDE03" (😃)
  • Multi-codepoint sequences like ❤️ above: "U+2764U+FE0F"

Source Code

unicode-utils-js on Github.

Share
Credits