The Magic of UTF-8: How Computers Understand Text

If computers are just ones and zeroes, how are you reading this? Computers don't understand letters. Everything you read on these life-cursing devices has been turned into numbers.

Let's think about how this works. Imagine you want to write me a message, but we've agreed to communicate only via (hexadecimal) numbers. We decide to map letters to numbers. After much deliberation, we agree 41^[1] will represent A.

In the real world, we can spit this into a .txt file and observe the results:

$ printf '\x41' > lovely-message.txt
$ cat lovely-message.txt
A%

It works! That's because most computers already interpret 0x41 as A due to Unicode, the world's de facto text encoding standard in 2025.

Numbers go in, letters come out. We can also work in the other direction:

The box above is interactive, so play around with various characters and see if you can find some interesting results. Some ideas:

English text (Hello, World!)

Uppercase and lowercase characters (look at how HeLlO wOrLd is different to hElLo WoRlD)

Non-English text, e.g. Japanese (こんにちは世界)

Emoji (👋🌍)

Unicode is a successful standard because it solves the problem of everyone having to agree on what characters should become what numbers. Yes, this used to be a big problem. But it also triggers its own follow-up questions, such as: why does A get converted into a single number (0x41) but À ends up as two (0xC3 0x80)? And how come we can squidge all of Hello, World! into 17 bytes, but a single 👩🏽‍🏫 emoji takes up 15 bytes^[2] just by itself?

We'll get to it.

One Giant Dictionary

Let's start with what Unicode calls a code point, which at its most straightforward represents a single character in the overall Unicode space. Code points are the core building blocks of Unicode, and conceptually represent the letters, numbers and punctuation of most of the world's languages.

Unicode has been designed to be big. To feel the breadth of all this, play with this randomised display of code points from just 0.08% of the overall Unicode space:

You'll notice the characters above have all been rendered as glyphs: as in, you're looking at them. Unicode does not actually concern itself with how things are displayed. Some code points aren't even designed to be 'seen' in the way that a typical user would expect. Other things you might see on your screen come from smooshing two (or more) code points together: the Union Jack flag is actually both U+1F1EC and U+1F1E7, which will display as 🇬 and 🇧 when separate but 🇬🇧 when adjacent.

The Unicode spec handily defines what you or I expect to be a single character as a user-perceived character, because the term 'character' is massively overloaded. The formal way of defining a user-percieved character in Unicode is a grapheme cluster, which is a term we'll start using. By this point you're pretty deep into the annexes of specification: you're looking for UAX #29.

1,114,112 possible code points are available^[3], of which 154,998 have been used as of Unicode version 16.0. Unicode partitions its space into seventeen planes numbered 0 through 16, with each plane supporting 65,536 code points. Plane 0 is known as the Basic Multilingual Plane (BMP), and contains the majority of our languages. Planes 1 through 16 are defined as the Supplementary Planes. 10 of the supplementary planes are currently completely unused.

If you're reading this, it's likely most characters you'll encounter on a day-to-day basis feature in the Basic Multilingual Plane. The basic Latin character set used for the English language is nestled right near the start of the BMP. In fact, 0x41 to A actually originally came from the formerly dominant ASCII standard, which was mostly fine if you were looking to encode English and a pain if you weren't. In order to smooth overall adoption when it came to rolling out Unicode, ASCII's encodings became a subset of Unicode: 0x41 in ASCII is U+0041 in Unicode. This was a smart design decision.

The problem with ASCII, it turned out, was there was so much language outside of the limits of what it could recognise: its 7-bit space maxed out a meagre 128 characters. Comprehending the sheer amount of work and effort to accommodate all of the planet's languages can actually be particularly tricky for native English speakers (typers?) to wrap their heads around, mostly because we've never had to actively think about it for the majority of our computing lives: ASCII was literally built just for us.

If you visualise just the basic multilingual plane you can isolate the non-control characters of the ASCII block to see just how small it is in the overall space:

So, to be specific, one or more Unicode code points defines the grapheme cluster representation of what the user percieves as a character, and its glyph is how it looks on their screen. A grapheme cluster could be A or é or even 👩‍🚀. It might seem somewhat obvious but it's worth hammering that home: the glyphs for hello can look different when rendered with a sans-serif font than when hello is rendered with a serif one.

Onward, to Bytes!

The next step is to take those code points and encode them into sequences of bytes, so we can do things like save them as documents or send them down internet pipes. Unicode code points can be encoded in three delicious flavours: UTF-32, UTF-16 and UTF-8. Which one should you use?

Spoiler: it's UTF-8.

UTF-8, UTF-16 and UTF-32 are named based on the minimum amount of bits needed to encode code point: 8, 16 or 32, respectively. This means UTF-32 is theoretically the simplest, as every single code point uses exactly four bytes, but also the most space inefficient. Do not also be lulled into a false sense of security that UTF-32 is guaranteed to be straightforward, as we've already seen that code points do not have a 1:1 mapping with what's displayed on the screen due to grapheme clusters - remember 🇬🇧?

Due to its ASCII heritage^[4] and position at the very beginning of the code space, Basic Latin characters don't need more than one byte to be encoded. Both UTF-16 and UTF-8 are variable in their length, with UTF-16 being two bytes for anything in the Basic Multilingual Plane and four bytes for everything else. UTF-16 can be the most space efficient for some Asian languages. UTF-8 goes even further, and encodes characters between one and four bytes.

Anyway, fast forward a couple of decades and UTF-8 emerged as the clear favourite. 98% of websites now use UTF-8. Due to the success of UTF-8 and its ubiquity, most of us can go about our days without having to contemplate it in the slightest. I'm going to totally gloss over UTF-16 and UTF-32 for the rest of this article, which at least saves a discussion about endianness, and I can do this so flippantly exactly because UTF-8 is so popular.

Fun fact: JavaScript encodes strings as UTF-16. This isn't meant as some kind of 'lol JavaScript' remark, I just think it's interesting.

Onto the actual encoding itself. UTF-8 is a variable-width encoding, so the amount of bytes you'll need depends on the code point you're looking to encode:

Required Bytes	Start Code Point	End Code Point
1	U+0000	U+007F
2	U+0080	U+07FF
3	U+0800	U+FFFF
4	U+10000	U+10FFFF

You can break it down like this:

Single byte characters cover ASCII (0-127)
Two byte characters cover most Latin-script alphabets and some other scripts
Three byte characters cover the rest of the Basic Multilingual Plane (BMP)
Four byte characters cover all supplementary planes, including emoji and rare historical scripts

For an example, let's start with U+1F602, more commonly identified as the Face With Tears of Joy emoji. That code point will get encoded as four-bytes in UTF-8. In the buttons below you'll also see 읥, Θ and A - which are encoded as three, two and one byte, respectively.

You'll also notice that some extra bits sneak in. These clever little bits contain valuable additional information for anything decoding your UTF-8.

We start with the code point represented as binary - the 1F602 in U+1F602 - and distribute those bits across the necessary amount of bytes as per the encoding rules. These are colour coded so you can see where they all end up. As for those other, weird extra bits:

Two, three and four byte UTF-8 encodings start with two, three or four 1s, so that whatever ends up decoding this knows how many bytes to read and parse. A one-byte encoding starts with a 0 to ensure ASCII compatibility.
The second, third or fourth bytes of a UTF-8 encoding start with 10, so you know that they're following on from something else. Notice how it's not possible for a valid first byte to ever start with 10, which is very clever. If you're ever parsing UTF-8 and you stumble upon an unexpected byte starting with 10, you can know straight away that something's got jumbled up.

Wrapping Up

Now we can come back to our question that we know doesn't have the most straightforward answer: why is 👩🏽‍🏫 15 bytes in UTF-8?

We know what we're seeing visually is a glyph, and those will look different based on platform. Microsoft's FluentUI renders it as , for example, which may or may not look the same as the glyph above - depending on where you're viewing this.

We also know that we should generally think in terms of grapheme clusters rather than individual code points, which is especially true for Woman Teacher: Medium Skin Tone: the Unicode information which represents this single 'user-perceived' character is made up of the following four Unicode code points:

👩 U+1F469: The base emoji character for "Woman"
🏽 U+1F3FD: The modifier for "Medium Skin Tone"
U+200D: The Zero Width Joiner (ZWJ), which makes it explicit that the previous code points are is connected to the next one, which might not be interpretable on its own. This one doesn't have its own glyph information associated with it on my version of macOS, either.
🏫 U+1F3EB: The emoji for "School"

Put that together, and the UTF-8 encoding for this one fabulous 👩🏽‍🏫 emoji will end up as 15 bytes made up of 👩 = 0xF0 0x9F 0x91 0xA9, 🏽 = 0xF0 0x9F 0x8F 0xBD, ZWJ = 0xE2 0x80 0x8D and 🏫 = 0xF0 0x9F 0x8F 0xAB.

UTF-8 is probably one of the most successful standards ever produced. This little journey has already taken us to some of the interesting spots, and hopefully made you appreciate what a fascinating accomplishment it is to have one globally-accepted standard to cover all of the world's languages.

That's 65 in decimal, but we'll jump straight into hexadecimal because it's so commonplace when we start to go to the lower levels. I'll also refer to hexadecimal numbers with an 0x prefix going forward, so this would be 0x41. ↩︎
This brings us to a classic Unicode sort-of-problem: the string length of 👩🏽‍🏫 can fluctuate across languages. Python 3.13 and Ruby 3.2 both say it's 4. Go 1.23 and in Rust 1.83 both say 15. Node 20.15 says 7. There are perfectly logical and reasonable explanations for all of these results, and also how each language iterates a string will probably be more relevant for your day-to-day sanity. ↩︎
In the spec, Unicode favours thinking of this in its hexadecimal form of $10FFFF_{16}$ . ↩︎
ASCII was actually conceived in the 1960s as a 7-bit encoding, partially because the teletypes and teleprinters of the time were also 7-bit. I love this because, from a 2025 perspective, it's fascinating to think of addressing blocks of memory in anything less than a byte at a time. ↩︎