UTF-8 Explained: How Computers Store Text

Understand how UTF-8 encodes every writing system into bytes, why it became the universal standard, and how encoding bugs appear in real applications.

The Problem: Characters vs Bytes

Computers store everything as bytes. ASCII defined 128 characters (one byte each) for English in 1963. Unicode assigned a unique code point to every character in every writing system: U+0041 = A, U+00E9 = é, U+1F525 = 🔥. UTF-8 is the encoding that converts those code points into bytes for storage and transmission.

How UTF-8 Works

UTF-8 is variable-width: different characters use 1-4 bytes.

ASCII characters (A-Z, 0-9, punctuation) → 1 byte (identical to ASCII)
Latin diacritics (é, ü, ñ) → 2 bytes
CJK characters, Arabic, Hebrew → 3 bytes
Emoji and supplementary characters → 4 bytes

// JavaScript quirk (internally UTF-16, not UTF-8):
"A".length        // → 1
"é".length        // → 1
"🔥".length       // → 2 (counts UTF-16 code units, not characters!)
[..."🔥"].length  // → 1 (spread uses Unicode-aware iteration)

Why UTF-8 Won

ASCII compatibility: Any ASCII document is valid UTF-8 — no migration needed
Space efficiency: English text uses 1 byte per character; UTF-32 (fixed 4 bytes) would triple file sizes
Universal standard: HTML5, JSON, PostgreSQL, and the web all default to UTF-8

Common Encoding Bugs

// Mojibake: reading UTF-8 as Latin-1
// é in UTF-8 = bytes 0xC3 0xA9
// Interpreted as Latin-1: "Ã©"

// MySQL "utf8" is NOT full UTF-8 — only supports 3-byte sequences
-- Emoji (4 bytes) silently fail or throw errors!
-- Fix: always use utf8mb4
ALTER TABLE posts CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

Practical Rules

Set Content-Type: application/json; charset=utf-8 on API responses
Use utf8mb4 in MySQL/MariaDB — never utf8
Always specify encoding when reading files: fs.readFileSync(path, 'utf8')
Never assume string length equals byte length when checking database field limits