The Problem: Characters vs Bytes
Computers store everything as bytes. ASCII defined 128 characters (one byte each) for English in 1963. Unicode assigned a unique code point to every character in every writing system: U+0041 = A, U+00E9 = é, U+1F525 = 🔥. UTF-8 is the encoding that converts those code points into bytes for storage and transmission.
How UTF-8 Works
UTF-8 is variable-width: different characters use 1-4 bytes.
- ASCII characters (A-Z, 0-9, punctuation) → 1 byte (identical to ASCII)
- Latin diacritics (é, ü, ñ) → 2 bytes
- CJK characters, Arabic, Hebrew → 3 bytes
- Emoji and supplementary characters → 4 bytes
// JavaScript quirk (internally UTF-16, not UTF-8):
"A".length // → 1
"é".length // → 1
"🔥".length // → 2 (counts UTF-16 code units, not characters!)
[..."🔥"].length // → 1 (spread uses Unicode-aware iteration)
Why UTF-8 Won
- ASCII compatibility: Any ASCII document is valid UTF-8 — no migration needed
- Space efficiency: English text uses 1 byte per character; UTF-32 (fixed 4 bytes) would triple file sizes
- Universal standard: HTML5, JSON, PostgreSQL, and the web all default to UTF-8
Common Encoding Bugs
// Mojibake: reading UTF-8 as Latin-1
// é in UTF-8 = bytes 0xC3 0xA9
// Interpreted as Latin-1: "é"
// MySQL "utf8" is NOT full UTF-8 — only supports 3-byte sequences
-- Emoji (4 bytes) silently fail or throw errors!
-- Fix: always use utf8mb4
ALTER TABLE posts CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
Practical Rules
- Set
Content-Type: application/json; charset=utf-8on API responses - Use
utf8mb4in MySQL/MariaDB — neverutf8 - Always specify encoding when reading files:
fs.readFileSync(path, 'utf8') - Never assume string length equals byte length when checking database field limits