Zero-width steganography
Theory
Why This Matters
Zero-width characters are Unicode code points that produce no visible glyph and occupy no space in rendered text, yet are present in the underlying byte sequence. They have been used to fingerprint leaked documents (each copy is uniquely watermarked by a different zero-width pattern injected between visible characters), to exfiltrate data through channels that permit Unicode text (email, chat, web forms), and to create CTF challenges where the flag is hidden in what appears to be a blank or innocent passage of text. The 2016 discovery that zero-width characters were being used to fingerprint Twitter DM leaks brought this technique to mainstream security awareness. Any analyst who examines text solely by visual inspection will miss zero-width steganography entirely.
Core Concept
The primary zero-width Unicode code points used in steganography are: U+200B (zero-width space, ZWSP), U+200C (zero-width non-joiner, ZWNJ), U+200D (zero-width joiner, ZWJ), U+FEFF (byte order mark / zero-width no-break space, BOM), and U+2060 (word joiner). These characters are classified as format characters (Unicode category Cf) — they affect text rendering or shaping logic but are invisible in standard renderers.
Encoding schemes vary. The simplest binary scheme uses two characters: ZWSP=0 and ZWJ=1 (or any two of the above). Each byte of the hidden message is encoded as 8 bits using these two characters, inserted between the visible characters of the carrier text. A ternary scheme uses three zero-width characters to encode base-3, slightly improving density. The hidden message can be detected by file size anomaly (a paragraph containing ZWSP characters will be significantly larger than its visible character count suggests — each ZWSP is 3 bytes in UTF-8: E2 80 8B) or by hex dump inspection of the raw bytes.
Technical Deep-Dive
import unicodedata
# Zero-width character definitions
ZWSP = "" # E2 80 8B in UTF-8
ZWNJ = "" # E2 80 8C
ZWJ = "" # E2 80 8D
BOM = "" # EF BB BF in UTF-8
WJ = "" # E2 81 A0
def detect_zero_width(text: str) -> list:
"""Return list of (index, codepoint, name) for all format chars."""
results = []
for i, ch in enumerate(text):
if unicodedata.category(ch) == "Cf":
results.append((i, hex(ord(ch)), unicodedata.name(ch, "UNKNOWN")))
return results
def extract_binary_zw(text: str, zero_char=ZWSP, one_char=ZWJ) -> bytes:
"""Extract hidden bytes from ZWSP=0, ZWJ=1 binary encoding."""
bits = ""
for ch in text:
if ch == zero_char:
bits += "0"
elif ch == one_char:
bits += "1"
# Pad to byte boundary
bits = bits[:len(bits) - (len(bits) % 8)]
return bytes(int(bits[i:i+8], 2) for i in range(0, len(bits), 8))
def strip_zero_width(text: str) -> str:
"""Remove all format characters to reveal visible text."""
return "".join(ch for ch in text if unicodedata.category(ch) != "Cf")
# Hex dump detection — spot E2 80 8B sequences in raw bytes
with open("suspicious.txt", "rb") as f:
data = f.read()
# Find all zero-width UTF-8 sequences
import re
pattern = re.compile(
b"xe2x80x8b" # ZWSP
b"|"
b"xe2x80x8c" # ZWNJ
b"|"
b"xe2x80x8d" # ZWJ
b"|"
b"xefxbbxbf" # BOM
)
for m in pattern.finditer(data):
print(f"Found at byte offset {m.start()}: {data[m.start():m.start()+3].hex()}")
# CyberChef: "Remove Zero-Width Characters" to strip carrier text
# Then "From Binary" or custom recipe to decode the extracted bits
# Quick check: file size vs. wc -m (character count)
wc -c suspicious.txt # bytes
wc -m suspicious.txt # characters
# If bytes >> characters * 3, zero-width chars likely present
# xxd hex dump to visually inspect
xxd suspicious.txt | grep "e2 80"
Analytical Methodology
- Check file size vs. visible character count. If byte count significantly exceeds visible character count × 3 (average UTF-8 bytes per visible character for Latin text), invisible characters are present. Use
wc -candwc -mto compare. - Scan for format character code points. Use
unicodedata.category(ch) == "Cf"or a hex dump to find sequences matchingE2 80 8B,E2 80 8C,E2 80 8D, orEF BB BF. - Determine the encoding scheme. How many distinct zero-width character types are present? Two types → binary scheme; three types → ternary; more than three → custom. Count the total number of zero-width characters to estimate message length (total ÷ 8 for binary = bytes).
- Extract and decode. Apply the binary extraction function above, assigning zero-width characters to bits based on observed pattern. If the extracted bytes are not printable, try the inverse assignment (swap 0 and 1 characters) or consider a different starting position.
- Use CyberChef. The "Remove Zero-Width Characters" operation strips invisibles from visible text. For extraction, combine "Extract" operations with bit-to-character decoding recipes.
- Check for nested encoding. The extracted bytes may themselves be base64, hex, or another encoding. Always validate extracted data against expected flag format before concluding the challenge.
Common Analytical Errors
- Missing zero-width characters entirely in text editors. Many editors do not highlight or indicate format characters. Copy-paste into a hex editor or use
xxdto verify raw content before concluding a text file contains nothing hidden. - Confusing BOM with intentional steganography. A single
U+FEFFat the start of a file is a standard UTF-8 BOM and not steganographic. Multiple BOM characters or BOM characters scattered through the text are suspicious. - Assuming ZWSP=0, ZWJ=1 universally. The bit assignment varies by challenge. If the extracted bits produce non-printable output, try the inverted assignment before assuming a different scheme.
- Not checking for multi-byte extraction. A ZWSP in a web page may have been HTML-entity-encoded (
​) or URL-encoded (%E2%80%8B). When examining web challenge source, check for entity-encoded zero-width characters that won't be found by direct Unicode scanning. - Over-counting zero-width characters from copy-paste artifacts. Copying text from certain platforms (Slack, WhatsApp) inadvertently adds ZWJ characters around emoji. These platform artifacts can corrupt steganographic extraction if not filtered out.
- Ignoring the word joiner U+2060. This character is less commonly discussed but is fully invisible and functionally equivalent to ZWSP for steganographic purposes. Exclude it from detection scans at your peril.
NICE Framework Alignment
| Code | Knowledge/Skill/Task Statement | How This Card Develops It |
|---|---|---|
| K0018 | Knowledge of encryption algorithms used to protect data during transmission | Frames zero-width steganography as a covert channel technique for data exfiltration distinct from encryption |
| K0019 | Knowledge of cryptography and key management concepts | Builds understanding of steganographic concealment as a complement to cryptographic confidentiality |
| K0305 | Knowledge of encryption standards and various encryption algorithms | Positions Unicode-based covert channels within the broader landscape of data-hiding techniques |
| S0138 | Skill in using defensive coding practices | Develops robust text sanitisation and format-character filtering in input-handling code |
| T0212 | Perform penetration testing as required to evaluate information security | Trains detection of zero-width exfiltration channels in text-accepting application endpoints |
Further Reading
- Unicode Standard Annex #9 — Unicode Bidirectional Algorithm — Unicode Consortium
- Hiding in Plain Sight: A Survey of Steganographic Techniques — Johnson and Jajodia, IEEE Multimedia
- Text Watermarking Using Zero-Width Characters — Brassil et al., IEEE Journal on Selected Areas in Communications
Challenge Lab
Reinforce your learning with a hands-on generated challenge based on this card's competency.