Zero-width steganography

log_analysis_siem Difficulty 1–5 30 min certifiable

Theory

Why This Matters

Zero-width characters are Unicode code points that produce no visible glyph and occupy no space in rendered text, yet are present in the underlying byte sequence. They have been used to fingerprint leaked documents (each copy is uniquely watermarked by a different zero-width pattern injected between visible characters), to exfiltrate data through channels that permit Unicode text (email, chat, web forms), and to create CTF challenges where the flag is hidden in what appears to be a blank or innocent passage of text. The 2016 discovery that zero-width characters were being used to fingerprint Twitter DM leaks brought this technique to mainstream security awareness. Any analyst who examines text solely by visual inspection will miss zero-width steganography entirely.

Core Concept

The primary zero-width Unicode code points used in steganography are: U+200B (zero-width space, ZWSP), U+200C (zero-width non-joiner, ZWNJ), U+200D (zero-width joiner, ZWJ), U+FEFF (byte order mark / zero-width no-break space, BOM), and U+2060 (word joiner). These characters are classified as format characters (Unicode category Cf) — they affect text rendering or shaping logic but are invisible in standard renderers.

Encoding schemes vary. The simplest binary scheme uses two characters: ZWSP=0 and ZWJ=1 (or any two of the above). Each byte of the hidden message is encoded as 8 bits using these two characters, inserted between the visible characters of the carrier text. A ternary scheme uses three zero-width characters to encode base-3, slightly improving density. The hidden message can be detected by file size anomaly (a paragraph containing ZWSP characters will be significantly larger than its visible character count suggests — each ZWSP is 3 bytes in UTF-8: E2 80 8B) or by hex dump inspection of the raw bytes.

Technical Deep-Dive

import unicodedata

# Zero-width character definitions
ZWSP  = ""   # E2 80 8B in UTF-8
ZWNJ  = "‌"   # E2 80 8C
ZWJ   = "‍"   # E2 80 8D
BOM   = ""   # EF BB BF in UTF-8
WJ    = "⁠"   # E2 81 A0

def detect_zero_width(text: str) -> list:
    """Return list of (index, codepoint, name) for all format chars."""
    results = []
    for i, ch in enumerate(text):
        if unicodedata.category(ch) == "Cf":
            results.append((i, hex(ord(ch)), unicodedata.name(ch, "UNKNOWN")))
    return results

def extract_binary_zw(text: str, zero_char=ZWSP, one_char=ZWJ) -> bytes:
    """Extract hidden bytes from ZWSP=0, ZWJ=1 binary encoding."""
    bits = ""
    for ch in text:
        if ch == zero_char:
            bits += "0"
        elif ch == one_char:
            bits += "1"
    # Pad to byte boundary
    bits = bits[:len(bits) - (len(bits) % 8)]
    return bytes(int(bits[i:i+8], 2) for i in range(0, len(bits), 8))

def strip_zero_width(text: str) -> str:
    """Remove all format characters to reveal visible text."""
    return "".join(ch for ch in text if unicodedata.category(ch) != "Cf")

# Hex dump detection — spot E2 80 8B sequences in raw bytes
with open("suspicious.txt", "rb") as f:
    data = f.read()

# Find all zero-width UTF-8 sequences
import re
pattern = re.compile(
    b"xe2x80x8b"   # ZWSP
    b"|"
    b"xe2x80x8c"   # ZWNJ
    b"|"
    b"xe2x80x8d"   # ZWJ
    b"|"
    b"xefxbbxbf"   # BOM
)
for m in pattern.finditer(data):
    print(f"Found at byte offset {m.start()}: {data[m.start():m.start()+3].hex()}")

# CyberChef: "Remove Zero-Width Characters" to strip carrier text
# Then "From Binary" or custom recipe to decode the extracted bits

# Quick check: file size vs. wc -m (character count)
wc -c suspicious.txt   # bytes
wc -m suspicious.txt   # characters
# If bytes >> characters * 3, zero-width chars likely present

# xxd hex dump to visually inspect
xxd suspicious.txt | grep "e2 80"

Analytical Methodology

Check file size vs. visible character count. If byte count significantly exceeds visible character count × 3 (average UTF-8 bytes per visible character for Latin text), invisible characters are present. Use wc -c and wc -m to compare.
Scan for format character code points. Use unicodedata.category(ch) == "Cf" or a hex dump to find sequences matching E2 80 8B, E2 80 8C, E2 80 8D, or EF BB BF.
Determine the encoding scheme. How many distinct zero-width character types are present? Two types → binary scheme; three types → ternary; more than three → custom. Count the total number of zero-width characters to estimate message length (total ÷ 8 for binary = bytes).
Extract and decode. Apply the binary extraction function above, assigning zero-width characters to bits based on observed pattern. If the extracted bytes are not printable, try the inverse assignment (swap 0 and 1 characters) or consider a different starting position.
Use CyberChef. The "Remove Zero-Width Characters" operation strips invisibles from visible text. For extraction, combine "Extract" operations with bit-to-character decoding recipes.
Check for nested encoding. The extracted bytes may themselves be base64, hex, or another encoding. Always validate extracted data against expected flag format before concluding the challenge.

Common Analytical Errors

Missing zero-width characters entirely in text editors. Many editors do not highlight or indicate format characters. Copy-paste into a hex editor or use xxd to verify raw content before concluding a text file contains nothing hidden.
Confusing BOM with intentional steganography. A single U+FEFF at the start of a file is a standard UTF-8 BOM and not steganographic. Multiple BOM characters or BOM characters scattered through the text are suspicious.
Assuming ZWSP=0, ZWJ=1 universally. The bit assignment varies by challenge. If the extracted bits produce non-printable output, try the inverted assignment before assuming a different scheme.
Not checking for multi-byte extraction. A ZWSP in a web page may have been HTML-entity-encoded () or URL-encoded (%E2%80%8B). When examining web challenge source, check for entity-encoded zero-width characters that won't be found by direct Unicode scanning.
Over-counting zero-width characters from copy-paste artifacts. Copying text from certain platforms (Slack, WhatsApp) inadvertently adds ZWJ characters around emoji. These platform artifacts can corrupt steganographic extraction if not filtered out.
Ignoring the word joiner U+2060. This character is less commonly discussed but is fully invisible and functionally equivalent to ZWSP for steganographic purposes. Exclude it from detection scans at your peril.

NICE Framework Alignment

Code	Knowledge/Skill/Task Statement	How This Card Develops It
K0018	Knowledge of encryption algorithms used to protect data during transmission	Frames zero-width steganography as a covert channel technique for data exfiltration distinct from encryption
K0019	Knowledge of cryptography and key management concepts	Builds understanding of steganographic concealment as a complement to cryptographic confidentiality
K0305	Knowledge of encryption standards and various encryption algorithms	Positions Unicode-based covert channels within the broader landscape of data-hiding techniques
S0138	Skill in using defensive coding practices	Develops robust text sanitisation and format-character filtering in input-handling code
T0212	Perform penetration testing as required to evaluate information security	Trains detection of zero-width exfiltration channels in text-accepting application endpoints