Custom encoding

encoding_crypto_classical Difficulty 1–5 30 min certifiable

Theory

Why This Matters

The hardest encoding challenges in CTF competitions — and in real malware analysis — involve custom encoding schemes invented specifically for the challenge or the malware sample. No reference table exists; no standard decoder applies. Success depends on the analyst's ability to extract structural patterns from the encoded data itself, use known plaintext anchors (the flag prefix, if the flag format is known), and form and test hypotheses about the encoding mechanism systematically. This card develops the general-purpose pattern analysis methodology that applies whenever a known encoding cannot be identified. The same skill set applies to proprietary binary formats, custom serialisation protocols, and undocumented file formats encountered during reverse engineering.

Core Concept

A custom encoding maps a source alphabet to an output alphabet using some rule that may or may not be key-dependent. Analysis proceeds in four phases:

Structural analysis: determine the output alphabet (what characters appear?), look for fixed-length groups (suggesting a base conversion), identify separators, and measure entropy.
Known-plaintext anchoring: if the flag format is CTF{...}, those characters are the first few bytes of plaintext. The corresponding output characters reveal partial mapping information.
Frequency analysis: in a monoalphabetic substitution, output symbol frequency mirrors plaintext frequency. English letter E is most frequent; apply frequency matching to candidate output symbols.
Differential analysis: if multiple encoded samples are available, XOR or difference operations between them can cancel out key material and reveal structural properties.

Base variants are a common category of custom encoding: base58 (Bitcoin address alphabet), base62 ([A-Za-z0-9]), and base91 (91-symbol encoding of Joachim Henke) all look similar to base64 but use different alphabets or group sizes. A string that looks like base64 but fails to decode is likely one of these variants.

Technical Deep-Dive

from collections import Counter
import itertools, string

def analyse_charset(encoded: str) -> dict:
    """Profile the character set of an encoded string."""
    chars = set(encoded)
    freq  = Counter(encoded)
    return {
        "unique_chars": len(chars),
        "charset": "".join(sorted(chars)),
        "length": len(encoded),
        "most_common": freq.most_common(5),
        "group_size_candidates": [
            g for g in range(2, 9) if len(encoded) % g == 0
        ],
    }

def find_known_plaintext_map(encoded: str, known_plain: str) -> dict:
    """Build partial map from known plaintext prefix."""
    return {encoded[i]: known_plain[i]
            for i in range(min(len(encoded), len(known_plain)))}

# Base58 decode (Bitcoin alphabet)
BASE58_ALPHA = "123456789ABCDEFGHJKLMNPQRSTUVWXYZabcdefghijkmnopqrstuvwxyz"
def decode_base58(s: str) -> bytes:
    n = sum(BASE58_ALPHA.index(c) * (58 ** i)
            for i, c in enumerate(reversed(s)))
    result = []
    while n > 0:
        result.append(n & 0xFF)
        n >>= 8
    # Add leading zero bytes for leading '1's in input
    result += [0] * (len(s) - len(s.lstrip("1")))
    return bytes(reversed(result))

# Base62 decode ([A-Za-z0-9])
BASE62_ALPHA = string.digits + string.ascii_uppercase + string.ascii_lowercase
def decode_base62(s: str) -> bytes:
    n = sum(BASE62_ALPHA.index(c) * (62 ** i)
            for i, c in enumerate(reversed(s)))
    result = []
    while n:
        result.append(n & 0xFF)
        n >>= 8
    return bytes(reversed(result))

# Frequency analysis for monoalphabetic substitution
ENGLISH_FREQ = "ETAOINSHRDLCUMWFGYPBVKJXQZ"
def frequency_attack(encoded: str) -> str:
    encoded_upper = encoded.upper()
    freq = Counter(c for c in encoded_upper if c.isalpha())
    by_freq = [c for c, _ in freq.most_common()]
    mapping = dict(zip(by_freq, ENGLISH_FREQ))
    return "".join(mapping.get(c, c) for c in encoded_upper)

# Differential analysis: XOR two encoded samples to cancel key
def xor_bytes(a: bytes, b: bytes) -> bytes:
    return bytes(x ^ y for x, y in zip(a, b))

# If sample1 = E(plain1) and sample2 = E(plain2) and E is XOR with key:
# xor_bytes(sample1, sample2) == xor_bytes(plain1, plain2) — key cancels
# Known-plaintext: if plain1 starts with "CTF{", we know first 4 bytes
# XOR those with sample1[:4] to recover the first 4 key bytes

def recover_xor_key_prefix(sample: bytes, known_plain: bytes) -> bytes:
    return xor_bytes(sample[:len(known_plain)], known_plain)

# CyberChef charcode analysis: "To Charcode" + histogram
# dcode.fr: "Cipher Identifier" for automated recognition of ~200 classical ciphers
# quipquip.com: automated frequency analysis for substitution ciphers
# python3 -c "from collections import Counter; print(Counter(open('enc.txt').read()))"

Analytical Methodology

Profile the character set. Call analyse_charset on the encoded string. Note the number of unique characters (26 → likely alphabetic substitution; 58 → likely base58; 91 → base91; 64 → base64 variant), the length, and which group sizes evenly divide the length.
Compare with known base variants. If the character set is [A-Za-z0-9] (62 chars), try base62. If it matches the Bitcoin base58 alphabet exactly, try base58. For any base-N variant, implement the positional decode: sum(alphabet.index(c) * N**i for i, c in enumerate(reversed(s))).
Anchor with known plaintext. If the flag format is known (e.g., CTF{), the first several encoded characters correspond to the first plaintext characters. Call find_known_plaintext_map to build a partial substitution table. Extend it using frequency analysis.
Apply frequency analysis. For monoalphabetic substitution, map the most frequent encoded symbols to the most frequent English letters (ETAOIN…). Iteratively adjust mappings by checking which substitutions produce readable word fragments.
Test differential analysis. If multiple encoded samples share a key (or the same encoding rule), XOR pairs of encoded values to cancel the key material. Recognisable plaintext structure in XOR differences confirms a key-dependent encoding and may reveal the key directly.
Validate hypotheses programmatically. Build a decode function for your hypothesised scheme and test it against all available encoded samples. A correct hypothesis produces consistent, readable plaintext across all samples.

Common Analytical Errors

Assuming a custom alphabet is actually base64. A string of 64 unique characters in groups of 4 may use a custom alphabet instead of the standard A-Za-z0-9+/. Try standard base64 first, but if it fails, check whether the character set is a permutation of the standard base64 alphabet.
Premature frequency analysis. Frequency analysis only works for monoalphabetic substitution over plaintext with known statistical properties. Applying it to a base-N encoding (where output frequency is determined by positional weight, not letter frequency) produces meaningless results.
Ignoring separator characters. Custom encodings sometimes use fixed delimiters between code groups (commas, dashes, colons). Failing to split on the correct delimiter produces groups of the wrong size and breaks all subsequent analysis.
Anchoring on too few known-plaintext bytes. Three characters provide 3 substitution mappings — insufficient to confirm a hypothesis. Require at least 5–6 consistent mappings before proceeding under the assumption that the scheme is a simple substitution.
Forgetting that custom encodings may be key-dependent. A custom encoding where each character is XORed with a rolling key cannot be broken by frequency analysis alone without knowing or recovering the key. Check for repeating patterns in the XOR difference stream (period = key length).
Not enumerating base variants systematically. Analysts often try base64 and immediately declare "unknown custom encoding." Before concluding custom, test base32, base58, base62, base85, and base91 — all are in common CTF use and each has a distinct alphabet signature.

NICE Framework Alignment

Code	Knowledge/Skill/Task Statement	How This Card Develops It
K0018	Knowledge of encryption algorithms used to protect data during transmission	Develops analytical intuition about the full range of encoding and encryption schemes, including novel constructions
K0019	Knowledge of cryptography and key management concepts	Builds skills in cryptanalysis — recovering plaintext and key structure from encoded samples without a known algorithm
K0305	Knowledge of encryption standards and various encryption algorithms	Expands knowledge to non-standard base variants and custom alphabets beyond the canonical encoding standards
S0138	Skill in using defensive coding practices	Develops hypothesis-driven, test-backed decoder implementation practices applicable to unknown binary formats
T0212	Perform penetration testing as required to evaluate information security	Trains the pattern-recognition and differential-analysis skills used in reverse engineering proprietary protocols and obfuscated payloads