Vigenère Cipher Cryptanalysis: Kasiski Examination and Index of Coincidence Attack

cloud_container_security Difficulty 1–5 30 min certifiable

Theory

Why This Matters

The Vigenère cipher was called "le chiffre indéchiffrable" — the unbreakable cipher — for three centuries until Charles Babbage and Friedrich Kasiski independently broke it in the mid-1800s. Its legacy in CTF competitions is substantial: Vigenère and its close variants appear in dozens of cryptography challenges every year, and automated decryption tools such as dCode and quipqiup routinely solve them in seconds once key length is known. Beyond CTFs, understanding Vigenère cryptanalysis is the conceptual bridge to understanding stream-cipher attacks: recovering a repeating keystream is structurally identical to recovering a Vigenère key, and the statistical techniques — IoC, Kasiski, frequency analysis on keystream bytes — transfer directly to XOR-based stream ciphers encountered in binary exploitation and malware reverse engineering.

Core Concept

The Vigenère cipher is a polyalphabetic substitution cipher: each plaintext character Pᵢ is shifted by the numeric value of the corresponding key character Kᵢ mod len(key), cycling the key as needed. Formally: Cᵢ = (Pᵢ + K[i mod L]) mod 26, where L is the key length and all values are letter indices (A=0 … Z=25).

The cipher violates perfect secrecy because the key is shorter than the message and repeats. This creates a structural weakness: plaintext characters at positions that are multiples of L apart are encrypted with the same key byte. The entire ciphertext is therefore a superposition of L independent Caesar ciphers. Recovering L breaks the polyalphabetic property and reduces the problem to L independent frequency analyses.

Kasiski examination exploits the fact that identical plaintext trigrams (common English sequences like "THE" or "AND") that happen to be aligned with the same part of the key will produce identical ciphertext trigrams. The distances between these repeated ciphertext trigrams are multiples of L. Taking the GCD of several such distances gives a strong candidate for L (or a small multiple of it).

The Friedman test uses the Index of Coincidence to estimate key length directly. For a ciphertext of length n enciphered with a key of length L, the IoC of the full ciphertext satisfies: IoC ≈ 0.065/L + 0.038(1 − 1/L). Solving for L: L ≈ (0.065 − 0.038) / (IoC − 0.038). This gives a real-valued estimate that should be rounded to the nearest integer and cross-checked with Kasiski results.

Once L is confirmed, the ciphertext is split into L cosets — coset j contains characters at positions j, j+L, j+2L, … — each of which is a pure Caesar ciphertext encrypted with key[j]. Each coset is attacked independently by frequency analysis (map most-frequent coset character to E), yielding each key letter and thus the full key.

The cryptanalytic precondition is ciphertext-only for standard English Vigenère. The attack requires sufficient ciphertext length: Kasiski needs at least two independent trigram repetitions (practical lower bound ~80 characters); frequency analysis per coset needs at least 20–30 characters per coset to be reliable.

Technical Deep-Dive

from math import gcd
from functools import reduce
from collections import Counter
import string

ENGLISH_FREQ = "ETAOINSHRDLCUMWFGYPBVKJXQZ"

def kasiski_key_length(ct: str, min_len: int = 2) -> list[int]:
    """Find repeated trigrams and return GCD of their spacings."""
    ct = ct.upper().replace(" ", "")
    spacings = []
    for i in range(len(ct) - 2):
        trigram = ct[i:i+3]
        for j in range(i+3, len(ct) - 2):
            if ct[j:j+3] == trigram:
                spacings.append(j - i)
    if not spacings:
        return []
    common = reduce(gcd, spacings)
    # Return all divisors of common as candidate key lengths
    return sorted({d for d in range(min_len, common+1) if common % d == 0})

def ioc(text: str) -> float:
    n = len(text)
    counts = Counter(text)
    return sum(f*(f-1) for f in counts.values()) / (n*(n-1)) if n > 1 else 0

def friedman_estimate(ct: str) -> float:
    """Estimate key length from global IoC."""
    ic = ioc(ct.upper().replace(" ", ""))
    if abs(ic - 0.038) < 1e-6:
        return float("inf")
    return (0.065 - 0.038) / (ic - 0.038)

def crack_vigenere(ct: str, key_len: int) -> str:
    """Given confirmed key length, recover key via per-coset frequency analysis."""
    ct = ct.upper().replace(" ", "")
    key = []
    for j in range(key_len):
        coset = ct[j::key_len]
        most_common = Counter(coset).most_common(1)[0][0]
        shift = (ord(most_common) - ord('E')) % 26
        key.append(chr(ord('A') + shift))
    key_str = "".join(key)
    pt = "".join(
        chr((ord(c) - ord('A') - ord(key_str[i % key_len]) + ord('A') + 26) % 26 + ord('A'))
        for i, c in enumerate(ct)
    )
    return pt, key_str

Cryptanalysis Methodology

Strip whitespace and punctuation — Normalise the ciphertext to uppercase A–Z only. Record original positions if you need to restore formatting.
Compute global IoC — Use the Python ioc() function. IoC < 0.055 suggests polyalphabetic; IoC ≥ 0.060 suggests monoalphabetic or very short key. Use the Friedman formula to obtain a preliminary L estimate.
Run Kasiski examination — Find repeated trigrams with the kasiski_key_length() function. Collect the GCD of spacings and enumerate its divisors as key-length candidates.
Cross-check candidates — For each candidate L in {2, 3, 4, 5, 6, 7, 8, 9, 10}, split the ciphertext into L cosets and compute the IoC of each coset. The correct L produces per-coset IoC values all near 0.065 (English). Wrong L values produce cosets with mixed frequencies and lower IoC.
Recover the key — For the confirmed L, apply per-coset frequency analysis with crack_vigenere(). Optionally use dCode.fr Vigenère tool or quipqiup for automated solving.
Verify and adjust — Inspect the decrypted plaintext. If 1–2 letters are wrong in the key, fix them manually by reading the garbled plaintext positions and correcting the corresponding key letter.
Handle non-standard alphabets — If the cipher uses digits or punctuation, extend the alphabet to match and adjust all modular arithmetic accordingly.

Secure Implementation Note — Vigenère provides no computational security. Any symmetric encryption requirement should use AES-256-GCM or ChaCha20-Poly1305 with a cryptographically random key and nonce. There is no legitimate use case for Vigenère in production systems.

Common Cryptanalysis Errors

Using IoC on a ciphertext shorter than 50 characters — IoC estimates are statistically unreliable on short samples. For short ciphertexts, brute-force key lengths 1–10 directly by checking per-coset IoC.
Accepting the first Kasiski GCD without validating coset IoC — The GCD may be a multiple of the true key length. Always verify each divisor of the GCD via coset IoC before committing to L.
Applying Caesar analysis to each coset before confirming L — If L is wrong, frequency analysis on each coset produces garbage and the analyst concludes the cipher is not Vigenère. Always confirm L first.
Forgetting to strip non-alpha characters — Spaces, punctuation, and digits in the ciphertext inflate apparent key length estimates and corrupt coset splitting. Strip all non-alphabet characters before analysis.
Missing autokey and running-key variants — A Beaufort cipher or autokey variant uses a different recurrence relation. If Vigenère analysis produces a garbled key, consider whether the cipher is a Beaufort (C = (K − P) mod 26) or autokey (key is primed with a word then extended with plaintext).
Trusting quipqiup blindly on short keys — Automated solvers use dictionary attacks and may produce plausible-looking English with the wrong key if the ciphertext is short. Always check that the recovered key is consistent across all cosets.

NICE Framework Alignment

Code	Knowledge/Skill/Task Statement	How This Card Develops It
K0018	Knowledge of encryption algorithms	Establishes Vigenère as a polyalphabetic cipher and maps its structure to modern stream cipher concepts
K0019	Knowledge of cryptography and cryptographic key management concepts	Shows that key repetition is the core weakness and motivates nonce/IV uniqueness requirements in modern ciphers
K0305	Knowledge of data concealment	Distinguishes polyalphabetic substitution from secure encryption and explains why IoC defeats both
S0138	Skill in using public-key encryption and PKI	Anchors the learner's intuition about key length and key reuse in a mathematically tractable classical setting