MD5 Collision Generation and Exploitation: Crafting Identical-Hash Inputs for Integrity Bypass

encoding_crypto_classical Difficulty 1–5 30 min certifiable

Theory

Cryptanalysis Methodology

A collision attack against a hash function finds two distinct inputs M1 and M2 such that H(M1) = H(M2). MD5 was broken in 2004 when Xiaoyun Wang and Hongbo Yu published a collision-finding algorithm running in 2^39 operations — well within reach of commodity hardware. By 2008, Marc Stevens and colleagues demonstrated the chosen-prefix collision: given any two chosen prefixes P1 and P2, find suffixes S1 and S2 such that MD5(P1 || S1) = MD5(P2 || S2). The Flame malware (2012) exploited an MD5 chosen-prefix collision to forge a Microsoft code-signing certificate, enabling it to distribute itself as a legitimate Windows Update.

Identical-prefix collision. The simplest form: both inputs share the same prefix; only the collision blocks differ. Tools: fastcoll (Marc Stevens, 2007) generates a collision pair in under 1 second on modern hardware. The two files differ in exactly 128 bytes at a fixed offset but hash to the same MD5.

Chosen-prefix collision. More powerful: the two inputs can begin with arbitrary chosen content. Tools: hashclash (Marc Stevens). Slower (~hours on a GPU cluster) but enables forging meaningful content like certificates or executables.

CTF applications: 1. File hash check bypass: a challenge verifies md5(submitted_file) == expected_hash where expected_hash was computed from a benign file. Submit a file that collides with the benign file but contains different content (e.g., the flag is printed only if a specific byte in the file is set). 2. Same hash, different content: the challenge compares two file hashes and requires them to be equal but the file contents to differ. Generate a collision pair with fastcoll and submit both files. 3. Certificate forgery simulation: reconstruct the scenario where two X.509 TBS (to-be-signed) certificate structures collide in MD5, demonstrating why MD5 was removed from certificate chains.

MD5 vs SHA-1 vs SHA-2. MD5 collisions are practical in seconds; SHA-1 collisions were demonstrated by Google in 2017 (SHAttered, ~6500 GPU-years equivalent), now achievable in hours; SHA-256 has no known practical collision attack.

Technical Deep-Dive

# Generate an MD5 collision pair with fastcoll
# Install: apt install fastcoll  OR  build from https://github.com/jvdsn/fastcoll
fastcoll -o file1.bin file2.bin
# Outputs two 128-byte files with identical MD5 hashes

# Verify the collision
md5sum file1.bin file2.bin
# Both lines should show the same hash

# Show where the files differ (always exactly 128 bytes at offset 0)
cmp -l file1.bin file2.bin | head

# Build collision files with a meaningful prefix (e.g., prepend a valid PNG header)
# 1. Create prefix file (must be a multiple of 64 bytes for MD5 block alignment)
python3 -c "
prefix = b'%PDF-1.4
'          # chosen prefix, any content
pad = (-len(prefix)) % 64        # pad to 64-byte boundary
open('prefix.bin', 'wb').write(prefix + b'x00' * pad)
"
fastcoll -p prefix.bin -o col1.bin col2.bin
md5sum col1.bin col2.bin         # should match

# Python: verify an MD5 collision pair programmatically
import hashlib

def verify_md5_collision(f1: str, f2: str) -> bool:
    d1 = open(f1, "rb").read()
    d2 = open(f2, "rb").read()
    if d1 == d2:
        return False   # identical files are not a collision
    return hashlib.md5(d1).hexdigest() == hashlib.md5(d2).hexdigest()

# CTF: find which of two given collision files matches a target hash
def find_matching_file(target_hash: str, files: list[str]) -> str | None:
    for path in files:
        if hashlib.md5(open(path,"rb").read()).hexdigest() == target_hash:
            return path
    return None

# Demonstrate that appending data after collision blocks preserves the collision:
# MD5(col1 || suffix) == MD5(col2 || suffix) for any suffix
def test_collision_extension(f1: str, f2: str, suffix: bytes) -> bool:
    d1 = open(f1, "rb").read() + suffix
    d2 = open(f2, "rb").read() + suffix
    return hashlib.md5(d1).hexdigest() == hashlib.md5(d2).hexdigest()

Common Cryptanalysis Errors

1. Confusing collision with preimage attack. A collision finds M1 ≠ M2 with the same hash. A preimage attack finds M such that H(M) = target_hash for a given target. MD5 preimage attacks are not practically feasible; MD5 collisions are. CTF challenges that appear to require a preimage are usually exploiting a collision construction instead.

2. Not accounting for prefix alignment. fastcoll requires the prefix to be a multiple of 64 bytes (one MD5 compression block). If your prefix is not aligned, the collision blocks will be appended at the wrong boundary. Always pad prefixes to a 64-byte multiple with null bytes before passing to fastcoll.

3. Assuming the suffix also differs. In a fastcoll identical-prefix collision, the two files share the same prefix and differ only in the 128-byte collision block. Any data appended after the collision block is identical in both files, and the collision property is preserved: MD5(col1 || suffix) = MD5(col2 || suffix). Use this to embed useful data after the collision block.

4. Using MD5 output directly as a security-sensitive check. If a CTF challenge uses if md5(input) == md5(stored): where both sides are MD5 of user-controlled content, submit a collision pair. But if the challenge compares md5(input) against a fixed constant, that is a preimage requirement, not a collision.

5. Forgetting that collision = broken integrity, not broken confidentiality. MD5 collisions mean you cannot trust MD5 for integrity checks, certificates, or code signing. They do not break MD5-HMAC (which is still considered secure because the attacker does not control the key and cannot extend the hash). Do not conflate hash collision resistance with HMAC security.

Challenge Lab

Reinforce your learning with a hands-on generated challenge based on this card's competency.