Corrupted archive

forensic_file_artifacts Difficulty 1–5 30 min certifiable

Theory

Why This Matters

Corrupted or deliberately modified ZIP archives are a recurring pattern in both incident response and CTF challenges. Malware authors have used intentionally malformed ZIP headers to evade antivirus scanning — an AV engine that cannot parse the archive gives up and passes the file, while a custom decompressor built into the malware correctly handles the non-standard format. In digital forensics, partially overwritten or truncated archives recovered from damaged storage media must be reconstructed before their contents can be examined. Understanding the ZIP format at the binary level is therefore a practical skill for any analyst who works with compressed evidence or inspects suspicious archives.

Core Concept

A ZIP archive consists of three primary regions. First, a sequence of Local File Header (LFH) records, each followed immediately by the compressed file data. Each LFH begins with the magic signature PKx03x04 (hex 50 4B 03 04) and encodes: the version needed to extract (2 bytes), a general-purpose bit flag (2 bytes), the compression method (2 bytes; 0 = stored, 8 = deflate), last modified time and date (4 bytes), CRC-32 checksum of the uncompressed data (4 bytes), compressed size (4 bytes), uncompressed size (4 bytes), filename length (2 bytes), extra field length (2 bytes), followed by the filename and extra field bytes. Second, the Central Directory (CD), a list of Central Directory File Headers (PKx01x02, hex 50 4B 01 02) each containing a superset of the LFH fields plus a relative offset back to the corresponding LFH. Third, the End of Central Directory (EOCD) record (PKx05x06) which gives the offset and size of the entire central directory.

The key forensic insight is that many ZIP utilities do not verify the CD against the LFH during extraction — some tools trust the LFH CRC and sizes; others trust the CD values; still others check neither if the data decompresses without error. A CRC-32 bypass for stored (non-compressed) files is possible by setting the CRC field to 00 00 00 00 and patching the compressed size to 0, causing tools that skip validation when sizes are zero to pass the file through. For deflate-compressed entries, a known-plaintext approach can recover the CRC: compute zlib.crc32(plaintext_bytes) and patch the LFH CRC field to match, then ensure the size fields are internally consistent.

The zip -FF command (fix-fix) attempts to rebuild a central directory from the LFH records alone, which is valuable when the CD region is corrupted or missing (e.g., the file was truncated). The Python zipfile module allows reading individual members even from archives with partial central directory damage, by specifying members by name.

Technical Deep-Dive

import zipfile, struct, zlib

# --- Inspect raw LFH fields from bytes ---
def parse_lfh(data, offset=0):
    sig, ver_need, flags, method, mod_time, mod_date, 
    crc32, comp_sz, uncomp_sz, fname_len, extra_len = 
        struct.unpack_from("<4sHHHHHIIIHH", data, offset)
    fname = data[offset+30 : offset+30+fname_len]
    return {
        "signature":   sig.hex(),       # should be "504b0304"
        "method":      method,          # 0=stored, 8=deflate
        "crc32":       f"{crc32:#010x}",
        "comp_sz":     comp_sz,
        "uncomp_sz":   uncomp_sz,
        "filename":    fname.decode(errors="replace"),
    }

with open("challenge.zip", "rb") as f:
    data = f.read()

print(parse_lfh(data, 0))   # parse first LFH at offset 0

# --- Patch a CRC-32 mismatch (stored file) ---
# If decompressed content is known, compute correct CRC and patch it:
known_content = b"CTF{...}"    # hypothetical plaintext for testing
correct_crc = zlib.crc32(known_content) & 0xFFFFFFFF
# CRC field is at bytes 14-17 of the LFH (offset 14 from LFH start)
patched = bytearray(data)
struct.pack_into("<I", patched, 14, correct_crc)
with open("patched.zip", "wb") as f:
    f.write(patched)

# Repair with zip -FF (rebuilds central directory from LFH records)
zip -FF challenge.zip --out repaired.zip
unzip repaired.zip

# zipinfo: compare LFH vs CD sizes — discrepancies indicate corruption
zipinfo -v challenge.zip 2>&1 | grep -E "(method|CRC|length|offset)"

# python zipfile — force-read even if CD is bad
python3 -c "
import zipfile
with zipfile.ZipFile('challenge.zip') as z:
    for name in z.namelist():
        print(name, z.getinfo(name).file_size)
    z.extractall('output/')
"

Analytical Methodology

Initial parse attempt — unzip -t challenge.zip to test integrity. Note the specific error: bad CRC, invalid compressed data, missing central directory, or unexpected end. Different errors point to different corruption types.
zipinfo verbose dump — zipinfo -v challenge.zip prints all header fields for every entry. Compare CRC-32, compressed size, and uncompressed size between the LFH and central directory. Discrepancies indicate targeted corruption.
Hex inspection of headers — xxd challenge.zip | head -20 to verify 50 4B 03 04 at offset 0. If the magic bytes are partially overwritten, determine which bytes were changed and patch them back.
Attempt repair — zip -FF challenge.zip --out repaired.zip. If the CD is missing but LFHs are intact, this reconstructs a valid archive.
Python zipfile fallback — If zip -FF fails, use the Python zipfile module which has its own parser and sometimes succeeds where command-line tools fail.
CRC patch for stored files — If the file is stored uncompressed (method = 0) and only the CRC is wrong, the content bytes are accessible directly. Extract them with dd (offset = LFH start + 30 + fname_len + extra_len) and verify content manually.
Distinguish intentional from accidental — Accidental corruption (damaged media) typically affects a contiguous byte range, corrupting multiple sequential fields. Intentional corruption (CTF/forensics) often changes exactly one field (just the magic bytes, or just the CRC, or just the signature) while leaving everything else valid.

Common Analytical Errors

Using only unzip — unzip and 7z have slightly different parsers and different tolerances for malformed headers. If one rejects the archive, try the other before concluding the file is unrecoverable.
Patching the CD but not the LFH — The LFH and central directory both store CRC-32 and size fields. Patching only one location may fix the tool that reads that location but break tools that read the other.
Ignoring the bit 3 flag — General-purpose bit flag bit 3 (Data Descriptor present) means the CRC and sizes are stored in a data descriptor record after the compressed data rather than in the LFH. Tools that do not handle this flag will mis-parse the sizes.
Wrong endianness when patching — ZIP uses little-endian byte order for all multi-byte integer fields. A CRC-32 of 0xDEADBEEF must be written as EF BE AD DE in the file.
Missing the password-protected case — A "corrupted" ZIP that has encryption flags set is not corrupted — it is encrypted. Bit 0 of the general-purpose flag being set means the entry is encrypted; the CRC check happens after decryption.
Truncated file vs. corrupted CD — If the file was truncated (missing bytes at the end), the EOCD is absent and zip -FF may produce incorrect offsets. In this case, parse LFH records manually to determine what content survives.

NICE Framework Alignment

Code	Knowledge/Skill/Task Statement	How This Card Develops It
K0118	Knowledge of file format structures and data encoding standards	Provides byte-level understanding of ZIP LFH and central directory record formats
S0065	Skill in identifying and extracting data of forensic interest from file artifacts	Practises extracting content from corrupted archives through patching, repair tools, and manual carving
S0068	Skill in using binary analysis tools to examine file content	Develops proficiency with zipinfo, xxd, zip -FF, and Python struct for binary analysis
T0075	Task: Analyse forensic images to recover data	Exercises data recovery methodology on intentionally or accidentally corrupted archive files