ZIP archive forensics

logic_reasoning Difficulty 1–5 30 min certifiable

Theory

Why This Matters

ZIP archives are pervasive in software distribution, document packaging, and evidence collection — virtually every OOXML office document (.docx, .xlsx, .pptx) is a ZIP archive internally. Forensic analysts have recovered incriminating metadata from ZIP files that witnesses believed were "just compressed files": creation timestamps revealing time-zone offsets that contradicted alibis, comment fields containing author notes, and password-protected entries that were enumerated (even when their content remained encrypted) to establish the file inventory a suspect had prepared. In malware analysis, ZIP metadata has been used to fingerprint malware-building toolchains, since automated packers leave characteristic extra-field signatures.

Core Concept

The ZIP format stores metadata redundantly across two structures: the Local File Header (LFH) and the Central Directory File Header (CDFH), both described in Card 4. Beyond the fields common to both, the CDFH includes an external file attributes field (4 bytes) encoding Unix or MS-DOS permissions, a version made by field identifying the OS and tool that created the archive, and crucially, a variable-length comment field per entry as well as an archive-level comment. The archive comment is stored at the end of the EOCD record (up to 65,535 bytes) and is readable by unzip -z or zipinfo. Entry-level comment fields are stored within each CDFH.

The extra field in both LFH and CDFH can contain arbitrary sub-records in a TLV (Tag-Length-Value) format, each sub-record prefixed by a 2-byte tag and 2-byte length. Common extra-field tags include 0x5455 (extended timestamp), 0x7875 (Unix UID/GID), and 0x0001 (ZIP64 extended information). A challenge designer can store flag data in any of these sub-records; most tools display only well-known tags and silently skip unknown ones.

A less-known attack surface is "ghost" entries — entries in the central directory that have a filename consisting of null bytes or a zero-length filename. Some ZIP parsers (including some versions of the Python zipfile module) silently skip entries with empty names, while others extract them to files named oddly. Similarly, entries with null byte-terminated filenames may be listed differently by different tools, concealing the true entry count.

Password-protected entries in a ZIP can be enumerated (listed) even without the password — the central directory reveals the filename, size, and compression method of every entry regardless of encryption status. Only the content bytes are protected.

Technical Deep-Dive

import zipfile, struct

def full_zip_audit(path):
    with zipfile.ZipFile(path) as z:
        print(f"Archive comment: {z.comment!r}")
        print(f"Total entries: {len(z.infolist())}")
        for info in z.infolist():
            print(f"
--- {info.filename!r} ---")
            print(f"  Comment:       {info.comment!r}")
            print(f"  Created by:    {info.create_system} v{info.create_version}")
            print(f"  Compressed:    {info.compress_size} bytes (method {info.compress_type})")
            print(f"  Uncompressed:  {info.file_size} bytes")
            print(f"  CRC-32:        {info.CRC:#010x}")
            print(f"  Date/time:     {info.date_time}")
            print(f"  Flag bits:     {info.flag_bits:#06x}")
            print(f"  Encrypted:     {bool(info.flag_bits & 1)}")
            # Parse extra field manually for unknown sub-records
            extra = info.extra
            idx = 0
            while idx + 4 <= len(extra):
                tag, length = struct.unpack_from("<HH", extra, idx)
                data = extra[idx+4 : idx+4+length]
                print(f"  Extra[{tag:#06x}] len={length}: {data.hex()}")
                idx += 4 + length

full_zip_audit("challenge.zip")

# Command-line metadata inspection
zipinfo -v challenge.zip         # verbose: all header fields per entry
unzip -v challenge.zip           # list with CRC, method, ratio
unzip -z challenge.zip           # print archive comment
unzip -p challenge.zip .zipcomment 2>/dev/null  # some tools name it

# List ALL entries including potentially hidden ones
python3 -c "
import zipfile
with zipfile.ZipFile('challenge.zip') as z:
    for i, info in enumerate(z.infolist()):
        print(i, repr(info.filename), info.file_size, bool(info.flag_bits & 1))
"

# Enumerate entries even if archive comment or CD is unusual
binwalk -e challenge.zip   # extracts all detected sub-files

Analytical Methodology

Archive-level comment — unzip -z challenge.zip or python -c "import zipfile; print(zipfile.ZipFile('f.zip').comment)". Archive comments are frequently overlooked and commonly used as flag slots.
Entry enumeration with Python — Iterate ZipFile.infolist() and print every entry's filename, comment, and extra bytes. Count entries; compare with what unzip -l displays (some tools hide entries with empty or null names).
Extra-field parsing — Print the raw hex of each entry's extra field and parse TLV sub-records. Unknown tags (above 0x9999) are private/application-specific and may encode data.
Encrypted entry inventory — Note which entries are encrypted (flag_bits & 1). The filenames are still readable. Try empty passphrase and obvious candidates before bruteforcing.
Timestamp analysis — ZIP stores modification time in MS-DOS format (local time, 2-second resolution) and optionally a Unix timestamp in the extra field. Discrepancies between the two timestamps, or a future date, indicate tampering.
Version made by field — The high byte encodes the creating OS (0 = MS-DOS, 3 = Unix, 11 = NTFS). The low byte encodes the tool version (e.g., 20 = 2.0). Unexpected values (e.g., a claimed Unix archive with Windows paths) are anomalies worth investigating.
Diff between tools — Compare zipinfo -v output with python zipfile output. Any entry or field that appears in one but not the other is a candidate for the hidden data.

Common Analytical Errors

Stopping at unzip -l — The standard listing shows only filename, date, size, and CRC. It does not show comments, extra fields, encryption flags, or version metadata. Use zipinfo -v and Python for a complete picture.
Missing the archive comment — unzip -l does not display the archive comment by default. You need -z or Python's .comment attribute. This is one of the most common flag locations in ZIP metadata challenges.
Not checking extra fields — Tools print human-readable interpretations of well-known extra-field sub-records but skip unknown ones. Raw hex inspection with Python is required to catch custom sub-records.
Overlooking zero-length entries — An entry with a zero-byte file size and a meaningful comment or extra field is easily missed when visually scanning a large listing. Iterate programmatically.
Wrong encoding for comment text — ZIP does not mandate an encoding for comment strings. A comment that appears as garbage in UTF-8 may be Latin-1, UTF-16LE, or may be Base64 or hex that needs decoding.
Assuming the listing is exhaustive — The central directory can contain entries that overlap in the LFH address space, or entries that point beyond the end of file. A forensically complete audit reads the CD byte-by-byte, not through a high-level API.

NICE Framework Alignment

Code	Knowledge/Skill/Task Statement	How This Card Develops It
K0082	Knowledge of file format standards and embedded metadata techniques	Teaches ZIP central directory structure, extra-field TLV encoding, and archive/entry comment fields
K0118	Knowledge of file format structures and forensic artefacts	Connects ZIP internal structures to forensic observables: timestamps, version fields, encryption flags
S0065	Skill in identifying and extracting data of forensic interest from file artifacts	Practises systematic enumeration of all metadata layers in a ZIP archive using multiple tools
T0048	Task: Perform file system forensic analysis	Applies forensic audit methodology to ZIP archives to uncover concealed entries and metadata