ZIP archive forensics
Theory
Why This Matters
ZIP archives are pervasive in software distribution, document packaging, and evidence collection — virtually every OOXML office document (.docx, .xlsx, .pptx) is a ZIP archive internally. Forensic analysts have recovered incriminating metadata from ZIP files that witnesses believed were "just compressed files": creation timestamps revealing time-zone offsets that contradicted alibis, comment fields containing author notes, and password-protected entries that were enumerated (even when their content remained encrypted) to establish the file inventory a suspect had prepared. In malware analysis, ZIP metadata has been used to fingerprint malware-building toolchains, since automated packers leave characteristic extra-field signatures.
Core Concept
The ZIP format stores metadata redundantly across two structures: the Local File Header (LFH) and the Central Directory File Header (CDFH), both described in Card 4. Beyond the fields common to both, the CDFH includes an external file attributes field (4 bytes) encoding Unix or MS-DOS permissions, a version made by field identifying the OS and tool that created the archive, and crucially, a variable-length comment field per entry as well as an archive-level comment. The archive comment is stored at the end of the EOCD record (up to 65,535 bytes) and is readable by unzip -z or zipinfo. Entry-level comment fields are stored within each CDFH.
The extra field in both LFH and CDFH can contain arbitrary sub-records in a TLV (Tag-Length-Value) format, each sub-record prefixed by a 2-byte tag and 2-byte length. Common extra-field tags include 0x5455 (extended timestamp), 0x7875 (Unix UID/GID), and 0x0001 (ZIP64 extended information). A challenge designer can store flag data in any of these sub-records; most tools display only well-known tags and silently skip unknown ones.
A less-known attack surface is "ghost" entries — entries in the central directory that have a filename consisting of null bytes or a zero-length filename. Some ZIP parsers (including some versions of the Python zipfile module) silently skip entries with empty names, while others extract them to files named oddly. Similarly, entries with null byte-terminated filenames may be listed differently by different tools, concealing the true entry count.
Password-protected entries in a ZIP can be enumerated (listed) even without the password — the central directory reveals the filename, size, and compression method of every entry regardless of encryption status. Only the content bytes are protected.
Technical Deep-Dive
import zipfile, struct
def full_zip_audit(path):
with zipfile.ZipFile(path) as z:
print(f"Archive comment: {z.comment!r}")
print(f"Total entries: {len(z.infolist())}")
for info in z.infolist():
print(f"
--- {info.filename!r} ---")
print(f" Comment: {info.comment!r}")
print(f" Created by: {info.create_system} v{info.create_version}")
print(f" Compressed: {info.compress_size} bytes (method {info.compress_type})")
print(f" Uncompressed: {info.file_size} bytes")
print(f" CRC-32: {info.CRC:#010x}")
print(f" Date/time: {info.date_time}")
print(f" Flag bits: {info.flag_bits:#06x}")
print(f" Encrypted: {bool(info.flag_bits & 1)}")
# Parse extra field manually for unknown sub-records
extra = info.extra
idx = 0
while idx + 4 <= len(extra):
tag, length = struct.unpack_from("<HH", extra, idx)
data = extra[idx+4 : idx+4+length]
print(f" Extra[{tag:#06x}] len={length}: {data.hex()}")
idx += 4 + length
full_zip_audit("challenge.zip")
# Command-line metadata inspection
zipinfo -v challenge.zip # verbose: all header fields per entry
unzip -v challenge.zip # list with CRC, method, ratio
unzip -z challenge.zip # print archive comment
unzip -p challenge.zip .zipcomment 2>/dev/null # some tools name it
# List ALL entries including potentially hidden ones
python3 -c "
import zipfile
with zipfile.ZipFile('challenge.zip') as z:
for i, info in enumerate(z.infolist()):
print(i, repr(info.filename), info.file_size, bool(info.flag_bits & 1))
"
# Enumerate entries even if archive comment or CD is unusual
binwalk -e challenge.zip # extracts all detected sub-files
Analytical Methodology
- Archive-level comment —
unzip -z challenge.ziporpython -c "import zipfile; print(zipfile.ZipFile('f.zip').comment)". Archive comments are frequently overlooked and commonly used as flag slots. - Entry enumeration with Python — Iterate
ZipFile.infolist()and print every entry'sfilename,comment, andextrabytes. Count entries; compare with whatunzip -ldisplays (some tools hide entries with empty or null names). - Extra-field parsing — Print the raw hex of each entry's extra field and parse TLV sub-records. Unknown tags (above
0x9999) are private/application-specific and may encode data. - Encrypted entry inventory — Note which entries are encrypted (
flag_bits & 1). The filenames are still readable. Try empty passphrase and obvious candidates before bruteforcing. - Timestamp analysis — ZIP stores modification time in MS-DOS format (local time, 2-second resolution) and optionally a Unix timestamp in the extra field. Discrepancies between the two timestamps, or a future date, indicate tampering.
- Version made by field — The high byte encodes the creating OS (0 = MS-DOS, 3 = Unix, 11 = NTFS). The low byte encodes the tool version (e.g., 20 = 2.0). Unexpected values (e.g., a claimed Unix archive with Windows paths) are anomalies worth investigating.
- Diff between tools — Compare
zipinfo -voutput withpython zipfileoutput. Any entry or field that appears in one but not the other is a candidate for the hidden data.
Common Analytical Errors
- Stopping at
unzip -l— The standard listing shows only filename, date, size, and CRC. It does not show comments, extra fields, encryption flags, or version metadata. Usezipinfo -vand Python for a complete picture. - Missing the archive comment —
unzip -ldoes not display the archive comment by default. You need-zor Python's.commentattribute. This is one of the most common flag locations in ZIP metadata challenges. - Not checking extra fields — Tools print human-readable interpretations of well-known extra-field sub-records but skip unknown ones. Raw hex inspection with Python is required to catch custom sub-records.
- Overlooking zero-length entries — An entry with a zero-byte file size and a meaningful comment or extra field is easily missed when visually scanning a large listing. Iterate programmatically.
- Wrong encoding for comment text — ZIP does not mandate an encoding for comment strings. A comment that appears as garbage in UTF-8 may be Latin-1, UTF-16LE, or may be Base64 or hex that needs decoding.
- Assuming the listing is exhaustive — The central directory can contain entries that overlap in the LFH address space, or entries that point beyond the end of file. A forensically complete audit reads the CD byte-by-byte, not through a high-level API.
NICE Framework Alignment
| Code | Knowledge/Skill/Task Statement | How This Card Develops It |
|---|---|---|
| K0082 | Knowledge of file format standards and embedded metadata techniques | Teaches ZIP central directory structure, extra-field TLV encoding, and archive/entry comment fields |
| K0118 | Knowledge of file format structures and forensic artefacts | Connects ZIP internal structures to forensic observables: timestamps, version fields, encryption flags |
| S0065 | Skill in identifying and extracting data of forensic interest from file artifacts | Practises systematic enumeration of all metadata layers in a ZIP archive using multiple tools |
| T0048 | Task: Perform file system forensic analysis | Applies forensic audit methodology to ZIP archives to uncover concealed entries and metadata |
Further Reading
- ZIP File Format Specification 6.3.10 — PKWARE Inc. (APPNOTE.TXT)
- Metadata Forensics: Recovering Evidence from File Archives — SANS Digital Forensics and Incident Response
- Python zipfile Module Documentation — Python Software Foundation
- Anti-Forensics and Archive Manipulation — Forensic Focus Journal
Challenge Lab
Reinforce your learning with a hands-on generated challenge based on this card's competency.