Browse CTFs New CTF Sign in

XLSX cell comment forensics

network_forensics_pcap Difficulty 1–5 30 min certifiable

Theory

Why This Matters

Excel comments and embedded drawing objects represent a persistent forensic blind spot. During the 2016 Bangladesh Bank heist investigation, forensic analysts examined numerous XLSX files for evidence of transaction manipulation; comments and drawing metadata were among the artefact classes reviewed because they preserve authorship timestamps and user identity strings that can corroborate or contradict reported timelines. In corporate investigations, Excel comments frequently contain analyst notes with sensitive content that the note author assumed would be invisible to a viewer who simply read the cell values. For CTF challenges, OOXML comment and drawing XML structures are rich hiding grounds because most analysts open documents in a spreadsheet application rather than parsing the underlying ZIP.

Core Concept

In OOXML, cell comments (the yellow sticky-note annotations visible when hovering over a cell) are stored in xl/comments1.xml (and xl/comments2.xml for the second sheet, etc.). Each comment is represented by a <comment> element containing the cell reference (ref attribute) and an author string, with the visible text stored inside <text><r><t> elements. The comments XML file is referenced from the sheet XML via a relationship in xl/worksheets/_rels/sheet1.xml.rels. Separately, the visual positioning and appearance of comment boxes is stored as a VML drawing in xl/drawings/vmlDrawing1.vml (a legacy XML-based vector format). A challenge designer can place flag content in the <text> element of a comment that is attached to a cell hidden far outside the visible scroll area (e.g., row 10,000).

Drawing objects in XLSX include charts, images, and shapes (text boxes, callouts, etc.). These are stored in xl/drawings/drawing1.xml (DrawingML format). A text box placed on a worksheet stores its visible text in <xdr:txBody><a:p><a:r><a:t> elements. Unlike comments, drawing text boxes are not returned by openpyxl's cell iteration — they exist in a completely separate XML file. A flag placed in a drawing text box is invisible to any analysis that only iterates cells.

Document properties are stored in two XML files at the archive root: docProps/core.xml (Dublin Core metadata: dc:creator, dc:description, cp:lastModifiedBy, cp:revision) and docProps/app.xml (application-specific properties: Application, Company, Manager, custom <vt:lpwstr> properties in <Properties>). Custom document properties are defined in docProps/custom.xml (if present) as named key-value pairs, completely invisible from within the Excel UI unless you navigate to File > Properties > Custom.

Technical Deep-Dive

import zipfile
from lxml import etree

NS = {
    "x":   "http://schemas.openxmlformats.org/spreadsheetml/2006/main",
    "xdr": "http://schemas.openxmlformats.org/drawingml/2006/spreadsheetDrawing",
    "a":   "http://schemas.openxmlformats.org/drawingml/2006/main",
    "dc":  "http://purl.org/dc/elements/1.1/",
    "cp":  "http://schemas.openxmlformats.org/package/2006/metadata/core-properties",
    "vt":  "http://schemas.openxmlformats.org/officeDocument/2006/docPropsVTypes",
}

def audit_xlsx(path):
    with zipfile.ZipFile(path) as z:
        names = z.namelist()
        print("All files in archive:")
        for n in names:
            print(" ", n)

        # --- Comments ---
        for name in names:
            if "comments" in name and name.endswith(".xml"):
                tree = etree.parse(z.open(name))
                for comment in tree.findall(".//x:comment", NS):
                    ref = comment.get("ref")
                    author = comment.get("authorId", "?")
                    texts = comment.findall(".//x:t", NS)
                    text = "".join(t.text or "" for t in texts)
                    print(f"Comment @{ref} author={author}: {text!r}")

        # --- Drawing text boxes ---
        for name in names:
            if "drawing" in name and name.endswith(".xml") and "vml" not in name:
                tree = etree.parse(z.open(name))
                for t_elem in tree.findall(".//a:t", NS):
                    if t_elem.text:
                        print(f"Drawing text [{name}]: {t_elem.text!r}")

        # --- Document properties ---
        for prop_file in ["docProps/core.xml", "docProps/app.xml",
                          "docProps/custom.xml"]:
            if prop_file in names:
                tree = etree.parse(z.open(prop_file))
                print(f"
--- {prop_file} ---")
                for elem in tree.iter():
                    if elem.text and elem.text.strip():
                        tag = elem.tag.split("}")[-1] if "}" in elem.tag else elem.tag
                        print(f"  {tag}: {elem.text.strip()!r}")

audit_xlsx("challenge.xlsx")
# Quick scan of all XML in the archive
unzip -o challenge.xlsx -d xlsx_extracted/
grep -r "" xlsx_extracted/xl/comments*.xml  2>/dev/null
grep -r "" xlsx_extracted/xl/drawings/ 2>/dev/null
grep -r "" xlsx_extracted/docProps/ 2>/dev/null

# xxd scan for binary content in drawing files
xxd xlsx_extracted/xl/drawings/vmlDrawing1.vml | head -30

Analytical Methodology

  1. Full archive listingunzip -l challenge.xlsx. Identify all files: xl/comments*.xml, xl/drawings/drawing*.xml, xl/drawings/vmlDrawing*.vml, docProps/core.xml, docProps/app.xml, docProps/custom.xml. Any of these may carry the flag.
  2. Comments extraction — Parse each xl/comments*.xml with the Python script above. Print every comment's cell reference and text content. Note authors and text that looks like encoded data.
  3. Drawing text extraction — Parse each xl/drawings/drawing*.xml for <a:t> text elements. These contain text box content entirely separate from cells.
  4. VML drawing inspectionxl/drawings/vmlDrawing*.vml files contain the visual properties of comment boxes and legacy drawing shapes. Scan for any embedded text strings with grep -i "text|CTF|flag" vmlDrawing1.vml.
  5. Document properties — Parse docProps/core.xml and docProps/app.xml for all text elements. Check docProps/custom.xml for custom properties. Pay particular attention to dc:description, cp:lastModifiedBy, and any application-specific fields.
  6. Relationship files — Check xl/worksheets/_rels/sheet*.xml.rels to confirm which comments/drawing files are associated with which sheets. An unusual relationship or a reference to an unexpected file is an anomaly.
  7. OLE embedded objects — If xl/embeddings/ exists in the archive, binary .bin files there are OLE compound documents (embedded legacy OLE objects like Word documents or Excel 97 files). Extract and analyse them separately with olevba or oletools.

Common Analytical Errors

  • Iterating only cellsopenpyxl cell iteration returns cell values, not comment text, not drawing text box content. You must explicitly access ws._comments and parse drawings XML separately.
  • Missing comments on out-of-view cells — A comment attached to cell A10000 is valid OOXML. It will appear in comments1.xml but is never scrolled to in normal use. Always iterate the comments XML directly rather than visually scanning the sheet.
  • Ignoring VML drawing files — Comments' visual representation (position, size, visibility) is stored in VML, which is a separate legacy format. Some custom text may be embedded in VML shapes rather than DrawingML.
  • Not checking custom document propertiesdocProps/custom.xml is entirely absent from most legitimate files but is a natural place for a CTF designer to store a flag under an innocuous property name like "ReviewStatus".
  • Assuming drawing1.xml is empty — Shapes and text boxes added via Insert > Shapes store their text in drawing1.xml. This file exists even if the sheet appears to have no drawn objects (they may be positioned outside the visible scroll area or be white-on-white).
  • Single-pass grep — A grep for "CTF" across all XML may miss Base64-encoded flags. Run a second pass with a regex for long Base64 strings ([A-Za-z0-9+/]{20,}={0,2}) across all XML files in the archive.

NICE Framework Alignment

Code Knowledge/Skill/Task Statement How This Card Develops It
K0082 Knowledge of file format standards and embedded document structures Teaches OOXML relationship model, DrawingML structure, VML legacy format, and document property XML layout
K0118 Knowledge of file format structures and forensic artefacts Maps OOXML artefact types (comments, drawings, properties) to their storage locations in the ZIP archive
S0065 Skill in identifying and extracting data of forensic interest from file artifacts Practises exhaustive forensic enumeration of all OOXML data layers using Python lxml and raw XML parsing
T0075 Task: Analyse forensic images to recover data Exercises data recovery from non-obvious storage locations within a complex OOXML document structure

Further Reading

  • ECMA-376: Office Open XML File Formats Standard — Ecma International (5th Edition, 2021)
  • Oletools: Analysing OLE and OOXML Documents — Philippe Lagadec (decalage.info)
  • Office Document Forensics — SANS Digital Forensics and Incident Response Reading Room
  • DrawingML Specification Reference — Microsoft Open Specifications (docs.microsoft.com/en-us/openspecs)

Challenge Lab

Reinforce your learning with a hands-on generated challenge based on this card's competency.