XLSX cell comment forensics
Theory
Why This Matters
Excel comments and embedded drawing objects represent a persistent forensic blind spot. During the 2016 Bangladesh Bank heist investigation, forensic analysts examined numerous XLSX files for evidence of transaction manipulation; comments and drawing metadata were among the artefact classes reviewed because they preserve authorship timestamps and user identity strings that can corroborate or contradict reported timelines. In corporate investigations, Excel comments frequently contain analyst notes with sensitive content that the note author assumed would be invisible to a viewer who simply read the cell values. For CTF challenges, OOXML comment and drawing XML structures are rich hiding grounds because most analysts open documents in a spreadsheet application rather than parsing the underlying ZIP.
Core Concept
In OOXML, cell comments (the yellow sticky-note annotations visible when hovering over a cell) are stored in xl/comments1.xml (and xl/comments2.xml for the second sheet, etc.). Each comment is represented by a <comment> element containing the cell reference (ref attribute) and an author string, with the visible text stored inside <text><r><t> elements. The comments XML file is referenced from the sheet XML via a relationship in xl/worksheets/_rels/sheet1.xml.rels. Separately, the visual positioning and appearance of comment boxes is stored as a VML drawing in xl/drawings/vmlDrawing1.vml (a legacy XML-based vector format). A challenge designer can place flag content in the <text> element of a comment that is attached to a cell hidden far outside the visible scroll area (e.g., row 10,000).
Drawing objects in XLSX include charts, images, and shapes (text boxes, callouts, etc.). These are stored in xl/drawings/drawing1.xml (DrawingML format). A text box placed on a worksheet stores its visible text in <xdr:txBody><a:p><a:r><a:t> elements. Unlike comments, drawing text boxes are not returned by openpyxl's cell iteration — they exist in a completely separate XML file. A flag placed in a drawing text box is invisible to any analysis that only iterates cells.
Document properties are stored in two XML files at the archive root: docProps/core.xml (Dublin Core metadata: dc:creator, dc:description, cp:lastModifiedBy, cp:revision) and docProps/app.xml (application-specific properties: Application, Company, Manager, custom <vt:lpwstr> properties in <Properties>). Custom document properties are defined in docProps/custom.xml (if present) as named key-value pairs, completely invisible from within the Excel UI unless you navigate to File > Properties > Custom.
Technical Deep-Dive
import zipfile
from lxml import etree
NS = {
"x": "http://schemas.openxmlformats.org/spreadsheetml/2006/main",
"xdr": "http://schemas.openxmlformats.org/drawingml/2006/spreadsheetDrawing",
"a": "http://schemas.openxmlformats.org/drawingml/2006/main",
"dc": "http://purl.org/dc/elements/1.1/",
"cp": "http://schemas.openxmlformats.org/package/2006/metadata/core-properties",
"vt": "http://schemas.openxmlformats.org/officeDocument/2006/docPropsVTypes",
}
def audit_xlsx(path):
with zipfile.ZipFile(path) as z:
names = z.namelist()
print("All files in archive:")
for n in names:
print(" ", n)
# --- Comments ---
for name in names:
if "comments" in name and name.endswith(".xml"):
tree = etree.parse(z.open(name))
for comment in tree.findall(".//x:comment", NS):
ref = comment.get("ref")
author = comment.get("authorId", "?")
texts = comment.findall(".//x:t", NS)
text = "".join(t.text or "" for t in texts)
print(f"Comment @{ref} author={author}: {text!r}")
# --- Drawing text boxes ---
for name in names:
if "drawing" in name and name.endswith(".xml") and "vml" not in name:
tree = etree.parse(z.open(name))
for t_elem in tree.findall(".//a:t", NS):
if t_elem.text:
print(f"Drawing text [{name}]: {t_elem.text!r}")
# --- Document properties ---
for prop_file in ["docProps/core.xml", "docProps/app.xml",
"docProps/custom.xml"]:
if prop_file in names:
tree = etree.parse(z.open(prop_file))
print(f"
--- {prop_file} ---")
for elem in tree.iter():
if elem.text and elem.text.strip():
tag = elem.tag.split("}")[-1] if "}" in elem.tag else elem.tag
print(f" {tag}: {elem.text.strip()!r}")
audit_xlsx("challenge.xlsx")
# Quick scan of all XML in the archive
unzip -o challenge.xlsx -d xlsx_extracted/
grep -r "" xlsx_extracted/xl/comments*.xml 2>/dev/null
grep -r "" xlsx_extracted/xl/drawings/ 2>/dev/null
grep -r "" xlsx_extracted/docProps/ 2>/dev/null
# xxd scan for binary content in drawing files
xxd xlsx_extracted/xl/drawings/vmlDrawing1.vml | head -30
Analytical Methodology
- Full archive listing —
unzip -l challenge.xlsx. Identify all files:xl/comments*.xml,xl/drawings/drawing*.xml,xl/drawings/vmlDrawing*.vml,docProps/core.xml,docProps/app.xml,docProps/custom.xml. Any of these may carry the flag. - Comments extraction — Parse each
xl/comments*.xmlwith the Python script above. Print every comment's cell reference and text content. Note authors and text that looks like encoded data. - Drawing text extraction — Parse each
xl/drawings/drawing*.xmlfor<a:t>text elements. These contain text box content entirely separate from cells. - VML drawing inspection —
xl/drawings/vmlDrawing*.vmlfiles contain the visual properties of comment boxes and legacy drawing shapes. Scan for any embedded text strings withgrep -i "text|CTF|flag" vmlDrawing1.vml. - Document properties — Parse
docProps/core.xmlanddocProps/app.xmlfor all text elements. CheckdocProps/custom.xmlfor custom properties. Pay particular attention todc:description,cp:lastModifiedBy, and any application-specific fields. - Relationship files — Check
xl/worksheets/_rels/sheet*.xml.relsto confirm which comments/drawing files are associated with which sheets. An unusual relationship or a reference to an unexpected file is an anomaly. - OLE embedded objects — If
xl/embeddings/exists in the archive, binary.binfiles there are OLE compound documents (embedded legacy OLE objects like Word documents or Excel 97 files). Extract and analyse them separately witholevbaoroletools.
Common Analytical Errors
- Iterating only cells —
openpyxlcell iteration returns cell values, not comment text, not drawing text box content. You must explicitly accessws._commentsand parse drawings XML separately. - Missing comments on out-of-view cells — A comment attached to cell
A10000is valid OOXML. It will appear incomments1.xmlbut is never scrolled to in normal use. Always iterate the comments XML directly rather than visually scanning the sheet. - Ignoring VML drawing files — Comments' visual representation (position, size, visibility) is stored in VML, which is a separate legacy format. Some custom text may be embedded in VML shapes rather than DrawingML.
- Not checking custom document properties —
docProps/custom.xmlis entirely absent from most legitimate files but is a natural place for a CTF designer to store a flag under an innocuous property name like "ReviewStatus". - Assuming drawing1.xml is empty — Shapes and text boxes added via Insert > Shapes store their text in
drawing1.xml. This file exists even if the sheet appears to have no drawn objects (they may be positioned outside the visible scroll area or be white-on-white). - Single-pass grep — A grep for "CTF" across all XML may miss Base64-encoded flags. Run a second pass with a regex for long Base64 strings (
[A-Za-z0-9+/]{20,}={0,2}) across all XML files in the archive.
NICE Framework Alignment
| Code | Knowledge/Skill/Task Statement | How This Card Develops It |
|---|---|---|
| K0082 | Knowledge of file format standards and embedded document structures | Teaches OOXML relationship model, DrawingML structure, VML legacy format, and document property XML layout |
| K0118 | Knowledge of file format structures and forensic artefacts | Maps OOXML artefact types (comments, drawings, properties) to their storage locations in the ZIP archive |
| S0065 | Skill in identifying and extracting data of forensic interest from file artifacts | Practises exhaustive forensic enumeration of all OOXML data layers using Python lxml and raw XML parsing |
| T0075 | Task: Analyse forensic images to recover data | Exercises data recovery from non-obvious storage locations within a complex OOXML document structure |
Further Reading
- ECMA-376: Office Open XML File Formats Standard — Ecma International (5th Edition, 2021)
- Oletools: Analysing OLE and OOXML Documents — Philippe Lagadec (decalage.info)
- Office Document Forensics — SANS Digital Forensics and Incident Response Reading Room
- DrawingML Specification Reference — Microsoft Open Specifications (docs.microsoft.com/en-us/openspecs)
Challenge Lab
Reinforce your learning with a hands-on generated challenge based on this card's competency.