Analyzing Advanced SMTP Exfiltration via MIME Multipart Parsing and Encoded Attachment Recovery

network_forensics_pcap Difficulty 1–5 30 min certifiable

Theory

Why This Matters

The 2022 Lapsus$ extortion group exfiltrated source code and proprietary data from Nvidia, Samsung, and Microsoft using multiple channels — one of which was SMTP: data was compressed, split across multiple emails as base64-encoded MIME attachments, and sent from internal accounts to external free-mail providers during business hours to blend with legitimate outbound mail. Post-breach analysis of the captured SMTP traffic required analysts to reconstruct multi-part MIME messages, decode base64 attachment payloads, decompress the extracted archives, and identify the data by content rather than filename — because the attackers used innocuous filenames like quarterly_report.docx for ZIP archives containing source code. Advanced SMTP exfiltration analysis is not simply reading the control channel: it requires MIME parsing, base64 decoding, binary identification, and correlation of email metadata with other PCAP events to build a complete exfiltration narrative.

Core Concept

SMTP (Simple Mail Transfer Protocol) carries email as structured RFC 5322 messages in the DATA command payload. The message format consists of header fields followed by a blank line followed by the message body. When the body contains attachments or HTML content, the Content-Type: multipart/mixed (or multipart/related, multipart/alternative) header introduces MIME (Multipurpose Internet Mail Extensions) structure.

A MIME multipart message uses a boundary string — specified in the Content-Type header — to delimit each part. Each part has its own sub-headers: Content-Type (MIME type of the part), Content-Transfer-Encoding (typically base64 or quoted-printable for non-ASCII content), and Content-Disposition (with attachment; filename="..." for file attachments).

Base64 MIME encoding: binary attachment data is encoded as base64 within the MIME part body, wrapped at 76 characters per line. The encoding inflates binary size by approximately 33%. Decoding reverses this exactly.

Advanced exfiltration techniques exploit SMTP structure in several ways:

Custom MIME headers as covert channels: non-standard headers (X-Exfil-Chunk: 3/7, X-Session-ID: abc123) carry structured metadata about the exfiltration operation outside the message body.
Steganographic encoding in base64 padding: subtle modifications to base64 padding characters (=, ==) or line breaks can encode small amounts of additional data invisible to casual inspection.
MIME type spoofing: a Content-Type: image/jpeg MIME part whose decoded payload is a ZIP archive or executable. Magic-byte identification exposes the true format.
Subject and header encoding: =?UTF-8?B?<base64>?= or =?UTF-8?Q?<qp>?= encoded words in Subject, From, or custom headers can carry hidden data.

Impacket provides impacket-smbserver and related network tools, but for SMTP analysis the relevant tool is the Python email library, which handles full RFC 5322 and MIME parsing with a single function call.

Technical Deep-Dive

# Filter all SMTP traffic and follow the TCP stream
tshark -r capture.pcap -Y "smtp" -T fields -e tcp.stream | sort -nu

# Follow the full SMTP session (stream N) as ASCII
tshark -r capture.pcap -z "follow,tcp,ascii,0" 2>/dev/null | head -200

# Extract the DATA payload (everything between DATA and the terminating dot)
# This contains the full RFC 5322 message including MIME parts
tshark -r capture.pcap -z "follow,tcp,ascii,0" 2>/dev/null 
  | sed -n '/^DATA/,/^./p' > email_raw.txt

# Detect custom X- headers (potential covert channel metadata)
tshark -r capture.pcap -z "follow,tcp,ascii,0" 2>/dev/null 
  | grep -i "^X-" | sort -u

# List all RCPT TO addresses (identify external recipients)
tshark -r capture.pcap 
  -Y "smtp.req.command == "RCPT"" 
  -T fields -e frame.time_relative -e ip.src -e smtp.req.parameter 
  | grep -vE "@corp.(com|local|internal)"   # filter internal recipients

# Detect encoded Subject lines (=?charset?encoding?text?= format)
tshark -r capture.pcap -z "follow,tcp,ascii,0" 2>/dev/null 
  | grep -i "^Subject:" | grep "=?"

# Find large DATA payloads (potential large attachment exfiltration)
tshark -r capture.pcap -Y "smtp" -T fields -e tcp.stream -e tcp.len 
  | awk -F" " '{sum[$1]+=$2} END{for(s in sum) if(sum[s]>50000) print "stream " s ": " sum[s] " bytes"}'

# Python: extract and decode all MIME attachments from a captured SMTP session
import email, base64, quopri, re
from email import policy
from pathlib import Path

# Step 1: extract the raw email from the SMTP DATA section
# (remove SMTP control lines; keep everything from "From:" header onward)
raw_session = open("smtp_stream.txt", "rb").read()

# Find the start of the RFC 5322 message (after DATA command acknowledgment)
match = re.search(rb"(?:
?
){2}(From:|MIME-Version:|Date:)", raw_session)
if match:
    raw_email = raw_session[match.start(1):]
else:
    raw_email = raw_session  # fallback: try the whole buffer

msg = email.message_from_bytes(raw_email, policy=policy.default)

print(f"From:    {msg['from']}")
print(f"To:      {msg['to']}")
print(f"Subject: {msg['subject']}")
print(f"Date:    {msg['date']}")

# Print all custom X- headers (potential covert channel)
for key, val in msg.items():
    if key.lower().startswith("x-"):
        print(f"Custom header: {key}: {val}")

# Walk MIME parts and extract attachments
out_dir = Path("extracted_attachments")
out_dir.mkdir(exist_ok=True)

for i, part in enumerate(msg.walk()):
    ct   = part.get_content_type()
    disp = part.get_content_disposition() or ""
    fname = part.get_filename() or f"part_{i}.bin"

    if "attachment" in disp or ct not in ("text/plain", "text/html", "multipart/mixed",
                                           "multipart/alternative", "multipart/related"):
        payload = part.get_payload(decode=True)  # handles base64 and qp automatically
        if payload:
            out_path = out_dir / fname
            out_path.write_bytes(payload)
            # Identify true file type by magic bytes
            magic_map = {
                b"PKx03x04": "ZIP archive",
                b"x1fx8b":   "gzip",
                b"MZ":          "Windows PE",
                b"x7fELF":    "ELF binary",
                b"x89PNG":    "PNG image",
                b"%PDF":        "PDF",
            }
            fmt = next((v for k, v in magic_map.items() if payload.startswith(k)), "Unknown")
            print(f"Attachment: {fname} ({len(payload)} bytes) — declared: {ct} — actual: {fmt}")

# Python: decode =?charset?encoding?text?= encoded words from email headers
import email.header

def decode_header_value(raw: str) -> str:
    """Decode RFC 2047 encoded-word sequences in email header values."""
    parts = email.header.decode_header(raw)
    decoded = []
    for data, charset in parts:
        if isinstance(data, bytes):
            decoded.append(data.decode(charset or "utf-8", errors="replace"))
        else:
            decoded.append(data)
    return "".join(decoded)

# Example: Subject: =?UTF-8?B?Q29uZmlkZW50aWFsIERhdGE=?=
raw_subject = "=?UTF-8?B?Q29uZmlkZW50aWFsIERhdGE=?="
print(decode_header_value(raw_subject))   # Output: Confidential Data

Analytical Methodology

Open the PCAP in Wireshark. Apply display filter smtp. Note source IPs, destination IPs, and port numbers. Outbound SMTP on port 25 from an internal workstation (not a mail server) is immediately anomalous — mail clients use port 587.
Apply filter smtp.req.command == "RCPT" to list all envelope recipients. Identify external recipients — addresses outside the organisation's domain. Multiple external recipients in a single session, or recipients at free-mail providers, are exfiltration indicators.
For each session with external recipients, follow the TCP stream (Follow → TCP Stream). Scroll past the SMTP handshake to the DATA section. Read the RFC 5322 headers: From:, To:, Subject:, Date:, and any X- custom headers.
Locate the Content-Type: multipart/... header and note the boundary= value. In the stream, identify each MIME part between boundary markers. Note the Content-Transfer-Encoding and Content-Disposition for each part.
For base64-encoded MIME parts (attachments), copy the base64 block (everything between the sub-headers and the next boundary). Decode using base64 -d or Python. Run file on the decoded output to identify the true format by magic bytes.
For suspicious attachments that are ZIP or gzip archives, extract and inspect the contents. Filenames inside archives often reveal the data type more clearly than the email attachment filename.
Examine all X- custom headers for structured metadata that may indicate automated exfiltration tooling: chunk indices, session IDs, or other non-standard fields.
Use NetworkMiner: load the PCAP, navigate to the Files tab for automatic MIME attachment extraction, and the Messages tab for structured email metadata. NetworkMiner computes MD5 hashes for each extracted file, suitable for forensic reporting.
Correlate the exfiltration session with other PCAP events: what file access or database queries preceded the email? Is there a corresponding SMTP AUTH session identifying the account used? Build a complete timeline from internal access to external delivery.

Common Analytical Errors

Stopping at credential extraction without reading the DATA payload: Analysts who identify AUTH credentials and consider the session documented miss the actual exfiltration payload in the DATA section. Credential extraction is step one; MIME payload analysis is the forensically significant step.
Assuming MIME type from Content-Type header: Attackers commonly set Content-Type: image/jpeg for ZIP archives or PE executables. Always decode the payload and identify by magic bytes — the declared MIME type is attacker-controlled and unreliable.
Missing multi-email exfiltration sequences: A single large dataset may be split across multiple emails (multiple SMTP sessions) with sequence indicators in subject lines or custom headers. Look for a pattern of emails to the same recipient with incrementing identifiers or similar subjects.
Not decoding RFC 2047 encoded-word subjects: A Subject line containing =?UTF-8?B?...?= is base64-encoded. Reading it as raw ASCII hides the actual subject. Use Python's email.header.decode_header() to reveal the true content — it may contain operational metadata.
Overlooking port 587 and 465 traffic: Internal mail clients submit email via port 587 (authenticated submission) or port 465 (SMTPS). An attacker exfiltrating via a mail client or SMTP library uses these ports. Ensure PCAP captures and analysis cover all three SMTP ports.
Failing to correlate SMTP timestamps with endpoint activity: SMTP exfiltration sessions must be correlated with endpoint file access logs, database query logs, or other network captures to establish that the attached data was gathered from the compromised system. The PCAP alone proves transmission; endpoint evidence proves origin.

NICE Framework Alignment

Code	Knowledge/Skill/Task Statement	How This Card Develops It
K0046	Knowledge of intrusion detection methodologies and techniques	Recognising advanced SMTP exfiltration patterns that DLP and IDS detect: external recipients, large DATA payloads, MIME type mismatches, anomalous X- headers
K0093	Knowledge of network protocols such as TCP/IP, DNS, SMTP	Understanding SMTP DATA structure, RFC 5322 message format, MIME multipart boundaries, base64 encoding, and RFC 2047 encoded-word headers
K0221	Knowledge of OSI model	Situating SMTP at the application layer (7) over TCP (layer 4); understanding how MIME structure creates a secondary encoding layer within the application-layer payload
S0046	Skill in performing packet-level analysis	Using Wireshark SMTP filters, Follow TCP Stream, tshark extraction, Python email library, and NetworkMiner to reconstruct MIME messages and recover encoded attachments
T0023	Collect intrusion artifacts	Recovering decoded MIME attachments, extracting archive contents, and preserving email metadata and file hashes as forensic evidence of data exfiltration