Detecting DNS Exfiltration Through Entropy-Based Subdomain Anomaly Analysis

network_forensics_pcap Difficulty 1–5 30 min certifiable

Theory

Why This Matters

DNS exfiltration was central to the data-theft stage of the 2016 DNSpionage campaign and has appeared in APT toolkits from Winnti to OilRig. Because DNS is almost universally permitted through perimeter firewalls and often excluded from DLP inspection, it is an attractive covert channel. Recognising the distinctive signatures of data-over-DNS requires familiarity with entropy mathematics, label length constraints, and the volume patterns produced by automated exfiltration tools.

Core Concept

DNS exfiltration encodes stolen data into the subdomain labels of DNS queries directed at an attacker-controlled authoritative nameserver. The attacker's nameserver receives the queries and reassembles the data from the encoded subdomains. No direct TCP connection to the attacker is required — only recursive DNS resolution, which traverses the firewall.

Common encoding schemes include base64, base32, and hex. A typical query looks like: 4b6f6e74656e74.evil-c2.com. The attacker controls the authoritative NS for evil-c2.com and logs every query.

Key indicators: high query volume to a single second-level domain (SLD), unusually long subdomain labels (RFC 1035 limits labels to 63 characters; legitimate traffic rarely exceeds 30), high Shannon entropy in subdomain labels (encoded data looks random, entropy > 3.5 bits/character), and use of rare query types such as TXT or NULL that carry larger payloads than A/AAAA records.

Shannon entropy for a string: H = -Σ p(c) × log₂(p(c)) for each unique character c. Random-looking encoded data approaches 4–5 bits/character; human-readable domain labels score 2–3.

Technical Deep-Dive

# Extract all DNS queries from a PCAP and compute subdomain label lengths
tshark -r capture.pcap -Y "dns.flags.response == 0" 
  -T fields -e frame.time -e dns.qry.name 
  | awk '{name=$2; n=split(name,a,"."); sub=a[1]; print length(sub), name}' 
  | sort -rn | head -30

# High-volume query count per SLD (last two labels)
tshark -r capture.pcap -Y "dns.flags.response == 0" 
  -T fields -e dns.qry.name 
  | awk -F. '{print $(NF-1)"."$NF}' 
  | sort | uniq -c | sort -rn | head -20

# Entropy calculator for DNS subdomain labels
import math, re
from collections import Counter

def entropy(s):
    if not s: return 0.0
    freq = Counter(s)
    total = len(s)
    return -sum((c/total)*math.log2(c/total) for c in freq.values())

def parse_subdomain(fqdn):
    parts = fqdn.rstrip('.').split('.')
    return '.'.join(parts[:-2]) if len(parts) > 2 else ''

with open("dns_queries.txt") as fh:
    for line in fh:
        qname = line.strip()
        sub = parse_subdomain(qname)
        if sub:
            h = entropy(sub)
            label_len = max(len(l) for l in sub.split('.'))
            if h > 3.5 or label_len > 45:
                print(f"ALERT  entropy={h:.2f}  maxlabel={label_len}  {qname}")

# Splunk: detect high-entropy DNS subdomains using eval + stats
index=dns sourcetype=dns_logs query_type=A OR query_type=TXT
| rex field=query "^(?P<subdomain>.+?).[^.]+.[^.]+$"
| eval sub_len = len(subdomain)
| where sub_len > 40
| stats count dc(query) AS unique_queries BY src_ip dest_domain
| where count > 50
| sort -count

Analytical Methodology

Pull DNS query logs for the investigation window. Aggregate by second-level domain (SLD). Identify any SLD receiving more than 100 queries per hour from internal hosts — flag for further analysis.
For flagged SLDs, extract all queried FQDNs. Measure label lengths. Any label exceeding 45 characters is strong evidence of encoded data, as legitimate labels are typically short and human-readable.
Compute Shannon entropy for each subdomain portion. Scores above 3.5 indicate non-English-language encoded content. Combine with label length to prioritise.
Check QTYPE distribution. Elevated TXT, NULL, or MX queries to a single domain with no corresponding mail infrastructure is anomalous.
Reconstruct the exfiltrated data: sort queries by timestamp, strip the SLD suffix, concatenate subdomain values in order, then decode (base64 -d, xxd -r -p, or python base64.b32decode).
Examine the reassembled bytes for file magic numbers (PDF: %PDF, ZIP: PKx03x04). Document the data type and estimated size in the incident report.
Correlate the source IP against endpoint logs to identify the process generating the queries (Windows: Sysmon Event 22 DNS query; Linux: auditd or DNS resolver logs).
Pivot to network: confirm there is no legitimate business use for the queried SLD. Verify domain registration date (new domains are high-risk).

Common Analytical Errors

Relying on single indicators: High volume alone is insufficient — CDN-heavy applications generate high DNS volume. Always combine volume, entropy, and label length before escalating.
Missing base32 encoding: base32 uses an alphabet of A–Z and 2–7. Entropy is lower than base64 (~3.2 vs 4.0 bits/char) and may fall below naive thresholds. Adjust thresholds and inspect visually.
Not accounting for DNSSEC: DNSSEC-signed zones contain long base32-encoded NSEC3 hashes in their labels. Exclude known DNSSEC infrastructure before flagging length anomalies.
Forgetting response data: The attacker may also send tasking back via DNS TXT responses. Capture and analyse DNS response records, not just queries.

NICE Framework Alignment

Code	Work Role Knowledge / Skill / Task	Relevance
K0046	Knowledge of intrusion detection methodologies	DNS exfiltration detection requires signature and anomaly-based detection in parallel
K0145	Knowledge of security event correlation tools	SIEM aggregation and entropy calculation applied across millions of DNS log records
K0187	Knowledge of file type abuse by adversaries	Recovered exfiltrated files may be renamed or fragmented to avoid DLP detection
S0047	Skill in preserving evidence integrity	Raw PCAP and DNS log preservation with verified checksums before any reassembly
T0049	Decrypt seized data / analyze forensic artifacts	Decoding base64/hex-encoded subdomain fragments to reconstruct exfiltrated files