Recovering Wide Strings (UTF-16LE) from Memory: C2 URL and Credential Extraction via Volatility

cloud_container_security Difficulty 1–5 30 min certifiable

Theory

Why This Matters

During the 2017 analysis of the NotPetya malware, researchers discovered that the credential harvesting module stored its collected NTLM hashes and C2 server addresses as wide (UTF-16LE) strings within the process heap. Standard strings output — which searches for ASCII sequences — missed these entirely. Only when analysts ran strings -el (little-endian wide strings) did the C2 infrastructure become visible. This is not an edge case: the entire Windows API operates on UTF-16LE strings, meaning that any malware that calls standard Windows functions to open files, connect to URLs, spawn processes, or access the registry will store those string arguments as wide strings in memory. Analysts who check only ASCII strings routinely miss the majority of Windows process string artifacts.

Core Concept

UTF-16LE (Unicode Transformation Format, 16-bit, Little-Endian) encodes most characters in two bytes, with the less-significant byte first. The ASCII character A (code point U+0041) is encoded as x41x00 in UTF-16LE; B is x42x00, and so on for the Basic Latin block. This interleaving of data bytes and null bytes means that a UTF-16LE string looks, to an ASCII scanner, like individual characters separated by null bytes — which is below the minimum-length threshold of most strings invocations.

The strings utility addresses this with the -e (encoding) flag: - strings -e l or strings -el: little-endian 16-bit (UTF-16LE) — the Windows native encoding. - strings -e b or strings -eb: big-endian 16-bit (UTF-16BE) — used on some non-x86 platforms. - strings -e s: 7-bit ASCII (default).

In memory dumps, wide strings appear in several key locations: - Process heap: dynamically allocated strings constructed or received at runtime (URLs, file paths, registry keys). - Stack frames: local wchar_t variables and LPWSTR function arguments. - PE image sections: the .data section of a PE executable may contain wide-string constants. - PEB / Process Parameters: the CommandLine, ImagePathName, and CurrentDirectory fields of RTL_USER_PROCESS_PARAMETERS are all UNICODE_STRING structures (a Length, MaximumLength WORD pair followed by a Buffer pointer to a wide string).

UNICODE_STRING is the fundamental Windows string type: a structure containing Length (bytes, not characters), MaximumLength, and Buffer (pointer). Volatility plugins that display process parameters parse these structures directly from the PEB.

Technical Deep-Dive

# ASCII strings (default)
strings -a -n 8 memdump.raw | grep -iE "(http|cmd|powershell|.exe)" | head -30

# Wide (UTF-16LE) strings
strings -el -n 8 memdump.raw | grep -iE "(http|cmd|powershell|.exe)" | head -30

# Side-by-side comparison: extract both and merge
strings -a  -n 8 memdump.raw > ascii_strings.txt
strings -el -n 8 memdump.raw > wide_strings.txt
wc -l ascii_strings.txt wide_strings.txt

# Filter for URL-like patterns in wide strings
strings -el memdump.raw | grep -oP 'https?://[^s]+'' | sort -u

# Volatility 2: strings plugin maps string offsets back to owning processes
strings -el -n 8 memdump.raw | 
  python vol.py -f memdump.raw --profile=Win7SP1x64 strings --string-file=-

# Volatility 2: cmdline — uses UNICODE_STRING from PEB to get process command line
vol.py -f memdump.raw --profile=Win7SP1x64 cmdline

# Volatility 2: dlllist — wide-string DLL paths from PEB loader data
vol.py -f memdump.raw --profile=Win7SP1x64 dlllist --pid=1234

import re, struct

def extract_wide_strings(data: bytes, min_chars: int = 6) -> list:
    """Extract UTF-16LE strings from a binary blob."""
    results = []
    # Pattern: sequences of (printable_byte, x00) of minimum length
    pattern = rb'(?:[x20-x7e]x00){%d,}' % min_chars
    for m in re.finditer(pattern, data):
        raw = m.group()
        try:
            decoded = raw.decode('utf-16-le').rstrip('x00')
            results.append((m.start(), decoded))
        except UnicodeDecodeError:
            pass
    return results

with open("memdump.raw", "rb") as f:
    dump = f.read()

wide = extract_wide_strings(dump, min_chars=8)
keywords = ["http", "password", "cmd", ".exe", "secret", "token"]
for offset, s in wide:
    if any(kw in s.lower() for kw in keywords):
        print(f"  0x{offset:08x}  {s}")

Analytical Methodology

Run strings -a -n 8 and strings -el -n 8 on the full dump as the first quick triage step. Save output to separate files. The wide-string file is the primary source for Windows API strings (file paths, registry keys, URLs, command lines).
Use Volatility cmdline plugin to extract the command-line arguments of every process — these are UNICODE_STRING values read directly from each process's PEB. This is more reliable than grep because it resolves the PEB pointer chain.
Use Volatility dlllist to list all loaded DLLs per process — also wide strings from the PEB loader. Unexpected DLL paths (e.g., a DLL loaded from %TEMP% or a non-standard directory) are injection indicators.
For processes of interest, use Volatility memdump to extract the process address space, then run strings -el specifically on the process dump. This narrows the search space and provides process attribution for every found string.
Apply the Python wide-string extractor to the process dump with keyword filtering. For each hit, note the memory offset. Optionally cross-reference the offset against vadinfo output to determine which VAD region (heap, stack, mapped file) contained the string.
Search specifically for UNICODE_STRING structures: a 2-byte length, a 2-byte max-length, and a 4/8-byte pointer. When the length is between 10 and 520 (max Windows path length in bytes), and the pointer resolves to readable memory containing the expected wide string, the structure is likely valid.
Grep wide-string output for C2 indicators: URLs (http/https), IP address patterns (d{1,3}.d{1,3}.d{1,3}.d{1,3}), and domain name patterns. Cross-reference against threat intelligence feeds.
Document each significant wide string: memory offset, containing process (PID and name), VAD region, decoded content, and relevance to the investigation.

Common Analytical Errors

Running only ASCII strings on Windows dumps: The single most common oversight in Windows memory forensics. On a Windows system, strings -el output will typically be 30–50% of the size of strings -a output but contains nearly all operationally significant strings.
Using -e l without -a as well: Both character sets matter. Malware sometimes uses ASCII strings for obfuscation (deliberate narrow encoding of what should be a wide string); checking only wide strings misses these.
Not attributing strings to processes: Whole-dump string output loses process context. A C2 URL found in dump-wide strings is a lead; the same URL attributed to svchost.exe PID 3456 with vaddump confirmation is actionable evidence.
Missing UNICODE_STRING header parsing: Grep on the decoded string only finds the string content. The UNICODE_STRING structure header contains the true length of the buffer, which can reveal truncation (a sign the string was partially overwritten) or padding.
Ignoring wide strings in non-heap regions: Stack frames and PE .data sections also contain wide strings. Restricting the search to the process heap misses command-line arguments (on the initial thread stack) and compile-time wide constants (in the PE .data section).

NICE Framework Alignment

Code	Knowledge/Skill/Task Statement	How This Card Develops It
K0017	Knowledge of concepts and practices of processing digital forensic data	Understanding UTF-16LE encoding and UNICODE_STRING structures as Windows memory artifacts
K0042	Knowledge of incident response and handling methodologies	Extracting C2 URLs and command strings from memory as part of malware triage during incident response
K0187	Knowledge of file type abuse by adversaries for data exfiltration	Recognising wide-string encoding as a technique that evades ASCII-only string scanners
S0047	Skill in preserving evidence integrity according to standard operating procedures	Working from memory images and documenting all findings with memory offsets and process attribution
T0049	Decrypt seized data using technical means	Decoding UTF-16LE binary sequences into human-readable strings for evidence analysis