reverse engineer malicious office document reverse engineer malicious office document

How to Reverse Engineer a Malicious Office Document

A weaponized .docm lands in an analyst’s inbox at 9:14 AM. By 9:18, the payload URL is in the SIEM, the second-stage hash is queued for sandbox detonation, and the IOC list is going to the firewall team. That four-minute turnaround is what good static reverse engineering of a malicious Office document looks like — and it almost never requires opening the file in Word.

Despite Microsoft blocking macros from internet-sourced files by default since 2022, malicious Office documents remain one of the most prolific initial-access vectors. Attackers have adapted: Excel 4.0 (XLM) macros, malicious .xll add-ins, template injection, embedded OLE objects, and exploits like CVE-2017-11882 and CVE-2022-30190 (Follina) all sidestep the macro warning entirely. This guide walks through the full static and dynamic reversing workflow, the tools that make it tractable, and the obfuscation patterns you will actually meet in the wild.

What “Office document” actually means under the hood

Three formats dominate, and each demands different tools.

The legacy OLE2 / Compound File Binary Format (.doc, .xls, .ppt) is a small filesystem inside a single file: storages, streams, and a directory. VBA project source lives in compressed streams with names like VBA/ThisDocument. This is what oledump.py and oletools were built to dissect.

Office Open XML / OOXML (.docx, .xlsx, .pptx and their macro-enabled cousins .docm, .xlsm, .pptm) is a ZIP archive containing XML parts plus a _rels/ directory describing relationships. By design the non-m extensions cannot host VBA, which is why a .docx that appears to do something dangerous is almost always abusing template injection or an embedded object instead. Unzipping with unzip -l is a legitimate first move on these.

Rich Text Format (.rtf) is plain text with control words. RTF cannot carry VBA but happily carries embedded OLE objects — historically the delivery vehicle for Equation Editor exploits like CVE-2017-11882. rtfobj is the right hammer here.

Knowing which container you are looking at decides everything that follows. A file command and an exiftool pass take five seconds and tell you the truth even when the extension lies.

Build a safe lab before you touch anything

Never analyze a suspected malicious document on your daily-driver workstation. The two community-standard environments are REMnux (Linux distro for static analysis, ships with oletools, oledump, pdfid, xlmdeobfuscator, ViperMonkey, CyberChef, YARA, and more pre-installed) and FLARE-VM (Windows lab for dynamic analysis, includes Process Monitor, Wireshark, x64dbg, dnSpy).

Network: isolate the VM. Use a host-only adapter or route through a proxy like INetSim that fakes responses for any DNS or HTTP request the sample makes. You want to see the C2 domain resolve, not actually fetch the next stage.

Snapshots: take one before opening anything. Roll back after each detonation. Static analysis is non-destructive, but the moment you launch Word in a dynamic run, assume the VM is burned.

ANALYSIS PIPELINE
01
Triage
Hash, file type, format detection. Tools: file, exiftool, oleid.
02
Structure
Map streams, parts, embedded objects. Tools: oledump, unzip, oledir.
03
Extract
Pull macros, OLE objects, shellcode. Tools: olevba, rtfobj, oleobj.
04
Deobfuscate
Decode strings, emulate, pivot to IOCs. Tools: ViperMonkey, CyberChef, xlmdeobfuscator.

Triage: identify before you analyze

Three commands answer the question “what am I actually holding.”

file sample.bin returns the real format regardless of extension. A .doc that reports as Composite Document File V2 Document is genuine OLE2; one that reports as Microsoft OOXML was renamed.

exiftool sample.doc extracts metadata: author, last-modified-by, template path, application name. Phishing operators are sloppy here. A document supposedly from a US accounting firm that lists Author: Администратор and Template: Normal.dotm written under LibreOffice/24.2 is telling on itself before you open a single stream.

oleid sample.doc from the oletools suite produces a structured risk indicator output flagging VBA macros, XLM macros, encrypted streams, Flash objects, and external relationships. It is the fastest “is this worth deeper analysis” check available, with a low/medium/high risk verdict.

If the file is OOXML, also run unzip -l sample.docx to list parts. Look for unexpected .bin files (often oleObject1.bin, where embedded payloads hide), suspicious external relationships in word/_rels/document.xml.rels, or template references pointing to remote URLs.

Map the structure with oledump

For OLE2 files, oledump.py sample.doc prints a numbered table of every stream with its size and a marker (M for VBA module with code, m for empty VBA module, O for OLE storage, ! for unusual). This is your map.

To dump a specific stream’s content, pass -s <number>. Pair with -v to decompress VBA streams. So oledump.py -s 8 -v sample.doc extracts the source of macro module 8. The advantage of oledump over olevba is granularity: when a sample has 40 streams and only one carries the malicious logic, you can extract surgically and pipe to your own tooling.

For OOXML, the equivalent is zipdump.py from the same Didier Stevens suite, or simply unzip followed by inspection of word/vbaProject.bin (which is itself an OLE2 container — recurse with oledump).

Extract and read the VBA

olevba sample.doc is the workhorse. It walks every supported container format, finds VBA and XLM macros, deobfuscates several common encoding schemes (Hex, Base64, Dridex, StrReverse, character-arithmetic VBA expressions), and runs a keyword scanner that flags AutoOpen, Document_Open, Workbook_Open, Shell, CreateObject, URLDownloadToFile, WScript.Shell, registry writes, and the rest of the malicious-VBA vocabulary. The output is a triple table — AutoExec triggers, Suspicious keywords, IOCs (URLs, IPs, executable names) — which often tells you the entire story before you read a line of source.

For raw extraction without analysis, olevba -c sample.doc > macros.vb dumps just the VBA source. Open it in your editor of choice and read.

A typical first-stage VBA pattern from a credential-stealer dropper looks like this once deobfuscated:

Sub AutoOpen()
    Dim s As String
    s = "powershell -nop -w hidden -enc " & b64payload
    Shell s, vbHide
End Sub

The b64payload decodes to a PowerShell stager that pulls the next stage from a C2 over HTTPS. That URL is the IOC you push to the firewall.

When the document hides Excel 4.0 macros

Excel 4.0 (XLM) macros predate VBA by several years and live in hidden worksheet cells rather than VBA streams. They became fashionable for malware authors around 2020 because most security tools focused on VBA. olevba detects basic XLM presence in .xlsm files, but for serious analysis use xlmdeobfuscator (the XLMMacroDeobfuscator project), which emulates the XLM execution engine and resolves chained =FORMULA(), =CALL(), and =EXEC() cells.

Look for sheets with white text on a white background, sheet visibility set to xlVeryHidden, and Auto_Open named ranges. The deobfuscator output will give you the resolved command line — usually a regsvr32 or mshta invocation pointing at a remote scriptlet.

Pulling embedded objects out of RTF

RTF is the format of choice for exploit-driven (rather than macro-driven) attacks because it can carry OLE objects with controlled type identifiers. rtfobj sample.rtf enumerates every embedded object with its CLSID, class name, and size. A CLSID of 0002CE02-0000-0000-C000-000000000046 is the Microsoft Equation Editor — CVE-2017-11882 and CVE-2018-0802 territory. Extract with rtfobj -s <id> sample.rtf and analyze the resulting binary as shellcode.

For Follina-style (CVE-2022-30190) attacks, the giveaway is in OOXML, not RTF: an external relationship in word/_rels/document.xml.rels with Target pointing to an HTML file that abuses the ms-msdt: URL handler. No macro warning, no embedded code in the document itself — the malicious logic lives on the attacker’s server and is fetched at open time.

Command reference

CHEATSHEET
Office Document Reversing Commands
TRIAGE & IDENTIFICATION
file sample.bin
Detect real format regardless of extension
exiftool sample.doc
Extract author, template, app, timestamps
oleid sample.doc
Risk-indicator scan: VBA, XLM, encryption, Flash
unzip -l sample.docx
List OOXML parts and embedded objects
STRUCTURE MAPPING
oledump.py sample.doc
List all streams with size and type markers
oledump.py -s 8 -v sample.doc
Dump and decompress stream 8
oledir sample.doc
Show directory entries including orphans
olemeta sample.doc
Display all OLE metadata properties
MACRO & OBJECT EXTRACTION
olevba sample.docm
Full VBA scan with deobfuscation and IOCs
olevba -c sample.docm
Dump only the VBA source code
olevba -r /samples/
Recursive batch scan of a directory
mraptor sample.docm
Heuristic malicious-macro verdict (suspicious / not)
rtfobj sample.rtf
Enumerate embedded OLE objects in RTF
oleobj sample.doc
Extract embedded objects from OLE files
msodde sample.doc
Detect DDE / DDEAUTO links in Office and RTF
DEOBFUSCATION & EMULATION
xlmdeobfuscator -f sample.xlsm
Emulate Excel 4.0 macro execution to resolve formulas
vmonkey sample.docm
ViperMonkey VBA emulation for deobfuscation
pcodedmp sample.doc
Disassemble compiled VBA p-code (catches stomping)
yara rules.yar sample.doc
Match against signature rules for known families

Deobfuscation: when the macro fights back

Skilled operators do not write Shell "powershell..." in clear text. They do one or more of the following.

String concatenation and character math: Chr(112) & Chr(111) & Chr(119) & ... produces pow.... olevba normally resolves this automatically; if not, paste into CyberChef and use the From Charcode operation.

Base64 and custom encoding: the macro decodes a long string at runtime and passes it to Shell. CyberChef’s Magic operation often guesses the encoding chain in one click.

VBA stomping: the source code stored in the VBA/<Module> stream is benign or empty; the malicious logic lives only in the precompiled p-code in the __SRP_* streams. Tools that read source miss it entirely. pcodedmp.py disassembles the p-code itself, and oletools includes detection logic for the inconsistency.

ViperMonkey is the heavier hammer when static deobfuscation stalls. It emulates VBA execution — it actually runs the macro in a sandboxed Python interpreter, traces calls to Shell, CreateObject, network APIs, and prints the resolved arguments. Slow, occasionally fragile on exotic VBA, but invaluable when a macro has six layers of self-modifying string transforms.

When you have a final command line, put the resulting URL or file path through your threat-intel platform. Pivot on the C2 host. Hash the dropped second stage if you can recover it.

Where reversing breaks down

Static analysis fails cleanly in three situations.

Remote template injection. The .docx itself contains nothing malicious — just an external relationship in settings.xml.rels pointing to http://attacker.tld/template.dotm. The malicious VBA lives in the remote template, fetched at open time. You will see the URL in the rels XML, but you must fetch and analyze the template separately, ideally through a controlled proxy.

Exploit payloads with no embedded code. Follina (CVE-2022-30190) and the Equation Editor bugs do not require macros. There is no VBA to extract. The structural anomaly is the only signal — a weird CLSID in rtfobj output, an ms-msdt: URI in a rels file. If your workflow assumes “find the macro,” you will miss these entirely.

Encrypted documents. Word and Excel support password protection that genuinely encrypts the document body. oletools can detect encryption but cannot bypass it. If the phishing email contains the password (a common pattern, since it defeats most email gateways), use that. Otherwise, dynamic analysis is your only option.

When to switch to dynamic analysis

If static reversing has yielded the IOCs you need — payload URL, dropped file name, persistence mechanism — stop. Dynamic adds risk and rarely produces information the static path missed.

Detonate when the macro is so heavily obfuscated that emulation is impractical, when the payload is fetched from a live C2 you want to map, or when you need to observe post-execution behavior (registry writes, scheduled tasks, lateral movement attempts). Run in FLARE-VM with Process Monitor capturing filesystem and registry events, Wireshark capturing traffic, and Sysmon logging process creation. Open the document, click Enable Content, wait, snapshot, roll back.

FAQ

Do I still need to learn this if macros are blocked by default? Yes. The default block applies to files with Mark-of-the-Web from the internet zone, and only since 2022. Internal phishing, archive-extracted files (which lose MOTW), and exploit-based attacks (Follina, Equation Editor) all bypass it. Office documents remain a top-three initial-access vector in 2025 incident reports.

Can I just upload the sample to VirusTotal? For triage on a known-public sample, sure. For an unknown sample tied to a live incident, no — uploading exposes the document to anyone with VT Intelligence access, including the attacker if they monitor for their own samples. Use a private sandbox or local analysis until you understand whether OPSEC matters.

What is the difference between olevba and oledump? olevba is purpose-built for VBA: it understands the format, deobfuscates common encodings, and produces an analyst-friendly report. oledump is a generic OLE2 stream browser — more flexible, less opinionated, better when you need to extract a specific stream and pipe it into your own tooling. Most analysts use both.

How do I keep up with new tradecraft? The MITRE ATT&CK technique pages for T1566.001 (Spearphishing Attachment) and T1204.002 (User Execution: Malicious File) are the structured reference. Beyond that, the oletools GitHub releases page, Didier Stevens’ blog, and the SANS DFIR malicious-document cheat sheet by Lenny Zeltser cover most format-level developments.

The skill that does not depreciate

Tools change — xlmdeobfuscator did not exist when XLM macros were the bleeding edge in 2020, and the next obfuscation technique will demand its own tool by 2027. What persists is the workflow: identify the container, map its structure, extract suspicious components, deobfuscate, pivot to network indicators. An analyst who internalizes that loop can reverse a format they have never seen before by Tuesday afternoon. An analyst who memorized command flags for one tool is stuck the moment attackers move.

The fastest reversers are not the ones with the most exotic toolchain. They are the ones who run file, exiftool, and oleid first — every time, on every sample — and let the artifact tell them what to do next.

Add a comment

Leave a Reply

Your email address will not be published. Required fields are marked *

Cybersecurity intelligence delivered directly to your inbox.

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Advertisement