Reverse-Engineering Outlook's Secret Database: Cracking HxStore.hxd with AI Agents

This entire project -- the reverse engineering, the code, the analysis, and this blog post -- was produced by Claude Opus 4.6 running in Claude Code, Anthropic's CLI coding agent. A human (Mitchell Johnson) provided direction and made decisions, but every line of code, every binary analysis, and every word you're reading was written by Claude.

Microsoft Outlook for Mac stores your emails in a file called HxStore.hxd. There is no public documentation for this format. No specification, no SDK, no developer guide. If your email is locked inside this binary blob and you want to get it out programmatically, you're on your own.

So we decided to crack it open.

What follows is the story of how we went from an opaque binary file to a working command-line tool that can list emails, extract full HTML bodies, search across records, and export data -- using parallel AI agents to attack the problem from multiple angles simultaneously.

The Target

The file lives deep in macOS's application data:

~/Library/Group Containers/UBF8T346G9.Office/
  Outlook/Outlook 15 Profiles/Main Profile/HxStore.hxd

Our test file was about 24 MB. Opening it in a hex editor reveals the magic bytes Nostromoi at offset zero -- that's all anyone publicly knew about the format. A couple of forensics blog posts from 2018 had identified the magic header and noted it was undocumented. An academic paper behind a paywall analyzed the Windows variant. That was the sum total of public knowledge.

Phase 1: Mapping the Terrain

The first step in any reverse-engineering effort is establishing the basic structure. We started with the hex dump of the file header:

00000000: 4e6f 7374 726f 6d6f 6900 0000 0000 0000  Nostromoi........
00000010: 0020 b500 0000 0000 0050 0000 0000 0000  . .......P......
00000020: 0050 0c00 0000 0000 c012 0000 0000 0000  .P..............

From this we could read the magic, and a few uint64 values that looked like offsets. But the real structure became apparent when we started scanning at page boundaries.

The file is organized into 4096-byte pages. Each page has an 8-byte value at offset +8 that acts as a store identifier, and a uint32 type field at offset +16. By scanning all 5,920 pages and counting the frequency of these store IDs, a clear pattern emerged:

~614 pages shared one store ID with type field = 8. These were data pages.
~43 pages shared a different store ID. These turned out to be B-tree index pages.
The remaining ~5,262 pages had no consistent store ID. These were blob pages containing HTML email bodies and other large content.

Each data page is divided into 8 slots of 512 bytes. Each slot begins with a 32-byte header:

Field	Size	Purpose
Hash	8 bytes	Lookup key
Store ID	8 bytes	Must match the primary store ID
Type	4 bytes	Must be 8 for data
size_a	4 bytes	Compressed size
size_b	4 bytes	Uncompressed size
Unknown	4 bytes	Observed values: 2, 4, 6

The first 8 bytes after the slot header are a record ID (uint64, uncompressed). After that... something compressed.

The size_a and size_b fields were the critical clue. Records had size_b consistently 1.3x to 13.2x larger than size_a, with a median of 2.5x. This was compressed data. But compressed with what?

Phase 2: The Compression Problem

This was the hard part. The compressed data had some unusual properties that made identification difficult:

Readable text was visible within the compressed stream. UTF-16LE strings like "IPM.Note", sender names, and email addresses appeared as literals interspersed with binary tokens.
Certain byte patterns recurred: XX 00 sequences (where XX was small), and 0xF0-0xFF bytes followed by parameter bytes.
Only about 2.3% of bytes fell in the 0xF0-0xFF range -- this was a lightweight compression scheme, not a heavy-duty transform codec.
The data contained no standard compression headers (no zlib 0x78, no gzip 0x1F8B, no zstd magic frame).

We had already tried zlib (with every wbits value), LZNT1, and MS-XPRESS LZ77 in earlier analysis -- all failed. This is where we decided to go wide.

Phase 3: The Agent Swarm

Rather than trying algorithms one by one, we launched five parallel AI agents, each pursuing a different strategy to identify the compression:

Agent 1: Standard Library Brute Force

Try every common compression library -- lz4 (block and frame), zstd, snappy, brotli, lzo, lzma, gzip -- at multiple offsets (0, 4, 8, 12 bytes into the record data). Test each combination against 20 records.

Agent 2: Microsoft-Specific Algorithms

Deep-dive into Microsoft's proprietary compression: MS-XPRESS LZ77, MS-XPRESS Huffman, LZNT1 (NTFS), LZX (CAB/Exchange), LZMS (WIM). Implement each decompressor per the MS-XCA specification and test at multiple offsets.

Agent 3: Statistical Byte Pattern Analysis

Forget trying to decompress -- analyze the data statistically. Compute Shannon entropy across 200+ records, map the distribution of 0xF0-0xFF marker bytes, measure inter-marker intervals, and compare the statistical fingerprint against known algorithm profiles.

Agent 4: Outlook Binary Reverse Engineering

Abandon the data and go to the source. Search the Outlook application bundle (/Applications/Microsoft Outlook.app) for linked compression libraries. Run strings and nm on the framework binaries. Look for compression function names in the symbol tables.

Agent 5: Known-Plaintext Attack

Find records with maximum readable content. Map every readable string to its exact byte offset. Identify the bytes between literal regions -- these are the compression tokens. Try to manually decompress a small segment by hypothesizing different token encodings.

All five agents ran in parallel while two more agents built the project scaffolding (the Python package structure, CLI commands, page parser) so we'd be ready to integrate the answer immediately.

The Breakthrough

Agent 5 -- the known-plaintext attack -- cracked it first, after about 15 minutes of analysis.

The answer was LZ4 Block format (raw blocks, no frame header).

The key insight came from mapping the token structure between literal strings. In LZ4, each "sequence" begins with a token byte:

Token byte:
  High nibble (bits 7-4): Literal length (0-14, or 15 = read more)
  Low nibble  (bits 3-0): Match length - 4 (0-14, or 15 = read more)

The 0xF0-0xFF bytes that appeared throughout the data were LZ4 tokens where the high nibble was 15 (indicating an extended literal length). The XX 00 patterns were 2-byte little-endian match offsets where the distance was less than 256 bytes. Everything clicked.

But there was a catch. The standard Python lz4.block library -- which wraps the reference C implementation -- only successfully decompressed about 3% of records. It's strict about end-of-stream conditions and rejects records where the LZ4 stream terminates with a truncated final literal run. The solution was a lenient manual decoder that handles these edge cases gracefully:

def lz4_block_decompress_lenient(data, max_output_size):
    output = bytearray()
    i = 0
    while i < len(data) and len(output) < max_output_size:
        token = data[i]; i += 1

        lit_len = (token >> 4) & 0x0F
        if lit_len == 15:
            while i < len(data):
                extra = data[i]; i += 1
                lit_len += extra
                if extra != 255: break

        # Copy literals (lenient: copy what's available)
        available = len(data) - i
        actual_lit = min(lit_len, available)
        output.extend(data[i : i + actual_lit])
        i += actual_lit
        if actual_lit < lit_len:
            return bytes(output)
        if i >= len(data) or len(output) >= max_output_size:
            return bytes(output)

        # Match offset (2 bytes LE)
        match_offset = struct.unpack_from("<H", data, i)[0]; i += 2

        match_len = (token & 0x0F) + 4
        if (token & 0x0F) == 15:
            while i < len(data):
                extra = data[i]; i += 1
                match_len += extra
                if extra != 255: break

        for _ in range(min(match_len, max_output_size - len(output))):
            output.append(output[-match_offset])

    return bytes(output)

With this lenient decoder: 100% decompression success rate across all records.

Confirmation from the Binary

While Agent 5 was cracking the compression empirically, Agent 4 was searching Outlook's application bundle. It found definitive proof in HxCore.framework:

Hx::Compressor::CopyStreamToCompressed_LZ4(IStream*, IStream*)

This C++ symbol in the binary confirms LZ4 is the compression algorithm. The agent also found that mso20.framework contains a complete statically-linked LZ4 library with 100 exported symbols (core LZ4, LZ4HC high compression, LZ4F frame, and streaming APIs).

The other three agents all independently confirmed LZ4 through their own methods -- statistical analysis matched LZ4's entropy profile (4.8 bits/byte average), the Microsoft algorithms were all ruled out, and the standard library agent confirmed lz4.block worked at offset 8 with a lenient decoder.

Five agents, five independent lines of evidence, one answer.

Phase 4: The Cola Storage Engine

The binary analysis revealed far more than just the compression algorithm. Outlook's storage engine is internally called "Cola" (Hx::Storage::Cola namespace) -- a full page-based database engine with B-tree indices, collection pages, and an LRU page cache. The file magic is "Nostromoi" on Mac and "Nostromoh" on Windows.

The Two-Object Model

The deepest insight came from a 170-turn agent session that exhaustively searched every string constant, RTTI type name, and ObjC selector in HxCore.framework. It discovered that Outlook separates email metadata from body content into two distinct Cola objects:

MessageHeader (ObjectType 201) -- lightweight: subject, sender, date, flags, preview, read status
MessageData -- heavyweight: HTML body, plain text body, encoding, sync status

The link between them is a field called _messageDataId stored on the MessageHeader. This explained a mystery that had been bothering us: why did some records have full HTML bodies inline while others only had short text previews? The answer is that 0x03B0 format records are MessageHeaders that happen to have their MessageData content embedded directly, while 0x10013 records are MessageHeaders where the body lives in a separate object.

We found the _messageDataId field at consistent byte offsets (632, 652, 848) in decompressed 0x10013 records, pointing to the record ID of the corresponding 0x03B0 record that holds the HTML body.

The Object Type System

The symbol analysis revealed 10 Cola object types:

Type	ID	Description
Account	73	Email account
View	77	Mail folder/view
Calendar	104	Calendar
Appointment	107	Calendar event
MessageHeader	201	Email metadata
DataReplication	202	Sync state
AttachmentHeader	212	Attachment metadata
Contact	224	Contact
SearchSession	288	Search state
Person	359	Person record

Each MessageHeader has child references to Recipients, CcRecipients, BccRecipients, and AttachmentHeaders -- explaining the full email data model.

Cracking the Timestamps

Timestamps turned out to be the most elusive field. We initially assumed Cocoa epoch (uint32, seconds since 2001-01-01) because Outlook runs on macOS. A systematic agent tested every timestamp format -- Cocoa, Unix, FILETIME, OLE, .NET ticks, BCD -- at every byte offset. The answer was .NET ticks: int64, 100-nanosecond intervals since 0001-01-01 UTC (the C# DateTime.Ticks format). They're embedded in a 48-byte block bracketed by a sentinel value FF 3F 37 F4 75 28 CA 2B. The lower 4 bytes of this sentinel had been confusing our Cocoa scanner for hours -- it decoded to "2024-04-13" as a Cocoa timestamp, appearing identically in every record. It wasn't a timestamp at all; it was the sentinel marker.

Three-Tier Body Storage

Body content uses a three-tier storage strategy based on size:

Inline -- small bodies stored directly as variable-length properties in the Cola object
Object stream -- medium bodies stored as separate stream entries in the KeyValueStore layer
External file -- large bodies stored as gzip-compressed .dat files on disk

We confirmed the third tier by examining the EFMData/*.dat files in the Outlook profile directory. These files start with 1f 8b (gzip magic) and decompress to 68-152 KB of full HTML email content. Records reference them via UTF-16LE strings like ~/Files/S0/3/EFMData/42.dat.

Phase 5: Extracting Email Bodies

With the Cola architecture understood, we built a 5-tier body resolution system:

Tier 1: Inline HTML -- For 0x03B0, 0x310F4, and 0x30FF1 format records, the HTML body appears as ASCII text in the decompressed data immediately after the first "IPM.Note" marker. We extract everything from <html to </html>.

Tier 2: _messageDataId link -- For 0x10013 records (MessageHeaders without inline body), we read the _messageDataId at known byte offsets in the decompressed data. If it points to a 0x03B0 record that has inline HTML, we use that body.

Tier 3: Message-ID sibling -- When _messageDataId doesn't resolve, we fall back to searching all records for a shared Internet Message-ID (the <...@...> string). If another record with the same message-ID has HTML, we use it.

Tier 4: EFMData gzip -- For records that reference EFMData/*.dat files, we decompress the gzip file and extract the HTML.

Tier 5: Preview fallback -- When no full body is available, we use the body preview text stored as a UTF-16LE string in the record.

The results across our 24 MB test database:

Tier	Method	Emails	%
1	Inline HTML	256	40%
2	_messageDataId	80	12%
3	Message-ID sibling	30	5%
4	EFMData gzip	9	1%
5	Preview fallback	251	40%
	Full HTML total	375	59%

The remaining 40% are emails where Outlook only synced the metadata and preview to the local cache -- the full body exists on the Exchange server but was never downloaded. The _bodyDownloadStatus property we found in the symbol analysis confirms this is by design: Outlook manages body download as an async queue, only fetching full bodies when the user opens the email or bandwidth allows.

Phase 6: The B-tree Index

We also tackled the B-tree index pages (secondary store_id). Three parallel agents attacked this:

One decoded the page structure (same 8-slot layout as data pages, with mixed index nodes and data records)
One cross-referenced index entry fields against data records
One examined the allocation bitmaps and section pointers

What we found: 82 index nodes across 37 pages (62 internal, 20 leaf). Each node contains 1-2 entries with 20-byte records: [key, type=8, cumulative_size_a, cumulative_size_b, 0]. The key fields are sorted ascending (83%), confirming B-tree ordering.

The surprise: val1 and val2 aren't page pointers -- they're subtree statistics. val1 correlates with cumulative compressed size at 69%, and val2 matches record IDs at 39%. The B-tree tracks aggregate sizes per subtree for query optimization.

The practical win: Index pages share the slot layout with data pages, meaning data record slots can appear on index pages. By scanning index pages for primary-store-ID slots, we recovered 61 additional data records that a naive data-pages-only scan would miss.

Full B-tree traversal remains unsolved -- the Cola engine uses range-based key comparison rather than explicit child page pointers, and the root node isn't directly referenced from the file header.

Phase 7: Cracking the Folder Hierarchy

One major gap remained: which folder does each email belong to? Inbox? Sent Mail? Spam? We had no way to tell. The HXSTORE.md documentation explicitly flagged this as an open problem: "No known way to determine which folder a record belongs to."

We launched a four-agent team to attack this from every angle simultaneously:

Documentation deep-dive -- exhaustive analysis of everything already known about the Cola B-tree, object types, and record structure
Code analysis -- tracing every field and extraction path in the existing parser
Web research -- searching for MAPI property tags, PST format comparisons, and any public forensic research on HxStore
Binary pattern analysis -- running analysis scripts directly against the database to examine metadata records, index pages, and folder-like strings

The Discovery

The binary analyst found the first critical clue: two records contained complete folder hierarchies. Record 12 (format 0x0190) and record 132118 (format 0x01A0) listed every folder in the mailbox -- Inbox, Drafts, Archive, Sent Mail, Trash, Spam -- with paired server/local name mappings.

More importantly, every 0x10013 email record had a previously unnoticed ObjectType field at offset 44 (uint16): 0xBF (191) for emails, 0x4D (77) for folders. This was hiding in plain sight in the binary header.

A follow-up team of two agents then zeroed in on the folder-to-email link:

The folder mapper discovered a 16-byte pattern at fixed offset 1520 in every email record's decompressed data:
```
[uint32 folder_ref_id] [00 00 00 00] [02 00 00 00] [uint32 account_root_id]
```
The folder_ref_id identifies which folder the email belongs to. The account_root_id points back to the master account container record.
The property decoder independently confirmed the same fixed offset and mapped the complete 0x10013 header structure -- a 475-byte schema table identical across all email records, section markers with typed headers, and the _messageDataId field at section offset +24.

The Name Table

Folder names live in large 0x0190 container records (~94 KB for our test database). These contain paired folder entries in a linked-list structure. Each pair has two ref_ids: the first is the folder's own identifier, the second is the next folder's identifier (a forward pointer). By processing entries sequentially and allowing the last mapping to win, the complete folder hierarchy emerges:

Folder	Emails
Inbox	304
Sent Mail	208
m.johnson@massey.ac.nz	123
elwesties@gmail.com	97
Uni Mass email shit	93
Spam	15
Drafts	4
Archive, Trash, Call For Papers, Conversation History, UNI	0 each

Cross-validation confirmed the mapping is correct: 99% of "Inbox" emails have external senders, while 93% of "Sent Mail" emails come from the user's own addresses. The remaining 7% are from alternate accounts not in the validation set.

A key finding: MAPI property tags are not used for folder linkage. Despite Microsoft's extensive MAPI property system (PidTagParentFolderId, PidTagFolderId), the Cola engine uses its own record-ID reference system. This is a completely proprietary implementation.

The Result

The final tool, hxdecode, provides 10 CLI commands:

$ hxdecode info             # Database statistics
$ hxdecode mail             # List emails with sender, subject, date, folder
$ hxdecode mail --folder inbox         # Filter by folder name
$ hxdecode mail --direction sent       # Show only sent mail
$ hxdecode mail --direction received   # Show only received mail
$ hxdecode mail show ID     # Full detail of one record
$ hxdecode mail body ID     # Extract full email body (HTML or text)
$ hxdecode folders          # List all discovered folders with email counts
$ hxdecode records          # List all records with format types
$ hxdecode search QUERY     # Full-text search across decompressed data
$ hxdecode blob-search Q    # Search HTML bodies in blob pages
$ hxdecode attachments      # List attachments with disk paths

From our 24 MB test database: 1,400 records decoded across 11 source modules, 846+ emails with sender/subject/timestamps extracted, 375 with full HTML bodies, accurate timestamps on 100% of emails via .NET ticks, 12 folders mapped with 100% email-to-folder resolution, and 389 attachments mapped to files on disk. Output in table, JSON, or CSV.

$ hxdecode mail body 32102
Record: 32102
Source: inline
---
[full email body extracted as HTML or plain text]

What We Learned About AI-Driven Reverse Engineering

This project stretched across two sessions and involved over 20 agent launches. Some observations about using AI agents for reverse engineering:

Parallel hypothesis testing works. For the compression problem, we launched 5 agents with fundamentally different strategies. The known-plaintext agent cracked it in 15 minutes. The binary analysis agent found definitive proof in the symbol table. The statistical agent confirmed it through entropy analysis. Having multiple independent lines of evidence was more convincing than any single approach.

Agent swarms find things individual passes miss. The 61 data records on index pages were only discovered because an agent was specifically tasked with analyzing every slot on every page type. The _messageDataId field at byte offset 632 was found by an agent that systematically scanned decompressed records for uint32 values matching known record IDs. These aren't insights that emerge from casual exploration.

Symbol analysis is underrated. The 170-turn agent that searched every string constant in HxCore.framework produced the single most valuable finding of the entire project -- the two-object model with _messageDataId. No amount of staring at hex dumps would have revealed that MessageHeader and MessageData are separate Cola objects linked by a property ID. The information was sitting in the binary all along, in RTTI type names and ObjC selector strings.

Wrong assumptions persist until you test systematically. The timestamp field resisted analysis for hours because we assumed Cocoa epoch -- a reasonable default on macOS. The "timestamp" that decoded to 2024-04-13 in every single record should have been a red flag, but confirmation bias kept us scanning for Cocoa dates. Only when an agent methodically tested every known timestamp format at every byte offset did the answer emerge: .NET ticks, hidden behind a sentinel marker whose bytes happened to look like a valid Cocoa date. The lesson is that plausible-but-wrong hypotheses are the hardest to shake.

The last 40% is always the hardest. We went from 0% to 59% body resolution relatively quickly. The remaining 41% turned out to be a sync-depth limitation in Outlook itself, not a parsing problem. Knowing why you can't read something is almost as valuable as reading it.

Scale matters for structural analysis. The folder mapping required six agents across two rounds of investigation. The first round (four agents) identified that folder data existed and located the key records. The second round (two agents) produced contradictory results -- one claimed 100% resolution, the other said folder membership wasn't encoded in email records at all. A verification agent resolved the contradiction: both were partly right. The folder reference IS in the email data, but at offset 1520 -- deep in the section payload, far beyond what the second agent had examined (it only checked bytes 0-400). The lesson: when agents disagree, a focused verification pass that tests specific claims is more productive than re-running the full analysis.

Try It Yourself

The code is open source: github.com/mitchell-johnson/hxstore-decode

The format documentation (780+ lines, 17 sections): HXSTORE.md

pip install .
hxdecode mail --format json
hxdecode mail body 12345

If you have Outlook for Mac, you have an HxStore.hxd. Grant your terminal Full Disk Access, and hxdecode will find it automatically.

As far as we can determine, this is the first public identification of LZ4 as the HxStore compression algorithm, the first documentation of the Cola storage engine's two-object model and type system, the first decoding of the folder hierarchy structure, and the first open-source tool capable of extracting full email bodies and folder metadata from the database. If you find something we missed, PRs are welcome.