I hate losing stuff.
I’m not talking about things like keys or wallets (although I don’t particularly enjoy misplacing these either). I mean stuff that has emotional and sentimental value to me. And most of it is digital.
Take, for instance, my first serious programming project: As a teenager, I decided to write a hangman game in Java. I got it all sorts of wrong and only got it working with my dad’s help. But in the end, it worked! I proudly shared it with my dad and with a friend of mine who also was into computers. I even uploaded it to Sourceforge!
Then I lost the source code. I don’t know when or why. Maybe I switched to a new computer and didn’t transfer all my data properly. Maybe I managed to screw up my OS once again and didn’t back everything up when wiping the disk for the reinstall. And at some point I also deleted the Sourceforge project. Who would even care about a small silly game like that, right? Turns out, I would, if only to remember how beautifully awful the code was.
This and other occurrences of valuable digital memories being lost taught me to be a little obsessive when it comes to my personal data. I store my stuff on an external RAID that is backed up online. As for data stored in various online services, I have made a habit (or rather a reminder that still refuses to settle into habit) of downloading a copy of my data every month (thanks GDPR!) so it can be backed up properly.
For most of my online data, that’s not very complicated. Most services provide your data as some form of JSON, some hand you a collection of HTML files, and a few have their data stored as CSV.
And then there is Microsoft OneNote.
Don’t get me wrong. I am a heavy OneNote user, with multiple thousands of notes, a subset of which I use almost every day. It’s where I keep my journal, collect quotes from books I read, keep a local copy of articles and blog posts I find interesting. It’s where my wife and I planned our wedding, where I make lists of gifts for the Christmas season, and where I keep a ton of archived notes from my time at university.
In other words, of all the services I use, OneNote is where losing data would be most painful to me. After all, I don’t think it’s unreasonable to assume that Microsoft might lose interest in supporting OneNote at some point in the next 40 to 50 years. But the notes will be valuable to me even then (or arguably more so!).
A couple of years ago I sat down to research how to back up my digital notebooks in order to keep them accessible even if OneNote is shut down. What I found was both good and bad news. The good news was: You can download your OneNote notebooks quite easily by selecting the notebook files in OneDrive web and downloading them. The bad news: The file format they use is (or rather was, at the time) virtually unsupported by the rest of the world. There were no open source file viewers or converters. There were reports that Evernote on Windows could import from OneNote. But that was just another vendor that could go out of business at any point in the future.
Then I found two documents, [MS-ONESTORE] and [MS-ONE]. Together they specify how OneNote files work. The first one is a specification for the OneNote Revision Store File Format which describes a revision store that is able to store arbitrary data in a list of revisions that represents the object’s history. The second one describes the data model OneNote uses and how this data is stored using the OneNote revision store. In other words, as long as I keep these documents around, I can build a parser for OneNote files even if Microsoft should decide to discontinue it. I downloaded a copy of the specifications and moved on.
A journey into the woods #
I couldn’t hold back for too long though: In the fall of 2020 my curiosity got the better of me and I started working on a OneNote file parser based on the specifications I found earlier. I started, as you’d expect, by reading the specs end to end. Both are around 100 pages long (due to the number of tables involved in explaining all byte offsets), plus another good 100 or so pages of related documents. They are dense, formal, and surprisingly thorough. With patience and a willingness to keep flipping back to earlier sections, you can implement a parser for OneNote files using nothing but the spec.
So that’s what I did. Sort of. The spec tells you what the bytes mean, but no one quite walks you through how to go from raw bytes to a revision store, and from a revision store to the actual object model. This is where libmson, dropbox/onenote-parser and Apache Tika’s OneNote parser all helped me understand one aspect or another of how to turn the spec into actual, working code. The bytes-to-objects pipeline came together over a few weekends, and by the end of it I could load a notebook and walk its tree of pages, sections, and objects in memory.
What the spec doesn’t tell you, however, is what to do with those objects. [MS-ONE] describes the data model — here’s an outline element, here’s a rich text run, here are the properties it might have — but it does not describe how OneNote actually lays a page out. How is the indentation of a nested list computed? What if I unindent some parts of the list but not others and then add a note tag? The spec is a description of the file format, that describes how the notes are stored. They don’t explain how they are displayed.
So when it came to producing something that looked like a OneNote page, I had no documentation at all. The only complete reference implementation in existence was OneNote itself.
That’s where the web app came in, and I came to it almost by accident. Surely, I assumed, the web app would use a completely different encoding — some JSON representation served from Microsoft’s backend, the binary format converted server-side into something more web-friendly. I could maybe take some clues from what DOM was rendered for a given page. But there was no way the web app would include a semi-complete parser of OneNote’s binary data, right?
With that, I began to dig. Just a little at first. Look at a rich text run to figure out how it is rendered. But it didn’t take me long to realize that what I was looking at, were the same .one files I’d been working with wrapped in another binary format but with ultimately the same semantics, right in front of me.
Though that “in front of me” is doing a lot of work in that sentence. What I was in fact looking at were parts of a complex desktop application — almost certainly originally written in C# or C++ — compiled to JavaScript and minified to within an inch of their life. Now function a calls function b with arguments c and d, and b does something to a property called e on an object whose constructor is f. Reading it is less like reading code and more like decoding a substitution cipher where all symbols are shuffled anew for every function you look at.
And then, every couple of weeks, Microsoft would push an update with re-randomized names. Every note I’d taken about which function did what would become useless overnight. The first time this happened I was annoyed; the second time I started keeping my notes in terms of structure rather than labels — “the function with three parameters and a three-way if-else that calls two functions in each branch,” “the method that sets the margins of some DOM object” — and re-locating those functions after each update became a routine fifteen-minute exercise rather than a setback.
It was, in an odd way, an invaluable debugging environment. Slow, hostile, full of dead ends, but with the unique property that any question I had about how a given file element should render, I could answer empirically: load the file in OneNote web, find the relevant function in the debugger, step through it, and watch what it did. The web app was a Rosetta stone hiding in plain sight — once you accepted that the stone was going to be re-engraved every fortnight.
A dirty little secret #
With the OneNote web app as my guide, I implemented the rendering part of the project and by November 2020 I had a working parser and renderer for my dearly beloved OneNote files.
All’s well that ends well, right? Parser working, renderer working, notebooks safely converted to HTML, data preservation problem solved. That’s where the story might have ended.
Except for a minor detail that Microsoft does not advertise when you download the specification PDFs. They are incomplete. Not in the sense that they’re outdated — Microsoft does keep them updated every few years — but in the sense that significant categories of content are simply not described. Ink and drawings: not in the spec. Math equations: not in the spec. Whatever you draw with a stylus, whatever formulas you’ve typed using OneNote’s equation editor while learning for that math exam — none of it is covered by the documents that are supposed to specify the file format.
Those bytes are still in the file, of course. OneNote writes them, OneNote reads them. You can find out the object type IDs easily enough, but then you’re left with a slurp of bytes without any guidance on what to do with them. Well, that’s where this story actually begins. Because onenote.rs and one2html both have full support for ink and math equations anyway.
Drawing in the dark #
Of the two missing features, ink turned out to be the more tractable one — though not by much, and not at first. The good news, when I started looking at it, was that ink data lives in OneNote’s normal property-set machinery. That meant the structure of an ink object was visible to my parser even without documentation: I could see the JCIDs, I could see which objects pointed to which, I could see what properties each one carried. The tree looked something like this:
graph TD
accTitle: OneNote ink object hierarchy
accDescr: An InkContainer references an InkDataNode, which references one or more InkStrokeNodes; each StrokeNode references a StrokePropertiesNode. The DataNode carries an inline InkBoundingBox property (four optional unsigned 32-bit integers); the StrokeNode carries an inline InkPath (a signed multi-byte integer stream); the PropertiesNode carries an inline InkDimensions array (32 bytes per entry).
%% Main Nodes
Container["<b>InkContainer</b>"]
DataNode["<b>InkDataNode</b>"]
StrokeNode["<b>InkStrokeNode</b>"]
PropsNode["<b>StrokePropertiesNode</b>"]
%% Property Leaf Nodes
BBox>InkBoundingBox:<br/>4 × u32 optional]
Path>InkPath:<br/>signed multi-byte integer stream]
Dims>InkDimensions:<br/>32-byte-per-entry array]
%% Object References (Solid Lines)
Container -- "InkData (ObjectID)" --> DataNode
DataNode -- "InkStrokes (ObjectIDs)" --> StrokeNode
StrokeNode -- "InkStrokeProperties (ObjectID)" --> PropsNode
%% Inline Properties (Dotted Lines)
DataNode -.-> BBox
StrokeNode -.-> Path
PropsNode -.-> Dims
%% Styling for leaf nodes to look like data properties
style BBox fill:#f4f4f9,stroke:#999,stroke-width:1px,stroke-dasharray: 3 3
style Path fill:#f4f4f9,stroke:#999,stroke-width:1px,stroke-dasharray: 3 3
style Dims fill:#f4f4f9,stroke:#999,stroke-width:1px,stroke-dasharray: 3 3
The semantics of most of these came essentially for free. With the help of the API docs for the JS Add-In, a strange little island of documentation that only covers OneNote on the web and hasn't moved past version 1.1 since 2016, I was quickly able to figure out the semantics of the objects I was staring at. With the structure mapped and the names mostly recovered, I could render the bounding box of every stroke in a notebook within an evening. They were all in the right places. They were all empty rectangles.
The actual stroke data — the path the pen had traced — lived in InkPath, and InkPath was where I spent the most time.
What I had to start with was a sequence of bytes and the knowledge that the web app eventually turned them into SVGs. Fixed-width integers didn’t work. None of the obvious variable-length encodings worked. A naive read produced numbers that bore no relationship to where the strokes should be on the page, and ink is unforgiving in this regard: any small mistake in decoding makes the entire stroke unreadable. A path is either right or it’s an unrecognizable scribble in the wrong half of the canvas.
The breakthrough was finding a format Microsoft had published — just not in the OneNote specs. The Ink Serialized Format (ISF), written for the Tablet PC SDK back in the 2000s, describes a multi-byte encoding of signed numbers that exactly matched what InkPath contained. The decoding goes:
- Read bytes one at a time, accumulating their lower 7 bits into an unsigned integer, until you hit a byte whose top bit is clear. That byte ends the number.
- The least significant bit of the resulting integer is the sign (0 = positive, 1 = negative). The remaining bits are the magnitude.
That gave me a stream of signed integers. Still not coordinates — the integers were differential, not absolute, and they were laid out in a way the spec doesn’t describe. The path is dimension-first: all of the X values, then all of the Y values, then any other dimensions the stroke happens to record. Apply the per-dimension scaling factor, accumulate the differentials, and you have an SVG path.
By this point I had ink mostly working, with a small list of remaining oddities I couldn’t fully account for. So I put it aside and created an issue on the GitHub repo with nothing but the title as a reminder to come back to this later. Then Sebastian a.k.a. blu-base found it.
Sebastian had been independently working on libmson, a parallel OneNote parser whose existence had already helped me — I’d referred to it more than once while writing the initial version of my own parser. By the time Sebastian found my issue, they’d ended up in roughly the same place I had with ink: ISF located, multi-byte encoding mostly understood, embedded ink (which is when ink is mixed with regular text) behaving strangely.
We compared notes in the issue thread. Sebastian had some clues I had been missing, particularly regarding the first entry in the stroke byte array. I had figured out that I had to skip it unless I wanted the strokes to start at seemingly arbitrary points on the page, but had no idea why this worked. Turns out it was simply a length prefix and ignoring it was the right thing to do all along.
With ISF decoding in place, the count prefix accounted for, and some other issues figured out, ink rendered correctly. I published my commit and moved on to math.
Solving for \(X\) #
Math turned out to be the harder one, and it was not because of the encoding.
With ink, the encoding seemed like incomprehensible gibberish until I stumbled upon the ISF spec. In contrast, the encoding for inline math wasn’t bad at all. Math content lives inside ordinary text runs, alongside the same formatting machinery OneNote uses for bold and italics. Each math-formatted run carries a small set of properties — a discriminator that says what kind of math construct this is (Fraction, Nary, Brackets, and so on), an argument count, alignment selectors, and up to three u16 values that encode the operator’s primary, secondary, and tertiary characters. The properties are right there in the file, ready to be interpreted.
The hard part was that nothing told me how to interpret them.
So I began to scour the web for anything that might be even remotely related. There was UnicodeMath, which was developed at Microsoft but wasn’t used here. There were some of the later RTF specs which were more helpful but used a wholly different encoding.
This time, there was no sudden breakthrough but a series of small revelations. I first found a blog post by Murray Sargent — the Microsoft engineer who designed both OfficeMath and UnicodeMath. It helped explain some of the weird symbols that the math text runs contained along with some hints about the Unicode-based formatting they seemed to be using.
The next revelation was when I found ITextRange2::GetInlineObject from Windows’ tom.h (Text Object Model). That page contained not only all supported inline object types but also the arguments they supported. With that, I could start to build a table of math object types along with their arguments and meanings and iteratively fill the gaps by composing some truly horrific equations in the OneNote desktop app and then looking at the resulting math data.
I’d build a test equation in OneNote — a triple integral with nested fractions and accents on the variables, a piecewise function with embedded summations, a 3×3 matrix of mixed types — save the file, parse it with my code, and compare the property sets I got against what I knew the equation looked like. Then I’d hand-trace each row. Object type 21 with three arguments, character 0x222D? That’s ∭, the n-ary integral. Object type 13 with two characters, 0x2016 and 0x2016? Norm brackets, double vertical bars on both sides. Each test equation answered three or four questions and raised one or two new ones. The matrix tests, in particular, took several rounds — getting MathInlineObjectCol to come out right for non-square matrices required a notebook full of progressively weirder shapes.
The most surprising piece of the puzzle is how that operator tree gets serialized into a flat run of text. Math content doesn’t live in a separate document structure with its own tree — math runs are regular text runs, in regular paragraphs, sharing space with prose. The tree of operators has to live inside the linear text somehow, and OneNote’s solution is to use three Unicode “noncharacter” code points as in-band markers:
U+FDD0opens an operator (let’s call it‹S›for simplicity).U+FDEEseparates its arguments (‹|›).U+FDEFcloses it (‹E›).
These code points have no glyph and no defined meaning in Unicode; they exist precisely so applications can repurpose them as private signaling characters. RichEdit — and therefore OneNote — uses them as a parenthesis-like grammar embedded directly in the text stream. A paragraph containing the Pythagorean theorem ends up stored, in part, as a string that looks something like ‹S›a‹|›2‹E›+‹S›b‹|›2‹E›=‹S›c‹|›2‹E›. Each marker is paired by document position with one entry in a parallel array of MathInlineObjects, which is what tells the lexer what kind of operator each ‹S› opens. In rendered form, this of course is \(a^2+b^2=c^2\), just flattened out with separators around each operation.
Once that structure clicked, rendering to MathML was straightforward(ish). A lexer consumes the text linearly, treating each ‹S› as the start of an operator with a known type, each ‹|› as an argument boundary, and each ‹E› as a close. Each operator’s MathInlineObjectType maps to a MathML element — mfrac for Fraction, msup for Superscript, mroot or msqrt for Radical, and so on — and the children of that element are the arguments delimited by ‹|› markers in the source. The result is a tree of MathML elements that any modern browser can render, which is exactly what one2html emits.
It isn’t perfect. Some object types have semantics that aren’t fully described anywhere I could find. There are constructs I don’t support, like no-build-up rendering, where OneNote shows an equation in linear form rather than its built-up form. For common math — the kind that appears in journal entries and study notes, which is what’s actually in my notebooks — the implementation is solid. For exotic constructs there are almost certainly cases where the rendered output doesn’t quite match what OneNote would have shown. I decided that was acceptable. If the web version of OneNote doesn’t have pixel-perfect fidelity to desktop version (which it doesn’t), I’m willing to accept a little imperfection too.
Surprise Synergy #
I’d like to say I shipped math support shortly after the implementation worked. I didn’t. By mid-2022 I had a working-enough math implementation sitting on my hard drive, with the rough edges of any parser that handles a poorly-documented format — some object types unhandled, some rendering quirks I hadn’t tracked down — and I told myself I’d polish it up and merge it soon. Then life happened. “Soon” turned into months, and months turned into years.
What I didn’t realize, while I was failing to ship math, was that other people had quietly started using the parts of the project I had shipped.
In late 2025, almost three years after my last meaningful work on the project, I came across a Joplin issue: a user trying to import a OneNote notebook into Joplin and hitting an error from somewhere deep in onenote.rs. That was surprising for two reasons. The first was that I didn’t know Joplin had OneNote import. The second was that the error came from code I had written.
Joplin, it turned out, had forked onenote.rs and one2html into their monorepo as onenote-converter. They’d done so for entirely sensible reasons. The upstream project had been inactive for years; they needed changes, I wasn’t around to look at the issues that were open; and they had real users asking for OneNote import. So they took the code, compiled it to WebAssembly so it could run inside Joplin’s Electron app, added support for the OneNote Desktop file format, improved error handling so a single bad page wouldn’t tank an entire import, and — relevantly to this post — were working on implementing inline math themselves.
This is exactly how open source is supposed to work, of course. The whole point of putting code on GitHub under a permissive license is that someone else might pick it up if you don’t. But it is nonetheless humbling when you’ve half-forgotten about a project and open a stranger’s repository doing useful work you haven’t imagined someone would pull off. The notebooks I’d been worried about preserving were now being preserved, in a small way, for users I’d never meet, by developers who’d never asked me anything because I had been missing in action.
I introduced myself in the issue thread. Within a day, Henry a.k.a. personalizedrefrigerator — the Joplin contributor who’d actually been maintaining the OneNote subsystem — had opened an issue on my repo to track what would need to flow upstream. They’d been doing the OneNote work on Joplin’s side for a while, so they knew exactly which downstream changes mattered: the OneNote Desktop file format support, the WASM target, the per-page error handling. The exchange was short and entirely cordial, and within a few days we’d agreed on the obvious solution: merge Joplin’s improvements back upstream so future fixes and features could flow in both directions instead of forking further apart. Most of those changes are now landing in onenote.rs, and I finally — three and a half years late — pushed my math implementation across the finish line and shipped it as part of the v1.3.0 release of one2html.
So the math support that exists in onenote.rs and one2html today is the version Joplin nudged me to finish, and the OneNote Desktop support that exists is the version Joplin built while I wasn’t looking. That’s a better outcome than the version of this story where I shipped everything myself in 2022. Joplin got something that helps their users; I got an excuse to revive a project I cared about and the satisfaction of seeing it doing more than I’d ever imagined.
The original goal of all this — backing up my notebooks against the day OneNote disappears — is, by now, slightly redundant. My notebooks are converted, sitting safely in HTML in my backups. But the project has outgrown that very personal goal. Anyone with a OneNote notebook and a reason to leave can now do what I did, either with one2html directly or by importing into Joplin. That’s the version of data preservation I didn’t know I was building toward.
Closing thoughts #
Microsoft did not publish [MS-ONE] out of the goodness of their hearts or because they wanted hobbyists to write OneNote parsers in their spare time. In the mid-2000s, under pressure from the EU antitrust case and the broader fight over open document standards, Microsoft begrudgingly committed to publishing the formats underlying its Office products and made them available under the Open Specification Promise. [MS-ONE] and [MS-ONESTORE] are part of that long tail — sitting alongside [MS-DOC], [MS-XLS], and a hundred others.
When I was first trying to decipher how to turn the seemingly random sequence of bytes into ink strokes, I was annoyed at Microsoft. They had gone through all the work of publishing the specification for the OneNote format. Why leave them incomplete?
The answer, I came to realize, is obvious. OneNote was not written as a single, monolithic piece of software. Why would it? There was an ink encoding from a team down the hallway (or halfway across the globe) and it would be more expensive to re-derive it from first principles rather than reuse the existing SDK. There was a math equation engine sitting right there in the RichEdit components the application was using anyway. More so for the web app. Why rebuild all parsing logic from scratch when there’s a C-something-to-JavaScript transpiler that is probably used across half a dozen of other products, waiting to be used?
From outside, decades later, the lack of details about ink and math parsing looks like a hole in the documentation, because the documentation was scoped to “the OneNote file format,” and the bytes in question aren’t really the OneNote file format. Which, in a way, was a blessing. As annoying as reverse engineering ink and math rendering was, it would have been all but impossible if the OneNote team had had the time and budget to develop their own version from scratch.
Which is to say that reverse engineering an old file format isn’t really a software problem. It is archeology of sorts. You’re not trying to crack a secret code, you’re trying to reconstruct the world the code was written in with nothing but a bunch of pottery shards and a few fragmented pieces of half-forgotten poetry. The strange part is that once you’ve done it once, you can’t stop seeing the historical layers in everything else — every file format, every protocol, every piece of software older than a few months. Because once you learn to read the shards, history is everywhere you look.
Resources #
If you’re picking up where I left off, starting your own OneNote parsing project, or working on any of the adjacent formats, here are the documents and projects that mattered most.
OneNote file format
- [MS-ONESTORE] and [MS-ONE] are Microsoft’s official specifications for the format.
- The OneNote object model diagram from the OneNote Add-In documentation is a useful sanity check on the high-level structure of a notebook. Also look at the JS Add-In API docs for more details on the object hierarchy.
Ink data
- The Ink Serialized Format specification is the document Microsoft does not link from the OneNote spec but should.
Math equations
- Murray Sargent’s Math in Office blog is the best account of how OfficeMath works and the design rationale behind it.
- The
ITextRange2::GetInlineObjectdocumentation on Microsoft Learn lists the inline object types and their arguments. ThetomConstantsenumeration gives the names of all the values.
The missing specs
The two pieces of OneNote’s file format that Microsoft never documented — ink and math equations — are now described in a pair of specs I’ve published in the onenote.rs wiki. They’re written in the same style as [MS-ONE], cover the property sets, encodings, and tree structures discussed above, and are intended to be the documents I wish I’d had when I started. If you find any issues or have further insights, feel free to leave a note.