Parsing an Undocumented File Format

blog.vivekpanyam.com

143 points by vpanyam 2 years ago · 64 comments

Reader

Reminds me of a situation I ran into years ago. I worked at a fintech startup where we were reverse engineering the mobile APIs of retail stock brokerages. Eventually we ran out of brokers in the US and began looking overseas. The first one we looked at was a large broker in Singapore.

Their API responses were in some absolutely insane markup language that I'd never seen before. I actually had to spend a good deal of time reading up on the history of markup languages, carefully going through each one to see if the syntax matched.

Eventually I gave up and just had to write a parser myself. The worst bit was that the attributes didn't use quotation marks around the values. So you'd literally have markup like:

  <something name=Hello world />

It was...fun times.

m463 2 years ago
I'm reminded of the comment in XeePhotoshopLoader.m:
```
  // At this point, I'd like to take a moment to speak to you about the Adobe PSD format.
  // PSD is not a good format. PSD is not even a bad format. Calling it such would be an
  // insult to other bad formats, such as PCX or JPEG. No, PSD is an abysmal format. Having
  // worked on this code for several weeks now, my hate for PSD has grown to a raging fire
  // that burns with the fierce passion of a million suns.
  // If there are two different ways of doing something, PSD will do both, in different
  // places. It will then make up three more ways no sane human would think of, and do those
  // too. PSD makes inconsistency an art form. Why, for instance, did it suddenly decide
  // that *these* particular chunks should be aligned to four bytes, and that this alignement
  // should *not* be included in the size? Other chunks in other places are either unaligned,
  // or aligned with the alignment included in the size. Here, though, it is not included.
  // Either one of these three behaviours would be fine. A sane format would pick one. PSD,
  // of course, uses all three, and more.
  // Trying to get data out of a PSD file is like trying to find something in the attic of
  // your eccentric old uncle who died in a freak freshwater shark attack on his 58th
  // birthday. That last detail may not be important for the purposes of the simile, but
  // at this point I am spending a lot of time imagining amusing fates for the people
  // responsible for this Rube Goldberg of a file format.
  // Earlier, I tried to get a hold of the latest specs for the PSD file format. To do this,
  // I had to apply to them for permission to apply to them to have them consider sending
  // me this sacred tome. This would have involved faxing them a copy of some document or
  // other, probably signed in blood. I can only imagine that they make this process so
  // difficult because they are intensely ashamed of having created this abomination. I
  // was naturally not gullible enough to go through with this procedure, but if I had done
  // so, I would have printed out every single page of the spec, and set them all on fire.
  // Were it within my power, I would gather every single copy of those specs, and launch
  // them on a spaceship directly into the sun.
  //
  // PSD is not my favourite file format.
```
- mschuster91 2 years ago
  
  > Why, for instance, did it suddenly decide that these particular chunks should be aligned to four bytes, and that this alignement should not be included in the size?
  Probably because, like many other ancient document formats (e.g. MS Office), it was a straight dump of memory structures into a file [1]. Obviously a very bad idea in hindsight (especially given the truckload of deserialization vulns resulting from it), but computers from that age were so memory-constrained that anything else wouldn't cut it, and by the time computers got more powerful the old formats were hopelessly entrenched.
  [1] https://www.joelonsoftware.com/2008/02/19/why-are-the-micros...
  - samus 2 years ago
    
    Flash and PDF have the same illnesses. Suspiciously many Adobe file formats are both overly complicated and contain lots of features that are way too powerful in combination and impossible to support properly.
- disruptiveink 2 years ago
  
  As a side note, I really miss Xee. It was clearly a labour of love and it showed - MacPaw has been a terrible steward ever since Dag handed it over to them.
  No updates other than straight up SDK bumps and recompiles, broken loading of random images in recent macOS/Apple Silicon, they somehow managed to break cropping in one of the two or three updates they did, still an Intel binary. Clearly they haven't tested it more than just checking if the app opens.
  Really wish Dag had just open sourced Xee3 instead, my opinion of MacPaw plumetted after seeing how they massacred my boy Xee.
  The Archive Browser was equally neglected. At least The Unarchiver still works, which in retrospect was clearly the only app MacPaw wanted to take off Dag's hands.
  - Wevah 2 years ago
    
    The Unarchiver still doesn’t handle multipart RAR files correctly fairly often, too.
- danielvaughn 2 years ago
  
  This is hilarious, and not at all surprising. The format has been around longer than most programmers today have been alive, with all the legacy cruft you could expect as it's changed over the years.
- samus 2 years ago
  
  > I can only imagine that they make this process so // difficult because they are intensely ashamed of having created this abomination.
  They also don't want anybody building a dependency on that sh*t, which would prevent them from ever cleaning up the mess.
  - m463 2 years ago
    
    I think there's been a dependency since day 1. For example, I have lots of photos scanned in the 90's with my old (old old) scanner, and they are in .psd format. Which for instance, macos can still preview.
  - mdaniel 2 years ago
    
    That ship has sailed: https://www.hyrumslaw.com/ and, of course, the obligatory xkcd https://xkcd.com/1172/
    When I read that rant comment, I thought of "oh, so it's microservices writing to a shared byte array" in a nod to https://en.wikipedia.org/wiki/Conway%27s_law
neilv 2 years ago

This "like standard protocol/format X, but strangely invalid" is a thing I've seen many times.
I speculate that one of the ways this happens is that someone decides or is told to use format Foo. Then they and possible collaborators implement both the writer and the reader for their idea of Foo from scratch, never testing with an off-the-shelf standard parser.
You'd think that doing XML like this is unlikely, given how easily available correct and validating parsers have been. But I've nevertheless seen this with XML too. I speculate that sometimes the programmer is on a platform that doesn't have an easily available off-the-shelf parser/writer, or they simply don't know about it.
I've also seen a variation of this, in half-butted "integrations", like to have a sales check-off feature of "we can generate X". These are sometimes tested only lightly, and sometimes not at all (such as when they don't have access to the tool that uses that format, and they were just working from poor documentation or an example). It's a thing.
- lpapez 2 years ago
  
  > I speculate that sometimes the programmer is on a platform that doesn't have an easily available off-the-shelf parser/writer, or they simply don't know about it.
  I bet this sounds surreal to people visiting this site, but there are really corporations out there running on software written by people who never heard of XML. Another example is a "database" implementation I have seen in a multi-billion dollar company which relied on a hierarchy of directories containing JSON files mimicking tables and rows inside a relational DB.
  The particular product in question had tens of millions of dollars yearly revenue.
- tonyedgecombe 2 years ago
  
  >You'd think that doing XML like this is unlikely, given how easily available correct and validating parsers have been. But I've nevertheless seen this with XML too.
  Guilty.
  Although in my defence it was during the early days of XML and the platform options had their own problems.
icsa 2 years ago

You are not alone. I did a project to upgrade "sort of XML" to standard XML. Your example content gave me flashback shivers.
cedws 2 years ago

I have seen code that produces output like this first hand. Instead of doing proper serialization, they were using string templating to construct the response and never bothered to validate the output. Laziness and stupidity basically.
- danielvaughn 2 years ago
  
  You're very likely correct, which is funny because they turned out to have incredibly security. We hacked the APIs of all the US brokers without an issue, but I didn't even make it past the auth stage with this Singapore broker.
  One morning I was working on their login flow - not doing anything crazy, mind you. Just a bit weird; logging in and out, watching the req/res cycles with Charles Proxy. All of a sudden my boss comes over and tells me to stop immediately. Apparently I set off so many alarm bells at the broker that the CTO was woken up (it was 2am where they were). That was a fun gig lol.
- tacone 2 years ago
  
  In my former PHP life I've seen people looping through objects and constructing a JSON string by hand instead of using a simpler single json_encode() call.

malux85 2 years ago

Reverse engineering a file format or protocol is almost a rite of passage for programmers, it is incredibly fun and rewarding, something I'd recommend for all medium/senior programmers get into at least once.

A few years ago I was using LiDAR scanners from a manufacturer that didn't provide a linux driver, only windows - the way it worked is that you programmed the firmware to fire UDP packets at a specified IP and port and then when the device powered up it would push this continuous stream of data to you. 300,000 points a second.

So I started capturing these UDP packets and then decoding them with python, eventually I had to write a plugin in C to do the high performance parsing and bit packing, but nothing beats that feeling when you're stumped on what a bit of data means and then a eurika moment hits you in the shower, and the project advances!

ikari_pl 2 years ago

Some of the worst code of my life, created 20 years ago when I was a teenager, today posted openly on my GitHub, was reverse engineered custom chat server protocol, as I wrote my own client to replace the Java applet. And to have logs.
The catch is... I didn't have any Internet connection. I was going to an internet cafe, logging onto the chat server, and chatting, while recording the connection with Wireshark.
At home, I'd print the hex + ASCII connection dump on my dot matrix printer, and used a highlighter and ballpoint pen to mark the fields of the message packet.
Then I'd code something around it, planned new tests, compiled a new version of the app and.... took it with me on a hard drive to the internet cafe to test next day, or next weekend.
I think I was way smarter and goal oriented than I am today.
- malux85 2 years ago
  
  I did this too! Having no interenet at home I would take a box of floppy disks to the library every day and save articles on C programming, OpenGL programming, and the raw weather satellite images from the NOAA - I was trying to re-create the 3D weather fly-over I saw in Jurassic Park when I was 7. I agree - having that disconnection was ultimatly good because it made me think a lot for myself rather than just google a solution, it developed my fundimentals a lot.
nemothekid 2 years ago

>is almost a rite of passage for programmers, it is incredibly fun and rewarding,
Only when you are doing it for yourself or when it's a known undertaking. It can be very frustrating when you are integrating with some hardware and you are 99% complete and you've told everyone you are ready to ship and the last 1% is a surprise reverse protocol engineering project.
- mungoman2 2 years ago
  
  Well tbh, failing to deliver after over committing is also a good experience.
- marginalia_nu 2 years ago
  
  Lesson is to not say it's ready before it's ready.
sedatk 2 years ago

That’s what I was gonna say. Reverse engineering was the only way in the 90’s as documentation was scarce. I had to reverse anything if I wanted to understand how it worked.
Here is an extractor I wrote for Westwood PAK and Lucasarts LFD files when I was 16: https://gist.github.com/ssg/e3e9654612be916336c01e104b10ddc7
- andrewf 2 years ago
  
  I picked apart some of the Dark Forces files myself as a kid. The GOB file was pretty obvious - I was familiar with Doom WADs, it's basically the same. I figured out the graphics formats by setting up VGA 320x200 mode with the contents of a .PAL file, and dumping the graphics file into screen memory. Then looking at the noise around the familiar patterns to figure out all the run/skip length stuff that wasn't just a pixel value.
  - sedatk 2 years ago
    
    Fantastic! I love how we’ve been through the same stuff despite being perhaps continents apart.
mattpallissard 2 years ago

Agreed, reverse engineering a job submission and validation protocol that ran over a Unix socket was some of the most fun I ever had on the clock.
I wound up basically re-implementing the software that was listening to create a test suite by the time it was all said and done.
- inferiorhuman 2 years ago
  
  Even the less challenging stuff can be fun. After seeing that SimCity themed Chicago zoning map I got to thinking about if it would be possible to do the same for the Bay Area. Turns out the cities I've looked at so far publish all sorts of machine readable information, the fun part is in finding and exfiltrating it.

vg_head 2 years ago

My experience with parsing undocumented binary formats is with Skia's skp files. Unfortunately they don’t publish any docs regarding the format. Instead I relied on their source code (which is very convoluted), and in that process I discovered two tools which proved pretty useful:

- Kaitai [1], which takes as input a YAML file and can parse binary files based on it (and even generate parsers).

- ImHex [2], which has a pattern language [3] which allows parsing, and it seems more powerful than what Kaitai offers. I stumbled upon some limitations with it but it was still useful.

[1]: https://kaitai.io/

[2]: https://github.com/WerWolv/ImHex

[3]: https://docs.werwolv.net/pattern-language/

LASR 2 years ago

More than a decade ago, I wrote a parser for Google Sketchup object files. It was part of some computer graphics assignment. We were not supposed to write parsers, but instead manually write out a 3D model vertex by vertex and then texture map them. I instead decided to write a parser and build my model in Sketchup.

It was a TON of fun doing that. And I learned a lot from that exercise.

Fast forward to a few weeks ago I "wrote" a parser / serializer for handling knowledge-graphs as input/output between my app and LLMs.

I used ChatGPT to walk me through the whole thing. It did very good job of converting between mermaidjs and an object type I defined in typescript. It wrote the code, the unit tests - the whole thing.

I don't understand how it works. The code is great. But not as satisfying.

samus 2 years ago

> We were not supposed to write parsers, but instead manually write out a 3D model vertex by vertex and then texture map them.
Good grief. It's like writing assembly: a good exercise, but only for trivial or particularly tricky parts of a program. For everything else, proper tooling (a compiler or a 3D modeler) is the way to go. I fully agree that the best learning experience is to build tools to automate away the annoying parts :-D

peteforde 2 years ago

It's very strange to have worked on an undocumented file format parser all night before checking HN before I sleep and seeing this.

In my case, I am trying to unpack MIDI files that have been packaged in a proprietary format by a company called ToonTrack.

They have two product lines: expensive VST instruments and MIDI files designed to be played on those instruments. It's an open secret that you don't need the expensive VST instruments, if you're willing to navigate a somewhat tortured folder hierarchy.

Well, for this new instrument, they thought that they'd be clever and bundle their MIDI packs so that you have to buy the expensive instrument to play it. Also: no refunds.

You can see where this is going...

peteforde 2 years ago

Just in case anyone sees this and finds themselves curious, I actually just shipped the fix: https://ezkeys.kabaragoya.com/

kaetemi 2 years ago

Reverse engineering data formats is one of the most fun and satisfying things to do in programming.

mdaniel 2 years ago

Followed, hopefully, by publishing that info so it's not lost to the dustbin of time
Where I throw stones at myself is that I have a "perfect is the enemy of good" problem in that I don't publish writeups that are incomplete, because I keep hoping that if I just spend more holidays banging on it then I'll figure out why the parser works most of the time
samus 2 years ago

Reverse-engineering a Postman backup file was fun as well. There are OpenAPI schemas, but I found many undocumented fields, and it was not at all obvious when default values for settings were set and when not. Probably a mix of glitches from schema evolution and fields that Postman uses internally.

fjfaase 2 years ago

I did some reverse engineering of binary file formats myself in the past, wrote some beginner notes and a collected some links. No idea if any of the links are still working.

https://www.iwriteiam.nl/Ha_HTCABFF.html

mabster 2 years ago

Back in the day I had a dual boot machine and was using a corner skin of Winamp on the Windows side and missed this on the Linux side.

So I started reverse engineering the Winamp skinning engine with the intention of making an engine that can run it on one of the Linux media apps. I did this by writing short programs and looking at the generated binary.

I had about 95% of it figured out when we had a robbery and the thief took the laptop I was using. By the time I got a new machine I had completely lost interest!

I wouldn't be surprised if it was using a well known VM that I just didn't know about at the time!

cjbprime 2 years ago

Once you think you're at parity, it would also be instructive to try to use differential fuzzing with a compiler (where the fuzzer is synthesizing valid programs to compile) to look for inconsistencies.

KolmogorovComp 2 years ago

Whenever I see stories like that I always wonder if anyone has succeeded at parsing an undocumented file format that included custom compression scheme.

Parsing a binary file is tedious but you can progress steadily at least, whereas you would never be sure you even decompressed correctly, before even trying to decode the format.

Fortunately this is mostly a theoretical problem. There are very few cases where a custom compression would be more efficient than slapping a .zip/.zstd/.tar on it if it ever goes too big.

jvdvegt 2 years ago

There was a story about reverse engineering highspeed Broadcom networkcard firmware on HN last week. That included a custom compression if I remember correctly.
This is the post: https://news.ycombinator.com/item?id=38772862
- KolmogorovComp 2 years ago
  
  Thanks for the link! Impressive work indeed. Relevant snippets
  > but had no idea as to how the image was compressed. It clearly wasn't compressed with any common compression algorithm. Mercifully unlike the MIPS firmware, it had at least a few strings, which is how I was able to tell it was compressed; a hex dump showed chunks of human-readable text with garbage interrupting them.
  > A hunch. After extensive amounts of time trying and failing to eyeball the compression algorithm from hexdumps of compressed code, and trying any decompression algorithm I could think of against it,
  But they eventually could break through by reverse engineering the decompression code.
  > Once I finally had a concise, sane description of the decompression algorithm in C, the algorithm turned out to be hilariously simple. I was also then able to figure out the origins of the compression algorithm; it's called LZSS

Dwedit 2 years ago

Just watch out for the encrypted file formats. Need a debugger to figure out what it's doing.

lifthrasiir 2 years ago

They are relatively easier to detect though. Compressed file formats or virtual machine codes in disguise tend to be more annoying, especially since they can look like somehow structured and waste your time.
- Dwedit 2 years ago
  
  Meanwhile for ZLIB compressed files, just look for the 'x' after headers.

SuperNinKenDo 2 years ago

This is something I've been interested in for a while.

I've collected a few links people have already posted to their own projects or write-ups here and elsewhere, but is there any single excellent resource for learning how to do this?

I've a number of dead and/or proprietary formats that I've always wanted to crack open, but I'm totally overwhelmed with where to start.

lifthrasiir 2 years ago

While I don't have any handy link, I did reverse-engineered several file formats without any further information and I can give some points.
First, make sure that you know what the format is actually supposed to encode. For example, if some file weighs (say) 40 KB then it is unlikely to be a raster image. The file name, if any, helps a lot to narrow the scope.
Second, you should have some understanding of similar file formats. I generally recommend to study PNG first because it gives an example of typical structured file formats and raster image formats. (Don't delve into the compression though---bitwise analysis is much harder.) This is also why you needed to know what the format is for, many formats with the same goal tend to have similar structures.
Third, collect as many examples as possible. You can line them up to see commonalities and differences and spot patterns. Even better if you can actively generate different files. This is generally the last hope when you are run out of reasonable hypotheses.
Fourth, optimize the feedback loop. You will have to do a lot of hypothesization, validation and automation. You can't really optimize the number of iterations, but you can optimize the time for a single iteration. Use a comfortable scripting language with good binary operation. I tend to use a vanilla Python with struct and make everything else by my own, but there are several libraries that greatly help you if you don't feel like doing so.
signaru 2 years ago

I had reversed engineered some ASCII file formats. While probably overkill, my background parsing simple programming languages (for which there are many good educational resources) was really helpful (in the approach I use). I tokenize, and try figuring out syntax structures from the order of token types, then from there, extract the information I need into my program's data representation. I'm not sure if this is the approach used by everyone else, but it seems plausible for someone with a CS/PL implementation background.
But first, it helps to have sample files to see recurring structures. Ideally, you also have access to software that generates these files. This allows you to deal with simpler files containing less information to reason about, make small changes within the program and compare the corresponding change(s) in the file.

darekkay 2 years ago

As everyone mentions their stories: I've once reverse-engineered the Foobar2000 index file format to sync my playlists with Plex.

https://darekkay.com/blog/foobar2000-playlist-index-format/

true_pk 2 years ago

Great job! Good to know I wasn’t alone parsing some weird binary formats. Back at the time I created this helper module in python to make my days a bit more bearable: https://github.com/sergeyshilin/byter

zubairq 2 years ago

Brings back memories of a similar task I have with importing MS Access files (the MS Access file format is undocumented).

I am trying to parse the MS Access files using NodeJS/Javascript. I last tried about 3 years ago and it was really tough going, so there is a lot of trial and error. I am able to parse some basic MS Access files, but need to figure out a way to get the whole database more reliably. My effort was here:

https://github.com/yazz/noaccess

0xDEADFED5 2 years ago

oh god, that sounds like a painful thing to try to do. there's an Access redistributable at https://www.microsoft.com/en-us/download/details.aspx?id=549...
i've never attempted JS interop, so i don't know how much it sucks, but it definitely seems doable. it'll likely be an adventure on it's own, but it's gotta be 1000x less fuckery than trying to reverse the binary format of Access
the saner option would probably just be a small .Net program that creates an endpoint for your JS
unless i'm assuming too much, and you're just doing it for sheer masochistic pleasure, and in that case: i salute you
- zubairq 2 years ago
  
  Yes, it is extremely painful. Thanks for the link, that may come in useful. But in general, the reason that I didn't use the Access redistributable is because I am building a Low Code system that I want to be able to import the Access database files that can work on Linux, Windows, or Mac.
  - password4321 2 years ago
    
    GPL2 https://github.com/mdbtools/mdbtools
    
    zubairq 2 years ago
    
    I remember taking a look at mdbtools and not using it... can't remember why now, but I'll take another look. Thanks!

pabs3 2 years ago

A wiki about file formats:

http://fileformats.archiveteam.org/

pabs3 2 years ago

Back in the day I did some reverse engineering on the files inside Microsoft Windows CHM documentation files, after some other folks did the container format.

http://chmspec.nongnu.org/latest/

t0suj4 2 years ago

The best binary parser I had was in the Practical Common Lisp book in the MP3 chapter. I was able to use it to reverse engineer several custom formats used to stitch files including FoxPro db format I had no idea existed.

SubiculumCode 2 years ago

I have some 3d binary image files from long lost propietary application and was trying to recover the data, and because I have little experience, ran into a wall.

I was wondering whether chatgpt can be used to read in the byte sequences and umm, do something.

predictsoft 2 years ago

I got £2000 for reverse engineering Psion 5 Word and Sheet file formats (for a Psion based company) in 1998. I tried to get a job with the same company but they didn't want me, which is very sad.

arch_rust 2 years ago

Cool seeing https://github.com/sharksforarms/deku in the wild

msla 2 years ago

Doing this with text files used to be a big part of my job. I had to write Java (well, mainly Java... ) (cf. the ingredients of scumble on Discworld) programs to parse the files school districts had that described bus stops, from depot to school in the AM and from school to depot in the PM among other arrangements, and, often, the best you could say about some files is that someone had likely worked fairly hard to make them look like they were software-generated. They had just enough structure that parsing them with a program was the correct option, but they had enough irregularities that the program was never going to be pretty, because there's no pretty way to parse an ugly file.

It's a wonderful example of inductive reasoning, or generating general rules from a collection of specific examples.

Settings

Parsing an Undocumented File Format

Keyboard Shortcuts