Reverse engineering the Notability file format

userbinator · on April 1, 2018

To summarise, it's a .zip file containing a binary serialisation of XML, which then contains the base64'd raw coordinates...

Reminds me of this (it gets even better in the comments): https://thedailywtf.com/articles/XXL-XML

Nothing in here was really complicated – it was just some existing standard formats (zip! apple plist! an array of floats!) combined together in a pretty simple way.

It's rather telling that wrapping points in 3 layers of encodings is considered "pretty simple", almost the norm today, when years ago doing such a thing would probably elicit reactions more towards WTF.

derefr · on April 1, 2018

• Very likely the files are only .zip files for network portability, and when in their native environment (iOS?) they’re unzipped document bundles (see e.g. the RTFd format)

• Likely this file isn’t literally a PList, but the output of a data storage framework (e.g. CoreData) that uses PLists as one possible backend. In the code, it just looks like adding annotations to native data structures to make them “storable” within a database-like object.

• Only the text-formatted PLists contain base64’ed binary data. The binary-formatted PLists just contain length-prefixed binary data. There isn’t the “binary in base64 in XML in binary” that you’re imagining.

jvns · on April 1, 2018

thanks for this!! I've never worked with iOS so I was definitely fumbling in the dark. the CoreData hypothesis explains why the PList file looks like a huge weird unstructured array! So in the code there's just some object with all the data in it and then CoreData serializes that object?

This explains the feeling throughout of "this doesn't feel like a format the app developer designed".

theamk · on April 1, 2018

It's just the opposite -- years ago, it was much, much worse.

I remember trying to reverse engineer older Symbian notes format -- it was a custom datastore format containing various binary "chunks", each with semi-custom compression. I had to recombine multiple chunks just to get a formatted text, and I still had occasional failure.

Or look at other vector format -- like CGM, first made in 1987. It has its own serialization mechanism, it's own model (file->body->figure->segment), and 3 compression types specifically designed for it. Look at http://standards.iso.org/ittf/PubliclyAvailableStandards/c03... -- do you want this today? I'd that zip of plist of float anytime over that.

wpietri · on April 1, 2018

Julia Evans is the best. I've been following her on Twitter for years because she brings so much enthusiasm about learning technology that she makes me see familiar things anew. By the end of an exploration, even it it's something I know well, I end up saying, "Holy shit, strace is cool."

To get a feel for her work, a bunch of her posts are here: https://jvns.ca/categories/favorite/

Anderkent · on April 1, 2018

Reminds me of when I had to reverse-engineer the configuration and language format for an audiobook reader - I think it was the Milestone 312 [1]. We bought it for our granpa who is hard of seeing; but installing the polish language pack would stop the device from working correctly.

In the end I had to compare a working english pack and the disfunctional one; the actual issue was something pretty simple - the names of the properties in their ini file were bad for some languages, and some checksums were wrong, or something like that. But it made for interesting black-box experience; there's of course no feedback about what went wrong in the device, nor any accessible logs :P

https://bones.ch/milestone312.php

UncleEntity · on April 1, 2018

Totally randomly I came across a paper that directly addresses this, The Next 700 Data Description Languages[0].

  Today, many programmers tackle the challenge of ad hoc data by writing scripts in a language
  like PERL. Unfortunately, this process is slow, tedious, and unreliable. Error checking
  and recovery in these scripts is often minimal or nonexistent because when present,
  such error code swamps the main-line computation. The program itself is often unreadable
  by anyone other than the original authors (and usually not even them in a month or two)
  and consequently cannot stand as documentation for the format. Processing code often
  ends up intertwined with parsing code, making it difficult to reuse the parsing code for different
  analyses. Hence, in general, software produced in this way is not the high-quality,
  reliable, efficient and maintainable code one should demand.

[0] https://www.cs.princeton.edu/~dpw/papers/ddc-jacm.pdf (pdf)

acemarke · on April 1, 2018

_Way_ back in the day, I built a PocketPC HTML editor. It used a plain textbox control. Meanwhile, a lot of people were building apps and had expressed interest in having some kind of a rich text edit control, similar to the RichEdit control available as part of the desktop Win32 API. Microsoft's PocketWord app used a rich text control, but it wasn't actually part of the mobile Win32 API, or documented at all.

After finding a couple hints on what WndProc messages the control would respond to, I spent a few weeks reverse-engineering it, and wrote up my findings here: https://groups.google.com/d/msg/microsoft.public.dotnet.fram... .

I never did wind up building anything useful with that information, but it was a pretty fun exercise.

ggambetta · on April 1, 2018

Great fun :) Good insight into the process of reverse-engineering and re-generating. I'm working (on and off) on something similar, to convert Kdenlive project files to Final Cut Pro, so that I can collaborate with people who don't use Linux. Both formats are XML, but figuring out the semantics can be tricky!

gravypod · on April 1, 2018

Does anyone here know of a native linux drawing toolkit? Something like the apple pencil? I've wanted a portable not taking device and all of the native Linux not taking software is not up to spec with OSX/Windows counterparts.

It's kind of a tangent but it's somewhat related.

killjoywashere · on April 1, 2018

There's a developer at SUSE who has done a lot of work on tablet mode, fwiw.

nchelluri · on April 1, 2018

I kind of want to see that ACID comic now.

jvns · on April 1, 2018

https://drawings.jvns.ca/acid/

anomie31 · on April 1, 2018

Nice write-up.