On-demand JSON: A better way to parse documents?

onlinelibrary.wiley.com

166 points by warpech 2 years ago · 54 comments

Reader

So they're creating a DOM-like api in front of a sax style parser and getting faster results (barring FPGA and GPU research). It's released as part of SIMDJson.

I wonder if that kind of front end was done in the age of SAX parsers?

Such a well-written paper.

twic 2 years ago

SAX is a push parser, presumably this is on top of a pull API like StAX.
The Jakarta JSON streaming API sort of gets at this (sort of):
https://jakarta.ee/specifications/platform/9/apidocs/jakarta...
The basic interface to a JSON document is something like an iterator, which lets you advance through the document, token by token, and read out values when you encounter them. So if you have an array of objects with x and y fields, you read a start of array, start of object, key "x", first x value, key "y", first y value, end of object, start of object, key "x", second x value, key "y", second y value, end of object, etc. Reading tokens, not anything tree/DOM-like. But there are also methods getObject() and getArray(), which pull a whole structure out of the document from wherever the iterator has got to. So you could read start of array, read object, read object, etc. That lets you process a document incrementally, without having to materialise the whole thing as a tree, but still having a nice tree-like interface at the leaves.
In principle, you could implement getObject() and getArray() in a way which does not eagerly materialise their contents - each node could know a range in a backing buffer, and parse contents on demand. But i don't think implementations actually do this.
Wrapping a tree-like interface round incremental parsing that doesn't require eager parsing or retaining arbitrary amounts of data, and doesn't leak implementation details, sounds astoundingly hard, perhaps even impossible. But then i am not Daniel Lemire. And i have not read the paper.
- dmurray 2 years ago
  
  > Wrapping a tree-like interface round incremental parsing that doesn't require eager parsing or retaining arbitrary amounts of data, and doesn't leak implementation details, sounds astoundingly hard, perhaps even impossible.
  I don't think they promise this and I suspect this fails to parse some pathological but correct JSON files, eg one that starts with 50 GB of [s.
phh 2 years ago

> I wonder if that kind of front end was done in the age of SAX parsers?
I though that XPath over SAX was a thing, and xslt was doing sax-like parsing, but turns out I'm wrong. Which is logical considering XPath can refer to previous nodes. That being said, it looks like there is streamable xslt in xslt 3.0, but that looks more niche
- riedel 2 years ago
  
  I did some automata for parsing, transformation and compression in my PhD. I think that XPath is the major failure in XML standardization, with XSLT building on this. If we had a stricter language we could easily compile much of the XML stuff and do binary XML much more extensively.
- jbverschoor 2 years ago
  
  Often a combination of sax and dom is usefull. You get many GBs of SAX stream, but it usually contains the same kind of documents. Creating a DOM at the end a specific token means fast processing, but still the easy of use of DOM.

eternityforest 2 years ago

Why not just use msgpack? The advantage of JSON is that support is already built in to everything and you don't have to think about it.

If you start having to actually make an effort to fuss with it, then why not consider other formats?

This does have nice backwards compatibility with existing JSON stuff though, and sticking to standards is cool. But msgpack is also pretty nice.

benatkin 2 years ago

This seems to be geared towards using a heavily adopted format.
Some would want to move to binary, but it's hard to find an ideal universal binary format.
msgpack doesn't support a binary chunk bigger than 4gb, which is unfortunate. Also the JavaScript library doesn't handle Map vs plain object.
In JSON you could have a 10GB Base64 blob, such as a video, in a string, no problem (from the format side, with a library YMMV).
For one that supports up 64 bit lengths, check out CBOR: https://cbor.io/ With libraries maybe it could be the ideal universal binary format (universal in the same sense JSON is - I've heard it called that). https://www.infoworld.com/article/3222851/what-is-json-a-bet...
- klabb3 2 years ago
  
  > msgpack doesn't support a binary chunk bigger than 4gb, which is unfortunate.
  I don’t care what behemoths people store in the formats they use but at the point you exceed “message size” the universality of any format is given up on. (Unless your format is designed to act as a database, like say a SQLite file.)
  > In JSON you could have a 10GB Base64 blob, such as a video, in a string, no problem
  Almost every stdlib json parser would choke on that, for good reason. Once you start adding partiality to a format, you get into tradeoffs with no obvious answers. Streaming giant arrays of objects? Scanning for keys in a map? How to validate duplicate keys without reading the whole file? Heck, just validating generally is now a deferred operation. Point is, it opens up a can of worms, where people argue endlessly about which use-cases are important, and meanwhile interop goes down the drain.
  By all means, the stateful streaming / scanning space is both interesting and underserved. God knows we need something. Go build one, perhaps json can be used internally even. But cramming it all inside of json (or any message format) and expecting others to play along is a recipe for (another) fragmentation shitshow, imo.
serial_dev 2 years ago

Example: you work on the mobile team, the backend team is large and focuses on serving the web app, they send huge JSON payloads that the mobile app only partially need, and asking the backend team to now also serve msgpack is out of the question as things together with the backend and web teams were proven to be a PITA.
In this scenario, writing a new, or bundling someone else's json library can significantly improve things.
- lpapez 2 years ago
  
  Sounds like a dysfunctional organization to be honest. Why couldn't you agree on the contents of the mobile API? If the backend team is slow, why can't the mobile team implement a proxy towards the backend to serve only the data they need?
  - serial_dev 2 years ago
    
    I totally agree, it was a dysfunctional org, but sometimes fixing the org is extremely hard and you want to make progress today focusing on what you can do to improve the user's experience, instead of focusing on what others should change and do.
- OnlyMortal 2 years ago
  
  I worked at TomTom on the Home application. There was a similar problem with map updates where the payload was XML.
  The devices only needed a sub-range of the XML so I used an XML parser to ignore everything until I got the tag needed then read until the end tag arrived.
  This avoided a DOM and the huge amount of memory needed to hold that. It was also significantly faster.
- secondcoming 2 years ago
  
  That’s an issue with your teams rather than the message format.
  Even sending data that the mobile app doesn’t need raises flags.

wruza 2 years ago

Alternatively, jsonl/ndjson. The largest parts of jsons are usually arrays, not dictionaries. So you can e.g.:

  {<a header about foos and bars>}
  {<foo 1>}
  ...
  {<foo N>}
  {<bar 1>}
  ...
  {<bar N>}

It is compatible with streaming, database json columns, code editors.

xiphias2 2 years ago

I don't really understand what's new here compared to what SIMDJSON supported already.

Anyways, it's the best JSON parser I found (in any language), I implemented fastgron (https://github.com/adamritter/fastgron) on top of it because of the on demand library performance.

One problem with the library was that it needed extra padding at the end of the JSON, so it didn't support streaming / memory mapping.

TkTech 2 years ago

This on-demand model has been implemented in simdjson for awhile. This is just the release of the paper.
Previously, simdjson only had a DOM model, where the entire document was parsed in one shot.
asa400 2 years ago

Nice work! I will have to check out your implementation and see if I can borrow any of your optimization ideas. I built jindex (https://github.com/ckampfe/jindex) because I also wanted a faster gron!

bawolff 2 years ago

Is this different from what everyone was doing with XML back in the day?

da_chicken 2 years ago

JSON has a lot more optimization that XML never got. Which I think says more about general interest in XML more than anything. Even today my experience is that XML processing varies wildly from "perfectly reasonable" to "maybe I can just do this with regex instead" even with widely used parsers.
Also XML has a number of features to care about like attributes as well as elements, and also potentially about schema. It's also needlessly verbose. Even though elements open and close in a stack there isn't a universal "close" tag. That is, if `<Tag1><Tag2></Tag1></Tag2>` is always considered malformed, then why isn't the syntax simply `<Tag1><Tag2></></>`?
- bemusedthrow75 2 years ago
  
  > That is, if `<Tag1><Tag2></Tag1></Tag2>` is always considered malformed, then why isn't the syntax simply `<Tag1><Tag2></></>`?
  XML isn't just a structured data format where close tags always run up against each other and whitespace is insignificant. It's also a descriptive document format which is often hand-authored.
  I think the argument is that the close tags being named makes those documents easier for a human author to understand. It certainly is my experience.
  - da_chicken 2 years ago
    
    No, I think that's true only in theory. It's true only in the hypothetical. It's only true when you're literally authoring the markup languages and no tools exists yet.
    In the vast majority of cases, XML, like YAML or JSON, is machine written and machine parsed. Further, there's an almost unlimited number of tools available for manipulation. That's why nobody makes markup languages like SGML anymore unless they have to.
    Heck, there's LaTeX, a document markup language which the HN users themselves seem to insist is incredibly easy to write in a text editor by hand, and that doesn't have verbose closing tags. Nevermind programming languages, etc.
    No, SGML and it's descendants are weird in their insistence that structures must be as verbose as possible.
  - anonyme-honteux 2 years ago
    
    Nobody has troubles reading a declarative DSL like
    Tag1 { Tag2 { } }
    
    bemusedthrow75 2 years ago
    
    We literally have editors that colour bracket pairs to make this stuff easier to deal with, though.

pkulak 2 years ago

This is a real “why didn’t I think of that” moment for sure. So many systems I’ve written have profiled with most of the cpu and allocations in the JSON parser, when all it needs is a few fields. But rewriting it all in SAX is just not worth all the trouble.

cha42 2 years ago

Maybe you need a query engine and not a parser.
Shameless promotion of my beta engine
https://github.com/V0ldek/rsonpath

jensneuse 2 years ago

Sounds similar to a technique we're using to dynamically aggregate and transform JSON. We call this package "astjson" as we're doing operations like "walking" through the JSON or "merging" fields at the AST level. We wrote about the topic and how it helped us to improve the performance of our API gateway written in Go, which makes heavy use of JSON aggregations: https://wundergraph.com/blog/astjson_high_performance_json_t...

hwestiii 2 years ago

On face it, this sounds kind of like the XML::Twig perl module.

SushiHippie 2 years ago

Related submission from yesterday:

https://news.ycombinator.com/item?id=39319746 - JSON Parsing: Intel Sapphire Rapids versus AMD Zen 4 - 40 points and 10 comments

pshirshov 2 years ago

I solved this problem with a custom indexed format: https://github.com/7mind/sick

Waterluvian 2 years ago

> The JSON specification has six structural characters (‘[’, ‘{’, ‘]’, ‘}’, ‘:’, ‘,’) to delimit the location and structure of objects and arrays.

Wouldn’t a quote “ also be a structural character? It doesn’t actually represent data, it just delimits the beginning and end of a string.

I get why I’m probably wrong: a string isn’t a structure of chars because that’s not a type in json. The above six are the pieces of the two collections in JSON.

jesprenj 2 years ago

Relevant: LEJP - libwebsockets json parser.

You specify what you're interested in and then the parser calls your callback whenever it reads the part of a large JSON stream that has your key.

https://libwebsockets.org/lws-api-doc-main/html/md_READMEs_R...

skibz 2 years ago

Pretty cool!

This reminds me of oboe.js: https://github.com/jimhigson/oboe.js

basil-rash 2 years ago

> The JSON syntax is nearly a strict subset of the popular programming language JavaScript.

What JSON isn’t valid JS?

sp332 2 years ago

"Any JSON text is a valid JavaScript expression, but only after the JSON superset revision. Before the revision, U+2028 LINE SEPARATOR and U+2029 PARAGRAPH SEPARATOR are allowed in string literals and property keys in JSON; but the same use in JavaScript string literals is a SyntaxError."
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...
"In fact, since JavaScript does not support bare objects, the simple statement {"k":"v"} will emit an error in JavaScript"
https://medium.com/@ExplosionPills/json-is-not-javascript-5d...
- bawolff 2 years ago
  
  > "In fact, since JavaScript does not support bare objects, the simple statement {"k":"v"} will emit an error in JavaScript"
  This is kind of a silly, "well technically". Its a valid expression. Its not a valid statement. It is valid javascript in the sense most people mean when asking the question if something is valid javascript.
- turnsout 2 years ago
  
  Eh, this is slightly dated—{"k":"v"} does work in the WebKit and Blink consoles, and the superset proposal was approved, so those separators should work fine too.
  - pitaj 2 years ago
    
    The console evaluates what is passed to it as an expression, which is a different context from how scripts or modules are evaluated.
zerocrates 2 years ago

The one thing I've seen mentioned before is the use of "__proto__" as a object property key. Though it's valid syntax in both JSON and JS like any other string key, it somewhat uniquely does something different if interpreted as JS (setting the created object's prototype) than it does if interpreted as JSON.
- basil-rash 2 years ago
  
  That's fair, though somewhat benign barring a prototype pollution vulnerability. The object still behaves the same as it would had you JSON.parse'd the same string (Object.getPrototypeOf aside).
  - zerocrates 2 years ago
    
    One simple issue would be if your object looks like
    x = {"__proto__": {"foo": "bar"}}
    now x.foo is "bar" if that's JS code, but undefined if you JSON.parse that same object definition from a string.

fanseepawnts 2 years ago

Sorry, I would never use this. Before I consume any json from any source or for any purpose I validate it. Lazy loading serves no purpose if you need validation.

Hint: you need validation.

andix 2 years ago

If you already know it's validated and coming from a trusted source there is no reason to validate it again. For example json from a database that only allows inserting valid json. In such cases even the structure might be known and some assumptions can be safely made.
- fanseepawnts 2 years ago
  
  Sorry theres no such thing as prevalidated JSON. You can do it in a sidecar all you want.
  In-process validation is required. There are no trusted sources. Your confusing valid json with valid json according to a schema for a specific purpose.
  Lazy loading JSON parsers have no nead to exist, at all, ever. This is why they dont exist.
Seb-C 2 years ago

There are cases where the json does not come from a user input and can be trusted without a validation layer.
Also you may want to stream-validate it.
ysleepy 2 years ago

You need a parser for validation, - preferably a fast, possibly even a streaming one.
- samatman 2 years ago
  
  Which this is not.
  A validating parser, that is. The paper clearly indicates that invalid JSON like [1, 1b] will pass, unless your code happens to try and decode the 1b.
creatonez 2 years ago

The purpose seems to be for when parsing a JSON document where large parts of it are irrelevant to your use case. It seems cursed, but it seems that using this method you can leave lee-way for unparsed parts and still validate what you are actually using, if the parser is robust enough.
jesprenj 2 years ago

You don't need to load the entire JSON object as a DOM into RAM just to validate it. Validation can easily be done using a stack and iteration with space complexity being the depth of the JSON (stack) and time being linear to the length of the object.
beached_whale 2 years ago

smdjson is validating as it moves through the file

Settings

On-demand JSON: A better way to parse documents?

Keyboard Shortcuts