More Top-Down Parsing: JSON
I’ll replace this status report with an article about topdown parsing of JSON at a later time. If you’re looking for a production-quality JSON library, get simplejson. If performance is critical, try python-cjson.
While the main motivation for my work on topdown parsing was a number of bigger projects (at least one of them will appear on a site near you in a not too distant future), I’ve also used some spare time to play with TDOP-based parsing of other formats.
One example is a JSON parser, the first version of which took about an hour to write, and which resulted in about 60 lines of JSON-specific code (30 lines for the parser definitions, 30 lines for the tokenizer), and 120 lines in total (including the parser framework, that is). Even the JSON-specific parts were mostly cut and pasted from the Python expression parser, and then just tweaked a little to fit.
The code is not quite ready for public consumption (just a leetle more testing is required…), but I’ll get there.
I did find the benchmarks a bit interesting. Using a 10k JSON record, I get the following timings:
- tdopjson: 12.6 ms
- simplejson, with C accelerator: 20.2 ms
- simplejson, without C accelerator: 24.8 ms
That is, this “quick hack” is about twice as fast as a robust and well-optimized production parser, and is a quite a bit faster than even a Python/C hybrid parser. Not bad.
This specific JSON record happened to be Python-compatible, so for comparison, I also fed it directly to Python’s eval (which isn’t something you’d want to do with arbitrary JSON, of course):
- eval: 5.3 ms
Removing the “interpretation” code from the TDOP parser only shaves off a few milliseconds, so using TDOP for validation and then passing the result to eval would be a step back (it’ll still beat simplejson, though).
I also tried the JSON parser in the pyparsing example collection, but that couldn’t parse any of my existing samples (all based on samples from the JSON specification or real-life JSON services). Using the sample provided with the parser, pyparsing seems to be about 30-40 times slower than the topdown approach.
(I’m sure a simple table-driven C implementation (dead link) would run circles around both implementations, but that’s a somewhat different story. I know how to write fast code in C/C++; in this case, I’m more interested in finding ways to write fast code in Python.)