Show HN: A 6KB, fast and mostly-correct XML parser in C

dev.yorhel.nl

3 points by yorhel 12 years ago · 5 comments

Reader

This looks interesting. Is it brand new? I've seen parsers with similar names, but, of course, almost all short letter combinations containing xml have been used, so that's expected.

It doesn't allocate or buffer, so when it says that it verifies proper nesting, I assume that doesn't mean verifying that start tags and end tags have matching names. I think that would be impossible.

I've been looking for a decent C parser similar to this to try some experiments with. Dealing with XML can be dreary, but I like to try to think of how you might make an API that doesn't make your code suck. It would be at a higher level, so I might be able to use this.

It looks like you pass the parse function a character at a time. To be fast, you'd want to somehow encourage that to get inlined. I'm not sure how easy or possible it is when you put the function in a separate .c file. There would probably be ways to stuff everything into the header.

I'm interested particularly in parsers that don't buffer. It seems like a fundamentally more flexible and potentially better way to parse XML is by memory mapping and letting the VM do what it does best.

yorhelOP 12 years ago

It's new, yes. I wrote it about two months ago because I needed a parser and found the existing solutions too bloated.
> It doesn't allocate or buffer, so when it says that it verifies proper nesting, I assume that doesn't mean verifying that start tags and end tags have matching names. I think that would be impossible.
It does match start tags and end tags, and in order to do so it does, indeed, need a small buffer. However, the only thing that needs to be stored in this buffer is a stack of element names, and that rarely needs much space - a simple fixed-size buffer is sufficient. It can be safely allocated on the stack if the application wants to. See the documentation for the yxml_init() for more details. The "mostly-bufferless" part of the homepage refers to the lack of a complicated input buffer and the avoidance of complex buffer management for output where possible.
> It looks like you pass the parse function a character at a time. To be fast, you'd want to somehow encourage that to get inlined.
The function is quite large. You can probably get it inlined when put inside a header file and when it's only called in a single place, and you may indeed see a performance boost. I haven't measured how significant such an improvement may be.
However, if performance is the primary goal (it wasn't for yxml), a completely different design would probably be better. Yxml needs to manage an explicit state object for each input character. For each new character it has to look at the state to know how it's handled, handle it, and then update the state for the next character. I was honestly surprised to see that yxml performed quite well in my benchmarks.
A more efficient design is probably the goto-based state machine approach that, e.g. ragel[1] can output. The reason I didn't go for that approach is that it complicates buffering. You either need to have the full file mapped into memory (not possible when reading from a socket or from a compressed stream), or you need to implement your own buffering and provide hooks so that the application can fill the buffer when it is depleted. The latter approach is used by most other XML parsers.
[1] http://www.complang.org/ragel/
- mtdewcmu 12 years ago
  
  I was interested in yours specifically because it's the only one I've seen that doesn't buffer (almost). You might as well drop the last little buffer and be completely bufferless. Then it will never cause an error or block, and you'll have no dependencies. If you want to support validation, make a higher-level interface to this and do it there. It will be just as easy.
  - yorhelOP 12 years ago
    
    Hmm? The stack buffer in yxml will never cause parsing to block, and there's no added dependencies on... anything? I don't think that buffer causes any problems even on a size-restricted microcontroller that doesn't have malloc(). As long as you can find ~512 bytes or so of free memory you can parse a lot of files.
    The only situation in which that buffer would cause an error is when the application used a too small buffer, or when the document is far too deeply nested or has extremely long element/property names. Both the maximum nesting level name lengths should, IMO, be limited in the parser in order to protect against malicious documents. Most parsers have separate settings for that, yxml simplifies that by letting the application control the size of a buffer.
    The stack buffer in yxml is also used to make the API a bit easier to use. With the buffer I can pass element/property names as a single zero-terminated C string to the application, without it I would have to use the same mechanism as used for attribute values and element contents, and that mechanism isn't all that easy to use. (This is the one case where I chose convenience over simplicity, but I kinda wanted the validation anyway so that wasn't really a problem)
    
    mtdewcmu 12 years ago
    
    Ah. I have not yet looked too far into how yxml works. I never came up with a perfectly satisfactory solution to the zero-buffer problem myself, but you've hit on a lot of the things that make it a hassle either way. It's almost impossible to do a usable xml parser under the assumption that it will not buffer and it will not be guaranteed access to more than one character at a time. I started developing one like that based on a goto-driven state machine, but I stopped working on it, because the interface was going to be too inconvenient.
    What I meant about blocking was blocking on malloc. It sounds like you're expecting the caller to take care of allocation?

Settings

Show HN: A 6KB, fast and mostly-correct XML parser in C

Keyboard Shortcuts