Show HN: A fast ISO8601 date-time parser for Python
hack.close.ioA regex only seems to take ~1µs.
In [7]: iso_regex = re.compile('(\\d{4})-(\\d{2})-(\\d{2})T(\\d{2}):(\\d{2}):(\\d{2}(?:\\.?\\d+))')
In [8]: %timeit iso_regex.match('2014-01-09T21:48:00.921000')
1000000 loops, best of 3: 1.05 µs per loop
But hey, once it's written in C, why go back?I'm missing the timezone, but the OP left that out, so I did too. For comparison, dateutil's parse takes ~76µs for me. Kinda makes me wonder why aniso8601 is so slow. (It's also missing a few other things, depending on if you count all the non-time forms as valid input.)
That said, cool! I might use this. One of the things that makes dateutil's parse slower is that it'll parse more than just ISO-8601: it parses many things that look like dates, including some very non-intuitive ones that have caused "bugs"¹. Usually in APIs, its "dates are always ISO-8601", and all I really need is an ISO-8601 parser. While I appreciate the theory behind "be liberal in what you accept", sometimes, I'd rather error out than to build expectations that sending garbage — er, stuff that requires a complicated parse algorithm that I don't really understand — is okay.
¹dateutil.parser.parse('') is midnight of the current date. Why, I don't know. Also, dateutil.parser.parse('noon') is "TypeError: 'NoneType' object is not iterable".
The library has the following features your regex is missing:
* Every part from month onwards is optional
* Separator characters are optional
* Date/time separator can be a space as well as T
* Timezone information
* Parsing the strings into numbers
* Actually creates a datetime object
I expect adding all of those will bump up the time a bit.
I'm not much of a regex wizard, but I tried to add all the features listed other than parsing the result and creating the datetime object.
It seems like it performs quite a bit worse than the library, which creates the full object.iso_regex = re.compile('([0-9]{4})-?([0-9]{1,2})(?:-?([0-9]{1,2})(?:[T ]([0-9]{1,2})(?::?([0-9]{1,2})(?::?([0-9]{1,2}(?:\\.?[0-9]+)?))?(?:(Z)|([+-][0-9]{1,2}):?([0-9]{1,2})))?)?)?')
In the interest of intellectual pursuit, is there anything that can be done to the regex to speed it up?In [82]: %timeit ciso8601.parse_datetime('2014-01-09T21:48:00.921000') 1000000 loops, best of 3: 368 ns per loop In [83]: %timeit iso_regex.match('2014-01-09T21:48:00.921000') 100000 loops, best of 3: 9.72 µs per loop
Note you still need to convert your regex match to a datetime object which is likely going to add some significant overhead.
Good idea with the regex, haven't tried it. That being said, you didn't take the time into account to construct a datetime object, let alone attach time zone information. ciso8601 supports time zones using pytz' FixedOffset and UTC classes, see https://github.com/elasticsales/ciso8601 for additional benchmarks. There's potential for further speedup by using a tzoffset subclass written in C, but in our cases all dates were UTC anyway and so we didn't need the time zone.
Pandas (data analysis library for python) has a lot of cython and C optimizations for datetime string parsing:
They have their own C function which parses ISO-8601 datetime strings: https://github.com/pydata/pandas/blob/2f1a6c412c3d1cbdf56610...
They have a version of strptime written in cython: https://github.com/pydata/pandas/blob/master/pandas/tslib.py...
I'm not saying these are better/worse than your solution, I haven't done any benchmarks and the pandas functions sometimes cut a few corners, but perhaps there is something useful there for reference anyways. They also don't deal directly in datetime.datetime objects, they use pandas specific intermediate objects, but should be simple enough to grok.
Having done some work with dateutil, I will tell you that dateutil.parser.parse is slow, but its main use case shouldn't be converting strings to datetimes if you already know the format. If you know the format already you should use datetime.strptime or some faster variant (like the one above). There is a nice feature of pandas where given a list of datetime-y strings of an arbitrary format, it will attempt to guess the format using dateutil's lexer (https://github.com/pydata/pandas/blob/master/pandas/tseries/...) combined with trial/error, and then try to use a faster parser instead of dateutil.parser.parse to convert the array if possible. In the general case this resulted in about a 10x speedup over dateutil.parser.parse if the format was guessable.
It would have been nice if Pandas had let this out into a separate package so that you didn't need to pull down all of Pandas to use it. This is why people are duplicating your efforts.
I tried to do a fair comparaison between comparaison between the main date implementations. The ciso8601 is really fast, 3.73 µs on my computer (MBA 2013). aniso8601, iso8601, isolate and arrow are all between 45 and 100µs. The dateutil parser is the slowest (150 µs).
>>> ds = u'2014-01-09T21:48:00.921000+05:30'
>>> %timeit ciso8601.parse_datetime(ds)
100000 loops, best of 3: 3.73 µs per loop
>>> %timeit dateutil.parser.parse(ds)
10000 loops, best of 3: 157 µs per loop
A regex[1] can be fast, but the parsing is just a small part of the time spent. >>> %timeit regex_parse_datetime(ds)
100000 loops, best of 3: 13 µs per loop
>>> %timeit match = iso_regex.match(s)
100000 loops, best of 3: 2.18 µs per loop
Pandas is also slow. However it is the fastest for a list of dates, just 0.43µs per date!! >>> %timeit pd.to_datetime(ds)
10000 loops, best of 3: 47.9 µs per loop
>>> l = [u'2014-01-09T21:{}:{}.921000+05:30'.format(
("0"+str(i%60))[-2:], ("0"+str(int(i/60)))[-2:])
for i in xrange(1000)] #1000 differents dates
>>> len(set(l)), len(l)
(1000, 1000)
>>> %timeit pd.to_datetime(l)
1000 loops, best of 3: 437 µs per loop
NB: pandas is however very slow in ill-formed dates, like u'2014-01-09T21:00:0.921000+05:30' (just one figure for the second) (230 µs, no speedup by vectorization).So if you care about speed and your dates are well formatted, make a vector of dates and use pandas. If you can't use it, go for ciso8601. For thomas-st: it may be possible to speed-up parsing of list of dates like Pandas do. Another nice feature would be caching.
Extremely simple and straightforward C code too, which is also nice to read. 320ns (on what processor?) is assuming a clock of 2-3GHz on x86 around 1K instructions, several orders of magnitude less than what it was before. But that still works out to a few dozen instructions per character of the string... so I'm inclined to believe that it could go an order of magnitude faster if you really wanted it to, but at that point the Python overhead (PyArg_ParseTuple et al) is going to dominate.
I'm not sure this would be any better than just manually writing out both trivial iterations of the loop:
for (i = 0; i < 2; i++)I did all the timeit benchmarks on the latest 13" retina MacBook Pro, 2.6 GHz Intel Core i5. The profiler screenshot is from one of our serves on EC2.
Of course there is always potential for optimization, but at this point it's fast enough for our purposes. If you can make it significantly faster please don't hesitate to submit a PR though :)
EDIT: Wouldn't most C compilers unroll the simple "for" loops? Direct link to the C code: https://github.com/elasticsales/ciso8601/blob/master/module....
I don't currently have a use for this library, but I'm going to bookmark it anyways because it looks like a nice introduction to writing a module in C. It does something non trivial but still is simple enough to grok quickly. Thanks!
Does it cover all of ISO8601? I'm sure it covers the common cases so is a valuable library anyway but I seem to remember that ISO8601 is quite complicated I.
It doesn't cover all of it, for example week dates or ordinal dates are not currently supported. But feel free to submit any patches :)
Sorry. I'm busy with other things at the moment.
It may also be better not covering everything if it keeps performance and simplicity but I just like to understand the trade-offs.
My quick look at this shows that unless you cython wrap the call -- this is going to be slower than using pandas' to_datetime on anything with an array layout.
I've never really spent much time looking at pandas' to_datetime but I believe it has to handle a lot of variety in what you pass to it, which probably cause a bit of a perf hit. (Lists, arrays, Series)
http://dl.dropboxusercontent.com/u/14988785/ciso8601_compari...
If you control the source data, store it as epoch and you can avoid this parsing.
Not quite related: Is there any python library that can handle timezone parsing, like the java SimpleDateFormat (http://docs.oracle.com/javase/7/docs/api/java/text/SimpleDat...)? The timezone could be in utc offset and short name format (EST, EDT,...). I am surprised that I couldn't find one.
While profiling I noticed the same thing about dateutil.parser.parse a few years ago. We standardized all our interacting systems on UTC so we have a regex that matches UTC and if that fails to match we call dateutil. That way the vast majority of cases are optimized but we still support other timezones.
How many dates are you parsing at a time that optimizing this would make a noticeable difference to users?
The post says: "For large object structures with thousands of date time objects this can easily add up." At 0.1ms per parse, that's 100ms per thousand dates, within the range of noticeable. (Their profiler screenshot has it taking 589ms.)
0.1ms to parse a date???
Even the standard PHP string parser does 0.017ms on my 3 year old netbook.
Seems like this solves a non-existing issue.<?php $st = microtime(true); $cnt = 10000; for ($i=0; $i<$cnt; $i++) strtotime('2014-01-09T21:48:00.921000'); echo 1000 * (microtime(true) - $st) / $cnt;You can see the issue it solves pretty clearly here: https://github.com/elasticsales/ciso8601#benchmark
Python != PHP
Actually both Python and PHP are ridiculously slow languages. Though Python is slower.
Some implementations of python are slowish for some tasks. Many parts, like the module being discussed are written in C/assembly/fortran/Java.
Python with a jit is Pypy, http://speed.pypy.org/
Also PHP has some fast _implementations_ of PHP.
Languages aren't slow. Interpreters are.
As a human, I can think a lot faster in Python. So for me, it's a faster language.
This also doesn't do the same thing, since you're not constructing a DateTime object.
I do notice a lot of people on hackernews who clearly have never had to write high throughput software.
Lots of people deal with data rates that make webscale throughput look pretty pathetic; you are just less likely to know as it will be prop tech
There are other parsers that exist already too. For example, did you try this one? https://pypi.python.org/pypi/iso8601
How do these all compare to each other?
I think you actually mean RFC3339. ISO8601 is probably a lot larger than you think.
RFC3339 makes most of the fields mandatory, while this library leaves them optional, so it is more accurately a subset of ISO8601 than an implementation of RFC3339. That said, you could describe it as an extension of RFC3339.
This seems like the type of thing that's good to ffi out of you're using it a lot. I highly doubt the c version would take this long.
What does that mean?> good to ffi out of you're using it a lotForeign Function Interface
Assuming "foreign function interface". http://en.wikipedia.org/wiki/Foreign_function_interface
s/of/if/
Spell correct is a helluva drug. I get bitten by of/if all the time.
Would it make more sense to modify the core library and send off a patch?
Most of the speed comes from only parsing a frequently used subset of ISO8601. For a core library, you probably want a more complete implementation.