Road to NumPy 2.0
hackmd.ioWhile we are trying to minimize the disruptions, there is one thing project maintainers should do right now: pin the maximum NumPy to <2.0 in their ~`pyproject.toml`~ project dependencies. This will ensure they do not inadvertently upgrade before they are ready to do so. Once numpy2.0 is released, you can check that your code works with it, and then release the pin.
Definitely good advice on pinning the upper range for major versions. Pandas 2.0 inadvertently broke spark’s toPandas for (at the time) all existing spark versions, which some of our users lean on. Pandas response to this was apparently to upgrade spark to a version that was not yet released. We lost about half a day’s work from a small handful of devs trying to identify why some of our users were seeing sudden failures.
I think that ndarray is the most successful abstraction I've come across. Numerical computing is a domain ripe for terrible code, but multi-indexing, broadcasting, mask arrays, .reshape(), .where(), linspace(), and all that are so well made and useful they are now the standard grammar of data science. Yes you've seen horrible numpy code before, but how much worse would it be if it had been written by the same person in raw C with only malloc and pointer arithmetic?
Just FYI, numpy by no means pioneered this concept. There was Fortran before, as a language built around n-dimensional arrays. And even Fortran was not the first one, as some stack based programming languages (such as APL, IIRC) have similar concepts. Also languages such as R and evventually Matlab (as a kind-of nicer frontend to Fortran libs) pioneered this concept. However, Numpy was the first library bringing this into a general-purpose language as Python is.
Fortran, Matlab and all array based languages (APL, J, K, Niall, etc) have these constructs at their core. Ndarray may be a great implementation but the ideas have all been there. I've been exploring J recently and the flexibility and compactness is tremendous.
Numeric Python, the original numpy, was based mainly on the MATLAB array object. Of course, it was also influenced by FORTRAN and other languages, but ultimately, it was taken from the mental model of MATLAB.
and MATLAB was entirely influenced by Fortran's array-based syntax.
s/stack/array/
Fortran was created in the 1950s. APL — 60s.
I remember implementing a sliding overlapping 2d windowing by playing with the strides. I don't know how common is that trick but at the time it felt like magic.
The real interesting page IMO is the actual project roadmap:
numpy is a great library, but I find the pytorch take on it superior. It allows for much more method-chaining, which I find to be very readable in longer computations. I frequently want to reach out for it in numpy, only to see that it's a np.* function. In functional languages I would use an infix pipeline-combinator like (|>) but that's not possible in python.
While not really python, there's Coconut language that has the infix pipeline operator and compiles into python.
This service HackMD looks pretty cool.
Still no NA?
I’m working on adding missing data support for strings as part of adding a UTF-8 variable-width string type to NumPy. Not a general solution but should help with a lot of use-cases. https://numpy.org/neps/nep-0055-string_dtype.html
The current memory use of string arrays is another major issue, glad to see this being worked on!
np.nan? Not trying to be funny, but hoping to learn whether I'm missing something about limitations of np.nan which would be solved by some other kind of missing value indicator.
np.nan is only for floats, doesn't help with integer, boolean, string etc. Also, datetimes have NaT, but it's troublesome to e.g. do different checks np.isnan() or np.isnat() depending in the data type. And we don't even have np.nat, but need np.datetime64("NaT"), so it's just confusing.
Why not use a structured array with an 'isna' field to use as a mask when performing operations?
How is that convenient? Missing data support belongs deep in NumPy itself (or any other similar package) so that operations can do the right thing and missing values propagate correctly. For example, let's say you want by definition missing values to sort last. If you roll out your custom missing value marker, you'll also need to roll out your own custom sort function. And the same for a whole lot more stuff.
What about a MaskedArray? ndarrays are homogeneous by definition.