Road to NumPy 2.0

68 points by adm_ 2 years ago · 23 comments

Reader

mattip 2 years ago

While we are trying to minimize the disruptions, there is one thing project maintainers should do right now: pin the maximum NumPy to <2.0 in their ~`pyproject.toml`~ project dependencies. This will ensure they do not inadvertently upgrade before they are ready to do so. Once numpy2.0 is released, you can check that your code works with it, and then release the pin.

appplication 2 years ago

Definitely good advice on pinning the upper range for major versions. Pandas 2.0 inadvertently broke spark’s toPandas for (at the time) all existing spark versions, which some of our users lean on. Pandas response to this was apparently to upgrade spark to a version that was not yet released. We lost about half a day’s work from a small handful of devs trying to identify why some of our users were seeing sudden failures.

fouronnes3 2 years ago

I think that ndarray is the most successful abstraction I've come across. Numerical computing is a domain ripe for terrible code, but multi-indexing, broadcasting, mask arrays, .reshape(), .where(), linspace(), and all that are so well made and useful they are now the standard grammar of data science. Yes you've seen horrible numpy code before, but how much worse would it be if it had been written by the same person in raw C with only malloc and pointer arithmetic?

ktpsns 2 years ago

Just FYI, numpy by no means pioneered this concept. There was Fortran before, as a language built around n-dimensional arrays. And even Fortran was not the first one, as some stack based programming languages (such as APL, IIRC) have similar concepts. Also languages such as R and evventually Matlab (as a kind-of nicer frontend to Fortran libs) pioneered this concept. However, Numpy was the first library bringing this into a general-purpose language as Python is.
- rajandatta 2 years ago
  
  Fortran, Matlab and all array based languages (APL, J, K, Niall, etc) have these constructs at their core. Ndarray may be a great implementation but the ideas have all been there. I've been exploring J recently and the flexibility and compactness is tremendous.
- dekhn 2 years ago
  
  Numeric Python, the original numpy, was based mainly on the MATLAB array object. Of course, it was also influenced by FORTRAN and other languages, but ultimately, it was taken from the mental model of MATLAB.
  - TheRealKing 2 years ago
    
    and MATLAB was entirely influenced by Fortran's array-based syntax.
- yakubin 2 years ago
  
  s/stack/array/
- d0mine 2 years ago
  
  Fortran was created in the 1950s. APL — 60s.
jcarrano 2 years ago

I remember implementing a sliding overlapping 2d windowing by playing with the strides. I don't know how common is that trick but at the time it felt like magic.

uoaei 2 years ago

The real interesting page IMO is the actual project roadmap:

https://github.com/orgs/numpy/projects/9

LeanderK 2 years ago

numpy is a great library, but I find the pytorch take on it superior. It allows for much more method-chaining, which I find to be very readable in longer computations. I frequently want to reach out for it in numpy, only to see that it's a np.* function. In functional languages I would use an infix pipeline-combinator like (|>) but that's not possible in python.

syockit 2 years ago

While not really python, there's Coconut language that has the infix pipeline operator and compiles into python.

whalesalad 2 years ago

This service HackMD looks pretty cool.

otsaloma 2 years ago

Still no NA?

ngoldbaum 2 years ago

I’m working on adding missing data support for strings as part of adding a UTF-8 variable-width string type to NumPy. Not a general solution but should help with a lot of use-cases. https://numpy.org/neps/nep-0055-string_dtype.html
- otsaloma 2 years ago
  
  The current memory use of string arrays is another major issue, glad to see this being worked on!
EForEndeavour 2 years ago

np.nan? Not trying to be funny, but hoping to learn whether I'm missing something about limitations of np.nan which would be solved by some other kind of missing value indicator.
- otsaloma 2 years ago
  
  np.nan is only for floats, doesn't help with integer, boolean, string etc. Also, datetimes have NaT, but it's troublesome to e.g. do different checks np.isnan() or np.isnat() depending in the data type. And we don't even have np.nat, but need np.datetime64("NaT"), so it's just confusing.
  - sheepshear 2 years ago
    
    Why not use a structured array with an 'isna' field to use as a mask when performing operations?
    
    otsaloma 2 years ago
    
    How is that convenient? Missing data support belongs deep in NumPy itself (or any other similar package) so that operations can do the right thing and missing values propagate correctly. For example, let's say you want by definition missing values to sort last. If you roll out your custom missing value marker, you'll also need to roll out your own custom sort function. And the same for a whole lot more stuff.
    
    sheepshear 2 years ago
    
    What about a MaskedArray? ndarrays are homogeneous by definition.

Settings

Road to NumPy 2.0

Keyboard Shortcuts