Ask HN: How did Python become the lingua franca of ML/AI?

78 points by heyzk 4 years ago · 98 comments · 1 min read

I've looked around a bit and can't really find a satisfying answer to this question. There are posts on answer sites, but these often boil down to "dynamic languages are good at glue", or "Tensorflow / Jupyter".

That can't be the whole story, can it? Or if it is, why did these projects choose Python over other scripting lanuages?

I bet there's some interesting history here.

cameldrv 4 years ago

NumPy, SciPy and the ecosystem around them. So much of what you do in ML involves matrix operations. People used to do this stuff in Matlab. Matlab is good at numerics but it's not a very good programming language, and doesn't have very good libraries outside of the numeric domain. The open source nature of NumPy and Python encouraged a big open source community that is hard to get going if you're building open source on top of a language that costs thousands of dollars per seat.

Python's dynamic nature also made a lot of what's in NumPy and the various ML libraries possible or more convenient to use. The performance is not as much of an issue if you start thinking in NumPy terms, doing operations on whole arrays where the loops are then in C. Really, Python itself is just acting as orchestration for a bunch of C code that's doing all the work. In the case of something like Tensorflow or PyTorch, it's actually a bunch of CUDA code that's doing all the work and orchestrated by Python.

tpoacher 4 years ago

Matlab is an excellent language. With great packages, C/C++ interoperability, seamless GPU support, and JIT compilation. Arguably it w is easiet than python for this purpose.
But. It's commercial. And thus prohibitive to the hobbyists and enthusiasts who are ultimately reaponsible for this kind of network effect.
And while I have a lot of love for octave, without the slew of proprietary packages and functionality available to matlab, it is hard for it to compete in such an ecosystem, despite some nice courses out there that use it (notably Andrew Ng's ML course).
If more people contributed open source packages to octave I'm sure it would become as big a player as python.
(inb4 julia: yes, but julia has other problems)
- evgen 4 years ago
  
  Regardless of whether Matlab is a good language, it is looked at as a one-trick pony. You would never instinctively reach for Matlab to write a quick web server or to scrape a third-party API. What Python brings is all of the math and data analysis code that was noted and everything else in the Python ecosystem. It is not just that you get the Numpy, Scipy, scikit, etc collection but you also get to add everything else Python is used for as a bonus. This leads to virtuous circles where the AI/ML code is made easy to apply to other domains, so it becomes more widely known within that domain and this in turn leads to more support for those same AI/ML libraries.
- bsder 4 years ago
  
  MATLAB is fine and could have continued to be the dominant player. MathWorks the company is hot garbage and basically destroyed the ubiquity of MATLAB.
  That's pretty much the alpha and omega. MATLAB had a 20 year head start on everybody and wasted it because everybody hated MathWorks so badly.
  - tpoacher 4 years ago
    
    It's weird ... if you asked me to provide evidence for it, I would struggle to find THE ONE major thing (or let's say 4 or 5) that did this for me, but what you described is exactly how I feel about Matlab and Mathworks. Perhaps it was just death by a thousand paper cuts?
    Which is odd, because while Octave is effectively 99% the same language, I adore octave and hate matlab with a passion.
    
    tpoacher 4 years ago
    
    The one aspect of the company I do admire however, is the quality of the documentation they provide.
    Matlab documentation really sets the bar on this. Python doesnt even come close (let alone julia, where documentation is, alas, more often than not an afterthought...)
  - rwallace 4 years ago
    
    I'm curious, having heard of Matlab but never used it, what exactly did MathWorks do to turn everyone away from them?

MattGaiser 4 years ago

My last job was at an ML company.

Most ML people there cannot build large robust systems and some struggled with the non-algorithmic bits of software. I am sure that some can out there in the world, but for the most part our ML people were very good at creating models and not very good at the development part, especially as the program grew (part of the motivation to hire devs like me in the first place).

Python gets rid of as much of the developmental complexity as possible. No types, no memory management, libraries for everything, No need to create a class to run "hello world." Pip makes it trivial to import things. Use PyCharm and you just need to click the run button, with no complicated JRE and JDK setup.

It is the fastest way to start writing models.

mynameisash 4 years ago

Your description is pretty much spot-on.
I'm a data engineer at FAANG. I love the data scientists I work with. They are, generally speaking, crazy smart and simultaneously humble about their (in)ability to write code - they're highly specialized in ML, not so much SWE. I therefore have generally good job security working in operationalizing and optimizing their code. (Recently tweaked a script a scientist wrote and dropped runtime from 5hrs to 5min.)
If I never saw another line of Python again, I'd probably be quite happy. But I think they - and even some ML engineers I work with - love them some Python precisely because they can mash the keyboard a bit and take the shortcut to the finish line. And I don't begrudge them this at all - good on them that they can get their job done quickly! But it's a pain to make things stable and efficient.
gonab 4 years ago

I don't think this is a fair assessment of "most ML people"
Some of the biggest distributed systems built today are used for statistical inference or scientific computation
Most "ML people" I know are highly versatile in software, networks and deep hardware knowledge, i.e., essentially they have a very good understanding of what a computer is and what is capable from
Its very naive to think that you can assemble machine learning systems without having a solid understanding of computers and statistics
You know who also likes python a lot? Hackers. I wonder why
- perth 4 years ago
  
  The most insane thing about Python is how you can override single methods in classes and use the class like normal. One time I was working on getting FIFO working on Windows, and none of the Python built-ins were set up to handle any random process writing to a named pipe that wasn't within the same Python instance. So what I did was I took the closest implementation Python offered, which was in the multiprocessing module [1], and overrode a single one of the methods to do what I wanted it to do. The module still handled all of the cleanup so I was confident there weren't any memory leak issues, and I was able to make a simple change to the flags it was passing Windows to allow for the functionality I wanted. A language that has a hackable standard library itself is insane and I don't think I've seen it on any other language. In addition, I've found the C/C++ bindings for Python wonderful and intuitive to work with. The setup takes no effort at all and it "just works", batteries included, via ctypes.
  https://github.com/python/cpython/blob/main/Lib/multiprocess...
  - gonab 4 years ago
    
    I understand exactly what you are saying. Let me add up by saying python has a beautiful learning curve
    In the begging is very simple to churn out code and do wtv you want, but the more you are into it, you start realizing that there are endless possibilities
    It's a great language for beginners and even better for experts that just want to solve problems with code without thinking to much about if coding is beautiful or not, or feeling cool, or arrogant about it
    It just works
  - friedman23 4 years ago
    
    Monkey patching is a terrible practice outside of unit testing and can lead to extremely difficult to debug bugs.
    Also monkey patching isn't unique to python.
    
    perth 4 years ago
    
    FWIW I tried looking a few up and standard library seemed hit-or-miss:
    JavaScript it didn't work:
    class testClass { constructor() { } callHoHe() { console.log('ho', 'he'); } } let hi = new testClass(); hi.callHoHe(); testClass.callHoHe = () => { console.log('haha'); } let heh = new testClass(); heh.callHoHe();
    This ended up just printing 'ho', 'he' twice,
    For Java people didn't think it was possible:
    https://stackoverflow.com/questions/47006118/is-there-any-wa...
    For Java they said here that you just have to use your own similar implementation.
    And for C# they have some pretty intense restrictions on overriding standard library stuff:
    https://stackoverflow.com/questions/21302768/where-we-can-ov...
    Golang doesn't seem to have this functionality as well:
    https://stackoverflow.com/questions/37079225/golang-monkey-p...
    Ps. it would have been nice to have monkey patching when dealing with btoa and atob in JavaScript, since they have different function on NodeJS vs the browser.
    
    diatone 4 years ago
    
    Pretty close with the JS, just change testClass.callHoHe to testClass.prototype.callHoHe and you're good to go. Agreed about btoa and atob, since they're globally scoped and I'm not sure if they can be overwritten...
    
    friedman23 4 years ago
    
    >Ps. it would have been nice to have monkey patching when dealing with btoa and atob in JavaScript, since they have different function on NodeJS vs the browser.
    The better solution is to encapsulate the class and override the methods. Monkey patching is terrible because the behavior of the function is changing at run time. If someone is not aware that you are monkey patching a function the only way for them to determine what is going on is to step through the code with a debugger.
    
    burntoutfire 4 years ago
    
    In Scala it's possible, but only at object creation time.
    
    MeinBlutIstBlau 4 years ago
    
    This was my thought exactly. I understand why someone would want to do it. However, when a problem comes up, good luck debugging it in python.
  - machiaweliczny 4 years ago
    
    Ruby has the same ability. Although it might be footgun if you(or dependency of dependency) modify or extend orignal class.
  - dragonwriter 4 years ago
    
    > The most insane thing about Python is how you can override single methods in classes and use the class like normal.
    Isn't that true of basically every language supporting class-based OOP and inheritance?
    
    danwills 4 years ago
    
    It's called 'monkey patching' and python does make it particularly easy, simply:
    class.methodName = newMethod
    .. kinda thing, future callers now get your method instead of the original.
    This does seem a fair bit easier than other languages make it to do?
    
    alanfranz 4 years ago
    
    class.methodName = newMethod
    > .. kinda thing, future callers now get your method instead of the original.
    Which is as powerful as it is a problem, since doing such kind of monkeypatching will change the behaviour all other instances, including already-created ones, that know nothing about your trick.
    Any part of the program can modify any other part of the program in a significant way, making local reasoning and debugging very hard.
    So, great for quick-and-dirty single-file scripts/ipython notebooks. Terrible for large systems.
    That's the very issue with Python. The way it doesn't enforce sane, clean programming behaviour makes it easy for a beginner/non-programmer to work with it. But a large system with a lot of external libraries is very hard to maintain.
    Source: Python user since ~2004
    
    dragonwriter 4 years ago
    
    > It's called 'monkey patching' and python does make it particularly easy
    Inheritance and overriding method in the descendant class is cleaner, and more broadly supported. When you need monkey patching, sure, its nice that most modern dynamic OO languages support it quite naturally. (Ruby even supports scoped monkey patching via refinements, as well as classic monkey patching and per-object overrides.) But this is not at all unique to Python.
    
    perth 4 years ago
    
    You see exactly what I mean!
- dustintrex 4 years ago
  
  Perhaps we're talking about different sets of people? You seem to be describing the people who build ML systems, while the previous poster was talking about the people who use them. Your average data scientist most definitely does not have (or, really, even need to have) deep hardware knowledge or any understanding of networking.
  - gonab 4 years ago
    
    I personally consider the term "data scientist" a very successful creation by a clever marketer
    It's sexy and most of the times ...
- exdsq 4 years ago
  
  Ethical hackers have written some of the worst code I’ve ever seen. They have to know a ton about the security of frameworks, networking, etc… it’s a very complex role for sure, but they are not shining beacons of software engineering quality
  - gonab 4 years ago
    
    I agree. Their objective is not to build beautiful code but to provide a working prototype
    I'm sure every single one of them is capable of writing world class code if they feel like it
    Their exploits are world class and their focus is to exploit
    I have seen stuff in JavaScript exploitation that I can't even scratch the surface. I feel like I have been playing piano for 15 years and I can't even understand if that a music that is playing
- gonab 4 years ago
  
  Python is used to make fast and dirty experiences. You haven't figured out the answer yet, so no point on building dedicated optimized code which might be useless in end
  Almost always, ML production models end up being a binary files of matricial weights. This file can be loaded in wtv language or device you decide to use
- Agingcoder 4 years ago
  
  Most ml people (aka data scientists) I've met have little understanding of what a computer is, and it's fine. They understand stats.
  The tiny subset of people who build ml systems (say tensorflow core devs, write actual distributed systems, etc) are actually hpc specialists, and have all the qualities you describe.
  Of course, you may work somewhere where you're lucky enough to have everyone be good at everything!
a-nikolaev 4 years ago

Python is a kind of a new Fortran for this generation of scientists. (And Python makes things even easier, ofc, compared to Fortran.)
z5h 4 years ago

> Most ML people there cannot build large robust systems and some struggled with the non-algorithmic bits of software.
I have experience working alongside ML people and interviewing them, and have to agree (anecdotally) that this is often the case. Not only that, but they often do not make it a priority to get good at these things.
billfruit 4 years ago

JRE/JDK is a rather simple thing to set up. Compared to the complexity in most other stacks Java is rather straightforward to setup and start building things with.

auntienomen 4 years ago

Python's a lingua franca in AI/NN because it was already a dominant language in scientific computing. Its dominance in scientific computing grew steadily through the 1990s and 2000s, for a few reasons:

1) Python -- specifically CPython -- made it easy to wrap existing, thoroughly tested high performance libraries in Python APIs. So, you got easy access to things like GSL and BLAS and LAPACK, but you get to call numpy.linalg.svd instead of GESDD.

2) Python was a general purpose language, unlike R or MATLAB, so you could extend existing systems to do more without running into a wall.

3) Python was a heck of a lot less effort to use than C++.

auntienomen 4 years ago

Extending the comment a little: I'm crediting CPython for the nice APIs, but really I should be crediting David Beazley's SWIG. That got the ball rolling on a lot of projects.
- Qem 4 years ago
  
  He gave a talk on this very question in the past week. I strongly recommend. See https://m.youtube.com/watch?v=4RSht_aV7AU
radmuzom 4 years ago

I was a student in the early 2000s and part of the academic community and have personally seen some of the world's best scientists working on physics or computer science. Not one of them used Python - in fact our professor chose to teach us Haskell and C in the first year of under-graduation. C, C++ or Fortran was quite common for scientific programming.
pjmlp 4 years ago

Indeed, Python was already the main scripting language at CERN when I was there during the early 2000's.
z3phyr 4 years ago

Can you link any source on the python scientific usage in the 90s and early 2000s? I think the dominant language in science at that time was a mishmash of MATLAB, Java, C++, FORTRAN and Perl (In Biology at least, perl was the goto glue language due to its excellent string processing capabilities)
- _ugfj 4 years ago
  
  https://aip.scitation.org/doi/pdf/10.1063/1.4822400
  I can attest to Python usage in physics exploding after this. What Livermore says, you listen to. They were considered the best of best, after all.
  An unknown student making an unknown library, no one cares. But when Livermore says, hey guys, Numeric is interesting, you listen.
  Things rolled from there.
- pjmlp 4 years ago
  
  Here CERN usage of Python during those early 2000s.
  https://lhcb-comp.web.cern.ch/support/CMT/cmt.htm
  https://home.cern/news/press-release/cern/lhc-computing-grid...
  https://cds.cern.ch/record/840543/files/lhcc-2005-024.pdf
- auntienomen 4 years ago
  
  I'm not sure I could find sources on the web any more easily than you. (Maybe start by looking at references in David Beazley's old talks?) I was in physics at the time, and what happened there was that Python basically enveloped Fortran & C++, letting people use the existing code without getting bogged down in complicated invocations.
  - eesmith 4 years ago
    
    https://www.youtube.com/watch?v=4RSht_aV7AU is a recent Beazley talk about that era.

alanfranz 4 years ago

I don't think there's a master plan or a design idea about that.

There's an old saying that goes like "Python is the second best language for anything".

Python isn't the best for any kind of task; but you can do almost anything in any field with python and some libraries. It's reasonably easy for a non-programmer to use it.

I think my first experience with GPU programming was using CUDA with C (I think it was kind of customized C in mid-2000s), so Python is not there since forever.

But if you need to do a bit of web scraping/input data manipulation, a bit of "offering a gui" (e.g. a small web server that shows the data), a bit a of matrix/vectorized operations, a bit of model training or even just inference... python has everything and everything is reasonably good. At least some of those operations would be cumbersome in other programming languages.

Try using R for general-purpose programming. Or Java for number crunching/matrix operations. They just suck.

Try finding the "greatest common divisor", functionality-wise, for the many tasks that you need in a ML system (just as many other systems), and you'll find Python.

The drawback is, IMHO, that it doesn't "scale" well. Python makes great proof of concepts and prototypes, but I'll always pick a different stack (possibly with multiple languages and technologies) if I want a long-running, maintainable production system.

bearly_legal 4 years ago

The same reason that Python is heavily used in scientific computing.

ML/AI/Scientists aren't systems people. They don't want to care about memory management/parallelization/etc. - they want to write perfect little mathematical poems which get executed on a perfect Turing machine.

Python is good at that. Thanks to the efforts of actual systems people, its libraries (numpy, scipy, etc.) run quick enough to be practical on a lot of workloads.

bbulkow 4 years ago

Another way to analyze the problem: what other language would it have been, given the moment ml hit?

You say compared to other scripting languages'. Let's list them.

Ruby: no numeric support Go: unnecessary typing, modest numeric support, shitty generics Bash: ha ha ha Scala, java, c, cpp: not a scripting language, complex Tcl, php: out of favor Rust: hadn't happened yet R: in memory bias, not as simple Other languages were obscure or owned by monoliths (kotlin, swift, c#)

Python also has multiple implementations, a minor thing, but not really. Pypy keeps cython on its toes.

C# really could be a contender. I am more productive in c# than any other language except python (although I think I will be more productive in rust)

Python is, almost unarguably, the easiest language to code in, right now, period. It has the greatest expressiveness and the simplest syntax. I use it for large scale open source art projects, and you can use it for ai.

Why are you asking?

specproc 4 years ago

R is definitely a good language for quant work. In some ways it could have been the natural choice, and there are still places where it's a better choice than Python.
It's just far to fragmented, only really good for numeric work (and thus harder to integrate with production systems), and full of the weirdest gotchas.
https://www.burns-stat.com/pages/Tutor/R_inferno.pdf
- mFixman 4 years ago
  
  R has a few advantages over Python for data science work, but Python has a big one: it's also widely used by software engineers who are not data scientists.
  I found it easy to jump from the software side of things to the data side of things because I already knew the quirks and tricks of Python. Having to learn a new language would have made this transition harder.
bbulkow 4 years ago

I left out JavaScript. It is not a simple language, with it's service heritage. It is bound up in the node runtime in a way that doesn't really work right for data processing.
- auntienomen 4 years ago
  
  lol, you forgot perl too.
  - danwills 4 years ago
    
    Yeah and what about awk!?
    Partially joking, but not totally, and I can appreciate why people might say this: https://news.ycombinator.com/item?id=5725291
    (I pasted the HN link because the original seems to be down)
cgb223 4 years ago

Non ML/AI coder here:
Why does ML/AI work need to be written in a scripting language?
Why can’t it be something like C++ etc instead?
- cameldrv 4 years ago
  
  One reason is that you usually need to try a lot of things before you get something to work. Language productivity is at a premium. You really want some sort of interactive shell where you can do calculations and pull up plots etc. This used to be done with IPython, which evolved into Jupyter.
- mynameisash 4 years ago
  
  It doesn't need to / you could theoretically do it in C++. It's just that Python (as with other scripting languages) provides really nice, high-level expressiveness and also has a decent module system. You can write code in the REPL or just write a quick-and-dirty script and test it out without write-compile-run cycles.
  NumPy is highly optimized for things like matrix math. You get great speed with the C-level module, and you drive it with really simple Python code. So you want to multiply two matrices? The code literally looks no different than multiplying two scalars. That's Python's superpower.
  I haven't written C++ in nearly 20 years; maybe it's good enough to be able to do ML work. But the heavy lifting library in C/C++ plus the high-level driving Python is a really good fit.
  - r-zip 4 years ago
    
    Well, it is slightly different than multiplying two scalars:
    c = a * b
    vs
    C = A @ B
- r-zip 4 years ago
  
  As another commenter said, speed of experimentation is an important factor. Also, dynamic types are nice when you're dealing with exploratory data work. Combine that with the library ecosystem and Python's ease of use, and there you have it.
- WithinReason 4 years ago
  
  Just one reason is that some Python libraries (numpy, tensorflow, pytorch) allow you to work with high dimensional arrays (3-4 dimensions) without for loops.
  ML also needs reverse autodifferentiation, which would be a real pain in C++.
rak1507 4 years ago

APL/array languages.

lokimedes 4 years ago

I can only provide anecdotal material, but back in 2007-2014 when I was a particle physics researcher, we saw a high uptake of python for steering data analysis jobs. The actual calculations were done in C++. Gradually over the years, as more students joined the LHC conquest, our tools evolved to allow more of the analyses to be directly programmed in Python. R was never a thing among the 10000+ physicists in our community. These people have since then drifted around the world working on Big Data, ML and recently Data Science. It’s hard to keep count, but I routinely recognize fellow particle physicists at various ML companies.

For the curious, our primary hammer was “ROOT” https://root.cern - note its well-evolved ability to connect Python and C++ code.

sbashyal 4 years ago

I have a historical perspective on this topic. Data Science popularity was rapidly growing and "R" was the lingua franca around 2010 - 2014.

I attended Strata conference in 2014 and after visiting various technology exhibition booths there, I saw a common theme: tech companies were building data solutions using Python as R was no good for the purpose

In a meeting scheduled to share my take-aways from the conference, I predicted "Python will emerge to be the language of DataScience in few years"

lordnacho 4 years ago

Numpy, Scipy pandas, etc are a way to use scripty syntax to write CPP code.

Under the hood you get the benefits of CPP: stuff is dense in cache, operations are efficient.

But you can write it without a bunch of types, templates and allocators, which confuse people who aren't used to it. Most numeric code doesn't have a load of types anyway, it's just a few operations on some very large matrices.

Add to that the benefit that you can just ask of python's universe of libraries, which is quite large compared to rivals like MATLAB or R. Want to serve your model as a website? Jam it into Flask. Need crypto lib to grab the data? No problem, just pip it and import.

holonomically 4 years ago

Python has always been used as a nice layer over various C libraries so when ML started taking off and people started using GPUs to accelerate training and inference it was a natural choice for acting as the high level code for interfacing with the low level GPU code.

There were some other DSLs that were being developed at the time but the ones that stuck were the Python ones. [1]

1: https://terralang.org/

ThePhysicist 4 years ago

Pythons' USP was and still is its ability to provide a simple & intuitive "glue layer" over lower-level libraries. Most of the performance-critical functionality that Python relies on for ML is written in C/C++/Fortran and Python mostly provides the UI layer (this is an oversimplification of course).

Wrapper generators and compiler tools like Cython and before that SWIG made it very easy to glue existing functionality to Python, so together with Pythons' great usability and user-friendly language it created a killer combination for productive data science & ML.

That said other languages could've pulled this off as well, Ruby for example. Python had more early traction in the scientific and high-performance computing communities though whereas Ruby was more popular in web development (due to Rails), which ultimately gave Python the edge and attracted more and more toolmakers to its ecosystem, which in turn spurred further growth. Great "IDEs" like the iPython/Juypter notebook were also a key factor in Pythons' success, as they provided a super user-friendly UI for data scientists.

chubot 4 years ago

Because Python has NumPy, which implements vectorized math on arrays and matrices. Machine learning algorithms are implemented naturally and efficiently with those primitives. PyTorch, TensorFlow, and I think every other machine learning framework in Python all use NumPy.

JavaScript, Ruby, and Perl either don't have this abstraction at all, or they have much weaker versions of it, and many fewer scientific libraries.

NumPy started in the early 2000's and continues to this day. It takes decades to build up this infrastructure! This recent interview with NumPy creator Travis Oliphant is great:

https://www.youtube.com/watch?v=gFEE3w7F0ww

He talks about how there were competing abstractions like "Numeric" and another library, and his goal with NumPy was to unify them. And how there are still some open design issues / regrets.

There were multiple people in the nascent Python community who were tired of MATLAB, not just because it's proprietary, but because it's a weak and inefficient language for anything other than its scientific use cases. You won't have a good time trying to write a web app wrapper in MATLAB, for example.

The much more recent Julia language is also inspired positively and negatively by MATLAB, and is very suitable for machine learning, though it doesn't have the decades of libraries that Python has.

-----

The NumPy extension was in turn enabled by operator overloading in Python (which is actually a very C++ influenced mechanism). JavaScript doesn't have operator overloading; I'm pretty sure Perl doesn't, but not sure about Ruby. Lua and Tcl do not have it. (Lua does have a machine learning framework though -- http://torch.ch/ -- but I think PyTorch is more popular now.)

So if Guido didn't design Python with operator overloading, then NumPy would not have grown out of it.

Also relevant is Guy Steele's famous talk Growing a Language (late 90's or early 2000's I think). He advocates for operator overloading in Java so end users can evolve language with their domain expertise! Well Java never got it, and Python ended up having the capabilities to grow linear algebra.

Guido has even said he doesn't really use or even "get" NumPy! So it turns out that an extensible design does have the benefits that Steele suggested (although it's a very difficult language design problem.) There have been several enhancements to Python driven by the NumPy community, like slicing syntax and semantics and the @ matrix multiplication operator. And I think many parts of the C API like buffers.

-----

Another interesting thing from Oliphant's interview is that he really liked that Python has complex numbers. (I don't think any of JavaScript, Ruby, Perl, or Lua have them in the core, which is important.) That piqued his interest and kicked off a few decades of hacking on Python.

He was an electrical engineering Ph.D. student and professor, and complex numbers are ubiquitous in that domain. Example:

    $ python3 -c 'print(3j * 2 + 1)'
    (1+6j)

This is another simple type built on Python's extensible core, and it's short.

    $ Python-3.9.4$ wc -l Objects/complexobject.c 
    1125 Objects/complexobject.c

I recommend writing a Python extension in C if you want to see how it works. See Modules/xx*.c in the Python source code for some templates / examples. IMO the Python source code is a lot more approachable than Perl, Ruby, or any JS engine I've looked at.

chubot 4 years ago

I should also add a cultural / social reason why Python is used in scientific computing and machine learning much more than JS/Ruby/Perl:
Python was the only one of those languages (partially) funded by government research agencies. Guido was a research programmer in the Netherlands at CWI, and then he moved to the US when he was hired by CNRI, a research agency headed by Bob Kahn (loosely connected with DARPA as far as I remember).
If you look at the backgrounds of Brendan Eich, Matz, and Larry Wall (creators of JS, Ruby, and Perl), they are quite different. None of them really worked in a research setting, and they certainly didn't develop their language in a research setting.
https://en.wikipedia.org/wiki/History_of_Python
3 hour oral history with Guido: https://www.youtube.com/watch?v=Pzkdci2HDpU&t=12s
Lex Fridman interview with Guido: https://www.youtube.com/watch?v=ghwaIiE3Nd8
Lua was developed in a research setting, funded partially by Brazilian oil companies as far as I remember, but I don't think it ever had a "scientific computing" focus. It was picked up more in games and apps due to the C embeddability and features like coroutines. The ML framework Torch was built on LuaJIT because it has math nearly as fast as C. But I think the language Lua is less suited toward linear algebra, again due to the lack of operator overloading.
Not to mention that Lua doesn't even have separate ints and floats! This is also an issue with using JavaScript for scientific computing.
- kragen 4 years ago
  Perl was originally written at JPL, which is the epitome of government-funded research, and for the first many years of its life most of its numerous contributors were at one or another government-funded research institution, because people who weren't didn't have internet access.
  Lua does support operator overloading:
  $ luajit LuaJIT 2.1.0-beta3 -- Copyright (C) 2005-2017 Mike Pall. http://luajit.org/ JIT: ON SSE2 SSE3 SSE4.1 AMD fold cse dce fwd dse narrow loop abc sink fuse > x = setmetatable({}, {__add=function() return 37 end}) > print(x+5) 37
  Not sure if that was true 20 years ago.
  - chubot 4 years ago
    
    Hm interesting, it's hard to find references to Wall working at JPL, but here's a very non-authoritative one:
    https://old.reddit.com/r/perl/comments/5lj9ms/did_larry_wall...
    Wikipedia doesn't mention it:
    https://en.wikipedia.org/wiki/Larry_Wall
    https://en.wikipedia.org/wiki/Perl#Early_versions
    That does sound right, since I vaguely recall an interview with Wall talking about JPL.
    -----
    I think there's still a difference because Python was literally funded as a research project by CNRI, a government research institution. It wasn't created there, and it was funded by different entities afterward, but I think that's the period when contributors with a scientific background like Travis Oliphant, Jim Huginin, and David Beazley started working on Python's libraries and infrastructure.
    At best it seems like Wall worked at JPL for a short time and started Perl there. It also matters what kind of research it was. Perl is aimed much more at text processing and not linear algebra, while Python is more general purpose in this respect.
    Also, if my memory is right, by early 2000's JPL had jobs in Python, and python.org said JPL was a user. I could be wrong but I don't think Perl ever caught on as much as Python did at JPL.
    -----
    Yes good point about Lua's metatable mechanism.
    
    kragen 4 years ago
    
    I think Larry was at JPL from before the first version of Perl in 01986; his job before that was evidently at Unisys, maybe starting around 01979: http://web.archive.org/web/19970626151153/https://www1.acm.o... https://spu.edu/depts/uc/response/spr2k/wall.html He was still at JPL in 01991; I suspect but am not sure that he kept working at JPL even after changing his email address and official affiliation to NetLabs.
    Rich $alz reposted Perl in 01988, bearing a "1987" copyright date, but I think the first version really was released in 01986; at any rate by this point Larry was definitely at JPL: https://www.tuhs.org/Usenet/comp.sources.unix/1988-February/...
    Unfortunately Google has decided to remove the ability to view source from Google Groups, so we can't see the return-path for https://groups.google.com/g/comp.lang.perl/c/t4RumjajsXA/m/7..., one of the earliest messages he posted from netlabs.com, so we can't see what NNTP server he was using at the time. (I guess that's what we get for letting Google take the responsibility for "making information universally accessible": we have no recourse when they decide that means making previously public information inaccessible.)
    The Wikipedia article says he released the first version of Perl when he was still at Unisys, citing the first edition of Programming Perl, which I don't have. The second edition (01996) is silent on the question. It also cites https://www.oreilly.com/pub/au/148, which does say that and presumably is at least subject to Larry's veto.
    So, at any rate, JPL was funding Perl development from at least 01987 to 01991, four years, almost the same amount of time that Guido van Rossum was working at CNRI, 01995 to 02000. But you're probably right that Guido had to write grants and write progress reports on Python, and Larry didn't on Perl; he had the discretion to just do it. Also, I suspect this was true when Guido was at CWI too, as you say. AFAICT the only research paper Guido published at CNRI that was about Python was the CP4E paper.
    I don't think it's accurate to say, "Perl is aimed much more at text processing and not linear algebra, while Python is more general purpose in this respect." PDL/perldl is from 01996, just the year after Numeric in 01995, and Perl 5 offers pretty much exactly the same set of facilities as Python for this sort of thing (dynamically loadable language extensions, operator overloading, dynamic typing --- though I guess at least Python's indexing syntax is more comfortable, because x[i:j, ..., 3] is a valid Python expression that preserves all that indexing structure, and has been since at least 1.5.2).
    If memory serves, PDL was a lot better at 3-D plotting than Numeric was when I first tried it in about 01998; it could pop up a rotatable 3-D plot in an X window, and Numeric couldn't.
    I think what happens is that a lot of people started working on Numeric (including those you mention --- although keep in mind dabeaz was also working on Perl's libraries and infrastructure!) and so it started getting better faster than PDL. Part of this was that Python is just a more pleasant, less clumsy language, so people chose it when Perl didn't have a killer advantage, like native mutable strings or a relevant CPAN module. But that's not about Python being well-suited for linear algebra; its only linear-algebra-specific feature is Python 3's @.
    I think it would be very hard to find anyplace in 01995 or later that had Unix systems and didn't have huge piles of Perl. You're probably the person in the world most aware of the shortcomings of the primary alternative in the early 01990s (ksh/pdksh/bash). But it's also true that lots of Perl-heavy sysadmin shops never got into writing "real programs" in Perl like Slic3r, Perlbal, Movable Type, and Frozen Bubble, and so Perl didn't show up in their job descriptions. And nowadays most of those huge piles of Perl are sort of regrettable, and regretted.
    
    chubot 4 years ago
    
    OK I went down a rabbit hole and also realized how terrible Usenet search is. I tried to Google for Usenet search engines and ended up with mostly spam :-( Seems like Hacker News is sadly a better search for these kinds of things.
    Anyway I agree with most of what you say, EXCEPT I think Perl's focus on text vs. Python's more general purpose focus can be seen from the creators' very early release announcements!
    One thing I've realized while working on shell is that the "bones" of a language are set in stone VERY early. Then comes 10-50 years of compatible changes that necessarily must obey many early design decisions.
    Also I'm not saying the focus on text is bad -- in fact a big motivation for Oil is that Python is not convenient enough for text processing :) (and that it's awkward for process-based concurrency)
    Perhaps my experience with Oil shows me all the stuff I'm NOT doing to support scientific computing. Even basic stuff like exponentiation x^0.2 is a huge bit of code, as well as scanning and printing floating point numbers, all of which shells lack. Oil should have proper floats but not in the initial versions. (Early in the project I also thought it would have efficient homogeneous vectors, before understanding why Python settles on heterogeneous lists and punts the rest to extensions)
    From your link:
    Perl is a interpreted language optimized for scanning arbitrary text files, extracting information from those text files, and printing reports based on that information. It's also a good language for many system management tasks. The language is intended to be practical (easy to use, efficient, complete) rather than beautiful (tiny, elegant, minimal).
    https://github.com/smontanaro/python-0.9.1 (I think this is from 1990 or so)
    This is Python, an extensible interpreted programming language that combines remarkable power with very clear syntax. This is version 0.9 (the first beta release), patchlevel 1. Python can be used instead of shell, Awk or Perl scripts, to write prototypes of real applications, or as an extension language of large systems, you name it.
    This is very revealing! And prescient! The intent of the creators does seem largely borne out. Perl was extended to more domains but I'd argue that the "bones" prevented it from evolving as Python did.
    optimized for scanning arbitrary text files, extracting information from those text files
    to write prototypes of real applications, or as an extension language of large systems, you name it.
    
    kragen 4 years ago
    
    > Anyway I agree with most of what you say, EXCEPT I think Perl's focus on text vs. Python's more general purpose focus can be seen from the creators' very early release announcements!
    Oh, I agree with that part, too; Perl's growth into a general-purpose language was very uncomfortable and surprising. I just think they were about equally terrible at linear algebra to begin with.
    What would make a language good at linear algebra? I think you'd want, as you say, efficient homogeneous vectors, and also multidimensional arrays (or at least two-dimensional), non-copying array slicing, different precisions of floating-point numbers, comma-free numerical vector syntax (maybe even delimiter-free, like APL), zero division that produces NaNs instead of halting execution, control over rounding modes, arguably 1-based indexing, plotting, and infix operators that either natively have linear-algebra semantics or are abundant and overrideable enough to have them. Python didn't have any of those built in, and a lot of them can't be added with pure-Python code.
    You'd also want flexible indexing syntax (that either does the right linear-algebra thing by default or can be overridden to do so), complex numbers, infix syntax for exponentiation, and a full math library (with things like erf, gamma, log1p, arcsinh, Chebyshev coefficients, and Bessel functions, not just log, exp, sin, cos, tan, atan2, and the like). Python 0.9.1 evidently didn't have any of those (you can do x[2:] or x[:5] but even x[2, 5] is a syntax error), but they were mostly all added pretty early, though its standard math library is still a bit anemic. Like Perl, though, the first version of Python did have arrays and floating-point support (arithmetic, reading, printing, formatting, serializing) from very early on; unlike Perl before Perl 5, its arrays were nestable. (Perl 5, in 01994, also added a slightly simplified version of Python's module and class systems to Perl. I forget if "use overload" was already in there, but it seems to be documented in the 01996 edition of the Camel Book, so I guess it was in Perl 5 from pretty early versions.)
    Numeric and Numpy added most of these things to Python, and IPython, Matplotlib, and SciPy added most of the others. Adding them to Perl 5 would have been about the same amount of work and would have worked about as well, but the people who were doing the work chose to do it in Python instead. It isn't the choice I would have made at the time, but I'm glad they had better technical judgment than I did.
    Nowadays, for a language to be good at linear algebra, you'd probably also want automatic differentiation, JIT compilation, efficient manycore parallelization, GPGPU support, and some kind of support for Observablehq-style reactivity. Julia fulfills most of these but they're hard to retrofit to CPython.
    A shell is sort of an "orchestration language", in the sense that a shell script tells how to coordinate fairly large-grained chunks of computation to achieve some desired effect. We've seen an explosion of such things in the last ten or fifteen years: Dockerfiles, Vagrant, Puppet, Chef, Apache SPARK, Terraform, Nix, Ansible, etc. Most of these are pretty limited, so there's a lot of duplication of functionality between them. And most of them don't really incorporate failure handling explicitly, but failures are unavoidable for the kinds of large computations that most need orchestration of large-grained chunks of computation. I wonder if this situation is optimal.
dragonwriter 4 years ago

> JavaScript, Ruby, and Perl either don't have this abstraction at all, or they have much weaker versions of it, and many fewer scientific libraries.
I don't know that Numo for Ruby is “much weaker” than NumPy. It looks like installation is rougher since it doesn't bundle dependencies, and its newer and thus there is less downstream ecosystem.
> JavaScript doesn't have operator overloading; I'm pretty sure Perl doesn't, but not sure about Ruby
Ruby and Perl both have operator overloading. (Perl has “use overload”, and in Ruby operators are defined via overridable methods.)
oezi 4 years ago

Ruby does have operator overloading.
And it is kind of sad to me that Python is so much more popular than it, even though Ruby has a much cleaner object-oriented foundation. Not to speak of underscores...
- chubot 4 years ago
  
  Ruby is arguably more OOP than Python, but I'd claim that doesn't help much in the scientific programming / machine learning use case. It might even hurt a little.
  This kind of code is naturally expressed in "functions and data" rather than "objects" (data being vectors, matrices, etc.).
  And I say this as someone who uses objects in most of my code! (which is not scientific code)

z3phyr 4 years ago

At the time when ML craze hit, python was already very popular as a beginners language and had good numeric libraries.

Many people of STEM fields without any programming background, had their first taste of programming with python. And it caught on.

Also, the real stuff is probably written in C/C++/CUDA/ASM. Its only the interface that is python (because of its inertial popularity)

powersnail 4 years ago

The use of python in most ML/AI research is not even a "glue" language.

It is used as a shell. It's merely an interface to some gigantic, highly optimized libraries (numpy, scipy, and later, Tensorflow, Pytorch, etc.), and it does a very decent job at being an interface.

- The language is easy to grasp, at least the part that is used in data science and ML;

- The syntax is "familiar", as compared with R;

- There are many more general purpose libraries in Python than in R;

- There's no memory management problems;

- The standard library is packed with batteries;

- No compiling, which is important for being a shell;

- It's better than bash etc. at dealing with non-text data, especially numerical values;

- The community was already writing extensions in C;

Some other language could work well, too, had someone written a numpy for it at the time. But there really aren't that many people who are capable, interested, and invested enough to write such a marvelous library.

habibur 4 years ago

This happened when MIT switched from Scheme to Python some time in the '00s. Python's adaptation increased in the scientific community further and here we are.

auntienomen 4 years ago

The scientific community was already making heavy use of Python by the time that happened. I suspect MIT switched to Python in large part because it was what the STEM faculty there were actually using.

oivey 4 years ago

Python had basically already won numerical computing thanks to NumPy, SciPy, and Matplotlib before ML really blew up. The other two serious contenders were R and Matlab. Python is a much better general purpose language than either of those, and Matlab is proprietary.

rg111 4 years ago

1. It's extremely easy. Before the so-called revolution and CS people trying to get into it in droves, it was a niche topic dominated by the lifelong-researcher types. They could not be bothered with complex code. Writing code should not get in the way. Then, Lisp dominated the ML/AI scene. Now Python does, for this reason, to some extent. Python being easy is also helpful for non-CS engineering and other science grads to learn quickly.

2. Python has a huge ecosystem. NumPy, SciPy, and now Tensorflow, PyTorch, JAX. These makes lives easier.

3. Python and its ecosystem is FOSS. Students, hobbyists can learn it for free. (Quick anecdote: my uni in India, a very reputed non-IIT one, with sub-optimal funding, two years ago switched to Python + ecosystem for Physics and CS courses- both major and minor. This switch happened directly from C. Before that, Fortran was used. MATLAB, SPSS, etc. was never an option for cash-starved Indian unis. This is pretty much the same all across India. And thus you get a huge talent pool already trained in Python that pass-outs from hard-to-get-into unis.)

4. Python being general purpose also helps vis-a-vis R. R is heavily constrained. You cannot do much in it. R is used in anaalysis and Data Science. I have never seen it being used in ML, DL or RL. You learn Python, you can do non-trivial file manipulation in it. Good luck doing that with R or MATLAB.

5. The amount of people who needs to write code that reaches the metal is very small. I never needed to look under the sheets. I spend my life writing PyTorch, fastai, and TFLite. A friend of mine doing PhD needed to write custom CUDA code and then a wrapper so that it could be accessed from Python. He said that it was a very horrible experience. But the number of such people too little to bring Julia to mainstream. Julia removes the "two-language problem", but most people never need to use anything besides Python.

snicker7 4 years ago

It comes down to timing, really. Just like most technology fads.

Python is interpreted/dynamic, open source, general-purpose, relatively popular, and is easy to write low-overhead wrappers to C/C++/Fortran libraries. In 2008-2010, when ML took off, Python was the only language with these properties.

Python, however, has its problems: an atrocious concurrency story (GIL, colored coroutines, asyncio vs trio rift, fork is inefficient thanks to GC, etc), highly non-compostable (especially in scientific computing), and package management is broken. The language is also inherently slow (unfixable). It will be replaced by something else eventually.

threeseed 4 years ago

I would put it mostly down to Spark.

Originally, it was only available in Scala/Java but then they added Python support courtesy of Py4J. And since Python was massively simpler than Scala it exploded in popularity very quickly becoming the default language.

So then you had Data Scientists who were already writing a lot of data transformations in Spark looking around at the rest of the Python ecosystem finding libraries like pandas, IDEs like Jupyter and basically staying there since it was so much easier than alternatives.

Their interests aren't really in computer science and so they look for whatever language can get them to an outcome as quickly and easily as possible. Even if it's not the most optimal, elegant or maintainable.

pedrosorio 4 years ago

Spark started getting industry adoption in ~2013-2014 when it became an Apache project.
The roots of Python as a language used for numerical/scientific/data science use cases are much older than that with numpy and spicy back in the 90s, early 2000 followed by pandas and scikit-learn in the late 2000s.
kragen 4 years ago

By the time SPARK was born (02010?) Python had already eclipsed the non-JS alternatives (Scheme, Perl, Tcl, Ruby, awk, BASIC, Lush). I'd put the crossover point around 02002. IPython notebooks and pandas came even later than SPARK.
rcarmo 4 years ago

PySpark certainly helped kill Hadoop for data munging, but I would say it only really got going in 2014.

blunte 4 years ago

Because a lot of the early development in these areas was done by mathematicians and physicists who weren’t programmers (and who had less exposure to languages). These are folks who just wanted an answer to a question or a premise, and the elegance of the path that took them to the answer was utterly insignificant.

In some cases you might see a 3000 line python script with no defined functions… just loops and conditionals and lots of copy-pasted code with small variations in each section.

It’s really a shame, since there are so many more elegant languages which are equally or more powerful. But python is not a terrible language… it’s just an everyman get-shit-done language. We could be worse off.

nijave 4 years ago

I suspect some of the popularity came from IT and engineering departments preferring to run Python compared to, say, Matlab, Excel, or some other GUI based application not designed to run on servers. R also has a lot of adoption but I think Python is a little more natural to run since so many infrastructure tools are written in Python

It also works fairly well cross platform. That means you can develop on Windows and run on Linux without too many issues (at least for ML/AI stuff, common frameworks usually have per platform binaries published)

i.e. Python is familiar to the people supporting production systems

ksec 4 years ago

>I bet there's some interesting history here.

ML / AI were derived from Data Science. So hence it was built up upon the same python foundation.

As to why Python on Data Science. One needs to be reminded most people doing Data Science, or any Matlab type of work do not considered themselves as programmers. They dont want to learn about 20 reason why functional programming, or objected oriented programming are better and 100s other best practice with 1000 tricks to write the same program.

Although I do wonder if Julia may have a chance to dethrone it in the next 10 years.

rcarmo 4 years ago

I saw Fortran being wrapped in it because it could be compiled down into libraries with C-compatible ABIs, and Python can load C libraries directly without fuss.

A lot of NumPy and SciPy ensued, and the rest is history.

xvedejas 4 years ago

My narrow perspective is that Python is the one language that both companies engaged in ML research (like Google) have been using, and also a very common language of instruction, for instance in the CS program at the school where I studied. If you were a CS student interested in ML/AI, starting school ten to fifteen years ago, but not particularly interested in software engineering, you'd be able to get by with only really knowing Python. Depending on how widespread this is, I'm guessing it's a part of the picture.

jandrewrogers 4 years ago

Python became the de facto glue language for supercomputing a very long time ago because you could easily bind C code into it. If you needed linear algebra etc to run on a massive supercomputer, there were highly optimized Python libraries for that so the researcher didn’t have to write C/C++/Fortran. This massively improved iteration times for a lot of scientific computing efforts with only a modest loss of performance. By the time data science/ML/AI/etc became a thing these tools were already very mature and also relevant.

The tl;dr: Python had the advantage of a mature legacy in supercomputing doing many of the same types of computations done in AI/ML. Those libraries and bindings provided a massive leg up versus other scripting languages that did not have this kind of capability effectively built-in.

peter_retief 4 years ago

Python is the default in many fields beside ML, however ML has its own very efficient languages, I recently discovered octave through a course I attended. Really worth having a look at https://www.gnu.org/software/octave/index is mostly compatible with matlab.

morelandjs 4 years ago

I think there was a strong cohort of scientists using matlab, and python and numpy adopt the same language conventions. Going from matlab to python is effortless.

Moreover, scientists are typically so-so programmers so not having to worry about complexities like dereferencing pointers, specifying types etc, makes the language much easier to pick up.

marto1 4 years ago

At least in academic circles python has always looked like pseudo code for C(or similar) you can execute so everyone has "used it" at some point or another to describe algorithms and stuff. Then the same academic circles do a lot of ML/AI research so python had a natural advantage.

karmasimida 4 years ago

Python was already the lingua franca before the whole deep learning/AI thing. It has numpy/scipy/pandas/scikit-learn, etc. And when did numpy happen? It was in 1996.

Arguably its biggest competitor then was R, but R is not well accepted by programmers. Yet another alternative is Matlab, but OMG, using matlab for anything string related is killing me.

While there is some history to it, Python won in the end isn't a surprise to anyone. It is simple but not toyish for real world system. I am working in one of the big techs, and Python is running the production workload for most AI services just fine.

I took a LOT of issue with dynamic typing, but for ML/AL you are going to write a lot ad-hoc data wrangling code, sometimes even Python feels verbose.

TL;DR: It had already won.

aasasd 4 years ago

Yup, it was already everywhere before ‘data science’ was a thing. Turns out, people in actual fundamental science want to slap calculations together fast instead of fussing with memory and whatnot.
I have a friend who writes C or C++ (can't remember which) for clusters processing data from particle accelerators, but he will still reach for Python when he wants anything simpler than that—and, I guess, less interactive than Matlab.

Settings

Ask HN: How did Python become the lingua franca of ML/AI?

Keyboard Shortcuts