Settings

Theme

A Python Compiler for Big Data

continuum.io

156 points by bluemoon 13 years ago · 35 comments

Reader

ezl 13 years ago

I just want to point this out because I feel like there's a good chance a lot of people won't have gotten this far:

Because our implementation does not explicitly depend on Python we are able to overcome many of the shortcomings of the Python runtime such as running without the GIL and utilising real threads to dispatch custom Numba kernels running at near C speed without the performance limitations of Python.

  • freyrs3 13 years ago

    Yes, using Numba we can just-in-time compile numeric Python logic straight down to machine code, so naturally we can achieve some pretty impressive numbers on kernel execution.

    In case many people didn't reach the bottom here are the links to the repo and the docs. The project is still in early stages, but is public and released under a BSD license.

    * http://blaze.pydata.org/docs/

    * https://github.com/ContinuumIO/blaze

rpm4321 13 years ago

Bit of a tangent, but I'm wondering if anyone here has had any luck with Cython?

I'm starting to run into some performance bottlenecks with Python, and so I'm just now looking at Cython, PyPy, Psyco, and... gasp... C.

From what little I've read, Cython is supposed to be as easy as adding some typing and modifying a few loops here and there, and you are in business.

  • IanOzsvald 13 years ago

    I taught High Performance Python covering the tools you mention at PyCon 2012 (and EuroPython last year), maybe my videos+write-up will be helpful. I also cover profiling, shedskin, pyCUDA etc:

    http://ianozsvald.com/2012/03/18/high-performance-python-1-f...

  • Erwin 13 years ago

    Depends on your application. Ideally you want to change your code so as much computation as possible can happen in pure-C code and pure-C data types (using Cython). If you have a big class tree with many callbacks and work spread over hundreds of method, that can be difficult.

    Before you go that far, I'd recommend making sure you know all the Python gotchas (for example, maybe you have some inner loop that does for x in range(100000) all the time), that you algorithms are in order. Sometimes even silly microoptimization can make a difference if a small function is a significant amount of your runtime. Using multiple processes with e.g. the multiprocessing module can be an option too.

    Depending on what data types you operate on, numpy (and now this new thing) can do some amazing things.

    PS: check things like http://packages.python.org/line_profiler/ beyond the ordinary profiling.

  • travisoliphant 13 years ago

    Cython is helpful, but you have to spell out a lot of type-information that is not specifically necessary. You might also try Numba --- easiest way to get it is via Anaconda CE or Wakari. Both at http://continuum.io

  • chadillac83 13 years ago

    LUA might be good for this too, it's pretty fast on it's own but from what I've read (not tested mind you) their C API is supposed to be pretty great.

    http://benchmarksgame.alioth.debian.org/u32/which-programs-a...

    http://en.wikipedia.org/wiki/Lua_(programming_language)#C_AP...

  • lrem 13 years ago

    Cython is good, but sometimes it's a bit tricky to bend it to do exactly what you want[1]. You'll probably still want to write that hot piece in C... But gluing it with Cython is IMHO much nicer than using the plain Python API.

    [1] - on the other hand, it comes with a tool explaining exactly how each of your lines of Cython looks in resulting C, with color-coding for high level overview of which pieces translated smoothly

  • frozenport 13 years ago

    I would go with C/C++ as the ways to address performance are well studied. There are many tools out there like callgrind or nvvp that will make it pain-free.

    I can narrow down performance in C/C++ quite quickly, but neither I nor anybody I know has done much of this for Python. Many people who I work with consider a Python implementation a prototype, while Fortran/C/C++ is mature real code worthy of attention.

    The only real downside is that C/C++ requires a little knowledge of the POSIX/LINUX or Windows. This represents a learning curve, but when you are over it, it represents quite durable long lasting skills.

greenonion 13 years ago

So is there anyone using Python for machine learning in production systems (i.e. not just for prototyping). I would love to do it but seems Java/Mahout is a safer choice, performance-wise.

I wonder whether Blaze is a step towards that direction.

  • law 13 years ago

    I use Python for nearly all of my ETL processes that involve text processing. Even in production systems, I'd be hard-pressed to admit any significant performance issues. Python facilitates implementing algorithms in a functional style, which I tend to prefer over the imperative style (i.e., Java). With C++11 and boost, I'm able to translate my Python code to C++ while preserving the functional style, which has immensely simplified prototyping/deploying NLP/ML algorithms while simultaneously begetting enormous performance gains. I see Python as an extremely viable alternative to Java.

    • greenonion 13 years ago

      You got me a bit confused here. If I understand correctly what you 're saying, you 're still using Python for prototyping the core algorithms and C++ in actual production systems. I'm not saying Python is not good for production systems in general, I'm wondering whether it is good enough for real-world implementations of machine learning algorithms.

      Also, I believe most people would consider Java as an alternative to C++, hence all the Java-based Apache projects, such as Mahout, Solr etc.

      • law 13 years ago

        I use Python in production for text pre-processing and other ETL-related processes, which is part of a larger reinforcement learning approach. Additionally, I use Python to prototype the core ML algorithms, which I sometimes re-implement in C++. However, for many of those algorithms, numpy actually performs identically to BLAS in C++.

        • greenonion 13 years ago

          I get it now, thanks. It's very interesting, maybe I will give Python for ML a chance!

    • cmccabe 13 years ago

      Have you tried Scala? It might let you write in a functional style and then not have to translate it to something else. Please don't interpret this as a troll; I'm genuinely curious what the pros/cons of these approaches are.

      • law 13 years ago

        I've never tried Scala, but I suppose I should give it a chance. I'm a fan of Lisp, and the two languages seem to have a lot in common. Scala's expressive type system seems like it has the potential to be both a blessing and a curse, but admittedly, I know next to nothing about the language.

        • peatmoss 13 years ago

          I may be missing something here, but if you're a fan of lisp and want easy interaction with libraries on the JVM, please tell me you've heard of Clojure. It's a modern lisp that strongly favors functional programming, and that has great concurrency support. Plus, there is already a data analysis / statistical platform built on top of it called Incanter.

  • dwiel 13 years ago

    We also use python in production at plotwatt for machine learning. We started by prototyping in matlab and then porting to c++, but have since found it much much easier to just do everything in python and numpy. When speed was an issue, we slightly changed the way we implemented the algorithm rather than implement the same algorithm in a faster language. Admittedly this isn't always possible.

davidf18 13 years ago

It would be great to eventually have a GPU version as well (as in the cases of Matlab and R). I saw a brief demo of Matlab on a Mac Retina Pro 15 where the GPU version ran 30x the CPU version.

Caligula 13 years ago

I read about continuum after the fellow who developed numpy left a few days ago to work on continuum. I am curious to see actual projects using continuum. So some sort of writeups.

andrewcooke 13 years ago

how does this compare to theano? it seems like some of the ideas are similar?

http://deeplearning.net/software/theano/

in general, i like (ie i don't see a better solution than) the idea of having an AST constructed via an embedded language that is implemented by a library. but it does have downsides - integration with other python features is going to be much more limited (it gives the illusion of a python solution, but in practice you're off in some other world that only looks like python).

are there more details? i guess the AST is fed to something that does the work. and that something will have an API and be replaceable. but is that something also composable? does it have, say, a part related to moving data and another to evaluating data? so that you can combine "distributed across local machines" with "evaluate on GPU"?

  • freyrs3 13 years ago

    > how does this compare to theano? it seems like some of the ideas are similar?

    It's quite similar, we just take some of the ideas farther and try to generalize the data storage to include storage backends that data scientists use more frequently ( i.e. SQL, CSV, S3, etc ). We're very friendly with the Theano developers and hope to bridge the projects with a compatibility layer at some point.

    > (it gives the illusion of a python solution, but in practice you're off in some other world that only looks like python).

    I would argue that's what make Python a great numeric language, and NumPy so succesfull. You get this high level language where you can express domain knowledge but also this 1:1 mapping between fast code execution at the C level. Blaze is the continuation of that vision

    > i guess the AST is fed to something that does the work. and that something will have an API and be replaceable.

    Precisely, we build up a intermediate form called ATerm out of the construction expression objects, do type inference, graph rewriting, and then pattern match our layout, metadata, and type information against a number of backends to find the most optimal one to perform execution. Or if needed we build a custom kernel with Numba informed by all this type and data layout information we've inferred from the graph.

    We don't aim to solve all the subproblems in this area ( expression optimization passes, distributed scheduling ) but I think we have a robust enough system that others can build extensions to Blaze to do expression evaluation in whatever fashion they like.

    > are there more details?

    Yes! See: http://blaze.pydata.org/

lucian1900 13 years ago

Interesting approach to modelling data that lives elsewhere, in fact quite similar to SQLAlchemy's.

  • piqufoh 13 years ago

    ... but you can't use numpy operations efficiently on SQLAlchemy data

    • lucian1900 13 years ago

      That's not what I meant. Both this and SA turn python expressions into expressions to be run elsewhere, on data that isn't necessarily in the process' memory.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection