R vs Python for simple interactive data analysis
github.comI'm fairly familiar with Python, somewhat familiar with R, and fairly familiar with Stata. The scenarios where you're doing OLS analysis can generally be broken in to two categories:
1) Exploration of a given data set.
2) Automatic collection and analysis of data.
In most academic uses of data (1) is the case. Here, if the data is already rectangular, and in a nice format, I would actually prefer Stata. However, as soon as any sort of manipulation is required, or graphing something more difficult than a scatterplot is required I shift to R.
R really shines through when you need to do complicated analysis on a fixed data set. I've found that Hadely Wickham's `reshape` and `ggplot` packages are invaluable. They easily produce graphics that are more informative and better looking than any other graphics package I've seen. Additionally R has packages for essentially any statistical analysis that you could want to do.
While R is able to pull data from a database, or other places, as soon as you have more dynamic data, you enter into case (2). This is when it might make sense to start using python. But even then I've found python is mostly useful for curating the data so that it can be used by R.
I entirely agree though my personal preference is for lattice over ggplot as it is faster and more flexible at the present time (the "limitations" of ggplot reflect the preference of the creator; e.g. not being able to have two axes for the same plot). But reshape and also plyr are quite useful.
R does have surprising capability for shell scripting and text processing, albeit slower than Python. I also use Python for the rapid text-processing necessary (possibly populating an SQL database) for R to eventually use.
I've been keeping an eye on SciPy, but there still seems to be a lot of "the source code is the documentation", whereas in R the documentation is usually superb and well-structured. And Matplotlib, while beautiful, seems to be more verbose than Matlab or R when it comes to customizing details of the graphics (e.g., axes, etc.). That's just my impression, but I wouldn't mind being shown otherwise.
I've built the equivalent to most of the plyr and reshape/reshape2 packages inside pandas (http://pandas.sourceforge.net, note I am in the midst of overhauling the documentation for the upcoming release). I plan to write a decent amount of side-by-side code comparisons, should definitely be useful for folks with R experience wishing to use Python for data analysis / statistics. Feedback from savvier R users than myself on pandas would also be extremely helpful.
Building a plotting library with the ease, sophistication, and beauty of ggplot2 in Python would be a big deal. A number of people I know are interested in venturing down that path (ggpy, anyone?).
That sounds great -- I only recently learned of pandas (I normally don't use much of NumPy/SciPy), but I will keep an eye out on this.
Thanks for your work on pandas, I'm looking forward to being able to stick with one language for most of my data analysis tool chain.
Yes, a python version of ggplot2 is well overdue!
This makes a lot of sense, and meshes with my own experience. I've created a lot of programs that collect data using Python/Ruby/similar and then use R for the analysis.
While using two languages is an overhead, it feels like it plays to the strengths of both sides.
I think you want ggplot2 and reshape2 ;)
The advantages and disadvantages the author cites seem more pertinent to his own idiosyncratic preferences than to more general features one might look for when doing interactive data analysis. They also seem easy to address. For example, an hour of time spent building a few quick functions would address most of his complaints about Python. I've personally used python, matlab, R and Stata in my research and view the first three as about equally capable. In my opinion Stata is less comparable to the others as it is more a wysiwyg collection of tools and functions. Matlab has good support for large data sets via memory mapping, has mex extensibility for building your own fast functions and is very good for interactive plotting, but doesn't produce publication-quality finals. Python is great for no-niggling fast idea to functioning execution and can push data into matlab is mlabraw. R has well developed stats packages and a huge user base. I disagree with the author regarding documentation for R--maybe he is right for the core, but depending on the package you may have trouble finding documentation beyond a man page. Ggplot is excellent but eccentric.
I’m more interested in Python for it’s real-time capabilities. With Python running statistics, I can do more user-facing things with the data, whereas R can do more statistical things with the data. Plus, an arduino-based sensor array interfacing with R sounds shaky.
And since Python does web easily: http://rapache.net/
LOESS is fairly simple to do in Python, or you can find an implementation via Google e.g. http://www.koders.com/python/fid5A91A606E15507B6823DEC7A0594...
I'd be curious to see an updated comparison with LOESS added to the Python code!
We recently implemented the lowess smoother in statsmodels (pending pull request: https://github.com/statsmodels/statsmodels/pull/5), so it will work its way into the mix soon.
The non-parametric stats functions in R are better than Py