My Journey from R to Julia

drtomasaragon.github.io

102 points by michelpereira 3 years ago · 119 comments

Reader

> For example, in R, we try to avoid loops because they are very inefficient

This was true before, but the performance of for loops has been improved a lot later years, and while vectorization is still faster, for loops are no longer a no-no

See https://www.r-bloggers.com/2022/02/avoid-loops-in-r-really/

_Wintermute 3 years ago

It's a really sticky misconception. I've seen many beginners telling others to "never ever use loops in R", and so you end up with nested sapply()s or whatever soon-to-be-deprecated tidyverse functions are in vogue that nobody can reason about.
- vharuck 3 years ago
  
  Agreed. The most common reason loops become bottlenecks is people "adding onto" vectors or dataframes. This causes a whole new vector to be created, the data from the old one copied into it, and then the new data filled in at the end. You'll rarely notice the performance hit unless you stick it in a loop that runs tens of thousands of times.
  For those who want to avoid it and still use a loop, you can create a vector beforehand with the final length and fill it in. If you don't know the final length, create a vector with a good guess for length, double its length whenever it gets full, and then crop off the unused tail when you're done.
- em500 3 years ago
  
  So Rob Pike’s rule 1 and 2 again:
  Rule 1. You can't tell where a program is going to spend its time. Bottlenecks occur in surprising places, so don't try to second guess and put in a speed hack until you've proven that's where the bottleneck is.
  Rule 2. Measure. Don't tune for speed until you've measured, and even then don't unless one part of the code overwhelms the rest.
  https://users.ece.utexas.edu/~adnan/pike.html
  - teruakohatu 3 years ago
    
    That holds true in general but when doing numerical calculations on a large amount of data, taking speed into account is necessary. You usually know approximately the time penalty for not doing so can evaluate the extra time spent coding verses time spent waiting for results.
    For example if I am writing a toy neutral network with a small dataset I don't care how optimized it is, or if it runs slowly on the CPU.
    But when training a large network on a large amount of data it is well worth spending extra effort from the start to ensure as much work as possible is done on a GPU and writing it to ensure if can support multiple GPUs.
  - pxmpxm 3 years ago
    
    That's some pretty generic premature optimization cargo culting.
    If you have a huge data set and some understanding what you're doing, the bottlebecks will be pretty obvious.
    
    em500 3 years ago
    
    Apparently not obvious enough for people to estimate in advance if they should avoid looping.
    Pike point is not just to avoid premature optimization. It’s to measure bottlenecks. Because due to changing language and hardware developments, what you think you knew to be true might become outdated.
    
    pjmlp 3 years ago
    
    One of the reasons why Mark Godbolt created compiler explorer was to prove teammates that what for them was pretty obvious actually wasn't.
    
    markkitti 3 years ago
    
    In Julia I do not need the Godbolt compiler explorer. The macros `@code_llvm` and `@code_native` show me the LLVM IR and native code for a function.
    julia> @code_llvm debuginfo=:none 5.0 + 3 define double @"julia_+_156"(double %0, i64 signext %1) #0 { top: %2 = sitofp i64 %1 to double %3 = fadd double %2, %0 ret double %3 } julia> @code_native debuginfo=:none 5.0 + 3 ... vcvtsi2sd %rdi, %xmm1, %xmm1 vaddsd %xmm0, %xmm1, %xmm0 retq ...
    
    adgjlsfhk1 3 years ago
    
    There's a reason Godbolt was made for C/C++ rather than python/R. In a fast language you need to know what the compiler is doing to know what's slow. In a slow language, the slow part is pretty much always just "code that does anything in the language".
    
    pjmlp 3 years ago
    
    Python is only slow because so far there has been a huge disregard for JIT implementations, versus how other dynamic languages have decided to deal with perfomance issues.
    
    adgjlsfhk1 3 years ago
    
    PyPy definitely shows that python could be 5x faster than it is, however this would still be ~10x slower than Julia/C/C++ (and R is roughly 5-10x slower than python now)
    
    pjmlp 3 years ago
    
    Yes, but it will never part of the development workflow of most Python developers, and the ongoing CPython JIT work sponsored by Microsoft will hardly win any JIT performance prices, given its design goals.
    In any case, there is a reason why Python has a profiler in the box, as what one thinks and what actually is, isn't the same. Which was my starting point.
  - naijaboiler 3 years ago
    
    Avoid the allure of premature optimization
    
    DennisP 3 years ago
    
    But embrace the repulsion from belated pessimization. As Len Lattanzi said, it's the leaf of no good.
- deng 3 years ago
  
  > so you end up with nested sapply()s
  And that's usually not even vectorizing anything, it just hides the for-loop that is buried somewhere in the apply-code...
johnmyleswhite 3 years ago

Does the article you linked not show that a loop is 10x slower than vectorization for computing square roots? The fact that 10x is better than the 60x slowdown for vapply isn't really evidence that loops are a reasonable alternative to vectorization yet.
rlh2 3 years ago

The obsession with cpu speed almost always confuses me in these topics. Time it takes to program is way more important, and that’s where a terse language like R shines. The base/most common functions are almost always executing C anyway. It’s kind of like lisp in that it’s easy to write slow code, but who cares if it’s “fast enough”? Also, it’s almost always easy to speed up if necessary at the R level and R’s C API is also easy to use for for numeric computing/optimization which is exposed at the C level if you want to use it.
- kkoncevicius 3 years ago
  
  It depends. Take for example any omic dataset where you might need to run a GLM model on ~500,000 rows. Codes I've seen for this operation can range in time from taking 30 minutes to 2 days.
  My take away here is that, sure, for one operation the speed is not that critical, but there is always the case where that one operation will be used close to a million times in one analysis and then it all adds up. On top of that if it's implemented in C then the invocation from R to C and back will be happening that many times which adds to the slowness.
  - derbOac 3 years ago
    
    Yes, I use R, Julia, and Python from time to time depending on the case and my mood and they all have their advantages and disadvantages.
    R is more than fast enough for straightforward prototypical analyses where a lot of the code is calling C or something lower level and you're not introducing something "new" to the interpreter system. But if you want to do some unusual optimization there's going to be something that bottlenecks everything unless you go into C/C++/Fortran yourself, and then Julia is a good compromise. I've had times when Julia didn't save any time whatsover, and other times when it took something that would literally run over a week at least in R and it was done in 30 minutes in Julia.
    Having said that, the more I use Julia the more I find myself scratching my head about it. It's very elegant but it's just low-level enough that sometimes I wonder if it's worth it over, say, modern C++ or something similarly low level, which tends to have nice abstracted libraries that have accumulated over the years. I also have the general impression, mentioned in a controversial post discussed here on HN, that a lot of Julia libraries I've used just don't quite work for mysterious reasons I've never been able to figure out. Everything with Julia has gotten better with time but I still have this sense that I could put a lot of time into some codebase, and have it just hit a wall because of some dependency that's not operating as documented.
    There's kind of an embarrassment of riches in numerical computing today, and yet I still have the feeling there's room for something else. Maybe that's the mythical golden language that's lured all sorts of language developers since the beginning though.
    
    BrandonS113 3 years ago
    
    I have been thinking the same and had similar timing experiences. As Julia is lower level than R/Python, there is a lot of annoying things to take care of that are not needed in R/Python. And then why not use, say Rust? Or just Rcpp in R. We just did a small test program in Rust that is called very often on the command line and takes a couple of seconds to run. Very happy with the experience. Same run speed as Julia, 10 times faster than R/python, and no 60 second load time like julia.
    
    markkitti 3 years ago
    
    Julia 1.9, now in beta, implements native code caching. Precompiling a Julia package now creates a native shared library, a ".so", ".dylib", or ".dll" file. For some packages, this lowers load time considerably. It may some time before many packages take full advantage of this.
    The promise of Julia is that you can have the high-level interface and the low-level code in the same language. The alternative would be coding the low level code in Rust or C and then creating bindings for Python or R.
    For a while Julia made the most sense for long-running code that is that is executed almost as often as it is modified (e.g. scientific computing). In this situation Rust or C static compilation times become a hinderance. As ahead-of-time and static compilation features get added to Juliaz this scope will expand.
    
    BrandonS113 3 years ago
    
    Yes I follow this. The load time keeps getting better. And am looking forward to 1.9.
    I really don't want to come across as negative, Julia is a fantastic language, and my hope is that that it will continue its impressive improvement path.
    But to follow form the thread's sentiment, I have the feeling Julia lives in an unstable equilibrium. It is lower level than R/Python but doesn't quite deliver the benefits of rust/c/fortran/c++. I find my colleagues gravitate to one of the 2 equilibria.
    Maybe your last paragraph crystallizes it. If one lives in the REPL, Julia is wonderful. Not how I work. I prefer the command line. Have new data, run code on it. Data changes in real time, code not. My code may run millions of times on different operating systems and only infrequently change.
    
    markkitti 3 years ago
    
    We already have some forward prototypes of being able to run Julia ahead-of-time compiled native code from the command line.
    https://github.com/brenhinkeller/StaticTools.jl
    I think what we'll end up with is a language that can be used in both a fully static mode and in a dynamic mode along with some possible mixing. We may yet get the benefits of a statically compiled language as the tooling continues to develop. I do not see anything inherent in the language that would prevent that from happening.
    
    DNF2 3 years ago
    
    In Julia you can go low-level, but there is no requirement. You can write purely high-level, generic, untyped code, with good performance. So I'm a bit reluctant to accept the claim that it's lower level.
    What are the things where low-level code is required in Julia, but not in Python/R?
- Hasnep 3 years ago
  
  One of the key points of Julia is that the language you use for performance critical parts is also Julia. That applies to both the libraries like DataFrames.jl and for situations where you'd drop to a lower level language when optimising. I think being productive in Fortran or C++ is unrealistic for most scientific programmers.
- freehorse 3 years ago
  
  It is a trade-off and a sweet spot has a lot to do with the specific context and background. Run speed matters a lot when the difference is between having to run your code on a dataset for half an hour vs through the whole night. Once you have prototyped your code, you are gonna use it more and more (not to mention runs in order to tweak parameters or validate results), and R's speed is not satisfying enough for my work. Python matlab are easy and fast enough to program in, and much faster for tasks that are computing-heavy. If I was getting into C I would not have saved as much time as I would have put into learning how run eg parallel tasks there safely. Moreover, R is not necessarily faster to program, always; real (ie tidyverse-style) R is quite idiosyncratic, if you come from a programming and not from a statistics background probably it will take more time to learn than it is worth unless it is sth important in your work environment.
- CyberDildonics 3 years ago
  
  When someone understands what is happening when their program executes they will write faster programs without much more effort.
  You might like writing slow programs, but that doesn't mean people like using them.
npalli 3 years ago
Sorry, not following the logic here. From the article, vectorization[1] is more than 10 times faster than a loop. How is this an endorsement for "for" loops.
```
   [1] Vectorization is more than ten times faster than the naive loop.
```

kkoncevicius 3 years ago

R can handle the examples in the article with generic functions:

  oddsratio         <- function(x, ...)     UseMethod("oddsratio", x)
  oddsratio.integer <- function(a, b, c, d) (a * d) / (b * c)
  oddsratio.numeric <- function(p1, p0)     ((p1)/(1 - p1)) / ((p0)/(1 - p0))
  oddsratio.matrix  <- function(x)          (x[1, 1] * x[2, 2]) / (x[1, 2] * x[2, 1])

Then:

  oddsratio(12L, 6L, 2L, 29L)         # 29
  oddsratio(12/(12+2), 6/(6+29))      # 29
  oddsratio(matrix(c(12,6,2,29), 2))  # 29

civilized 3 years ago

This works, but only because you can tell which function you need to call using only the first argument.
A more compelling example for Julia would have to have two modes of operation where the first argument has the same type in both modes, but later arguments have different types.
- Cosi1125 3 years ago
  setMethod(myfun, signature = c("integer", "character"), ...) setMethod(myfun, signature = c("list", "data.frame", "logical"), ...) setMethod(myfun, signature = "foo", ...)
  This takes into account as many arguments as you wish.
- kkoncevicius 3 years ago
  True, but in R (at least in S3) it's probably avoided by design because of default parameters. A short example:
  genericfun.type1 <- function(x, sub=1) x - sub genericfun.type2 <- function(x, y) sum(x, y)
  The single case can be differentiated:
  genericfun(x)
  But which function are we calling with:
  genericfun(x, y)
  I don't know about Julia and how it solves this. Maybe by not allowing to pass nameless optional arguments.
  - disgruntledphd2 3 years ago
    
    S4 can dispatch on multiple arguments, just like Julia (though it seems a lot more natural in Julia).

vitorsr 3 years ago

This has been said before multiple times over but with these languages it is rarely about the languages themselves but their ecosystems:

https://cran.r-project.org/web/packages/available_packages_b...

To go from R to Julia, as an example, one would have to give up on a hundred or so high-quality packages potentially related to their activities.

getoffmycase 3 years ago

Having looked at a large number of R packages source code, I do hesitate to freely label R packages as generally high-quality. I’ve been operating on the “trust, but verify” principle

NicolasL-S 3 years ago

You don't have to. Just use Rcall.

Of course R has been here longer. Eleven years after its creation, R had fewer than 500 packages. Julia was released in 2012 and today has over 7,000 packages.

vitorsr 3 years ago

    R
    Cited in: 8,589 Publications
    7,353 [Citing Publications in] Statistics (62-XX)
    https://zbmath.org/software/771

    Julia
    Cited in: 442 Publications
    64 [Citing Publications in] Statistics (62-XX)
    https://zbmath.org/software/13986

BrandonS113 3 years ago

That is exactly the issue. No language comes close to the richness of the R statistical package ecosystem.
- dunefox 3 years ago
  
  It's not an issue at all with RCall and PyCall.
  - BrandonS113 3 years ago
    
    Of course it is, sometimes one needs to pass data represented as R objects (like zoo) to R functions, and receiving something that is an R object and to be worked on by R calls before being passed on. That is very clumsy to manage with RCall.
    
    RosanaAnaDana 3 years ago
    
    A perfect example of this is image processing.
    Is is delightful in working with spatial raster data. It's an 'it just works' space, especially with raster and stars. There have been several attempts to disrupt the raster package but none have really stuck.
    Granted, it's not fast. However trying to do the same kinds of things in python with rasterio is just cludgy. And half the time you end up making system calls to gdal anyways. Guess what? I can make the system calls from R too.

sfpotter 3 years ago

This is a pretty weak article. The author lists five reasons an epidemiologist would be interested in Julia and then only gives a (kind of simple and contrived) example for one of them.

bigger_cheese 3 years ago

My father is a (retired) veterinary epidemiologist (i.e. he looked at spread of diseases in Animal populations) probably a function of his age but he wrote most of his models and simulations in Pascal and then switched to Basic eventually.
From my conversations with him the programming language was not the bottleneck for him it was the integration with GIS software and spatial mapping packages which caused him problems. A lot of programming languages did not mesh together very well with spatial mapping tools available to him at the time.
The first language I ever saw him use was Turbo Pascal, later he would use QBasic and GWBasic eventually he was using Visual Basic).
Towards the end of his career I believe he looked at other languages like Python and Java but I don't believe he found them very compelling. Python I believe has better spatial tools available now but it would have been relatively early in the languages life my Dad was looking at it and those packages probably did not exist.
jgalt212 3 years ago

For the epidemiologists, I wonder how Julia stacks up against R when calculating the benefits (but not the costs) of long-lasting lockdowns and school closures.

bluedino 3 years ago

I work with PhD chemists at a F500 company, most everyone uses Python, we have a pocket of users that are on the R train. Mostly Rstudio mixed with Python.

Someone just asked to install Julia on the compute cluster just last week so we'll see how many others start using it.

dunefox 3 years ago

Also have a look at RCall and PyCall.

BrandonS113 3 years ago

Just now I was thinking of moving a long calculation from R to Julia (non-linear optimisation of a simple function with multiple local minima, for a lot of different datasets). No loops. Embarrassingly parallel. And to my great surprise, R and Julia took the same time.

tkuraku 3 years ago

I think there are a lot of performance pitfalls with Julia. In my experience you have to pay close attention https://docs.julialang.org/en/v1/manual/performance-tips/. Otherwise you can easily get lack luster performance. However, there are times when the speed of Julia really shines and it just feels magical compared to R, python, etc.
Lyngbakr 3 years ago

I only used Julia for a short time, but I didn't see the blazing fast speeds I was promised. I've seen the benchmarks, of course, on which the claims are founded, but the C-like speeds weren't obvious to me in everyday data science workflows. In the end, there wasn't sufficient motivation for me to switch to Julia as my weapon of choice. I do like Pluto[0], though...
[0]https://plutojl.org/
- BrandonS113 3 years ago
  
  If doing data science, I find Julia's tools to be inferior to Python and R. But in my work, when it comes to long computations, not only does Julia usually vastly outperform both, we write Julia code faster with fewer errors.
  - thejosh 3 years ago
    
    What kind of questions are you trying to answer?
    What's the area for the long computations you're doing?
    Just curious :)
- rightbyte 3 years ago
  
  You need to know how the compiler propagates types in detail to write performant code. It is quite hard.
  - CyberDildonics 3 years ago
    
    It's not 'quite hard' you just need to put types on your variable and function declarations.
    If you leave them off you can make a function generic but you need to make sure you don't have multiple returns that could each return different types.
ChrisRackauckas 3 years ago

Did you use JuMP? It would be interesting to see the JuMP code.
- BrandonS113 3 years ago
  
  No, nlopt. Why we could easy port from R to Julia as nlopt exists for both (its c)
  - harshreality 3 years ago
    
    If you use the same native C library from both Julia and some other language, and (presumably) that library is where the algorithm spends the vast majority of its time, why would you expect Julia to be faster?
    
    BrandonS113 3 years ago
    
    no. the code spends 99% of its time in 5 lines in the objective function. And my usual experience is that Julia is much faster than R. Just not always, apparently.
  - ChrisRackauckas 3 years ago
    
    Ahh then it's all based on the speed of the objective function and the gradient calculation code. Do you have an example to look at for that? Or did you try modelingtoolkitizing and running an auto-simplified form?
    
    celrod 3 years ago
    
    I've noticed that R often defaults to much higher tolerances than Julia, even when it's wrappers to the same C library, like cubature.
    R cubature[0]: 1e-5 Cubature.jl [1]: 1e-8
    The difference for NLopt in R vs Julia is smaller. `NLopt.DEFAULT_OPTIONS`[2] in Julia shows `1e-7` for `ftol_rel`, `xtol_rel`, and `constrtol_abs`, while in R `xtol_rel` is `1e-6` and the others are `0.0`[3]. So, the options at least aren't the same with nlopt. Anyway, I always recommend confirming that you're comparing the same settings.
    And of course, in Julia, you'll probably want to `JET.report_opt` your function and fix and glaring performance issues.
    NLopt seems like it may be a bit of an exception, but I noticed this is pretty common pattern elsewhere, uniroot[4] being another example, with eps()^(1/4) default tolerance, far higher than Julia root solvers will use.
    [0] https://cran.r-project.org/web/packages/cubature/cubature.pd... [1] https://github.com/JuliaMath/Cubature.jl [2] https://github.com/JuliaOpt/NLopt.jl/blob/6ade25740362895bbf... [3] https://cran.r-project.org/web/packages/nloptr/nloptr.pdf [4] https://www.rdocumentation.org/packages/stats/versions/3.6.2...
    
    BrandonS113 3 years ago
    
    We take of that by setting all the NLOPT options to be the same across calling languages. We get pretty much the same number of calls in julia, r and c.
    
    BrandonS113 3 years ago
    
    Yes, well no gradients. It a simple parametric function (6 parameters) applied to a vector of lengths 10 to 20. All vectorizes in the language of R. All powers, logs, exponentials, ratios, and sums. Large number of local maxima, and places where function cannot be evaluated. So probabilistic optimizer is very important.
    Come to think of it, for this sort of calculations, R and Julia should take the same time.
    
    DNF2 3 years ago
    
    Of course, this is piques one's curiosity. It might be, if the function is simple enough, that there is little advantage to Julia here.
    But if you are combining multiple operations on a vector, there could be opportunities for Julia, in-place operations, fusing, simd. Maybe even StaticArrays.
    Any chance of sharing that little piece of code?
    
    BrandonS113 3 years ago
    
    sure, for parameters par (what is optimized for), data vector x (typical length from 10 to 20), constants n and n2, a typical function is
    if((1 - par[3]^2)<0 ) return(100) if(par[1] + par[2] \* par[5] \* sqrt(1 - par[3]^2)<0) return(500 ) tmp <- (x - par[4])^2 + par[5]^2 tmp2 <- par[1] + par[2] \* (par[3] \* (x - par[4]) + sqrt(tmp)) result <- ((sqrt(tmp2) - Ce) / n2)^2 result <- sqrt(sum(n \* result) / n) if(is.na(result)) return(1111) return(result)
    
    DNF2 3 years ago
    
    I'm not that good at reading R. But if the Julia code is similar, then this code is type unstable, sometimes returning an Int, sometimes a Float. That harms performance.
    Generally, it looks like a function where Julia could have a significant performance advantage.
    
    BrandonS113 3 years ago
    
    R looks pretty much the same as Julia here. Don't think its unstable. Its all float. Nlopt passes parameters as float, data is float. Can't see how it could return int.
    But I'll take a second look at the julia version of it. We optimised R as much as possible. might have missed a julia trick. simd perhaps. Except we also run it on ARM
    
    DNF2 3 years ago
    
    Maybe it's all floats in R, but in Julia, `return 500` means an Int is returned. But it's really hard to determine whether the Julia code is unstable based only on the R code.

mxkopy 3 years ago

From my understanding Julia is closer to metal than R. This means the semantics are much more specific than R, and the syntax is more consistent/rigid.

For example, plotting in R always baffled me.

plot(x, y, col=..., col.name=...)

In this case, col.name is literally just a symbol. But in another context col.name is the data with index 'name' stored in col. Or something, it's been a while.

R seems to have a lot of these 'special contexts' that A. make understanding and writing code much quicker and B. reward familiarity over intuition. One line in R can be 100 in Julia, and both compile to 80 machine instructions, for example.

I'd say if you can agree with others on what R code does and you're comfortable with R, then use R. If you need to build something performant with many domains, then Julia is a great language for that sort of thing.

kgwgk 3 years ago
```
    > plot(x, y, col=..., col.name=...)
    > In this case, col.name is literally just a symbol.
```
In this case, col.name is literally… made up?
- mxkopy 3 years ago
  
  This is bad faith pedantry
  - bob29 3 years ago
    
    Being technically correct is the best feeling in the universe

fithisux 3 years ago

I'm afraid R is dragged by its S legacy. I think its time for these to evolve separately. I see Julia can do what R already does by following software engineering practices, cleaner code and typing.

Julia is the new R for me. Unless R re-invents itself.

aljabadi 3 years ago

It’s important to note that R’s S4 Object System now supports multiple dispatch & I have enjoyed using it. I would agree that it’s not quite as elegant as Julia’s. See https://www.mpjon.es/2021/05/31/r-julia-multiple-dispatch/

getoffmycase 3 years ago

The problem with S4 is that it really sucks to write.
kgwgk 3 years ago

> now
It's 25 years old!

gozzoo 3 years ago

The article is supposed to tell us why Jilia is better than R, but it mainly focuses on one feature - multiple dispatch. Can someone please explain - does multiple dispatch provide any advantage over other function call strategies, and even if it does how much effort would it save, how much shorter or less ambiguous our code would become.

markkitti 3 years ago

Multiple dispatch is not unique to Julia, but it is a large part of the language. It helps 3rd parties extend rather than duplicate interfaces.
There a few recordings of the "Unreasonable effectiveness of multiple dispatch" talk that explains this: https://youtu.be/QTCKsqIK6nE
xtalax 3 years ago

This isn't even showcasing what multiple dispatch is, and its power, see https://www.youtube.com/watch?v=kc9HwsxE1OY

maxboone 3 years ago

I'd love to see more Julia (or Python) adoption in non-cs/math/phys academic research.

It's a breeze doing such analyses with Stata, and with a bunch of weird syntax, some libraries and more lines you can get it done in R as well.

But I tried assisting my SO with setting up their statistical methods in Python and it was so much more work than Stata (or R).

dan-robertson 3 years ago

Interestingly, I found myself going the other way. Let me first say that R is a hilariously weird-feeling and janky language. The Julia features mentioned (structure are good for organising; compilation and better data structures mean you need to worry less about accidentally writing code that is 10x or 100x slower than it ought to be, which tends to matter a lot for interactive use) are definitely useful, and magically getting e.g. arbitrary precision arithmetic is pretty cool.

I think the example in the post shows an annoying way for Julia’s generic functions to be difficult because the function seems to take a matrix but secretly it only wants a 2x2 matrix. If such a function gets called with the wrong value deep in some other computation, and especially if it silently doesn’t complain, you may end up with some pretty annoying bugs. This kind of bug can happen in R too (functions may dispatch on the type of their first arg and many are written to be somewhat generic by inspecting types at runtime). I think it’s a little less likely only because data structures are more limited. A related example that trips me up in R is min vs pmin.

The biggest issue I had in practice is that for either language, I wanted to input some data, fiddle with it, draw some graphs, maybe fit some models, and suchlike. R seems to have better libraries for doing the latter but maybe I just didn’t find the right Julia libraries.

- I feel like I had more difficulties reading csvs with Julia. But then when I was using Julia, I wanted to read a bunch of ns-precision time stamps which the language didn’t really like, and with R I didn’t happen to need this. I found neither language had amazing datetime type support (partly this is things like precision. Partly this is things like wanting to group by week/day/whatever. Partly this is things like wanting sensible graphs to appear for a time axis)

- R has a bigger standard library of functions that are useful to me, e.g. approx or nlm or cut. I think it’s a reasonable philosophy for Julia to want a small stdlib but it is less fun trying to find the right libraries all the time. Presumably if I knew the canonical libraries I would have been happier.

- R seems to have better libraries for stats.

- I found manipulating dataframes in Julia to be less ergonomic than dplyr, but maybe I just wasn’t using the Julia equivalent. In particular, instead of e.g. mutate(x=cumsum(yfilter)), I would have to write something like mutate(do, [:y, :filter]=>((y,f)-> cumsum(yfilter))=>:x). I didn’t like it, even though it’s clearly more explicit about scoping which I find desirable in a less interactive language.

- I much preferred ggplot2 to the options in Julia. It seems the standard thing is plots.jl but I never had a great time with that. Gadfly seemed to have a better interface but had similar issues to manipulating data frames and I found myself hitting many annoying bugs with it. Ggplot is fast slow, however.

- Pluto crashed a lot on me, which wasn’t super fun. In general, I felt like Julia was more buggy in general. Though I also get an annoying bug with R where it starts printing new prompts every second or so, and sometimes just crashes after that. Pluto also doesn’t work with Julia’s parallelism features (but maybe it does now?)

- The thing that most frustrated me with Pluto/Gadfly was that I would want to take a bunch of data, draw it nice and big, and have a good look at it. Ggplot (probably because of bad hidpi support) does this well by throwing up the plot with a tiny font size on a nice 4k window and, with appropriate options, not doing a ton of X draw calls for partial results (downside: it is still quite slow with a lot of points). Gadfly in Pluto wants to generate an SVG with massive font size and thick borders on chonky scatter plot shapes, and crams it into a tiny rectangle in Pluto. Maybe this is more aesthetic or something but generally I plot things because I want to look at the data and this is not an easy way to look at it. The option to hide the thick borders in gadfly is hilariously obscure. I never bothered learning how to not generate the svg in the notebook. I would just suffer terrible performance while I zoomed in to get a higher resolution screenshot (before deleting the avg in the dev console) or generate a png file.

That said, there are still things I don’t know how to do with either plotting system, like reversing a datetime scale, or having a scale where the output coordinate goes as -pseudolog(1-y) to see the tail of an ecdf, or having a scale where the labels come from one source but positions come from some weight, e.g. time on the x axis weighted by cpu-hours so that an equal x distance between points corresponds to equal cpu-hours rather than equal wall-time. Maybe I will learn how to do it someday with ggplot.

stillyslalom 3 years ago

Take a look at DataFramesMeta for nicer manipulation of Julia's dataframes. Your example would look like

  julia> df = DataFrame(y = rand(10^6), filter=randn(10^6));

  julia> @transform!(df, :x = cumsum(:y .* :filter))
  1000000×3 DataFrame
       Row │ y          filter      x
           │ Float64    Float64     Float64
  ─────────┼────────────────────────────────────
         1 │ 0.0726663   1.7213       0.125081
         2 │ 0.183898   -0.392131     0.0529686
         3 │ 0.150274    1.08083      0.21539
      ⋮    │     ⋮          ⋮            ⋮

It's particularly nice in conjunction with @chain [1].

[1] https://juliadata.github.io/DataFramesMeta.jl/dev/#Chaining-...

bluenose69 3 years ago

The key comment is that it's hard to know more than 1.5 languages. I think everyone has their own number for that. My number is higher than the author's. I use R for most work, but a lot of my computations involved large binary datasets that are best read with C/C++, so I use C/C++ and R in tandem for my data-analysis work.

Separate from that, I use python when I'm writing (undemanding) system-level work. I see it as a great replacement for the shell. (Python took over from perl, and once I got to 20% proficiency with python I had a sigh of relief, knowing that I would never really need to write in perl again.)

And, yes, I also use Julia. This is mainly for writing small numerical models. It is a lovely language. I would never start to write a small model in fortran anymore. But that doesn't mean I can leave fortran behind because it is still the language used for large numerical models. (These models involve many tens of person-years of effort by world experts. This is not just a coding thing.)

I suspect that quite a lot of people have language limits more like mine than the 1.5 stated by the author. For such people, Julia is definitely an arrow that ought to be in the quiver. It is elegant. It is fast. It is modern. Parts of it are simply delightful. But there are downsides.

1. The startup is slow enough to be annoying, for folks (like me) who like to use makefiles to coordinate a lot of steps in analysis, as opposed to staying in a language environment all day long. (Note, though, that julia is getting faster. In particular, the time-to-first-plot has been decreasing from an annoying minute or so, down to perhaps half a minute.) 2. The error messages are often emanated from a low level, making it hard to understand what is wrong. In this, R and python and even C/C++ are much superior. 3. The language is still in rapid development, so quite often the advice you find on the web will not be the best advice. 4. There are several graphics systems, and they work differently. This wild-west approach is confusing to users. Which one to choose? If I run into problems with one and see advice to switch to another, what new roadblocks will I run into? 5. The graphical output is fairly crude, compared with R. 6. It has some great libraries, but in shear number and depth and published documentation, it cannot really hold a candle to R. Nearly every statistical PhD involves R code, and I think quit a lot of packages come from that crucible. This environment ought not to be underestimated.

The bottom line? It only takes an hour or so to see that Julia is a wonderful open-source replacement for matlab, and for small tasks that might otherwise be done in Fortran. Anyone with a language capacity of 2 or 3 or more (and I suspect this is many folks on HN) will find Julia to be a great tool to learn, for certain tasks.

BrandonS113 3 years ago

1.5 is on the low side. I use Python, R, Julia, and Latex professionally. Python for op system/internet/data, R for stats and Julia for numerical calculations (some very large scale). So know the useful parts of all 4.

usgroup 3 years ago

TLDR: Author switched to Julia because he “fell in love” with it, with no further qualification.

He then speaks a bit about multiple dispatch and how it’s useful when it’s suitable.

Personally I saw nothing here that might actually convince someone to switch. R + Tidyverse + Rcpp + CRAN is formidable.

heywhatupboys 3 years ago

> Rcpp
Rcpp is the worst thing that ever happened to humanity. Crazy build system, impossible magic words and macros, poisons an entire C or C++ project with new headers etc., extremely to downright impossibly hard to compile without R specific compiler tools. Two different build systems for whatever reasons in sourceCpp, compiler just includes arbitrary files, maintainer is ahem extremely condescending to any Q&A questions on SO and GH and doesn't understand why crazy long errors aren't just obvious

hnarayanan 3 years ago

I get confused by this every time this comes up. Is multiple dispatch the same as function-overloading (e.g. in C++)?

sfpotter 3 years ago

They're different. IIRC, multiple dispatch is dynamic (i.e., happens at runtime) while C++'s function overloading is static (happens at compile time).
- borodi 3 years ago
  
  One interesting thing is that if julia can prove what types a function will be called with at compile time, it doesn't have to do dynamic dispatch, so it has no overhead. It's what the julia folks call type-stable code
  - sfpotter 3 years ago
    
    If ifs and buts were candy and nuts...
    
    markkitti 3 years ago
    
    This becomes an important part of optimizing Julia code. There is some tooling for this. Below, identity is a type stable function because we know that an Int64 argument results in a Int64 output. The macro code_warntype reveals the type analysis:
``` julia> @code_warntype identity(5) MethodInstance for identity(::Int64) from identity(x) @ Base operators.jl:513 Arguments #self#::Core.Const(identity) x::Int64 Body::Int64 1 ─ nothing └── return x ```
This is type unstable and results in dynamic dispatch because we are not sure if the argument to identity will be an Int64 or a Float64.
``` julia> f(x) = identity(x ≥ 0 ? x : x + 0.0) f (generic function with 1 method)
    julia> @code_warntype f(5) MethodInstance for f(::Int64) from f(x) @ Main REPL[4]:1 Arguments #self#::Core.Const(f) x::Int64 Locals @_3::Union{Float64, Int64} Body::Union{Float64, Int64} 1 ─ %1 = (x ≥ 0)::Bool └── goto #3 if not %1 2 ─ (@_3 = x) └── goto #4 3 ─ (@_3 = x + 0.0) 4 ┄ %6 = @_3::Union{Float64, Int64} │ %7 = Main.identity(%6)::Union{Float64, Int64} └── return %7 ```
    
    markkitti 3 years ago
    
    Now with better formatting...
    This becomes an important part of optimizing Julia code. There is some tooling for this. Below, identity is a type stable function because we know that an Int64 argument results in a Int64 output. The macro code_warntype reveals the type analysis:
    julia> @code_warntype identity(5) MethodInstance for identity(::Int64) from identity(x) in Base at operators.jl:526 Arguments #self#::Core.Const(identity) x::Int64 Body::Int64 1 ─ return x julia> f(x) = identity(x ≥ 0 ? x : x + 0.0) f (generic function with 1 method) julia> @code_warntype f(5) MethodInstance for f(::Int64) from f(x) in Main at REPL[6]:1 Arguments #self#::Core.Const(f) x::Int64 Locals @_3::Union{Float64, Int64} Body::Union{Float64, Int64} 1 ─ %1 = (x ≥ 0)::Bool └── goto #3 if not %1 2 ─ (@_3 = x) └── goto #4 3 ─ (@_3 = x + 0.0) 4 ┄ %6 = @_3::Union{Float64, Int64} │ %7 = Main.identity(%6)::Union{Float64, Int64} └── return %7
- moelf 3 years ago
  
  >happens at runtime
  is not technically true, because that implies a massive slow-down. instead it's more accurate to say behavior-wise it's always equivalent to a dynamic dispatch, but because Julia's Just-Ahead-of-Time compilation, often you eliminate the dynamic dispatch during run time.
  - sfpotter 3 years ago
    
    It is technically true, Julia and other programming language's implementation of multiple dispatch notwithstanding.
    First sentence from the Wikipedia article on multiple dispatch:
    "Multiple dispatch or multimethods is a feature of some programming languages in which a function or method can be dynamically dispatched based on the run-time (dynamic) type or, in the more general case, some other attribute of more than one of its arguments."
    And later:
    "Multiple dispatch should be distinguished from function overloading, in which static typing information, such as a term's declared or inferred type (or base type in a language with subtyping) is used to determine which of several possibilities will be used at a given call site, and that determination is made at compile or link time (or some other time before program execution starts) and is thereafter invariant for a given deployment or run of the program."
    
    DNF2 3 years ago
    
    The dispatch is based on the "runtime type", but that does not necessarily "happen at runtime", because the runtime/dynamic type can often be determined statically.
- hnarayanan 3 years ago
  
  Yes, thank you! I keep re-learning and forgetting this. Trying to research now and see how this manifests in practice.
bicepjai 3 years ago

This video helps understand the difference. https://youtu.be/kc9HwsxE1OY

adenozine 3 years ago

I’ve made most of my career turning scientific and mathematical code into maintainable and aesthetic code, and the red flag for me in this article is that he evidently couldn’t keep up with the Python learning curve and chose instead a language with no traits, no interfaces, and no classes. So, the amount of organization in his code is effectively zero.

I understand that Julia 2.0 is slated to have some sort of concrete interface mechanism, so that’s good. Thus far, I’ve seen some pretty low quality results. There’s just no way to have intuition about what method is going to be called in Julia. In python, I know it’s either going to be somewhere in dir(some-obj) or it’s gonna be some funky meta class stuff. Either way, pycharm can literally just hyperlink me.

Until Julia has the same capability, it just won’t be suitable for general purpose code. I know there will be some Julia fan in the replies about how I can approximate the behavior, and how Julia is the future and blah blah blah.

Just fix interfaces. It’s not that hard. They’ve got MIT grads for crying out loud!

I’m a little appalled there’s PhDs doing computer science work with public money that can’t wrap their head around python. That’s a failed curriculum imo.

andyferris 3 years ago

I find this a bit perplexing - it is possible to make organised, maintainable code without class-based polymorphism or statically enforced interfaces (traits, type classes). And Julia is so much easier than several other languages I know to organise your code in a neat modular way (and it still being quite generic and reusable).
Better static analysis tools (or traits/interfaces in the type system) would of course be welcome. But in my experience that’s more to catch silly mistakes and typos than to aid in healthy modularity or easy discoverability (which to me are remarkable good already).
mxkopy 3 years ago

> no traits, no interfaces, and no classes. So, the amount of organization in his code is effectively zero.
This is an utterly deranged take. Do you mean to say that adherence to OOP and code organization are the same thing ?
- EdwardDiego 3 years ago
  
  Deranged? You could've said that you disagree and leave it at that.
  - mxkopy 3 years ago
    
    I view the statement with fascination rather than derision
BrandonS113 3 years ago

That is harsh. I know PhDs doing comp science work, hey with PhDs in comp science, coming to same conclusion. Python is an excellent language. But coding in numpy is not its strength.
jpfr 3 years ago

I'm surpised. At JuliaCon ~6 months ago the message was that no 2.0 is in the works [1]. I.e. no backwards-incompatible changes to the language.
I checked the usual places and did not find any information on 2.0 and an interface mechanism. Do you have a pointer?
[1] https://youtu.be/N4h46_TCmGc?t=1656
- adgjlsfhk1 3 years ago
  
  No, he's just making stuff up.
  - adenozine 3 years ago
    
    Oh, well here's a post by Jeff Bezanson, since you've decided to just talk out of your ass...
    https://discourse.julialang.org/t/what-dont-you-like-about-j...
    Hope the uh... co-creator warrants enough merit as a source.
    
    jpfr 3 years ago
    
    So the 2 year old post by Jeff on discourse says that breaking changes would have to go into an eventual 2.0.
    Whereas in the 6 month old keynote the same Jeff says there is no plan for breaking changes and hence no 2.0 is planned.
    So Julia will remain stable. And maybe we get interfaces (or rather traits) at some point in the 1.x release family.
    
    adenozine 3 years ago
    
    Sure, if it happens, then I'll change my tune pretty quick.
    I don't stay plugged in beyond reading the changelog whenever a new version comes out, so I didn't know that bit about the keynote. I've noticed as well that the Julia community produces an absolutely extraordinary amount of conference talks/video content, I don't have the time for the finer details.
    I've also noticed a distinct and crucial lack of long-term vision for Julia from the co-founders. I've also read some dramatic (unverified) claims that I won't repeat here about one of them. Frankly, I think a better group of people could be assembled to steward the language and its ecosystem, but I don't see much chance of that happening. Julia exists in a weird little place where being terrible to read and maintain doesn't matter as long as it lives up to its promises of being very, very fast, in most of the applications it is used for. That's all well and good, but you don't build a solid foundation for a community to really depend on that tool for bedrock tasks.
    I've been unfortunate enough to read some .jl code in '22, and it was dreadful. I truly don't understand how multiple dispatch makes anybody's life easier, it's an absolute nightmare of unmaintainable code that calls any of dozens or hundreds of methods, the performance of the entire application essentially dependent on whether or not that type is stable and of course, there's no way to know that without reading the definition of the unknown method that gets called.
    Personally, I have my eyes peeled for github.com/exaloop/codon for higher performance stuff with Python. It's already an order of magnitude faster than pypy for most cases, and equally more usable for practical work than Julia, imo.
    Anyhow, when it's all said and done, there's a lot of computing being done and a lot of money changing hands and so forth. All's well that ends well, despite never really being done well. /shrug
    
    DNF2 3 years ago
    
    > I don't stay plugged in beyond reading the changelog whenever a new version comes out ... > I've also noticed a distinct and crucial lack of long-term vision for Julia from the co-founders.
    Don't you find those two statements contradictory? There's a pretty striking contrast between your claim to ignorance and your confidently sweeping generalization (sadly, those two often do come in pairs.)
    > I've also read some dramatic (unverified) claims that I won't repeat here about one of them.
    If this is what I think it is, the claims were not substantiated in any way (even though it would be easy), and seemed quite outlandish, frankly.
    > I've been unfortunate enough to read some .jl code in '22, and it was dreadful. I truly don't understand how multiple dispatch makes anybody's life easier
    It's really hard for me to understand this opinion, given that Julia code, to me, is far more appealing than all the most common alternatives. In particular, multiple dispatch is such an obviously natural paradigm, that it's just hard to fathom why everyone don't just 'get it' right away. I mean, not taking all input argument types into account now seems to me like a completely artificial, even perverse, restriction. Why?
    I guess this is why people argue on in the internet.
    
    adenozine 3 years ago
    
    Ah, well I should’ve clarified. I was very, very excited about the language from about 0.4 to 1.2ish.
    In that time, I was nearly obsessive in reading all the news, though I wasn’t writing any jl code at all during that time. After 1.0, I used it for a few little things here and there in my business, some basic csv munging and a few other one-off tasks. It did fine, ofc.
    The claims, yeah I mean, I don’t know the parties involved personally but because we are not far apart in the social graph and I don’t need the headache this close to retirement. I love my opinions very much, but not when they involve personal controversy.
    Multiple dispatch is a write-only benefit. In my line of work, I might come across different machines, different memory setups, different instruction sets, different c compilers, different latency profiles, different spoken language as documentation, etc.
    I’m not a mathematician, but I clean up and leave businesses with pragmatic, clean, maintainable code after the mathematicians do their thing and move on. For me, good code is about how specifically and how clearly I can communicate an idea so that the next person (who won’t know shit about shit) can add a feature without breaking something, change a deploy, fix a small bug, update documentation, etc, without breaking things. In that regard, Julia code reads like someone’s manuscript about the issue and not a program written by a programmer.
    To you, that’s desired. To me, that’s my actual worst nightmare because it means the longest mental checklist between getting onsite and leaving a finished product and getting paid.
    You mention not taking all argument types, it’s just so stereotypically mathematician of an assumption that the code will just work with whatever garbage you send it.
    Because maybe that function needs to run on a 32bit system somewhere, and it will fail?
    Maybe some of the other methods have a bug, and you won’t notice until you use the original method with a certain type, that you hadn’t considered? Julia can’t prove this sort of thing, to my knowledge.
    I’m not saying python can, necessarily, but I know that there are very strong confidence levels and that if something borks, I can arrive at the scene of the crime within a few minutes of any given stacktrace and docker image, or whatever.
    I don’t mean to argue. It’s clear we disagree. That’s fine. I just can’t take Julia seriously because they treat data like math on a whiteboard, and it’s just a ridiculous way to program a computer. Like I said, it gets a lot of computing done, and a lot of people are going home paid well. It’s all good, to that end.
    Edit: I likely won’t reply further, it’s really annoying to scroll back in HN to see replies without linked notifications. Be well!
    
    DNF2 3 years ago
    
    > In that regard, Julia code reads like someone’s manuscript about the issue and not a program written by a programmer. To you, that’s desired. To me, that’s my actual worst nightmare
    That's a pretty dubious attribution of intention, though, that I desire this. I want clean, maintainable code.
    > You mention not taking all argument types, it’s just so stereotypically mathematician of an assumption that the code will just work with whatever garbage you send it.
    But that's not what I said, and it's not what multiple dispatch means. You are talking about generic functions and duck typing -- essentially, code that has no type restrictions. Multiple dispatch means that types of all arguments are considered, and you are free to be as restrictive as you wish. You can specifically and concretely type every single input, and probably make the code much more predictable and to your liking.
    The amount of genericity vs type safety is a trade-off between different advantages, and you have a lot of freedom to choose.
    
    adgjlsfhk1 3 years ago
    
    You misread (although it wasn't especially clearly phrased.
    "Interfaces are a really common topic of discussion and I think at this point we’re determined to do something about it in julia 2.0 (if it requires breaking changes)."
    This means we really want a solution for interfaces and if we had a good enough design for interfaces that would require 2.0, they are important enough that it could be worth breaking existing code (and releasing a 2.0 with interfaces). However there still isn't a plan for interfaces (breaking or non-breaking).
    
    adenozine 3 years ago
    
    If we are to continue this conversation, I need it to be addressed that you falsely characterized my top level comment with “he’s making stuff up.”
    Otherwise, further correction would be wasting my time.

Settings

My Journey from R to Julia

Keyboard Shortcuts