What it's like to be on the data science job market
treycausey.comOutstanding post, especially the supplement info from others. Most of the opportunities I've seen in DS also have emphasized engineering over science. (Maybe that's due to my job history.)
I've also wondered what fraction of DS employers use Hadoop but not enough data to warrant it. Certainly the DJIA giant pharma where I work doesn't.
That's bog standard -- every company uses hadoop. Then when you see the actual datasets, they're maybe a couple hundred gigs completely denormalized. Yet you still have to use hadoop/hive/spark to access them, with all the inefficiencies, complexity, and slowness those bring.
One of the things Trey skipped -- he got the first two only -- that is very annoying is the big breaks in the data science field are data scientist / analysis; data scientist / builder; and data engineer/etl. Data scientists' work sits on top a giant batch of data engineering, and often companies (imo intentionally) try to hire data scientists by dangling interesting analysis or implementation work, but when you dig deep enough or worse, accept the job offer, it's really 80%+ data engineering. (And they get pissy when you quit two months in after discovering this, both because that's not what I want to do and because relationships founded on lies tend not to work out well for employees.)
The other very difficult thing you get is project tests; it's hard to test something deeply in 5 hours. Even when companies claim to want to test statistics knowledge, the tests almost always turn out to be dominated by data ingestion/cleaning work. Or they're simply too much work. eg Stitchfix wanted me to spend 10+ hours implementing an analysis after just speaking to a recruiter, without even having spoken to one of their data scientists because they were "too busy". The recruiter was grumpy when I stopped responding to email.
> Then when you see the actual datasets, they're maybe a couple hundred gigs completely denormalized. Yet you still have to use hadoop/hive/spark to access them, with all the inefficiencies, complexity, and slowness those bring.
I was always under the impression that one of the benefits of NoSQL was its speed, but then watching a webcast the other day querying a very small dataset, I was shocked at how slow it was, and this was in contrast to another demo where a different query was mind boggingly fast compared to comparable performance on a traditional SQL platform. (Yes, I know the particulars matter here and it's not that good of a question without that specificity, but any light you could shine on this would be appreciated.)
For data of "a couple hundred gigs", what platform would you say is more appropriate?
no, the benefit of nosql, at least for data science, is scalability. ie what do you do when you can't fit the data on a single machine. This works great at a former employer, who really did have pb scale datasets. The vast vast majority of companies do not have pb scale datasets. Most don't have tb datasets.
as for what do you do, postgres / mysql; pandas /R; or roll your own code depending on precisely what you need. But you can rack a pretty beefy box with 256g ram in it, 2 xeons, and a ton of ssd + spindle disk for $10k. Nothing that nosql or hadoop or spark do can't be done easier, written way faster, executed faster, and kept running more easily on a single box or even better in a single process.
For example: at my current gig, I work on 20-40g raw datasets. Ingest to pandas and externalize user agent strings drops it to 5g or so. That process takes 30 to 60 minutes, but I do it once, cache the results, and update incrementally.
Postgres, or depending on the particulars just start rolling your own.
For anyone who isn't a data scientist, much of this applies to any interview experience, especially the latter part of the article.
Recruiters and companies tend to flip out when I push back, but you know what? That is an excellent signal that this is the wrong job for you. I only have a very brief time to form an impression of your company and your team; why come to me with irrational or unsupported by the data behavior? I really don't understand it. Everyone claims hiring is really hard, and then they do everything they can to alienate the interviewee, then wonder when the offer is turned down, or why the person failed to perform some stupid coding trick on demand they last saw 20 years ago, maybe, in a classroom.
Make the interviewee like you and want to work for you. That shouldn't be hard to understand. Then figure out what work you need to have done, and talk to them about it. It'll be readily clear in most cases. If you are lucky and land a live one, there mind will be straying far from the constraints of your little problem and pretty much have grasped your business, your problems, and are full of ideas of how to improve them all. If not, you probably still have a good worker (if they are able to do the work, i.e. didn't lie on their resume).
I would add to this article - do what you can to see the source code[1]. If you can't, often questions can expose what it is like. Most won't give good answers, but if you are put through the normal wringer one of the 6-12 people you talk to will be fairly open and honest. Every place has warts and limitations - the question is whether these are due to inavoidable tradeoffs (jump on board), or a horrible culture/infrastructure (run away unless you are being very, very well compensated to fix the problem).
[1] trawl the github/bitbucket page of every engineer if you have to, or of course the company's pages if they do open source. It's surprising how much undocumented spaghetti is released by companies in 'support' of their products. I'm mulling pulling it up on a laptop and doing a little code review if the questions for me get silly. But realistically, I'll probably not accept the offer to interview if it is really bad.
has anyone tried asking technical questions back? e.g. a list of putnam questions.
Can you fit a parabola of arc length 4 inside the unit circle?