From Data Unicorns to Data Giraffes

tldr: it used to be hard to munge data, with AI able to put most data science skills in the hands of everyone, data scientists must adapt.

The myth of the data scientist as a magical unicorn coming in to an organization to unlock insights and hidden value from existing dormant data lying around was probably never true. But like all good myths, it did inspire and influence how data scientists were perceived and perceived themselves. This perception influenced things like salaries, team and organizational structures, team processes, and level of domain knowledge. As the friction of generating insights from clean data drops and the number of people who can explore data expands, the myth of the unicorn no longer holds.

In my job talk in the last year of my PhD in the spring of 2010, I used the recently published cover of the Economist issue on the Data Deluge as the title page for my presentation; to highlight my currency with the emerging big data era; and the new science it promised.

I’d been a daily R user for a few years at that point, had just built a surveillance system that downloaded the most recent field surveys from an email server, structured the survey data into clean csv files, inserted them into a raw postgres table, cleaned up and normalized the data into the application schema, integrated environmental data, ran some stats and models, made some graphs, and emailed out reports. In today’s vernacular, we would use words like orchestration (i.e., shell scripts and cron), ETL (i.e., psql calls), medallion architecture (messy data cleaned up into cleaner data), mlops (my r files and other analytics software and some logging), and maybe a BI layer (html reports emailed out to people - themselves I believe Sweave which mashed up R with LaTeX). So I was feeling pretty confident overall with my data skills, which were seen as an end to do better and different kinds of science, not an end in themselves.

Looking back today, it was probably not the data pipeline that was the most valuable part of that work, despite perhaps what I thought at the time, it was - new data, the relationships we built during the project, giving hands-on workshops to participants and students, getting organizations talking to each other about data that hadn’t been previously, the publications that flowed out of it. In the end the process maybe worked as designed, but for a period, the tech took centre stage. This I think resembles the moment we’re in with data science; as AI can take on the mechanics of working with data, the era of tech dominance is ending and the era of real results and impact will soon be upon us.

By 2022, a decade had passed since data science was heralded by the media as the Sexiest Job of the 21st Century. In both industry and academia, data science and data-intensive analytics were aimed at uncovering new insights from existing data. First this was relational data, in the analytics data warehouse era, soon to be followed by all sorts of data- with the move to Hadoop, s3/parquet, and eventually the lakehouse type platforms we see today.

As the toolchain became more standardized, the relative value of the expertise required to apply these tools declined. When anyone can load up sklearn and load some data and hit model.fit there is a decreasing value as an area of specialization. Since tools are applied to solve problems the degree of specialization has to shift as the tools become more ubiquitous. We have few experts in keyboarding and typing today, instead we have writers.

The point of this article is that with the onset of AI; there are no more low-hanging fruit. To succeed, data science must specialize; in highly technical roles or highly specialized domain expertise. The era of the generalist data scientist is I think over, as it should be.

The term ‘low-hanging’ fruit’ is one I almost never heard in academia, but have heard frequently in industry. The notion that we attack the easy-to-access problems, that we start with the simple stuff, is almost never questioned. But most high value data that could be analyzed has been at this point - the easy analytics are done.

How many ‘big data analytics’ projects were spun up that resulted in tons of computation and modelling to arrive at totally obvious and well-known conclusions? This continues today with AI - as the first wave of AI demos and PoCs have demonstrated, with buggy Clippys unveiled on almost every major website and application, it is not that difficult to stitch together a RAG app - it is much more difficult to create an AI assistant that works reliably and is highly valued by actual users.

In the late 2000s and early 2010s, it would have been impossible to be detached from real-world problems because you learned the tools to solve the problems you had at hand, not learn the tools as an end in themselves. When the commodification of data science meant you had people who had only ever worked with the iris or Kaggle datasets considering themselves ‘data scientists’, you had a whole class of people coming up having never worked with a real problem (regardless of how well specific dimensions of that problem are represented in data). The danger of this is that you generate very little value: you’re doing things with data that do not help anyone.

The more general point here is that value is generated by friction; a well-known idea if you google it (i.e., not all scarce things have value, but most valuable things have scarcity). So when AI eliminates most of the friction associated with working with data, the mere ability to work with data loses its value. Data scientists need to adapt to continue to be valuable parts of their organizations.

One area of adaptation is work tirelessly to understand the data. The process of applying data science techniques to clean data has almost no friction, and almost no value. But, most data is not clean, most organizations continually build and rebuild systems, change data models, have special projects, change identifiers. All of these things add friction to the process of learning from data, generating new findings, optimizing existing processes. It’s indeed possible that at some point, with enough context and organizational knowledge - AI can integrate these factors to produce credible insights. But for the foreseeable future there remains a significant role for data scientists to bring a deep understanding of the data generating processes to help build systems with AI. Consider this an adaptation toward data-engineering, but data engineering informed by the domain. This is not about creating data pipelines. Moving data from postgres to csv to parquet is even more automatable today than exploratory data analysis. This is more about understanding that February last year data feed went down silently and when it came back on the data type changed because the source system changed. Or that when they integrated that new system, not all departments were loaded so the baseline frequencies differ systematically.

This type of data specialization generally requires time, and is directly proportional to time spent at an organization.

A second area of adaptation is to dive deep into domain knowledge. This may have traditionally been the role of the data analyst - so consider this an adaptation in that direction. You can specialize in a domain by collaborating and putting in the work to understand the what the data are meant to represent as much as possible. There is usually a nexus of actual domain and information systems knowledge that combines to create the data models, processing, and representations of the data.

You may not become an expert in the field, but you can build what the sociologists of expertise call interactional expertise. This is the ability to converse with domain experts, to probe problems, to understand main debates in the field, so that you - the data scientist - can frame the data, the approach, the interpretation of results, in a way that tracks with your stakeholders. This cant be learned overnight, but it can be learned.

For example, one of my first academic publications was a paper in Forest Science on the spread of a forest insect, the mighty dendroctonus ponderosae. I am not an entomologist nor a biologist. But I could read enough and learn enough and talk to enough experts in the field to understand how beetles disperse, how pheromones interact with tree density, and potential hypotheses about how this impacts the rate and direction of spread. With this understanding, I could analyze patterns in infested trees and draw some conclusions about these processes. As an area of adaptation, this is not easy to do, requires hard work, humility, and extensive collaboration, but it is possible.

Data scientists with interactional expertise, of course, exist today and have been and will continue to be important parts of their organizations. However the approaches to interdisciplinarity, collaboration, and technical integration with domain specialists remain to be formalized.

The purest data science type of adaptation is to specialize in a particular tech stack, model or class of models. This is arguably the most difficult to achieve, as you either have to be in a research role innovating on the development of models themselves at a university or frontier lab, or in an area niche enough that there is not much available in generally available software tools.

Without question, the agentic systems, architectures and tools proliferating today will continue and data scientists will be at the forefront of building high-value systems with this technology. Evaluation will comprise an increasingly important technical focus here, as the need for human oversight of agentic systems working with data remains and data scientists are still best suited to do the checking of results, parameters, and decisions made by the systems.

However, we are in a sense in the 2012 data science era of agentic AI - where just knowing how to stitch together MCP servers, skills files, LLM calls, token budgets, deployment configurations is a high-value skillset and probably the key area of activity across the field. But like a decade earlier, the tech stack will consolidate and mature. For example, already the field has shifted from differentiation of models, to differentiation of model harness, memory, tools, and skills. We need to avoid replicating the past, focusing on tools over outcomes, and tech over impact.

As my most recent experience building a machine learning model-based application reminded me (i.e., 2 hours from idea to dashboard with model results); there remains a lot of nuance to building and interpreting models that are useful. While AI-created artefacts are often described in terms of time saved - the value payoff is yet to be demonstrated. In the rare cases where you have clean data, robust pipelines, and clear targets for modelling perhaps much of this will be AI-generated or in the hands of domain specialists.

Perhaps this is the role of the ML researcher, AI scientist, or a yet-to-be coined role. Even if researcher-agents will enable more rapid progress, people are needed to craft tools into solutions, to expand on tools and frameworks, to push technology to the limit. But this will be a small proportion of data scientists as most organizations do not invest to build technology they use technology for their primary purpose.

Adaptations in the spirit of those discussed here move us towards a more challenging but potentially more fruitful era of data science, where the work is closer to value, more embedded with end users, and requires deeper collaborations. Data scientists become giraffes, with adaptations that enable specialization to the environment around them, given that all of the low hanging fruit are consumed by AI agents. In reality though, those AI agents are working in the service of data science, and data scientists must be accountable for insights and decisions made based on the outcomes.

In time, Giraffe Data Science begins to resemble more just plain old applied science - understanding the problem, selecting the right tools, modifying or extending tools when the problem changes, interpreting results, and charting out new questions.

Views here are my own and do not represent my current or previous employers.

From Data Unicorns to Data Giraffes

Discussion about this post

Ready for more?