In the past 2 years, the data ecosystem has been evolving rapidly. New tools have been emerging every month in the modern data stack. In a hype cycle, it becomes hard to distinguish the signal from the noise. Which of those tools would eventually become simple features or actual products that we would be using in a few years?
In addition to our growing number of tools, we've seen a few new trends, such as declarative approaches appearing everywhere (from Kubernetes where we have code as infra, orchestration as code, and even integration as code)
Other trends include the rise of the Semantic Layer, Rust becoming the future of performance-intense applications in data (potential replacing Spark eventually), or even data modeling coming back with the exposing of the modern data stack. All this is without mentioning AI and vector-based engines being used for small data, such as DuckDB along with newer ones especially supporting the AI wave behind the curtains with Pinecone, Qdrant, etc.
So much is going on that we had to take a step back on the evolution of the data engineer
To make sense of it all, we need all the data we can get. Fortunately, this State of Data 2023 survey is the largest data engineering survey made to date. It will help us take a step back and understand what the community is using and feeling excited about, what is noise or signal in the modern data stack.
The research will first give details on the demographics of the survey participants. Then, it will go through the data stack, but also the blogs, podcasts, newsletters that we follow most.
The best insights are usually discovered when using the filters at your disposal, per company size and per experience, so you can drill down on the information that matters most to you.
Now, let's see what we discovered with the survey.
This is the largest data engineering survey made to date.
Where do you currently reside?
Source: Airbyte
How many years of experience in your current field do you have?
Source: Airbyte
If you're currently employed, how large is your company?
Source: Airbyte
If you're currently employed, how large is your the data team at your company?
Source: Airbyte
Which option best describes your current role?
Source: Airbyte
Is your data team currently hiring?
Source: Airbyte
This is the largest data engineering survey made to date.
Cash compensation ($K USD) vs Years of experience
Source: Airbyte
Cash compensation ($K USD) vs Company Size
Source: Airbyte
Cash compensation ($K USD) vs Location
Source: Airbyte
This is the largest data engineering survey made to date.
Brand recognition and adoption - Data Ingestion
Source: Airbyte
Extra poll - people care most about Correctness, Stability, and Performance for data integration
Source: Airbyte
Extra poll - more than 30% of teams maintain more than 10 connectors
Source: Airbyte
Brand recognition and adoption - Data Transformation
Source: Airbyte
There are a few things that I find surprising and exciting about the State of the Data survey. Firstly, I’m surprised for example, that Pandas is still leading the pack for data transformations. This is also exciting because it points to a need for continued education and development around new tooling like Polars, which has a lot to offer. I find it surprising Databricks isn’t used more, but also has a bright side in the idea that there is a lot of room for growth towards those tools. Both from an education of perspective as a content creator and then it points to still exciting times ahead for the Data Engineering community as more people and teams adopt new technologies.
![]()
Brand recognition and adoption - Data Warehouses
Source: Airbyte
Brand recognition and adoption - Data Orchestration
Source: Airbyte
It's unsurprising to find data quality as the number one concern of data engineers. Yet, no one seems to own that across companies which will become an increasingly important issue. Also, I found out interesting to see that most people are widely using self-hosted Airflow, while we heard that Airflow was very difficult to set up. I believe the choice between self-hosted and managed Airflow in the future, will be more on what those managed solutions bring to quickly onboard teams, give a better dev experience, and help solve quality/observability issues.
![]()
Brand recognition and adoption - Business Intelligence
Source: Airbyte
Brand recognition and adoption - Data Quality
Source: Airbyte
I am particularly happy to see the growth of Data Quality tools that have evolved for good. This signals maturity is coming along. It's not a shocker to me Airbyte still leading the way for the Data Ingestion Layer.
![]()
Brand recognition and adoption - Reverse ETL
Source: Airbyte
Brand recognition and adoption - Data Catalog
Source: Airbyte
Amazing Data Engineering survey! I highly recommend checking out the insights into adoption of engineering tools from Data Ingestion, transformation to reverse ETL and Data Catalogs. That section was my highlight. Congratulations to Airbyte for leading the Data Ingestion section.
![]()