State of data 2023 - NFHN Reader

In the past 2 years, the data ecosystem has been evolving rapidly. New tools have been emerging every month in the modern data stack. In a hype cycle, it becomes hard to distinguish the signal from the noise. Which of those tools would eventually become simple features or actual products that we would be using in a few years?

In addition to our growing number of tools, we've seen a few new trends, such as declarative approaches appearing everywhere (from Kubernetes where we have code as infra, orchestration as code, and even integration as code)

Other trends include the rise of the Semantic Layer, Rust becoming the future of performance-intense applications in data (potential replacing Spark eventually), or even data modeling coming back with the exposing of the modern data stack. All this is without mentioning AI and vector-based engines being used for small data, such as DuckDB along with newer ones especially supporting the AI wave behind the curtains with Pinecone, Qdrant, etc.

So much is going on that we had to take a step back on the evolution of the data engineer

To make sense of it all, we need all the data we can get. Fortunately, this State of Data 2023 survey is the largest data engineering survey made to date. It will help us take a step back and understand what the community is using and feeling excited about, what is noise or signal in the modern data stack.

The research will first give details on the demographics of the survey participants. Then, it will go through the data stack, but also the blogs, podcasts, newsletters that we follow most.

The best insights are usually discovered when using the filters at your disposal, per company size and per experience, so you can drill down on the information that matters most to you.

Now, let's see what we discovered with the survey.

This is the largest data engineering survey made to date.

Where do you currently reside?

Source: Airbyte

How many years of experience in your current field do you have?

Source: Airbyte

If you're currently employed, how large is your company?

Source: Airbyte

If you're currently employed, how large is your the data team at your company?

Source: Airbyte

Which option best describes your current role?

Source: Airbyte

Is your data team currently hiring?

Source: Airbyte

This is the largest data engineering survey made to date.

Cash compensation ($K USD) vs Years of experience

Source: Airbyte

Cash compensation ($K USD) vs Company Size

Source: Airbyte

Cash compensation ($K USD) vs Location

Source: Airbyte

This is the largest data engineering survey made to date.

Brand recognition and adoption - Data Ingestion

Source: Airbyte

Extra poll - people care most about Correctness, Stability, and Performance for data integration

Source: Airbyte

Extra poll - more than 30% of teams maintain more than 10 connectors

Source: Airbyte

Brand recognition and adoption - Data Transformation

Source: Airbyte

There are a few things that I find surprising and exciting about the State of the Data survey. Firstly, I’m surprised for example, that Pandas is still leading the pack for data transformations. This is also exciting because it points to a need for continued education and development around new tooling like Polars, which has a lot to offer. I find it surprising Databricks isn’t used more, but also has a bright side in the idea that there is a lot of room for growth towards those tools. Both from an education of perspective as a content creator and then it points to still exciting times ahead for the Data Engineering community as more people and teams adopt new technologies.

Brand recognition and adoption - Data Warehouses

Source: Airbyte

Brand recognition and adoption - Data Orchestration

Source: Airbyte

It's unsurprising to find data quality as the number one concern of data engineers. Yet, no one seems to own that across companies which will become an increasingly important issue. Also, I found out interesting to see that most people are widely using self-hosted Airflow, while we heard that Airflow was very difficult to set up. I believe the choice between self-hosted and managed Airflow in the future, will be more on what those managed solutions bring to quickly onboard teams, give a better dev experience, and help solve quality/observability issues.

Brand recognition and adoption - Business Intelligence

Source: Airbyte

Brand recognition and adoption - Data Quality

Source: Airbyte

I am particularly happy to see the growth of Data Quality tools that have evolved for good. This signals maturity is coming along. It's not a shocker to me Airbyte still leading the way for the Data Ingestion Layer.

Brand recognition and adoption - Reverse ETL

Source: Airbyte

Brand recognition and adoption - Data Catalog

Source: Airbyte

Amazing Data Engineering survey! I highly recommend checking out the insights into adoption of engineering tools from Data Ingestion, transformation to reverse ETL and Data Catalogs. That section was my highlight. Congratulations to Airbyte for leading the Data Ingestion section.