Stack Overflow in 2023: Predicting with ARIMA and BigQuery

Press enter or click to view image in full size

Can you predict the top Stack Overflow tags of 2023? BigQuery makes this easy, with its new support for training time-series analysis models with ARIMA.

Important update: I left Google and joined Snowflake in 2020 — so I’m unable to keep my older posts updated. If you want to try Snowflake, join us — I’m having a lot of fun ❄️.

Finding the top Stack Overflow tags of 2023

Let’s start by visualizing the top Stack Overflow tags on June 2020:

Press enter or click to view image in full size

Top 10 Stack Overflow tags, June 2020

As you can see in the chart, by number of pageviews during the last quarter the current top Stack Overflow tags are:

Python
JavaScript
Java
C#
HTML
Android
PHP
CSS
jQuery
C++

As you look back in the chart, you can see a lot of movement. Two years ago Python was in the third place, and now it’s the only tag in the top 10 showing growth.

Just by looking at the above chart — how would you extend these lines 3 years into the future? With BigQuery and some SQL magic, we can easily do that:

Press enter or click to view image in full size

Top 10 Stack Overflow tags, June 2020 — projected into the future with ARIMA

The predictions by the ARIMA model make a lot of sense: Python keeps going up, JavaScript stays stable, Java keeps going down… Now, can we find the top Stack Overflow tags of 2023 using these trends? Sure:

Press enter or click to view image in full size

Top 10 Stack Overflow tags, June 2023 — projected by ARIMA

Now this is interesting! C++ and jQuery are gone — the two new top-10 tags are ReactJS and Angular.

This is really cool if we want to identify the top trends of the future. We can use all the currently available information, project it into the future, and base our decisions on visible trends.

Other interesting trends

Now that we have these projections and a dashboard, we can re-visit my 2018 post “The real Stack Overflow trends: Use the pageviews”.

React vs Angular vs Vue.js

Right now both React and Angular gather a similar amount of attention
React is projected to get a slight advantage into the future.

Press enter or click to view image in full size

Machine learning

TensorFlow, Keras, and PyTorch are show strong growth.
Back in 2018 Pytorch and MxNet were equally ignored. Nevertheless PyTorch trended up, and MxNet stayed ignored.
Caffe had some attention in 2017, but it just faded out.
TensorFlow is projected to keep the top spot in 2023.

Press enter or click to view image in full size

Upcoming languages: Swift, Go, Haskell, Kotlin, Rust

Swift gathers cyclical attention, but shows no growth.
Go and Kotlin grow in popularity each quarter, but not enough to catch up to Swift anytime soon.
Haskell and Rust don’t show much action, but at least Rust is going up while Haskell stays stable.
On the downward slope, for comparison: Ruby. It still gathers more attention to Go and Kotlin, but probably not for long.

Press enter or click to view image in full size

AWS Lambda vs GCP Functions vs Azure Functions

AWS has the advantage, but GCP and Azure keep going up and to the right.

Press enter or click to view image in full size

Find your own trends

Play with the Data Studio interactive dashboard, and share what you find.

How-to

ARIMA in BigQuery

To generate time-series predictions with BigQuery, all you need to do is create a model with a couple lines of SQL:

CREATE OR REPLACE MODEL `temp.merged202006_arima_dailyquarter`
OPTIONS(model_type='ARIMA',
        time_series_id_col='id_col',
        time_series_data_col='data_col',    
        time_series_timestamp_col='date_col'
)

That’s it. As long as you have time-series, BigQuery can do its best to find a fitting ARIMA model.

For a quick check, let’s generate 4 basic time-series: Linear, quadratic growth, exponential growth, and cyclical:

CREATE OR REPLACE MODEL `temp.test_arima`
OPTIONS(model_type='ARIMA',
        time_series_id_col='trend',
        time_series_data_col='i',    
        time_series_timestamp_col='date'
) ASSELECT DATE_ADD('2020-01-01', INTERVAL i DAY) date
  , 10*i i, 'linear' trend
FROM UNNEST(GENERATE_ARRAY(0,10)) i
UNION ALL
SELECT DATE_ADD('2020-01-01', INTERVAL i DAY)
  , i*i, 'square'
FROM UNNEST(GENERATE_ARRAY(0,10)) i
UNION ALL
SELECT DATE_ADD('2020-01-01', INTERVAL i DAY)
  , EXP(i)/400, 'exponential'
FROM UNNEST(GENERATE_ARRAY(0,10)) i
UNION ALL
SELECT DATE_ADD('2020-01-01', INTERVAL i DAY)
  , 100*SIN(i*ACOS(-1)/4), 'sin'
FROM UNNEST(GENERATE_ARRAY(0,10)) i

Then we can ask this model to predict future values for each of the time series:

SELECT DATE(forecast_timestamp) date, forecast_value i, trend 
FROM ML.FORECAST(MODEL `temp.test_arima` 
  , STRUCT(10 AS horizon, 0.1 AS confidence_level))

Once in a chart, the results look pretty good:

Press enter or click to view image in full size

Testing ARIMA. Half of these values were predicted by the model.

Note that even when all of these predictions look good — one didn’t “read my mind”: The exponential growth curve got projected with linear growth by ARIMA.

What’s especially interesting is how the seasonality in the sin cyclical “sinus” series is handled by BigQuery ML time series. There is way more than ARIMA in the BQML time series, from the official docs:

What’s inside a BigQuery ML time series model
ARIMA is considered the core algorithm used in BigQuery ML time series. However, it is not the only model used in the model creation pipeline. The pipeline consists of the following components, listed roughly in the order the steps are run:
- Automatic cleaning adjustments to the input time series, including missing values, duplicated timestamps, spike anomalies, and accounting for abrupt level changes in the time series history.
- Holiday effects adjustments.
- Seasonal and trend decomposition using the Seasonal and Trend decomposition using Loess (STL) algorithm.
- Seasonality extrapolation using the double exponential smoothing (ETS) algorithm.
- Trend modeling using the ARIMA model and the auto.ARIMA algorithm for automatic hyper-parameter tuning. In auto.ARIMA, dozens of candidate models are trained and evaluated in parallel. The best model comes with the lowest Akaike information criterion (AIC).

You can also ask BigQuery to reveal the optimal coefficients it found for each series:

SELECT  *
FROM  ML.ARIMA_COEFFICIENTS(MODEL `temp.test_arima`)

ML.ARIMA_COEFFICIENTS for each ARIMA trend

Trends in Stack Overflow

Check my previous post “The real Stack Overflow trends: Use the pageviews”.

Putting it all together with dbt

I shared the full code for generating the Stack Overflow predictions on GitHub:

https://github.com/fhoffa/…/stack_overflow_dbt

And make sure to check my 2 videos with

where we build these predictions:

Hacking ARIMA

Note that ARIMA in BigQuery is currently in beta.
It does a really good job! In this case I was able to build a model and predictions for 15,000 different tags in only 20 minutes.
One problem I expect the team can fix soon: ARIMA didn’t work very well with quarterly data, but that’s all I had for this example. So I had to convert the dates from a quarterly cadence to a daily one, and the convert back.

Next steps

Anomaly detection

Instead of predicting 3 years into the future, predict for today, and compare the predicted results with the actual ones.

For anything outside the confidence interval from ARIMA, you can consider that an anomaly that deserves to be checked closer.

Check

’s post:

Want more?

I’m Felipe Hoffa, a Developer Advocate for Google Cloud. Follow me on @felipehoffa, find my previous posts on medium.com/@hoffa, and all about BigQuery on reddit.com/r/bigquery.

Can you predict the top Stack Overflow tags of 2023? BigQuery makes this easy, with its new support for training time-series analysis models with ARIMA.

Finding the top Stack Overflow tags of 2023

Other interesting trends

React vs Angular vs Vue.js

Machine learning

Upcoming languages: Swift, Go, Haskell, Kotlin, Rust

AWS Lambda vs GCP Functions vs Azure Functions

Find your own trends

How-to

ARIMA in BigQuery

Trends in Stack Overflow

Putting it all together with dbt

Hacking ARIMA

Next steps

Anomaly detection

Related posts

Want more?