Stack Overflow in 2023: Predicting with ARIMA and BigQuery

6 min read Original article ↗

Press enter or click to view image in full size

Can you predict the top Stack Overflow tags of 2023? BigQuery makes this easy, with its new support for training time-series analysis models with ARIMA.

Felipe Hoffa

Important update: I left Google and joined Snowflake in 2020 — so I’m unable to keep my older posts updated. If you want to try Snowflake, join us — I’m having a lot of fun ❄️.

Finding the top Stack Overflow tags of 2023

Let’s start by visualizing the top Stack Overflow tags on June 2020:

Press enter or click to view image in full size

Top 10 Stack Overflow tags, June 2020

As you can see in the chart, by number of pageviews during the last quarter the current top Stack Overflow tags are:

  1. Python
  2. JavaScript
  3. Java
  4. C#
  5. HTML
  6. Android
  7. PHP
  8. CSS
  9. jQuery
  10. C++

As you look back in the chart, you can see a lot of movement. Two years ago Python was in the third place, and now it’s the only tag in the top 10 showing growth.

Just by looking at the above chart — how would you extend these lines 3 years into the future? With BigQuery and some SQL magic, we can easily do that:

Press enter or click to view image in full size

Top 10 Stack Overflow tags, June 2020 — projected into the future with ARIMA

The predictions by the ARIMA model make a lot of sense: Python keeps going up, JavaScript stays stable, Java keeps going down… Now, can we find the top Stack Overflow tags of 2023 using these trends? Sure:

Press enter or click to view image in full size

Top 10 Stack Overflow tags, June 2023 — projected by ARIMA

Now this is interesting! C++ and jQuery are gone — the two new top-10 tags are ReactJS and Angular.

This is really cool if we want to identify the top trends of the future. We can use all the currently available information, project it into the future, and base our decisions on visible trends.

Other interesting trends

Now that we have these projections and a dashboard, we can re-visit my 2018 post “The real Stack Overflow trends: Use the pageviews”.

React vs Angular vs Vue.js

  • Right now both React and Angular gather a similar amount of attention
  • React is projected to get a slight advantage into the future.

Press enter or click to view image in full size

Machine learning

  • TensorFlow, Keras, and PyTorch are show strong growth.
  • Back in 2018 Pytorch and MxNet were equally ignored. Nevertheless PyTorch trended up, and MxNet stayed ignored.
  • Caffe had some attention in 2017, but it just faded out.
  • TensorFlow is projected to keep the top spot in 2023.

Press enter or click to view image in full size

Upcoming languages: Swift, Go, Haskell, Kotlin, Rust

  • Swift gathers cyclical attention, but shows no growth.
  • Go and Kotlin grow in popularity each quarter, but not enough to catch up to Swift anytime soon.
  • Haskell and Rust don’t show much action, but at least Rust is going up while Haskell stays stable.
  • On the downward slope, for comparison: Ruby. It still gathers more attention to Go and Kotlin, but probably not for long.

Press enter or click to view image in full size

AWS Lambda vs GCP Functions vs Azure Functions

  • AWS has the advantage, but GCP and Azure keep going up and to the right.

Press enter or click to view image in full size

Find your own trends

Play with the Data Studio interactive dashboard, and share what you find.

How-to

ARIMA in BigQuery

To generate time-series predictions with BigQuery, all you need to do is create a model with a couple lines of SQL:

CREATE OR REPLACE MODEL `temp.merged202006_arima_dailyquarter`
OPTIONS(model_type='ARIMA',
time_series_id_col='id_col',
time_series_data_col='data_col',
time_series_timestamp_col='date_col'
)

That’s it. As long as you have time-series, BigQuery can do its best to find a fitting ARIMA model.

For a quick check, let’s generate 4 basic time-series: Linear, quadratic growth, exponential growth, and cyclical:

CREATE OR REPLACE MODEL `temp.test_arima`
OPTIONS(model_type='ARIMA',
time_series_id_col='trend',
time_series_data_col='i',
time_series_timestamp_col='date'
) AS
SELECT DATE_ADD('2020-01-01', INTERVAL i DAY) date
, 10*i i, 'linear' trend
FROM UNNEST(GENERATE_ARRAY(0,10)) i
UNION ALL
SELECT DATE_ADD('2020-01-01', INTERVAL i DAY)
, i*i, 'square'
FROM UNNEST(GENERATE_ARRAY(0,10)) i
UNION ALL
SELECT DATE_ADD('2020-01-01', INTERVAL i DAY)
, EXP(i)/400, 'exponential'
FROM UNNEST(GENERATE_ARRAY(0,10)) i
UNION ALL
SELECT DATE_ADD('2020-01-01', INTERVAL i DAY)
, 100*SIN(i*ACOS(-1)/4), 'sin'
FROM UNNEST(GENERATE_ARRAY(0,10)) i

Then we can ask this model to predict future values for each of the time series:

SELECT DATE(forecast_timestamp) date, forecast_value i, trend 
FROM ML.FORECAST(MODEL `temp.test_arima`
, STRUCT(10 AS horizon, 0.1 AS confidence_level))

Once in a chart, the results look pretty good:

Press enter or click to view image in full size

Testing ARIMA. Half of these values were predicted by the model.

Note that even when all of these predictions look good — one didn’t “read my mind”: The exponential growth curve got projected with linear growth by ARIMA.

What’s especially interesting is how the seasonality in the sin cyclical “sinus” series is handled by BigQuery ML time series. There is way more than ARIMA in the BQML time series, from the official docs:

What’s inside a BigQuery ML time series model

ARIMA is considered the core algorithm used in BigQuery ML time series. However, it is not the only model used in the model creation pipeline. The pipeline consists of the following components, listed roughly in the order the steps are run:

- Automatic cleaning adjustments to the input time series, including missing values, duplicated timestamps, spike anomalies, and accounting for abrupt level changes in the time series history.

- Holiday effects adjustments.

- Seasonal and trend decomposition using the Seasonal and Trend decomposition using Loess (STL) algorithm.

- Seasonality extrapolation using the double exponential smoothing (ETS) algorithm.

- Trend modeling using the ARIMA model and the auto.ARIMA algorithm for automatic hyper-parameter tuning. In auto.ARIMA, dozens of candidate models are trained and evaluated in parallel. The best model comes with the lowest Akaike information criterion (AIC).

You can also ask BigQuery to reveal the optimal coefficients it found for each series:

SELECT  *
FROM ML.ARIMA_COEFFICIENTS(MODEL `temp.test_arima`)
ML.ARIMA_COEFFICIENTS for each ARIMA trend

Trends in Stack Overflow

Check my previous post “The real Stack Overflow trends: Use the pageviews”.

Putting it all together with dbt

I shared the full code for generating the Stack Overflow predictions on GitHub:

And make sure to check my 2 videos with

where we build these predictions:

Hacking ARIMA

  • Note that ARIMA in BigQuery is currently in beta.
  • It does a really good job! In this case I was able to build a model and predictions for 15,000 different tags in only 20 minutes.
  • One problem I expect the team can fix soon: ARIMA didn’t work very well with quarterly data, but that’s all I had for this example. So I had to convert the dates from a quarterly cadence to a daily one, and the convert back.

Next steps

Anomaly detection

Instead of predicting 3 years into the future, predict for today, and compare the predicted results with the actual ones.

For anything outside the confidence interval from ARIMA, you can consider that an anomaly that deserves to be checked closer.

Related posts

Check

’s post:

Want more?

I’m Felipe Hoffa, a Developer Advocate for Google Cloud. Follow me on @felipehoffa, find my previous posts on medium.com/@hoffa, and all about BigQuery on reddit.com/r/bigquery.