Press enter or click to view image in full size
Can you predict the top Stack Overflow tags of 2023? BigQuery makes this easy, with its new support for training time-series analysis models with ARIMA.
Important update: I left Google and joined Snowflake in 2020 — so I’m unable to keep my older posts updated. If you want to try Snowflake, join us — I’m having a lot of fun ❄️.
Finding the top Stack Overflow tags of 2023
Let’s start by visualizing the top Stack Overflow tags on June 2020:
Press enter or click to view image in full size
As you can see in the chart, by number of pageviews during the last quarter the current top Stack Overflow tags are:
- Python
- JavaScript
- Java
- C#
- HTML
- Android
- PHP
- CSS
- jQuery
- C++
As you look back in the chart, you can see a lot of movement. Two years ago Python was in the third place, and now it’s the only tag in the top 10 showing growth.
Just by looking at the above chart — how would you extend these lines 3 years into the future? With BigQuery and some SQL magic, we can easily do that:
Press enter or click to view image in full size
The predictions by the ARIMA model make a lot of sense: Python keeps going up, JavaScript stays stable, Java keeps going down… Now, can we find the top Stack Overflow tags of 2023 using these trends? Sure:
Press enter or click to view image in full size
Now this is interesting! C++ and jQuery are gone — the two new top-10 tags are ReactJS and Angular.
This is really cool if we want to identify the top trends of the future. We can use all the currently available information, project it into the future, and base our decisions on visible trends.
Other interesting trends
Now that we have these projections and a dashboard, we can re-visit my 2018 post “The real Stack Overflow trends: Use the pageviews”.
React vs Angular vs Vue.js
- Right now both React and Angular gather a similar amount of attention
- React is projected to get a slight advantage into the future.
Press enter or click to view image in full size
Machine learning
- TensorFlow, Keras, and PyTorch are show strong growth.
- Back in 2018 Pytorch and MxNet were equally ignored. Nevertheless PyTorch trended up, and MxNet stayed ignored.
- Caffe had some attention in 2017, but it just faded out.
- TensorFlow is projected to keep the top spot in 2023.
Press enter or click to view image in full size
Upcoming languages: Swift, Go, Haskell, Kotlin, Rust
- Swift gathers cyclical attention, but shows no growth.
- Go and Kotlin grow in popularity each quarter, but not enough to catch up to Swift anytime soon.
- Haskell and Rust don’t show much action, but at least Rust is going up while Haskell stays stable.
- On the downward slope, for comparison: Ruby. It still gathers more attention to Go and Kotlin, but probably not for long.
Press enter or click to view image in full size
AWS Lambda vs GCP Functions vs Azure Functions
- AWS has the advantage, but GCP and Azure keep going up and to the right.
Press enter or click to view image in full size
Find your own trends
Play with the Data Studio interactive dashboard, and share what you find.
How-to
ARIMA in BigQuery
To generate time-series predictions with BigQuery, all you need to do is create a model with a couple lines of SQL:
CREATE OR REPLACE MODEL `temp.merged202006_arima_dailyquarter`
OPTIONS(model_type='ARIMA',
time_series_id_col='id_col',
time_series_data_col='data_col',
time_series_timestamp_col='date_col'
) That’s it. As long as you have time-series, BigQuery can do its best to find a fitting ARIMA model.
For a quick check, let’s generate 4 basic time-series: Linear, quadratic growth, exponential growth, and cyclical:
CREATE OR REPLACE MODEL `temp.test_arima`
OPTIONS(model_type='ARIMA',
time_series_id_col='trend',
time_series_data_col='i',
time_series_timestamp_col='date'
) ASSELECT DATE_ADD('2020-01-01', INTERVAL i DAY) date
, 10*i i, 'linear' trend
FROM UNNEST(GENERATE_ARRAY(0,10)) i
UNION ALL
SELECT DATE_ADD('2020-01-01', INTERVAL i DAY)
, i*i, 'square'
FROM UNNEST(GENERATE_ARRAY(0,10)) i
UNION ALL
SELECT DATE_ADD('2020-01-01', INTERVAL i DAY)
, EXP(i)/400, 'exponential'
FROM UNNEST(GENERATE_ARRAY(0,10)) i
UNION ALL
SELECT DATE_ADD('2020-01-01', INTERVAL i DAY)
, 100*SIN(i*ACOS(-1)/4), 'sin'
FROM UNNEST(GENERATE_ARRAY(0,10)) i
Then we can ask this model to predict future values for each of the time series:
SELECT DATE(forecast_timestamp) date, forecast_value i, trend
FROM ML.FORECAST(MODEL `temp.test_arima`
, STRUCT(10 AS horizon, 0.1 AS confidence_level))Once in a chart, the results look pretty good:
Press enter or click to view image in full size
Note that even when all of these predictions look good — one didn’t “read my mind”: The exponential growth curve got projected with linear growth by ARIMA.
What’s especially interesting is how the seasonality in the sin cyclical “sinus” series is handled by BigQuery ML time series. There is way more than ARIMA in the BQML time series, from the official docs:
What’s inside a BigQuery ML time series model
ARIMA is considered the core algorithm used in BigQuery ML time series. However, it is not the only model used in the model creation pipeline. The pipeline consists of the following components, listed roughly in the order the steps are run:
- Automatic cleaning adjustments to the input time series, including missing values, duplicated timestamps, spike anomalies, and accounting for abrupt level changes in the time series history.
- Holiday effects adjustments.
- Seasonal and trend decomposition using the Seasonal and Trend decomposition using Loess (STL) algorithm.
- Seasonality extrapolation using the double exponential smoothing (ETS) algorithm.
- Trend modeling using the ARIMA model and the auto.ARIMA algorithm for automatic hyper-parameter tuning. In auto.ARIMA, dozens of candidate models are trained and evaluated in parallel. The best model comes with the lowest Akaike information criterion (AIC).
You can also ask BigQuery to reveal the optimal coefficients it found for each series:
SELECT *
FROM ML.ARIMA_COEFFICIENTS(MODEL `temp.test_arima`)Trends in Stack Overflow
Check my previous post “The real Stack Overflow trends: Use the pageviews”.
Putting it all together with dbt
I shared the full code for generating the Stack Overflow predictions on GitHub:
And make sure to check my 2 videos with
where we build these predictions:
Hacking ARIMA
- Note that ARIMA in BigQuery is currently in beta.
- It does a really good job! In this case I was able to build a model and predictions for 15,000 different tags in only 20 minutes.
- One problem I expect the team can fix soon: ARIMA didn’t work very well with quarterly data, but that’s all I had for this example. So I had to convert the dates from a quarterly cadence to a daily one, and the convert back.
Next steps
Anomaly detection
Instead of predicting 3 years into the future, predict for today, and compare the predicted results with the actual ones.
For anything outside the confidence interval from ARIMA, you can consider that an anomaly that deserves to be checked closer.
Related posts
Check
’s post:
Want more?
I’m Felipe Hoffa, a Developer Advocate for Google Cloud. Follow me on @felipehoffa, find my previous posts on medium.com/@hoffa, and all about BigQuery on reddit.com/r/bigquery.