We compare neo-graph databases, Postgres, LadybugDB, and DuckDB, to understand the user experience and general performance.
You are getting early access to this article as a subscriber. Your support makes articles like this possible. Thank you.
This article presents a comparison of graph capabilities in three different databases: DuckDB (v1.4.4 with duckpgq), LadybugDB (0.16.1), and PostgreSQL (19devel). We will load a large volume of records (5,635,972 rows of baseball data covering people, parks, team records, and game play-by-plays) into each database, define the entities and relationships, and write a variety of queries that take full advantage of the graph structure.
There are three goals I aim to hit with my methodology:
- Provide a side-by-side comparison of the look and feel of interfacing with each database via their respective syntaxes.
- Provide a general sense of performance when running queries in each database.
- Equip you to use graph database capabilities in your own explorations!
We leave investigating the database internals as an exercise for the reader. And we leave out discussion of the elder in the space, Neo4j. This article is also not an introduction to graph query languages, so be prepared!
Introducing the players#
Like any good talent scout, we pick players through careful observation and ask some key questions. Will they fit with the team? What sets them apart? Do they excel at only one thing? What is their utility? What makes them tick? By the end of the article, these questions will get answered!
Let's break down the roster.
Postgres: a league veteran learns a new position#
SQL:2023 Property Graph Queries (also referred to as SQL/PGQ) is being actively developed by the pgsql-hackers community. This feature is not yet released, but available in the PostgreSQL master branch, meaning graph database functionality will no longer depend on extensions like Apache AGE. To use it, you must build from source. Easy enough!
sudo apt-get update -y
sudo apt-get install -y git build-essential pkg-config libicu-dev bison flex libreadline-dev liburing-dev zlib1g-dev liblz4-dev hyperfine
sudo useradd postgres
sudo mkdir /postgres
git clone https://github.com/postgres/postgres.git
cd postgres
git checkout master
./configure --prefix /postgres --with-liburing --with-lz4
make -j8
sudo make install
sudo chown -R postgres:postgres /postgres/
sudo -u postgres /postgres/bin/initdb -D /postgres/data
sudo -u postgres /postgres/bin/pg_ctl -D /postgres/data -l /postgres/logfile start
sudo -u postgres /postgres/bin/createdb graph_baseball
We will take a closer look at Property Graphs when working with DuckDB as well, highlighting the similarities and differences between their implementations.
Let's also tune our Postgres settings while we're at it. I used the following PGTune configuration parameters for PostgreSQL 19devel running on a Digital Ocean Ubuntu 24.04 VM with 2 CPUs and 4GB RAM.
echo "max_connections = 200
shared_buffers = 1GB
effective_cache_size = 3GB
maintenance_work_mem = 256MB
checkpoint_completion_target = 0.9
wal_buffers = 16MB
default_statistics_target = 100
random_page_cost = 1.1
effective_io_concurrency = 200
work_mem = 5041kB
huge_pages = off
jit = off
wal_compression = lz4
io_method = io_uring
min_wal_size = 1GB
max_wal_size = 4GB" | sudo tee -a /postgres/data/postgresql.conf
And restart Postgres.
sudo -u postgres /postgres/bin/pg_ctl -D /postgres/data -l /postgres/logfile restart
DuckDB and LadybugDB: two promising rookies#
DuckDB and LadybugDB are two recent MIT-licensed DBMS projects that have made waves in the database world and take a very interesting approach to how they operate and work with data. Both databases can run in-memory (which is especially powerful for exploring small datasets) or on-disk (when persisting data or working with larger-than-memory datasets). They both provide standalone executable CLI tools as well as first-party client APIs supporting a plethora of popular programming languages (Python, Java, Golang, Rust and even WASM).
DuckDB bills (haha) itself as a simple, portable, extensible and blazingly fast column-oriented database that specializes in online analytical processing (OLAP) workloads to run complex queries against large databases. v1.0.0 first released in 2024, it also supports Property Graphs through the duckpgq first-party extension, which we focus on in this article.
Note that while duckpgq is first-party, it does not support the latest version of DuckDB (1.5.2), only being available for DuckDB 1.5.0 in that minor version. (It is available for the LTS version 1.4.4 which is what we'll use.) Extensions, even first-party ones, are apparently not built for each patch version. And extensions are not compatible even across patch versions. So grab the LTS version and install the duckpgq extension.
curl https://install.duckdb.org | DUCKDB_VERSION=1.4.4 sh
export PATH="$HOME/.duckdb/cli/1.4.4":$PATH
duckdb graph_baseball.duckdb -c "INSTALL duckpgq FROM community;"
LadybugDB is even newer than DuckDB: its first release was in November 2025. Like DuckDB, it is column-oriented but primarily positions itself as a graph database, where relationships are modeled as nodes and edges (the lines connecting nodes). It is the successor to Kuzu, which was acquired by Apple and subsequently archived in October 2025.
LadybugDB implements OpenCypher and should feel familiar to Neo4j users (perhaps the most popular graph database).
curl -s https://install.ladybugdb.com | bash
I encourage readers to look through the 'Getting Started' guides for both databases (you really can get started with both in mere seconds!) to get a feel of what each offers.
On baseball metrics#
Since baseball season is just underway, I thought it would be interesting and fun to use data from the game. Some readers may be familiar with Sabermetrics (from the acronym SABR, the Society for American Baseball Research), whose practitioners seek to obtain the objective truth of how runs are created and games are won through statistical analysis. Major League Baseball is a fantastic medium for this pursuit for a couple reasons.
First, the massive volume of games (162 in a regular season for each of the 30 teams), three-game series (the same two teams play each other over three days, with one team hosting all games), and balanced scheduling (every team plays every other team in the league at least once) ensures that statistical outliers and flukes (noise) are swallowed. Trends across eras and seasons are visible and statistically significant, and reflect causality in rule changes, ballpark changes, Performance-Enhancing Drug usage, etc.
Second, the atomic structure of events and unambiguity of outcomes works well for their recording in structured formats like box scores, spreadsheets, and database tables. Baseball has kept consistent records (along with varying statistics) since the 1870s; the ability to isolate individual performances from team performances lets historians and statisticians alike appreciate the stars of the game who may not be household names.
The discrete nature of the game is aptly summarized in the now-classic book “The Hidden Game of Baseball” by John Thorn and Pete Palmer, first published in 1984:
Baseball offers a model of perfection, a utopian, zero-sum system in which every action by the offense has a corresponding and inverse act by the defense, and everything balances in the end. The box score reads like the Book of Life held by St. Peter at the pearly gates; no action is left unaccounted.
Thorn and Palmer were inspired in large part by "The Bill James Historical Baseball Abstract", the 1977 foundational work that launched the Sabermetrics movement. Then in 1984, Bill James and Dr. David Smith started Project Scoresheet, which Smith pursued further by starting Retrosheet in 1989. We will use this invaluable resource to obtain our baseball data below.
Acquiring data from Retrosheet#
Retrosheet is an all-volunteer organization that generously compiles and provides a plethora of historic data for download; including players, teams, ballparks, as well as collections for complete season records. We will be looking at 16 complete seasons of MLB data from 2010 through 2025.
sudo apt-get update -y && sudo apt-get install -y unzip
curl -O https://www.retrosheet.org/downloads/biodata.zip
unzip -q "biodata.zip" -d "biodata"
rm "biodata.zip"
for year in {2010..2025}; do
url="https://www.retrosheet.org/downloads/${year}/${year}csvs.zip"
curl -O "$url"
if [ $? -eq 0 ]; then
mkdir -p "${year}csvs"
unzip -q "${year}csvs.zip" -d "${year}csvs"
rm "${year}csvs.zip"
else
echo "Failed to download ${year}csvs.zip"
fi
done
Key files among the downloaded CSVs have the following breakdown:
| Entity | File(s) | Number of records |
|---|---|---|
| Player | biodata/biofile0.csv | 26,961 |
| Umpire | biodata/umpires0.csv | 1,676 |
| Team | biodata/teams0.csv | 172 |
| Ballpark | biodata/ballparks0.csv | 506 |
| Game | 20XX/20XXgameinfo.csv | 37,975 |
| Event | 20XX/20XXplays.csv | 3,008,209 |
These entities represent the “nodes” in our graph-based schema, and by defining a handful of relationships between them, we will create the “edges” that can model the game of baseball!
When all the nodes and relationships are populated in their respective tables, there are 24,768,432 records in total. Keep in mind that this number includes duplication across tables; in an effort to keep the data-to-database pipeline as simple as possible for readers, I intentionally did NOT optimize the schema design to thoughtfully separate node and edge table concerns. Their bloated nature requires scanning and loading more rows or columns than necessary into memory during execution.
For instance, LadybugDB cannot effectively remove redundant tuples during factorization or take maximal advantage of compressed sparse row (CSR) adjacency indices. The vectorized query engines of both LadybugDB and DuckDB are also hampered by missed push-down opportunities for early filtering, suboptimal join order caused by ambiguity of duplicate properties, and wider intermediate tuples decreasing batch density. With these issues in mind, the findings presented here will most likely represent a worst-case scenario that can certainly be avoided and improved by using a cleaner schema separation.
Preprocessing the data#
Retrosheet provides a useful reference for all the columns in each file. So let’s write a little script to facilitate copying the Retrosheet CSV files into the databases.
Run this (ugly) script in the directory containing the downloaded data from Retrosheet (biodata, 20XXcsvs, etc.) and it will recursively parse and output all the ready-to-copy files into a single /parsed directory. I'll explain what it's doing below.
sed -i 's/\r//' $(find . -name "*.csv")
mkdir -p parsed
find . -name "*.csv" | while read -r f; do
t="parsed/${f##*/}"
awk -F, -v f="$f" 'BEGIN{OFS=","}
FNR==1{delete c;for(i=1;i<=NF;i++)if(tolower($i)~/date|last_|first_|debut_/)c[i];
if(f~/plays.csv/){for(i=1;i<=NF;i++){if($i=="gid")g=i;if($i=="pn")p=i}printf "event_id,";print;next}
if(f~/fielding|pitching|batting|teamstats/){x=$1;$1=$2;$2=x}
if(f~/relatives.csv/){x=$2;$2=$3;$3=x}
if(f~/allplayers.csv/){x=$6;for(i=6;i>2;i--)$i=$(i-1);$2=x}
if(f~/gameinfo.csv/){x=$4;for(i=4;i>2;i--)$i=$(i-1);$2=x}
if(f~/coaches0.csv/){y=$2;tm=$3;$2=tm;$3=y}
print;next}
{if(f~/plays.csv/)printf "%s_%s,",$g,$p;
if(f~/fielding|pitching|batting|teamstats/){x=$1;$1=$2;$2=x}
if(f~/relatives.csv/){x=$2;$2=$3;$3=x}
if(f~/allplayers.csv/){x=$6;for(i=6;i>2;i--)$i=$(i-1);$2=x}
if(f~/gameinfo.csv/){x=$4;for(i=4;i>2;i--)$i=$(i-1);$2=x}
if(f~/coaches0.csv/){y=$2;tm=$3;$2=tm;$3=y}
for(i in c){
if(length($i)<8)$i="";
if(substr($i,5,2)=="00")$i=substr($i,1,4)"01"substr($i,7,2);
if(substr($i,7,2)=="00")$i=substr($i,1,6)"01";
if($i~/^(1[89]|20)[0-9]{6}$/)$i=substr($i,1,4)"-"substr($i,5,2)"-"substr($i,7,2)}print}' "$f" > "$t"
done
First, this script creates an event_id field in the *plays.csv files to serve as the primary key for each event in a game. It takes the form <Game ID>_<Play Number>, e.g. CHN202503180_15, combining the gid and pn fields from the Retrosheet CSVs.
Second, to allow LadybugDB to seamlessly populate the respective relationship tables directly from files, the first two columns must represent the FROM and TO nodes. So the script makes some ordering changes.
*fielding.csv,*pitching.csv, and*batting.csv: Moveidbeforegid(Player IN Game)*teamstats.csv: Moveteaminto the second column aftergid(Team IN Game)coaches0.csv: Moveteaminto the second column afterid(Coach ON Team)*allplayers.csv: Moveteaminto the second column afterid(Player ON Team)*gameinfo.csv:Movesiteinto the second column aftergid(Game IN Ballpark)relatives.csv: Moveid2into the second column afterid1(Player TO Player)
Lastly, Retrosheet CSV files do not use a datestring format that databases understand, and any unknown parts are filled with 00. So these get translated to a YYYY-MM-DD format that substitutes 01 for 00 and clears any malformed dates with an empty string.
With our data cleaned up, let's get ready for the game!
Inside baseball#
As a sidenote, LadybugDB has a really cool browser-based explorer. The image below is taken directly from this tool and visually captures the entire schema of our baseball data. Even though this visualization came from LadybugDB, this schema is consistent across the DuckDB and PostgreSQL databases as well.
LadybugDB is our leadoff batter, and we will use it to kick things off for our other players.
At the plate: LadybugDB#
Unlike DuckDB (which can automatically create tables directly from CSV files) LadybugDB requires tables to be created with an explicit schema. The wall of text below may look dense, but it simply captures the Retrosheet CSV data for each table with the specified types.
// Nodes
CREATE NODE TABLE IF NOT EXISTS Player(id STRING PRIMARY KEY, lastname STRING, usename STRING, fullname STRING, birthdate DATE, birthcity STRING, birthstate STRING, birthcountry STRING, deathdate DATE, deathcity STRING, deathstate STRING, deathcountry STRING, cemetery STRING, cem_city STRING, cem_state STRING, cem_ctry STRING, cem_note STRING, birthname STRING, altname STRING, debut_p DATE, last_p DATE, debut_c DATE, last_c DATE, debut_m DATE, last_m DATE, debut_u DATE, last_u DATE, bats STRING, throws STRING, height DOUBLE, weight DOUBLE, HOF STRING);
CREATE NODE TABLE IF NOT EXISTS Umpire(id STRING PRIMARY KEY, lastname STRING, firstname STRING, first_g DATE, last_g DATE);
CREATE NODE TABLE IF NOT EXISTS Team(team STRING PRIMARY KEY, city STRING, nickname STRING, first_g DATE, last_g DATE);
CREATE NODE TABLE IF NOT EXISTS Ballpark(site STRING PRIMARY KEY, name STRING, city STRING, state STRING, first_g DATE, last_g DATE);
CREATE NODE TABLE IF NOT EXISTS Game(gid STRING PRIMARY KEY, site STRING, visteam STRING, hometeam STRING, date DATE, number INT64, starttime STRING, daynight STRING, innings INT64, tiebreaker INT64, usedh BOOLEAN, htbf STRING, timeofgame INT64, attendance INT64, fieldcond STRING, precip STRING, sky STRING, temp INT64, winddir STRING, windspeed INT64, oscorer STRING, forfeit STRING, suspend INT64, umphome STRING, ump1b STRING, ump2b STRING, ump3b STRING, umplf STRING, umprf STRING, wp STRING, lp STRING, save STRING, gametype STRING, vruns INT64, hruns INT64, wteam STRING, lteam STRING, line STRING, batteries STRING, lineups STRING, box STRING, pbp STRING, season INT64);
CREATE NODE TABLE IF NOT EXISTS Event(event_id STRING PRIMARY KEY, gid STRING, event STRING, inning INT64, top_bot INT64, vis_home INT64, site STRING, batteam STRING, pitteam STRING, score_v INT64, score_h INT64, batter STRING, pitcher STRING, lp INT64, bat_f INT64, bathand STRING, pithand STRING, balls INT64, strikes INT64, `count` INT64, pitches STRING, nump INT64, pa INT64, ab INT64, `single` INT64, double INT64, triple INT64, hr INT64, sh INT64, sf INT64, hbp INT64, walk INT64, k INT64, xi INT64, roe INT64, fc INT64, othout INT64, noout INT64, oth INT64, bip INT64, bunt INT64, ground INT64, fly INT64, line INT64, iw INT64, gdp INT64, othdp INT64, tp INT64, fle INT64, wp INT64, pb INT64, bk INT64, oa INT64, di INT64, sb2 INT64, sb3 INT64, sbh INT64, cs2 INT64, cs3 INT64, csh INT64, pko1 INT64, pko2 INT64, pko3 INT64, k_safe INT64, e1 INT64, e2 INT64, e3 INT64, e4 INT64, e5 INT64, e6 INT64, e7 INT64, e8 INT64, e9 INT64, outs_pre INT64, outs_post INT64, br1_pre STRING, br2_pre STRING, br3_pre STRING, br1_post STRING, br2_post STRING, br3_post STRING, lob_id1 STRING, lob_id2 STRING, lob_id3 STRING, pr1_pre STRING, pr2_pre STRING, pr3_pre STRING, pr1_post STRING, pr2_post STRING, pr3_post STRING, run_b STRING, run1 STRING, run2 STRING, run3 STRING, prun_b STRING, prun1 STRING, prun2 STRING, prun3 STRING, ur_b INT64, ur1 INT64, ur2 INT64, ur3 INT64, rbi_b INT64, rbi1 INT64, rbi2 INT64, rbi3 INT64, runs INT64, rbi INT64, er INT64, tur INT64, l1 STRING, l2 STRING, l3 STRING, l4 STRING, l5 STRING, l6 STRING, l7 STRING, l8 STRING, l9 STRING, lf1 INT64, lf2 INT64, lf3 INT64, lf4 INT64, lf5 INT64, lf6 INT64, lf7 INT64, lf8 INT64, lf9 INT64, f2 STRING, f3 STRING, f4 STRING, f5 STRING, f6 STRING, f7 STRING, f8 STRING, f9 STRING, po0 INT64, po1 INT64, po2 INT64, po3 INT64, po4 INT64, po5 INT64, po6 INT64, po7 INT64, po8 INT64, po9 INT64, a1 INT64, a2 INT64, a3 INT64, a4 INT64, a5 INT64, a6 INT64, a7 INT64, a8 INT64, a9 INT64, fseq INT64, batout1 INT64, batout2 INT64, batout3 INT64, brout_b INT64, brout1 INT64, brout2 INT64, brout3 INT64, firstf INT64, loc STRING, hittype STRING, dpopp INT64, pivot INT64, pn INT64, umphome STRING, ump1b STRING, ump2b STRING, ump3b STRING, umplf STRING, umprf STRING, date DATE, gametype STRING, pbp STRING);
// Edges
CREATE REL TABLE IF NOT EXISTS PLAYED_FOR_TEAM(FROM Player TO Team, last STRING, first STRING, bat STRING, throw STRING, g INT64, g_p INT64, g_sp INT64, g_rp INT64, g_c INT64, g_1b INT64, g_2b INT64, g_3b INT64, g_ss INT64, g_lf INT64, g_cf INT64, g_rf INT64, g_of INT64, g_dh INT64, g_ph INT64, g_pr INT64, first_g DATE, last_g DATE);
CREATE REL TABLE IF NOT EXISTS COACHED_FOR_TEAM(FROM Player TO Team, year INT64, role STRING, first_g DATE, last_g DATE);
CREATE REL TABLE IF NOT EXISTS PLAYED_IN_GAME(FROM Team TO Game, inn1 INT64, inn2 INT64, inn3 INT64, inn4 INT64, inn5 INT64, inn6 INT64, inn7 INT64, inn8 INT64, inn9 INT64, inn10 INT64, inn11 INT64, inn12 INT64, inn13 INT64, inn14 INT64, inn15 INT64, inn16 INT64, inn17 INT64, inn18 INT64, inn19 STRING, inn20 STRING, inn21 STRING, inn22 STRING, inn23 STRING, inn24 STRING, inn25 STRING, inn26 STRING, inn27 STRING, inn28 STRING, lob INT64, mgr STRING, stattype STRING, b_pa INT64, b_ab INT64, b_r INT64, b_h INT64, b_d INT64, b_t INT64, b_hr INT64, b_rbi INT64, b_sh INT64, b_sf INT64, b_hbp INT64, b_w INT64, b_iw INT64, b_k INT64, b_sb INT64, b_cs INT64, b_gdp INT64, b_xi INT64, b_roe INT64, p_ipouts INT64, p_noout INT64, p_bfp INT64, p_h INT64, p_d INT64, p_t INT64, p_hr INT64, p_r INT64, p_er INT64, p_w INT64, p_iw INT64, p_k INT64, p_hbp INT64, p_wp INT64, p_bk INT64, p_sh INT64, p_sf INT64, p_sb INT64, p_cs INT64, p_pb INT64, d_po INT64, d_a INT64, d_e INT64, d_dp INT64, d_tp INT64, d_pb INT64, d_wp INT64, d_sb INT64, d_cs INT64, start_l1 STRING, start_l2 STRING, start_l3 STRING, start_l4 STRING, start_l5 STRING, start_l6 STRING, start_l7 STRING, start_l8 STRING, start_l9 STRING, start_f1 STRING, start_f2 STRING, start_f3 STRING, start_f4 STRING, start_f5 STRING, start_f6 STRING, start_f7 STRING, start_f8 STRING, start_f9 STRING, start_f10 STRING, date DATE, number INT64, site STRING, vishome STRING, opp STRING, win INT64, loss INT64, tie INT64, gametype STRING, box STRING, pbp STRING);
CREATE REL TABLE IF NOT EXISTS BATTED_IN_GAME(FROM Player TO Game, team STRING, b_lp INT64, b_seq INT64, stattype STRING, b_pa INT64, b_ab INT64, b_r INT64, b_h INT64, b_d INT64, b_t INT64, b_hr INT64, b_rbi INT64, b_sh INT64, b_sf INT64, b_hbp INT64, b_w INT64, b_iw INT64, b_k INT64, b_sb INT64, b_cs INT64, b_gdp INT64, b_xi INT64, b_roe INT64, dh INT64, ph INT64, pr INT64, date DATE, number INT64, site STRING, vishome STRING, opp STRING, win INT64, loss INT64, tie INT64, gametype STRING, box STRING, pbp STRING);
CREATE REL TABLE IF NOT EXISTS FIELDED_IN_GAME(FROM Player TO Game, team STRING, d_seq INT64, d_pos INT64, stattype STRING, d_ifouts INT64, d_po INT64, d_a INT64, d_e INT64, d_dp INT64, d_tp INT64, d_pb INT64, d_wp INT64, d_sb INT64, d_cs INT64, d_gs INT64, date DATE, number INT64, site STRING, vishome STRING, opp STRING, win INT64, loss INT64, tie INT64, gametype STRING, box STRING, pbp STRING);
CREATE REL TABLE IF NOT EXISTS PITCHED_IN_GAME(FROM Player TO Game, team STRING, p_seq INT64, stattype STRING, p_ipouts INT64, p_noout INT64, p_bfp INT64, p_h INT64, p_d INT64, p_t INT64, p_hr INT64, p_r INT64, p_er INT64, p_w INT64, p_iw INT64, p_k INT64, p_hbp INT64, p_wp INT64, p_bk INT64, p_sh INT64, p_sf INT64, p_sb INT64, p_cs INT64, p_pb INT64, wp INT64, lp INT64, save INT64, p_gs INT64, p_gf INT64, p_cg INT64, date DATE, number INT64, site STRING, vishome STRING, opp STRING, win INT64, loss INT64, tie INT64, gametype STRING, box STRING, pbp STRING);
CREATE REL TABLE IF NOT EXISTS UMPIRED_IN_GAME(FROM Umpire TO Game, position STRING);
CREATE REL TABLE IF NOT EXISTS OCCURRED_IN_GAME (FROM Event TO Game, event STRING, date DATE);
CREATE REL TABLE IF NOT EXISTS PITCHED_IN_EVENT (FROM Player TO Event, pa INT64, pb INT64, pithand STRING, nump INT64, balls INT64, strikes INT64, count INT64, pitches STRING, wp INT64, bk INT64, runs INT64, er INT64, tur INT64, hr INT64, k INT64, walk INT64, iw INT64, hbp INT64, bip INT64, bunt INT64, ground INT64, fly INT64, line INT64, bathand STRING, risp BOOLEAN);
CREATE REL TABLE IF NOT EXISTS BATTED_IN_EVENT (FROM Player TO Event, pa INT64, ab INT64, bat_f INT64, bathand STRING, lineup_pos INT64, `single` INT64, double INT64, triple INT64, hr INT64, k INT64, walk INT64, iw INT64, hbp INT64, sh INT64, sf INT64, xi INT64, roe INT64, fc INT64, gdp INT64, tp INT64, bip INT64, bunt INT64, ground INT64, fly INT64, line INT64, rbi INT64, outs_pre INT64, risp BOOLEAN, k_safe INT64, fseq STRING);
CREATE REL TABLE IF NOT EXISTS ADVANCED_IN_EVENT (FROM Player TO Event, start_base INT64, end_base INT64, is_out INT64, rbi_credited INT64);
CREATE REL TABLE IF NOT EXISTS FIELDED_IN_EVENT (FROM Player TO Event, position INT64, putouts INT64, assists INT64, errors INT64);
CREATE REL TABLE IF NOT EXISTS PLAYED_IN_BALLPARK(FROM Game TO Ballpark);
CREATE REL TABLE IF NOT EXISTS CONNECTED_TO (FROM Event TO Event, inning INT64, top_bot INT64, from_event_string STRING, to_event_string STRING, runs_scored INT64, outs_recorded INT64, has_risp BOOLEAN);
CREATE REL TABLE IF NOT EXISTS RELATED_TO(FROM Player TO Player, relation STRING);
ladybugdb/schema.cypher
Right away we know this is NOT SQL thanks to the NODE and REL specifiers in the syntax alongside the fact that LadybugDB is a graph database. As I mentioned in the introduction, LadybugDB uses OpenCypher as its query language. Within our graph, Player, Umpire, Team, Ballpark, Game, and Event form the vertices and the various *_IN_*, *_FOR_*, and *_TO form the edges between them.
One interesting thing to note: unlike the other CREATE TABLE statements, the REL TABLE must not explicitly include the related node ids in the list of fields. This is handled automatically by the FROM Node TO Node syntax. This also means those two columns must be in the first and second positions of each of the edge’s .csv files, as stated in the previous section.
Let’s try it out:
lbug graph_baseball.lbug < ladybugdb/schema.cypher
LadybugDB COPY statements#
Now let's load our CSV data into LadybugDB with the COPY command that supports a glob operator (*) to dynamically match all files in a directory.
// Nodes
COPY Player FROM 'parsed/biofile0.csv' (header=true, ignore_errors=false);
COPY Umpire FROM 'parsed/umpires0.csv' (header=true, ignore_errors=false);
COPY Team FROM 'parsed/teams0.csv' (header=true, ignore_errors=false);
COPY Ballpark FROM 'parsed/ballparks0.csv' (header=true, ignore_errors=false);
COPY Game FROM 'parsed/*gameinfo.csv' (header=true, ignore_errors=false);
COPY Event FROM 'parsed/*plays.csv' (header=true, ignore_errors=false);
// Edges
COPY PLAYED_FOR_TEAM FROM 'parsed/*allplayers.csv' (header=true, ignore_errors=false);
COPY COACHED_FOR_TEAM FROM 'parsed/coaches0.csv' (header=true, ignore_errors=false);
COPY PLAYED_IN_GAME FROM 'parsed/*teamstats.csv' (header=true, ignore_errors=false);
COPY BATTED_IN_GAME FROM 'parsed/*batting.csv' (header=true, ignore_errors=false);
COPY FIELDED_IN_GAME FROM 'parsed/*fielding.csv' (header=true, ignore_errors=false);
COPY PITCHED_IN_GAME FROM 'parsed/*pitching.csv' (header=true, ignore_errors=false);
COPY RELATED_TO FROM 'parsed/relatives.csv' (header=true, ignore_errors=false);
COPY OCCURRED_IN_GAME FROM (
MATCH (e:Event)
WHERE e.event_id IS NOT NULL AND e.gid IS NOT NULL
RETURN e.event_id, e.gid, e.event, e.date
);
COPY UMPIRED_IN_GAME FROM (
MATCH (g:Game)
UNWIND [
{id: g.umphome, pos: 'home'},
{id: g.ump1b, pos: '1b'},
{id: g.ump2b, pos: '2b'},
{id: g.ump3b, pos: '3b'},
{id: g.umplf, pos: 'lf'},
{id: g.umprf, pos: 'rf'}
] AS ump
WITH g, ump
WHERE ump.id <> '(none)' AND ump.id <> ''
MATCH (u:Umpire {id: ump.id})
RETURN u.id, g.gid, ump.pos
);
COPY PITCHED_IN_EVENT FROM (
MATCH (e:Event)
WHERE e.pitcher IS NOT NULL AND e.event_id IS NOT NULL
RETURN
e.pitcher, e.event_id, e.pa, e.pb, e.pithand, e.nump, e.balls, e.strikes,
e.`count`, e.pitches, e.wp, e.bk, e.runs, e.er, e.tur, e.hr, e.k, e.walk,
e.iw, e.hbp, e.bip, e.bunt, e.ground, e.fly, e.line, e.bathand,
(e.br2_pre IS NOT NULL OR e.br3_pre IS NOT NULL)
);
COPY BATTED_IN_EVENT FROM (
MATCH (e:Event)
WHERE e.batter IS NOT NULL AND e.event_id IS NOT NULL
WITH e, CASE e.batter
WHEN e.l1 THEN 1 WHEN e.l2 THEN 2 WHEN e.l3 THEN 3
WHEN e.l4 THEN 4 WHEN e.l5 THEN 5 WHEN e.l6 THEN 6
WHEN e.l7 THEN 7 WHEN e.l8 THEN 8 WHEN e.l9 THEN 9
ELSE 0
END AS lineup_pos
RETURN
e.batter, e.event_id, e.pa, e.ab, e.bat_f, e.bathand, lineup_pos,
e.`single`, e.`double`, e.triple, e.hr, e.k, e.walk, e.iw, e.hbp, e.sh,
e.sf, e.xi, e.roe, e.fc, e.gdp, e.tp, e.bip, e.bunt, e.ground, e.fly,
e.line, e.rbi, e.outs_pre, (e.br2_pre IS NOT NULL OR e.br3_pre IS NOT NULL), e.k_safe, e.fseq
);
COPY ADVANCED_IN_EVENT FROM (
MATCH (e:Event)
WITH e,
list_creation(e.batter, e.br1_pre, e.br2_pre, e.br3_pre) AS pre,
list_creation(e.br1_post, e.br2_post, e.br3_post) AS post,
list_creation(e.run_b, e.run1, e.run2, e.run3) AS runs
UNWIND pre as b
WITH e, b, pre, post, runs
WHERE b IS NOT NULL
WITH e, b, list_position(pre, b) - 1 AS start_base,
CASE
WHEN list_has(runs, b) THEN 4
WHEN list_has(post, b) THEN list_position(post, b)
ELSE 0
END AS end_base
RETURN b, e.event_id, start_base, end_base, (CASE WHEN end_base = 0 THEN 1 ELSE 0 END), (CASE WHEN end_base = 4 AND e.rbi > 0 THEN 1 ELSE 0 END)
);
COPY FIELDED_IN_EVENT FROM (
MATCH (e:Event)
UNWIND [
{id: e.pitcher, pos: 1, po: e.po1, a: e.a1, err: e.e1},
{id: e.f2, pos: 2, po: e.po2, a: e.a2, err: e.e2},
{id: e.f3, pos: 3, po: e.po3, a: e.a3, err: e.e3},
{id: e.f4, pos: 4, po: e.po4, a: e.a4, err: e.e4},
{id: e.f5, pos: 5, po: e.po5, a: e.a5, err: e.e5},
{id: e.f6, pos: 6, po: e.po6, a: e.a6, err: e.e6},
{id: e.f7, pos: 7, po: e.po7, a: e.a7, err: e.e7},
{id: e.f8, pos: 8, po: e.po8, a: e.a8, err: e.e8},
{id: e.f9, pos: 9, po: e.po9, a: e.a9, err: e.e9}
] AS fielder
WITH e, fielder
WHERE fielder.po > 0 OR fielder.a > 0 OR fielder.err > 0
RETURN fielder.id, e.event_id, fielder.pos, fielder.po, fielder.a, fielder.err
);
COPY PLAYED_IN_BALLPARK FROM (
MATCH (g:Game)
RETURN g.gid, g.site
);
COPY CONNECTED_TO FROM (
MATCH (e1:Event), (e2:Event)
WHERE e1.gid = e2.gid
AND e1.inning = e2.inning
AND e1.top_bot = e2.top_bot
AND e2.pn = e1.pn + 1
RETURN
e1.event_id, e2.event_id, e1.inning, e1.top_bot, e1.event, e2.event, e1.runs,
(e1.outs_post - e1.outs_pre),
(e1.br2_pre IS NOT NULL OR e1.br3_pre IS NOT NULL)
);
ladybugdb/copy.cypher
You may also notice that not every relationship table is populated directly from a CSV file; instead we COPY the result of a MATCH subquery directly into the table! This is similar to INSERT INTO with SELECT, but with no CTEs and JOINs the syntax is slightly more compact than the SQL equivalent.
The COPY command infers the column names from the specified table, so there is no need to alias in the RETURN statement, just keep in mind the position of fields returned must match the schema order.
Let’s give it a try:
lbug graph_baseball.lbug < ladybugdb/copy.cypher
LadybugDB EXPORT#
With LadybugDB loaded, we can execute EXPORT DATABASE to get our graph data in a format we can more easily ingest into DuckDB and Postgres.
By default, LadybugDB exports data as Parquet files. While DuckDB supports Parquet, you'd need an extension to read or write Parquet from Postgres. So we'll export LadybugDB data as CSV which all three natively support.
echo "EXPORT DATABASE './ladybugdb/export' (format='csv', header=true);" | lbug graph_baseball.lbug
With this data, let's populate DuckDB and Postgres, our next batters in the lineup!
On deck: DuckDB#
Creating the tables in DuckDB has no surprises, its dialect is fairly similar to PostgreSQL and supports the standard SQL types (LadybugDB’s STRING becomes VARCHAR, INT64 becomes BIGINT). You might notice there are NO Foreign Key constraints included in the DDL for our many-to-many tables. This is intentional: they will be defined in our Property Graph Tables, which act as a layer on top of our data.
-- Entities Tables
CREATE TABLE player (id VARCHAR PRIMARY KEY, lastname VARCHAR, usename VARCHAR, fullname VARCHAR, birthdate DATE, birthcity VARCHAR, birthstate VARCHAR, birthcountry VARCHAR, deathdate DATE, deathcity VARCHAR, deathstate VARCHAR, deathcountry VARCHAR, cemetery VARCHAR, cem_city VARCHAR, cem_state VARCHAR, cem_ctry VARCHAR, cem_note VARCHAR, birthname VARCHAR, altname VARCHAR, debut_p DATE, last_p DATE, debut_c DATE, last_c DATE, debut_m DATE, last_m DATE, debut_u DATE, last_u DATE, bats VARCHAR, throws VARCHAR, height DOUBLE, weight DOUBLE, HOF VARCHAR);
CREATE TABLE umpire( id VARCHAR PRIMARY KEY, lastname VARCHAR, firstname VARCHAR, first_g DATE, last_g DATE);
CREATE TABLE team (team VARCHAR PRIMARY KEY, city VARCHAR, nickname VARCHAR, first_g DATE, last_g DATE);
CREATE TABLE ballpark ("site" VARCHAR PRIMARY KEY, "name" VARCHAR, city VARCHAR, state VARCHAR, first_g DATE, last_g DATE);
CREATE TABLE game (gid VARCHAR PRIMARY KEY, "site" VARCHAR, visteam VARCHAR, hometeam VARCHAR, date DATE, number BIGINT, starttime VARCHAR, daynight VARCHAR, innings BIGINT, tiebreaker BIGINT, usedh BOOLEAN, htbf VARCHAR, timeofgame BIGINT, attendance BIGINT, fieldcond VARCHAR, precip VARCHAR, sky VARCHAR, "temp" BIGINT, winddir VARCHAR, windspeed BIGINT, oscorer VARCHAR, forfeit VARCHAR, suspend BIGINT, umphome VARCHAR, ump1b VARCHAR, ump2b VARCHAR, ump3b VARCHAR, umplf VARCHAR, umprf VARCHAR, wp VARCHAR, lp VARCHAR, save VARCHAR, gametype VARCHAR, vruns BIGINT, hruns BIGINT, wteam VARCHAR, lteam VARCHAR, line VARCHAR, batteries VARCHAR, lineups VARCHAR, box VARCHAR, pbp VARCHAR, season BIGINT);
CREATE TABLE "event" (event_id VARCHAR PRIMARY KEY, gid VARCHAR, "event" VARCHAR, inning BIGINT, top_bot BIGINT, vis_home BIGINT, site VARCHAR, batteam VARCHAR, pitteam VARCHAR, score_v BIGINT, score_h BIGINT, batter VARCHAR, pitcher VARCHAR, lp BIGINT, bat_f BIGINT, bathand VARCHAR, pithand VARCHAR, balls BIGINT, strikes BIGINT, count BIGINT, pitches VARCHAR, nump BIGINT, pa BIGINT, ab BIGINT, single BIGINT, "double" BIGINT, triple BIGINT, hr BIGINT, sh BIGINT, sf BIGINT, hbp BIGINT, walk BIGINT, k BIGINT, xi BIGINT, roe BIGINT, fc BIGINT, othout BIGINT, noout BIGINT, oth BIGINT, bip BIGINT, bunt BIGINT, ground BIGINT, fly BIGINT, line BIGINT, iw BIGINT, gdp BIGINT, othdp BIGINT, tp BIGINT, fle BIGINT, wp BIGINT, pb BIGINT, bk BIGINT, oa BIGINT, di BIGINT, sb2 BIGINT, sb3 BIGINT, sbh BIGINT, cs2 BIGINT, cs3 BIGINT, csh BIGINT, pko1 BIGINT, pko2 BIGINT, pko3 BIGINT, k_safe BIGINT, e1 BIGINT, e2 BIGINT, e3 BIGINT, e4 BIGINT, e5 BIGINT, e6 BIGINT, e7 BIGINT, e8 BIGINT, e9 BIGINT, outs_pre BIGINT, outs_post BIGINT, br1_pre VARCHAR, br2_pre VARCHAR, br3_pre VARCHAR, br1_post VARCHAR, br2_post VARCHAR, br3_post VARCHAR, lob_id1 VARCHAR, lob_id2 VARCHAR, lob_id3 VARCHAR, pr1_pre VARCHAR, pr2_pre VARCHAR, pr3_pre VARCHAR, pr1_post VARCHAR, pr2_post VARCHAR, pr3_post VARCHAR, run_b VARCHAR, run1 VARCHAR, run2 VARCHAR, run3 VARCHAR, prun_b VARCHAR, prun1 VARCHAR, prun2 VARCHAR, prun3 VARCHAR, ur_b BIGINT, ur1 BIGINT, ur2 BIGINT, ur3 BIGINT, rbi_b BIGINT, rbi1 BIGINT, rbi2 BIGINT, rbi3 BIGINT, runs BIGINT, rbi BIGINT, er BIGINT, tur BIGINT, l1 VARCHAR, l2 VARCHAR, l3 VARCHAR, l4 VARCHAR, l5 VARCHAR, l6 VARCHAR, l7 VARCHAR, l8 VARCHAR, l9 VARCHAR, lf1 BIGINT, lf2 BIGINT, lf3 BIGINT, lf4 BIGINT, lf5 BIGINT, lf6 BIGINT, lf7 BIGINT, lf8 BIGINT, lf9 BIGINT, f2 VARCHAR, f3 VARCHAR, f4 VARCHAR, f5 VARCHAR, f6 VARCHAR, f7 VARCHAR, f8 VARCHAR, f9 VARCHAR, po0 BIGINT, po1 BIGINT, po2 BIGINT, po3 BIGINT, po4 BIGINT, po5 BIGINT, po6 BIGINT, po7 BIGINT, po8 BIGINT, po9 BIGINT, a1 BIGINT, a2 BIGINT, a3 BIGINT, a4 BIGINT, a5 BIGINT, a6 BIGINT, a7 BIGINT, a8 BIGINT, a9 BIGINT, fseq BIGINT, batout1 BIGINT, batout2 BIGINT, batout3 BIGINT, brout_b BIGINT, brout1 BIGINT, brout2 BIGINT, brout3 BIGINT, firstf BIGINT, loc VARCHAR, hittype VARCHAR, dpopp BIGINT, "pivot" BIGINT, pn BIGINT, umphome VARCHAR, ump1b VARCHAR, ump2b VARCHAR, ump3b VARCHAR, umplf VARCHAR, umprf VARCHAR, date DATE, gametype VARCHAR, pbp VARCHAR);
-- Many-to-Many Tables
CREATE TABLE players_teams (id VARCHAR, team VARCHAR, "last" VARCHAR, "first" VARCHAR, bat VARCHAR, throw VARCHAR, g BIGINT, g_p BIGINT, g_sp BIGINT, g_rp BIGINT, g_c BIGINT, g_1b BIGINT, g_2b BIGINT, g_3b BIGINT, g_ss BIGINT, g_lf BIGINT, g_cf BIGINT, g_rf BIGINT, g_of BIGINT, g_dh BIGINT, g_ph BIGINT, g_pr BIGINT, first_g DATE, last_g DATE);
CREATE TABLE coaches_teams (id VARCHAR, team VARCHAR, "year" BIGINT, "role" VARCHAR, first_g DATE, last_g DATE, PRIMARY KEY(id, first_g));
CREATE TABLE teams_games (team VARCHAR, gid VARCHAR, inn1 BIGINT, inn2 BIGINT, inn3 BIGINT, inn4 BIGINT, inn5 BIGINT, inn6 BIGINT, inn7 BIGINT, inn8 BIGINT, inn9 BIGINT, inn10 BIGINT, inn11 BIGINT, inn12 BIGINT, inn13 BIGINT, inn14 BIGINT, inn15 BIGINT, inn16 BIGINT, inn17 BIGINT, inn18 BIGINT, inn19 VARCHAR, inn20 VARCHAR, inn21 VARCHAR, inn22 VARCHAR, inn23 VARCHAR, inn24 VARCHAR, inn25 VARCHAR, inn26 VARCHAR, inn27 VARCHAR, inn28 VARCHAR, lob BIGINT, mgr VARCHAR, stattype VARCHAR, b_pa BIGINT, b_ab BIGINT, b_r BIGINT, b_h BIGINT, b_d BIGINT, b_t BIGINT, b_hr BIGINT, b_rbi BIGINT, b_sh BIGINT, b_sf BIGINT, b_hbp BIGINT, b_w BIGINT, b_iw BIGINT, b_k BIGINT, b_sb BIGINT, b_cs BIGINT, b_gdp BIGINT, b_xi BIGINT, b_roe BIGINT, p_ipouts BIGINT, p_noout BIGINT, p_bfp BIGINT, p_h BIGINT, p_d BIGINT, p_t BIGINT, p_hr BIGINT, p_r BIGINT, p_er BIGINT, p_w BIGINT, p_iw BIGINT, p_k BIGINT, p_hbp BIGINT, p_wp BIGINT, p_bk BIGINT, p_sh BIGINT, p_sf BIGINT, p_sb BIGINT, p_cs BIGINT, p_pb BIGINT, d_po BIGINT, d_a BIGINT, d_e BIGINT, d_dp BIGINT, d_tp BIGINT, d_pb BIGINT, d_wp BIGINT, d_sb BIGINT, d_cs BIGINT, start_l1 VARCHAR, start_l2 VARCHAR, start_l3 VARCHAR, start_l4 VARCHAR, start_l5 VARCHAR, start_l6 VARCHAR, start_l7 VARCHAR, start_l8 VARCHAR, start_l9 VARCHAR, start_f1 VARCHAR, start_f2 VARCHAR, start_f3 VARCHAR, start_f4 VARCHAR, start_f5 VARCHAR, start_f6 VARCHAR, start_f7 VARCHAR, start_f8 VARCHAR, start_f9 VARCHAR, start_f10 VARCHAR, date DATE, number BIGINT, site VARCHAR, vishome VARCHAR, opp VARCHAR, win BIGINT, loss BIGINT, tie BIGINT, gametype VARCHAR, box VARCHAR, pbp VARCHAR);
CREATE TABLE batting_games (id VARCHAR, gid VARCHAR, team VARCHAR, b_lp BIGINT, b_seq BIGINT, stattype VARCHAR, b_pa BIGINT, b_ab BIGINT, b_r BIGINT, b_h BIGINT, b_d BIGINT, b_t BIGINT, b_hr BIGINT, b_rbi BIGINT, b_sh BIGINT, b_sf BIGINT, b_hbp BIGINT, b_w BIGINT, b_iw BIGINT, b_k BIGINT, b_sb BIGINT, b_cs BIGINT, b_gdp BIGINT, b_xi BIGINT, b_roe BIGINT, dh BIGINT, ph BIGINT, pr BIGINT, date DATE, number BIGINT, site VARCHAR, vishome VARCHAR, opp VARCHAR, win BIGINT, loss BIGINT, tie BIGINT, gametype VARCHAR, box VARCHAR, pbp VARCHAR);
CREATE TABLE fielding_games (id VARCHAR, gid VARCHAR, team VARCHAR, d_seq BIGINT, d_pos BIGINT, stattype VARCHAR, d_ifouts BIGINT, d_po BIGINT, d_a BIGINT, d_e BIGINT, d_dp BIGINT, d_tp BIGINT, d_pb BIGINT, d_wp BIGINT, d_sb BIGINT, d_cs BIGINT, d_gs BIGINT, date DATE, number BIGINT, site VARCHAR, vishome VARCHAR, opp VARCHAR, win BIGINT, loss BIGINT, tie BIGINT, gametype VARCHAR, box VARCHAR, pbp VARCHAR);
CREATE TABLE pitching_games (id VARCHAR, gid VARCHAR, team VARCHAR, p_seq BIGINT, stattype VARCHAR, p_ipouts BIGINT, p_noout BIGINT, p_bfp BIGINT, p_h BIGINT, p_d BIGINT, p_t BIGINT, p_hr BIGINT, p_r BIGINT, p_er BIGINT, p_w BIGINT, p_iw BIGINT, p_k BIGINT, p_hbp BIGINT, p_wp BIGINT, p_bk BIGINT, p_sh BIGINT, p_sf BIGINT, p_sb BIGINT, p_cs BIGINT, p_pb BIGINT, wp BIGINT, lp BIGINT, save BIGINT, p_gs BIGINT, p_gf BIGINT, p_cg BIGINT, date DATE, number BIGINT, site VARCHAR, vishome VARCHAR, opp VARCHAR, win BIGINT, loss BIGINT, tie BIGINT, gametype VARCHAR, box VARCHAR, pbp VARCHAR);
CREATE TABLE umpires_games (id VARCHAR, gid VARCHAR, position VARCHAR);
CREATE TABLE events_games (event_id VARCHAR, gid VARCHAR, "event" VARCHAR, "date" DATE);
CREATE TABLE pitching_events (id VARCHAR, event_id VARCHAR, pa BIGINT, pb BIGINT, pithand VARCHAR, nump BIGINT, balls BIGINT, strikes BIGINT, count BIGINT, pitches VARCHAR, wp BIGINT, bk BIGINT, runs BIGINT, er BIGINT, tur BIGINT, hr BIGINT, k BIGINT, walk BIGINT, iw BIGINT, hbp BIGINT, bip BIGINT, bunt BIGINT, ground BIGINT, fly BIGINT, line BIGINT, bathand VARCHAR, risp BOOLEAN);
CREATE TABLE batting_events (id VARCHAR, event_id VARCHAR, pa BIGINT, ab BIGINT, bat_f BIGINT, bathand VARCHAR, lineup_pos BIGINT, single BIGINT, double BIGINT, triple BIGINT, hr BIGINT, k BIGINT, walk BIGINT, iw BIGINT, hbp BIGINT, sh BIGINT, sf BIGINT, xi BIGINT, roe BIGINT, fc BIGINT, gdp BIGINT, tp BIGINT, bip BIGINT, bunt BIGINT, ground BIGINT, fly BIGINT, line BIGINT, rbi BIGINT, outs_pre BIGINT, risp BOOLEAN, k_safe BIGINT, fseq VARCHAR);
CREATE TABLE advancing_events (id VARCHAR, event_id VARCHAR, start_base BIGINT, end_base BIGINT, is_out BIGINT, rbi_credited BIGINT);
CREATE TABLE fielding_events (id VARCHAR, event_id VARCHAR, position BIGINT, putouts BIGINT, assists BIGINT, errors BIGINT);
CREATE TABLE connected_events (event_id1 VARCHAR, event_id2 VARCHAR, inning BIGINT, top_bot BIGINT, from_event_string VARCHAR, to_event_string VARCHAR, runs_scored BIGINT, outs_recorded BIGINT, has_risp BOOLEAN);
CREATE TABLE players_relationships (id1 VARCHAR, id2 VARCHAR, relation VARCHAR);
duckdb/schema.sql
Run it with DuckDB:
duckdb graph_baseball.duckdb < duckdb/schema.sql
Now let's load all our data into DuckDB. We save ourselves some data-wrangling effort by populating some of the many-to-many tables with the CSV files generated by LadybugDB’s export command.
-- Entities Tables
COPY player FROM 'parsed/biofile0.csv';
COPY umpire FROM 'parsed/umpires0.csv';
COPY team FROM 'parsed/teams0.csv';
COPY ballpark FROM 'parsed/ballparks0.csv';
COPY game FROM 'parsed/*gameinfo.csv';
COPY "event" FROM 'parsed/*plays.csv';
-- Many to Many Tables
COPY players_teams FROM 'parsed/*allplayers.csv';
COPY coaches_teams FROM 'parsed/coaches0.csv';
COPY teams_games FROM 'parsed/*teamstats.csv';
COPY batting_games FROM 'parsed/*batting.csv';
COPY fielding_games FROM 'parsed/*fielding.csv';
COPY pitching_games FROM 'parsed/*pitching.csv';
COPY players_relationships FROM 'parsed/relatives.csv';
COPY events_games FROM './ladybugdb/export/OCCURRED_IN_GAME_Event_Game.csv';
COPY umpires_games FROM './ladybugdb/export/UMPIRED_IN_GAME_Umpire_Game.csv';
COPY pitching_events FROM './ladybugdb/export/PITCHED_IN_EVENT_Player_Event.csv';
COPY batting_events FROM './ladybugdb/export/BATTED_IN_EVENT_Player_Event.csv';
COPY advancing_events FROM './ladybugdb/export/ADVANCED_IN_EVENT_Player_Event.csv';
COPY fielding_events FROM './ladybugdb/export/FIELDED_IN_EVENT_Player_Event.csv';
COPY connected_events FROM './ladybugdb/export/CONNECTED_TO_Event_Event.csv';
duckdb/copy.sql
Run it with DuckDB:
duckdb graph_baseball.duckdb < duckdb/copy.sql
Now we can try out Property Graph Queries in DuckDB via duckpgq.
Property Graph Queries in DuckDB#
While we can define the graph with primary key and foreign key constraints, let's instead define it in the graph itself for DuckDB. (Later on with Postgres we'll go the primary key and foreign key route.)
LOAD 'duckpgq';
CREATE PROPERTY GRAPH baseball VERTEX TABLES (
player,
umpire,
team,
ballpark,
game,
"event"
) EDGE TABLES (
players_teams SOURCE KEY (id) REFERENCES player(id)
DESTINATION KEY (team) REFERENCES team(team)
LABEL played_for_team,
coaches_teams SOURCE KEY (id) REFERENCES player(id)
DESTINATION KEY (team) REFERENCES team(team)
LABEL coached_for_team,
teams_games SOURCE KEY (team) REFERENCES team(team)
DESTINATION KEY (gid) REFERENCES game(gid)
LABEL played_in_game,
batting_games SOURCE KEY (id) REFERENCES player(id)
DESTINATION KEY (gid) REFERENCES game(gid)
LABEL batted_in_game,
fielding_games SOURCE KEY (id) REFERENCES player(id)
DESTINATION KEY (gid) REFERENCES game(gid)
LABEL fielded_in_game,
pitching_games SOURCE KEY (id) REFERENCES player(id)
DESTINATION KEY (gid) REFERENCES game(gid)
LABEL pitched_in_game,
umpires_games SOURCE KEY (id) REFERENCES umpire(id)
DESTINATION KEY (gid) REFERENCES game(gid)
LABEL umpired_in_game,
events_games SOURCE KEY (event_id) REFERENCES "event"(event_id)
DESTINATION KEY (gid) REFERENCES game(gid)
LABEL occurred_in_game,
pitching_events SOURCE KEY (id) REFERENCES player(id)
DESTINATION KEY (event_id) REFERENCES "event"(event_id)
LABEL pitched_in_event,
batting_events SOURCE KEY (id) REFERENCES player(id)
DESTINATION KEY (event_id) REFERENCES "event"(event_id)
LABEL batted_in_event,
advancing_events SOURCE KEY (id) REFERENCES player(id)
DESTINATION KEY (event_id) REFERENCES "event"(event_id)
LABEL advanced_in_event,
fielding_events SOURCE KEY (id) REFERENCES player(id)
DESTINATION KEY (event_id) REFERENCES "event"(event_id)
LABEL fielded_in_event,
game AS games_ballparks SOURCE KEY (gid) REFERENCES game(gid)
DESTINATION KEY ("site") REFERENCES ballpark("site")
LABEL played_in_ballpark,
connected_events SOURCE KEY (event_id1) REFERENCES "event"(event_id)
DESTINATION KEY (event_id2) REFERENCES "event"(event_id)
LABEL connected_to,
players_relationships SOURCE KEY (id1) REFERENCES player(id)
DESTINATION KEY (id2) REFERENCES player(id)
LABEL related_to
);
duckdb/graph.sql
And run it.
duckdb graph_baseball.duckdb < duckdb/graph.sql
In the hole: Postgres#
Postgres is stricter. Postgres knows the rules are what make the game beautiful. We must do our work in a specific order to ensure Postgres is happy:
- Use
TEXTinstead of VARCHAR1 - Rename columns to avoid name-collision with other columns when creating our Property Graph2
- Create tables WITHOUT constraints to avoid validation for every row during copy
- Add primary key constraints and foreign key constraints after copying from the CSVs
- Validate the constraints
- Create the Property Graph
-- Entities Tables
CREATE TABLE player (id TEXT PRIMARY KEY, lastname TEXT, usename TEXT, fullname TEXT, birthdate DATE, birthcity TEXT, birthstate TEXT, birthcountry TEXT, deathdate DATE, deathcity TEXT, deathstate TEXT, deathcountry TEXT, cemetery TEXT, cem_city TEXT, cem_state TEXT, cem_ctry TEXT, cem_note TEXT, birthname TEXT, altname TEXT, debut_p DATE, last_p DATE, debut_c DATE, last_c DATE, debut_m DATE, last_m DATE, debut_u DATE, last_u DATE, bats TEXT, throws TEXT, height NUMERIC(4,2), weight NUMERIC(5,2), HOF TEXT);
CREATE TABLE umpire( id TEXT PRIMARY KEY, lastname TEXT, firstname TEXT, first_g DATE, last_g DATE);
CREATE TABLE team (team TEXT PRIMARY KEY, city TEXT, nickname TEXT, first_g DATE, last_g DATE);
CREATE TABLE ballpark ("site" TEXT PRIMARY KEY, "name" TEXT, city TEXT, state TEXT, first_g DATE, last_g DATE);
-- We rename the TEXT fields wp, lp, save and line to winning_pitcher, losing_pitcher, saving_pitcher, and linescore_known to not conflict with those BIGINT fields in other tables.
CREATE TABLE game (gid TEXT PRIMARY KEY, "site" TEXT, visteam TEXT, hometeam TEXT, date DATE, number BIGINT, starttime TEXT, daynight TEXT, innings BIGINT, tiebreaker BIGINT, usedh BOOLEAN, htbf TEXT, timeofgame BIGINT, attendance BIGINT, fieldcond TEXT, precip TEXT, sky TEXT, "temp" BIGINT, winddir TEXT, windspeed BIGINT, oscorer TEXT, forfeit TEXT, suspend BIGINT, umphome TEXT, ump1b TEXT, ump2b TEXT, ump3b TEXT, umplf TEXT, umprf TEXT, winning_pitcher TEXT, losing_pitcher TEXT, saving_pitcher TEXT, gametype TEXT, vruns BIGINT, hruns BIGINT, wteam TEXT, lteam TEXT, linescore_known TEXT, batteries TEXT, lineups TEXT, "box" TEXT, pbp TEXT, season BIGINT);
CREATE TABLE "event" (event_id TEXT PRIMARY KEY, gid TEXT, "event" TEXT, inning BIGINT, top_bot BIGINT, vis_home BIGINT, site TEXT, batteam TEXT, pitteam TEXT, score_v BIGINT, score_h BIGINT, batter TEXT, pitcher TEXT, lp BIGINT, bat_f BIGINT, bathand TEXT, pithand TEXT, balls BIGINT, strikes BIGINT, count BIGINT, pitches TEXT, nump BIGINT, pa BIGINT, ab BIGINT, single BIGINT, "double" BIGINT, triple BIGINT, hr BIGINT, sh BIGINT, sf BIGINT, hbp BIGINT, walk BIGINT, k BIGINT, xi BIGINT, roe BIGINT, fc BIGINT, othout BIGINT, noout BIGINT, oth BIGINT, bip BIGINT, bunt BIGINT, ground BIGINT, fly BIGINT, line BIGINT, iw BIGINT, gdp BIGINT, othdp BIGINT, tp BIGINT, fle BIGINT, wp BIGINT, pb BIGINT, bk BIGINT, oa BIGINT, di BIGINT, sb2 BIGINT, sb3 BIGINT, sbh BIGINT, cs2 BIGINT, cs3 BIGINT, csh BIGINT, pko1 BIGINT, pko2 BIGINT, pko3 BIGINT, k_safe BIGINT, e1 BIGINT, e2 BIGINT, e3 BIGINT, e4 BIGINT, e5 BIGINT, e6 BIGINT, e7 BIGINT, e8 BIGINT, e9 BIGINT, outs_pre BIGINT, outs_post BIGINT, br1_pre TEXT, br2_pre TEXT, br3_pre TEXT, br1_post TEXT, br2_post TEXT, br3_post TEXT, lob_id1 TEXT, lob_id2 TEXT, lob_id3 TEXT, pr1_pre TEXT, pr2_pre TEXT, pr3_pre TEXT, pr1_post TEXT, pr2_post TEXT, pr3_post TEXT, run_b TEXT, run1 TEXT, run2 TEXT, run3 TEXT, prun_b TEXT, prun1 TEXT, prun2 TEXT, prun3 TEXT, ur_b BIGINT, ur1 BIGINT, ur2 BIGINT, ur3 BIGINT, rbi_b BIGINT, rbi1 BIGINT, rbi2 BIGINT, rbi3 BIGINT, runs BIGINT, rbi BIGINT, er BIGINT, tur BIGINT, l1 TEXT, l2 TEXT, l3 TEXT, l4 TEXT, l5 TEXT, l6 TEXT, l7 TEXT, l8 TEXT, l9 TEXT, lf1 BIGINT, lf2 BIGINT, lf3 BIGINT, lf4 BIGINT, lf5 BIGINT, lf6 BIGINT, lf7 BIGINT, lf8 BIGINT, lf9 BIGINT, f2 TEXT, f3 TEXT, f4 TEXT, f5 TEXT, f6 TEXT, f7 TEXT, f8 TEXT, f9 TEXT, po0 BIGINT, po1 BIGINT, po2 BIGINT, po3 BIGINT, po4 BIGINT, po5 BIGINT, po6 BIGINT, po7 BIGINT, po8 BIGINT, po9 BIGINT, a1 BIGINT, a2 BIGINT, a3 BIGINT, a4 BIGINT, a5 BIGINT, a6 BIGINT, a7 BIGINT, a8 BIGINT, a9 BIGINT, fseq BIGINT, batout1 BIGINT, batout2 BIGINT, batout3 BIGINT, brout_b BIGINT, brout1 BIGINT, brout2 BIGINT, brout3 BIGINT, firstf BIGINT, loc TEXT, hittype TEXT, dpopp BIGINT, "pivot" BIGINT, pn BIGINT, umphome TEXT, ump1b TEXT, ump2b TEXT, ump3b TEXT, umplf TEXT, umprf TEXT, date DATE, gametype TEXT, pbp TEXT);
-- Many-to-Many Tables
CREATE TABLE players_teams (id TEXT, team TEXT, "last" TEXT, "first" TEXT, bat TEXT, "throw" TEXT, g BIGINT, g_p BIGINT, g_sp BIGINT, g_rp BIGINT, g_c BIGINT, g_1b BIGINT, g_2b BIGINT, g_3b BIGINT, g_ss BIGINT, g_lf BIGINT, g_cf BIGINT, g_rf BIGINT, g_of BIGINT, g_dh BIGINT, g_ph BIGINT, g_pr BIGINT, first_g DATE, last_g DATE);
CREATE TABLE coaches_teams (id TEXT, team TEXT, "year" BIGINT, "role" TEXT, first_g DATE, last_g DATE);
CREATE TABLE teams_games (team TEXT, gid TEXT, inn1 BIGINT, inn2 BIGINT, inn3 BIGINT, inn4 BIGINT, inn5 BIGINT, inn6 BIGINT, inn7 BIGINT, inn8 BIGINT, inn9 BIGINT, inn10 BIGINT, inn11 BIGINT, inn12 BIGINT, inn13 BIGINT, inn14 BIGINT, inn15 BIGINT, inn16 BIGINT, inn17 BIGINT, inn18 BIGINT, inn19 TEXT, inn20 TEXT, inn21 TEXT, inn22 TEXT, inn23 TEXT, inn24 TEXT, inn25 TEXT, inn26 TEXT, inn27 TEXT, inn28 TEXT, lob BIGINT, mgr TEXT, stattype TEXT, b_pa BIGINT, b_ab BIGINT, b_r BIGINT, b_h BIGINT, b_d BIGINT, b_t BIGINT, b_hr BIGINT, b_rbi BIGINT, b_sh BIGINT, b_sf BIGINT, b_hbp BIGINT, b_w BIGINT, b_iw BIGINT, b_k BIGINT, b_sb BIGINT, b_cs BIGINT, b_gdp BIGINT, b_xi BIGINT, b_roe BIGINT, p_ipouts BIGINT, p_noout BIGINT, p_bfp BIGINT, p_h BIGINT, p_d BIGINT, p_t BIGINT, p_hr BIGINT, p_r BIGINT, p_er BIGINT, p_w BIGINT, p_iw BIGINT, p_k BIGINT, p_hbp BIGINT, p_wp BIGINT, p_bk BIGINT, p_sh BIGINT, p_sf BIGINT, p_sb BIGINT, p_cs BIGINT, p_pb BIGINT, d_po BIGINT, d_a BIGINT, d_e BIGINT, d_dp BIGINT, d_tp BIGINT, d_pb BIGINT, d_wp BIGINT, d_sb BIGINT, d_cs BIGINT, start_l1 TEXT, start_l2 TEXT, start_l3 TEXT, start_l4 TEXT, start_l5 TEXT, start_l6 TEXT, start_l7 TEXT, start_l8 TEXT, start_l9 TEXT, start_f1 TEXT, start_f2 TEXT, start_f3 TEXT, start_f4 TEXT, start_f5 TEXT, start_f6 TEXT, start_f7 TEXT, start_f8 TEXT, start_f9 TEXT, start_f10 TEXT, "date" DATE, "number" BIGINT, "site" TEXT, vishome TEXT, opp TEXT, win BIGINT, loss BIGINT, tie BIGINT, gametype TEXT, box TEXT, pbp TEXT);
CREATE TABLE batting_games (id TEXT, gid TEXT, team TEXT, b_lp BIGINT, b_seq BIGINT, stattype TEXT, b_pa BIGINT, b_ab BIGINT, b_r BIGINT, b_h BIGINT, b_d BIGINT, b_t BIGINT, b_hr BIGINT, b_rbi BIGINT, b_sh BIGINT, b_sf BIGINT, b_hbp BIGINT, b_w BIGINT, b_iw BIGINT, b_k BIGINT, b_sb BIGINT, b_cs BIGINT, b_gdp BIGINT, b_xi BIGINT, b_roe BIGINT, dh BIGINT, ph BIGINT, pr BIGINT, date DATE, number BIGINT, site TEXT, vishome TEXT, opp TEXT, win BIGINT, loss BIGINT, tie BIGINT, gametype TEXT, box TEXT, pbp TEXT);
CREATE TABLE fielding_games (id TEXT, gid TEXT, team TEXT, d_seq BIGINT, d_pos BIGINT, stattype TEXT, d_ifouts BIGINT, d_po BIGINT, d_a BIGINT, d_e BIGINT, d_dp BIGINT, d_tp BIGINT, d_pb BIGINT, d_wp BIGINT, d_sb BIGINT, d_cs BIGINT, d_gs BIGINT, date DATE, number BIGINT, site TEXT, vishome TEXT, opp TEXT, win BIGINT, loss BIGINT, tie BIGINT, gametype TEXT, box TEXT, pbp TEXT);
CREATE TABLE pitching_games (id TEXT, gid TEXT, team TEXT, p_seq BIGINT, stattype TEXT, p_ipouts BIGINT, p_noout BIGINT, p_bfp BIGINT, p_h BIGINT, p_d BIGINT, p_t BIGINT, p_hr BIGINT, p_r BIGINT, p_er BIGINT, p_w BIGINT, p_iw BIGINT, p_k BIGINT, p_hbp BIGINT, p_wp BIGINT, p_bk BIGINT, p_sh BIGINT, p_sf BIGINT, p_sb BIGINT, p_cs BIGINT, p_pb BIGINT, wp BIGINT, lp BIGINT, save BIGINT, p_gs BIGINT, p_gf BIGINT, p_cg BIGINT, date DATE, number BIGINT, site TEXT, vishome TEXT, opp TEXT, win BIGINT, loss BIGINT, tie BIGINT, gametype TEXT, box TEXT, pbp TEXT);
CREATE TABLE umpires_games (id TEXT, gid TEXT, ump_position TEXT);
CREATE TABLE events_games (event_id TEXT, gid TEXT, "event" TEXT, "date" DATE);
CREATE TABLE pitching_events (id TEXT, event_id TEXT, pa BIGINT, pb BIGINT, pithand TEXT, nump BIGINT, balls BIGINT, strikes BIGINT, count BIGINT, pitches TEXT, wp BIGINT, bk BIGINT, runs BIGINT, er BIGINT, tur BIGINT, hr BIGINT, k BIGINT, walk BIGINT, iw BIGINT, hbp BIGINT, bip BIGINT, bunt BIGINT, ground BIGINT, fly BIGINT, line BIGINT, bathand TEXT, risp BOOLEAN);
CREATE TABLE batting_events (id TEXT, event_id TEXT, pa BIGINT, ab BIGINT, bat_f BIGINT, bathand TEXT, lineup_pos BIGINT, single BIGINT, double BIGINT, triple BIGINT, hr BIGINT, k BIGINT, walk BIGINT, iw BIGINT, hbp BIGINT, sh BIGINT, sf BIGINT, xi BIGINT, roe BIGINT, fc BIGINT, gdp BIGINT, tp BIGINT, bip BIGINT, bunt BIGINT, ground BIGINT, fly BIGINT, line BIGINT, rbi BIGINT, outs_pre BIGINT, risp BOOLEAN, k_safe BIGINT, fseq BIGINT);
CREATE TABLE advancing_events (id TEXT, event_id TEXT, start_base BIGINT, end_base BIGINT, is_out BIGINT, rbi_credited BIGINT);
CREATE TABLE fielding_events (id TEXT, event_id TEXT, position BIGINT, putouts BIGINT, assists BIGINT, errors BIGINT);
CREATE TABLE connected_events (event_id1 TEXT, event_id2 TEXT, inning BIGINT, top_bot BIGINT, from_event_string TEXT, to_event_string TEXT, runs_scored BIGINT, outs_recorded BIGINT, has_risp BOOLEAN);
CREATE TABLE players_relationships (id1 TEXT, id2 TEXT, relation TEXT);
postgres/schema.sql
Run it against Postgres.
/postgres/bin/psql -U postgres -d graph_baseball -f postgres/schema.sql
Now let's load the data into Postgres.
Loading data into Postgres#
Because Postgres's COPY does not support the glob operator and assumes the files are available to the Postgres server itself, here's a little shell script that uses the /postgres/bin/psql client’s \copy command instead.
static=(
"player:parsed/biofile0.csv"
"umpire:parsed/umpires0.csv"
"team:parsed/teams0.csv"
"ballpark:parsed/ballparks0.csv"
"coaches_teams:parsed/coaches0.csv"
"players_relationships:parsed/relatives.csv"
"events_games:ladybugdb/export/OCCURRED_IN_GAME_Event_Game.csv"
"umpires_games:ladybugdb/export/UMPIRED_IN_GAME_Umpire_Game.csv"
"pitching_events:ladybugdb/export/PITCHED_IN_EVENT_Player_Event.csv"
"batting_events:ladybugdb/export/BATTED_IN_EVENT_Player_Event.csv"
"advancing_events:ladybugdb/export/ADVANCED_IN_EVENT_Player_Event.csv"
"fielding_events:ladybugdb/export/FIELDED_IN_EVENT_Player_Event.csv"
"connected_events:ladybugdb/export/CONNECTED_TO_Event_Event.csv"
)
for entry in "${static[@]}"; do
IFS=":" read -r table file <<< "$entry"
/postgres/bin/psql -U postgres -d graph_baseball -c "\copy ${table} FROM '$(pwd)/${file}' WITH (FORMAT csv, HEADER)"
done
for yr in {2010..2025}; do
base="$(pwd)/parsed/${yr}"
/postgres/bin/psql -U postgres -d graph_baseball -c "\copy game FROM '${base}gameinfo.csv' WITH (FORMAT csv, HEADER)"
/postgres/bin/psql -U postgres -d graph_baseball -c "\copy \"event\" FROM '${base}plays.csv' WITH (FORMAT csv, HEADER)"
/postgres/bin/psql -U postgres -d graph_baseball -c "\copy players_teams FROM '${base}allplayers.csv' WITH (FORMAT csv, HEADER)"
/postgres/bin/psql -U postgres -d graph_baseball -c "\copy teams_games FROM '${base}teamstats.csv' WITH (FORMAT csv, HEADER)"
/postgres/bin/psql -U postgres -d graph_baseball -c "\copy batting_games FROM '${base}batting.csv' WITH (FORMAT csv, HEADER)"
/postgres/bin/psql -U postgres -d graph_baseball -c "\copy fielding_games FROM '${base}fielding.csv' WITH (FORMAT csv, HEADER)"
/postgres/bin/psql -U postgres -d graph_baseball -c "\copy pitching_games FROM '${base}pitching.csv' WITH (FORMAT csv, HEADER)"
done
Now that our data is in PostgreSQL, let’s establish primary keys and foreign keys.
Postgres table constraints#
It is important to note that for join tables (which will form our edge tables in the property graph), Postgres requires that a primary key exist, otherwise it raises an error during Property Graph creation:
ERROR: no key specified and no suitable primary key exists for definition of element "players_teams"
LINE 9: players_teams SOURCE player DESTINATION team LABEL played…
To satisfy this condition, we create a "synthetic" BIGSERIAL primary key, which adds an auto-incrementing BIGINT value for each entry in the table.
We also add foreign key constraints, which we validate in the next step. Including NOT VALID allows for non-blocking constraint addition, avoiding the need to scan and check each entry when foreign keys are added. This approach runs faster for the large volume of data.
ALTER TABLE players_teams ADD COLUMN players_teams_pk BIGSERIAL PRIMARY KEY;
ALTER TABLE coaches_teams ADD COLUMN coaches_teams_pk BIGSERIAL PRIMARY KEY;
ALTER TABLE teams_games ADD COLUMN teams_games_pk BIGSERIAL PRIMARY KEY;
ALTER TABLE batting_games ADD COLUMN batting_games_pk BIGSERIAL PRIMARY KEY;
ALTER TABLE fielding_games ADD COLUMN fielding_games_pk BIGSERIAL PRIMARY KEY;
ALTER TABLE pitching_games ADD COLUMN pitching_games_pk BIGSERIAL PRIMARY KEY;
ALTER TABLE umpires_games ADD COLUMN umpires_games_pk BIGSERIAL PRIMARY KEY;
ALTER TABLE events_games ADD COLUMN events_games_pk BIGSERIAL PRIMARY KEY;
ALTER TABLE pitching_events ADD COLUMN pitching_events_pk BIGSERIAL PRIMARY KEY;
ALTER TABLE batting_events ADD COLUMN batting_events_pk BIGSERIAL PRIMARY KEY;
ALTER TABLE advancing_events ADD COLUMN advancing_events_pk BIGSERIAL PRIMARY KEY;
ALTER TABLE fielding_events ADD COLUMN fielding_events_pk BIGSERIAL PRIMARY KEY;
ALTER TABLE connected_events ADD COLUMN connected_events_pk BIGSERIAL PRIMARY KEY;
ALTER TABLE players_relationships ADD COLUMN players_relationships_pk BIGSERIAL PRIMARY KEY;
ALTER TABLE game ADD CONSTRAINT fk_game_site FOREIGN KEY ("site") REFERENCES ballpark(site) NOT VALID;
ALTER TABLE "event" ADD CONSTRAINT fk_event_gid FOREIGN KEY (gid) REFERENCES game(gid) NOT VALID;
ALTER TABLE players_teams ADD CONSTRAINT fk_pt_player FOREIGN KEY (id) REFERENCES player(id) NOT VALID, ADD CONSTRAINT fk_pt_team FOREIGN KEY (team) REFERENCES team(team) NOT VALID;
ALTER TABLE coaches_teams ADD CONSTRAINT fk_ct_player FOREIGN KEY (id) REFERENCES player(id) NOT VALID, ADD CONSTRAINT fk_ct_team FOREIGN KEY (team) REFERENCES team(team) NOT VALID;
ALTER TABLE teams_games ADD CONSTRAINT fk_tg_game FOREIGN KEY (gid) REFERENCES game(gid) NOT VALID, ADD CONSTRAINT fk_tg_team FOREIGN KEY (team) REFERENCES team(team) NOT VALID;
ALTER TABLE batting_games ADD CONSTRAINT fk_bg_game FOREIGN KEY (gid) REFERENCES game(gid) NOT VALID, ADD CONSTRAINT fk_bg_player FOREIGN KEY (id) REFERENCES player(id) NOT VALID;
ALTER TABLE fielding_games ADD CONSTRAINT fk_fg_game FOREIGN KEY (gid) REFERENCES game(gid) NOT VALID, ADD CONSTRAINT fk_fg_player FOREIGN KEY (id) REFERENCES player(id) NOT VALID;
ALTER TABLE pitching_games ADD CONSTRAINT fk_pg_game FOREIGN KEY (gid) REFERENCES game(gid) NOT VALID, ADD CONSTRAINT fk_pg_player FOREIGN KEY (id) REFERENCES player(id) NOT VALID;
ALTER TABLE umpires_games ADD CONSTRAINT fk_ug_umpire FOREIGN KEY (id) REFERENCES umpire(id) NOT VALID, ADD CONSTRAINT fk_ug_game FOREIGN KEY (gid) REFERENCES game(gid) NOT VALID;
ALTER TABLE events_games ADD CONSTRAINT fk_eg_event FOREIGN KEY (event_id) REFERENCES "event"(event_id) NOT VALID, ADD CONSTRAINT fk_eg_game FOREIGN KEY (gid) REFERENCES game(gid) NOT VALID;
ALTER TABLE pitching_events ADD CONSTRAINT fk_pe_player FOREIGN KEY (id) REFERENCES player(id) NOT VALID, ADD CONSTRAINT fk_pe_event FOREIGN KEY (event_id) REFERENCES "event"(event_id) NOT VALID;
ALTER TABLE batting_events ADD CONSTRAINT fk_be_player FOREIGN KEY (id) REFERENCES player(id) NOT VALID, ADD CONSTRAINT fk_be_event FOREIGN KEY (event_id) REFERENCES "event"(event_id) NOT VALID;
ALTER TABLE advancing_events ADD CONSTRAINT fk_ae_player FOREIGN KEY (id) REFERENCES player(id) NOT VALID, ADD CONSTRAINT fk_ae_event FOREIGN KEY (event_id) REFERENCES "event"(event_id) NOT VALID;
ALTER TABLE fielding_events ADD CONSTRAINT fk_fe_player FOREIGN KEY (id) REFERENCES player(id) NOT VALID, ADD CONSTRAINT fk_fe_event FOREIGN KEY (event_id) REFERENCES "event"(event_id) NOT VALID;
ALTER TABLE connected_events ADD CONSTRAINT fk_ce_event1 FOREIGN KEY (event_id1) REFERENCES "event"(event_id) NOT VALID, ADD CONSTRAINT fk_ce_event2 FOREIGN KEY (event_id2) REFERENCES "event"(event_id) NOT VALID;
ALTER TABLE players_relationships ADD CONSTRAINT fk_pr_id1 FOREIGN KEY (id1) REFERENCES player(id) NOT VALID, ADD CONSTRAINT fk_pr_id2 FOREIGN KEY (id2) REFERENCES player(id) NOT VALID;
postgres/constraints.sql
Run it.
/postgres/bin/psql -U postgres -d graph_baseball -f postgres/constraints.sql
And then let's validate constraints in a second pass.
ALTER TABLE game VALIDATE CONSTRAINT fk_game_site;
ALTER TABLE "event" VALIDATE CONSTRAINT fk_event_gid;
ALTER TABLE players_teams VALIDATE CONSTRAINT fk_pt_player, VALIDATE CONSTRAINT fk_pt_team;
ALTER TABLE coaches_teams VALIDATE CONSTRAINT fk_ct_player, VALIDATE CONSTRAINT fk_ct_team;
ALTER TABLE players_relationships VALIDATE CONSTRAINT fk_pr_id1, VALIDATE CONSTRAINT fk_pr_id2;
ALTER TABLE teams_games VALIDATE CONSTRAINT fk_tg_game, VALIDATE CONSTRAINT fk_tg_team;
ALTER TABLE batting_games VALIDATE CONSTRAINT fk_bg_game, VALIDATE CONSTRAINT fk_bg_player;
ALTER TABLE fielding_games VALIDATE CONSTRAINT fk_fg_game, VALIDATE CONSTRAINT fk_fg_player;
ALTER TABLE pitching_games VALIDATE CONSTRAINT fk_pg_game, VALIDATE CONSTRAINT fk_pg_player;
ALTER TABLE umpires_games VALIDATE CONSTRAINT fk_ug_umpire, VALIDATE CONSTRAINT fk_ug_game;
ALTER TABLE events_games
VALIDATE CONSTRAINT fk_eg_event,
VALIDATE CONSTRAINT fk_eg_game;
ALTER TABLE pitching_events
VALIDATE CONSTRAINT fk_pe_player,
VALIDATE CONSTRAINT fk_pe_event;
ALTER TABLE batting_events
VALIDATE CONSTRAINT fk_be_player,
VALIDATE CONSTRAINT fk_be_event;
ALTER TABLE advancing_events
VALIDATE CONSTRAINT fk_ae_player,
VALIDATE CONSTRAINT fk_ae_event;
ALTER TABLE fielding_events
VALIDATE CONSTRAINT fk_fe_player,
VALIDATE CONSTRAINT fk_fe_event;
ALTER TABLE connected_events
VALIDATE CONSTRAINT fk_ce_event1,
VALIDATE CONSTRAINT fk_ce_event2;
postgres/validate.sql
Run it.
/postgres/bin/psql -U postgres -d graph_baseball -f postgres/validate.sql
Now we can create the graph.
Defining the property graph for Postgres#
Since we already had to define all primary key and foreign key relationships to create a Property Graph in Postgres, we can now use a shorter syntax; the relationships are inferred from the constraints.
If an edge table has more than one primary key and foreign key relationship defined with the same vertex table, both the source and destination keys must be defined to avoid ambiguity, as seen with connected_events and player_relationships. The games_ballparks edge is also explicitly defined as well, since the game table itself contains both the source (gid) AND the destination (site). This edge table is also aliased to not conflict with the vertex table.
CREATE PROPERTY GRAPH baseball VERTEX TABLES (
player,
umpire,
team,
ballpark,
game,
"event" LABEL "Event" -- Postgres is case-sensitive for quoted values
) EDGE TABLES (
players_teams SOURCE player DESTINATION team LABEL played_for_team,
coaches_teams SOURCE player DESTINATION team LABEL coached_for_team,
teams_games SOURCE team DESTINATION game LABEL played_in_game,
batting_games SOURCE player DESTINATION game LABEL batted_in_game,
fielding_games SOURCE player DESTINATION game LABEL fielded_in_game,
pitching_games SOURCE player DESTINATION game LABEL pitched_in_game,
umpires_games SOURCE umpire DESTINATION game LABEL umpired_in_game,
events_games SOURCE "event" DESTINATION game LABEL occurred_in_game,
pitching_events SOURCE player DESTINATION "event" LABEL pitched_in_event,
batting_events SOURCE player DESTINATION "event" LABEL batted_in_event,
advancing_events SOURCE player DESTINATION "event" LABEL advanced_in_event,
fielding_events SOURCE player DESTINATION "event" LABEL fielded_in_event,
game AS games_ballparks
SOURCE KEY (gid) REFERENCES game(gid)
DESTINATION KEY ("site") REFERENCES ballpark("site")
LABEL played_in_ballpark,
connected_events
SOURCE KEY (event_id1) REFERENCES "event"(event_id)
DESTINATION KEY (event_id2) REFERENCES "event"(event_id)
LABEL connected_to,
players_relationships
SOURCE KEY (id1) REFERENCES player(id)
DESTINATION KEY (id2) REFERENCES player(id)
LABEL related_to
);
postgres/graph.sql
And run it.
/postgres/bin/psql -U postgres -d graph_baseball -f postgres/graph.sql
And we're done with data loading! With the 'bases loaded, let’s take a strategic timeout to create a few helper functions; this keeps the upcoming queries more readable and saves you keystrokes (or tokens!).
Mathematical macros#
Sabermetricians work hard to create advanced formulas that reveal truer insights not captured by traditional statistics (which often contain biases like team context or luck). If you aren’t familiar with these statistics or their goals (the acronyms are quite cryptic), the MLB Glossary contains a comprehensive reference. With a basic understanding of baseball, their meanings should become intuitive enough to make sense of our findings and what they tell us about the game.
The Weighted On-Base Average (wOBA), On Base Plus Slugging Plus (OPS+), and Skill-interactive Earned Run Average (SIERA) functions depend upon a scale factor, league-average constants, or computed coefficients; the default values are the most current, but be sure to adjust them when working with different seasons or eras of data. Baseball Reference is a great resource for season statistics and FanGraphs Guts! publishes wOBA weights, park factors and more.
We will not be using all of these calculations in this article, but they will provide a great starting point for your own statistical explorations!
LadybugDB and DuckDB#
Both LadybugDB and DuckDB support macros; these are written as to be compatible in both Cypher and SQL
CREATE MACRO calculate_ba(h, d, t, hr, ab) AS
COALESCE(CAST(h AS DOUBLE) / NULLIF(ab, 0), 0.0);
CREATE MACRO calculate_slg(h, d, t, hr, ab) AS
COALESCE(CAST((h - d - t - hr) + (2 * d) + (3 * t) + (4 * hr) AS DOUBLE) / NULLIF(ab, 0), 0.0);
CREATE MACRO calculate_iso(d, t, hr, ab) AS
COALESCE(CAST(d + (2 * t) + (3 * hr) AS DOUBLE) / NULLIF(ab, 0), 0.0);
CREATE MACRO calculate_babip(h, hr, sf, k, ab) AS
COALESCE(CAST(h - hr AS DOUBLE) / NULLIF(ab - k - hr + sf, 0), 0.0);
CREATE MACRO calculate_obp(h, walk, hbp, ab, sf) AS
COALESCE(CAST(h + walk + hbp AS DOUBLE) / NULLIF(ab + walk + hbp + sf, 0), 0.0);
CREATE MACRO calculate_woba(h, d, t, hr, sf, w, iw, hbp, ab,
woba_scale:=1.00,
w_bb:=0.697,
w_hbp:=0.727,
w_1b:=0.855,
w_2b:=1.248,
w_3b:=1.575,
w_hr:=2.014
) AS
COALESCE(woba_scale * (w_bb * (w - iw) + w_hbp * hbp + w_1b * (h - d - t - hr) + w_2b * d + w_3b * t + w_hr * hr) / NULLIF(ab + w - iw + sf + hbp, 0), 0.0);
CREATE MACRO calculate_ops_plus(obp, slg, lg_obp:=0.315, lg_slg:=0.404) AS
COALESCE(100.0 * (obp / NULLIF(lg_obp, 0) + slg / NULLIF(lg_slg, 0) - 1.0), 0.0);
CREATE MACRO calculate_siera(
pa, so, bb, gb, fb, pu,
lg_c:=6.145,
k_coeff:=16.986,
bb_coeff:=11.434,
gb_fb_coeff:=1.858,
k_sq_coeff:=7.653,
gb_fb_sq_coeff:=6.664,
k_pa_gb_fb_coeff:=10.130,
bb_pa_gb_fb_coeff:=5.195
) AS
lg_c
- (k_coeff * (CAST(so AS DOUBLE) / NULLIF(pa, 0)))
+ (bb_coeff * (CAST(bb AS DOUBLE) / NULLIF(pa, 0)))
- (gb_fb_coeff * (CAST(gb - fb - pu AS DOUBLE) / NULLIF(pa, 0)))
+ (k_sq_coeff * pow((CAST(so AS DOUBLE) / NULLIF(pa, 0)), 2))
- (gb_fb_sq_coeff * sign(CAST(gb - fb - pu AS DOUBLE)) * pow((CAST(gb - fb - pu AS DOUBLE) / NULLIF(pa, 0)), 2))
+ (k_pa_gb_fb_coeff * (CAST(so AS DOUBLE) / NULLIF(pa, 0)) * (CAST(gb - fb - pu AS DOUBLE) / NULLIF(pa, 0)))
- (bb_pa_gb_fb_coeff * (CAST(bb AS DOUBLE) / NULLIF(pa, 0)) * (CAST(gb - fb - pu AS DOUBLE) / NULLIF(pa, 0)));
macros.txt
Load the macros into LadybugDB and DuckDB.
lbug graph_baseball.lbug < macros.txt
duckdb graph_baseball.duckdb < macros.txt
Postgres#
Postgres does not support macros, but does support functions. The math is identical, but we use Postgres’ NUMERIC type for flexibility and precision. The types of each function argument must also be specified.
CREATE OR REPLACE FUNCTION calculate_ba(h NUMERIC, d NUMERIC, t NUMERIC, hr NUMERIC, ab NUMERIC)
RETURNS NUMERIC
LANGUAGE SQL
IMMUTABLE
AS $$
SELECT COALESCE(CAST(h AS NUMERIC) / NULLIF(ab, 0), 0)
$$;
CREATE OR REPLACE FUNCTION calculate_slg(h NUMERIC, d NUMERIC, t NUMERIC, hr NUMERIC, ab NUMERIC)
RETURNS NUMERIC
LANGUAGE SQL
IMMUTABLE
AS $$
SELECT COALESCE(CAST((h - d - t - hr) + (2 * d) + (3 * t) + (4 * hr) AS NUMERIC) / NULLIF(ab, 0), 0)
$$;
CREATE OR REPLACE FUNCTION calculate_iso(d NUMERIC, t NUMERIC, hr NUMERIC, ab NUMERIC)
RETURNS NUMERIC
LANGUAGE SQL
IMMUTABLE
AS $$
SELECT COALESCE(CAST(d + (2 * t) + (3 * hr) AS NUMERIC) / NULLIF(ab, 0), 0)
$$;
CREATE OR REPLACE FUNCTION calculate_babip(h NUMERIC, hr NUMERIC, sf NUMERIC, k NUMERIC, ab NUMERIC)
RETURNS NUMERIC
LANGUAGE SQL
IMMUTABLE
AS $$
SELECT COALESCE(CAST(h - hr AS NUMERIC) / NULLIF(ab - k - hr + sf, 0), 0)
$$;
CREATE OR REPLACE FUNCTION calculate_obp(h NUMERIC, walk NUMERIC, hbp NUMERIC, sf NUMERIC, ab NUMERIC)
RETURNS NUMERIC
LANGUAGE SQL
IMMUTABLE
AS $$
SELECT COALESCE(CAST(h + walk + hbp AS NUMERIC) / NULLIF(ab + walk + hbp + sf, 0), 0)
$$;
CREATE OR REPLACE FUNCTION calculate_woba(
h NUMERIC, d NUMERIC, t NUMERIC, hr NUMERIC, sf NUMERIC,
w NUMERIC, iw NUMERIC, hbp NUMERIC, ab NUMERIC,
woba_scale NUMERIC DEFAULT 1.00,
w_bb NUMERIC DEFAULT 0.697,
w_hbp NUMERIC DEFAULT 0.727,
w_1b NUMERIC DEFAULT 0.855,
w_2b NUMERIC DEFAULT 1.248,
w_3b NUMERIC DEFAULT 1.575,
w_hr NUMERIC DEFAULT 2.014
)
RETURNS NUMERIC
LANGUAGE SQL
IMMUTABLE
AS $$
SELECT COALESCE(woba_scale * (w_bb * (w - iw) + w_hbp * hbp + w_1b * (h - d - t - hr) + w_2b * d + w_3b * t + w_hr * hr) / NULLIF(ab + w - iw + sf + hbp, 0), 0)
$$;
CREATE OR REPLACE FUNCTION calculate_ops_plus(obp NUMERIC, slg NUMERIC, lg_obp NUMERIC DEFAULT 0.315, lg_slg NUMERIC DEFAULT 0.404)
RETURNS NUMERIC
LANGUAGE SQL
IMMUTABLE
AS $$
SELECT COALESCE(100.0 * (obp / NULLIF(lg_obp, 0) + slg / NULLIF(lg_slg, 0) - 1.0), 0)
$$;
CREATE OR REPLACE FUNCTION calculate_siera(
pa NUMERIC, so NUMERIC, bb NUMERIC, gb NUMERIC, fb NUMERIC, pu NUMERIC,
lg_c NUMERIC DEFAULT 6.145,
k_coeff NUMERIC DEFAULT 16.986,
bb_coeff NUMERIC DEFAULT 11.434,
gb_fb_coeff NUMERIC DEFAULT 1.858,
k_sq_coeff NUMERIC DEFAULT 7.653,
gb_fb_sq_coeff NUMERIC DEFAULT 6.664,
k_pa_gb_fb_coeff NUMERIC DEFAULT 10.130,
bb_pa_gb_fb_coeff NUMERIC DEFAULT 5.195
)
RETURNS NUMERIC
LANGUAGE SQL
IMMUTABLE
AS $$
SELECT lg_c
- (k_coeff * (CAST(so AS NUMERIC) / NULLIF(pa, 0)))
+ (bb_coeff * (CAST(bb AS NUMERIC) / NULLIF(pa, 0)))
- (gb_fb_coeff * (CAST(gb - fb - pu AS NUMERIC) / NULLIF(pa, 0)))
+ (k_sq_coeff * pow((CAST(so AS NUMERIC) / NULLIF(pa, 0)), 2))
- (gb_fb_sq_coeff * sign(CAST(gb - fb - pu AS NUMERIC)) * pow((CAST(gb - fb - pu AS NUMERIC) / NULLIF(pa, 0)), 2))
+ (k_pa_gb_fb_coeff * (CAST(so AS NUMERIC) / NULLIF(pa, 0)) * (CAST(gb - fb - pu AS NUMERIC) / NULLIF(pa, 0)))
- (bb_pa_gb_fb_coeff * (CAST(bb AS NUMERIC) / NULLIF(pa, 0)) * (CAST(gb - fb - pu AS NUMERIC) / NULLIF(pa, 0)))
$$;
postgres/functions.sql
And run it.
/postgres/bin/psql -U postgres -d graph_baseball -f postgres/functions.sql
We are now ready for some graph-powered action! We will be writing a variety of queries to discover how game dynamics change statistics, how a graph structure forces specific query strategies, and the impact of recursive relationships in Cypher (along with a necessary "hack" for SQL/PGQ).
Left/right platoon splits#
A left/right platoon is when a right handed batter faces a left-handed pitcher (and vice versa). Batters tend to consistently perform better against opposite-handed pitchers, resulting in a positive platoon split (a platoon advantage). When the pitcher and batter have the same-handedness, it often works against the batter (a negative platoon split) which typically gives the pitcher the advantage.
We will write some queries that try to measure the impact of platoon splits using our Sabermetric statistics, first by looking at every single event across every game, and then taking a closer look at “high leverage” scenarios that occur in close-scoring, late-game situations.
Looking at all events#
Our first query will start off with a straightforward MATCH for every pair of BATTED_IN_EVENT and PITCHED_IN_EVENT edges. From these, we get the batter’s handedness, pitcher’s handedness and some additional information about the batting in the event. Pay attention to the distinction between Plate Appearances and At-bats, since certain statistical calculations (including our macros and functions) use one or the other.
LadybugDB#
We do not bind variables for either the Player or Event nodes because we will not be using them in our query. This could be made more compact by passing variables directly to the RETURN clause, but the way it is written keeps our query very similar to the SQL syntax: the RETURN statement is identical to the SELECT, the WITH almost identical to COLUMNS.
MATCH (:Player)-[be:BATTED_IN_EVENT]->(e:Event)<-[pe:PITCHED_IN_EVENT]-(:Player)
WITH
CASE WHEN be.bathand <> pe.pithand THEN 'Batter Advantage' ELSE 'Pitcher Advantage' END AS matchup_type,
// 'All' AS matchup_type, // To aggregate across all matchup types
SUM(CASE WHEN e.run_b is not null THEN 1 ELSE 0 END) AS batter,
SUM(CASE WHEN e.run1 is not null THEN 1 ELSE 0 END) AS first,
SUM(CASE WHEN e.run2 is not null THEN 1 ELSE 0 END) AS second,
SUM(CASE WHEN e.run3 is not null THEN 1 ELSE 0 END) AS third,
SUM(be.pa) AS pa,
SUM(be.ab) AS ab,
SUM(be.bip) AS bip,
SUM(be.`single`) AS `single`,
SUM(be.double) AS double,
SUM(be.triple) AS triple,
SUM(be.hr) AS hr,
SUM(be.ground) AS ground,
SUM(be.walk) AS walk,
SUM(be.hbp) AS hbp,
SUM(be.gdp) AS gdp
RETURN
matchup_type, pa,
(batter + first + second + third) AS runs,
ROUND(CAST(batter + first + second + third AS DECIMAL) / NULLIF(pa, 0), 4) AS runs_per_pa,
ROUND(calculate_slg(`single` + double + triple + hr, double, triple, hr, ab), 3) AS SLG,
ROUND(calculate_iso(double, triple, hr, ab), 3) AS ISO,
ROUND(CAST(gdp AS DECIMAL) / NULLIF(ground, 0), 3) AS gdp_per_ground,
ROUND(CAST(double + triple AS DECIMAL) / NULLIF(bip, 0), 3) AS xbh_per_bip,
ROUND(CAST(hbp + walk AS DECIMAL) / NULLIF(pa, 0), 3) AS hbp_bb_rate
ORDER BY matchup_type ASC;
ladybugdb/platoon_advantage.cypher
And run it.
$ lbug graph_baseball.lbug < ladybugdb/platoon_advantage.cypher
┌───────────────────┬─────────┬────────┬─────────────┬──────────┬──────────┬────────────────┬─────────────┬─────────────┐
│ matchup_type │ pa │ runs │ runs_per_pa │ SLG │ ISO │ gdp_per_ground │ xbh_per_bip │ hbp_bb_rate │
│ STRING │ INT128 │ INT128 │ DOUBLE │ DOUBLE │ DOUBLE │ DOUBLE │ DOUBLE │ DOUBLE │
├───────────────────┼─────────┼────────┼─────────────┼──────────┼──────────┼────────────────┼─────────────┼─────────────┤
│ Batter Advantage │ 1545949 │ 178233 │ 0.115300 │ 0.419000 │ 0.164000 │ 0.060000 │ 0.077000 │ 0.099000 │
│ Pitcher Advantage │ 1330487 │ 156649 │ 0.117700 │ 0.393000 │ 0.148000 │ 0.068000 │ 0.069000 │ 0.084000 │
└───────────────────┴─────────┴────────┴─────────────┴──────────┴──────────┴────────────────┴─────────────┴─────────────┘
Now on to DuckDB.
DuckDB#
Even though this query is SQL/PGQ, it should look very similar: it is essentially an inverted version of the Cypher query above!
Both DuckDB and Postgres implement SQL:2023 SQL/PGQ for label/type constraints, but use different syntax. DuckDB follows Cypher convention and uses a colon (:), while PostgreSQL prioritizes SQL convention and uses the IS operator. The underlying logic is identical; only the syntactic representation differs.
Unlike LadybugDB and Postgres, it is important to note that every node and edge in the MATCH statement must be bound to a variable, or DuckDB will raise a Constraint Error.
LOAD duckpgq;
SELECT
CASE WHEN BatHand <> PitHand THEN 'Batter Advantage' ELSE 'Pitcher Advantage' END AS matchup_type,
SUM(pa) AS PA,
SUM(CASE WHEN run_b IS NOT NULL THEN 1 ELSE 0 END)
+ SUM(CASE WHEN run1 IS NOT NULL THEN 1 ELSE 0 END)
+ SUM(CASE WHEN run2 IS NOT NULL THEN 1 ELSE 0 END)
+ SUM(CASE WHEN run3 IS NOT NULL THEN 1 ELSE 0 END) AS runs,
ROUND(CAST(SUM(CASE WHEN run_b IS NOT NULL THEN 1 ELSE 0 END)
+ SUM(CASE WHEN run1 IS NOT NULL THEN 1 ELSE 0 END)
+ SUM(CASE WHEN run2 IS NOT NULL THEN 1 ELSE 0 END)
+ SUM(CASE WHEN run3 IS NOT NULL THEN 1 ELSE 0 END) AS DECIMAL)
/ NULLIF(SUM(pa), 0), 4) AS runs_per_pa,
ROUND(calculate_slg(SUM(single + double + triple + hr), SUM(double), SUM(triple), SUM(hr), SUM(ab)), 3) AS SLG,
ROUND(calculate_iso(SUM(double), SUM(triple), SUM(hr), SUM(ab)), 3) AS ISO,
ROUND(CAST(SUM(gdp) AS NUMERIC) / NULLIF(SUM(ground), 0), 3) AS gdp_per_ground,
ROUND(CAST(SUM(double + triple) AS NUMERIC) / NULLIF(SUM(bip), 0), 3) AS xbh_per_bip,
ROUND(CAST(SUM(hbp + walk) AS NUMERIC) / NULLIF(SUM(pa), 0), 3) AS hbp_bb_rate
FROM GRAPH_TABLE (baseball
MATCH (batter:Player)-[be:BATTED_IN_EVENT]->(e:"Event")<-[pe:PITCHED_IN_EVENT]-(pitcher:Player)
COLUMNS (
e.run_b,
e.run1,
e.run2,
e.run3,
be.bathand AS bathand,
pe.pithand AS pithand,
be.pa AS pa,
be.ab AS ab,
be.bip AS bip,
be.single AS single,
be.double AS double,
be.triple AS triple,
be.hr AS hr,
be.ground AS ground,
be.walk AS walk,
be.hbp AS hbp,
be.gdp AS gdp
)
)
GROUP BY matchup_type ORDER BY matchup_type ASC;
duckdb/platoon_advantage.sql
Give it a run.
$ duckdb graph_baseball.duckdb < duckdb/platoon_advantage.sql
┌───────────────────┬─────────┬────────┬─────────────┬────────┬────────┬────────────────┬─────────────┬─────────────┐
│ matchup_type │ PA │ runs │ runs_per_pa │ SLG │ ISO │ gdp_per_ground │ xbh_per_bip │ hbp_bb_rate │
│ varchar │ int128 │ int128 │ double │ double │ double │ double │ double │ double │
├───────────────────┼─────────┼────────┼─────────────┼────────┼────────┼────────────────┼─────────────┼─────────────┤
│ Batter Advantage │ 1545949 │ 178233 │ 0.1153 │ 0.419 │ 0.164 │ 0.06 │ 0.077 │ 0.099 │
│ Pitcher Advantage │ 1330487 │ 156649 │ 0.1177 │ 0.393 │ 0.148 │ 0.068 │ 0.069 │ 0.084 │
└───────────────────┴─────────┴────────┴─────────────┴────────┴────────┴────────────────┴─────────────┴─────────────┘
Now on to Postgres.
Postgres#
The SQL/PGQ query for Postgres is practically identical to DuckDB, with the two key distinctions mentioned in the previous section:
- Postgres uses the
ISoperator in the MATCH statement - Postgres can work with unbound labels. Since we do not use any properties for
Playernodes, we can omit their binding entirely.
SELECT
CASE WHEN BatHand <> PitHand THEN 'Batter Advantage' ELSE 'Pitcher Advantage' END AS matchup_type,
SUM(pa) AS PA,
SUM(CASE WHEN run_b IS NOT NULL THEN 1 ELSE 0 END)
+ SUM(CASE WHEN run1 IS NOT NULL THEN 1 ELSE 0 END)
+ SUM(CASE WHEN run2 IS NOT NULL THEN 1 ELSE 0 END)
+ SUM(CASE WHEN run3 IS NOT NULL THEN 1 ELSE 0 END) AS runs,
ROUND(CAST(SUM(CASE WHEN run_b IS NOT NULL THEN 1 ELSE 0 END)
+ SUM(CASE WHEN run1 IS NOT NULL THEN 1 ELSE 0 END)
+ SUM(CASE WHEN run2 IS NOT NULL THEN 1 ELSE 0 END)
+ SUM(CASE WHEN run3 IS NOT NULL THEN 1 ELSE 0 END) AS DECIMAL)
/ NULLIF(SUM(pa), 0), 4) AS runs_per_pa,
ROUND(calculate_slg(SUM(single + double + triple + hr), SUM(double), SUM(triple), SUM(hr), SUM(ab)), 3) AS SLG,
ROUND(calculate_iso(SUM(double), SUM(triple), SUM(hr), SUM(ab)), 3) AS ISO,
ROUND(CAST(SUM(gdp) AS NUMERIC) / NULLIF(SUM(ground), 0), 3) AS gdp_per_ground,
ROUND(CAST(SUM(double + triple) AS NUMERIC) / NULLIF(SUM(bip), 0), 3) AS xbh_per_bip,
ROUND(CAST(SUM(hbp + walk) AS NUMERIC) / NULLIF(SUM(pa), 0), 3) AS hbp_bb_rate
FROM GRAPH_TABLE (baseball
MATCH ()-[be IS BATTED_IN_EVENT]->(e IS "Event")<-[pe IS PITCHED_IN_EVENT]-()
COLUMNS (
e.run_b,
e.run1,
e.run2,
e.run3,
be.bathand AS bathand,
pe.pithand AS pithand,
be.pa AS pa,
be.ab AS ab,
be.bip AS bip,
be.single AS single,
be.double AS double,
be.triple AS triple,
be.hr AS hr,
be.ground AS ground,
be.walk AS walk,
be.hbp AS hbp,
be.gdp AS gdp
)
)
GROUP BY matchup_type ORDER BY matchup_type ASC;
postgres/platoon_advantage.sql
And give it a run.
$ /postgres/bin/psql -U postgres -d graph_baseball -f postgres/platoon_advantage.sql
matchup_type | pa | runs | runs_per_pa | slg | iso | gdp_per_ground | xbh_per_bip | hbp_bb_rate
-------------------+---------+--------+-------------+-------+-------+----------------+-------------+-------------
Batter Advantage | 1545949 | 178233 | 0.1153 | 0.419 | 0.164 | 0.060 | 0.077 | 0.099
Pitcher Advantage | 1330487 | 156649 | 0.1177 | 0.393 | 0.148 | 0.068 | 0.069 | 0.084
Query results#
The results from each database are consistent, and we can see there is a ~0.53 to ~0.47 split for all plate appearances. I will not dive too deeply into the statistical analysis, but logically the numbers follow some expected trends.
Slugging and Isolated Power Increase (SLG and ISO)
Unsurprisingly, when batters have an advantage, they hit for more power and for more bases (SLG and ISO), a ~6% and ~11% increase over the disadvantage. This is also reinforced by our calculated extra-bases per ball-in-play rate (xbh_per_bip).
Runs per Plate Appearance Remain Nearly Constant (runs_per_pa)
Fascinatingly, the runs per plate appearance are identical to the thousandths place! This tells us that in pitcher-advantaged situations, there must be other methods used to score runs than just big bats, perhaps tactical hitting to move baserunners like bunts or sacrificial flyballs.
Advantaged Batters Get More First Base Passes (hbp_bb_rate)
Batters with a handedness advantage see an ~18% increase in walks and hit-by-pitches (taking first base) over disadvantaged batters, telling us that they can afford to have more discipline at the plate and be more selective with their swings. Breaking balls, like sliders or curveballs, also tend to move in towards the batter’s body in these matchups, which makes them easier to see (or get hit by).
Advantaged Pitchers Create More Double-Plays (gdp_per_ground)
Conversely to the above, batters at the same side as the pitcher have balls break away from their body, which are harder to see and hit. This tends to result in weaker contact and more ground balls. Ground balls with runners on base are particularly dangerous for batters because fielders can tag out two (or more) runners at once and turn a double (or triple) play. Indeed, there is around a 12% increase in double-play rate for all ground balls with a pitcher advantaged matchup.
In comparison#
Let's see how the databases compare.
$ hyperfine --warmup 3 \
'lbug graph_baseball.lbug < ladybugdb/platoon_advantage.cypher' \
'/root/.duckdb/cli/1.4.4/duckdb graph_baseball.duckdb < duckdb/platoon_advantage.sql' \
'/postgres/bin/psql -U postgres -d graph_baseball -f postgres/platoon_advantage.sql'
Benchmark 1: lbug graph_baseball.lbug < ladybugdb/platoon_advantage.cypher
Time (mean ± σ): 3.806 s ± 0.137 s [User: 6.865 s, System: 0.248 s]
Range (min … max): 3.636 s … 4.053 s 10 runs
Benchmark 2: /root/.duckdb/cli/1.4.4/duckdb graph_baseball.duckdb < duckdb/platoon_advantage.sql
Time (mean ± σ): 2.949 s ± 0.130 s [User: 4.858 s, System: 0.776 s]
Range (min … max): 2.725 s … 3.157 s 10 runs
Benchmark 3: /postgres/bin/psql -U postgres -d graph_baseball -f postgres/platoon_advantage.sql
Time (mean ± σ): 37.380 s ± 6.623 s [User: 0.002 s, System: 0.002 s]
Range (min … max): 30.500 s … 49.237 s 10 runs
Summary
/root/.duckdb/cli/1.4.4/duckdb graph_baseball.duckdb < duckdb/platoon_advantage.sql ran
1.29 ± 0.07 times faster than lbug graph_baseball.lbug < ladybugdb/platoon_advantage.cypher
12.68 ± 2.31 times faster than /postgres/bin/psql -U postgres -d graph_baseball -f postgres/platoon_advantage.sql
This query touches every single event, and the columnar vectorized query execution engine of DuckDB and LadybugDB really shines; it only scans the columns required by the query and processes batches of 2048 tuples. Postgres is slower since it scans entire rows and processes them one-by-one.
Tough luck shutouts and stranded runners#
I asked one of my baseball-stat loving friends for the most specific question he could think of, and he proposed “What is the highest OPS (On-base Plus Slugging) a team had in a game where they scored no runs and allowed at least 5 earned runs”. Basically, games where the losing team was productive with base-hits but failed to score any runs, and (to add insult to injury) allowed the other team to score 5 or more runs without the benefit of defensive mishaps. I took this idea one step further to also look at the batters who stranded a significant percentage of the total runners left on base.
LadybugDB#
This query gives us a look at the “composability” of queries in Cypher, using WITH to carry bound graph objects into subsequent queries. This allows for us to first filter and sort high OPS shutout games, then match batters and baserunners in the third-out events that occurred only in those specific games. This powerful Cypher syntax makes it easy to build complex queries piece-by-piece and chain them together.
MATCH
(t1:Team)-[r1:PLAYED_IN_GAME {b_r: 0}]->(g:Game)<-[r2:PLAYED_IN_GAME]-(t2:Team)
WHERE
t1.team <> t2.team
AND r1.b_ab > 0 and r2.b_ab > 0
AND r1.p_er >= 5
WITH
r1, g,
t1.team AS Team,
t2.team AS Opp,
r1.lob AS LOB,
ROUND(
calculate_obp(r1.b_h, r1.b_w, r1.b_hbp, r1.b_ab, r1.b_sf)
+ calculate_slg(r1.b_h, r1.b_d, r1.b_t, r1.b_hr, r1.b_ab),
3) AS OPS
ORDER BY OPS DESC
LIMIT 11
MATCH
(e:Event {outs_post: 3})-[:OCCURRED_IN_GAME]->(g),
(batter:Player)-[:BATTED_IN_EVENT]->(e)<-[adv:ADVANCED_IN_EVENT]-(runner:Player)
WHERE
batter <> runner
AND (
(e.top_bot = 0 AND r1.vishome = 'v')
OR (e.top_bot = 1 AND r1.vishome = 'h')
)
WITH
Team, Opp, OPS, LOB,
g.date AS GameDate,
r1.p_er AS OppEarned,
batter.fullname AS batter_name,
count(*) AS StrandedCount
RETURN
GameDate,
Team,
OPS,
Opp,
OppEarned,
LOB,
COLLECT(
DISTINCT CASE WHEN StrandedCount >= (0.33 * LOB) THEN batter_name END
) AS StrandingBatters
ORDER BY OPS DESC
ladybugdb/stranded_runners.cypher
Give it a run.
$ lbug graph_baseball.lbug < ladybugdb/stranded_runners.cypher
┌────────────┬────────┬──────────┬────────┬───────────┬───────┬─────────────────────────────────────────┐
│ GameDate │ Team │ OPS │ Opp │ OppEarned │ LOB │ StrandingBatters │
│ DATE │ STRING │ DOUBLE │ STRING │ INT64 │ INT64 │ STRING[] │
├────────────┼────────┼──────────┼────────┼───────────┼───────┼─────────────────────────────────────────┤
│ 2015-05-31 │ HOU │ 0.833000 │ CHA │ 5 │ 6 │ [Luis Adan Valbuena,Jose Carlos Altuve] │
│ 2018-06-01 │ KCA │ 0.669000 │ OAK │ 16 │ 7 │ [Michael Christopher Moustakas] │
│ 2010-04-29 │ MIL │ 0.668000 │ SDN │ 8 │ 8 │ [Carlos Argelis (Pena) Gomez] │
│ 2011-07-05 │ LAN │ 0.668000 │ NYN │ 6 │ 13 │ │
│ 2018-08-19 │ BAL │ 0.667000 │ CLE │ 8 │ 7 │ │
│ 2018-05-20 │ BAL │ 0.659000 │ BOS │ 5 │ 14 │ [Joseph Anthony Mancini] │
│ 2017-09-14 │ MIA │ 0.658000 │ PHI │ 9 │ 10 │ │
│ 2021-05-25 │ ARI │ 0.651000 │ SFN │ 8 │ 11 │ │
│ 2013-09-01 │ ATL │ 0.650000 │ MIA │ 7 │ 10 │ │
│ 2017-05-05 │ ATL │ 0.649000 │ SLN │ 8 │ 10 │ │
│ 2018-05-13 │ PIT │ 0.649000 │ SFN │ 5 │ 11 │ [Joshua Evan Bell] │
└────────────┴────────┴──────────┴────────┴───────────┴───────┴─────────────────────────────────────────┘
DuckDB#
SQL/PGQ does not have a similar mechanism to Cypher’s WITH, so the query must be structured differently. Since DuckDB supports multiple patterns in a MATCH statement, our query can be written within a single GRAPH_TABLE even though we have two separate subgraphs that join at the Game node g (forming a double-star graph/caterpillar tree).
LOAD duckpgq;
SELECT
GameDate,
Team,
OPS,
Opp,
OppEarned,
LOB,
LIST(DISTINCT Batter)
FILTER (WHERE StrandedCount >= 0.33 * LOB) AS StrandingBatters
FROM (
SELECT
GameDate,
Team,
Opp,
OppEarned,
LOB,
Batter,
OPS,
COUNT(runner_id) OVER (
PARTITION BY game_id, Batter
) AS StrandedCount
FROM GRAPH_TABLE (baseball
MATCH
(t1:Team)-[r1:PLAYED_IN_GAME WHERE r1.b_r = 0]->(g:Game)<-[r2:PLAYED_IN_GAME]-(t2:Team),
(batter:Player)-[be:BATTED_IN_EVENT]->(e:Event WHERE e.outs_post = 3)<-[adv:ADVANCED_IN_EVENT]-(runner:Player),
(e:Event)-[o:OCCURRED_IN_GAME]->(g:Game)
WHERE t1.team <> t2.team
AND r1.b_ab > 0
AND r2.b_ab > 0
AND r1.p_er >= 5
AND batter.id <> runner.id
AND (
(e.top_bot = 0 AND r1.vishome = 'v')
OR (e.top_bot = 1 AND r1.vishome = 'h')
)
COLUMNS (
g.gid AS game_id,
g.date AS GameDate,
t1.team AS Team,
t2.team AS Opp,
r1.p_er AS OppEarned,
r1.lob AS LOB,
ROUND(
calculate_obp(r1.b_h, r1.b_w, r1.b_hbp, r1.b_ab, r1.b_sf)
+ calculate_slg(r1.b_h, r1.b_d, r1.b_t, r1.b_hr, r1.b_ab),
3) AS OPS,
batter.fullname AS Batter,
runner.id AS runner_id
)
)
) sub
GROUP BY GameDate, Team, OPS, Opp, OppEarned, LOB
ORDER BY OPS DESC
LIMIT 11;
duckdb/stranded_runners.sql
And give it a run.
$ duckdb graph_baseball.duckdb < duckdb/stranded_runners.sql
┌────────────┬─────────┬────────┬─────────┬───────────┬───────┬──────────────────────────────────────────┐
│ GameDate │ Team │ OPS │ Opp │ OppEarned │ LOB │ StrandingBatters │
│ date │ varchar │ double │ varchar │ int64 │ int64 │ varchar[] │
├────────────┼─────────┼────────┼─────────┼───────────┼───────┼──────────────────────────────────────────┤
│ 2015-05-31 │ HOU │ 0.833 │ CHA │ 5 │ 6 │ [Jose Carlos Altuve, Luis Adan Valbuena] │
│ 2018-06-01 │ KCA │ 0.669 │ OAK │ 16 │ 7 │ [Michael Christopher Moustakas] │
│ 2010-04-29 │ MIL │ 0.668 │ SDN │ 8 │ 8 │ ['Carlos Argelis (Pena) Gomez'] │
│ 2011-07-05 │ LAN │ 0.668 │ NYN │ 6 │ 13 │ NULL │
│ 2018-08-19 │ BAL │ 0.667 │ CLE │ 8 │ 7 │ NULL │
│ 2018-05-20 │ BAL │ 0.659 │ BOS │ 5 │ 14 │ [Joseph Anthony Mancini] │
│ 2017-09-14 │ MIA │ 0.658 │ PHI │ 9 │ 10 │ NULL │
│ 2021-05-25 │ ARI │ 0.651 │ SFN │ 8 │ 11 │ NULL │
│ 2013-09-01 │ ATL │ 0.65 │ MIA │ 7 │ 10 │ NULL │
│ 2017-05-05 │ ATL │ 0.649 │ SLN │ 8 │ 10 │ NULL │
│ 2018-05-13 │ PIT │ 0.649 │ SFN │ 5 │ 11 │ [Joshua Evan Bell] │
├────────────┴─────────┴────────┴─────────┴───────────┴───────┴──────────────────────────────────────────┤
│ 11 rows 7 columns │
└────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Postgres#
As mentioned earlier, Postgres does not support multi-pattern path matching3, and there is no way to structure our MATCH statement in a single path. We must write two separate queries, one for each subgraph, which are later combined with a JOIN on the common Game gid.
Subqueries within a GRAPH_TABLE reference are also not supported, so we cannot include only the high-OPS games from our first subquery to filter only the events in those games within the second subquery. This requires Postgres to scan every event (~3 million), which takes a much longer time than DuckDB.
SELECT
game_date AS GameDate,
team AS Team,
ops AS OPS,
opp AS Opp,
p_er AS OppEarned,
lob AS LOB,
NULLIF(
ARRAY_REMOVE(
ARRAY_AGG(DISTINCT CASE WHEN stranded_count >= (0.33 * lob) THEN batter_name END),
NULL
),
'{}'
) AS StrandingBatters
FROM (
SELECT
top_games.game_date,
top_games.team,
top_games.ops,
top_games.opp,
top_games.p_er,
top_games.lob,
third_outs.batter_name,
COUNT(*) OVER (
PARTITION BY top_games.gid, third_outs.batter_name
) AS stranded_count
FROM (
SELECT
gid,
"date" AS game_date,
team,
opp,
p_er,
lob,
vishome,
ROUND(
calculate_obp(b_h, b_w, b_hbp, b_ab, b_sf)
+ calculate_slg(b_h, b_d, b_t, b_hr, b_ab),
3) AS ops
FROM GRAPH_TABLE (baseball
MATCH
(t1 IS Team)-[r1 IS PLAYED_IN_GAME]->(g IS Game)<-[r2 IS PLAYED_IN_GAME]-(t2 IS Team)
WHERE
t1.team <> t2.team
AND r1.b_r = 0
AND r1.b_ab > 0 AND r2.b_ab > 0
AND r1.p_er >= 5
COLUMNS (
g.gid,
g.date,
t1.team,
t2.team AS opp,
r1.p_er, r1.lob, r1.vishome,
r1.b_h, r1.b_w, r1.b_hbp,
r1.b_ab, r1.b_sf,
r1.b_d, r1.b_t, r1.b_hr
)
)
ORDER BY ops DESC
LIMIT 11
) AS top_games
JOIN (
SELECT
gid,
batter_name,
top_bot
FROM GRAPH_TABLE (baseball
MATCH
(batter IS Player)-[ IS BATTED_IN_EVENT]->(e IS "Event")<-[adv IS ADVANCED_IN_EVENT]-(runner IS Player)
-[ IS ADVANCED_IN_EVENT]->(e)-[ IS OCCURRED_IN_GAME]->(g IS Game)
WHERE
e.outs_post = 3
AND batter.id <> runner.id
COLUMNS (
g.gid,
batter.fullname AS batter_name,
e.top_bot
)
)
) AS third_outs
ON top_games.gid = third_outs.gid
AND (
(third_outs.top_bot = 0 AND top_games.vishome = 'v')
OR (third_outs.top_bot = 1 AND top_games.vishome = 'h')
)
) AS combined
GROUP BY game_date, team, ops, opp, p_er, lob
ORDER BY ops DESC;
postgres/stranded_runners.sql
And run it.
$ /postgres/bin/psql -U postgres -d graph_baseball -f postgres/stranded_runners.sql
gamedate | team | ops | opp | oppearned | lob | strandingbatters
------------+------+-------+-----+-----------+-----+---------------------------------------------
2015-05-31 | HOU | 0.833 | CHA | 5 | 6 | {"Jose Carlos Altuve","Luis Adan Valbuena"}
2018-06-01 | KCA | 0.669 | OAK | 16 | 7 | {"Michael Christopher Moustakas"}
2010-04-29 | MIL | 0.668 | SDN | 8 | 8 | {"Carlos Argelis (Pena) Gomez"}
2011-07-05 | LAN | 0.668 | NYN | 6 | 13 |
2018-08-19 | BAL | 0.667 | CLE | 8 | 7 |
2018-05-20 | BAL | 0.659 | BOS | 5 | 14 | {"Joseph Anthony Mancini"}
2017-09-14 | MIA | 0.658 | PHI | 9 | 10 |
2021-05-25 | ARI | 0.651 | SFN | 8 | 11 |
2013-09-01 | ATL | 0.650 | MIA | 7 | 10 |
2017-05-05 | ATL | 0.649 | SLN | 8 | 10 |
2018-05-13 | PIT | 0.649 | SFN | 5 | 11 | {"Joshua Evan Bell"}
Query results#
Even though each query differed in structure, the results are consistent. As expected, the high OPS corresponds with a high number of runners left on base. Play around with the stranded ratio (0.33) to filter batters.
$ hyperfine --warmup 3 \
'lbug graph_baseball.lbug < ladybugdb/stranded_runners.cypher' \
'/root/.duckdb/cli/1.4.4/duckdb graph_baseball.duckdb < duckdb/stranded_runners.sql' \
'/postgres/bin/psql -U postgres -d graph_baseball -f postgres/stranded_runners.sql'
Benchmark 1: lbug graph_baseball.lbug < ladybugdb/stranded_runners.cypher
Time (mean ± σ): 2.318 s ± 0.243 s [User: 2.222 s, System: 0.510 s]
Range (min … max): 2.064 s … 2.637 s 10 runs
Benchmark 2: /root/.duckdb/cli/1.4.4/duckdb graph_baseball.duckdb < duckdb/stranded_runners.sql
Time (mean ± σ): 864.6 ms ± 66.2 ms [User: 1279.4 ms, System: 232.4 ms]
Range (min … max): 802.8 ms … 1017.5 ms 10 runs
Benchmark 3: /postgres/bin/psql -U postgres -d graph_baseball -f postgres/stranded_runners.sql
Time (mean ± σ): 498.067 s ± 55.300 s [User: 0.001 s, System: 0.003 s]
Range (min … max): 409.754 s … 611.634 s 10 runs
Summary
/root/.duckdb/cli/1.4.4/duckdb graph_baseball.duckdb < duckdb/stranded_runners.sql ran
2.68 ± 0.35 times faster than lbug graph_baseball.lbug < ladybugdb/stranded_runners.cypher
576.04 ± 77.68 times faster than /postgres/bin/psql -U postgres -d graph_baseball -f postgres/stranded_runners.sql
This was DuckDB's fastest query and Postgres' slowest, while LadybugDB remained fairly consistent with its previous queries. Remember that Postgres required two independent queries to be joined (since it does not support multi-pattern match statements), resulting in a significant speed disadvantage which may have affected benchmark stability. Once multi-pattern matching is implemented, the DuckDB version of the query can be modified for Postgres and it should run faster.
Run expectancy matrices (RE24)#
Run expectancy based on 24 base-out states (RE24) empirically calculates the number of runs expected to score based on the current number of runners on base and current number of outs. With 8 possible baserunner combinations and 3 outs, we get a 24x24 state matrix. The work of Tom Tango, an analyst and co-author of The Book: Playing the Percentages in Baseball, helped integrate this foundational statistic into mainstream baseball analysis.
LadybugDB#
Generating the matrices is fundamentally an ordered accumulation operation over a flat sequence. However, there is no ordered stateful row operation in Cypher, so we use the Kleene star operator syntax to match variable-length relationships using an unbounded upper limit and sum runs from the current event to the end of the half-inning. The recursive nature of this query takes quadratic time O(k²) for each half-inning (recursing ~4 levels for each). This is not a catastrophic recursion (it would be if EVERY event was connected), but it still has to resolve ~6.6M adjacency steps on a single thread.
MATCH (e:Event)-[:OCCURRED_IN_GAME]->(g:Game {gametype: 'regular'})
WHERE e.outs_pre < 3 AND e.inning < 9
OPTIONAL MATCH (e)-[:CONNECTED_TO* 0..]->(next:Event)
WITH e, SUM(next.runs) AS runs_roi,
e.outs_pre AS outs,
CASE
WHEN e.br1_pre IS NULL AND e.br2_pre IS NULL AND e.br3_pre IS NULL THEN 0
WHEN e.br1_pre IS NOT NULL AND e.br2_pre IS NULL AND e.br3_pre IS NULL THEN 1
WHEN e.br1_pre IS NULL AND e.br2_pre IS NOT NULL AND e.br3_pre IS NULL THEN 2
WHEN e.br1_pre IS NULL AND e.br2_pre IS NULL AND e.br3_pre IS NOT NULL THEN 3
WHEN e.br1_pre IS NOT NULL AND e.br2_pre IS NOT NULL AND e.br3_pre IS NULL THEN 4
WHEN e.br1_pre IS NOT NULL AND e.br2_pre IS NULL AND e.br3_pre IS NOT NULL THEN 5
WHEN e.br1_pre IS NULL AND e.br2_pre IS NOT NULL AND e.br3_pre IS NOT NULL THEN 6
WHEN e.br1_pre IS NOT NULL AND e.br2_pre IS NOT NULL AND e.br3_pre IS NOT NULL THEN 7
END AS state_idx
WITH state_idx,
CASE
WHEN state_idx = 0 THEN "_ _ _"
WHEN state_idx = 1 THEN "1 _ _"
WHEN state_idx = 2 THEN "_ 2 _"
WHEN state_idx = 3 THEN "_ _ 3"
WHEN state_idx = 4 THEN "1 2 _"
WHEN state_idx = 5 THEN "1 _ 3"
WHEN state_idx = 6 THEN "_ 2 3"
WHEN state_idx = 7 THEN "1 2 3"
END AS `Base State`,
ROUND(AVG(CASE WHEN outs = 0 THEN runs_roi ELSE NULL END), 3) AS `0 Outs`,
ROUND(AVG(CASE WHEN outs = 1 THEN runs_roi ELSE NULL END), 3) AS `1 Out`,
ROUND(AVG(CASE WHEN outs = 2 THEN runs_roi ELSE NULL END), 3) AS `2 Outs`
RETURN `Base State`, `0 Outs`, `1 Out`, `2 Outs`
ORDER BY state_idx ASC LIMIT 8
ladybugdb/re24.cypher
Give it a run.
$ lbug graph_baseball.lbug < ladybugdb/re24.cypher
┌────────────┬──────────┬──────────┬──────────┐
│ Base State │ 0 Outs │ 1 Out │ 2 Outs │
│ STRING │ DOUBLE │ DOUBLE │ DOUBLE │
├────────────┼──────────┼──────────┼──────────┤
│ _ _ _ │ 0.500000 │ 0.266000 │ 0.102000 │
│ 1 _ _ │ 0.889000 │ 0.527000 │ 0.228000 │
│ _ 2 _ │ 1.136000 │ 0.687000 │ 0.324000 │
│ _ _ 3 │ 1.387000 │ 0.963000 │ 0.364000 │
│ 1 2 _ │ 1.490000 │ 0.927000 │ 0.445000 │
│ 1 _ 3 │ 1.795000 │ 1.179000 │ 0.497000 │
│ _ 2 3 │ 2.006000 │ 1.402000 │ 0.575000 │
│ 1 2 3 │ 2.340000 │ 1.592000 │ 0.779000 │
└────────────┴──────────┴──────────┴──────────┘
DuckDB#
While duckpgq does support path-finding syntax in the MATCH statement, in practice it is too slow for querying every event. As a smaller concrete example, the players_relationships table contains 1,320 rows, and a query that matches all of them using
MATCH p = ANY SHORTEST (a:Player)-[r:RELATED_TO]->+(b:Player)
takes ~10 minutes to run on an M4 MacBook Air with 16GB of memory.
However, the strictly linear relationship between half-inning events allows us to "cheat" and use a pure SQL feature: window functions. This makes a single sorted pass over flat columnar data, never touching the graph structure at all. Each input row gets its own calculation result based on an unbounded "window" of related rows. The graph topology of CONNECTED_TO edges is completely unused!
LOAD duckpgq;
SELECT
CASE state_idx
WHEN 0 THEN '_ _ _'
WHEN 1 THEN '1 _ _'
WHEN 2 THEN '_ 2 _'
WHEN 3 THEN '_ _ 3'
WHEN 4 THEN '1 2 _'
WHEN 5 THEN '1 _ 3'
WHEN 6 THEN '_ 2 3'
WHEN 7 THEN '1 2 3'
END AS "Base State",
ROUND(AVG(CASE WHEN outs = 0 THEN runs_roi END), 3) AS "0 Outs",
ROUND(AVG(CASE WHEN outs = 1 THEN runs_roi END), 3) AS "1 Out",
ROUND(AVG(CASE WHEN outs = 2 THEN runs_roi END), 3) AS "2 Outs"
FROM (
SELECT
outs_pre AS outs,
CASE
WHEN br1_pre IS NULL AND br2_pre IS NULL AND br3_pre IS NULL THEN 0
WHEN br1_pre IS NOT NULL AND br2_pre IS NULL AND br3_pre IS NULL THEN 1
WHEN br1_pre IS NULL AND br2_pre IS NOT NULL AND br3_pre IS NULL THEN 2
WHEN br1_pre IS NULL AND br2_pre IS NULL AND br3_pre IS NOT NULL THEN 3
WHEN br1_pre IS NOT NULL AND br2_pre IS NOT NULL AND br3_pre IS NULL THEN 4
WHEN br1_pre IS NOT NULL AND br2_pre IS NULL AND br3_pre IS NOT NULL THEN 5
WHEN br1_pre IS NULL AND br2_pre IS NOT NULL AND br3_pre IS NOT NULL THEN 6
ELSE 7
END AS state_idx,
SUM(runs) OVER (
PARTITION BY gid, inning, top_bot
ORDER BY pn
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
) AS runs_roi
FROM GRAPH_TABLE (baseball
MATCH (e:"Event")-[o:OCCURRED_IN_GAME]->(g:Game)
WHERE g.gametype = 'regular'
AND e.outs_pre < 3 AND e.inning < 9
COLUMNS (
g.gid,
e.pn,
e.inning,
e.top_bot,
e.runs,
e.outs_pre,
e.br1_pre,
e.br2_pre,
e.br3_pre
)
)
)
GROUP BY state_idx
ORDER BY state_idx;
duckdb/re24.sql
Give it a run.
$ duckdb graph_baseball.duckdb < duckdb/re24.sql
┌────────────┬────────┬────────┬────────┐
│ Base State │ 0 Outs │ 1 Out │ 2 Outs │
│ varchar │ double │ double │ double │
├────────────┼────────┼────────┼────────┤
│ _ _ _ │ 0.5 │ 0.266 │ 0.102 │
│ 1 _ _ │ 0.889 │ 0.527 │ 0.228 │
│ _ 2 _ │ 1.136 │ 0.687 │ 0.324 │
│ _ _ 3 │ 1.387 │ 0.963 │ 0.364 │
│ 1 2 _ │ 1.49 │ 0.927 │ 0.445 │
│ 1 _ 3 │ 1.795 │ 1.179 │ 0.497 │
│ _ 2 3 │ 2.006 │ 1.402 │ 0.575 │
│ 1 2 3 │ 2.34 │ 1.592 │ 0.779 │
└────────────┴────────┴────────┴────────┘
Postgres#
We utilize the window function for Postgres as well (the only difference in this SQL query is the IS operator in the MATCH statement.)
SELECT
CASE state_idx
WHEN 0 THEN '_ _ _'
WHEN 1 THEN '1 _ _'
WHEN 2 THEN '_ 2 _'
WHEN 3 THEN '_ _ 3'
WHEN 4 THEN '1 2 _'
WHEN 5 THEN '1 _ 3'
WHEN 6 THEN '_ 2 3'
WHEN 7 THEN '1 2 3'
END AS "Base State",
ROUND(AVG(CASE WHEN outs = 0 THEN runs_roi END), 3) AS "0 Outs",
ROUND(AVG(CASE WHEN outs = 1 THEN runs_roi END), 3) AS "1 Out",
ROUND(AVG(CASE WHEN outs = 2 THEN runs_roi END), 3) AS "2 Outs"
FROM (
SELECT
outs_pre AS outs,
CASE
WHEN br1_pre IS NULL AND br2_pre IS NULL AND br3_pre IS NULL THEN 0
WHEN br1_pre IS NOT NULL AND br2_pre IS NULL AND br3_pre IS NULL THEN 1
WHEN br1_pre IS NULL AND br2_pre IS NOT NULL AND br3_pre IS NULL THEN 2
WHEN br1_pre IS NULL AND br2_pre IS NULL AND br3_pre IS NOT NULL THEN 3
WHEN br1_pre IS NOT NULL AND br2_pre IS NOT NULL AND br3_pre IS NULL THEN 4
WHEN br1_pre IS NOT NULL AND br2_pre IS NULL AND br3_pre IS NOT NULL THEN 5
WHEN br1_pre IS NULL AND br2_pre IS NOT NULL AND br3_pre IS NOT NULL THEN 6
ELSE 7
END AS state_idx,
SUM(runs) OVER (
PARTITION BY gid, inning, top_bot
ORDER BY pn
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
) AS runs_roi
FROM GRAPH_TABLE (baseball
MATCH (e IS "Event")-[o IS OCCURRED_IN_GAME]->(g IS Game)
WHERE g.gametype = 'regular'
AND e.outs_pre < 3 AND e.inning < 9
COLUMNS (
g.gid,
e.pn,
e.inning,
e.top_bot,
e.runs,
e.outs_pre,
e.br1_pre,
e.br2_pre,
e.br3_pre
)
)
)
GROUP BY state_idx
ORDER BY state_idx;
postgres/re24.sql
Give it a run.
$ /postgres/bin/psql -U postgres -d graph_baseball -f postgres/re24.sql
Base State | 0 Outs | 1 Out | 2 Outs
------------+--------+-------+--------
_ _ _ | 0.500 | 0.266 | 0.102
1 _ _ | 0.889 | 0.527 | 0.228
_ 2 _ | 1.136 | 0.687 | 0.324
_ _ 3 | 1.387 | 0.963 | 0.364
1 2 _ | 1.490 | 0.927 | 0.445
1 _ 3 | 1.795 | 1.179 | 0.497
_ 2 3 | 2.006 | 1.402 | 0.575
1 2 3 | 2.340 | 1.592 | 0.779
Query results#
Our query crunches 15 seasons of data; if you filter the query to only the 2010 through 2015 seasons, they are very consistent with Tom Tango's RE24 numbers.
$ hyperfine --warmup 3 \
'lbug graph_baseball.lbug < ladybugdb/re24.cypher' \
'/root/.duckdb/cli/1.4.4/duckdb graph_baseball.duckdb < duckdb/re24.sql' \
'/postgres/bin/psql -U postgres -d graph_baseball -f postgres/re24.sql'
Benchmark 1: lbug graph_baseball.lbug < ladybugdb/re24.cypher
Time (mean ± σ): 174.212 s ± 2.261 s [User: 186.779 s, System: 2.416 s]
Range (min … max): 169.992 s … 177.670 s 10 runs
Benchmark 2: /root/.duckdb/cli/1.4.4/duckdb graph_baseball.duckdb < duckdb/re24.sql
Time (mean ± σ): 3.094 s ± 0.121 s [User: 4.634 s, System: 1.109 s]
Range (min … max): 2.940 s … 3.327 s 10 runs
Benchmark 3: /postgres/bin/psql -U postgres -d graph_baseball -f postgres/re24.sql
Time (mean ± σ): 23.747 s ± 4.956 s [User: 0.001 s, System: 0.003 s]
Range (min … max): 19.207 s … 33.392 s 10 runs
Summary
/root/.duckdb/cli/1.4.4/duckdb graph_baseball.duckdb < duckdb/re24.sql ran
7.68 ± 1.63 times faster than /postgres/bin/psql -U postgres -d graph_baseball -f postgres/re24.sql
56.31 ± 2.32 times faster than lbug graph_baseball.lbug < ladybugdb/re24.cypher
We see that the quadratic time complexity of the Kleene star operator bit LadybugDB hard, resulting in the slowest query of the entire study and the second-highest standard deviation. The optimization of the SQL window function is made apparent for both DuckDB and Postgres, running significantly faster and with greater consistency.
In summary#
DuckDB claimed first place in every benchmark, beating out LadybugDB by about 0.7 seconds on average (ignoring the extreme outlier of RE24), and beating Postgres by multiple seconds across the board. Keep in mind that SQL/PGQ is very cutting edge: duckpgq is considered a research project and does not yet target the latest DuckDB version, while Postgres is currently missing some important features like multi-pattern matching which limit its applicability.
It is pragmatic to use the right tool for the job, and Cypher has long been the graph database language of choice. The conversation might change with the emergence of SQL/PGQ providing the best of pure SQL and property graphs in a single engine. This approach helped DuckDB fly through the RE24 query while LadybugDB slowed to a crawl. With the addition of duckpgq, DuckDB becomes a veritable contender against the graph-native LadybugDB, making the selection between SQL/PGQ and Cypher a tradeoff between performance and maturity. Once SQL/PGQ reaches production readiness, the convergence could fundamentally reshape how graph workloads are handled in databases.
That's the ballgame!#
I hope you enjoyed learning about the power of graph databases, and maybe gained a new appreciation for baseball and Sabermetrics along the way. Graphs are extremely powerful data structures, and while graph databases traditionally filled a fairly specific niche, SQL/PGQ makes the distinction fuzzier. It will be interesting to see if graphs become more mainstream as the capabilities of these databases mature! Our methodologies revealed each database is capable, (albeit with a few caveats and tradeoffs) and should only get better with time.
I appreciate all the hard work of the communities behind LadybugDB, DuckDB, and Postgres, and I really look forward to their growth, adoption, and continued development.
Keep building, sharing, and contributing, and most importantly have fun!
The information used here was obtained free of charge from and is copyrighted by Retrosheet. Interested parties may contact Retrosheet at www.retrosheet.org.
Trying to use VARCHAR raised the following error during Property Graph creation:
no equality operator exists for SOURCE key comparison of edge "games_ballparks".From the Postgres documentation on Character Types:
Textis PostgreSQL's native string data type, in that most built-in functions operating on strings are declared to take or returntextnotcharacter varying. Character varying acts as though it were a domain over text. Any operator or function that can be applied to the underlying type will work on the domain type.I do not know the root cause of this error, but I thought it was interesting that within the Property Graph, the equality operator between two TEXT types works, but VARCHAR does not. This seems to contradict the final sentence above!↩
Some column names are re-used across different Retrosheet files and refer to different entities. For instance,
lp(losing pitcher) can either refer to the id of the player or a boolean (as a 0 or 1). This causes issues for Postgres, raising:ERROR: property "lp" data type mismatch: text vs. bigintDETAIL: In a property graph, a property of the same name has to have the same data type in each label.As a workaround, we'll rename the conflicting columns in the
gametable to avoid issues.↩There has been very recent work in the pgsql-hackers mailing list to add support under the subject.↩
Noticed a mistake? Have a question or comment? Write to the editor.