Not long ago, I encountered the StarRocks database. It seemed very intriguing to me because many people talk about its high performance, especially on large volumes of data. I decided to delve deeper into its functionality and try to perform some basic operations.
Press enter or click to view image in full size
So, what is StarRocks all about? StarRocks is a high-performance and scalable database designed for real-time analysis of large volumes of data. It is known for its ability to work with large volumes of structured data and quickly provide analytical queries and reports. Its key advantages include:
- Efficient OLAP operations
- Distributed operations
- High query speed
- Flexibility and scalability
Let’s move on to the test! :)
Previously, I had already installed the StarRocks image in Docker.
Now let’s try to connect and create a table. I will perform all operations and commands in Jupyter Notebook.
Test file from Kaggle: https://www.kaggle.com/datasets/declanmcalinden/time-series-uk-supermarket-data
import pandas as pd
import time
from sqlalchemy import create_engine, Table, MetaData, Column
from sqlalchemy.sql import text%load_ext sql
%config SqlMagic.autocommit=False
%sql mysql+pymysql://root:@localhost:9030
%sql CREATE DATABASE jupysql;
%sql USING jupysql;
%%sql
CREATE TABLE supermarket_prices_new (
supermarket VARCHAR(6),
prices NUMERIC(10, 2),
prices_unit NUMERIC(10, 2),
unit VARCHAR(5),
names TEXT,
date DATE,
category VARCHAR(25),
own_brand BOOLEAN
);
I became curious about how long it would take to insert such a dataframe into a table.
For the test, I took a file and selected a sample of 100,000 rows. In my case, I chose batches of 10,000 rows each.
df = pd.read_csv('sample_100000.csv')engine = create_engine('mysql+pymysql://root:@localhost:9030/jupysql')
metadata = MetaData()
table_name = 'supermarket_prices_new'
supermarket_prices_new = Table(
table_name,
metadata,
Column('supermarket', String(6)),
Column('prices', Numeric(10, 2)),
Column('prices_unit', Numeric(10, 2)),
Column('unit', String(5)),
Column('names', Text),
Column('date', Date),
Column('category', String(25)),
Column('own_brand', Boolean),
starrocks_engine='OLAP',
extend_existing=True
)
metadata.create_all(engine)
data = df.to_dict(orient='records')
batch_size = 10000
start_time = time.time()
with engine.connect() as connection:
transaction = connection.begin()
try:
for i in range(0, len(data), batch_size):
batch = data[i:i+batch_size]
connection.execute(
supermarket_prices_new.insert(),
batch
)
transaction.commit()
print("Data inserted successfully.")
except Exception as e:
transaction.rollback()
print(f"Error occurred: {e}, transaction rolled back.")
end_time = time.time()
runtime_seconds = end_time - start_time
print(f"Script execution time: {runtime_seconds:.2f} seconds")
Result for 100000 rows:
Data inserted successfully.
Script execution time: 11.70 secondsThe result was really good — around 11 seconds for the entire dataframe.
In the next section, we will test how fast queries are through StarRocks and compare them with other databases.