Testing the High-Performance StarRocks Database!

Not long ago, I encountered the StarRocks database. It seemed very intriguing to me because many people talk about its high performance, especially on large volumes of data. I decided to delve deeper into its functionality and try to perform some basic operations.

Press enter or click to view image in full size

So, what is StarRocks all about? StarRocks is a high-performance and scalable database designed for real-time analysis of large volumes of data. It is known for its ability to work with large volumes of structured data and quickly provide analytical queries and reports. Its key advantages include:

Efficient OLAP operations
Distributed operations
High query speed
Flexibility and scalability

Let’s move on to the test! :)

Previously, I had already installed the StarRocks image in Docker.

Now let’s try to connect and create a table. I will perform all operations and commands in Jupyter Notebook.

Test file from Kaggle: https://www.kaggle.com/datasets/declanmcalinden/time-series-uk-supermarket-data

import pandas as pd
import time
from sqlalchemy import create_engine, Table, MetaData, Column
from sqlalchemy.sql import text%load_ext sql
%config SqlMagic.autocommit=False
%sql mysql+pymysql://root:@localhost:9030
%sql CREATE DATABASE jupysql;
%sql USING jupysql;
%%sql
CREATE TABLE supermarket_prices_new (
    supermarket VARCHAR(6),
    prices NUMERIC(10, 2),
    prices_unit NUMERIC(10, 2),
    unit VARCHAR(5),
    names TEXT,
    date DATE,
    category VARCHAR(25),
    own_brand BOOLEAN
);

I became curious about how long it would take to insert such a dataframe into a table.

For the test, I took a file and selected a sample of 100,000 rows. In my case, I chose batches of 10,000 rows each.

df = pd.read_csv('sample_100000.csv')engine = create_engine('mysql+pymysql://root:@localhost:9030/jupysql')
metadata = MetaData()
table_name = 'supermarket_prices_new'
supermarket_prices_new = Table(
    table_name,
    metadata,
    Column('supermarket', String(6)),
    Column('prices', Numeric(10, 2)),
    Column('prices_unit', Numeric(10, 2)),
    Column('unit', String(5)),
    Column('names', Text),
    Column('date', Date),
    Column('category', String(25)),
    Column('own_brand', Boolean),
    starrocks_engine='OLAP',
    extend_existing=True
)
metadata.create_all(engine)
data = df.to_dict(orient='records')
batch_size = 10000
start_time = time.time()
with engine.connect() as connection:
    transaction = connection.begin()
        try:
        for i in range(0, len(data), batch_size):
            batch = data[i:i+batch_size]
            connection.execute(
                supermarket_prices_new.insert(),
                batch
            )
                    transaction.commit()
        print("Data inserted successfully.")
    except Exception as e:
        transaction.rollback()
        print(f"Error occurred: {e}, transaction rolled back.")
end_time = time.time()
runtime_seconds = end_time - start_time
print(f"Script execution time: {runtime_seconds:.2f} seconds")

Result for 100000 rows:

Data inserted successfully.
Script execution time: 11.70 seconds

The result was really good — around 11 seconds for the entire dataframe.

In the next section, we will test how fast queries are through StarRocks and compare them with other databases.