GitHub - ankane/ruby-polars: Blazingly fast DataFrames for Ruby

3 min read Original article ↗

Ruby Polars

🔥 Blazingly fast DataFrames for Ruby, powered by Polars

Build Status

Installation

Add this line to your application’s Gemfile:

Getting Started

This library follows the Polars Python API.

Polars.scan_csv("iris.csv")
  .filter(Polars.col("sepal_length") > 5)
  .group_by("species")
  .agg(Polars.all.sum)
  .collect

You can follow Polars tutorials and convert the code to Ruby in many cases. Feel free to open an issue if you run into problems.

Reference

Examples

Creating DataFrames

From a CSV

Polars.read_csv("file.csv")

# or lazily with
Polars.scan_csv("file.csv")

From Parquet

Polars.read_parquet("file.parquet")

# or lazily with
Polars.scan_parquet("file.parquet")

From Active Record

Polars.read_database(User.all)
# or
Polars.read_database("SELECT * FROM users")

From JSON

Polars.read_json("file.json")
# or
Polars.read_ndjson("file.ndjson")

# or lazily with
Polars.scan_ndjson("file.ndjson")

From Feather / Arrow IPC

Polars.read_ipc("file.arrow")

# or lazily with
Polars.scan_ipc("file.arrow")

From Avro

Polars.read_avro("file.avro")

From Iceberg (experimental, requires iceberg)

Polars.scan_iceberg(table)

From Delta Lake (experimental, requires deltalake-rb)

Polars.read_delta("./table")

# or lazily with
Polars.scan_delta("./table")

From a hash

Polars::DataFrame.new({
  a: [1, 2, 3],
  b: ["one", "two", "three"]
})

From an array of hashes

Polars::DataFrame.new([
  {a: 1, b: "one"},
  {a: 2, b: "two"},
  {a: 3, b: "three"}
])

From an array of series

Polars::DataFrame.new([
  Polars::Series.new("a", [1, 2, 3]),
  Polars::Series.new("b", ["one", "two", "three"])
])

Attributes

Get number of rows

Get column names

Check if a column exists

Selecting Data

Select a column

Select multiple columns

Select first rows

Select last rows

Filtering

Filter on a condition

df.filter(Polars.col("a") == 2)
df.filter(Polars.col("a") != 2)
df.filter(Polars.col("a") > 2)
df.filter(Polars.col("a") >= 2)
df.filter(Polars.col("a") < 2)
df.filter(Polars.col("a") <= 2)

And, or, and exclusive or

df.filter((Polars.col("a") > 1) & (Polars.col("b") == "two")) # and
df.filter((Polars.col("a") > 1) | (Polars.col("b") == "two")) # or
df.filter((Polars.col("a") > 1) ^ (Polars.col("b") == "two")) # xor

Operations

Basic operations

df["a"] + 5
df["a"] - 5
df["a"] * 5
df["a"] / 5
df["a"] % 5
df["a"] ** 2
df["a"].sqrt
df["a"].abs

Rounding

df["a"].round(2)
df["a"].ceil
df["a"].floor

Logarithm

df["a"].log # natural log
df["a"].log(10)

Exponentiation

Trigonometric functions

df["a"].sin
df["a"].cos
df["a"].tan
df["a"].arcsin
df["a"].arccos
df["a"].arctan

Hyperbolic functions

df["a"].sinh
df["a"].cosh
df["a"].tanh
df["a"].arcsinh
df["a"].arccosh
df["a"].arctanh

Summary statistics

df["a"].sum
df["a"].mean
df["a"].median
df["a"].quantile(0.90)
df["a"].min
df["a"].max
df["a"].std
df["a"].var

Grouping

Group

Works with all summary statistics

Multiple groups

df.group_by(["a", "b"]).count

Combining Data Frames

Add rows

Add columns

Inner join

df.join(other_df, on: "a")

Left join

df.join(other_df, on: "a", how: "left")

Encoding

One-hot encoding

Conversion

Array of hashes

Hash of series

CSV

df.to_csv
# or
df.write_csv("file.csv")

Parquet

df.write_parquet("file.parquet")

JSON

df.write_json("file.json")
# or
df.write_ndjson("file.ndjson")

Feather / Arrow IPC

df.write_ipc("file.arrow")

Avro

df.write_avro("file.avro")

Iceberg (experimental)

df.write_iceberg(table, mode: "append")

Delta Lake (experimental)

df.write_delta("./table")

Numo array

Types

You can specify column types when creating a data frame

Polars::DataFrame.new(data, schema: {"a" => Polars::Int32, "b" => Polars::Float32})

Supported types are:

  • boolean - Boolean
  • decimal - Decimal
  • float - Float32, Float64
  • integer - Int8, Int16, Int32, Int64, Int128
  • unsigned integer - UInt8, UInt16, UInt32, UInt64, UInt128
  • string - String, Categorical, Enum
  • temporal - Date, Datetime, Duration, Time
  • nested - Array, List, Struct
  • other - Binary, Object, Null, Unknown

Get column types

For a specific column

Cast a column

df["a"].cast(Polars::Int32)

Visualization

Add Vega to your application’s Gemfile:

And use:

df.plot("a", "b", type: "line")

Supports line, pie, column, bar, area, and scatter plots

Group data

df.plot("a", "b", group: "c", type: "line")

Stacked columns or bars

df.plot("a", "b", group: "c", type: "column", stacked: true)

Plot a series [unreleased]

Supports hist, kde, and line plots

History

View the changelog

Contributing

Everyone is encouraged to help improve this project. Here are a few ways you can help:

To get started with development:

git clone https://github.com/ankane/ruby-polars.git
cd ruby-polars
bundle install
bundle exec rake compile
bundle exec rake test
bundle exec rake test:docs