GitHub - CIG-GitHub/serif: A Python-native library for handling vectors, tables and multi-dimensional arrays.

serif

A clean, typed, composable data layer for Python, built on Vector and Table.

Vector provides the foundation; Table is your primary tool for readable data modeling and analysis workflows.

30-Second Example

from serif import Table

# Create a table with automatic column name sanitization
t = Table({
    "price ($)": [10, 20, 30],
    "quantity":  [4, 5, 6]
})

# Add calculated columns with dict syntax
t >>= {'total': t.price * t.quantity}
t >>= {'tax': t.total * 0.1}

t
# 'price ($)'   quantity   total      tax
#      .price  .quantity  .total     .tax
#       [int]      [int]   [int]  [float]
#          10          4      40      4.0
#          20          5     100     10.0
#          30          6     180     18.0
#
# 3×4 table <mixed>

Real-World Example: Interactive CSV Exploration

from serif import read_csv

t = read_csv("sales.csv")  # Messy column names? No problem.

# Discover columns interactively (no print needed!)
#   t. + [TAB]      → shows all sanitized column names
#   t.pr + [TAB]    → t.price
#   t.qua + [TAB]   → t.quantity

# Compose expressions naturally
total = t.price * t.quantity

# Add derived columns
t >>= {'total': total}

# Inspect (original names preserved in display!)
t
# 'price ($)'  'quantity'   'total'
#      .price   .quantity    .total
#          10           4        40
#          20           5       100
#          30           6       180
#
# 3×3 table <int>

The power: You don't need to know the CSV contents upfront. Tab completion guides you, the repr shows you everything, and messy column names are automatically cleaned for dot-access.

Installation

Zero external dependencies. In a fresh environment:

pip freeze
# serif==0.x.y

Why serif?

Explicit, predictable vector semantics
Tables compose cleanly from vectors
Readable "spreadsheet-like" workflows
Table-owns-storage: building a table copies inputs so tables never share columns by accident
Controlled mutation: column vectors are live views; in-place updates mutate only that table
Immediate visual feedback via __repr__

Quickstart

Vectors: elementwise operations

from serif import Vector

a = Vector([1, 2, 3, 4, 5])
b = Vector([10, 20, 30, 40, 50])

a + b           # Vector([11, 22, 33, 44, 55])
a * 2           # Vector([2, 4, 6, 8, 10])
a > 3           # Vector([False, False, False, True, True])

Tables: compose vectors with `>>`

from serif import Table

# Column names auto-sanitize to valid Python attributes
t = Table({
    "first name": [1, 2, 3],
    "price ($)":  [10, 20, 30]
})

t.first_name    # Vector([1, 2, 3])
t.price         # Vector([10, 20, 30])

# Add columns with >>= (recommended)
t >>= (t.first_name * t.price).alias("total")

t
# 'first name'  'price ($)'  total
#  .first_name       .price  .total
#            1           10      10
#            2           20      40
#            3           30      90
#
# 3×3 table <int>

Boolean masking

filtered = t[t.price > 15]

filtered
# 'first name'  'price ($)'  total
#  .first_name       .price  .total
#            2           20      40
#            3           30      90
#
# 2×3 table <int>

Joins

customers = Table({'id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Charlie']})
scores = Table({'id': [2, 3, 4], 'score': [85, 90, 95]})

result = customers.inner_join(scores, left_on='id', right_on='id')

result
#    id  name           id   score
#   .id  .name      .id__2  .score
# [int]  [str]       [int]   [int]
#     2  'Bob'           2      85
#     3  'Charlie'       3      90
#
# 2×4 table <mixed>

Aggregations

t = Table({'customer': ['A', 'B', 'A'], 'amount': [100, 200, 150]})

result = t.aggregate(
    over=t.customer,
    sum_over=t.amount,
    count_over=t.amount
)

result
# customer  amount_sum  amount_count
#    [str]       [int]         [int]
#      'A'         250             2
#      'B'         200             1
#
# 2×3 table <mixed>

See docs/joins-aggregations.md for detailed examples.

Key Features

Automatic `repr`: Instant Visual Feedback

# Dictionary syntax: quick and familiar
t = Table({'id': range(100), 'value': [x**2 for x in range(100)]})

# Or compose from vectors: showcases Vector's design philosophy
a = Vector(range(100), name='id')
t = a >> (a**2).alias('value')

t
# id  value
#  0      0
#  1      1
#  2      4
#  3      9
#  4     16
#... ...
# 95   9025
# 96   9216
# 97   9409
# 98   9604
# 99   9801
#
# 100×2 table <int>

Head/tail preview + type annotations + dimensions—no need for .head(), .info(), etc.

Column Name Sanitization

Column names are sanitized to valid Python identifiers so you can access them with dot notation:

t = Table({"2023-Q1 Revenue ($M)": [1, 2, 3]})
t.c2023_q1_revenue_m  # Deterministic, predictable access

Rules:

Non-alphanumeric characters become _
Leading digits get c prefix
All lowercase

Unnamed columns use system names: t.col0_, t.col1_, etc. Original column names are preserved; sanitization only affects attribute access.

Typed Subclasses

Vector auto-creates typed subclasses with method proxying:

from datetime import date

dates = Vector([date(2023, 6, 29), date(2024, 1, 2), date(2024, 12, 28)])
dates += 5       # Add 5 days to each date
dates.year       # Vector([2023, 2024, 2025]) - one crossed the year boundary!

Works for int, float, str, date types.

Common Gotchas

Don't use subscript lists—use boolean masks

# ANTI-PATTERN
indices = [1, 5, 9]
result = v[indices]  # Slow, emits warning

# IDIOMATIC
mask = (v > threshold)
result = v[mask]

Operator overloading: avoid `.index()` on Vector lists

# WRONG: invokes elementwise equality
cols = [table.year, table.month]
idx = cols.index(table.year)  # Returns boolean vector!

# CORRECT: use enumerate
for idx, col in enumerate(cols):
    if col is table.year:  # identity check
        ...

`None` handling

None is excluded from aggregations but counted in len():

v = Vector([10, None, 20])
v.sum()   # 30 (None excluded)
len(v)    # 3 (None counted)

Just Write Python

Not every task fits neatly into a vectorized expression. When a loop is the clearest approach, serif keeps it efficient.

for row in table: is fully supported and stays lightweight, so you can use whichever style makes the code easiest to understand.

Design Philosophy

serif prioritizes clarity and workflow ergonomics.

What you get:

Readable, debuggable code
No hidden state or aliasing bugs (copy-on-write)
Deterministic operations
Zero dependencies
O(1) fingerprinting for change detection

When to use serif:

Modeling-scale data (10K–1M rows)
Correctness and maintainability matter most
Interactive workflows (Jupyter notebooks, REPL)
Projects where zero dependencies is important

Further Documentation

Performance & Complexity — O(n) analysis for joins, aggregations, indexing
Exception Handling — Custom exception types and error handling
Aliasing & Fingerprints — Copy-on-write and change detection
Joins & Aggregations — Detailed examples and patterns
Development Guide — Running tests, project structure

Philosophy

Clarity beats cleverness
Explicit beats implicit
Modeling should feel intuitive
You should always know what your code is doing

License

MIT

Contributing

See CONTRIBUTING.md and CODE_OF_CONDUCT.md.