GitHub - amr8t/blobsearch

High-performance log storage and analytics using Parquet compression and DuckDB. Built for modern web applications.

Why? Cloud monitoring solutions while easy to use and lucrative only exist with free tiers to capture the market and the expectation is to squeeze profits with usage based pricing. Blobsearch is designed to be a cost-effective alternative to cloud-based solutions, providing a flexible and scalable platform for log storage and analytics.

Comparison: BlobSearch vs Cloud-Based Log Solutions

Feature	BlobSearch	Cloud-Based Solutions
Hosting	Self-hosted	SaaS (CloudWatch, Datadog, LogDNA, Papertrail, etc.)
Cost Model	Storage + compute only	Per log ingestion + retention + host pricing
Data Ownership	Your S3 bucket	Vendor's servers
Query Language	SQL (DuckDB)	Proprietary query languages
Data Format	Parquet (open standard)	Vendor-specific formats
Portability	Full - export from S3	Vendor lock-in
Setup Time	5 minutes	5-30 minutes
Ingestion Rate	28K+ logs/sec	High (with throttling and tier limits)
Compression	3.7x (Snappy)	~2x
Query Performance	<50ms on 56K logs	Good on recent (??)
Alerting	DIY	Native
Visualization	DIY	Built-in dashboards
Retention Cost	S3 standard rates (predictable)	High for long retention (scales with usage)
Open Source	✅ Yes	❌ No

When to Choose BlobSearch

Perfect for:

Startups wanting predictable costs
Teams with SQL experience
Multi-cloud or hybrid environments
Projects requiring data portability
Long-term log retention (months/years)
Privacy-sensitive applications
Pure log storage without analytics overhead

Trade-offs:

No built-in UI (use BI tools like Grafana, Metabase)
Alerting requires integration (e.g., CloudWatch Alarms on S3 metrics, Lambda functions)
Self-managed infrastructure

🚀 Quickstart

Run with Your S3

docker run -d -p 8080:8080 \
  -e ENDPOINT=https://s3.amazonaws.com \
  -e ACCESS_KEY=your-key \
  -e SECRET_KEY=your-secret \
  -e BUCKET=your-bucket \
  -e REGION=us-east-1 \
  ghcr.io/amr8t/blobsearch/ingestor:latest

# Send logs (any JSON format with timestamp and level fields)
echo '{"timestamp":"2024-01-15T10:30:00Z","level":"error","message":"Database connection failed"}' | \
  curl -X POST --data-binary @- http://localhost:8080/ingest

# Flush to S3
curl -X POST http://localhost:8080/flush

Query with DuckDB

INSTALL httpfs; LOAD httpfs;
SET s3_region='us-east-1';
SET s3_access_key_id='your-key';
SET s3_secret_access_key='your-secret';

SELECT * FROM read_parquet('s3://your-bucket/logs/date=*/level=*/*', hive_partitioning=true)
WHERE date = '2024-01-15' AND level = 'error'
LIMIT 10;

For development/testing with MinIO: See harness/README.md

Features

Format Agnostic - Works with any JSON log format via configurable field extraction
Fast - 28K+ entries/sec ingestion
Efficient - Parquet + Snappy (3.7x compression)
Quick Queries - DuckDB queries in <50ms on 56K logs
S3-Compatible - AWS S3, MinIO, DigitalOcean Spaces, R2, etc.
Partitioned - Hive-style partitioning by date/level (no redundant part suffixes)
Auto-Flush - Configurable automatic flushing (default: 90s)
Dedupe - Optional deduplication

Production Deployment

Docker run example

Run ingestor on the host, containers send to host IP:

# Run ingestor on host
docker run -d \
  --name blobsearch-ingestor \
  --restart unless-stopped \
  -p 8080:8080 \
  -p 12201:12201 \
  -e ENDPOINT=https://s3.amazonaws.com \
  -e ACCESS_KEY=your-key \
  -e SECRET_KEY=your-secret \
  -e BUCKET=my-logs \
  -e REGION=us-east-1 \
  ghcr.io/amr8t/blobsearch/ingestor:latest

# Configure containers to use GELF
docker run -d \
  --log-driver=gelf \
  --log-opt gelf-address=tcp://172.17.0.1:12201 \
  --log-opt tag=my-app \
  your-app:latest

Docker Compose Example

version: '3.8'

services:
  ingestor:
    image: ghcr.io/amr8t/blobsearch/ingestor:latest
    ports:
      - "8080:8080"
      - "12201:12201"
    environment:
      ENDPOINT: https://s3.amazonaws.com
      ACCESS_KEY: ${AWS_ACCESS_KEY}
      SECRET_KEY: ${AWS_SECRET_KEY}
      BUCKET: my-logs
      REGION: us-east-1

  app:
    image: your-app:latest
    logging:
      driver: gelf
      options:
        gelf-address: "tcp://ingestor:12201"
        tag: "my-app"
    depends_on:
      - ingestor

Kubernetes

Deploy as a DaemonSet so every node runs an ingestor. Containers send logs to localhost:12201:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: blobsearch-ingestor
spec:
  selector:
    matchLabels:
      app: blobsearch-ingestor
  template:
    metadata:
      labels:
        app: blobsearch-ingestor
    spec:
      hostNetwork: true
      containers:
      - name: ingestor
        image: ghcr.io/amr8t/blobsearch/ingestor:latest
        ports:
        - containerPort: 12201
          hostPort: 12201
        - containerPort: 8080
          hostPort: 8080
        env:
        - name: ENDPOINT
          value: "https://s3.amazonaws.com"
        - name: ACCESS_KEY
          valueFrom:
            secretKeyRef:
              name: s3-credentials
              key: access-key
        - name: SECRET_KEY
          valueFrom:
            secretKeyRef:
              name: s3-credentials
              key: secret-key
        - name: BUCKET
          value: "my-logs"
        - name: REGION
          value: "us-east-1"

Configure your pods to use GELF logging:

apiVersion: v1
kind: Pod
metadata:
  name: my-app
spec:
  containers:
  - name: app
    image: your-app:latest
    # Docker GELF driver configuration
    # Add to docker daemon.json or use logging sidecar

Note: Configure Docker daemon on nodes with GELF driver pointing to tcp://127.0.0.1:12201

Field Extraction

BlobSearch is format-agnostic and works with any JSON log format. It extracts two key fields for partitioning:

Timestamp Field

Used for time-based partitioning (date=YYYY-MM-DD). Configure which JSON fields to check:

TIMESTAMP_FIELDS="timestamp,time,@timestamp"  # Default

BlobSearch checks each field in order and uses the first one found. Supports RFC3339, RFC3339Nano, and common ISO formats.

Examples:

{"timestamp": "2024-01-15T10:30:00Z", ...}        # ✓ Works
{"time": "2024-01-15T10:30:00.123Z", ...}         # ✓ Works
{"@timestamp": "2024-01-15 10:30:00", ...}        # ✓ Works

Level Field

Used for severity-based partitioning (level=error|warn|info|debug). Configure which JSON fields to check:

LEVEL_FIELDS="level,severity,severityText"  # Default

Supports both string values ("ERROR", "error") and numeric values (syslog/OTLP scales).

Examples:

{"level": "error", ...}                           # ✓ Works
{"severity": "ERROR", ...}                        # ✓ Works (normalized to lowercase)
{"severityText": "WARN", ...}                     # ✓ Works (OpenTelemetry)
{"severityNumber": 17, ...}                       # ✓ Works (OTLP numeric)

Common Formats Supported:

Generic JSON logs with level field
OpenTelemetry (OTEL) logs with severityText/severityNumber
Structured logs with severity field
Custom formats via field configuration

Configuration

Ingestor Environment Variables

Variable	Default	Description
`ENDPOINT`	required	S3 endpoint URL
`ACCESS_KEY`	required	S3 access key
`SECRET_KEY`	required	S3 secret key
`BUCKET`	`blobsearch`	S3 bucket name
`REGION`	`us-east-1`	S3 region
`PREFIX`	`logs`	S3 key prefix
`BATCH_SIZE`	`10000`	Logs per Parquet file
`COMPRESSION`	`snappy`	`snappy`, `gzip`, or `none`
`WITH_TIMESTAMPS`	`true`	Parse timestamps from logs
`DEDUPLICATE`	`false`	Enable deduplication
`DEDUP_WINDOW`	`100000`	Dedup cache size
`AUTO_FLUSH`	`true`	Enable automatic periodic flushing
`AUTO_FLUSH_INTERVAL`	`90`	Auto-flush interval in seconds
`TIMESTAMP_FIELDS`	`timestamp,time,@timestamp`	Comma-separated JSON field names to check for timestamp
`LEVEL_FIELDS`	`level,severity,severityText`	Comma-separated JSON field names to check for log level

API

POST /ingest

Ingest logs (newline-delimited text or JSON).

cat app.log | curl -X POST --data-binary @- http://localhost:8080/ingest

POST /gelf

Ingest GELF formatted logs (HTTP endpoint).

# GELF messages can be sent via HTTP POST
curl -X POST -H "Content-Type: application/json" \
  --data '{"version":"1.1","host":"myhost","short_message":"Log message","timestamp":1234567890.123,"level":6}' \
  http://localhost:8080/gelf

TCP Port 12201

Accept GELF messages via TCP (Docker GELF logging driver).

This is automatically enabled when running in HTTP mode. Configure your Docker containers:

logging:
  driver: gelf
  options:
    gelf-address: "tcp://ingestor:12201"
    tag: "my-app"

Note: TCP is the default for reliability. For high-throughput scenarios where some message loss is acceptable, you can use UDP by starting a UDP server.

POST /flush

Flush buffered logs to S3.

curl -X POST http://localhost:8080/flush

GET /stats

Get ingestion statistics.

curl http://localhost:8080/stats

Querying Logs

Basic Queries

-- Count logs by level
SELECT level, COUNT(*) as count
FROM read_parquet('s3://your-bucket/logs/date=*/level=*/*', hive_partitioning=true)
GROUP BY level;

-- Recent errors
SELECT timestamp, message
FROM read_parquet('s3://your-bucket/logs/date=*/level=*/*', hive_partitioning=true)
WHERE level = 'error'
ORDER BY timestamp DESC
LIMIT 10;

-- Time series
SELECT date, COUNT(*) as log_count
FROM read_parquet('s3://your-bucket/logs/date=*/level=*/*', hive_partitioning=true)
GROUP BY date
ORDER BY date DESC;

Working with JSON Logs

-- Extract fields from JSON messages
SELECT 
    timestamp,
    level,
    json_extract_string(message, '$.service') as service,
    json_extract_string(message, '$.error_code') as error_code,
    message
FROM read_parquet('s3://your-bucket/logs/date=*/level=*/*', hive_partitioning=true)
WHERE level = 'error'
LIMIT 10;

-- Count by service
SELECT 
    json_extract_string(message, '$.service') as service,
    COUNT(*) as count
FROM read_parquet('s3://your-bucket/logs/date=*/level=*/*', hive_partitioning=true)
WHERE message LIKE '{%'
GROUP BY service
ORDER BY count DESC;

Advanced Queries

See QUERY_GUIDE.md for:

Partition pruning for faster queries
Deduplication strategies
Performance optimization
Complex aggregations

Log Collection Methods

GELF Logging Driver (Recommended)

Use Docker's native GELF logging driver for seamless integration:

services:
  app:
    image: your-app:latest
    logging:
      driver: gelf
      options:
        gelf-address: "tcp://ingestor:12201"
        tag: "my-app"

Advantages:

No additional containers
No Docker socket access needed
Native Docker integration
TCP ensures reliable delivery
Minimal overhead

See examples/docker/README.md for full examples.

How It Works

1. Parquet Storage

Columnar format with compression:

Fast analytical queries
Excellent compression ratios (3-4x)
Efficient for time-series data
Native support in DuckDB

2. Hive Partitioning

Logs partitioned by: date=YYYY-MM-DD/level=ERROR/

Benefits:

Query only relevant partitions
99.9% reduction in files scanned
Sub-second queries on millions of logs
Optimized for time-based and level-based filtering

3. Structured Logs

Optimized for JSON structured logs from modern frameworks:

Next.js
Rails
Express
FastAPI
Any app outputting JSON logs

Documentation

QUERY_GUIDE.md - Advanced querying and performance optimization
harness/README.md - Development/Testing environment

Examples

examples/standalone-docker/ - Simple Docker Compose with pre-built image
examples/kubernetes/ - Ultra-simple Kubernetes DaemonSet example

Contributing

Fork the repository
Create a feature branch
Use the harness for testing: cd harness && make docker-up
Make changes and test
Run tests: go test -v ./...
Submit a pull request

Support

Issues: https://github.com/amr8t/blobsearch/issues
Discussions: https://github.com/amr8t/blobsearch/discussions

License

This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0). See the LICENSE file for details.

This license requires that if you run a modified version of this software on a server and provide network access to users, you must make the modified source code available to those users.

Credits

Built with:

Apache Parquet for columnar storage
DuckDB for analytics
MinIO for testing with S3-compatible storage