High-performance log storage and analytics using Parquet compression and DuckDB. Built for modern web applications.
Why? Cloud monitoring solutions while easy to use and lucrative only exist with free tiers to capture the market and the expectation is to squeeze profits with usage based pricing. Blobsearch is designed to be a cost-effective alternative to cloud-based solutions, providing a flexible and scalable platform for log storage and analytics.
Comparison: BlobSearch vs Cloud-Based Log Solutions
| Feature | BlobSearch | Cloud-Based Solutions |
|---|---|---|
| Hosting | Self-hosted | SaaS (CloudWatch, Datadog, LogDNA, Papertrail, etc.) |
| Cost Model | Storage + compute only | Per log ingestion + retention + host pricing |
| Data Ownership | Your S3 bucket | Vendor's servers |
| Query Language | SQL (DuckDB) | Proprietary query languages |
| Data Format | Parquet (open standard) | Vendor-specific formats |
| Portability | Full - export from S3 | Vendor lock-in |
| Setup Time | 5 minutes | 5-30 minutes |
| Ingestion Rate | 28K+ logs/sec | High (with throttling and tier limits) |
| Compression | 3.7x (Snappy) | ~2x |
| Query Performance | <50ms on 56K logs | Good on recent (??) |
| Alerting | DIY | Native |
| Visualization | DIY | Built-in dashboards |
| Retention Cost | S3 standard rates (predictable) | High for long retention (scales with usage) |
| Open Source | ✅ Yes | ❌ No |
When to Choose BlobSearch
Perfect for:
- Startups wanting predictable costs
- Teams with SQL experience
- Multi-cloud or hybrid environments
- Projects requiring data portability
- Long-term log retention (months/years)
- Privacy-sensitive applications
- Pure log storage without analytics overhead
Trade-offs:
- No built-in UI (use BI tools like Grafana, Metabase)
- Alerting requires integration (e.g., CloudWatch Alarms on S3 metrics, Lambda functions)
- Self-managed infrastructure
🚀 Quickstart
Run with Your S3
docker run -d -p 8080:8080 \ -e ENDPOINT=https://s3.amazonaws.com \ -e ACCESS_KEY=your-key \ -e SECRET_KEY=your-secret \ -e BUCKET=your-bucket \ -e REGION=us-east-1 \ ghcr.io/amr8t/blobsearch/ingestor:latest # Send logs (any JSON format with timestamp and level fields) echo '{"timestamp":"2024-01-15T10:30:00Z","level":"error","message":"Database connection failed"}' | \ curl -X POST --data-binary @- http://localhost:8080/ingest # Flush to S3 curl -X POST http://localhost:8080/flush
Query with DuckDB
INSTALL httpfs; LOAD httpfs; SET s3_region='us-east-1'; SET s3_access_key_id='your-key'; SET s3_secret_access_key='your-secret'; SELECT * FROM read_parquet('s3://your-bucket/logs/date=*/level=*/*', hive_partitioning=true) WHERE date = '2024-01-15' AND level = 'error' LIMIT 10;
For development/testing with MinIO: See harness/README.md
Features
- Format Agnostic - Works with any JSON log format via configurable field extraction
- Fast - 28K+ entries/sec ingestion
- Efficient - Parquet + Snappy (3.7x compression)
- Quick Queries - DuckDB queries in <50ms on 56K logs
- S3-Compatible - AWS S3, MinIO, DigitalOcean Spaces, R2, etc.
- Partitioned - Hive-style partitioning by date/level (no redundant part suffixes)
- Auto-Flush - Configurable automatic flushing (default: 90s)
- Dedupe - Optional deduplication
Production Deployment
Docker run example
Run ingestor on the host, containers send to host IP:
# Run ingestor on host docker run -d \ --name blobsearch-ingestor \ --restart unless-stopped \ -p 8080:8080 \ -p 12201:12201 \ -e ENDPOINT=https://s3.amazonaws.com \ -e ACCESS_KEY=your-key \ -e SECRET_KEY=your-secret \ -e BUCKET=my-logs \ -e REGION=us-east-1 \ ghcr.io/amr8t/blobsearch/ingestor:latest # Configure containers to use GELF docker run -d \ --log-driver=gelf \ --log-opt gelf-address=tcp://172.17.0.1:12201 \ --log-opt tag=my-app \ your-app:latest
Docker Compose Example
version: '3.8' services: ingestor: image: ghcr.io/amr8t/blobsearch/ingestor:latest ports: - "8080:8080" - "12201:12201" environment: ENDPOINT: https://s3.amazonaws.com ACCESS_KEY: ${AWS_ACCESS_KEY} SECRET_KEY: ${AWS_SECRET_KEY} BUCKET: my-logs REGION: us-east-1 app: image: your-app:latest logging: driver: gelf options: gelf-address: "tcp://ingestor:12201" tag: "my-app" depends_on: - ingestor
Kubernetes
Deploy as a DaemonSet so every node runs an ingestor. Containers send logs to localhost:12201:
apiVersion: apps/v1 kind: DaemonSet metadata: name: blobsearch-ingestor spec: selector: matchLabels: app: blobsearch-ingestor template: metadata: labels: app: blobsearch-ingestor spec: hostNetwork: true containers: - name: ingestor image: ghcr.io/amr8t/blobsearch/ingestor:latest ports: - containerPort: 12201 hostPort: 12201 - containerPort: 8080 hostPort: 8080 env: - name: ENDPOINT value: "https://s3.amazonaws.com" - name: ACCESS_KEY valueFrom: secretKeyRef: name: s3-credentials key: access-key - name: SECRET_KEY valueFrom: secretKeyRef: name: s3-credentials key: secret-key - name: BUCKET value: "my-logs" - name: REGION value: "us-east-1"
Configure your pods to use GELF logging:
apiVersion: v1 kind: Pod metadata: name: my-app spec: containers: - name: app image: your-app:latest # Docker GELF driver configuration # Add to docker daemon.json or use logging sidecar
Note: Configure Docker daemon on nodes with GELF driver pointing to tcp://127.0.0.1:12201
Field Extraction
BlobSearch is format-agnostic and works with any JSON log format. It extracts two key fields for partitioning:
Timestamp Field
Used for time-based partitioning (date=YYYY-MM-DD). Configure which JSON fields to check:
TIMESTAMP_FIELDS="timestamp,time,@timestamp" # Default
BlobSearch checks each field in order and uses the first one found. Supports RFC3339, RFC3339Nano, and common ISO formats.
Examples:
{"timestamp": "2024-01-15T10:30:00Z", ...} # ✓ Works
{"time": "2024-01-15T10:30:00.123Z", ...} # ✓ Works
{"@timestamp": "2024-01-15 10:30:00", ...} # ✓ WorksLevel Field
Used for severity-based partitioning (level=error|warn|info|debug). Configure which JSON fields to check:
LEVEL_FIELDS="level,severity,severityText" # Default
Supports both string values ("ERROR", "error") and numeric values (syslog/OTLP scales).
Examples:
{"level": "error", ...} # ✓ Works
{"severity": "ERROR", ...} # ✓ Works (normalized to lowercase)
{"severityText": "WARN", ...} # ✓ Works (OpenTelemetry)
{"severityNumber": 17, ...} # ✓ Works (OTLP numeric)Common Formats Supported:
- Generic JSON logs with
levelfield - OpenTelemetry (OTEL) logs with
severityText/severityNumber - Structured logs with
severityfield - Custom formats via field configuration
Configuration
Ingestor Environment Variables
| Variable | Default | Description |
|---|---|---|
ENDPOINT |
required | S3 endpoint URL |
ACCESS_KEY |
required | S3 access key |
SECRET_KEY |
required | S3 secret key |
BUCKET |
blobsearch |
S3 bucket name |
REGION |
us-east-1 |
S3 region |
PREFIX |
logs |
S3 key prefix |
BATCH_SIZE |
10000 |
Logs per Parquet file |
COMPRESSION |
snappy |
snappy, gzip, or none |
WITH_TIMESTAMPS |
true |
Parse timestamps from logs |
DEDUPLICATE |
false |
Enable deduplication |
DEDUP_WINDOW |
100000 |
Dedup cache size |
AUTO_FLUSH |
true |
Enable automatic periodic flushing |
AUTO_FLUSH_INTERVAL |
90 |
Auto-flush interval in seconds |
TIMESTAMP_FIELDS |
timestamp,time,@timestamp |
Comma-separated JSON field names to check for timestamp |
LEVEL_FIELDS |
level,severity,severityText |
Comma-separated JSON field names to check for log level |
API
POST /ingest
Ingest logs (newline-delimited text or JSON).
cat app.log | curl -X POST --data-binary @- http://localhost:8080/ingestPOST /gelf
Ingest GELF formatted logs (HTTP endpoint).
# GELF messages can be sent via HTTP POST curl -X POST -H "Content-Type: application/json" \ --data '{"version":"1.1","host":"myhost","short_message":"Log message","timestamp":1234567890.123,"level":6}' \ http://localhost:8080/gelf
TCP Port 12201
Accept GELF messages via TCP (Docker GELF logging driver).
This is automatically enabled when running in HTTP mode. Configure your Docker containers:
logging: driver: gelf options: gelf-address: "tcp://ingestor:12201" tag: "my-app"
Note: TCP is the default for reliability. For high-throughput scenarios where some message loss is acceptable, you can use UDP by starting a UDP server.
POST /flush
Flush buffered logs to S3.
curl -X POST http://localhost:8080/flush
GET /stats
Get ingestion statistics.
curl http://localhost:8080/stats
Querying Logs
Basic Queries
-- Count logs by level SELECT level, COUNT(*) as count FROM read_parquet('s3://your-bucket/logs/date=*/level=*/*', hive_partitioning=true) GROUP BY level; -- Recent errors SELECT timestamp, message FROM read_parquet('s3://your-bucket/logs/date=*/level=*/*', hive_partitioning=true) WHERE level = 'error' ORDER BY timestamp DESC LIMIT 10; -- Time series SELECT date, COUNT(*) as log_count FROM read_parquet('s3://your-bucket/logs/date=*/level=*/*', hive_partitioning=true) GROUP BY date ORDER BY date DESC;
Working with JSON Logs
-- Extract fields from JSON messages SELECT timestamp, level, json_extract_string(message, '$.service') as service, json_extract_string(message, '$.error_code') as error_code, message FROM read_parquet('s3://your-bucket/logs/date=*/level=*/*', hive_partitioning=true) WHERE level = 'error' LIMIT 10; -- Count by service SELECT json_extract_string(message, '$.service') as service, COUNT(*) as count FROM read_parquet('s3://your-bucket/logs/date=*/level=*/*', hive_partitioning=true) WHERE message LIKE '{%' GROUP BY service ORDER BY count DESC;
Advanced Queries
See QUERY_GUIDE.md for:
- Partition pruning for faster queries
- Deduplication strategies
- Performance optimization
- Complex aggregations
Log Collection Methods
GELF Logging Driver (Recommended)
Use Docker's native GELF logging driver for seamless integration:
services: app: image: your-app:latest logging: driver: gelf options: gelf-address: "tcp://ingestor:12201" tag: "my-app"
Advantages:
- No additional containers
- No Docker socket access needed
- Native Docker integration
- TCP ensures reliable delivery
- Minimal overhead
See examples/docker/README.md for full examples.
How It Works
1. Parquet Storage
Columnar format with compression:
- Fast analytical queries
- Excellent compression ratios (3-4x)
- Efficient for time-series data
- Native support in DuckDB
2. Hive Partitioning
Logs partitioned by: date=YYYY-MM-DD/level=ERROR/
Benefits:
- Query only relevant partitions
- 99.9% reduction in files scanned
- Sub-second queries on millions of logs
- Optimized for time-based and level-based filtering
3. Structured Logs
Optimized for JSON structured logs from modern frameworks:
- Next.js
- Rails
- Express
- FastAPI
- Any app outputting JSON logs
Documentation
- QUERY_GUIDE.md - Advanced querying and performance optimization
- harness/README.md - Development/Testing environment
Examples
- examples/standalone-docker/ - Simple Docker Compose with pre-built image
- examples/kubernetes/ - Ultra-simple Kubernetes DaemonSet example
Contributing
- Fork the repository
- Create a feature branch
- Use the harness for testing:
cd harness && make docker-up - Make changes and test
- Run tests:
go test -v ./... - Submit a pull request
Support
- Issues: https://github.com/amr8t/blobsearch/issues
- Discussions: https://github.com/amr8t/blobsearch/discussions
License
This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0). See the LICENSE file for details.
This license requires that if you run a modified version of this software on a server and provide network access to users, you must make the modified source code available to those users.
Credits
Built with:
- Apache Parquet for columnar storage
- DuckDB for analytics
- MinIO for testing with S3-compatible storage