High-Performance Bioinformatics IO for Mojo - Zero-Copy to GPU
BlazeSeq is a high-throughput parser for biological sequence and interval data in Mojo. It combines SIMD-accelerated parsing, a unified reader layer, and GPU-ready data layouts to support production pipelines from local files to accelerated kernels.
BlazeSeq currently supports:
FASTQviaFastqParser(zero-copy views, owned records, and SoA batches)FASTAviaFastaParser(including multi-line sequence normalization)FAIviaFaiParser(FASTA/FASTQ index rows)BEDviaBedParser(genomic interval records)
Project Goals
- Build one cohesive parsing stack for common genomics formats in Mojo.
- Keep throughput high by default with SIMD and low-allocation APIs.
- Bridge CPU parsing and GPU compute with explicit batch types and upload utilities.
- Offer ergonomic APIs for both exploratory scripting and production workflows.
Key Features
- Multi-format parsing: FASTQ, FASTA, FAI, and BED in a single library (support for others format follows).
- Unified I/O layer.
- Three Unified access modes:
views()for zero-copy streamingrecords()for owned recordsbatches()for Structure-of-Arrays GPU pipelines
- GPU-oriented data flow:
FastqBatchplus device upload support for accelerated kernels. - Parallel gzip decode:
RapidgzipReaderenables multithreaded.fastq.gzingestion. - Compile-time tuning: Toggle validation checks for speed/safety trade-offs.
- Python bindings (experimental): Wheel package for Python integration.
Quick Start
Install as a Mojo dependency (Pixi)
Add BlazeSeq to your pixi.toml:
[dependencies] blazeseq = { git = "https://github.com/MoSafi2/BlazeSeq", branch = "main" }
Then install dependencies:
Python bindings (experimental)
Install from PyPI:
or:
Python usage details are documented in python/README.md.
Usage Examples
FASTQ: iterate owned records
from blazeseq import FastqParser, FileReader from std.pathlib import Path def main() raises: var parser = FastqParser[FileReader](FileReader(Path("data.fastq")), "sanger") var reads = 0 var bases = 0 for record in parser.records(): reads += 1 bases += len(record) print(reads, bases)
FASTQ: maximum speed with validation off
from blazeseq import FastqParser, FileReader from blazeseq.fastq import ParserConfig from std.pathlib import Path def main() raises: var parser = FastqParser[ FileReader, ParserConfig(check_ascii=False, check_quality=False) ](FileReader(Path("data.fastq")), "generic") for view in parser.views(): _ = len(view)
FASTA: parse multi-line records
from blazeseq import FastaParser, FileReader from std.pathlib import Path def main() raises: var parser = FastaParser[FileReader](FileReader(Path("ref.fa"))) for record in parser: print(record.id(), len(record))
FAI: read FASTA/FASTQ index entries
from blazeseq import FaiParser, FileReader from std.pathlib import Path def main() raises: var parser = FaiParser[FileReader](FileReader(Path("ref.fa.fai"))) for rec in parser: print(rec.name(), rec.length(), rec.offset())
BED: stream genomic intervals
from blazeseq import BedParser, FileReader from std.pathlib import Path def main() raises: var parser = BedParser[FileReader](FileReader(Path("regions.bed"))) for interval in parser.views(): print(interval.chrom(), interval.chrom_start, interval.chrom_end)
Gzip FASTQ with parallel decoding
from blazeseq import FastqParser, RapidgzipReader from std.pathlib import Path def main() raises: var reader = RapidgzipReader(Path("data.fastq.gz"), parallelism=4) var parser = FastqParser[RapidgzipReader](reader^, "illumina_1.8") for record in parser.records(): _ = record.id()
GPU alignment example
Run the end-to-end GPU Needleman-Wunsch example:
pixi run mojo run examples/nw_gpu/main.mojo
FASTQ Access Modes and Trade-offs
| API | Return Type | Copies Data? | Best For |
|---|---|---|---|
next_view() / views() |
FastqView |
No | Streaming transforms and filtering where data is consumed immediately |
next_record() / records() |
FastqRecord |
Yes | General scripting and in-memory storage |
next_batch() / batches() |
FastqBatch |
Yes | GPU and parallel batch compute |
Important: FastqView spans are valid only until parser state advances to the next operation.
Benchmarks
See benchmark/README.md for benchmark commands and comparisons against other sequence parsers.
Documentation
- API docs: https://mosafi2.github.io/BlazeSeq/
- Examples: examples/
- Benchmark scripts: benchmark/
Limitations
- FASTQ parser expects standard four-line records (multi-line FASTQ is not supported).
- Paired-end specific APIs are not yet implemented.
- Parsers are stream-oriented; random seek/index-aware scanning is limited to indexed formats.
- Python package is currently wheel-only.
Testing
Run all Mojo tests:
The test corpus includes valid and invalid edge cases across FASTQ, FASTA, FAI, BED, and I/O layers derived from the Biopython project.
Project History
BlazeSeq began as a rewrite of MojoFastTrim and has since expanded into a broader parsing and compute-ready genomics toolkit with a unified architecture.
Acknowledgements
The FASTQ parser design is inspired by needletail, adapted and optimized for Mojo's SIMD-oriented programming model.
License
This project is licensed under the MIT License.
