GitHub - smortezah/smashpp: Find and visualize rearrangements in DNA sequences

Smash++

Smash++ is a fast utility for identifying and visualizing rearrangements in DNA sequences.

Installation

Smash++ requires CMake 4.0.0 or newer and a compiler with C++20 support.

Conda

conda install -y bioconda::smashpp

Docker

docker pull smortezah/smashpp
docker run -it smortezah/smashpp

Build From Source

git clone --depth 1 https://github.com/smortezah/smashpp.git
cd smashpp
bash install.sh

By default, install.sh builds in ./build and installs smashpp, smashpp-inv-rep, and exclude_N into ./dist/bin.

You can customize the build with environment variables:

PREFIX=/your/path BUILD_TYPE=Debug PARALLEL=16 bash install.sh

Ubuntu

apt update && apt install -y git g++ python3-pip
pip3 install --user "cmake~=4.0.0"

git clone --depth 1 https://github.com/smortezah/smashpp.git
cd smashpp
bash install.sh

macOS

brew install git python
pip3 install --user "cmake~=4.0.0"

git clone --depth 1 https://github.com/smortezah/smashpp.git
cd smashpp
bash install.sh

Windows

Install Visual Studio 2022 Build Tools with the Desktop C++ workload, plus Python 3.

py -m pip install --user "cmake~=4.0.0"
git clone --depth 1 https://github.com/smortezah/smashpp.git
cd smashpp
powershell -ExecutionPolicy Bypass -File .\install.ps1

The PowerShell installer supports the same knobs as the shell script, for example:

powershell -ExecutionPolicy Bypass -File .\install.ps1 -BuildType Debug -Prefix .\dist

Usage

If you used the default source install, run the binaries from ./dist/bin.

./dist/bin/smashpp [OPTIONS] -r <REF_FILE> -t <TAR_FILE>
./dist/bin/smashpp viz [OPTIONS] -o <SVG_FILE> <POS_FILE>

For best results, keep the reference and target filenames short.

Smash++ Options

Use smashpp --help to print the full CLI help.

Option	Value	Description	Default
`-r`, `--reference`	`<FILE>`	Reference file in `seq`, `FASTA`, or `FASTQ` format.	Required
`-t`, `--target`	`<FILE>`	Target file in `seq`, `FASTA`, or `FASTQ` format.	Required
`-l`, `--level`	`<INT>`	Compression level from `0` to `6`.	`3`
`-m`, `--min-segment-size`	`<INT>`	Minimum segment size.	`50`
`-fmt`, `--format`	`<STRING>`	Output format: `pos` or `json`.	`pos`
`-e`, `--entropy-N`	`<FLOAT>`	Entropy assigned to `N` bases.	`2.0`
`-n`, `--num-threads`	`<INT>`	Number of worker threads.	`4`
`-mem`, `--max-memory`	`<SIZE>`	Maximum estimated memory use. Supports `B`, `K`, `M`, `G`, and `T` suffixes; `0` disables the check.	Auto
`-f`, `--filter-size`	`<INT>`	Filter window size.	`100`
`-ft`, `--filter-type`	`<INT/STRING>`	Window function: `0/rectangular`, `1/hamming`, `2/hann`, `3/blackman`, `4/triangular`, `5/welch`, `6/sine`, `7/nuttall`.	`hann`
`-fs`, `--filter-scale`	`<STRING>`	Filter scale: `S/small`, `M/medium`, or `L/large`.	Auto
`-d`, `--sampling-step`	`<INT>`	Sampling step.	Auto
`--approx-sampled-models`	`-`	Use faster approximate updates between sampled positions in multi-model runs.	Disabled
`-th`, `--threshold`	`<FLOAT>`	Segmentation threshold.	`1.5`
`-rb`, `--reference-begin-guard`	`<INT>`	Reference begin guard.	`0`
`-re`, `--reference-end-guard`	`<INT>`	Reference end guard.	`0`
`-tb`, `--target-begin-guard`	`<INT>`	Target begin guard.	`0`
`-te`, `--target-end-guard`	`<INT>`	Target end guard.	`0`
`-ar`, `--asymmetric-regions`	`-`	Consider asymmetric regions.	Disabled
`-nr`, `--no-self-complexity`	`-`	Skip self-complexity computation.	Disabled
`-sb`, `--save-sequence`	`-`	Keep temporary `.seq` files generated from FASTA/FASTQ input.	Disabled
`-sp`, `--save-profile`	`-`	Save profile output.	Disabled
`-sf`, `--save-filtered`	`-`	Save filtered output.	Disabled
`-ss`, `--save-segmented`	`-`	Save extracted segment files.	Disabled
`-sa`, `--save-profile-filtered-segmented`	`-`	Save profile, filtered, and segmented outputs.	Disabled
`-rm`, `--reference-model`	`<STRING>`	Custom reference model chain.	Auto from `--level`
`-tm`, `--target-model`	`<STRING>`	Custom target model chain.	Auto from `--level`
`-ll`, `--list-levels`	`-`	Print the built-in compression levels.	-
`-h`, `--help`	`-`	Show the help message.	-
`-v`, `--verbose`	`-`	Print detailed progress information.	Disabled
`-V`, `--version`	`-`	Show the program version.	-

Model Parameter Fields

Custom model strings use the form k,[w,d,]ir,a,g/t,ir,a,g:....

Field	Meaning
`k`	Context size.
`w`	Sketch width given in log2 form, for example `10` means $2^{10} = 1024$.
`d`	Sketch depth.
`ir`	Inverted-repeat mode: `0` regular, `1` inverted only, `2` regular plus inverted.
`a`	Estimator.
`g`	Forgetting factor in the range `0.0` to `1.0`.
`t`	Threshold for the number of substitutions in a tolerant model.

Output Compatibility

Smash++ output is deterministic for the same executable, options, input files, and platform. Profile files saved with -sp or -sa still serialize entropy values using the profile precision shown by the program, but filtering and segmentation use full-precision entropy internally.

Because of that, .fil, .pos, and .json output may differ slightly from older Smash++ releases in the final decimal places or in threshold-adjacent segment boundaries. These differences are deterministic and come from avoiding an older round-to-text-and-parse-back step in the compression hot path.

--approx-sampled-models is opt-in. It speeds up sampled multi-model runs by updating only contexts between sampled positions, so its .prf, .fil, .pos, and .json output should be treated as an approximate mode rather than byte-compatible output with the default model update path.

Troubleshooting zero segments

If Smash++ finishes with 0 segments in both regular and inverted modes, it still writes an empty output file. For chromosome-scale or more divergent genome comparisons, the first tuning knobs to try are:

increase -th / --threshold
reduce -m / --min-segment-size
use -fs L / --filter-scale L for broader smoothing
lower -d / --sampling-step for finer resolution

See the Large and eukaryotic genomes section below for additional guidance on multi-gigabyte inputs.

Large and eukaryotic genomes

Smash++ was originally benchmarked on viral and bacterial genomes (kilobytes to low megabytes). When comparing large eukaryotic assemblies — for example human vs. chimpanzee — the automatic sampling step grows proportionally to file size in bytes (ceil(min(ref_bytes, tar_bytes) / 5000)), which can reduce resolution to the point where no segments survive filtering and thresholding.

Recommended workflow:

Compare individual chromosomes rather than whole-genome FASTA files. Concatenated multi-chromosome files add cross-chromosome noise and inflate the auto-sampling step:
```
# Extract chr1 from each assembly, then compare
smashpp -r human_chr1.fa -t chimp_chr1.fa
smashpp viz -o chr1_map.svg human_chr1.fa.chimp_chr1.fa.pos
```
Lower the sampling step for multi-megabyte or gigabyte inputs so that the profile retains enough resolution:
```
smashpp -r ref_chr.fa -t tar_chr.fa -d 50
```
Raise the segmentation threshold — eukaryotic genomes contain more repetitive and divergent background, so a threshold of 1.5 (the default) may be too strict:
```
smashpp -r ref_chr.fa -t tar_chr.fa -th 2.5
```
Use sketch-based models with explicit width for memory-efficient processing of large chromosomes. The 6-field model format k,w,d,ir,a,g lets you control the sketch size:
```
smashpp -r ref_chr.fa -t tar_chr.fa \
    -rm "20,10,5,0,0.002,0.95" \
    -tm "20,10,5,0,0.002,0.95"
```
Here w=10 means a sketch width of 2^10 = 1024 buckets with depth d=5.
A practical starting point for chromosome-to-chromosome eukaryotic comparison:
```
smashpp -r human_chr1.fa -t chimp_chr1.fa \
    -l 0 -m 500 -th 2.5 -fs L -d 50 -n 8
```
Adjust -th and -m based on the expected divergence between the species.

Visualizer Options

Use smashpp viz --help to print the full CLI help.

Option	Value	Description	Default
`<POS_FILE>`	File	Position file generated by Smash++ in `.pos` or `.json` format.	Required
`-o`, `--output`	`<SVG_FILE>`	Output SVG path.	`map.svg`
`-rn`, `--reference-name`	`<STRING>`	Override the displayed reference label.	Header value
`-tn`, `--target-name`	`<STRING>`	Override the displayed target label.	Header value
`-l`, `--link`	`<INT>`	Link style between the two maps.	`1`
`-c`, `--color`	`<INT>`	Color mode: `0` or `1`.	`0`
`-p`, `--opacity`	`<FLOAT>`	Connector opacity.	`0.9`
`-w`, `--width`	`<INT>`	Sequence bar width.	`10`
`-s`, `--space`	`<INT>`	Space between sequences.	`40`
`-tc`, `--total-colors`	`<INT>`	Total number of colors to use.	Auto
`-rt`, `--reference-tick`	`<INT>`	Reference tick spacing.	Auto
`-tt`, `--target-tick`	`<INT>`	Target tick spacing.	Auto
`-th`, `--tick-human-readable`	`<INT>`	Human-readable tick labels: `0` false, `1` true.	`1`
`-m`, `--min-block-size`	`<INT>`	Minimum block size to display.	`1`
`-vv`, `--vertical-view`	`-`	Render a vertical layout.	Disabled
`-nrr`, `--no-relative-redundancy`	`-`	Hide relative redundancy coloring.	Disabled
`-nr`, `--no-redundancy`	`-`	Hide redundancy coloring.	Disabled
`-ni`, `--no-inverted`	`-`	Hide inverted matches.	Disabled
`-ng`, `--no-regular`	`-`	Hide regular matches.	Disabled
`-n`, `--show-N`	`-`	Highlight `N` bases.	Disabled
`-stat`, `--statistics`	`-`	Save statistics to CSV.	`stat.csv`
`-h`, `--help`	`-`	Show the help message.	-
`-v`, `--verbose`	`-`	Print detailed plotting information.	Disabled
`-V`, `--version`	`-`	Show the program version.	-

Example

After running the default installer, the example workflow looks like this:

cd example
../dist/bin/smashpp -r ref -t tar
../dist/bin/smashpp viz -o example.svg ref.tar.pos

JSON output is available too:

cd example
../dist/bin/smashpp --reference ref --target tar --format json
../dist/bin/smashpp viz --output example.svg ref.tar.json

If smashpp is already on your PATH, you can drop the ../dist/bin/ prefix.

Testing and Benchmarks

After configuring and building from source, run the regression suite with:

ctest --test-dir build --output-on-failure

To make warnings fail the build in local development or CI, configure with:

cmake -S . -B build -DSMASHPP_STRICT_WARNINGS=ON

The repository also includes CMake presets for common maintainer workflows:

cmake --preset strict
cmake --build --preset strict
ctest --preset strict

Focused test labels are available for narrower checks, for example:

ctest --preset strict -L compatibility
ctest --preset strict -L packaging
ctest --preset benchmark-smoke

For local performance checks, run the benchmark target:

cmake --build build --target smashpp-benchmark

To compare against another executable configure with:

cmake -S . -B build -DSMASHPP_BENCHMARK_BASELINE=/path/to/other/smashpp
cmake --build build --target smashpp-benchmark

The benchmark generates deterministic small and large inputs and writes timing rows to build/benchmarks/summary.csv. When a baseline executable is configured, it also writes build/benchmarks/comparison.csv with median timings and speedups for each scenario. The default large benchmark input is 256 MiB per file. Override the generated input sizes with byte counts when you need a shorter smoke run or a larger production check:

cmake -S . -B build \
  -DSMASHPP_BENCHMARK_SMALL_BYTES=131072 \
  -DSMASHPP_BENCHMARK_LARGE_BYTES=268435456

Use the same compiler, build type, input sizes, and machine when comparing results.

To create portable release archives from the install rules, run:

cmake --build build --target package

The archives are written to build/packages/.

Cite

If you find Smash++ useful in your research, please acknowledge our work by citing:

M. Hosseini, D. Pratas, B. Morgenstern, A.J. Pinho, "Smash++: an alignment-free and memory-efficient tool to find genomic rearrangements," GigaScience, vol. 9, no. 5, 2020. DOI: 10.1093/gigascience/giaa048

Issues

If you encounter an issue, please let us know.

Contributing

Development workflow, testing, benchmarking, and pull request guidance are in CONTRIBUTING.md.

License

Smash++ is distributed under the GNU GPL v3 license.