GitHub - smortezah/smashpp: Find and visualize rearrangements in DNA sequences

8 min read Original article ↗

Smash++

Anaconda version Anaconda downloads CI License

Smash++ is a fast utility for identifying and visualizing rearrangements in DNA sequences.

Installation

Smash++ requires CMake 4.0.0 or newer and a compiler with C++20 support.

Conda

conda install -y bioconda::smashpp

Docker

docker pull smortezah/smashpp
docker run -it smortezah/smashpp

Build From Source

git clone --depth 1 https://github.com/smortezah/smashpp.git
cd smashpp
bash install.sh

By default, install.sh builds in ./build and installs smashpp, smashpp-inv-rep, and exclude_N into ./dist/bin.

You can customize the build with environment variables:

PREFIX=/your/path BUILD_TYPE=Debug PARALLEL=16 bash install.sh

Ubuntu

apt update && apt install -y git g++ python3-pip
pip3 install --user "cmake~=4.0.0"

git clone --depth 1 https://github.com/smortezah/smashpp.git
cd smashpp
bash install.sh

macOS

brew install git python
pip3 install --user "cmake~=4.0.0"

git clone --depth 1 https://github.com/smortezah/smashpp.git
cd smashpp
bash install.sh

Windows

Install Visual Studio 2022 Build Tools with the Desktop C++ workload, plus Python 3.

py -m pip install --user "cmake~=4.0.0"
git clone --depth 1 https://github.com/smortezah/smashpp.git
cd smashpp
powershell -ExecutionPolicy Bypass -File .\install.ps1

The PowerShell installer supports the same knobs as the shell script, for example:

powershell -ExecutionPolicy Bypass -File .\install.ps1 -BuildType Debug -Prefix .\dist

Usage

If you used the default source install, run the binaries from ./dist/bin.

./dist/bin/smashpp [OPTIONS] -r <REF_FILE> -t <TAR_FILE>
./dist/bin/smashpp viz [OPTIONS] -o <SVG_FILE> <POS_FILE>

For best results, keep the reference and target filenames short.

Smash++ Options

Use smashpp --help to print the full CLI help.

Option Value Description Default
-r, --reference <FILE> Reference file in seq, FASTA, or FASTQ format. Required
-t, --target <FILE> Target file in seq, FASTA, or FASTQ format. Required
-l, --level <INT> Compression level from 0 to 6. 3
-m, --min-segment-size <INT> Minimum segment size. 50
-fmt, --format <STRING> Output format: pos or json. pos
-e, --entropy-N <FLOAT> Entropy assigned to N bases. 2.0
-n, --num-threads <INT> Number of worker threads. 4
-mem, --max-memory <SIZE> Maximum estimated memory use. Supports B, K, M, G, and T suffixes; 0 disables the check. Auto
-f, --filter-size <INT> Filter window size. 100
-ft, --filter-type <INT/STRING> Window function: 0/rectangular, 1/hamming, 2/hann, 3/blackman, 4/triangular, 5/welch, 6/sine, 7/nuttall. hann
-fs, --filter-scale <STRING> Filter scale: S/small, M/medium, or L/large. Auto
-d, --sampling-step <INT> Sampling step. Auto
--approx-sampled-models - Use faster approximate updates between sampled positions in multi-model runs. Disabled
-th, --threshold <FLOAT> Segmentation threshold. 1.5
-rb, --reference-begin-guard <INT> Reference begin guard. 0
-re, --reference-end-guard <INT> Reference end guard. 0
-tb, --target-begin-guard <INT> Target begin guard. 0
-te, --target-end-guard <INT> Target end guard. 0
-ar, --asymmetric-regions - Consider asymmetric regions. Disabled
-nr, --no-self-complexity - Skip self-complexity computation. Disabled
-sb, --save-sequence - Keep temporary .seq files generated from FASTA/FASTQ input. Disabled
-sp, --save-profile - Save profile output. Disabled
-sf, --save-filtered - Save filtered output. Disabled
-ss, --save-segmented - Save extracted segment files. Disabled
-sa, --save-profile-filtered-segmented - Save profile, filtered, and segmented outputs. Disabled
-rm, --reference-model <STRING> Custom reference model chain. Auto from --level
-tm, --target-model <STRING> Custom target model chain. Auto from --level
-ll, --list-levels - Print the built-in compression levels. -
-h, --help - Show the help message. -
-v, --verbose - Print detailed progress information. Disabled
-V, --version - Show the program version. -

Model Parameter Fields

Custom model strings use the form k,[w,d,]ir,a,g/t,ir,a,g:....

Field Meaning
k Context size.
w Sketch width given in log2 form, for example 10 means $2^{10} = 1024$.
d Sketch depth.
ir Inverted-repeat mode: 0 regular, 1 inverted only, 2 regular plus inverted.
a Estimator.
g Forgetting factor in the range 0.0 to 1.0.
t Threshold for the number of substitutions in a tolerant model.

Output Compatibility

Smash++ output is deterministic for the same executable, options, input files, and platform. Profile files saved with -sp or -sa still serialize entropy values using the profile precision shown by the program, but filtering and segmentation use full-precision entropy internally.

Because of that, .fil, .pos, and .json output may differ slightly from older Smash++ releases in the final decimal places or in threshold-adjacent segment boundaries. These differences are deterministic and come from avoiding an older round-to-text-and-parse-back step in the compression hot path.

--approx-sampled-models is opt-in. It speeds up sampled multi-model runs by updating only contexts between sampled positions, so its .prf, .fil, .pos, and .json output should be treated as an approximate mode rather than byte-compatible output with the default model update path.

Troubleshooting zero segments

If Smash++ finishes with 0 segments in both regular and inverted modes, it still writes an empty output file. For chromosome-scale or more divergent genome comparisons, the first tuning knobs to try are:

  • increase -th / --threshold
  • reduce -m / --min-segment-size
  • use -fs L / --filter-scale L for broader smoothing
  • lower -d / --sampling-step for finer resolution

See the Large and eukaryotic genomes section below for additional guidance on multi-gigabyte inputs.

Large and eukaryotic genomes

Smash++ was originally benchmarked on viral and bacterial genomes (kilobytes to low megabytes). When comparing large eukaryotic assemblies — for example human vs. chimpanzee — the automatic sampling step grows proportionally to file size in bytes (ceil(min(ref_bytes, tar_bytes) / 5000)), which can reduce resolution to the point where no segments survive filtering and thresholding.

Recommended workflow:

  1. Compare individual chromosomes rather than whole-genome FASTA files. Concatenated multi-chromosome files add cross-chromosome noise and inflate the auto-sampling step:

    # Extract chr1 from each assembly, then compare
    smashpp -r human_chr1.fa -t chimp_chr1.fa
    smashpp viz -o chr1_map.svg human_chr1.fa.chimp_chr1.fa.pos
  2. Lower the sampling step for multi-megabyte or gigabyte inputs so that the profile retains enough resolution:

    smashpp -r ref_chr.fa -t tar_chr.fa -d 50
  3. Raise the segmentation threshold — eukaryotic genomes contain more repetitive and divergent background, so a threshold of 1.5 (the default) may be too strict:

    smashpp -r ref_chr.fa -t tar_chr.fa -th 2.5
  4. Use sketch-based models with explicit width for memory-efficient processing of large chromosomes. The 6-field model format k,w,d,ir,a,g lets you control the sketch size:

    smashpp -r ref_chr.fa -t tar_chr.fa \
        -rm "20,10,5,0,0.002,0.95" \
        -tm "20,10,5,0,0.002,0.95"

    Here w=10 means a sketch width of 2^10 = 1024 buckets with depth d=5.

  5. A practical starting point for chromosome-to-chromosome eukaryotic comparison:

    smashpp -r human_chr1.fa -t chimp_chr1.fa \
        -l 0 -m 500 -th 2.5 -fs L -d 50 -n 8

    Adjust -th and -m based on the expected divergence between the species.

Visualizer Options

Use smashpp viz --help to print the full CLI help.

Option Value Description Default
<POS_FILE> File Position file generated by Smash++ in *.pos or *.json format. Required
-o, --output <SVG_FILE> Output SVG path. map.svg
-rn, --reference-name <STRING> Override the displayed reference label. Header value
-tn, --target-name <STRING> Override the displayed target label. Header value
-l, --link <INT> Link style between the two maps. 1
-c, --color <INT> Color mode: 0 or 1. 0
-p, --opacity <FLOAT> Connector opacity. 0.9
-w, --width <INT> Sequence bar width. 10
-s, --space <INT> Space between sequences. 40
-tc, --total-colors <INT> Total number of colors to use. Auto
-rt, --reference-tick <INT> Reference tick spacing. Auto
-tt, --target-tick <INT> Target tick spacing. Auto
-th, --tick-human-readable <INT> Human-readable tick labels: 0 false, 1 true. 1
-m, --min-block-size <INT> Minimum block size to display. 1
-vv, --vertical-view - Render a vertical layout. Disabled
-nrr, --no-relative-redundancy - Hide relative redundancy coloring. Disabled
-nr, --no-redundancy - Hide redundancy coloring. Disabled
-ni, --no-inverted - Hide inverted matches. Disabled
-ng, --no-regular - Hide regular matches. Disabled
-n, --show-N - Highlight N bases. Disabled
-stat, --statistics - Save statistics to CSV. stat.csv
-h, --help - Show the help message. -
-v, --verbose - Print detailed plotting information. Disabled
-V, --version - Show the program version. -

Example

After running the default installer, the example workflow looks like this:

cd example
../dist/bin/smashpp -r ref -t tar
../dist/bin/smashpp viz -o example.svg ref.tar.pos

JSON output is available too:

cd example
../dist/bin/smashpp --reference ref --target tar --format json
../dist/bin/smashpp viz --output example.svg ref.tar.json

If smashpp is already on your PATH, you can drop the ../dist/bin/ prefix.

Testing and Benchmarks

After configuring and building from source, run the regression suite with:

ctest --test-dir build --output-on-failure

To make warnings fail the build in local development or CI, configure with:

cmake -S . -B build -DSMASHPP_STRICT_WARNINGS=ON

The repository also includes CMake presets for common maintainer workflows:

cmake --preset strict
cmake --build --preset strict
ctest --preset strict

Focused test labels are available for narrower checks, for example:

ctest --preset strict -L compatibility
ctest --preset strict -L packaging
ctest --preset benchmark-smoke

For local performance checks, run the benchmark target:

cmake --build build --target smashpp-benchmark

To compare against another executable configure with:

cmake -S . -B build -DSMASHPP_BENCHMARK_BASELINE=/path/to/other/smashpp
cmake --build build --target smashpp-benchmark

The benchmark generates deterministic small and large inputs and writes timing rows to build/benchmarks/summary.csv. When a baseline executable is configured, it also writes build/benchmarks/comparison.csv with median timings and speedups for each scenario. The default large benchmark input is 256 MiB per file. Override the generated input sizes with byte counts when you need a shorter smoke run or a larger production check:

cmake -S . -B build \
  -DSMASHPP_BENCHMARK_SMALL_BYTES=131072 \
  -DSMASHPP_BENCHMARK_LARGE_BYTES=268435456

Use the same compiler, build type, input sizes, and machine when comparing results.

To create portable release archives from the install rules, run:

cmake --build build --target package

The archives are written to build/packages/.

Cite

If you find Smash++ useful in your research, please acknowledge our work by citing:

  • M. Hosseini, D. Pratas, B. Morgenstern, A.J. Pinho, "Smash++: an alignment-free and memory-efficient tool to find genomic rearrangements," GigaScience, vol. 9, no. 5, 2020. DOI: 10.1093/gigascience/giaa048

Issues

If you encounter an issue, please let us know.

Contributing

Development workflow, testing, benchmarking, and pull request guidance are in CONTRIBUTING.md.

License

Smash++ is distributed under the GNU GPL v3 license.