Genomic ranges
Understand the genome's coordinate system
June 10, 2025
Ranges are everywhere in genomics. Typically represented as chromosome:start-end, they track where interesting things are found (exons, protein binding sites, etc.), or how metrics change along the genome (GC content, copy-number, etc.). Here we explore how ranges work, and how
to analyze them on the command-line with bedtools.
Merge overlapping ranges
Let's consider a simple example with 3 ranges:
Try moving and resizing the ranges!
To the right of the visualization Under the visualization is a BED file, a tab-separated file for storing ranges. BED files must have the 3 columns shown, but can also include extra columns. A popular tool for wrangling BED files is bedtools, which we use here.
For example, these ranges could represent the locations of protein binding sites from a ChIP-seq experiment. To get a concise
representation, let's merge overlapping ranges with bedtools merge:
Output of bedtools merge -i chipseq.bed:
Modify the ranges at the top to update this visualization.
What happens when all the ranges overlap?
Merging doesn't have to be all or nothing: you can merge nearby ranges that might be biologically related. Let's merge ranges
within basepairs of each other using the parameter -d:
Output of bedtools merge -i chipseq.bed -d 30:
Tweak the value of -d to see how it impacts whether ranges are combined or not.
Get notified when new sandbox.bio tutorials are released:
Intersect overlapping ranges
Another very common operation is to intersect two sets of ranges to get shared ranges. Here we have 2 BED files: exons.bed contains the locations of exons, and cpg.bed contains the locations of CpG
islands.
To find the CpG islands that overlap exons, we intersect those ranges using bedtools intersect:
Output of bedtools intersect -a exons.bed -b cpg.bed:
Try tweaking the ranges to get more intersections in the output.
As shown above, where Input A intersects Input B, bedtools intersect returns the portion of A that overlaps B. In
some cases, you might instead want to filter down Input A by keeping only the ranges that intersect Input B. This is where the
flags -wa and -v come in.
To find exons that overlap with CpG islands and output original ranges, use the flag
. And to return the ranges in Input A that don't intersect, use the flag
:
Output of bedtools intersect -a exons.bed -b cpg.bed:
Try moving the ranges so that one range from exons.bed overlaps 2 ranges in cpg.bed. What do you notice about the output?
Can you think of a bedtools command that would remove duplicate ranges?
Calculate genome coverage
Another common operation is to calculate coverage, that is, at every position in the genome, how many ranges are found
at that position? Say you ran a sequencing experiment, mapped the reads to the genome, and obtained the result shown below. We
can use bedtools genomecov to create a histogram of the number of bases that are covered by 0, 1, 2, or 3 ranges:
Output of bedtools genomecov -i reads.bed -g genome.txt
How can you modify the ranges to hide the histogram bar for 0 coverage?
This command needs to know the size of each chromosome so it can count the positions with no coverage—we stored that
information in genome.txt. For simplicity, we used a .bed file as the input, but bedtools also
supports .bam files, in which case you don't need genome.txt because the .bam has that information already in the header.
Also, keep in mind that bedtools outputs a text file, not a pretty histogram. Use the embedded terminal to see
what that output looks like and how you can interpret it as a histogram . Use the manual to understand what
the last 2 columns of the output represent.
Pitfalls
This article wouldn't be complete if I didn't mention the ways bioinformatics conspires to make you question your life choices. When it comes to genomic ranges specifically, keep in mind that:
- BED files use 0-based indexing, meaning chromosome coordinates range from 0 to chromosome length - 1. Just to keep it interesting, some formats like VCF use 1-based indexing.
- Coordinates exist within the context of a reference genome. When you're running range operations on
multiple BED files (like
intersect), make sure they all use coordinates from the same reference genome! The changes between 2 genome versions might be so small that you might not notice. - If your BED files contain forward/reverse strand information (column 6) and you don't want to
merge/intersect ranges on different strands, make sure to use the flag
-sin your bedtools commands. - If you find yourself modifying a BED file manually (not that I've ever done that myself), make sure your IDE doesn't
helpfully change your tabs into spaces on one of the lines, because that renders the file incorrect, and software like
bedtoolswill give you an error. You can use tools like bedqc to validate your BED files.
What's next?
Here we covered commonly used operations on ranges, but there are many, many more. You don't need to know all of them, but it's
good to briefly browse the list of bedtools commands,
just so you know what's possible for future reference. You'll be surprised by how much bedtools can do.
Another way to explore this topic more is to use the terminal below and the bedtools manual to explore the following questions:
- Merging ranges is nice and all, but how would you go about also counting how many intervals were merged? BED files usually have more columns than we did here, but in this case, choosing any column number works.
- When using
bedtools intersect, how would you intersect ranges only if they overlap by a significant amount, say 50%? - What happens if you run
bedtools intersectwith the flag-wb? How does the output look different? - How would you use
bedtoolsto get a list of regions along the genome where no ranges exist? Consult the manual to see which command could help.