The nearest hospital to every place on Earth, in a single S2 range query

30 min read Original article ↗

How far is the nearest hospital, for every place on Earth? At first, this sounds like a distance problem with billions of pairs to check. However, with the right tools, it isn’t a distance problem at all: with S2 indexing it’s the same query as which country, region, and locality contains this point? We can solve it with a plain integer range check against a single index.

Last week I was advising a client whose geo pipeline was getting slower every week. Their team had started with the obvious approach: given a few hundred thousand points and several hundred thousand polygons, for each point they scanned every polygon to find the one that contained it. As the input size grew, they watched the nightly batch pipeline become considerably slower and were running out of options.

As we sketched applicable approaches on a whiteboard, I realized I was drawing a picture I’d drawn before, a decade before at Google. The trick, using S2 geometry to turn spatial joins into key joins, is one of the most elegant and underrated primitives I’ve come across: the kind of indexing idea that, like Ukkonen’s suffix trees, collapses an apparently quadratic problem into something nearly linear.

The problem I was solving back then had the same shape but on a different application: the typical price of a hotel stay anywhere on Earth, any night of the year, served at 10ms latency, which required precomputing everything via batch pipelines. One step had to associate every hotel with every named place that might contain it: neighborhood, town, city, region, POI.

Written naively, that step is a cartesian product. Given H hotels and R regions, generate all H x R pairs and only keep the ones where the hotel falls inside the region’s polygon.

Spatial realityR1R2R3h1h2h3h4h5h6Each hotel lands in ~one region, so the answer is sparse.Naive computation (H × R)regionshotelsR1R2R3h1h2h3h4h5h6But H×R tests every hotel against every region (18 cells, 5 true).
The association step. The answer is a sparse matrix: each hotel falls inside a handful of regions. But the naive cartesian product still runs all H×R point-in-polygon tests to find it.

I’d built it as a Flume pipeline, computing summaries for every night of the year over data that already lived distributed across several storage backends. Running it on Flume was justified by the layout of the data, the multitude of prices we had, and the 365-day time dimension. However, joining points-against-polygons was always within reach of one machine. The primitive, the indexing trick at the heart of it, never needed a cluster around it.

To demonstrate it, I wanted to solve a fresh public-data version of that same problem on a pretty typical machine: an AMD Ryzen 9 7900 desktop (12 cores, 64 GB of RAM). The question I picked: where on Earth is your nearest hospital? Its naive form is 437 billion pairs. The S2 index collapses it to a single integer range-join: about an hour to build the index, and then the answer for every locality on Earth comes back in seconds. This post is the story of how.

The worst place to get injured

If you live on the Kerguelen archipelago (a French sub-Antarctic research outpost halfway between Madagascar and Antarctica) and you need a hospital, the nearest one is 3,362 km away, on Rodrigues Island in Mauritius. That’s farther than New York to Los Angeles. And it’s the loneliest result in a worldwide leaderboard of localities ranked by distance to their nearest healthcare facility, covering every place on Earth that anyone has put on the map.

The top of the leaderboard is unsurprising: the top three are all settlements on Kerguelen, the next seven are all Tuamotu atolls in French Polynesia. Interestingly, both are French overseas territories; together they sweep the entire global top 10 because:

  1. each small atoll is its own locality polygon,
  2. they really are extraordinarily remote, and
  3. France maps them well.

The last point was interesting: the underlying map data is a treasure, but has varying coverage by country. I suppose it’s due to a combination of factors, one of which is how active the local mapping community is. I guess it makes sense considering how much of it comes from OpenStreetMap volunteers.

A more interesting exploration asks the question per country, restricted to countries most readers will recognize. Click any row to see the actual location on OpenStreetMap.

CountryLocalitykm
RussiaDikson, Arctic Ocean coast2,901
CanadaRead Island, Northwest Territories2,264
United StatesAdak, Aleutian Islands, Alaska1,546
Greenland (Denmark)Nanortalik, southern Greenland1,171
AustraliaMundrabilla, Nullarbor Plain631
China黑瞎子岛镇, Bolshoy Ussuriysky Island584
United KingdomRockall, North Atlantic islet503
MauritaniaAkdernit, Sahara365
MadagascarBeroboka Nord, western coast245
BrazilOriximiná, Amazon interior230
MaliTin Zaouatine, Algerian border / Sahara219
ArgentinaSierra Colorada, Patagonia173
JapanKitadaitōjima, remote Pacific island165
ChileNatales, Patagonia / Magallanes153
MexicoProgreso, Yucatán125
ItalyUstica, volcanic island57
FranceÎle-de-Sein, Brittany45

A few things this list reveals (I had to look up every single one of them except for Ustica):

  • The UK’s most-isolated mapped locality is Rockall: a tiny rocky islet in the North Atlantic, contested between four countries, with a Royal-Marines-occupied flagpole. So small, it turns out, that nobody lives on it at all (more on that below).
  • China’s is Bolshoy Ussuriysky: a river island the PRC and Russia split between them in 2008.
  • Australia’s Nullarbor Plain at 631 km is genuinely remote; the Outback is really isolated.
  • Italy and France show what well-covered countries look like: the most-isolated locality is a tiny offshore island just ~50 km from the nearest hospital. That’s what most of Europe looks like.

Behind every entry on that table is one integer per locality, half a million of them, classified against every healthcare POI in Overture (the open global map dataset, introduced below), every (locality, POI) pair a potential great-circle distance. The primitive that makes this possible at scale is S2 cell indexing, originally built in the mid-2000s to power Google Maps. The same primitive simultaneously answers a different-looking question (what country, region, and city does each hospital belong to?) from the same index. That’s the part this post is built around.

The hotels-to-cities problem, again

The Flume pipeline from my memory was a hotels-to-cities job: for every (hotel, place) pair, answer is this hotel inside this place? The pipeline in my demo works for both hospitals-to-administrative-units and hospitals-to-radius-bands. S2 cell indexing makes these problems tractable by turning geographic questions (is point P inside polygon Q?) into an integer-key question (is the integer P′ in the sorted set Q′?). Then, a sort-merge pass handles the rest and the shape of the problem goes from quadratic-on-geometry to linear-on-integers.

I wanted to apply the lesson to a fresh problem with public data and real-world stakes, and settled for these two facets:

  • Distance to the nearest hospital is something everybody understands. Mapping every hospital on Earth to every country, region, locality it belongs to, at the same time, is the same question as the hotels-to-cities use case from my experience.
  • How far is the nearest hospital from each locality is the same question shape, just with distance bands instead of admin levels.

And S2’s hierarchy makes those two kinds of “level” indistinguishable to the algorithm. That’s the surprise from the beginning, spelled out: distance and containment are the same query. Let’s see how.

S2 in one paragraph

S2 partitions the surface of the Earth into a quadtree of cells. A quadtree is a tree-like data structure where each node represents a region of a 2-dimensional space (i.e. our cells), which is then divided in 4 smaller quadrants. Key S2 principles:

  • every cell has a unique 64-bit integer ID;
  • every cell has a (larger) parent at the next coarser level;
  • every cell has exactly four (smaller) children at the next finer level;
  • every leaf cell at level 30 (about 1 cm²) is the descendant of exactly one cell at every level above it;
  • it’s possible to walk the S2 hierarchy all the way up to six face cells.

This strict parent-child invariant — meaning that every leaf has a unique ancestor at every level — is what makes S2 special. It lets us encode every cell as an interval [range_min, range_max] over leaf-cell IDs. A leaf L is contained in cell C if and only if range_min(C) ≤ L ≤ range_max(C).

In my mind, what makes S2 unique is how much depth of information can be encoded in a simple integer-interval check. The whole post is an attempt to illustrate this one specific property.

Also, each S2 Cell ID is positional. Three bits map to one of six cube faces, then two bits per level map to one of four children, and a final 1 bit marks where the cell stops. A coarser cell is therefore a binary prefix of all its descendants:

cell C        011 10 00 11 · 1 · 0000…0
range_min(C)  011 10 00 11 · 0000…00 · 1
...
range_max(C)  011 10 00 11 · 1111…11 · 1

Every leaf under C shares the prefix 011 10 00 11. What varies are only the bits below it, from all-zeros to all-ones. That span is the interval [range_min, range_max]. So the question “is this leaf inside this cell?” becomes “does this integer fall in that range?”.

The Hilbert ordering shown below adds another useful property: nearby cells get nearby IDs, which enables representing whole regions with few cells instead of thousands. See below for an example.

Example showing S2 cells on one face at level 4, walked in Hilbert order. The blue space-filling curve is the order S2 uses to number cells: consecutive cells along the curve are represented via consecutive cell IDs. The red square is the 16 leaf-cell descendants of a single parent cell two levels up: they form a contiguous run of cell IDs (a single integer interval) that's also a connected sub-region of the curve.
Example showing S2 cells on one face at level 4, walked in Hilbert order. The blue space-filling curve is the order S2 uses to number cells: consecutive cells along the curve are represented via consecutive cell IDs. The red square is the 16 leaf-cell descendants of a single parent cell two levels up: they form a contiguous run of cell IDs (a single integer interval) that's also a connected sub-region of the curve.

The data

The dataset I used for the demo is Overture Maps, a public release of Meta’s, Microsoft’s, AWS’s, and TomTom’s joint cleanup and merge of OpenStreetMap, Microsoft Building Footprints, and proprietary data from many other sources. As of the May 2026 release it has 54 million POIs and 1.07 million administrative polygons, all openly available as Parquet on S3, byte-range-readable with no auth required, which made it a great candidate for exploration.

I pulled the global tile to my workstation via the overturemaps CLI in a few minutes. From that, the input set for this post featured:

  • 770,440 healthcare POIs: hospital, medical_center, emergency_room, urgent_care_clinic. This deliberately excludes pharmacies, dental clinics, and specialist offices. However, you could run the same pipeline with any other type of POI.
  • 567,307 localities: cities, towns, villages, wards.
  • ~4,700 regions, ~380 country polygons (note: some countries have multiple polygons in the source, such as overseas territories, exclaves, and disputed islands, which get rolled up to one ISO code at aggregation time).

The next sections cover some preparation steps to make this data more suitable to be used as an S2 index.

Preparing an S2 index of administrative regions

For each admin polygon (country, region, locality) we then build its cell-union: the smallest set of S2 cells whose union covers the polygon. This is where the S2 properties we mentioned earlier earn their keep: cell-unions are a relatively compact and efficient way to represent regions of space as unions of intervals. Additionally, each cell in the union carries an INTERIOR / BOUNDARY tag. INTERIOR cells are entirely inside the polygon; BOUNDARY cells straddle its edge. This can be done quite easily by leveraging s2 libraries:

from geo.s2_covering import cover_polygon

# `polygon` is a shapely geometry for one admin region.
# `cover_polygon` returns a mixed-level cell-union: small cells along
# the boundary, big cells inside the bulk.
tagged = cover_polygon(polygon, min_level=4, max_level=12, max_cells=200)

rows = [
    (admin_id, c.range_min, c.range_max, c.tag == "INTERIOR")
    for c in tagged
]

Building the cell-union table for over half a million admin polygons produces ~5.7 million rows in 36 minutes on my workstation. Each row is a small struct: (admin_id, range_min, range_max, is_interior).

Italy's S2 cell-union at levels 4–7, overlaid on the OSM base map. Green cells (INTERIOR) fall entirely inside the country polygon: a POI whose leaf cell falls in any of them is auto-confirmed as inside Italy. Orange cells (BOUNDARY) fall on the polygon edge; POIs in them have to be re-checked with a polygon-contains call. INTERIOR cells seem rare here, and that's quite common: on every real admin polygon I checked, INTERIOR cells are a small minority of the cell-union. The next section covers why the join is still fast.

On the other side of the question, for each healthcare POI, we compute its leaf cell, a 64-bit integer. Processing all 770k of them takes four seconds.

The spatial join

Next we can answer our questions via a single SQL range-join in DuckDB (conceptually similar to SQLite, but for analytical work on columnar data):

SELECT poi.id, admin.id, admin.subtype
FROM poi_leaves poi
JOIN unified_cell_union admin
  ON poi.leaf BETWEEN admin.range_min AND admin.range_max

What’s impressive is that it works across all admin levels at once. On my machine, it resulted in ~5.6 million candidate (POI, admin) pairs in under a second:

  • ~965k INTERIOR matches are auto-confirmed: the leaf POI is inside an INTERIOR cell, no further checks needed.
  • ~4.7M BOUNDARY matches need a polygon-contains check; ~3.3M pass.

The refinement for boundary cells is the only place in which geometry is still required. To keep it fast, we can group the candidates by admin polygon: load each polygon’s geometry once, then call shapely.contains against every POI that landed in one of its boundary cells:

from shapely.geometry import Point

confirmed = []
for admin_id, group in boundary_candidates.groupby('admin_id'):
    poly = admin_polygons[admin_id]      # loaded once per admin
    confirmed.extend(
        (poi_id, admin_id)
        for poi_id, lon, lat in group.itertuples(index=False)
        if poly.contains(Point(lon, lat))
    )

Most of the runtime is spent in polygon loads; the contains calls are cheap because the admin polygon was already simplified upstream of the index. This step takes a few minutes, and results in ~3.3M confirmed matches (from ~4.7M candidates).

Total: ~4.2 million confirmed (POI, admin) pairs, across country and region and locality, in 6 minutes of compute.

For a concrete example, let’s take a real hospital: Bergamo’s Ospedale Papa Giovanni XXIII, located at roughly (45.6917° N, 9.6692° E). Its S2 leaf cell ID is 5,152,488,575,548,925,233. Each admin polygon that contains it is a union of many cells; the cell whose interval contains the leaf cell ID of the hospital is its matching cell. A cell and its interval are the same object: a cell at level k is exactly the leaf range [range_min, range_max]. The three matching cells:

Admin polygonMatching cellrange_minrange_max
Italia (country)level 65,152,117,973,711,847,4255,152,680,923,665,268,735
Lombardia (region)level 75,152,399,448,688,558,0815,152,540,186,176,913,407
Bergamo (locality)level 115,152,488,509,130,407,9375,152,489,058,886,221,823

The SQL query above resolves country, region, and locality containment with one BETWEEN per row, against one unified table. The canonical way to answer “which polygon contains this point?” is a spatial index like an R-tree, the data structure behind PostGIS and most geo databases. But an R-tree covers one layer of polygons at a time, so all three admin levels mean three separate indexes and three separate tree searches. With S2, we can replace all of it with integer comparisons against a single table, and can scale better.

Each region is a union of cells; one cell per row contains the leafItaliacountryLombardiaregionBergamolocalityS2 cell ID: one 64-bit integer axishospital leaf cell5,152,488,575,548,925,233Schematic, not to scale. Bergamo's real cells are ~0.1% the size of Italy's, and a country's union runs to ~200 cells.
Each administrative level's cell-union is a set of S2 cells. Drawn on a single axis of cell IDs, the highlighted cell in each row is the one whose interval contains the hospital's leaf cell (red).

As we said before, the cell-union skews heavily toward BOUNDARY cells. This is because most admin polygons have long, jagged perimeters relative to their area, and RegionCoverer breaks up the edge cells: each level of refinement typically turns 1 parent into ~2 straddling children, and a few edge cells at coarse levels turn into many leaf BOUNDARY cells at max_level. Meanwhile the bulk of the polygon’s area gets covered by a handful of large INTERIOR cells.

That skew doesn’t slow the join down, though. The range-join itself is just BETWEEN comparisons, so it doesn’t matter how the cells are tagged. The only cost is the BOUNDARY refinement, and most of those checks land on leaf cells at max_level, where the geometries are small and shapely.contains runs in microseconds. INTERIOR cells help at the margin, auto-confirming the POIs that fall in a polygon’s bulk, but the join is fast mainly because even the boundary path is cheap: checking a tiny leaf against an already-simplified polygon results in a microsecond contains call.

The refinement work scales with a polygon’s perimeter at max_level, not with area.

Distance is just another kind of hierarchy

Now for a question that seems harder at first glance: how far is the nearest hospital from each locality? The naive solution would be to treat it as a nearest-neighbor problem: for each locality, scan all hospitals, find the closest one. Even with a spatial index it looks like a different kind of problem from the admin-rollup join.

Is it though?

Let’s see how it can be reduced to a similar-shape problem as the one before.

First, for each healthcare POI, we can build multiple spherical caps, one for each radius band (e.g. within 1 km, within 5 km, within 15 km, within 30 km, within 100 km) and then cover each cap with S2 cells. Doing this in S2 is quite straightforward:

import s2sphere
EARTH_RADIUS_KM = 6371.0

center = s2sphere.LatLng.from_degrees(poi.lat, poi.lon).to_point()
cap = s2sphere.Cap.from_axis_angle(
    center,
    s2sphere.Angle.from_radians(radius_km / EARTH_RADIUS_KM),
)
cells = region_coverer.get_covering(cap)

Here:

  • get_covering is the same primitive used for the admin polygons;
  • Cap is just a shape RegionCoverer knows how to handle.
“Within r km of a hospital” is a spherical capθ = r / Rrcentre, radius Rhospital(cap axis)spherical capflat-plane analogra diskcurve the plane →the disk becomes a capSchematic, not to scale: θ is drawn far wider than any real band — 100 km is ≈ 0.9° on Earth. The code passes θ = radius_km / EARTH_RADIUS_KM, the arc radius in radians.
Why a cap, not a circle: Earth is a sphere, so “within r km of a hospital” is the portion of surface within an angular radius θ of the axis pointing at that hospital: a spherical cap. S2 can construct it given an angle: θ = radius_km / EARTH_RADIUS_KM, which is the value we pass to Cap.from_axis_angle.

Then we build five cell-unions (one per radius band) across the caps for all 770k POIs. When building each union, individual caps overlap and fuse into a single ragged region: solid where hospitals cluster, full of gaps over oceans, deserts, and empty highlands. For a given radius then, the cell-union covers the parts of the planet that are within that radius of some hospital, and the holes cover the places that aren’t.

Then, for each locality, we take its representative point’s leaf cell and ask: what’s the smallest band whose cell-union contains me? That’s its isolation distance.

The check is identical to the admin one: leaf BETWEEN range_min AND range_max. The hierarchy now spans distance scales instead of administrative scales, and the algorithm handles it the same way. I see that as more proof of the elegance in the S2 design: country / region / locality / 1 km / 5 km / 30 km are all the same kind of thing to the index. They are all representable as nested cell-unions over a quadtree, and queryable with a single integer-interval check.

Five concentric radius-band S2 cell-unions around one example hospital in Reykjavik, overlaid on OSM. Each colored band is everything within 1 / 5 / 15 / 30 / 100 km of the hospital, covered with S2 cells whose [range_min, range_max] intervals can be checked against a leaf cell in a single integer comparison. Bigger radii merge into coarser parent cells, the same Hilbert-locality property that made the country figure work. To the algorithm, this picture and Italy's are the same concept.

That’s one hospital. Now, do it for all hospitals and build the union for each band, and the tidy concentric rings become a single coverage map with holes:

Iceland's 50 km hospital-coverage band: the S2 cap-unions of every Icelandic hospital, merged into one. The colored wash covers every place within 50 km of some hospital; the bare patches (the interior highlands and the remote fjords) cover the localities that fall through into a coarser band. Each marked dot is a real inhabited locality whose nearest hospital is farther than that. Under the hood, it's the same primitive and the same leaf BETWEEN range_min AND range_max check as the admin join; only the polygons changed.

This is where S2 differs from H3, Uber’s hexagonal grid system, and from R-trees. R-trees can do range queries but need one tree per question: four levels of hierarchy require four trees and four traversals. H3 can index polygons but its parent-child relation is approximate, so a multi-level union breaks the integer-interval property. S2’s strict quadtree parent-child invariant is the property that makes the same primitive work for both kinds of hierarchy.

Here are the cell counts per band, after normalization:

RadiusCells (after normalize)
1 km1,575,449
5 km724,157
15 km317,056
30 km129,497
100 km11,021

Bigger radii merge into coarser parents, yielding fewer cells for the same fidelity. Building all five bands across 770k POIs took 30 minutes; classifying 567k localities into bands took 30 seconds.

A booby-trap

As is often the case, the first time I reviewed the results, I tripped into some fun edge cases. The first version of the global leaderboard had Archipel des Crozet at 2,008 km. Being unfamiliar with many of these locations, I decided to do some spot-checking. Crozet is a sub-Antarctic French island group in the southern Indian Ocean, about as far from anything as land gets, so I expected a big number. But 2,008 felt too low: Crozet to Cape Town is 2,400 km, Crozet to Madagascar is 2,400 km, and there’s nothing in between. A hospital can’t sit closer than the nearest land, I thought. So where was this 2,008-km hospital?

A query of my data gave me the answer: at coordinates (-30.75°, 63.63°), the middle of the open Indian Ocean. Querying that location in Overture’s data turned up:

서울병원 (Seoul Hospital), country=KR

A Korean hospital tagged ~10,000 km from Korea, due to (I suppose) a geocoding error somewhere upstream in the OSM ingest. That phantom POI was the “nearest hospital” pulling Crozet down to 2,008 km. It did worse damage next door: it also sat ~2,150 km from Kerguelen, well inside Kerguelen’s true ~3,360 km, so it had also bumped genuine top results off the board. Three more ghosts turned up at similar mid-ocean coordinates, each leading us to understate how isolated some real place is.

The fix is one filter, against the same table we built for the join:

-- Keep POIs whose leaf falls inside some country polygon.
-- Misgeocoded ocean POIs never match because no country range covers
-- the open ocean.
SELECT DISTINCT poi.id
FROM poi_leaves poi
JOIN unified_cell_union admin
  ON poi.leaf BETWEEN admin.range_min AND admin.range_max
 AND admin.subtype = 'country';

That’s the same BETWEEN primitive, restricted to one admin level. The index doesn’t need to know what a “country” is; it just trusts the cell-union for subtype = 'country'. After the filter, Kerguelen reclaims the top three and Crozet settles at its real ~2,400 km (nearest hospital in Madagascar): still extraordinarily remote, just below the Kerguelen and Tuamotu tier at the top of this post.

The lesson, more useful than the leaderboard for a toy problem: even with a clean algorithm and public data on a workstation, the first answer is wrong in interesting ways. The question shape is right; the data has booby-traps; you find them by looking at outliers and asking “does this make sense geographically?”

You’ll often hear advice that in any data problem (whether it’s for ML or data analysis), spending time eyeballing the data for patterns and outliers pays off. This is one more example of that.

See it

The interactive map above zooms through three layers of the same join focused on my home country: Italy. 20 regioni, 8,577 comuni (municipalities), and 12,579 healthcare POIs. It’s rendered as a choropleth, where polygon fills are colored by what fraction of contained comuni are within 5 km of healthcare: blue is good, red is bad, cream in the middle. Hover any feature to see its per-area stats. Toggle “Density” to switch from access to healthcare-POIs-per-km².

Why Italy specifically? Of course I have ties to the country, but beyond that, there are more interesting reasons.

While the algorithm is universal, the input data isn’t: Overture’s locality coverage is dense across Italy (every comune is mapped) and several other countries, but patchy elsewhere. Showing the join over Italy keeps the demo interesting and realistic: every visible polygon is a real comune with a real population, not a hole in OSM’s coverage. When I worked on similar problems at Google, we were also dependent on the quality of data from Google Maps, and we knew that some countries were better mapped than others.

A few patterns the Italian view shows:

  • The alpine arc (Valle d’Aosta, Trentino, the northern fringe of Lombardia and Veneto) reads warmer: small mountain comuni without their own hospital, leaning on the valley town next door for healthcare access. An interesting follow-up study might be considering driving distance, which I’d expect to affect POI accessibility even more.
  • Po Valley, Tuscany, Campania around Naples, the Adriatic coast read cool: dense comune mosaic, dense hospital coverage.
  • Sardinia and Sicily show internal isolation: interior mountain comuni warm, coastal comuni cool.
  • Zoom in far enough and the POI dots light up: deep red for hospitals, blue for emergency rooms.

See the join structure itself

Now enable Show S2 cells in the toolbar. The choropleth gets overlaid with the actual S2 cell-union used in the join: green cells are INTERIOR (entirely inside the polygon, auto-confirmed in the range query), orange cells are BOUNDARY (straddle the edge, refined via shapely.contains). At low zoom you see the per-regione cell-union; at closer zoom levels the map transitions to the per-comune cell-unions. At intermediate zoom levels you can briefly see both layers on top of each other: a Lombardia-tier cell sitting above the Milan / Monza / Lodi comune cells it contains. That visual nesting mirrors the integer-interval check from the SQL block earlier in this article.

One thing the overlay shows: the INTERIOR/BOUNDARY distribution looks completely different depending on what you count. Barely 1% of the cells in Italy’s cell-union are INTERIOR, while nearly all the rest cover the jagged comune and regione edges. That sounds like the algorithm’s “fast path” is doing almost no work, but the matches skew far more INTERIOR than the cells do, because POIs typically cluster in the bulk of polygons, not on their edges.

What the answer doesn’t (and can’t) say

Four caveats deserve more than a footnote:

OSM coverage varies by country. As we mentioned before, the dataset’s tagging fidelity isn’t uniform. Some examples:

  • Italy’s espresso bars are tagged bar and not coffee_shop;
  • Vietnam tags every cà phê;
  • Thailand tags every clinic;
  • rural Russia barely tags anything.

For isolation distance, this means the answer is an upper bound wherever coverage is thin: when we say “120 km to the nearest hospital,” the best we can tell is “120 km to the nearest hospital that someone bothered to map.” That’s an OK answer to what does the public dataset say, but it’s not the answer to where are the actual hospital deserts without continued investment in mapping ground truth as open data. (Which, incidentally, happens to be a topic I am passionate about.)

Density is partly a tagging-density story. When generalizing this pipeline to the world, the “densest healthcare” leaderboard is dominated by Bangkok wards (top entry: 254 healthcare POIs per km², four of the top five are in central Bangkok), then Delhi, Jakarta, Taipei, Seoul, Saigon. Bangkok’s medical-tourism culture and high small-clinic density are probably a factor. But Thai OSM mappers are also unusually thorough for any POI type, not just healthcare. The dataset’s leaderboards are always partly a leaderboard of map quality and coverage.

Outright vandalism happens. OSM is open, which sometimes means that some entries are jokes. Greenland’s top-isolated locality in the raw data came back as a Skibidi-meme joke name that someone added to the map. The shape of the problem is the same as the hospital in the middle of the ocean we covered before: a public dataset’s leaderboard depends on the public dataset’s quality, and you’d want to invest in quality measures for any real-world applications.

A locality isn’t necessarily inhabited. When I looked into the UK’s lonely winner, Rockall, it turned out to have no permanent population at all. There’s nothing wrong with the data, though: Overture does have an optional population field, but it’s sparse, and the locality subtype only asserts that someone put a named place on the map, not that anyone lives there. For any real-world use case, this would be an area of improvement.

Locality coverage is wildly uneven across countries, which is why the live demo in this post is Italy. Overture inherits OSM’s country-level mapping density: Italy has 8,577 mapped comuni, France 36,752 communes, Germany 22,967. Conversely, Greece has just 178. You see similar range within the US, where Massachusetts is fully tiled with mapped places while Virginia is full of gaps. A global choropleth at city zoom reads great over Western Europe and becomes visibly patchy elsewhere. And that’s not because the algorithm breaks, but because the input data is uneven. The interactive widget in this post uses Italy because every visible polygon is a real comune with a real population.

Why this is an S2 post and not just a “use a spatial index” post

The cell-set inner join, on its own, doesn’t distinguish S2 from any other spatial index. R-trees can do range queries just fine. H3 cells can index polygons. DuckDB’s plain ST_Contains is already quite fast at this scale on its own: for a single-level point-in-polygon problem, a generic spatial join would land in the same ballpark.

What S2 buys with its versatility, and what I wanted to show here, is how the same primitive can handle such breadth with just one index, serving every scale at once:

  • The same cell-union table resolves country and region and locality for any point: single integer-interval lookup against a unified table, which is why it worked great with DuckDB.
  • The same primitive handles what 1 km / 5 km / 30 km radius am I in for any point: it’s the same lookup on structurally identical tables.
  • The two kinds of polygon “level” are handled in the same way by the algorithm. R-trees would need four trees and four traversals.

One thing it took me a moment to figure out when I first looked into this is why H3 can’t do the same. H3 also has parents after all. Turns out, that’s not the issue. The issue is that H3 cell IDs across different resolutions live in disjoint integer spaces. The encoding embeds the resolution in the ID itself, so a leaf cell at res 9 and its res-5 ancestor are unrelated integers. There’s no equivalent to S2’s [range_min, range_max] that lets a single leaf ID be range-checked against a cell at any coarser level. What this means is that, while with H3 you can do lookups, you can only work at one fixed resolution at a time. S2’s strict quadtree invariant, that every cell at every level is representable as an integer interval over leaf IDs, is what makes the multi-resolution unified table work.

Concretely, here’s a visualization of what that costs us. The same patch of Lombardia, covered two ways:

Lombardia covered with S2 vs H3. On the left, S2's mixed-level cell-union: 59 cells at levels 5–9. Three large green INTERIOR cells cover the polygon's bulk; the remaining 56 BOUNDARY cells (orange, finer) handle the edges. Mixing levels is possible because S2's quadtree guarantees each cell's [range_min, range_max] over leaf IDs cleanly contains all its descendants, so coarse and fine cells live in the same integer-indexed table. On the right, two resolutions of H3 cells overlaid: res 4 (15 large blue hexes) and res 5 (97 red strokes laid on top). The red boundaries cross the blue ones because the smaller hexes do not nest inside the bigger ones. A res-4 cell well inside Lombardia has a ~14% difference (symmetric) with the union of its 7 res-5 children: children spill out and leave gaps. That's why H3 can't do the S2 trick of mixing resolutions in one cell-union table.
Lombardia covered with S2 vs H3. On the left, S2's mixed-level cell-union: 59 cells at levels 5–9. Three large green INTERIOR cells cover the polygon's bulk; the remaining 56 BOUNDARY cells (orange, finer) handle the edges. Mixing levels is possible because S2's quadtree guarantees each cell's [range_min, range_max] over leaf IDs cleanly contains all its descendants, so coarse and fine cells live in the same integer-indexed table. On the right, two resolutions of H3 cells overlaid: res 4 (15 large blue hexes) and res 5 (97 red strokes laid on top). The red boundaries cross the blue ones because the smaller hexes do not nest inside the bigger ones. A res-4 cell well inside Lombardia has a ~14% difference (symmetric) with the union of its 7 res-5 children: children spill out and leave gaps. That's why H3 can't do the S2 trick of mixing resolutions in one cell-union table.

S2’s mixed-level union is a clean tiling, leveraging a handful of large interior cells, a band of small boundary cells, leaving no gaps and no overlaps. H3’s geometry can’t deliver that: pick one resolution and you get a sensible cover; try to mix two and the cell boundaries don’t line up cleanly.

Another way to get the feel of the impact of this design is to look at cell counts across the three nested admin polygons:

S2 (mixed levels)H3 res 5 (~8.5 km)H3 res 7 (~1.2 km)H3 res 9 (~174 m)
Italia (country)1781,70183,3224,081,854
Lombardia (region)59974,653227,466
Bergamo (locality)1618380

S2 keeps its counts small by mixing levels (large cells for the interior, fine cells only at the edges) while H3 must pick one resolution, and none serves all three rows: res 5 collapses Bergamo to a single hexagon, res 9 blows Italy up to four million cells. While you can build a working spatial join with H3, you can’t build this particular multi-scale unified-index join with it.

When it comes down to practical considerations: the unified cell-union table over 572k admin polygons is 46 MB. The five radius-band cell-unions over 770k POIs total ~70 MB. Both indexes are tiny, and building them is the only slow part: the admin cell-unions take 36 minutes and the radius bands another 30, so both indexes are ready in about an hour. Everything after that is a query. The range-join that resolves country, region, and locality for all 770k POIs returns in under a second; classifying every one of the 567k localities on Earth into a nearest-hospital band takes 30 seconds. The one query-side step that costs minutes, the boundary refinement, is geometry-loading, not the join — a one-shot batch cost that would happen once and then be amortized away.

Which takes us back to the beginning. When I used a distributed batch job via Flume for a similar problem at Google, it was not because the algorithm needed one but because the operations downstream needed huge datasets. Here the entire pipeline (building both indexes, joins, polygon refinement, aggregation) runs comfortably on a single CPU. I find it remarkable that we can turn a point-in-polygon question into an integer-interval question and utilize simple techniques like sort-merge. By building an index once, we can answer planet-scale queries in seconds of work. As is often the case, the index is what does the real work, at fleet scale and on a desktop alike.

Reproduce

The code is at github.com/abahgat/s2-spatial-join (public, MIT-licensed). All data sources are open and publicly available with no auth required. End-to-end on an average desktop computer, from raw Overture tiles to the finished leaderboard, runs in under two hours, with almost all the time spent in a one-time index build.

uv sync
uv run overturemaps download -t place         -f geoparquet -o data/cache/world_places.parquet
uv run overturemaps download -t division_area -f geoparquet -o data/cache/world_divisions.parquet
uv run python scripts/bake_healthcare_world.py            # phases A–D: indexes
uv run python scripts/bake_healthcare_world_geojson.py    # phase E: aggregate + leaderboard
CONTAINMENT · which admin units contain each POIDISTANCE · nearest hospital per localityOverture places770k healthcare POIsOverture divisions572k admin polygonsAAdmin S2 cell-union572k polys · 46 MBBRange-join · DuckDBleaf ∈ [min, max]5.6M candidates · <1 sCBoundary refineshapely · 14% of POIs4.24M pairs · 6 minDRadius-band cell-unions1 / 5 / 15 / 30 / 100 km70 MB · 30 minDClassify each localityby smallest bandEAggregate per localityGeoJSON + leaderboardphases A–D: scripts/bake_healthcare_world.py · phase E: …_geojson.py · one S2 primitive, two questions
The whole pipeline illustrated. Two open Overture datasets feed one S2 index which can then answer two questions: containment (which country, region, and locality contain each POI) and distance (the nearest hospital per locality, via radius-band cell-unions). Both tracks converge into the per-locality aggregate that becomes the leaderboard. Phase tags A–E match the script comments above.

If you want to explore the same kind of question for a different POI type (most isolated locality from a school, a pharmacy, a fire station, a bookstore, a coffee shop) the pipeline generalizes by changing one filter, and runs in a comparable amount of time.

The engineering leader I was coaching: their pipeline is now leveraging S2. They didn’t need to buy more computers.


A Note on Tools: I used AI to assist in generating the vector illustrations and refining the sample data pipeline for this post.