Libpostal: C library for parsing/normalizing street addresses around the world

github.com

104 points by nateb2022 22 days ago


kleiba - 22 days ago

Relevant? -> "Falsehoods programmers believe about addresses" (https://www.mjt.me.uk/posts/falsehoods-programmers-believe-a...)

Discussed on HN here: https://news.ycombinator.com/item?id=8907301

kerkeslager - 21 days ago

I think fundamentally, no parsing/normalizing library can be effective for addresses. A much better approach is to have a search library which finds the address you're looking for within a dataset of all the addresses in the world.

Addresses are fundamentally unstructured data. You can't validate them structurally. It's trivial to create nonexistent addresses which any parsing library will parse just fine. On the flipside, there's enough variety in real addresses that your parser has to be extremely tolerant in what it accepts--so tolerant that it basically tolerates everything. The entire purpose of a parser for addresses is to reject invalid addresses, so if your parser tolerates everything it's pointless.

The only validation that makes any sense is "does this address exist in the real world?". And the way to do that is not parsing, it's by comparing to a dataset of all the addresses in the world.

I haven't evaluated this project enough to understand confidently what they're doing, but I hope they're approaching this as a search engine for address datasets, and not as a parsing/normalizing library.

RobinL - 22 days ago

There are many useful applications of libpostal, and it's an impressive library, but one I would caution against is for the purpose of address matching, at least as the 'primary' approach.

The problem is the hardest to parse addresses are also often the hardest to match, making the problem somewhat circular. I wrote about this more in a recent blog on address matching: https://www.robinlinacre.com/address_matching/

shakna - 21 days ago

I somehow doubt this will pass the snifftest of one of my old addresses, which Australia Post successfully delivered to on a weekly basis:

    Third on right of main,
    Tiwi College,
    Melville Island, 0822, AU.
You can try to normalize that... But "Main Road" is in another city. Because I wasn't living in a city. There were no road names. And the 3rd position was an empty plot, not the third house. We had a bunch of houses around a strip of land, a few minutes from the airstrip - the only egress.
jandrese - 22 days ago

Wow, ambitious project. Anybody who has tried to verify addresses can tell you that the staggering number of different formats and conventions around the world make it and almost intractable problem. So many countries have wildly informal standards and people putting down just whatever they want because the mailman "just knows".

weinzierl - 22 days ago

In the same vein, there is also Google's excellent libphonenumber for parsing, formatting, and validating international phone numbers.

And because I had no idea before I worked on a project where we had to deal with customer data: many companies also use commercial services for address and phone number validation and normalization.

claytongulick - 21 days ago

Libpostal is great and was a lifesaver for me, but anyone who is interested in using it should be aware that it it NOT lightweight.

IIRC it takes gigs of storage space and has significant runtime requirements.

Also, while it's implemented in C there are language binding for most major languages [1].

It's one of those things where it's most likely best deployed as an independent service on a dedicated machine.

[1] https://github.com/openvenues/libpostal?tab=readme-ov-file#b...

degamad - 22 days ago

Previously:

<https://news.ycombinator.com/item?id=18775099> Libpostal: A C library for parsing/normalizing street addresses around the world - 117 points by polm23 on Dec 29, 2018 (25 comments)

<https://news.ycombinator.com/item?id=11173920> Libpostal: international street address parsing in C trained on OpenStreetMap (mapzen.com) 74 points by riordan on Feb 25, 2016 (7 comments)

Ameo - 22 days ago

I used this at a previous company with quite good success.

With relatively minimal effort, I was able to spin up a little standalone container that wrapped around the service and exposed a basic API to parse a raw address string and return it as structured data.

Address parsing is definitely an extremely complex problem space with practically infinite edge cases, but libpostal does just about as well as I could expect it to.

ttw44 - 21 days ago

When I was first engaging into web development a year ago, I was making forms that took addresses. From a C and C++ background, I kept asking, what if they lived in a specific country? How can I make my database truly safe? What is the best way to store all these addresses? I immediately gave up on that effort. Very impressive.

gorgoiler - 21 days ago

I have a real soft spot for these codifications of everyday things. A lot of us do. See also tzdata, GNU units, pluralize(noun), humanize(timestamp), and SPICE astronavigation. And yes, locating Mars in the night sky is indeed an everyday thing!

What are some others?

alganet - 21 days ago

Having used it in the past, I can firmly say it performs better than regex.