Show HN: Open-source text-to-geolocation models

46 points by yachayai 3 years ago · 20 comments · 1 min read

Reader

Yachay is an open-source community that works with the most accurate text-to-geolocation models on the market right now

This is _really_ cool. Early in the pandemic I released a local news aggregation tool that aimed to aggregate COVID-related content and score it for relevance using an ensemble of ML classification models, including one that would attempt to infer an article's geographic coordinates. Accuracy peaked at about ~70-80%, which was just not quite high enough for this use case. With a large enough dataset of geotagged documents I'm pretty sure we could've improved that by another 10-15% which would've likely been "good enough" for our purposes. But one of the surprising things I took away from the project was that there's not a well-defined label for this category of classification problems, and as a result there's few datasets or benchmarks to encourage progress.

yachayaiOP 3 years ago

Thanks! COVID is a great example of the use case, and we agree problems like this need more attention - we've shared some data already, and will continue to share more with the public to encourage collaboration on this. Hope you will find something useful there for your future projects:)

DOsinga 3 years ago

This does look interesting but as other comments have pointed out without data or weights it's not clear how well this works. The training notebook seems to suggest it is not actually improving all that much on the training data

yachayaiOP 3 years ago

As mentioned in the response below, we have posted the challenge with a respective data set of ~ 500k tweets from 100 regions around the world - https://github.com/1712n/yachay-public/tree/master/conf_geot...
We are working on adding more data as well - feel free to create a GitHub issue if there's more you need - we're going to be working on everything there is to do to help the developers here:)

JimDabell 3 years ago

Depending upon your use-case, you can get pretty good results by using spaCy for named entity recognition then matching on the titles of Wikipedia articles that have coördinates.

yachayaiOP 3 years ago

Agreed. That said, more often than not, as mentioned in the comment above (COVID use case), we'd look for a higher recall value in the predictions - there, NERs, although helpful, wouldn't be our go-to solution. This is exactly the reason why we open sourced the infrastructure and are rolling out the data
rmbyrro 3 years ago

Tried this in the past, it's too limited... There are too many ways certain locations can be referred to. Take: New York City, NYC, NY, New York, NYCity, so on...
- JimDabell 3 years ago
  
  Wikipedia handles “New York City” and “NYC” as intended. “NY” and “New York” are ambiguous to both machines and humans (are you referring to the city or the state?) and if you have a resolution strategy for this then Wikipedia gives you the options to disambiguate. I’ve never seen “NYCity” used by anybody.
  - rmbyrro 3 years ago
    
    If you start processing web articles on the scale of millions you'll be surprised by how creative people can be. Not talking about tweets, just news and blog articles.
    
    JimDabell 3 years ago
    
    Not surprised, just not relevant. The criteria here is “you can get pretty good results”, not “you must be able to process millions of articles without failure”.
    
    rmbyrro 3 years ago
    
    If a method is not generalizable to the entire dataset, it's not that useful.
    When processing text at large scale, the usefulness of heuristic approaches like the one we're discussing diminishes rapidly.
    
    JimDabell 3 years ago
    
    > If a method is not generalizable to the entire dataset, it's not that useful.
    No, in many situations, something doesn’t have to be perfect to be useful.
    Again, I think you are missing the original point being made:
    > Depending upon your use-case, you can get pretty good results by…
    You seem to be responding as if I said:
    > For all use-cases, you can get flawless results by…
    Pointing out that this is not perfect is irrelevant to the point I was making. “Good enough” is usually good enough.

tomthe 3 years ago

There are no weights and no data, only some code to create a pytorch character based network and train it. Will you provide weights or data in the future? Do you have any benchmark over Nominatim or Google maps?

I think something like this (but with more substance) could be helpful for some people, especially in the social sciences.

rmbyrro 3 years ago

Yea, I was expecting a general-purpose model or dataset to train a model. The idea is great, but - as it currently stands - of no use to most people.
- yachayaiOP 3 years ago
  
  We have posted a challenge with the respective data set of ~ 500k tweets from 100 regions around the world - https://github.com/1712n/yachay-public/tree/master/conf_geot...

rmbyrro 3 years ago

This would have been tremendously useful in a project I worked at a few years ago.

It's really a difficult task to parse text at large scale with accurate geographical tagging.

yachayaiOP 3 years ago

What was the goal of the project?

cyanydeez 3 years ago

Probably should combine this with DELFT.

TuringNYC 3 years ago

Has anyone got this working? Curious if someone could PR a dependencies file that can be used to run this?

yachayaiOP 3 years ago

We have updated the wiki and published the dependencies:) Feel free to create a GitHub issue for any further requests, or ask away in our discord - https://discord.gg/msWFtcfmwe

Settings

Show HN: Open-source text-to-geolocation models

Keyboard Shortcuts