Settings

Theme

Show HN: A database of everything (over 55M keys)

outpan.com

97 points by outpan 9 years ago · 42 comments

Reader

syats 9 years ago

There's some more complex and curated projects out there, among them:

https://www.wikidata.org/wiki/Wikidata:Main_Page

GuiA 9 years ago

Signup page is blanking out for me.

Really curious to try it. It's a really neat idea. Has this been done before? I've never seen anything like it.

The full value of this would likely come from interesting, productive, insightful visualizations of the underlying graph that is being built.

Questions that come to my mind:

- What if you write bots that scrape Wikipedia, Twitter, etc. and output entries from semantic analysis performed on these sources?

- If many people write such bots, how similar would the graphs be? What are the parameters that determine graph overlap?

- Can you use this to tie in to the real world?

"A key can be any unique string such as product barcodes, book ISBNs, email addresses, URLs, domain names, names, phone numbers ..."

Interesting stuff there... A way to make this into an interesting social network is if people curated their own graphs, e.g. of books and webpages and favorite restaurants, and other people could browse these graphs in a read only mode. (perhaps they can clone it to their graphs or so and then start to edit them)

The temporal aspect might be interesting too. Would there be value in seeing how my graph has changed over time? When I was in academia, none of the tools to keep track of the papers I read suited me. I could see this working well for this use case scenario.

(if you could write a Prolog against this, would it have interesting properties?)

I see that a submitter just posted 7630028603780 -> volume -> 75ml (the first number being probably the bar code of some beverage), that's neat.

Someone else just posted the following entry:

5010438013621 -> ingredients -> Water, sugar, mixed fruit juices from concentrate 10% (Grape, blackcurrant, raspberry), acid (Citric acid), Vimto flavouring (Includes natural extracts of fruits, herbs, barley malt and spices), colouring food (Concentrate of carrot, hibiscus), acidity regulator (Sodium citrate), preservatives (Sodium benzoate, Potassium sorbate), Vitamin C, Sweeteners (Sucralose, Acesulfame K).

Now what the site should be doing is converting this to something like:

5010438013621 -> ingredients -> water

5010438013621 -> ingredients -> sugar

5010438013621 -> ingredients -> vitamin C

...

(Outpan person/people, I'm in SF, if you want to chat more around coffee, there's an email is in my profile)

  • exogen 9 years ago

    > Has this been done before?

    Depends what you mean by "this"! RDF [1] and most of the technology surrounding it and the "Semantic Web" are based on (subject, predicate, object) triples almost exactly like this, where each element is often a URI, and objects are often strings just like they are here.

    It even has taken this idea to the next level where the statements expressed by such a triple can themselves be given an "anonymous" ID, which can then be used as a subject or object – meaning you can make meta statements about the statement itself, all while still using this simple system of triples.

    There are even entire languages built around querying graphs of such triples: https://www.w3.org/TR/sparql11-query/

    DBpedia [1] is one such project that attempts to encode data from Wikipedia in triples like this; their About page says that the 2014 version of the database had 3 billion triples, so that number is probably much higher now. Here's a preview if you want to see what these triples look like:

    • Homepages of things: http://downloads.dbpedia.org/preview.php?file=2015-10_sl_cor...

    • Genders of things: http://downloads.dbpedia.org/preview.php?file=2015-10_sl_cor...

    etc. You'll notice that RDF predicates are all namespaced by URIs; that way you can unambiguously know in what sense "homepage" and "gender" are used (consider more ambiguous properties like "length"). That means there can be other uses of "homepage", "gender", "length" etc. that mean different things, and those will be namespaced by a different URI.

    Anyway, this Outpan project is obviously a more loose and freeform version of that – but only slightly; RDF is not very strict at all, it's just that people have thought a lot about how to successfully model the entire world's information, and so real-world RDF ontologies end up looking somewhat complicated. I'm not sure if a freeform version like this has been widely attempted before.

    [1] https://en.wikipedia.org/wiki/Resource_Description_Framework [2] http://wiki.dbpedia.org/

  • outpanOP 9 years ago

    I think graph overlap is what actually determines what data is "accurate". There are currently bots writing to the database by people who are not connected and I have yet to see the overlap (since the key space is too large for the number of bots right now). I'm excited to see how that plays out.

    I will send you an email :)

    • outpanOP 9 years ago

      As for your question about real world applications: Outpan was only a product database up until last week. It is used in over a hundred apps, some with more than a million users.

      I expect this to work on a larger key space as well. It is interesting to see how the expansion works out in terms of usage patterns.

yawniek 9 years ago

does silicon valley now suddenly discover the semantic web and triplestores :O

if you want really all the data: http://lod-cloud.net/

but still, neat project.

  • outpanOP 9 years ago

    Semantic web is too good of an idea not too iterate on constantly :)

ethernetsalad 9 years ago

As much as I like the idea, I can already see someone called "Drumpf" on the front page linking attributes about Hillary Clinton and Donald Trump to their Twitter accounts and not their names. I guess if you're after "everything" then curation makes no sense but you'll end up with a bunch of nonsense.

  • outpanOP 9 years ago

    Twitter url seems like a strong key since it is referenced in many other contexts.

    As for curation, it is intended to provide examples of what key, attr and values are regardless of their content. This is an experimental feature and might be removed/tweaked...

tisryno 9 years ago

Great concept but I can see it falling out of line very quickly, on the homepage I spotted "berlin -> country -> germany" Followed by "england -> capital -> London"

If you search for the key "germany" it has no results, if you search "london" it finds no results.

The fluidity of the data is definitely a hindrance, if you wanted to use the dataset you'd have to already know what you are looking for to find the value.

erikb 9 years ago

Is 55 million keys a lot for tracking "everything"? My expectation was that "everything" would need more than 55 trillion keys.

  • outpanOP 9 years ago

    "a journey of a thousand miles begins with a single step"

    • nenadg 9 years ago

      "a journey of a everything begins with a 55M keys"

    • erikb 9 years ago

      Yeah but it doesn't say "we collect all kinds of data" but "this is a database of everything". I can accept the first, but then one shouldn't promise the second.

jrochkind1 9 years ago

So it's RDF without the globally unique identifiers?

kristopolous 9 years ago

So searches like "Linux" "California" "lincoln" "Hitler" "Disney" and "red" return 0 results

  • outpanOP 9 years ago

    I was driving so couldn't stay on the phone. The number of keys for different concepts is just very large (if not infinite for the sake of avoiding philosophical debates). It will take a some time to have enough data to cover even popular keys. Fortunately I have a lot of time :D

  • outpanOP 9 years ago

    Up until last week, outpan was strictly used for gathering data on product barcodes. We are in the process of adding data in other categories.

jitl 9 years ago

What sort of backend storage does this use?

  • Arcsech 9 years ago

    Given that it's effectively key->key->value, I'm guessing Cassandra for the main backend. The data model fits very well, and it would give you the kind of scalability you would need for this sort of thing.

    • smarx007 9 years ago

      Given how ridiculously basic the website is, I think you're right. It should have been a real triplestore with a SPARQL endpoint though.

  • 0xmohit 9 years ago

    Yes, it would be interesting to get some insight into the tech stack.

0xmohit 9 years ago

Is it possible to lookup all the attributes for a given value, say Trump?

ivoras 9 years ago

Shame it fails for the trap of natural language ambiguity. So on the front page I see "England -> capital -> London" and at a glance I thought the capital (as in money) flows from England and is accumulated in London.

  • OJFord 9 years ago

    I agree in general (and it's the fault of freely user-definable keys/attribute) - but that's not a great example, since the arrow aren't read as a direction of flow.

    Perhaps that reveals that a forward slash might be a better separator though - like a URI.

firewalkwithme 9 years ago

The header looks like fastmail :)

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection