Settings

Theme

Ask HN: Tools for exploratory analysis of 10-100GB graphs

2 points by craggyjaggy 4 years ago · 8 comments · 1 min read


I've never had to work with any dataset bigger than L3 cache, so I'm somewhat out of my depth here. I have a sample of the (relational) data that's about 10GB, with another 80GB available that may or may not be mostly garbage.

In the end I would like to have the graph in a visual interface to zoom and pan through it, and a way to experiment with different clustering algorithms based on some proximity measure (I have an idea for what those might look like).

I'm not a data scientist so I have no overview of the tooling landscape here and find it difficult to filter through endless pages of marketing for vaguely ML/Big Data related products. I'm not looking for an expensive ready-made solution, I do like to hack on things after all :)

PaulHoule 4 years ago

At scales far below what you're talking about people experience grave difficulties making sense of big graphs.

https://cambridge-intelligence.com/how-to-fix-hairballs/

One of my favorite examples is this guy

https://en.wikipedia.org/wiki/Mark_Lombardi

I saw an art exhibit that showed some of the sketches that he made and it was clear that he worked really hard drawing and redrawing each graph and they went from being hairballish to telling a clear story.

You're also very insightful to be talking about the specific scale you're working at because it matters. Graph workloads can drive you batty because they frequently defeat caches by beating very nonlocal.

For your small data set you are in the range where you can get a "big" computer with say 64GB or 128GB of RAM and be able to work in RAM. You might be a little disappointed with the performance (it takes a while to touch every memory address in a 128GB machine) but it will good enough if you're efficient and disciplined.

As an RDF fanatic I'll share that I have handled data sets on the small end of your scale with

https://virtuoso.openlinksw.com/

  • craggyjaggyOP 4 years ago

    I have a million nodes, which I'm confident I can prune to less than 100k now that I know I'll have to. Each with 100 edges that I can probably filter down to 10-20. That should get me down to <3GB, which might be more in reach?

    • PaulHoule 4 years ago

      Yeah, you might even get away with using neo4j at that scale, which has an API people like even if it doesn't handle bigger graphs well.

      • nikonyrh 4 years ago

        "it doesn't handle bigger graphs well."

        I'd like to hear more, I used it for a prototype several years ago and was quite impressed with the query language but also performance.

      • craggyjaggyOP 4 years ago

        I'll have a look at Neo4j, thank you :)

icsa 4 years ago

How many nodes and edges are in your graphs?

10,000 nodes is an upper limit for most graph visualization tools that I have used.

  • PaulHoule 4 years ago

    I've been doing a project that involves making cyberphysical objects "three-sided cards" and constellations of those cards... This one takes up at least 1/3 of the wall of my office:

    https://gen5.info/$/XQ*42RXF-TLY:$B.8/

    I am getting pulled into making "constellations" that tell a story or describe some domain (say games related to Killer 7, weapons used by Ukraine and Russia, characters that Tara Strong voiced, ...) I can put a bunch up on the wall in front of a network made of ribbons (need to start cutting the cards because it really doesn't look good with uniform cards) but I want to make the "cyber" end better.

    Simultaneously I want to give people a direct manipulation experience like touching physical objects but I want to escape the limitations of the physical world. That has me thinking about WebGL, ARKit, and similar things... And the question of how you make a graph visualization system based on 2022 graphics tech is on my mind -- seems if you really maxed out the GPU you could visualize big graphs.

  • craggyjaggyOP 4 years ago

    The total is about a million nodes, though I'm fairly sure the most interesting data is a subset of about 50,000 nodes. Each node has on the order of 100 edges. Since that's still a lot I'll have to rethink my plan I'm afraid.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection