Settings

Theme

SEqlite – Minimal Stack Exchange Data Dump in SQLite Format

seqlite.puny.engineering

28 points by JasonPunyon 2 years ago · 12 comments

Reader

panqueca 2 years ago

I liked the idea. I think SQLite is very powerful way of storing and sharing data. But this site is only about Stack Exchange Network.

Do you know if is there a place to host other of sqlite dumps? I mean from other websites? Recently I dumped the whole hackernews api and I got thinking about it.

  • JasonPunyonOP 2 years ago

    This site is on a Cloudflare R2 bucket because (and only because) they have free egress. While not datacenter sized some of these files are large. Just opening up your credit card to 10 cents a gigabyte will be a bad time anywhere else.

  • 0cf8612b2e1e 2 years ago

    Datasette is built to serve SQLite databases. For querying purposes, not bulk download.

black_puppydog 2 years ago

It's always a good reminder how small useful data really is. None of these files is "need a datacenter" sized, and yet they contain just about any question you ever wanted to ask about, plus some answers...

0cf8612b2e1e 2 years ago

I built one of these myself that I keep on my laptop. Never had real need to use it, but glad I have .

I keep meaning to do the same thing with Wikipedia. Although the Wikipedia dumps are so inscrutably named and seemingly undocumented it seems the organization does not want me to pursue the idea.

  • giantrobot 2 years ago

    I've had the same problem with Fandom née Wikia dumps. Just gigabytes of XML with questionable adherence to schemas. Fandom also has a ton of custom-to-Fandom tags which are a further pain to handle.

    Pulling useful content out of the dumps has been an exercise in frustration. I'm sure I could figure something out if I had a bunch of time to dedicate to the effort.

    If I just had sqlite dumps they'd be trivial to work with and I'd be much happier with them.

    • 0cf8612b2e1e 2 years ago

      Those sites are so ad infested, I am amazed they offer dumps to get the content. Now I am similarly interested in pursuing this idea, but possibly with the exact same amount of tolerance for pain that you have reported.

      • giantrobot 2 years ago

        The ad cancer was part of my original motivation for downloading dumps. I've now found a a lot of Fandom wikis have links to their dumps you can't actually download because they're on S3 buckets that require keys or have download limits. It's infuriating and I think maybe I'm lucky I grabbed a few dumps when I did.

        Fandom is usually the first example I think of whenever I hear the word "enshitification". First Wikia ate all the independent wikis because they offered free/managed MediaWiki hosting. Then slowly started making Wikia worse until the full Fandomization. Now the site is literally unusable without an ad blocker and all of that GFDL content on the site is locked behind obfuscation and incompetence. I desperately miss the old Wookiepedia and MemoryAlpha.

roozbeh18 2 years ago

Can someone tell me how this is collected in SQLite

  • wolfgang42 2 years ago

    I wrote a blog post a while back about reading these dumps: https://search.feep.dev/blog/post/2021-09-04-stackexchange

    Presumably they have a script that does something similar to that process, and then writes the resulting data into a predefined table structure.

    • JasonPunyonOP 2 years ago

      Nice post!

      Yep, my process is similar. It goes...

        - decompress (users|posts)  
        - split into batches of 10,000  
        - xsltproc the batch into sql statements  
        - pipe the batches of statements into sqlite in parallel using flocks for coordination
      
      On my M1 Max it takes about 40 minutes for the whole network. Then I compress each database with brotli which takes about 5 hours.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection