Help preserve the internet with Archiveteam's warrior

170 points by neoglow 4 years ago · 52 comments

Reader

Yes!

Archiving is important, we have already seen so much online history gone down the drain or just accidentally saved.

Large institutions like the internet archive are doing an admirable job, but there is a lot of content that they cannot and will not cover. So we will definitely (also) need volunteer-based archival for the foreseeable future.

18TB drives are ~$300 a piece right now, go buy one and help our collective memory!

smarx007 4 years ago

ArchiveTeam sends archives to Internet Archive but the two are not related. I don't think you confused the two but I mention this every time just in case.
The Warrior is a small Docker image that downloads files via your ISP connection and forwards them to the AT servers. No need for large drives.
For my personal use, I have a home server install of https://github.com/ArchiveBox/ArchiveBox and for that one you may want to get some storage, though I prefer to host its data on the SSD for performance reasons (my archive grows approx. 5000 items or 150GB per year). It's like a private Internet Archive on your home network.
- uniqueuid 4 years ago
  
  Thanks, it's always good to point that out.
  There's a surprising amount of tools that are able to submit data to the internet archive (and get data from there). Even wget can produce WARC archive files.
  While the warrior downloads content via your line (a bit like a residential proxy network), I do think it's important that we decentralize the storage as well.
  Just without the crypto mafia/drug traders/investors.
  - zdkl 4 years ago
    
    AFAIK you can use IPFS (& clusters[0]) without relying on the crypto parts of that ecosystem. That ought to fit rather well with the use case.
    [0] https://cluster.ipfs.io/
    
    uniqueuid 4 years ago
    
    Yes there are some really interesting projects, also in the ML replicability space.
    One really nice approach is the DAT project [1]. The protocol [2] looks pretty sensible and useful. Unfortunately, the tooling has been in such a state of permanent flux (i.e. perpetual deprecation) that I've never bothered to invest much time.
    [1] https://datproject.org/
    [1] https://datproject.org/
- bradknowles 4 years ago
  
  The last time I tried to do anything with or for the Archive Team, it was a mostly "just watch us work" sort of deal.
  The tools couldn't be built without additional knowledge that wasn't published anywhere -- because there had been drift from what was published versus what was working, and those changes never got folded back in. And there were multiple versions and variants of the tools, with different teams using different versions or variants.
  And once you built the tools, you couldn't get your Warrior into the list to be used, although you could always run your systems separately.
  It's not like you could sign up for a SETI@Home type initiative and just let your equipment run.
  I understand why they work this way. It's a very insular crowd, and new people and resources seem to disappear as quickly as they showed up.
  So, they let you watch.
  If you stay around long enough (months? years?), then they might let you start participating. But I wasn't willing to wait that long.
prox 4 years ago

I kind of wonder how we can make it searchable again. Is this included in this archiving effort?
In any case wonderful work.
- uniqueuid 4 years ago
  
  There is a standard set of tooling for indexing archives: CDX files. [1]
  They index WARC archives and can be used to quickly find records. You can build on top of this (and some systems do) to make a proper search front-end.
  But in general, these archives are NOT geared towards full-blown search because it would be pretty expensive to keep the indexes in hot cache. Plus you would need to deal with historical versions of records, which is not normally done in search UX.
  [1] https://wiki.archiveteam.org/index.php/The_WARC_Ecosystem#CD...
  - camtarn 4 years ago
    
    Ah, is the WARC format the reason it's called 'Warrior'? It seems like a very strange name for an archival program.
    
    myself248 4 years ago
    
    ArchiveTeam seems very guerrilla in their operations.
    I always imagined the Warrior as a camo-faced archivist operating under cover of darkness, preserving data even in the most hostile Yahoo-occupied territory.
  - prox 4 years ago
    
    Thank you for that information!

Thorentis 4 years ago

The Internet Archive has a huge noise to signal ratio, very much in favour of noise. I admire the effort and regularly make use of the quality archives. However, I wonder if much like Bitcoin, tremendous energy and amounts of resources are being put towards very little of value.

DoingIsLearning 4 years ago

I disagree, the unfiltered high noise is what makes it valuable. Curation is a bias.
If someone wants to dive into any topic in the archive 30 years from now they will have access to everything, not access to what some of us deem 'worthy' of curating.
I agree that it makes it harder to find things but I also see the value of IA as a time capsule.
- lkbm 4 years ago
  
  Yes, curation is very valuable, but it needs to be a layer on top of an uncurated source.
  I enjoy using Open Library to re-read obscure middle grades books from the 1950s-1990, and there are some obscure DOS games I want to revisit. It's hard to find what I want sometimes, but only having access to curated lists would change it from "hard" to "impossible" in many cases.
- londons_explore 4 years ago
  
  Tools to separate signal from noise will also get better in the future. You can imagine that in 100 years time, using a super duper AI search engine will perform far better than whatever some human decided to categorize stuff as today.
djokkataja 4 years ago

Storing data is cheap and gets cheaper all the time. This isn't a super comparison, but the Internet archive's 2019 revenue is listed as $36.7 mil on Wikipedia (https://en.m.wikipedia.org/wiki/Internet_Archive).
Hard to compare Bitcoin directly, but its market cap was around $1 billion in 2013 and cleared $1 trillion for the first time a little over a year ago.
I get that this article is about people using their personal computers to help archive things, but I don't think the Internet archive is ever going to be using resources even remotely as aggressively as cryptocurrencies unless they somehow turn all their archiving into cryptocurrency.
prox 4 years ago

Value is really hard to predict, but as someone who researches a lot in archives, there is no such thing as too little information. Especially if you want the views of several parties or organizations. In anthropology and history research this work (archiving) can be of tremendous value.
Usually it’s hard to say if it’s valuable now , only time can tell.
uniqueuid 4 years ago

Just to point this out, on a technical level, the internet archive has very (!) little overhead.
Crawled data is de-duplicated on the request level and response payloads can be individually gzipped as well as having per-archive-file compression. [1]
[1] https://www.iso.org/standard/68004.html
qiskit 4 years ago

> tremendous energy and amounts of resources are being put towards very little of value.
I doubt it take tremendous energy or resources. What percentage of the overall internet energy/resources is used by IA? An insignificant minuscule amount.
The problem with IA is that they are constantly attacked by institutions, corporations, etc to remove content.
cyber_kinetist 4 years ago

I think the real problem is a bit deeper: Unorganized raw data itself is of very low value, but it becomes much more valuable when humans process, categorize, and interpret it via a higher-level system of reason. We're doing a lot of the former but not the latter: we have so much data but have no idea what they all mean as a whole.
Libraries aren't just "a bunch of books piled up in shelves", they're a historical invention built and perfected for centuries where books are extensively coded and catalogued via a complex hierarchical system. As we are dealing with far more data than the past (not just books but posts and comments from all over the world, as well as new kinds of media such as images and videos), and also have new kinds of conceptual and technological inventions that previous librarians didn't have access to (hyperlinks, databases, graph theory, machine learning, etc.), the current status of data management begs for a major overhaul. (For example, the best we are currently doing for querying and searching from massive data is Google, and it is incredibly primitive! And even then we lament that the quality of it has decreased in favor of SEO-maximizing content.) So much raw data is created every day, and we just seem to fail to understand and interpret almost all of it, I see it as one of the major historical crises we face today. Instead of just storing data, we must find radical new methodologies and tools to search, filter, and explore data, and this poses as both a philosophical problem (of semiotics, linguistics, and hermeneutics) as well as a technological problem.
sandgiant 4 years ago

Can you provide some details on this? I'm curious how noise and signal are defined and measured in this case.
- azeirah 4 years ago
  
  I disagree with the op. This is historical data and includes all kinds of interesting content. Even if severely uninteresting today it may still be really valuable 40 years from now as part of research into colloquial language, design, trends, influence of events etc.
  Same reason why notes taken by random people 250 years ago are really valuable to historians today, even if it's just a todo list
stjohnswarts 4 years ago

I would argue that the archive.org and saving the legacy of the internet is a far more important use of energy than making up imaginary digital currency pyramid schemes.
textfiles 4 years ago

I've been having fun with this post all day, but now I kind of need to know: Can you give examples of noise on the Archive?
janandonly 4 years ago

Unlike the Archive, the "value" of Bitcoin can be measured: Today's market cap of BTC is $839.5B
- nix23 4 years ago
  
  >Today's market cap of BTC is $839.5B
  Or Zero...depends who want to exchange it to real-stuff
  - janandonly 4 years ago
    
    Well, if you happen to have some bitcoins that you are willing to sell to me for less than their "market value" today, then please, get in touch with me...
    The same goes for any other money/not-money's out there... If anyone has gold/silver/diamonds that he wants to get rid of for a price lower than the market value, then again, please get in touch with me....
    
    nix23 4 years ago
    
    If you think you can compare the trust people have to gold compared to bitcoin your in a massive bubble.

RNAlfons 4 years ago

Make it an easy installable/runable Windows application and it will spread like wildfire.

capableweb 4 years ago

If it was only that easy. To make distributed archiving as high quality as possible, you need reproducible environments as much as possible, which is why the "official" way of participating is to run virtual machines, instead of directly on the host.
Not sure why this 3rd party is the submission site rather than the official page, which is this: https://wiki.archiveteam.org/index.php/ArchiveTeam_Warrior
Has a couple of different installation methods as well.
- jrwr 4 years ago
  
  Yep, Using Virtual box is rather easy to get the warrior running!
cxr 4 years ago

Why even require that? If the data in question is available over HTTP, it should be as easy as opening a page from the relevant origin in a browser tab, optionally opening a second tab for a "Warrior Dashboard", then invoking a bookmarklet on the former to slurp up data by XHR &tc. (If it's necessary to cross origins as the thing roves around, the dashboard can alert you to this while it continues doing what it can with the first origin. Just have the human return to the dashboard from time to time and repeat the second step to run as many in parallel as they want.)
- jrwr 4 years ago
  
  Full Archival with the standards required by the Internet Archive require that full unmodified headers are required, and unmodified content. This tends not to work well with modern browsers. Chrome and Firefox both fail at this currently. Someone is looking into a kind of modified Firefox to help with this. but its just not that how this system works. Now the Archive.org does have a API of sorts to say hay archive this URL, and a little working on the backend goes and does it..
  What the Archive Team does is on a much more massive scale. Like SETI at home scale of scraping data across the internet. At almost every point we have had to make custom tools to ensure it meets our needs in our archival efforts.
  - cxr 4 years ago
    
    > standards required by the Internet Archive require that full unmodified headers are required
    Sure, this would not be a solution for the Wayback Machine, but would be adequate[1][2] for lots of non-Wayback collections (of the sort that Archive Team is associated with).
    1. https://twitter.com/textfiles/status/970912494284779520
    2. http://ascii.textfiles.com/archives/4285
- TheTechRobo 4 years ago
  
  Similar: github.com/InternetArchive/warcprox
- myself248 4 years ago
  
  That would be awesome, do you think you could write that?
  - cxr 4 years ago
    
    I'd definitely be interested in working on getting as close as possible if the grant money were to appear.

causi 4 years ago

Warrior is great for the community effort, but I wish someone would put some work into a modern local site archiver. HTTRACK just doesn't cut it anymore.

myself248 4 years ago

Oh jeez yeah. I've been going through https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-... the last few days and I've concluded that none of 'em are appropriate for someone with my level of software ineptitude.
- lazyjeff 4 years ago
  
  For local archiving, I've been working on my own solution that's simply a background process running in your systray on Windows: https://irchiver.com/
  There's some philosophical differences with ArchiveBox. 1) I'm more about automatic archiving of every web page, rather than the curation approach, 2) I prefer full-resolution screenshots over the actual source of the web page so you can save what you actually saw (so it works with dynamic pages, pages behind logins, etc.), 3) and I think full-text search is a key part of the archive so have implemented that.
- nix23 4 years ago
  
  wget --recursive --page-requisites --adjust-extension --convert-links --no-parent https://YOURWEBPAGEHEREX.com
  NO "--convert-links" if you want a "pure" non local browsable copy.
  - myself248 4 years ago
    
    Yes yes fine, and then I get throttled to 2 bytes/sec by the server. So I did some user-agent hijinks and set my delay to like 5000msec and that helped for a while, but my machine crashed and when I went to resume the task I was throttled again.
    
    nix23 4 years ago
    
    >but my machine crashed
    Maybe it's not the servers who throttle you then ;)
    
    traverseda 4 years ago
    
    Wget will exhaust all available ram on a long enough crawl.
    
    nix23 4 years ago
    
    No, i crawled many multi-gigabyte sites with my raspberry2 for days.
    
    traverseda 4 years ago
    
    I've had memory exhaustion (on a 4GB system) after I think about 600GB in a single crawl. Splitting it into multiple crawls is of course better.
    That was a site specifically set up to deal with large collections of files though.
TheTechRobo 4 years ago

There's github.com/ArchiveTeam/grab-site, but unfortunately it's not maintained very well.

iforgotpassword 4 years ago

How likely is it you end up downloading child porn on behalf of them? In other words, how well curated or specific is the list of download jobs your node gets assigned? If it's something like "just grab everything from this blog platform" I guess chances are not zero.

mhitza 4 years ago

I think you would be more likely to win the lottery without playing.
That type of content has long moved from clearnet to the darknet. I would be inexplicably surprised if that type of content can be found on the clearnet. But I still can be wrong.
However if you're in the US loli hentai is going to be a risk and legal headache for sure https://www.shouselaw.com/ca/blog/is-loli-illegal-in-the-uni...
As far as I'm aware, maybe excepting Australia (?) as well, in the rest of the world that type of content is not something they'll classify as child pornography, you'll just get a few sketchy looks.
- charcircuit 4 years ago
  
  >That type of content has long moved from clearnet to the darknet.
  A fraction of it.
  >I would be inexplicably surprised if that type of content can be found on the clearnet
  That kind of content is a single internet search away.
smarx007 4 years ago

My experience has shown that list to be extremely well-curated. See https://wiki.archiveteam.org/#Warrior-based_projects for the current list.
Though if you join the Reddit archival project, all bets may be off but that's not AT team's fault, I guess.

Settings

Help preserve the internet with Archiveteam's warrior

Keyboard Shortcuts