How to archive a website that's shutting down soon
blog.marcua.netWhy not use grab-site?
This is great! It looks like you can even grab WARCs using `wget`: https://golangexample.com/put-a-web-archive-warc-on-an-s3-bu...
I'm curious: what sort of fidelity have you seen in `grab-site`'s rewritten static asset URLs? Having to fix URLs that weren't properly rewritten ended up taking me the most time.
grab-site doesn't rewrite URLs, it captures the entire request and response of each http request for an asset and stores as-is for archival purposes. The quality of any mutations or transformations performed will be governed by the tooling used to consume the generated WARCs.
More detail on the WARC format can be found below:
https://www.loc.gov/preservation/digital/formats/fdd/fdd0002...
This is really helpful! Thank you so much!
Easiest and simple self hosted option is archive box https://archivebox.io/
Thank you! I wish I had found this sooner to try it out. Similar question to the other one I asked on this thread: what sort of fidelity have you seen in `archivebox`'s rewritten static asset URLs? Having to fix URLs that weren't properly rewritten ended up taking me the most time.