Netninja.com - NFHN Reader

If you are in certain internet circles, you may have seen people getting excited by the recent announcement that the Unfiction forum archive is now online. (You can find the forums back at forums.unfiction.com after all these years.) When Alternate Reality Games (ARGs) started taking off, Unfiction rose as a rallying place. The forums that Unfiction provided are an important time capsule of the era. They cover not just the games and puzzles, but the attitude and philosophy that players brought into the experiences. These games had no formal rules. The community brought their own structure and order. Alas… the ongoing treadmill of website maintenance and security upgrades combined with a fatal server crash resulted in the forums dropping offline without a viable way to revive them.

About a year ago, in January of 2024, Laura Hall approached Sean Stacey and I with an idea. Sean is the owner of Unfiction. I was a moderator and had run a half-dozen wikis for ARGs in the day and have been working on scripts to freeze a copy of MediaWiki into a static capture as a way to step off that upgrade treadmill and move down to cheaper hosting. She wanted to know if we could get the forum archives back online. There was some discussion around how to best make that happen as well as some life-gets-in-the-way dropping the ball, but eventually we worked out that I’d try to get the forums running long enough to take a static snapshot of all the pages. And that’s what I did. Eventually.

This blog post will serve as a loose recounting of the technical details behind that capture. It includes the steps and missteps. Along the way, I only took loose notes. A lot of this is from memory. You can blame anything conflicting or deviating from reality on faulty memory. There are two parts to the process: getting the forums running followed by taking the snapshot.

Running an Outdated phpBB

The initial work started in May of 2024. This wasn’t a full-time project. I have a full-time job. I also help run a nonprofit that produces monthly events. I spend a couple nights a week with a Vespa club. I have a century-old house that needs maintenance. Not to mention a certain amount of distraction by shiny things, at the expense of longer ongoing projects. I probably worked on the Unforums a couple of hours at a time every few weeks.

In order to take a static snapshot of the forums, I had to get them running, at least temporarily. This wasn’t an easy task. The version of Ubuntu of the era — the one we think the server ran on — was v10.04. I forget how I managed to find or make an AMI of this fifteen-year-old operating system, but getting it running in the AWS cloud was a no-go. “ClientError: Unsupported kernel version 2.6.32-38-server”. It was just too old for the hypervisor. I switched to running on a local VM, but quickly realized that the package repository is long dead, returning 404 errors. So. If I have no packages, then I’m going to have to compile server software from source. And if I’m compiling from source, the specific version of Unix distribution and the kernel shouldn’t matter much for these specific services. Let’s start with something modern and work backward in versions where required.

I spun up Amazon Linux 2023, thinking I’ll try everything on there as-is, with a modern MySQL, modern Apache, and modern PHP. It would be a miracle if it all magically worked, but would also be a massive time-saver. The database backup mostly loaded fine. There were some UTF-8 / Code-Page issues. I forget where exactly I solved them: the database config, the PHP config, the phpBB code, or a combination. Apache was just fine — it only acts as a conduit between the outside world and PHP. The PHP interpreter was absolutely and horribly angry at the ancient phpBB codebase. With all the customizations the Unforum had, upgrading to a modern phpBB seemed like a much more difficult task than downgrading the PHP version. (I think Sean had done some earlier research and some of the plugins and/or themes simply were no longer compatible nor available for the latest phpBB versions.)

I lined up the Ubuntu release dates to the PHP release dates. PHP 5 seemed the most likely match for the era. By comparison, we’re on PHP 8 today. Version 8 has added a lot of modern programming concepts (proper classes, named arguments, and type checking) and has also removed some earlier language features, including iterating through lists with the each() function. This was a pretty common PHP practice back in the day.

I ended up manually compiling PHP5, including MySQL support. There were also some required module dependencies that the forums needed. This included libraries such as libxml and ImageMagic (for post attachments). I then manually loaded the PHP module into the Apache config. The communication between the two was using a plain CGI connection and not FastCGI. A single-threaded web spider doesn’t need the high transactions-per-second that FastCGI provides. I then had to lock down external access. I’ll let you guess how many CVEs there are against an ancient version of PHP. Also set the hostname and faked a DNS entry for forums.unfiction.com that points to 127.0.0.1.

The Unforum code wasn’t stock phpBB. There were a lot of customizations as well as a custom theme template. I had to make a couple of tweaks for PHP 5, making me wonder if this was originally running on PHP 4. Fortunately, there were not too many tweaks. This is also where I might have made some UTF-8 related tweaks to correlate with the database.

I then hacked the phpBB permissions model and template. If the bulletin board recognizes the incoming connection as a bot, such as Google making a capture of the website, it assigns a “bot” permission to the session. This prevents it from getting to the login pages, making posts, and similar operations. I forced it to assume every connection was always a bot.

I also had to make some template changes. The things that come to mind are:

Changed the donation box to describe the website restoration project.
Changed the rotating tips box to text marking the snapshot date.
Disabled the sticky announcements. (“Help save Unfiction!”)
Removed various links to create an account, log in, preferences, edit profile, private messages, etc.
Probably some other tweaks I can no longer remember.

Some file attachments were missing from the backup we had. This might have been because the database dump and the filesystem backup were taken at different times. It could have been corruption in backups. We actually combined two different backups and found that some attachments were present in one but not the other and vice-versa. The union of both backups covered most of the file attachments, but there were still some referenced in the database that didn’t exist in either backup. The stock phpBB kicked back a 404 error for missing attachments. I wanted something a little more friendly and descriptive of the situation, so I hacked the php code to present a visible apology that the file isn’t present in the archives instead of an error page.

Finally, I got the site in a state that we all agreed looked good enough to encase in amber.

Taking a Static HTML Snapshot of an Outdated phpBB

I use wget for a lot of web-client tasks, including taking snapshots of pages and sites. It has a very robust set of “mirror” features. It can spider its way through a collection of pages. It can parse a page it’s downloaded and also retrieve its dependencies (images, CSS, JavaScript). When saving pages to disk with specific filenames (such as adding an “html” extension or using Windows-friendly filenames), it will rewrite any links that point to those pages. Overall, wget is a great Swiss Army knife.

The command I used to capture the site looked like this:

wget \
    --user $MY_USERNAME \
    --password $MY_PASSWORD \
    --mirror \
    --page-requisites \
    --recursive \
    --no-parent \
    --quota=0 \
    --no-verbose \
    --adjust-extension \
    --convert-links \
    --restrict-file-names=windows \
    http://forums.unfiction.com

Pro Tip: run your capture in a tmux session, so that it can continue running in the background, even after you disconnect, and you can later pop in and check up on it.

I kicked off the first capture and crossed my fingers. I checked back every day, since I wasn’t sure how long it would take. Unfortunately it never ended. There was some trail you could take through the website that constructed URLs that continually added an extra slash between the folder and the filename. For example forums/viewpost.php would eventually become forums//viewpost.php and then on a later pass, forums///viewpost.php, etc. Each time through, it added another slash. Although the web server doesn’t care and normalizes multiple slashes into a single, wget acting as a client web browser treats these as uniquely separate pages. I casually poked around the site, looking for where this extra slash was being constructed but ultimately hacked it into the Apache server using mod_rewrite to detect multiple slashes and force a redirect to a single slash. That made wget happy with the situation.

So I kicked off another capture and crossed my fingers. This one also never ended. I could pop into the tmux session and look at the URLs that wget was capturing and found it was just spidering its way off into eternity through calendar.php. You see, the Unforums had a calendar plugin. When I was actively a user and moderator, I don’t think I ever looked at the calendar, but I think the theory was you could put game launches or expected incremental updates in there. Some ARGs had a very strict schedule, like an “update Tuesday” that players would look forward to. The problem was that the calendar part of the website just let you keep clicking on previous/next month — presumably out to infinity. And that is exactly what the wget spider was doing.

After some discussion with Sean and Laura, I removed the calendar plugin completely. It was easier than trying to hack in limits and none of us could think of any strong value the calendar provided. At the same time that I hacked out the calendar, I noted that the “bot” session profile also excluded user profile pages. Since I figured these were a fun peek into the personalities of the citizens of the bulletin board, I hacked the permissions model to allow bots to look at profile pages.

So I kicked off another capture and crossed my fingers. After several days, my 100GB capture volume ran out of disk space. Who could have guessed I’d be collecting more than 100GB of html files? I attached a much larger EBS volume to the instance.

So I kicked off another capture and crossed my fingers. This one actually ran to completion. It took about 5 days.

My copy/paste of text to replace the donation box missed a later email in the chain, where we discussed tweaks to the format/style of the site. We’d all overlooked my copypasta error when proofing the site, just before starting the capture. The forums are “Unforums” and not “unForums.” I didn’t want to kick off another 5-day capture for such a small change (yet a small change across 1,267,025 files). To be safe, I duplicated the entire capture folder and performed a little rewrite-in-place command line magic that involved grep | xargs | sed to make the edits directly. I also updated the corresponding PHP in case we needed to redo a capture in the future.

Finally, as a validation step, I copied the entire static archive to an S3 bucket configured for static hosting. I also shut down the web server to confirm that external resources weren’t accidentally being referenced somehow. This looked good.

From there, I transferred the files to Sean’s hosting provider and also downloaded a copy onto a USB SSD. Sean pointed the old forums subdomain to his new static hosting.

The final stats are:

134 GB of content
- …which can be compressed to 14 GB of content (as .tar.bz2).
- There is presumably a ton of repeated content such as header, footer, and assorted other template text that can be easily compressed due to the repetition.
1,304,691 total files (1,267,025 HTML pages + 37,666 other files)
1,045,728 posts.
About five days for each capture. I might have improved on this by making wget more aggressive and/or using a more powerful (but therefore more expensive) EC2 instance. I might have also increased speed by binding Apache to only 127.0.0.1 and dropping the basic auth. I’m not sure if wget reuses the connection between pages or if it has to re-handshake authentication each request. The caching provided by FastCGI may have also increased speed.

And while I’m dropping stats, I’ll end with this little chart I generated from the database of posts over time. I’ll leave it as an exercise for the reader to determine which ARGs corresponded to which spikes.

Posted in: