Archiving Websites

This post was originally posted on Cooper Hewitt Labs on Jan 06, 2012

I would imagine that just about any organization out there will eventually amass a collection of legacy web properties. I know we have! Be it a microsite from 1998 or some fantastic ( at the time ) forum that has now been declared “dead” — it’s a problem. The big question being, what to do with them.

There are a few technical problems at work here. First, there is a feeling of permanence on the Internet that is hard to ignore. You want these legacy sites to live on in some form. archive.org is a pretty good system for looking back at your main website, but its a moving target, constantly being updated with each iteration of your site. I’m talking more about preserving old web outliers. Those exhibition micro-sites, and one-off contest sites you might have produced years ago.

The next issue is that in order for these sites to live on, you need to provide some level of maintenance for them. Nearly every website these days has a database running the show, so in order for these sites to work, they need to have an open connection to that database. This means you need to continually update the application code, and do crazy things like upgrade to MySQL 5, 6, 7 and so on. What a drag!

Scrape The Site

One option we have been using here at Cooper-Hewitt is called web scraping. This is a pretty common technique that essentially creates a non-dynamic, static version of any website. There are several ways of scraping a site, one of the simplest being the wget program.

wget is a pretty simple program that comes installed on most linux distributions. You can also install it on your Mac using Homebrew. Here is a sample command line call using wget.

wget works pretty well but its not really the ideal tool for the job. All it does is download web content. It’s great for downloading files to your linux server ( nice way to install WordPress on a new linux box ) but it doesn’t do much else.

For scraping our sites, we chose to go with a pretty simple tool called httrack ( thanks to Geoff Barker at Powerhouse ). This program ( available as a command line tool for Mac ) does the same thing wget does, with some added bells and whistles. The main bell being that it re-writes all of the internal hyperlinks in the site so that the archived site can be hosted on just about any domain name.

Hosting A Scraped Site

Once you have scraped a site, it probably makes sense to move it somewhere for safe keeping. We had lots of sites on lots of domains. It didn’t really make sense after years of producing these sites with different methodologies. So, we decided to create archive.cooperhewitt.org and place each scraped site as a sub-folder of this domain.

Initially I thought it would be really nice to host these static sites on Amazon’s S3. I know it’s possible to do this, but I found that many of the pages wouldn’t load correctly. I’m still interested in S3 as an option for this as it’s sort of the perfect hardware for the job ( is it really hardware? ) but instead I chose to spin up a micro instance on EC2 and host the sites there.

Here’s an example of one of our scraped sites — http://archive.cooperhewitt.org/campana

301s

It’s pretty standard practice on the web to create 301 redirects for sites you are moving to a new domain. I was able to do this pretty easily using an .htaccess file and the following commands.

This allows you to browse the site by going to the original URL at http://campana.cooperhewitt.org or any of its permalinks like http://campana.cooperhewitt.org/about.html

The Downsides

As with anything, there are downsides to using this technique. The main one being no more interactivity. If your website had a commenting feature built in, it won’t work anymore. If it ran off a CMS like WordPress, you won’t be able to log in and make edits to your content. Everything is now static HTML, forever. Also, httrack won’t do it all. It hiccups on some types of URLs depending on the underlying structure/technology. I found this to be a small problem with things like roll over images and dynamic hyperlinks ( especially links with ? marks in them ). But most of these issues can be resolved with a little cleanup.

One Final Step

Since you are scraping the site and turning it into static html, it does make sense to make a real archive of your original site files and any attached database. I simply copied all the files in our /var/www directory to an external hard drive and did a mass MySQL dump to the same drive. If I ever really need to resurrect one of the sites, I have everything I need sitting on a shelf in cold storage.

3 Comments

  1. Michal Migurski January 24, 2012 at 8:59 am

    URLs with CGI parameters are going to mess with any flat-file backup system. One possibility with the wget approach would be to get all those GET requests into something like a SQLite file, including their full paths + queries and response headers. Should then be trivial to serve *that* from a purpose-built HTTP server. A bit more complicated, but may serve as a backup plan for filesystem-hostile sites.

    Reply
    1. Micah Walter January 24, 2012 at 7:27 pm

      Michal,
      Nice idea. I hadn’t really thought of using SQLite — Ideally I think it would be great if I could serve all of this off an S3 bucket as that seems like the most stable/long term type place to put stuff like this… but there are a few issues that I still need to research… ( htaccess 301s etc )
      Micah

      Reply
  2. Lynda Schmitz Fuhrig January 27, 2012 at 3:30 pm

    Hi Micah,
    Nice to see your posting about website archiving. We have been tackling this issue since the late 1990s at the Smithsonian Institution Archives through a variety of ways. We are currently using Heritrix (an open-source crawler the Internet Archive developed with the Nordic National Libraries) to archive the numerous websites of the Smithsonian museums and offices. The crawler output is a WARC (Web
    ARChive), which is a container file of a “simple sequence of content blocks.” The WARCs can be viewed in Wayback (the open source version of the Wayback Machine) for the look and feel of the site. WARC, which is an international standard, is appealing for a variety of reasons including metadata and manageability. See here for some blog postings we have done on this issue: http://siarchives.si.edu/search/sia_search/website%20archiving%20lyndaLynda Schmitz FuhrigSmithsonian Institution Archives

    Reply

Leave A Comment

Your email address will not be published. Required fields are marked *