After what seems quite a long time from when the idea was first mentioned to change the name of my employer, it's finally happened, and they've got a new website and everything. We still have the old company mugs, sadly.
I'd been helping out in renaming things. It's surprising just how deep the hooks of a name go: email addresses, documentation, licences, even mentions in code too. A more startlingly obvious case is that of a website. And as there's now a new website, we don't really need to maintain the old one anymore. Just taking everything down would be a shame as there's lots of content on there that people have found useful in the past.
WordPress migration
Like many sites, it used WordPress. One option for migration away from WordPress is to export the content as an XML file, which either various other hosts may be able to import for you directly, or you can try converting into a useful format yourself, e.g. if you're using a static site generator to build a new site.
We just wanted to archive the site, rather than move to a new site and reuse the old content, which is a different task.
There are several WordPress plugins that are supposed to convert your site into a static one. However, none of the two I tried were successful; one just froze without seemingly doing anything, while the other only pulled a couple of pages from the site.
Gathering the site content with wget
In the end, I resorted to wget
which
did an admirably good job after a few attempts.
wget
is more often known as a command-line download tool, but is much more
powerful than a simple downloader. The downside with this is that there are
options present that you might not even know about until you try this kind of
site archiving, hit a problem and then discover which wget
option you should
have used to fix it. The upside is that wget
probably does feature the option
you require.
After lots of trial and error, what I ended up with was (deep breath):
wget --page-requisites --convert-links --adjust-extension --mirror --span-hosts --domains=blog.scraperwiki.com,scraperwiki.com --exclude-domains beta.scraperwiki.com,classic.scraperwiki.com,media.scraperwiki.com,mot.scraperwiki.com,newsreader.scraperwiki.com,premium.scraperwiki.com,status.scraperwiki.com,x.scraperwiki.com scraperwiki.com
Let's look at what each option does:
-
--page-requisites
collects other files need to render a HTML page, e.g. images and CSS. -
--convert-links
converts links in the retrieved document to one that will correctly display in the local mirror; this all happens at the end of a collection; if you check files while the collection is in progress, the links won't yet have changed. -
--adjust-extension
gives HTML files an.html
file extension if they don't have one (--convert-links
adheres to these modified filenames too, so if a HTML file has a modified name, then the link to will be correct). -
--mirror
is an alias for a series of options that facilitate mirroring a site. -
--span-hosts
enables wget to move across different hostnames. -
--domains
and--exclude-domains
list domains thatwget
should or shouldn't retrieve content from.
Later the files from our multiple domains got moved into one directory for
simplicity. I think you could have done this using --no-host-directories
.
Running this command will then proceed to collect the content from the sites you've allowed it to crawl.
Hosting for free
If you have a free GitHub account and are competent enough with the
basics of git
, note that you can host static sites, also for free, on
GitHub Pages. Introducing git
version control is an entirely other
post, but there are plenty of online tutorials. To do something like
this, don't feel overwhelmed by the different commands. You only really
need the basics, i.e. knowing how to create a new repository, add files
to it, make commits and push them to a remote repository.
Another nice feature of preparing your static site in a version controlled repository is that when you encounter things that need fixing, you can try them and always be able to restore to an earlier version, should you need to.
Problems I had to solve
Broken links
Initially, the links to images or other pages were broken, the
--convert-links
option fixed that.
Unrestricted crawling
wget
was crawling other scraperwiki.com domains than the ones we wanted and
collecting a lot of unnecessary content. All I needed was the main site, so I
specified the domains to exclude.
Removing dynamic features in the static site version
Some features of the site, as expected, no longer functioned after collecting the pages. For example, the comment and contact forms we had involve some processing on the WordPress server, and therefore were displayed on the static site, but didn't work.
Removing this content is possible in a couple of ways. Either you could process
the HTML, using blunt instruments of find/replace, or the finer tools of
reprocessing the HTML with a proper parser, and remove the unwanted elements,
for instance using Python and lxml
.
Easier still is just hiding the broken elements. Finding an appropriate
selector for those elements, and then adding a CSS rule for that
selector containing display: none;
(perhaps with !important
if
required) will hide them. This also preserves the HTML as it was without
mangling it.
It probably won't affect you, but it's worth noting that automated bots may well still be trying to trigger elements "hidden" in this way. They're still on the page, just not displayed to normal users of the site (unless they tweak the CSS). We had an odd issue possibly caused by a redirect from the old site: bots were presumably attempting to submit to a form on our new site, even though they were visiting the archive of the old site. (It was flagged by the form provider asking us to approve submissions from that URL.)
Query strings in filenames
This was caused by WordPress hosting resources with version query strings in the URL, e.g. CSS and JS.
When wget
retrieved the files initially, it retained the ?
and the
trailing part of the name, e.g. jetpack.css?ver=4.1.1
. The problem is
that the links also look like this too. The "?
" gets interpreted by
clients as a query string, and not part of a static filename. In a
static version of the site, the query string won't work: the browser
requests the file without the ?
and passes a (now) useless query
string. Our copy of the file actually has a filename of
jetpack.css?ver=4.1.1
, not jetpack.css
. We don't want the client to
request jetpack.css
with a query string of ?ver=4.1.1
, but to
actually request a static file with the question mark in its name.
I'm not entirely sure what fixed this; I didn't notice at first. From
some point onwards, the ?
ended up getting encoded as %3F
which then
gets correctly requested as a filename with a question mark in it. Not
particularly clean, but it worked. It may have just been using a recent
version of wget
that solved this. This was nice as the alternative was
a horrendous find/replace task using commands like:
find . -type f -exec grep -Iq . {} \; -and -exec sed -i
's!wp-content/plugins/jetpack/css/jetpack.css?ver=4.1.1!wp-content/plugins/jetpack/css/jetpack.css!g'
{} \;
to fix up the content by hand for every CSS and JS file. (You could optimise by
having sed
perform all the replacements in one command, rather than executing
the same find repeatedly, but I was verifying the changes by eye after
processing each filename; git diff
is useful here.) That equally worked, but
was much more work to check.
Searching now, by far the simplest fix would be use a Wordpress plugin to remove the query strings before archiving. There are a couple out there.
srcset
images
Some of our images were responsive srcset
images listed in img
attributes. Until recently, wget
didn't handle this, but from wget
1.18, it supports srcset
images just as it does images in src
attributes of img
elements. To clarify, it both correctly retrieves
the images and updates the links. (I assume only when using
--convert-links
.)
Links in option values
As part of the site archive navigation, there were links to option values. These weren't updated. I just hid those elements as it didn't seem a critical part of the site.
WordPress emoji code
Recent versions of WordPress feature code to add emoji support for certain browsers. This embedded JavaScript is in the HTML and was still pulling from the old WordPress site because it contains an absolute URL within it. Saving the file locally and correcting the URL is non-trivial as a relative URL depends on how deep in the local mirror you are. The simplest solution was disabling this feature, by adding the "Disable Emojis" plugin to the site, then recrawling.
It's not essential to fix this, but means the static site is not making requests for files that don't exist when you move from WordPress.
Fixing up incorrect index.html
URLs
Lots of URLs on the original site ended with /
, rather than
/some_page.html
. wget
instead saved these pages with a filename of
index.html
. Fixing these up was a pain. In terms of making the site
work, this strictly wasn't necessary. GitHub Pages will respond with a
request to foo/
as the index.html
in the foo
directory, so any
existing links to the site would still work. However, fixing this does
ensure anyone sharing links to our pages uses the same URL as the pages
always had.
I'm not even completely sure that this was entirely fixed correctly; there are just too many URLs to verify, but it appeared to have the desired behaviour. You can see the commit message for the code I used.
Essentially, it used lots of regular expressions to substitute:
-
foo/index.html
tofoo/
; -
foo/index.html#bar
tofoo#bar
; -
foo/baz.html
tofoo/baz
; -
foo/baz.html#foobar
tofoo/baz#foobar
.
There are a few other tweaks I did to clean up URL links too. They are detailed in the other commit messages in the repository.
Some other redirect problems
For us, some redirects we had set up on the old site caused a couple of cases where files ended up in the wrong place or with the wrong name compared with the original site. We had to be careful to fix up relative URLs here. We also had multiple copies of certain files. These weren't difficult to fix by hand as there were only a few to modify.
Finally, we had to correctly set up redirects to the new site in our server configuration, and everything was finally done.
Moving on
After much work, the static version's now working well. Everything's
pretty much the same on the site aside from
the interactive forms. It was much more of a task to get done than I'd
imagined. I think it was around half a day spent trying out wget
options and another couple of days spent on resolving all the other
issues.
The advantages are that we don't need to maintain a WordPress installation to run the site and the site's getting hosted for free.
Good luck if you're tackling the same problem! Hopefully some of the tips here might help you with your migration.