Say, you built a brand-new website and you want to migrate your existing pages to the new content management system (CMS). It might be obvious that you export from the source CMS and import into the target one, but what happens with the pages between those steps? This blog is about the T in ETL.
Any maintainable website larger than just a few pages will be stored in a smarter way than with only an html file for each page. You can’t just shove them to the next website. All CMSes store pages differently, because there are infinite ways to make pages look consistent across the site. Therefore, a page in one CMS never just fits in another CMS like that. That is why you cannot do a website migration without content transformations.
What transformations are needed for a website migration?
This largely depends on your source system, target system and business rules. I list some important examples below that I have encountered in years of web migration experience.
1. Field mapping
Like said, a decent website is not stored as a bunch of html files. One way in which CMSes help you structure your content is by providing several fields per page. Every content type (like a news article or landing page) has a specific set of fields defined. If there’s one type of mapping you will need for any web migration, it is a field mapping. Even if target pages have the ‘same’ fields as the source pages, a mapping from title to page-title is also a mapping, even though they are functionally the same. An organization might also, for instance, decide that an ‘introduction’ field must be added, or that one is no longer necessary. Business rules must define how exactly the source field(s) should be split, merged or filled. Field mapping goes from mandatory in any migration to completely project-specific.
2. Content type mappings
Much related to field mapping. Sometimes, content types are merged, or structured differently. Some CMSes have only flat content types, others have a content type hierarchy that can be quite complex.
3. Page chunks overhaul
Some simple CMSes have one ‘body’ field with almost all page content, which can be edited with a built-in WYSIWYG editor. Most of the time, there is also the option to edit the HTML directly, because these editors are often not able to create the intended HTML automatically. For a website with multiple contributors, this scenario guarantees inconsistencies between pages, that are amplified over the years. How do you move your old, messy pages to a much better (and better configured) CMS like this without rewriting every single page? With Xill software and (built-in) HTML Tidy it is no big deal. You can break up page chunks on set HTML tags, make valid HTML out of them with Tidy and insert them into import XML for the target CMS system to make nice image blocks, photo album blocks, quote blocks, paragraph blocks and what not with it! It feels almost magical.
You will likely encounter some unexpected HTML tags and maybe even some highly corrupt code that Tidy can not process. Hence you will need some test rounds, and the content owner and the person performing the migration will need to align their minds several times. This usually reveals some previously unknown problems with the pages, which is good because it hands you free insights in other possible improvements.
4. Intermediate data format(s)
Some form has to be chosen to store the pages with their metadata between source and target. First, because Xillio has connectors for many systems, we export to a universal data model in MongoDB. HTML files, image files and other binaries can be stored on hard disk, with smart names like source ID plus extension. But after that, you often need a transformation to some specific transportable format as well, because it needs to go through an HTTPS connection and/or the importer needs some specific form of XML to work with. With Xill IDE, these transformations are relatively easy to perform as well. We script the Base64 encoding of a file (and therefore all files) with only one line of code.
How to automate these transformations?
There are many libraries in Java and other programming languages that can help you do parts of these transformations. Xillio software unifies virtually all of those that you will ever need for web migrations, and makes them very easily usable! Xillio includes:
- XML and HTML functions like Xpath
- Web navigation automation
- String manipulation functions like regexp (regular expressions)
- Database connectors for MongoDB, MySQL and more
- Date/time functions
- Encode/decode functions for XML and Base64 encoding
- Excel read/write functions
- File read/write functions
- Math functions
Are these transformations enough?
No, probably not. I may have missed some potential mandatory types of transformation, but also chances are (almost 100%) that it is a great idea to improve, fix, upgrade and/or enrich your content as well. Read about optional transformations in my next blog post!
Starting a website migration project?
Using a detailed step-by-step plan, this white paper gives you guidance on how to successfully complete a migration of a web content management system.