NMa: Enabling effective search through metadata enrichment
by Corné van Leuveren, on Jan 17, 2014 1:04:00 PM
The NMa (now called Authority for Consumers & Markets; the Dutch competition regulation authority) aimed to consolidate four websites by migrating them into one global site. The goal was to align the look & feel, simplify the administration process, and greatly improve functionality of the website. We were contracted by the NMa to advize on the overall migration strategy and execute the technical migration
Migrating from Tridion 5 to Tridion 2009
These were the four websites in scope:
|VervoerKamer.nl||Transport regulation site|
|EnergieKamer.nl||Energy regulation site|
Three of the four websites of the NMa ran on a Tridion R5 platform and the last consisted of static HTML pages not contained within the existing CMS. To achieve the goals the current CMS was upgraded from Tridion R5 to Tridion 2009, however since there was no direct upgrade path which handled the merger of the four sites, the project was complicated.
The first step in this project was performing a content inventory of the NMa sites which was performed on two sides: front-end and back-end.
- The front-end inventory extracted the navigational structure, content, images, and binary files from the web-accessible site. During this process the four sites were scraped and all the content stored locally for future use.
- Following the front-end scrape the content was also extracted from the back-end of the CMS to gather extra metadata, taxonomy, and inventory the individual components the content pages were composed of.
After this inventory was completed, an analysis was performed to compare the similarities and differences between the existing content schemas and the schemas devised for the new website, and to generate the required business rules for mapping the old content into these schemas.
Internally, the NMa created the sitemap of the new website and began the process of fitting content from the four separate sites into this new sitemap. In parallel with this process, the mapping between schema fields was refined and additional business rules defined.
The new site contained many new metadata fields so further mappings were created to convert the old values from four different sites into one standard list of values. The mapping process uncovered deficiencies in the functional and technical specifications that, since it was quite early in the process, were handled with little impact on the planning and other activities.
Finding legal decisions
The NMa site hosts legal decisions on mergers and market regulation which is a key area of user focus. Quickly and easily locating decisions by number, name, industry sector, or outcome is critical to the user experience.
Enhancing these search results required a massive metadata enrichment process, as much of the existing information was incorrect, inconsistent, or simply nonexistent. The enrichment was performed in the following manner:
- Each decision has an attached PDF document containing the legal text and description of the decision. Through a regex-heavy automated process additional metadata, such as dates and the parties involved in the decision, was extracted.
- In cases where no metadata already existed, the context in which the PDF was located (such as the page or the URL) identified various metadata elements, such as the industry sector.
- Internally, the NMa has information systems which keep track of cases, decisions, and documents. They provided additional metadata that could be tied back to the individual decisions being migrated.
- Many decisions already contained some metadata and through a complex cleaning process, this metadata was normalized to a set of predefined values.
Cleaning and converting
While the four sites contained similar content types (such as news, decisions, and press releases), without strong editorial guidelines the content structure had evolved over time and the editors had used this flexibility creatively. The automated cleaning and transformation process reorganized this content back into the proper content types definitions and corrected the page markup to XHTML. Additionally, the NMa’s PDF downloads were converted to the PDF/A archival format to ensure longevity.
Using a custom-built importer, the cleaned and prepared content was collected and inserted into Tridion. This included the content schemas, images, and documents. Tridion dynamically generates the navigation and breadcrumbs based upon the content structure in the CMS, so correct organization of the folders and pages within Tridion is critical.
Further, Tridion uses a blueprinting system to make content available in multiple languages, so some specific content items were imported into shared content areas, to be used and referred to by both Dutch and English pages. The migrations were performed in several environments, from test to staging, and finally production.
Internal link resolution
The last step to be performed after the content was inserted was the link resolution. Two types of internal links were found during the migration: the internal links within a single website and the previously external links between the four sites.
Automatic link reconstruction
For example a news item on NManet.nl might contain a link to a decision about energy on EnergieKamer.nl. These links have been converted and provided with additional information stored so that the custom importer could reconstruct these links after the migration.
Additionally, new functionality on the site provides the ability for users to see whether the decision they are currently viewing is the latest for the relevant case. This functionality required that the importer build the relationships between different decisions in the CMS.
Result 1: Brief content freeze
During the automated migration process, the four legacy websites ran as normal with no interference and continued to publish content regularly. There was a very brief content freeze while the final migration was executed and the new website went live.
Result 2: Improved legal search
This new website provided the restructured and cleaned content from the four source sites. The website’s end users benefited from a new unified platform and greatly improved search functionality. They could now quickly and accurately find the most recent information relating to the legal decisions made by the NMa, whether it was for a specific company, in reference to a specific law, or related to one specific industry.
Result 3: Better and cheaper platform
Lastly, by consolidating to a single new Tridion site they simplified their site structure, aligned their content organization, and reduced their software & infrastructure costs.
Read more about automated content migration or metadata enrichment here.