What is Content ETL?

ETL: Extraction, Transformation and Load

ETL stands for: Extract, Transform, and Load, in short ETL This is a concept from computing and is traditionally applied to structured data. It refers to the processes used to unify data from different structured databases (such as ECM, WCM, DM systems) into another database, often a data warehouse.

The concept of Content ETL has been introduced recently and it is applied to unstructured data only, as stored in file systems and ECM, WCM, DM systems. Content ETL refers to the processes that are used to simply exchange data between multiple content repositories. The ETL (Extract, Transform and Load) process for unstructured data is much more difficult than for structured data. This has a number of causes. More about ETL.

The challenges of Content ETL

With unstructured data, the metadata is often unavailable or unreliable. A lot of information is contained in the documents itself, but is not made explicitly available to users through metadata. This makes it difficult to automate decisions, such as whether or not to delete documents, or to add metadata in order to benefit the target system.

Finally, to harmonize and unify the unstructured data, such as for an integration, migration or conversion process, it is ultimately much more difficult than with structured data.

In order to harmonize unstructured data, it is necessary to validate the data, to complement the data and, where necessary, to correct it. To do this, information from different sources will be used.

For instance, you can look to see what information is implicitly available in the documents, such as where the document is stored, knowledge about the processes which are used; and the content of the document itself can be analyzed.