Manual versus automatic content classification
by Ernst van Rheenen, on Mar 17, 2016 4:02:34 PM
In January 2016 a Dutch TV program reported that 5,600 medical records of Dutch patients were prepared for digitization (free of staples and paper clips) by Belgian prisoners. Although the prisoners had signed a confidentiality agreement according to the party concerned, scanning and preparation of papers by this group raised some questions.
For organizations that are digitizing and archiving documents and files, the scanning street is just the beginning of a complex process. Not only is manual preparation at the beginning of the scanning street still commonplace, improvements are also possible in the rest of the process. After the first step (the actual scanning of the documents) classification and metadata is added to documents, often manually.
Whether it is done at the end of the scan street, or in case of content enrichment such as a migration of file shares to a new ECM environment (e.g. SharePoint), manual document classification is time consuming, and it is not consistent. Although librarians and information specialists are highly skilled, it is difficult for a team to classify content consistently and unambiguously, even if they are following a standard template. Give a set of documents to different specialists and there is discrepancy in the way they will classify the documents.
Content classification nowadays is not always a task for specialists, it is often performed by people within the organization at the moment they introduce a document. Often they don’t see the importance of proper classification, and problems arise with the quality of the classification (no training, so inconsistent allocation of metadata).
This problem is of course not a new problem, but it is something that is becoming more acute. First, because the volume of documents and information significantly increases, and secondly because there are increasing risks and costs when it is clear that your document management does not comply with laws and regulations.
A proven method to classify documents is to use intelligent tooling. Multiple vendors offer solutions for automatic classification. Most of the solutions, however, classify based on the form or layout of documents, and found keywords. Xillio's approach goes a step further and assigns labels based on grammar, spelling, choice of words and repetition used in the document.
In addition, Xillio's solution works on any set of documents from any given content system, mostly the network drives or file shares. But also think of ECM, DM and DAM systems, even custom-built legacy systems, or systems which are only partly in use.
And the winner is…..
Manual classification is subjective and therefore inconsistent, but not necessarily worse or less accurate than automatic classification. Nowadays there are better ways to classify your documents than doing it by hand, with solutions that offer the same or even higher accuracy and guarantee a huge improvement in terms of production speed.
We recently did a project to automatically add metadata to OpenText Content Server content. Read this case study of automatic classification of OpenText Content.