Modernizing Document Data Extraction with AI

  • April 16, 2021
Man and woman at work on translucent map

Extracting data from documents has evolved significantly since the OCR days of the 1990s. Template-based approaches have been replaced with AI- (artificial intelligence) and NLP- (natural language processing) guided systems, offering intelligent data extraction from complex unstructured documents. Intelligent Data Extraction (IDE) typically is a component of an overall Intelligent Automation (IA) strategy that combines various processes to give the organization a complete, automated end-to-end solution. Following the steps outlined below can simplify the journey to a successful IDE solution as part of a more comprehensive, overall Intelligent Automation solution.

Most IDE solutions support various inputs — such as multiple languages, handwriting, signature validation, barcodes, free-form and tabular data, and numerous image formats. Using low-code GUI-based applications, the non-data scientist can build extraction processes for basic inputs.

For more complex documents, your planning process will include an in-depth review of the documents and their attributes, determining the fields to extract, and deciding on the downstream processes post-extraction. Other common planning tasks include; processes for exception handling, Validation, and exporting the data, including inserting the extracted data into downstream systems.

A successful IDE implementation requires an understanding of these tasks and planning for their execution.

The extraction process is broken down into three main processes: Read, Refine, Apply. Each process includes a series of tasks. The tasks themselves may lend themselves to more than one process, making the delineation between the main processes a bit fuzzy, but the tasks remain. These tasks are commonly automated within the IDE system but can be configured to enhance the data capture process:

1. Document Import — Documents flow to the IDE platform from multiple source systems through various methods, including different automation options, workflow queues, managed folders, or direct feeds,

2. Image Enhancement — Not all documents will be clean PDFs or images. Document images, including faxes and pictures, may be captured at lower resolutions making the extraction process less accurate. Your IDE solution should automatically evaluate the image resolution and enhance it as needed. You may need to plan for an exception process for low-quality or heavily marked-up images.

3. Document Classification — Your IDE platform works across multiple use cases with a wide variety of documents. Incoming documents are classified based upon page attributes. Plan a clear document hierarchy to delineate your use cases. Defining document attributes allows for better classification. Identify unique characteristics for each document type to assist in the classification process,

4. Character / Word Recognition — Your IDE platform supports multiple languages, handwriting, barcodes, word recognition in both key-value pairs and unstructured text. Take advantage of this by defining the characteristics of the data to extract from the document. The location of a data element is no longer required but defining these characteristics, including the use of regular expressions (Regex), are key components and will require significant planning. Your IDE platform may help you by automating some of the discovery processes but working with your stakeholders and SMEs is recommended. Data elements extracted will be assigned a confidence score, which can be aggregated to an overall document confidence score. Defining confidence level cut-offs will determine the quality of the recognition process and the level of effort to plan for in Validation.

5. Recognition — This step includes technologies such as Optical Character Recognition (OCR), Intelligent Character Recognition (ICR), Optical Mark Recognition (OMR), Natural Language Processing (NLP), Natural Language Understanding (NLU). Barcode recognition combined with database lookups and pre-set rules and translations allow the IDE system to both extract and supplement the data with additional information.

6. Validation — Does the candidate document have required data elements? Are there specific validation rules to follow? Mistakes can happen; required elements can be missing, characters can be misread, and/or words can be ignored. Validation is essential to obtain accurate results. Validation rules include check digits, length checks, format checks, cross-totaling calculations, value comparison, MDM matching and data lookups. Failing a Validation rule should result in the routing of the document to an exception queue.

7. Routing — Defining routing rules is a crucial component to successful IDE implementation. What does ‘straight through processing’ look like? Are different document types routed differently? When exceptions occur, what happens? Routing should be managed, allowing for document flow based on a combination of factors such as document type, confidence score, anomalies, load balancing, follow-the-sun and more.

8. Verification – The main goal of an IDE system is to reduce or eliminate manual processes. However, when Validations fail, exceptions occur, or the IDE platform assigns a low confidence score. The IDE system will route the document to a validation queue for human intervention and Validation. Defining what that will look like, including the user interface, will improve the process. Once a document is verified, does it continue processing or exit the system? What happens when a document can’t be verified?

9. Export — The IDE system is commonly part of an overall IA solution. Both the original document and the data extracted will be exported to external systems — such as a content repository, an Intelligent Automation (IA) interface, or exported to various file formats or databases. Planning these connections will ensure your IDE system is fully integrated into the overall solution.

Other items to consider include:

  • Throughput KPIs at the page/document level — When combined with document volumes, KPIs will help you plan out the necessary infrastructure for the IDE platform.
  • Platform location — Whether your IDE platform is cloud-based or local is a foundational decision with far-reaching implications. Cloud offerings can be more flexible when scaling your data capture operation, but security and governance concerns may come into play.
  • Data enhancement options exist for both Validation and allowing the addition of external data to your process. Understanding where and how data enhancement occurs will result in a more complete solution.

NTT DATA Services has deep experience in these processes and stands ready to work with you to create a successful IDE implementation. Check out our intelligent data and automation solutions.

Subscribe to our blog

Harry Goldman

Harry Goldman brings an extensive technology career in database design and BI applications to NTT DATA. He has worked as a Data Scientist for several organizations, including IBM. Harry is a member of our Data as an Asset Practice, an innovative organization enabling clients to gain new business insights through data science. His interest in AI was further enhanced by attending Northwestern University where he received a master’s degree in Predictive Analytics.

Related Blog Posts