What Is Intelligent Document Processing?

Most organizations have experienced the drudgery of scanning stacks of invoices, receipts, contracts, HR documents, etc., in the quest to “go paperless.In the past, you might have dealt with optical character recognition (OCR) tools that produced less-than-ideal results, forcing you to go back and look at those original hardcopies to figure out the contents of the scanned documents. The technology simply wasn’t there yet. Today it’s a different story.

Intelligent Document Processing (IDP) is a term that describes a combination of technologies that convert an assortment of documents and files into integrated smart data for your business needs.

Now, the technology incorporates computer vision, artificial intelligence (AI), natural language processing (NLP), advanced pattern matching, and machine learning (ML) to extract, analyze, sort, classify, and store huge volumes of data with little to no human intervention. Data projects that were impossible to accomplish just a few years ago are now handled with speed and unbelievable accuracy. We’ll look at the stages of the intelligent document process, but first, let’s review the different data types.

What are Types of Data?

Structured Data

Structured data organizes contents to be easily recognized. This data type is found in databases and spreadsheets, where column headers or other labels, such as XML tags, allow the data to match a type.

Semi-Structured Data

Semi-structured data comes from documents containing the same data but not in the same format or layout. For example, think of how the invoices from your vendors can look different, even though they all present the same information.

Unstructured Data

Unstructured data is essentially free-form. Much of the important day-to-day communications with our internal and external customers are unstructured: emails, text messages, transcribed phone messages, content typed into text entry fields on a web form, and social media posts. Many word-processing files are also unstructured because they don’t have precise metadata describing the data (although they might include basic metadata like title, keywords, author, and category).

What is Metadata?

The most basic definition of metadata is “data about the data.” And metadata is a key to classifying and sorting information. In HTML, XML, and other coding languages, developers can precisely define contents by surrounding them with metadata tags. For instance, an XML document might contain the following pieces of information:

<vendor> Cat Things, Inc. <vendor />
<item> Purple Catnip Octopus <item />
<price> $7.99 <price />

You (or an AI robot) can immediately discern these three data items. However, as we just mentioned, most Word documents will not contain metadata like this. They might have tabular data, but the column and row labels might be inconsistent from document to document and vendor to vendor. So, keeping all this in mind, let’s look at how the IDP process resolves problems and manages all data types.

Intelligent Document Processing steps graphic

IDP systems utilize a series of steps to process all three data types to enhance your agency’s operations.

1. Document Capture

The first step is scanning the physical documents to convert them to digital format. You’ll notice that we didn’t simply say paper. Document capture can include the digital acquisition of information on microfiche and microfilm, too.

2. Image Processing

Next, computer vision algorithms analyze and process these new digital files to discover the data:

  • Optical character recognition (OCR) recognizes typed text and converts it to digital text.
  • Intelligent character recognition (ICR) handles the challenging task of deciphering handwriting into text.
  • Optical mark recognition (OMR) looks for check marks, filled circles and boxes, and other content indicators on surveys, tests, and scan sheets.

3. Automated Integration

Automated Integration is a parallel step to both document capture and image processing. Here, software tools ingest Word documents, PDFs, and other text files so they are ready for further analysis.

4. Natural Language Processing (NLP)

After all the content is in digital format, NLP looks for targeted words, phrases, sentences, and paragraphs. Sophisticated algorithms can:

  • Detect and tag words with their parts of speech (noun, verb, adjective, etc.)
  • Perform sentiment analysis on longer text passages (positive/negative, for/against)
  • Look for specific named entities (people, organizations, locations, codes, monetary units, and many others)

5. Data Classification

During the classification stage, machine learning algorithms learn and then recognize different types of documents and then automatically categorize them:

  • Visual Classification: an algorithm looks for a logo, QR or barcode, specific image, or even the overarching document layout to visually classify the document.
  • Text Classification: Using predefined rules and more advanced tactics like NLP, an algorithm finds keywords and text patterns and can even learn to assess emotional content (i.e., if a document is a complaint or a positive review).

6. Data Validation

Data validation refers to comparing the extracted and analyzed data with content in specialized databases and dictionaries (lexicons). Data that doesn’t match expectations can be flagged for review by a human.

7. Integration

During integration, labeled data and its metadata are linked to the human-readable documents, usually the original PDFs, word processing files, images, and scanned documents.

IDP in Action: Automated Accounts Payable

Coppell, TX, is a city of 42,000 residents within the Dallas-Fort Worth metroplex. Previously, three staff members of the city’s engineering department were spending three days per week processing invoices. To free up staff members for more valuable work and eliminate data entry errors, city leadership consulted with our experts to implement an IDP solution powered by ABBYY FlexiCapture. ABBYY FlexiCapture communicates seamlessly with Tyler Technologies Enterprise ERP, resulting in much faster processing of payments—from days to hours!

Benefits of Intelligent Document Processing

IDP allows agencies to visualize the flow of work through process stages, observe any bottlenecks and delays, and adjust as needed. As a result, your employees spend less time doing repetitive work that can generate errors, and your organization gains efficiencies and improves quality control. In addition, a system like ABBYY FlexiCapture can create alerts to ensure compliance with rules and help you identify more opportunities for improvement.

There will always be paper, but it doesn’t have to slow down your digital transformation!