Capture all Instances of a Data Element

Randy Sussner
Mar 11, 2022 8:03:42 AM

In working in the capture space a similar use case often comes up – capture all instances of [insert the data element name here] from the document. For example, I want all phone numbers, regardless of location in the document, and regardless of the format of the document. ABBYY FlexiCapture is very good at creating solutions for well formatted, i.e. predictable, documents and capturing data in these documents. But sometimes the format of the document is not known in advance of it entering the system. In this case, we are still interested in capturing the data element (or elements) from the OCRed text and assigning them as properties. The solution can then use this captured data in a content management system as searchable metadata as well as use it as a lookup parameter in internal systems to capture additional data.

ABBYY FlexiLayout is used to create a document definition that supports semi-structured and unstructured documents. Capturing all the data elements that occur in the document can be done in a few relatively easy steps

  • Create a FlexiLayout with search elements defined to locate all occurrences of the data element
  • Refine the FlexiLayout to use a regular expression to identify all data elements
    • Refining the search in this manner increases the accuracy in capturing the desired results
  • Create a document definition based on the FlexiLayout
  • Start capturing data!

 

Create a FlexiLayout

The first step is to create a FlexiLayout and provide a sample document to match against. For this exercise, we’ll be using a simple email that contains a phone number in the text as well as multiple phone numbers in the footer.

FlexiLayout_Email

Now lets see the following elements in FlexiLayout

  • Repeating group – as we want to capture all the phone numbers in the document
  • Character string – data element that captures the phone number values
    • Specify a simple regular expression for phone number (note there are multiple ways a phone number can be represented by a regular expression)
    • FlexiLayout_Regex
  • Add a block to capture the data
    • This is how the extracted data elements are defined as data fields in the FlexiCapture document definition
    • Make sure to check ‘Has repeating instances’
    • Assigned the block to the phone number character string
    • FlexiLayout_BlockProperties

Overall the FlexiLayout structure should look as follows

FlexiLayout_Structure

Export the FlexiLayout so we can create a document definition in FlexiCapture based on this layout

 

FlexiCapture Document Definition

Create a new document definition based on the FlexiLayout created in the previous step

  • Document definition type: Semi-structured or unstructured documents (FL)
  • Specify a test image
  • Specify the FlexiLayout created in the previous step

When created, you can see the data element, PhoneNumber in our case, is created as a multi-entry field

FlexiCapture_DocDef

 

The final step is to test the document definition. This can be done directly in the document definition editor by selecting Testing à Run Test. You can see from the test below that FlexiCapture found three instances of phone number in the test image. Note, there is one phone number that was not captured as the phone number contains letters (TEAM). This could be remedied by altering the regular expression in FlexiLayout.

FlexiCapture_ExtractTest

Summary

FlexiLayout can be used to extract any number of data elements from a document. Namely where a field may be represented by a regular expression. It can capture all the values from the text and assign these to a repeating group block. Finally with a document definition in FlexiCapture incorporates the FlexiLayout, it can be used to produce not only an OCRed version of the document, but all the instances of the data element captured.

About TEAM IM

TEAM IM is a global solution company that advises, develops, implements, supports, and manages enterprise grade information management and content management systems. For more than twenty years, TEAM IM has helped our clients through our offices in Australia, New Zealand, Europe and the United States get the most out of their investment in technology. Whether our clients are large government agencies or corporations, construction firms, accounting firms, heavy industry, or smaller organizations, we strive to deliver demonstrable business benefits and generate real ROI and efficiencies for our clients.

Our products and services offer solutions to digitize, automate and modernize your operations.  TEAM IM strives to create multi-year, multiple outcome, outstanding return based relationships with our customers.  As we plan to support any solution we deliver, we take care to design for long term, future proof solutions.  We work best-in-class technology partners that we have carefully selected to ensure we can deliver on our multi-year, multiple outcome promises.  

Our products and solutions encompass Advisory Services, Business Process Automation, Optimization, Content Platforms and Content Services.  We are also a leader in Mobile App/Field Services software development, focusing on building Digital Workplaces with industry-specific solutions for the Construction and Accounting Services sectors—with more sector-specific solutions on the way.

The most important thing to know about TEAM IM is that, after more than twenty years, we are still passionate about achieving outstanding outcomes for you, our clients. 

No Comments Yet

Let us know what you think