Effective Use of OCR
As an administrator working on extraction rules, it's strongly advised that you keep document sample pages limited to the pages you intend on extracting data from. When uploading documents to build extraction rules, all pages will be read and converted to text. On larger documents that are tens or hundreds of pages, this process can be resource intensive. It can take a long time, and it may negatively impact general performance of your capture environment for your other users. If your must upload large file sets, do so sparingly, or do so in off hours so as not to impact other processing happening on your system. In most cases, a page or a subset of pages is far more efficient, and will make the general process of rule building much faster. Most Tiff and PDF document editors (like Adobe Acrobat) will include features to extract just a page or range of pages from a file. Alternately, if you have access to the paper copy, just scan the pages of interest when planning your rule builds.
Targeted OCR, or OCR that is looking as a specific area or characteristics of a page to capture data, is the most efficient means to extract text from a page. When defining extraction rules, each extracted field may be isolated to a specific page. In this way, GlobalCapture can execute extraction rules very quickly since it doesn't need to wait for all pages of a document to be read by the OCR engine. It's helpful to identify what pages data may possibly appear on in a document so rules can be built accordingly. It is advisable to OCR only the pages you intend on extracting text from to improve the system's per page processing speed.
Full Page OCR
As the name implies, Full Page OCR does an image to text conversion on an entire page. It's important to understand when and how to use this feature to ensure optimal performance. Unless a customer wishes to perform text based searching (Content Search or Find In File) a full text PDF is generally not necessary. Regardless of the type of file being stored (Tiff, PDF, etc.) GlobalSearch offers the ability to output a PDF when requested. As such, for many customers, conversion to PDF is an unnecessary step that slows down per page processing speeds. There are valid use cases where Full Page OCR is required. For example, customer's whose compliance requirements mandate all documents are stored in PDF/A, or any customer that does wish to leverage the previously mentioned content based search features native to GlobalSearch.
If you do wish to use Full Page OCR (implemented in GlobalCapture through the Convert Node), consider the following:
- All pages present in a document are processed. Documents that may have coversheets and/or separators are no exception. If processing speed is important, it may be more efficient to perform Full Page OCR after separation and cover page deletion. Actual speed is situational and dependent on many factors that make it difficult to blindy estimate performance. In general, performing OCR on a standard 8.5X11 text page will take between .9 and 2 seconds. This model is most efficient for smaller document sets handled in batches. Consider a 200 page PDF, comprised of single page documents with each page having a corresponding cover sheet that indicates separation and provides index data for the page succeeding it. In this example, performing Full Page OCR after separation, indexing, and cover page deletion would eliminate the OCR step on 100 of the 200 pages. This would in turn have a meaningful impact on performance.
- Stacking a Convert Node in front of a Classify Node can have a positive performance impact. When Convert immediately precedes Classify, the Classify Node will leverage the OCR results from the Convert step. In applicable workflows, this can have a meaningful performance impact as OCR is run only once. To describe it simply, when processing a single page file and extracting text zones to corresponding fields, Convert → Classify will OCR the page once. Classify → Convert will OCR the page twice. It's important to note that the number of OCR steps in a workflow can impact a customer's available page count. It's a good idea to minimize OCR steps where ever possible for both performance and license purposes.
- EDocs (Word, Excel) are natively text searchable. If you store these files natively, they will open in PDF format, and will be text searchable without any conversion. File of this type may always be output in either native or PDF format.