Text Extraction Profile Settings

Extraction profiles allow you to fine-tune the OCR process for documents that are unique or difficult to read, especially when the default settings don't produce accurate results. By creating an OCR Text Extraction Profile, you can apply customized settings to specific workflows, nodes within a workflow, or individual zones within a template—improving extraction accuracy without affecting your global OCR defaults.

Recognition Settings

Basic Settings

Name

The profile name must be distinct and should clearly reflect its primary purpose to ensure easy identification when applying it within a workflow, node, or zone.

If I am creating a profile to apply a specific PDF/A Compliance conversion, I might name my profile: Convert to PDF_A_3U.

OCR Mode

Set your OCR mode by balancing Speed and Accuracy.

Best - Best mode delivers the highest level of OCR accuracy by performing a more thorough and detailed analysis of the document. This setting is ideal for processing complex layouts, poor-quality scans, or documents where precision is critical. While it provides the most accurate results, it may take longer to complete compared to other modes.
Fast - Fast mode prioritizes speed, making it the optimal choice for high-volume document processing where quick results are more important than perfect accuracy. While this mode may be less precise than others, it significantly reduces processing time, making it suitable for simple, clean documents.
Normal - Normal mode offers a balanced approach to OCR processing, providing a reliable combination of speed and accuracy. Suitable for most standard documents, this mode is designed to handle everyday OCR tasks efficiently without compromising too much on quality or performance.

Language Extraction

The OCR uses trained data for each language to accurately recognize words and sentences. Specify the language of the documents being OCR’d.

Page Segmentation Mode

Page Segmentation Mode refers to how the OCR engine analyzes the layout of a page to identify and separate different regions like text blocks, images, columns, or lines before recognizing the text. Choosing the right Page Segmentation Mode improves OCR accuracy by helping the engine understand the page structure better.

Auto - Analyzes the image and attempts to detect and segment the content as a full page of text—recognizing multiple words, lines, and paragraphs. Once segmented, it performs OCR on the text and returns the extracted results.
Single Column - Useful when working with documents that contain columns of data such as receipts, invoices, spreadsheets etc. This organizes the data in rows of data.

The data is returned row-wise:

Burger Combo	6.99
Extra Cheese	0.75
Soda (Large)	1.99

Instead of being returned:

Burger Combo

Extra Cheese

Soda

6.99

0.75

1.99

Single Block Vert Text - This mode behaves similar to Single Column but for document that have the text aligned vertically, such as being rotated.

Single Block - Used to extracting large blocks of uniform text. Best when there is a simple page structure and a consistent font throughout. This is useful if scanning pages of a book or novel, newspaper articles, etc.

Single Line - Used when looking at only a single like of uniform text from an image. It ignores any multi-line structures, columns, and other layout elements. This might be useful if you were attempt to read a license plate from an image.

Single Word - Used when looking for a single word of uniform text from an image with no gaps or spaces. This might be useful for identifying the single word in an image like extracting text from a logo or label.

Single Char - Used when looking for a single character of text in an image.

Sparse Text - Designed for use on documents with sparse text and irregular layouts. Extracts as much text as possible from an image and returns the text. This method does not respect order/grouping, but only the text. This is useful for documents such as business cards.

Raw Line - This mode bypasses all segmentation and OCR preprocessing techniques and treats the image as a single, raw line of text. This mode should only be used when exhausting all other options.

Page Pre-Processing

Enable the checkbox next to each option in page pre-processing to enable it for the profile.

Deskew

If a scanned page is crooked, it can confuse the text recognition software and make it less accurate. To get better results, make sure the text lines are straight and horizontal before running OCR.

Despeckle

Despeckle is a process that clears up unwanted dots or "noise" from an image. It helps make images clearer, especially those poorly scanned.

Grayscale

Grayscale turns a color image into shades of gray, removing all color information while preserving brightness, reducing the image size. This is useful to reduce storage space and load images faster.

Invert

Inverting an image flips the colors—light parts become dark, and dark parts become light. This helps improve text recognition when the text is light-colored on a dark background.

Conversion Options

Correct Orientation

The purpose of page orientation is to control the direction a page is displayed. This will automatically rotate all pages of the document so the text on the page is oriented correctly.

PDF/A Compliance Mode

PDF/A compliance means a PDF file is saved in a special format designed for long-term archiving. It ensures the document looks the same in the future by embedding all fonts and avoiding features that might not work later, like audio, video, or external links.

Standard	Based on PDF Version	Conformance Level	Focus / Features	Accessibility (Tagging & Structure)	Supports Modern PDF Features (Layers, Transparency, JPEG2000)	Can Embed Other Files
PDF/A-1a	PDF 1.4	Level A (a)	Long-term archiving + full accessibility (tagged structure)	Yes	No	No
PDF/A-1b	PDF 1.4	Level B (b)	Long-term archiving + visual fidelity only	No	No	No
PDF/A-2a	PDF 1.7	Level A (a)	Archiving + accessibility + newer PDF features	Yes	Yes	Yes
PDF/A-2b	PDF 1.7	Level B (b)	Archiving + newer PDF features	No	Yes	Yes
PDF/A-2u	PDF 1.7	Level U (u)	Archiving + Unicode text enforcement	Partial (Unicode focus)	Yes	Yes
PDF/A-3a	PDF 1.7	Level A (a)	PDF/A-2a + allows embedding any file format	Yes	Yes	Yes
PDF/A-3b	PDF 1.7	Level B (b)	PDF/A-2b + allows embedding any file format	No	Yes	Yes
PDF/A-3u	PDF 1.7	Level U (u)	PDF/A-2u + allows embedding any file format	Partial (Unicode focus)	Yes	Yes
PDF/UA-1	Based on PDF 1.7+	Accessibility standard	Ensures PDF accessibility for users with disabilities	Yes (mandatory tagging & structure)	Yes	No
PDF/X-1a:2001	PDF 1.3	Print standard	Fixed-layout for reliable printing, CMYK color only	No	No (no transparency or layers)	No
PDF/X-1a	PDF 1.3	Print standard	Similar to PDF/X-1a:2001, fixed color and print-ready files	No	No	No
PDF/X-3	PDF 1.3	Print standard	Like PDF/X-1a but allows color management with ICC profiles	No	No	No
ZUGFeRD	Based on PDF/A-3	Hybrid Invoice Format	Combines PDF/A-3 with embedded structured XML invoice data	Yes (inherits from PDF/A-3a/b/u)	Yes	Yes (embedded XML invoice data)

V_#_# - These refer to the different versions of PDF.
Level A (a) = Full accessibility and logical structure tagging.
Level B (b) = Visual appearance preservation only.
Level U (u) = Unicode text compliance (ensures proper text encoding).
PDF/A-3 variants allow embedding of arbitrary file formats (e.g., XML, CAD files), which is unique to this version.
PDF/UA-1: Focused entirely on accessibility compliance to meet legal and usability requirements for people with disabilities.
PDF/X standards: Focused on reliable printing workflows, ensuring colors and layout are predictable for printers.
PDF/X-1a and PDF/X-1a:2001 disallow features like transparency and require CMYK color space for print consistency.
PDF/X-3 allows color management workflows using ICC profiles, useful for advanced printing needs.
ZUGFeRD is a hybrid format combining a human-readable PDF invoice with machine-readable XML data embedded inside. Built on PDF/A-3, it inherits long-term archiving, accessibility, and embedding features. Widely used in electronic invoicing in Europe for easier automated processing while preserving readable invoices.

Binarization

Otsu Binarization

Otsu binarization is a method used in OCR to convert a grayscale image into a black-and-white (binary) image. Otsu’s method automatically finds the best threshold to separate the foreground (text) from the background. It looks at the brightness values (shades of gray) in an image and picks a point that splits them into two groups: dark and light.

Example: Imagine a scanned document where the text is dark gray and the paper is light gray. Otsu binarization picks the perfect brightness level to decide what counts as "black" (text) and what counts as "white" (background).

Local Region Width and Local Region Height refers to the size of the neighborhood or window around each pixel that is used to calculate the Otsu threshold for that particular pixel. Instead of applying a single, global threshold to the entire image, adaptive thresholding dynamically determines a threshold for each pixel based on the characteristics of its surrounding local region. The threshold has a minimum value of 16 pixels.
- Larger window sizes can be effective for images with slowly varying backgrounds, while smaller window sizes are suitable for images with rapidly changing background intensities or smaller objects.
- Images with fine details or smaller objects benefit from smaller window sizes, while those with larger objects may require larger window sizes to capture sufficient information.
Horizontal Smoothing and Vertical Smoothing can be used to improve the results. This is especially true for noisy images. Applying smoothing can reduce the random noise on an image making it easier for the OCR to distinguish between the foreground (text) and the background. This value has a minimum of 0, no smoothing.
Sensitivity fine tunes the results by determining how aggressively foreground pixels are detected. Adjust the sensitivity between 0 and 1. Zero is the least sensitive where fewer pixels are classified as foreground (text) to 1 where more pixels are classified as foreground (text).

Sauvola Binarization

Sauvola binarization is an advanced method used in OCR to convert grayscale images to black-and-white, especially when the image has uneven lighting, shadows, or textured backgrounds—scenarios where simpler methods like Otsu often fail. This method decides locally, for each small region of the image, whether a pixel should be black (text) or white (background). It adapts to changing lighting or shading across the page.

Window Half Width sets the size of the local window used to determine foreground (text) from the background. This value has a minimum of 2 pixels. There is no optimal size for this window as it depends on the characteristics of the image. Setting an appropriate windows is often done with trial and error.
Threshold this has to do with determining of the pixel is a foreground or background pixel based on the pixels around it. value is between 0 and 1
Tiled is used to increase the efficiency of processing. The image is divided up into smaller, overlapping tiles and the binarization is applied to each tile, reducing the size of the data being processed at a given time. When enabled, additional options become available:
- Horizontal Tile Count sets the number of tiles in each row. There must be at least 1 tile per row.
- Vertical Tile Count sets the number of tiles in each column. There must be at least 1 tile per column.
Add Border adds padding around images edges so the binarization can be applied to all the pixels on the page, including the ones at the edges of the image.

Set Default Profile

To automatically apply your extraction profile during specific processes, set it as the default. Only one profile can be designated as the default at a time. Assigning a new default will replace the previously selected profile.

Zonal OCR

Sets the default profile for OCR actions. This is the default profile used in the Classify node and by individual zone when no specific profile is selected.

Conversion

Sets the default profile for PDF Conversion actions. This is the default profile used in the Convert node when no specific profile is selected.

Use caution when setting a profile as the default, as any existing processes configured to use the default profile will now use the newly assigned one.