Text Extraction Profile Settings
Extraction profiles allow you to fine-tune the OCR process for documents that are unique or difficult to read, especially when the default settings don't produce accurate results. By creating an OCR Text Extraction Profile, you can apply customized settings to specific workflows, nodes within a workflow, or individual zones within a template—improving extraction accuracy without affecting your global OCR defaults.
Recognition Settings
Basic Settings

Name and Initial Settings
Name
The profile name must be distinct and should clearly reflect its primary purpose to ensure easy identification when applying it within a workflow, node, or zone.
If I am creating a profile to apply a specific PDF/A Compliance conversion, I might name my profile: Convert to PDF_A_3U.
OCR Mode
Set your OCR mode by balancing Speed and Accuracy.
Best - Best mode delivers the highest level of OCR accuracy by performing a more thorough and detailed analysis of the document. This setting is ideal for processing complex layouts, poor-quality scans, or documents where precision is critical. While it provides the most accurate results, it may take longer to complete compared to other modes.
Fast - Fast mode prioritizes speed, making it the optimal choice for high-volume document processing where quick results are more important than perfect accuracy. While this mode may be less precise than others, it significantly reduces processing time, making it suitable for simple, clean documents.
Normal - Normal mode offers a balanced approach to OCR processing, providing a reliable combination of speed and accuracy. Suitable for most standard documents, this mode is designed to handle everyday OCR tasks efficiently without compromising too much on quality or performance.
Language Extraction
The OCR uses trained data for each language to accurately recognize words and sentences. Specify the language of the documents being OCR’d.
Page Segmentation Mode
Page Segmentation Mode refers to how the OCR engine analyzes the layout of a page to identify and separate different regions like text blocks, images, columns, or lines before recognizing the text. Choosing the right Page Segmentation Mode improves OCR accuracy by helping the engine understand the page structure better.
Auto - Analyzes the image and attempts to detect and segment the content as a full page of text—recognizing multiple words, lines, and paragraphs. Once segmented, it performs OCR on the text and returns the extracted results.
Single Column - Useful when working with documents that contain columns of data such as receipts, invoices, spreadsheets etc. This organizes the data in rows of data.

Receipt
The data is returned row-wise:
Burger Combo | 6.99 |
Extra Cheese | 0.75 |
Soda (Large) | 1.99 |
Instead of being returned:
Burger Combo
Extra Cheese
Soda
6.99
0.75
1.99
Single Block Vert Text - This mode behaves similar to Single Column but for document that have the text aligned vertically, such as being rotated.

Vertical Text Image
Single Block - Used to extracting large blocks of uniform text. Best when there is a simple page structure and a consistent font throughout. This is useful if scanning pages of a book or novel, newspaper articles, etc.

Block of Text from a Novel
Single Line - Used when looking at only a single like of uniform text from an image. It ignores any multi-line structures, columns, and other layout elements. This might be useful if you were attempt to read a license plate from an image.

Single Block Text
Single Word - Used when looking for a single word of uniform text from an image with no gaps or spaces. This might be useful for identifying the single word in an image like extracting text from a logo or label.

Vendor Logo
Single Char - Used when looking for a single character of text in an image.
Sparse Text - Designed for use on documents with sparse text and irregular layouts. Extracts as much text as possible from an image and returns the text. This method does not respect order/grouping, but only the text. This is useful for documents such as business cards.

Business Card
Raw Line - This mode bypasses all segmentation and OCR preprocessing techniques and treats the image as a single, raw line of text. This mode should only be used when exhausting all other options.
Page Pre-Processing

Page Pre-Processing
Enable the checkbox next to each option in page pre-processing to enable it for the profile.
Deskew
If a scanned page is crooked, it can confuse the text recognition software and make it less accurate. To get better results, make sure the text lines are straight and horizontal before running OCR.
Despeckle
Despeckle is a process that clears up unwanted dots or "noise" from an image. It helps make images clearer, especially those poorly scanned.
Grayscale
Grayscale turns a color image into shades of gray, removing all color information while preserving brightness, reducing the image size. This is useful to reduce storage space and load images faster.
Invert
Inverting an image flips the colors—light parts become dark, and dark parts become light. This helps improve text recognition when the text is light-colored on a dark background.
Conversion Options

Conversion Options
Correct Orientation
The purpose of page orientation is to control the direction a page is displayed. This will automatically rotate all pages of the document so the text on the page is oriented correctly.
PDF/A Compliance Mode
PDF/A compliance means a PDF file is saved in a special format designed for long-term archiving. It ensures the document looks the same in the future by embedding all fonts and avoiding features that might not work later, like audio, video, or external links.
Standard | Based on PDF Version | Conformance Level | Focus / Features | Accessibility (Tagging & Structure) | Supports Modern PDF Features (Layers, Transparency, JPEG2000) | Can Embed Other Files |
---|---|---|---|---|---|---|
PDF/A-1a | PDF 1.4 | Level A (a) | Long-term archiving + full accessibility (tagged structure) | Yes | No | No |
PDF/A-1b | PDF 1.4 | Level B (b) | Long-term archiving + visual fidelity only | No | No | No |
PDF/A-2a | PDF 1.7 | Level A (a) | Archiving + accessibility + newer PDF features | Yes | Yes | Yes |
PDF/A-2b | PDF 1.7 | Level B (b) | Archiving + newer PDF features | No | Yes | Yes |
PDF/A-2u | PDF 1.7 | Level U (u) | Archiving + Unicode text enforcement | Partial (Unicode focus) | Yes | Yes |
PDF/A-3a | PDF 1.7 | Level A (a) | PDF/A-2a + allows embedding any file format | Yes | Yes | Yes |
PDF/A-3b | PDF 1.7 | Level B (b) | PDF/A-2b + allows embedding any file format | No | Yes | Yes |
PDF/A-3u | PDF 1.7 | Level U (u) | PDF/A-2u + allows embedding any file format | Partial (Unicode focus) | Yes | Yes |
PDF/UA-1 | Based on PDF 1.7+ | Accessibility standard | Ensures PDF accessibility for users with disabilities | Yes (mandatory tagging & structure) | Yes | No |
PDF/X-1a:2001 | PDF 1.3 | Print standard | Fixed-layout for reliable printing, CMYK color only | No | No (no transparency or layers) | No |
PDF/X-1a | PDF 1.3 | Print standard | Similar to PDF/X-1a:2001, fixed color and print-ready files | No | No | No |
PDF/X-3 | PDF 1.3 | Print standard | Like PDF/X-1a but allows color management with ICC profiles | No | No | No |
ZUGFeRD | Based on PDF/A-3 | Hybrid Invoice Format | Combines PDF/A-3 with embedded structured XML invoice data | Yes (inherits from PDF/A-3a/b/u) | Yes | Yes (embedded XML invoice data) |
V_#_# - These refer to the different versions of PDF.
Level A (a) = Full accessibility and logical structure tagging.
Level B (b) = Visual appearance preservation only.
Level U (u) = Unicode text compliance (ensures proper text encoding).
PDF/A-3 variants allow embedding of arbitrary file formats (e.g., XML, CAD files), which is unique to this version.
PDF/UA-1: Focused entirely on accessibility compliance to meet legal and usability requirements for people with disabilities.
PDF/X standards: Focused on reliable printing workflows, ensuring colors and layout are predictable for printers.
PDF/X-1a and PDF/X-1a:2001 disallow features like transparency and require CMYK color space for print consistency.
PDF/X-3 allows color management workflows using ICC profiles, useful for advanced printing needs.
ZUGFeRD is a hybrid format combining a human-readable PDF invoice with machine-readable XML data embedded inside. Built on PDF/A-3, it inherits long-term archiving, accessibility, and embedding features. Widely used in electronic invoicing in Europe for easier automated processing while preserving readable invoices.
Binarization
Otsu Binarization

Otsu Binarization
Otsu binarization is a method used in OCR to convert a grayscale image into a black-and-white (binary) image. Otsu’s method automatically finds the best threshold to separate the foreground (text) from the background. It looks at the brightness values (shades of gray) in an image and picks a point that splits them into two groups: dark and light.
Example: Imagine a scanned document where the text is dark gray and the paper is light gray. Otsu binarization picks the perfect brightness level to decide what counts as "black" (text) and what counts as "white" (background).
Local Region Width and Local Region Height refers to the size of the neighborhood or window around each pixel that is used to calculate the Otsu threshold for that particular pixel. Instead of applying a single, global threshold to the entire image, adaptive thresholding dynamically determines a threshold for each pixel based on the characteristics of its surrounding local region. The threshold has a minimum value of 16 pixels.
Larger window sizes can be effective for images with slowly varying backgrounds, while smaller window sizes are suitable for images with rapidly changing background intensities or smaller objects.
Images with fine details or smaller objects benefit from smaller window sizes, while those with larger objects may require larger window sizes to capture sufficient information.
Horizontal Smoothing and Vertical Smoothing can be used to improve the results. This is especially true for noisy images. Applying smoothing can reduce the random noise on an image making it easier for the OCR to distinguish between the foreground (text) and the background. This value has a minimum of 0, no smoothing.
Sensitivity fine tunes the results by determining how aggressively foreground pixels are detected. Adjust the sensitivity between 0 and 1. Zero is the least sensitive where fewer pixels are classified as foreground (text) to 1 where more pixels are classified as foreground (text).
Sauvola Binarization

Sauvola Binarization
Sauvola binarization is an advanced method used in OCR to convert grayscale images to black-and-white, especially when the image has uneven lighting, shadows, or textured backgrounds—scenarios where simpler methods like Otsu often fail. This method decides locally, for each small region of the image, whether a pixel should be black (text) or white (background). It adapts to changing lighting or shading across the page.
Window Half Width sets the size of the local window used to determine foreground (text) from the background. This value has a minimum of 2 pixels. There is no optimal size for this window as it depends on the characteristics of the image. Setting an appropriate windows is often done with trial and error.
Threshold this has to do with determining of the pixel is a foreground or background pixel based on the pixels around it. value is between 0 and 1
Tiled is used to increase the efficiency of processing. The image is divided up into smaller, overlapping tiles and the binarization is applied to each tile, reducing the size of the data being processed at a given time. When enabled, additional options become available:
Horizontal Tile Count sets the number of tiles in each row. There must be at least 1 tile per row.
Vertical Tile Count sets the number of tiles in each column. There must be at least 1 tile per column.
Add Border adds padding around images edges so the binarization can be applied to all the pixels on the page, including the ones at the edges of the image.
Set Default Profile

Set Default Profile
To automatically apply your extraction profile during specific processes, set it as the default. Only one profile can be designated as the default at a time. Assigning a new default will replace the previously selected profile.
Zonal OCR
Sets the default profile for OCR actions. This is the default profile used in the Classify node and by individual zone when no specific profile is selected.
Conversion
Sets the default profile for PDF Conversion actions. This is the default profile used in the Convert node when no specific profile is selected.
Use caution when setting a profile as the default, as any existing processes configured to use the default profile will now use the newly assigned one.