Nuance

Resources
Awards
SDK Customers
Datasheets
Special Solutions
OCR on Linux
Arabic OCR
Customized OCR
Free Evaluation Version

Register online and receive a free evaluation version of the OmniPage Capture SDK 15. More

 
Asian OCR Support

The OmniPage Capture SDK supports for Simplified and Traditional Chinese, Japanese (Hiragana, Katakana and Kanji), and Korean (Hangul and Hanja) languages.

The Asian OCR is a separate kit from Professional OCR and Professional Recognition kits. It can be installed and run together on the same developer’s machine. The same end user application can use both Western and Asian OCR. The Asian OCR API is the same as the Capture SDK 12 Western language kits (except a prefix to function names).

Asian OCR Specifications

OCR Engine System Software and Data The SDK includes all OCR engine system software and data that will be required to use the OCR engine. This includes, but is not limited to: Dictionaries Shape Recognition Tools and Data

Supported Languages
The OCR engine supports the following languages and character sets:
Japanese (Shift-JIS)
Simplified Chinese (GB-2312 character set)
Traditional Chinese (BIG5 character set)
Korean (KSC)

Image Modes
Black and white, Grayscale and Color

Image Input
Scanner, Image file, and Memory, in strips at a time for both gray-scale and color

Output file formats
Single page and multi-page Text, XML, RTF, Excel, Searchable PDF.

Font Information
Simplified Chinese: Hei, Song, Kai, SimSun, SimHei
Traditional Chinese: MingLiu, Gothic
Korean: Batang, Myeongjo, Gothic
Japanese: Mincho, Gothic

Text Detection
Horizontal and vertical text layout
Full and half width spacing
Japanese Ruby (Hiragana/Katakana (8pt), Kanji (9pt), Latin (7pt))

Character and Document Structure

Each language also supports characters in the ASCII character set. The default output representation will be Unicode with conversion functions for SJIS, GB, Big5 and KSC.

The OCR engine is able to automatically identify the following components of a document without human intervention. Beyond automatically identifying different components of a document, the engine can output this information for use by other software products.

Character Identities (including punctuation and special characters):

  • Numbers
  • Spaces
  • Character Bounding Boxes
  • Confidence Vectors
  • Position Coordinates
  • Font size for characters

Word Information (applies to English and Korean output only):

  • Word bounding boxes
  • Position Coordinates

Document Structure Information:

  • Position coordinates for the document structural elements
  • Paragraph Boundaries
  • Text Region Boundaries
  • Vertical or Horizontal Text Identification
  • Text Columns
  • Headers and Footers
  • Tables
  • Read Order
  • Automatic Page Segmentation
  • Picture/Image Bounding Boxes
© 2008 Nuance Communications, Inc. All rights reserved.