RU2582064C1

RU2582064C1 - Methods and systems for effective automatic recognition of symbols using forest solutions

Info

Publication number: RU2582064C1
Application number: RU2014150945/08A
Authority: RU
Inventors: Юрий Георгиевич Чулинин; Олег Евгеньевич Сенкевич
Original assignee: Общество с ограниченной ответственностью "Аби Девелопмент"
Priority date: 2014-12-16
Filing date: 2014-12-16
Publication date: 2016-04-20

Abstract

FIELD: optics.

SUBSTANCE: invention relates to optical character recognition. Proposed system includes instructions in machine code, when executed by a processor, control system of optical character recognition to process scanned image containing text of document by performing identification of character images containing text in scanned image of document. Identification is performed for each document page and symbol for each image on page. Method includes identification of a set of suitable reference data structures for image of symbol using forest solutions. Method uses a suitable standard data structure to determine appropriate set of grapheme and uses identified set of suitable grapheme to select character code that corresponds to image of characters. Method includes preparing processed document containing codes symbols, which correspond to images of characters from a scanned document image, and processed document in one or more memory devices and memory.

EFFECT: optimising optical character recognition owing to use of forest solutions.

20 cl, 66 dwg

Description

ОБЛАСТЬ ТЕХНИКИFIELD OF TECHNOLOGY

Настоящая заявка относится к автоматической обработке изображений отсканированных документов и других изображений, содержащих текст, и, в частности, к способам и системам эффективного преобразования изображений символов, полученных из отсканированных документов, в кодовые комбинации соответствующих символов с использованием множества кластеров эталонов символов и леса решений.This application relates to the automatic processing of images of scanned documents and other images containing text, and, in particular, to methods and systems for efficiently converting images of characters obtained from scanned documents into code combinations of the corresponding symbols using a plurality of symbol pattern clusters and a decision forest.

УРОВЕНЬ ТЕХНИКИBACKGROUND

Печатные, машинописные и рукописные документы на протяжении долгого времени используются для записи и хранения информации. Несмотря на современные тенденции отказа от бумажного делопроизводства, печатные документы продолжают широко использоваться в коммерческих организациях, учреждениях и домах. С развитием современных компьютерных систем создание, хранение, поиск и передача электронных документов превратились, наряду с непрекращающимся применением печатных документов, в чрезвычайно эффективный и экономически выгодный альтернативный способ записи и хранения информации. Из-за подавляющего преимущества в эффективности и экономической выгоде, обеспечиваемого современными средствами хранения и передачи электронных документов, печатные документы легко преобразуются в электронные с помощью различных способов и систем, включающих преобразование печатных документов в цифровые изображения отсканированных документов с использованием электронных оптико-механических сканирующих устройств, цифровых камер, а также других устройств и систем, и последующую автоматическую обработку изображений отсканированных документов для получения электронных документов, закодированных в соответствии с одним или более различными стандартами кодирования электронных документов. Например, в настоящее время можно использовать настольный сканер и современные программы оптического распознавания символов («OCR»), позволяющие с помощью персонального компьютера преобразовывать печатный документ в соответствующий электронный документ, который можно просматривать и редактировать с помощью текстового редактора.Printed, typewritten, and manuscript documents have long been used to record and store information. Despite current trends in the rejection of paperwork, paper documents continue to be widely used in commercial organizations, institutions and homes. With the development of modern computer systems, the creation, storage, retrieval and transmission of electronic documents has turned, along with the ongoing use of printed documents, into an extremely effective and cost-effective alternative way of recording and storing information. Due to the overwhelming advantage in efficiency and economic benefits provided by modern means of storing and transmitting electronic documents, printed documents are easily converted to electronic using various methods and systems, including the conversion of printed documents into digital images of scanned documents using electronic optical-mechanical scanning devices , digital cameras, as well as other devices and systems, and subsequent automatic processing of images scanned x documents for receiving electronic documents encoded in accordance with one or more different coding standards for electronic documents. For example, now you can use a desktop scanner and modern optical character recognition (“OCR”) programs that allow you to convert a printed document into a corresponding electronic document using a personal computer, which can be viewed and edited using a text editor.

Хотя современные системы OCR развились до такой степени, что позволяют автоматически преобразовывать в электронные документы сложные печатные документы, включающие в себя изображения, рамки, линии границ и другие нетекстовые элементы, а также текстовые символы множества распространенных алфавитных языков, остается нерешенной проблема преобразования печатных документов, содержащих китайские и японские иероглифы или корейские морфо-слоговые блоки.Although modern OCR systems have evolved to the extent that they can automatically convert complex printed documents into electronic documents, including images, frames, border lines and other non-text elements, as well as text characters in many common alphabetic languages, the problem of converting printed documents remains unresolved. containing Chinese and Japanese characters or Korean morpho-syllabic blocks.

КРАТКОЕ ОПИСАНИЕ ИЗОБРЕТЕНИЯSUMMARY OF THE INVENTION

Настоящее изобретение относится к способам и системам распознавания символов, соответствующих изображениям символов, полученных из изображения отсканированного документа или другого изображения, содержащего текст, включая символы, соответствующие китайским или японским иероглифам или корейским морфо-слоговым блокам, а также символам других языков, в которых применяется большое количество знаков для записи и печати. В одном варианте осуществления способы и системы, описанные в настоящем документе, осуществляют стадию начальной обработки одного или более отсканированных изображений с целью идентификации набора графем, которые вероятнее всего соответствуют изображению всех символов, встречающихся в изображении отсканированного документа. Графемы выбираются для изображения символа на основе накопленных голосов, полученных по эталонам символов, определенных как вероятно связанных с изображением символа с использованием одного или более лесов решений.The present invention relates to methods and systems for recognizing characters corresponding to character images obtained from an image of a scanned document or other image containing text, including characters corresponding to Chinese or Japanese characters or Korean morpho-syllabic blocks, as well as characters of other languages in which a large number of characters for recording and printing. In one embodiment, the methods and systems described herein carry out the step of initially processing one or more scanned images to identify a set of graphemes that most likely correspond to the image of all the characters found in the image of the scanned document. Graphemes are selected to represent a symbol based on accumulated votes obtained from symbol standards defined as likely associated with a symbol image using one or more decision forests.

КРАТКОЕ ОПИСАНИЕ ЧЕРТЕЖЕЙBRIEF DESCRIPTION OF THE DRAWINGS

На Фиг. 1А-В показан печатный документ.In FIG. 1A-B shows a printed document.

На Фиг. 2 показаны обычный настольный сканер и персональный компьютер, которые используются вместе для преобразования печатных документов в электронные, которые можно хранить на запоминающих устройствах и/или в электронной памяти.In FIG. Figure 2 shows a conventional desktop scanner and personal computer that are used together to convert printed documents into electronic ones that can be stored on storage devices and / or in electronic memory.

На Фиг. 3 показана работа оптических компонентов настольного сканера, изображенного на Фиг. 2.In FIG. 3 shows the operation of the optical components of the desktop scanner of FIG. 2.

На Фиг. 4 представлена общая архитектурная схема разных типов компьютеров и других устройств, управляемых процессором.In FIG. 4 shows a general architectural diagram of various types of computers and other devices controlled by the processor.

На Фиг. 5 показано цифровое представление отсканированного документа.In FIG. 5 shows a digital representation of a scanned document.

На Фиг. 6 показан гипотетический набор символов.In FIG. 6 shows a hypothetical character set.

На Фиг. 7А-С показаны различные аспекты наборов символов для естественных языков.In FIG. 7A-C show various aspects of character sets for natural languages.

На Фиг. 8А-В показаны параметры и значения параметров, рассчитанные для изображений символов.In FIG. 8A-B show parameters and parameter values calculated for symbol images.

На Фиг. 9 показана таблица значений параметров, рассчитанных для всех символов из набора, изображенного в качестве примера на Фиг. 6.In FIG. 9 shows a table of parameter values calculated for all characters from the set shown as an example in FIG. 6.

На Фиг. 10 показан трехмерный график символов для символов из набора, изображенного в качестве примера на Фиг. 6, на котором каждое из измерений представляет значения одного из трех разных параметров.In FIG. 10 shows a three-dimensional graph of symbols for symbols from the set shown as an example in FIG. 6, in which each of the measurements represents the values of one of three different parameters.

На Фиг. 11А-В показаны символы, содержащиеся в каждом из кластеров, представленных точками трехмерного пространства, изображенного на Фиг. 10.In FIG. 11A-B show the symbols contained in each of the clusters represented by points of the three-dimensional space shown in FIG. 10.

На Фиг. 12А показан отдельный параметр, который можно использовать в комбинации с тремя параметрами, соответствующими каждому из измерений трехмерного пространства параметров, изображенного на Фиг. 10, для полного распознавания каждого из символов в кластере 8.In FIG. 12A shows a separate parameter that can be used in combination with three parameters corresponding to each of the dimensions of the three-dimensional parameter space shown in FIG. 10, to fully recognize each of the characters in cluster 8.

На Фиг. 12В показано значение дополнительного параметра для каждого символа из кластера 8, которое следует рассматривать со ссылкой на Фиг. 12А.In FIG. 12B shows the value of an additional parameter for each symbol from cluster 8, which should be considered with reference to FIG. 12A.

На Фиг. 13А показан дополнительный параметр, используемый для определения характеристик изображений символов.In FIG. 13A shows an additional parameter used to characterize character images.

На Фиг. 13В показан полный набор значений параметров для гипотетического набора символов, показанного на Фиг. 6.In FIG. 13B shows a complete set of parameter values for the hypothetical character set shown in FIG. 6.

На Фиг. 14А-17С показано построение четырех деревьев принятия решений из четырех подмножеств гипотетического набора символов, показанного на Фиг. 6.In FIG. 14A-17C show the construction of four decision trees from four subsets of the hypothetical character set shown in FIG. 6.

На Фиг. 18А-В показана классификация изображений символов с помощью леса решений.In FIG. 18A-B show a classification of symbol images using a decision forest.

На Фиг. 19 показано небольшое содержащее текст изображение, которое первоначально обрабатывалось системой OCR для получения сетки окон символов (1900), в каждом из которых содержится изображение символа.In FIG. 19 shows a small text-containing image that was initially processed by the OCR system to obtain a grid of symbol windows (1900), each of which contains a symbol image.

На Фиг. 20 показан общий подход к обработке сетки окон символов, показанной на Фиг. 19.In FIG. 20 shows a general approach to processing a grid of symbol windows shown in FIG. 19.

На Фиг. 21 показан первый подход к реализации подпрограммы «process» ((2004) на Фиг. 20).In FIG. 21 shows a first approach to implementing the “process” routine ((2004) in FIG. 20).

На Фиг. 22А-В показан второй вариант осуществления подпрограммы «process» ((2004) на Фиг. 20).In FIG. 22A-B show a second embodiment of a process subroutine ((2004) in FIG. 20).

На Фиг. 23 показан третий вариант осуществления подпрограммы «process», рассмотренной в предыдущем подразделе, с использованием тех же иллюстраций и условных обозначений в псевдокоде, которые использовались в предыдущем подразделе.In FIG. 23 shows a third embodiment of the process subroutine discussed in the previous subsection using the same illustrations and pseudo code conventions used in the previous subsection.

На Фиг. 24 показаны структуры данных, обеспечивающие кластеризацию и предварительную обработку в одном варианте осуществления системы OCR, включающей в себя третий вариант осуществления подпрограммы «process».In FIG. 24 illustrates data structures for clustering and preprocessing in one embodiment of an OCR system including a third embodiment of a process subroutine.

На Фиг. 25А-Н показана предварительная обработка изображения символа с использованием структур данных, рассмотренных выше со ссылкой на Фиг. 24.In FIG. 25A-H show symbol image preprocessing using the data structures discussed above with reference to FIG. 24.

На Фиг. 26 показаны отношения между символами, изображениями символов, эталонами и графемами.In FIG. 26 shows the relationships between symbols, symbol images, patterns and graphemes.

На Фиг. 27 показан процесс преобразования изображений символов в коды символов.In FIG. 27 shows a process for converting symbol images to symbol codes.

На Фиг. 28A-G показан процесс обработки символов в мультикластерной структуре данных.In FIG. 28A-G illustrate symbol processing in a multi-cluster data structure.

На Фиг. 29A-D показано использование леса решений для выявления кандидатов структур данных параметров для изображения входных символов.In FIG. 29A-D illustrate the use of a decision forest to identify candidate parameter data structures for representing input characters.

На Фиг. 30A-D с помощью блок-схем показан один из вариантов осуществления способа обработки документа на основе мультикластерной структуры данных, в котором используется лес решений для выбора подходящих структур данных эталона для каждого изображения символа.In FIG. 30A-D, using flowcharts, one embodiment of a document processing method based on a multi-cluster data structure is shown, which uses a decision forest to select appropriate reference data structures for each symbol image.

ОПИСАНИЕ ПРЕДПОЧТИТЕЛЬНЫХ ВАРИАНТОВ РЕАЛИЗАЦИИDESCRIPTION OF PREFERRED EMBODIMENTS

Настоящий документ относится к способам и системам распознавания символов, соответствующих изображениям символов, полученных из изображения отсканированного документа. В одном варианте осуществления в способах и системах, рассматриваемых в настоящем документе, на стадии начальной обработки одного или более отсканированных изображений определяется частота, с которой каждая графема из набора графем связывается с эталонами символов, соответствующими изображениям символов, используемым в отсканированном изображении или изображениях документа. Для каждого эталона символа подсчитывается число, характеризующее частоту связей с графемами, на которые ссылаются эталоны символов, затем эталоны символов в каждом кластере эталонов сортируются по данным числам. Упорядочивание эталонов приводит к тому, что на следующей стадии оптического распознавания символов, в которой изображения символов связываются с одной или более графемами или кодировками символов.This document relates to methods and systems for recognizing characters corresponding to character images obtained from an image of a scanned document. In one embodiment, in the methods and systems discussed herein, at the initial processing stage of one or more scanned images, a frequency is determined at which each grapheme from a set of graphems associates with character patterns corresponding to character images used in the scanned image or document images. For each symbol pattern, a number is calculated that characterizes the frequency of communications with graphemes referenced by symbol patterns, then the character patterns in each cluster of patterns are sorted by these numbers. Ordering the patterns results in the next stage of optical character recognition, in which character images are associated with one or more graphemes or character encodings.

Изображения отсканированных документов и электронные документыImages of scanned documents and electronic documents

На Фиг. 1А-В показан печатный документ. На Фиг. 1А показан исходный документ с текстом на японском языке. Печатный документ (100) включает в себя фотографию (102) и пять разных содержащих текст областей (104)-(108), включающих в себя японские иероглифы. Этот документ будет использоваться в качестве примера при рассмотрении способа и систем определения смысла, к которым относится настоящая заявка. Текст на японском языке может писаться слева направо, построчно, как пишется текст на английском языке, но альтернативно может использоваться способ написания сверху вниз в вертикальных колонках. Например, область (107) явно содержит вертикально написанный текст, в то время как текстовый блок (108) содержит текст, написанный горизонтально. На Фиг. 1В печатный документ, изображенный на Фиг. 1А, показан переведенным на русский язык.In FIG. 1A-B shows a printed document. In FIG. 1A shows a source document with Japanese text. The printed document (100) includes a photograph (102) and five different text-containing areas (104) - (108), including Japanese characters. This document will be used as an example when considering the method and systems for determining the meaning to which this application relates. The text in Japanese can be written from left to right, line by line, as the text is written in English, but alternatively, a way to write from top to bottom in vertical columns can be used. For example, region (107) explicitly contains vertically written text, while text block (108) contains text written horizontally. In FIG. 1B, the printed document of FIG. 1A, shown translated into Russian.

Печатные документы могут быть преобразованы в цифровые изображения отсканированных документов с помощью различных средств, включающих электронные оптико-механические сканирующие устройства и цифровые камеры. На Фиг. 2 показаны обычный настольный сканер и персональный компьютер, которые используются вместе для преобразования печатных документов в электронные, которые можно хранить на запоминающих устройствах и/или в электронной памяти. Настольное сканирующее устройство (202) включает в себя прозрачное стекло (204), на которое лицевой стороной вниз помещается документ (206). Запуск сканирования приводит к получению оцифрованного изображения отсканированного документа, которое можно передать на персональный компьютер («ПК») (208) для хранения на запоминающем устройстве. Программа, предназначенная для отображения отсканированного документа, может вывести оцифрованное изображение отсканированного документа на экран (210) устройства отображения ПК (212).Printed documents can be converted into digital images of scanned documents using various means, including electronic optical-mechanical scanning devices and digital cameras. In FIG. Figure 2 shows a conventional desktop scanner and personal computer that are used together to convert printed documents into electronic ones that can be stored on storage devices and / or in electronic memory. The desktop scanning device (202) includes clear glass (204) on which a document (206) is placed face down. Starting a scan results in a digitized image of the scanned document, which can be transferred to a personal computer (“PC”) (208) for storage on a storage device. A program for displaying a scanned document can display a digitized image of the scanned document on the screen (210) of the PC display device (212).

На Фиг. 3 показана работа оптических компонентов настольного сканера, изображенного на Фиг. 2. Оптические компоненты этого CCD-сканера расположены под прозрачным стеклом (204). Фронтально перемещаемый источник яркого света (302) освещает часть сканируемого документа (304), свет от которой отражается вниз. Этот свет отражается от фронтально перемещаемого зеркала (306) на неподвижное зеркало (308), которое отражает излучаемый свет на массив CCD-элементов (310), формирующих электрические сигналы пропорционально интенсивности света, поступающего на каждый из них. Цветные сканеры могут включать в себя три отдельные строки или массива CCD-элементов с красным, зеленым и синим фильтрами. Фронтально перемещаемые источник яркого света и зеркало двигаются вместе вдоль документа для получения изображения сканируемого документа. Другой тип сканера, использующего контактный датчик изображения, называется CIS-сканером. В CIS-сканере подсветка документа осуществляется перемещаемыми цветными светодиодами (LED), при этом отраженный свет светодиодов улавливается массивом фотодиодов, который перемещается вместе с цветными светодиодами.In FIG. 3 shows the operation of the optical components of the desktop scanner of FIG. 2. The optical components of this CCD scanner are located under the clear glass (204). A front-moving bright light source (302) illuminates a portion of a scanned document (304), from which light is reflected downward. This light is reflected from a front-moving mirror (306) to a fixed mirror (308), which reflects the emitted light to an array of CCD elements (310), which form electrical signals in proportion to the intensity of the light entering each of them. Color scanners can include three separate lines or an array of CCD elements with red, green, and blue filters. The front-moving bright light source and the mirror move together along the document to obtain an image of the scanned document. Another type of scanner using a contact image sensor is called a CIS scanner. In the CIS scanner, the document is backlighted by movable colored light emitting diodes (LEDs), while the reflected light of the LEDs is captured by an array of photodiodes, which moves together with the color LEDs.

На Фиг. 4 представлена общая архитектурная схема разных типов компьютеров и других устройств, управляемых процессором. Архитектурная схема высокого уровня может описывать современную компьютерную систему, например, ПК, показанный на Фиг. 2, в которой программы отображения отсканированного изображения документа и программы распознавания оптических символов хранятся в запоминающих устройствах для передачи в электронную память и выполнение одним или более процессорами, что позволяет преобразовать компьютерную систему в специализированную систему оптического распознавания символов. Компьютерная система содержит один или множество центральных процессоров (которые ниже именуются «ЦП») (402-405), один или более модулей электронной памяти (408), соединенных с ЦП при помощи шины подсистемы ЦП/память (410) или множества шин, первый мост (412), который соединяет шину подсистемы ЦП/память (410) с дополнительными шинами (414) и (416), или другими средствами высокоскоростного взаимодействия, включающими в себя множество высокоскоростных последовательных линий. Эти шины или последовательные соединения, в свою очередь, соединяют ЦП и память со специализированными процессорами, такими как графический процессор (418), а также с одним или более дополнительными мостами (420), взаимодействующими с высокоскоростными последовательными линиями или с множеством контроллеров (422-427), например с контроллером (427), которые предоставляют доступ к различным типам запоминающих устройств (428), электронным дисплеям, устройствам ввода и другим подобным компонентам, подкомпонентам и вычислительным ресурсам.In FIG. 4 shows a general architectural diagram of various types of computers and other devices controlled by the processor. A high-level architecture may describe a modern computer system, for example, the PC shown in FIG. 2, in which programs for displaying a scanned image of a document and optical character recognition programs are stored in memory devices for transmission to electronic memory and executed by one or more processors, which makes it possible to convert a computer system into a specialized optical character recognition system. A computer system comprises one or a plurality of central processors (referred to below as “CPUs”) (402-405), one or more electronic memory modules (408) connected to a CPU via a CPU / memory subsystem bus (410), or multiple buses, the first a bridge (412) that connects the bus of the CPU / memory subsystem (410) to additional buses (414) and (416), or other means of high-speed interaction, including many high-speed serial lines. These buses or serial connections, in turn, connect the CPU and memory to specialized processors, such as a graphics processor (418), as well as to one or more additional bridges (420) that communicate with high-speed serial lines or with multiple controllers (422- 427), for example with a controller (427) that provide access to various types of storage devices (428), electronic displays, input devices, and other similar components, subcomponents, and computing resources.

На Фиг. 5 показано цифровое представление отсканированного документа. На Фиг. 5 небольшой круглый фрагмент изображения (502) печатного документа (504), используемого в качестве примера, показан в увеличенном виде (506). Соответствующий фрагмент оцифрованного изображения отсканированного документа (508) также представлен на Фиг. 5. Оцифрованный отсканированный документ включает в себя данные, которые представляют собой двухмерный массив значений пикселей. В представлении (508) каждая ячейка сетки под символами (например, ячейка (509)), представляет собой квадратную матрицу пикселей. Небольшой фрагмент сетки (510) показан с еще большим увеличением ((512) на Фиг. 5), при котором отдельные пиксели представлены в виде элементов матрицы (например, элемента матрицы (514)). При таком уровне увеличения края символов выглядят зазубренными, поскольку пиксель является наименьшим элементом детализации, который можно использовать для излучения света заданной яркости. В файле оцифрованного отсканированного документа каждый пиксель представлен фиксированным числом битов, при этом кодирование пикселей осуществляется последовательно. Заголовок файла содержит информацию о типе кодировки пикселей, размерах отсканированного изображения и другую информацию, позволяющую программе отображения оцифрованного отсканированного документа получать данные кодирования пикселей и передавать команды устройству отображения или принтеру с целью воспроизведения двухмерного изображения исходного документа по этим кодировкам. Для представления оцифрованного отсканированного документа в виде монохромных изображений с оттенками серого обычно используют 8-разрядное или 16-разрядное кодирование пикселей, в то время как при представлении цветного отсканированного изображения может выделяться 24 или более бит для кодирования каждого пикселя, в зависимости от стандарта кодирования цвета. Например, в широко применяемом стандарте RGB для представления интенсивности красного, зеленого и синего цветов используются три 8-разрядных значения, закодированных с помощью 24-разрядного значения. Таким образом, оцифрованное отсканированное изображение, по существу, представляет собой документ в той же степени, в какой цифровые фотографии представляют визуальные образы. Каждый закодированный пиксель содержит информацию о яркости света в определенных крошечных областях изображения, а для цветных изображений в нем также содержится информация о цвете. В оцифрованном изображении отсканированного документа отсутствует какая-либо информация о значении закодированных пикселей, например информация, что небольшая двухмерная зона соседних пикселей представляет собой текстовый символ. Фрагменты изображения, соответствующие изображениям символов, могут обрабатываться для получения битов изображения символа, в котором биты со значением «1» соответствуют изображению символа, а биты со значением «0» соответствуют фону. Растровое отображение удобно для представления как полученных изображений символов, так и эталонов, используемых системой OCR для распознавания конкретных символов.In FIG. 5 shows a digital representation of a scanned document. In FIG. 5, a small circular fragment of an image (502) of a printed document (504) used as an example is shown in enlarged view (506). The corresponding fragment of the digitized image of the scanned document (508) is also presented in FIG. 5. The digitized scanned document includes data, which is a two-dimensional array of pixel values. In view (508), each grid cell under the symbols (e.g., cell (509)) is a square matrix of pixels. A small fragment of the grid (510) is shown with even greater magnification ((512) in Fig. 5), in which individual pixels are represented as matrix elements (for example, matrix element (514)). At this magnification level, the edges of the characters look jagged, since a pixel is the smallest detail that can be used to emit light of a given brightness. In the file of the digitized scanned document, each pixel is represented by a fixed number of bits, while the pixels are encoded sequentially. The file header contains information about the type of pixel encoding, the size of the scanned image, and other information that allows the display program of the digitized scanned document to receive pixel encoding data and transmit commands to the display device or printer in order to reproduce a two-dimensional image of the original document using these encodings. 8-bit or 16-bit pixel coding is usually used to represent a digitized scanned document as monochrome grayscale images, while 24 or more bits can be allocated for each pixel when presenting a color-scanned image, depending on the color coding standard . For example, the widely used RGB standard uses three 8-bit values encoded with a 24-bit value to represent the intensities of red, green, and blue. Thus, the digitized scanned image is essentially a document to the same extent as digital photographs represent visual images. Each encoded pixel contains light brightness information in certain tiny areas of the image, and for color images, it also contains color information. In the digitized image of the scanned document, there is no information about the value of the encoded pixels, for example, information that a small two-dimensional area of neighboring pixels is a text symbol. Image fragments corresponding to symbol images can be processed to obtain symbol image bits in which bits with a value of “1” correspond to a symbol image and bits with a value of “0” correspond to a background. Raster display is convenient for representing both the received symbol images and the standards used by the OCR system to recognize specific characters.

В отличие от этого обычный электронный документ, созданный с помощью текстового редактора, содержит различные типы команд рисования линий, ссылки на представления изображений, таких как оцифрованные фотографии, а также текстовые символы, закодированные в цифровом виде. Одним из наиболее часто используемых стандартов для кодирования текстовых символов является стандарт Юникод. В стандарте Юникод обычно применяется 8-разрядный байт для кодирования символов ASCII (американский стандартный код обмена информацией) и 16-разрядные слова для кодирования символов и знаков множества языков, включая японский, китайский и другие неалфавитные текстовые языки. Большая часть вычислительной работы, которую выполняет программа OCR, связана с распознаванием изображений текстовых символов, полученных из оцифрованного изображения отсканированного документа, и с преобразованием изображений символов в соответствующие кодовые комбинации стандарта Юникод. Очевидно, что для хранения текстовых символов стандарта Юникод будет требоваться гораздо меньше места, чем для хранения растровых изображений текстовых символов. Кроме того, текстовые символы стандарта Юникод можно редактировать, используя различные шрифты, а также обрабатывать всеми доступными в текстовых редакторах способами, в то время как оцифрованные изображения отсканированного документа можно изменить только с помощью специальных программ редактирования изображений.In contrast, a typical electronic document created using a text editor contains various types of line drawing commands, links to image representations, such as digitized photographs, and digitally encoded text characters. One of the most commonly used standards for encoding text characters is the Unicode standard. The Unicode standard typically uses 8-bit bytes to encode ASCII characters (American Standard Code for Information Interchange) and 16-bit words to encode characters and signs in many languages, including Japanese, Chinese, and other non-alphanumeric text languages. Most of the computational work performed by the OCR program is associated with the recognition of images of text characters obtained from the digitized image of a scanned document, and with the conversion of images of characters into the corresponding Unicode code combinations. Obviously, storing Unicode text characters will require much less space than storing bitmaps for text characters. In addition, Unicode standard text characters can be edited using various fonts, as well as processed by all methods available in text editors, while digitized images of a scanned document can only be changed using special image editing programs.

На начальном этапе преобразования изображения отсканированного документа в электронный документ печатный документ (например, документ (100), показанный на Фиг. 1) анализируется для определения в нем различных областей. Во многих случаях области могут быть логически упорядочены в виде иерархического ациклического дерева, состоящего из корня, представляющего документ как единое целое, промежуточных узлов, представляющих области, содержащие меньшие области, и конечных узлов, представляющих наименьшие области. Представляющее документ дерево включает в себя корневой узел, соответствующий всему документу, и шесть конечных узлов, каждый из которых соответствует одной определенной области. Области можно определить, применяя к изображению разные методы, среди которых различные типы статистического анализа распределения пикселей или значений пикселей. Например, в цветном документе фотографию можно выделить по большему изменению цвета в области фотографии, а также по более частым изменениям значений яркости пикселей по сравнению с содержащими текст областями.At the initial stage of converting the image of the scanned document into an electronic document, a printed document (for example, the document (100) shown in Fig. 1) is analyzed to determine various areas in it. In many cases, regions can be logically ordered as a hierarchical acyclic tree consisting of a root representing the document as a whole, intermediate nodes representing regions containing smaller regions, and end nodes representing smaller regions. The tree representing the document includes the root node corresponding to the entire document, and six end nodes, each of which corresponds to one specific area. Areas can be defined by applying different methods to the image, including various types of statistical analysis of the distribution of pixels or pixel values. For example, in a color document, a photograph can be distinguished by a larger color change in the photo area, as well as by more frequent changes in pixel brightness values compared to areas containing text.

Как только начальный анализ выявит различные области на изображении отсканированного документа, области, которые с большой вероятностью содержат текст, дополнительно обрабатываются подпрограммами OCR для выявления и преобразования текстовых символов в символы стандарта Юникод или любого другого стандарта кодировки символов. Для того чтобы подпрограммы OCR могли обработать содержащие текст области, определяется начальная ориентация содержащей текст области, благодаря чему в подпрограммах OCR эффективно используются различные способы сопоставления с эталоном для определения текстовых символов. Следует отметить, что изображения в документах могут быть не выровнены должным образом в рамках изображений отсканированного документа из-за погрешности в позиционировании документа на сканере или другом устройстве, формирующем изображение, из-за нестандартной ориентации содержащих текст областей или по другим причинам. Затем содержащие текст области делят на фрагменты изображений, содержащие отдельные знаки или символы, после чего эти фрагменты целиком масштабируются и ориентируются, а изображения символов центрируются внутри этих фрагментов для облегчения последующего автоматического распознавания символов, соответствующих изображениям символов.As soon as the initial analysis reveals various areas in the image of the scanned document, areas that are likely to contain text are further processed by OCR routines to identify and convert text characters to Unicode characters or any other character encoding standard. In order for OCR routines to process text-containing areas, the initial orientation of the text-containing area is determined, so that OCR routines effectively use various matching methods to define text characters. It should be noted that the images in the documents may not be properly aligned within the images of the scanned document due to an error in positioning the document on the scanner or other device forming the image, due to the non-standard orientation of the text-containing areas or for other reasons. Then the text-containing areas are divided into image fragments containing individual characters or symbols, after which these fragments are fully scaled and oriented, and the symbol images are centered inside these fragments to facilitate subsequent automatic recognition of characters corresponding to symbol images.

Пример способов и систем OCRAn example of OCR methods and systems

Для перехода к конкретному обсуждению различных методов оптического распознавания символов в качестве примера будет использоваться набор символов для некоторого гипотетического языка. На Фиг. 6 показан гипотетический набор символов. На Фиг. 6 показаны 48 различных символов, расположенных в 48 прямоугольных областях, таких как прямоугольная область (602). В правом углу каждой прямоугольной области указан числовой индекс или код символа, вписанный в круг; например, индекс или код «1» (604), соответствует первому символу (606), показанному в прямоугольной области (602). Данный пример выбран для демонстрации работы как существующих в настоящее время способов и систем OCR, так и новых способов и систем, описанных в настоящем документе. Фактически для письменных иероглифических языков, включая китайский и японский языки, для печати и письма могут использоваться десятки тысяч различных символов.To proceed to a concrete discussion of various methods of optical character recognition, as an example, a set of characters for a hypothetical language will be used. In FIG. 6 shows a hypothetical character set. In FIG. 6 shows 48 different symbols located in 48 rectangular regions, such as a rectangular region (602). In the right corner of each rectangular area is indicated a numerical index or symbol code inscribed in a circle; for example, the index or code “1” (604) corresponds to the first character (606) shown in the rectangular region (602). This example is selected to demonstrate the operation of both the currently existing OCR methods and systems, and the new methods and systems described herein. In fact, for written hieroglyphic languages, including Chinese and Japanese, tens of thousands of different characters can be used for printing and writing.

На Фиг. 7А-В показаны различные аспекты наборов символов для естественных языков. На Фиг. 7А в колонке показаны различные формы изображения восьмого символа из набора, показанного на Фиг. 6. В колонке (704) для восьмого символа (702) из набора символов, показанного на Фиг. 6, представлены разные формы написания, встречающиеся в разных стилях текста. Во многих естественных языках могут использоваться различные стили текста, а также различные варианты написания каждого символа.In FIG. 7A-B show various aspects of character sets for natural languages. In FIG. 7A, the column shows various image forms of the eighth symbol from the set shown in FIG. 6. In the column (704) for the eighth character (702) from the character set shown in FIG. 6, various forms of writing are found in different styles of text. Many natural languages can use different styles of text, as well as different spellings of each character.

На Фиг. 7В показаны разные подходы к распознаванию символов естественного языка. На Фиг. 7В конкретный символ естественного языка представлен узлом (710) на схеме (712). Конкретный символ может иметь множество различных общих письменных или печатных форм. В целях оптического распознавания символов каждая из этих общих форм представляется в виде отдельной графемы. В некоторых случаях определенный символ может содержать две или более графем. Например, китайские иероглифы могут содержать комбинацию из двух или более графем, каждая из которых присутствует в других иероглифах. Корейский язык, на самом деле, основан на алфавите, при этом в нем используются корейские морфо-слоговые блоки, содержащие ряд буквенных символов в различных позициях. Таким образом, корейский морфо-слоговой блок может представлять собой символ более высокого уровня, состоящий из множества компонентов графем. Для символа (710), показанного на Фиг. (7В), существует шесть различных графем (714)-(719). Кроме того, есть одна или более различных печатных или письменных форм начертания графем, каждая из которых представлена соответствующим эталоном. На Фиг. 7В каждая из графем (714) и (716) имеет два возможных варианта начертания, представленных эталонами (720) и (721) и (723)-(724), соответственно. Каждая из графем (715) и (717)-(719) связана с одним эталоном (722) и (725)-(727), соответственно. Например, восьмой символ из набора, показанного в качестве примера на Фиг. 6, может быть связан с тремя графемами, первая из которых соответствует начертаниям (702), (724), (725) и (726), вторая - (728) и (730), а третья - (732). В этом примере к первой графеме относятся начертания, в которых используются прямые горизонтальные элементы, ко второй графеме относятся начертания, в которых используются горизонтальные элементы и короткие вертикальные элементы с правой стороны, а к третьей графеме относятся начертания, включающие в себя изогнутые (а не прямые) элементы. В альтернативном варианте осуществления все начертания восьмого символа (702), (728), (724), (732), (725), (726) и (730) можно представить в виде эталонов, связанных с единственной графемой для восьмого символа. В определенной степени выбор графем осуществляется произвольно. В некоторых типах иероглифических языков имеется много тысяч разных графем. Эталоны можно рассматривать в качестве альтернативного представления или изображения символа, при этом они могут быть представлены в виде набора пар «параметр - значение параметра», как описано ниже.In FIG. 7B shows different approaches to recognizing natural language characters. In FIG. 7B, a particular natural language symbol is represented by a node (710) in diagram (712). A particular symbol may have many different common written or printed forms. For the purposes of optical character recognition, each of these general forms is represented as a separate grapheme. In some cases, a particular character may contain two or more graphemes. For example, Chinese characters may contain a combination of two or more graphemes, each of which is present in other characters. The Korean language, in fact, is based on the alphabet, while it uses Korean morpho-syllabic blocks containing a number of letter characters in different positions. Thus, the Korean morpho-syllabic block can be a higher-level symbol consisting of many components of graphemes. For the symbol (710) shown in FIG. (7B), there are six different graphemes (714) - (719). In addition, there is one or more different printed or written forms of graphemes, each of which is represented by an appropriate standard. In FIG. 7B, each of the graphemes (714) and (716) has two possible styles, represented by the standards (720) and (721) and (723) - (724), respectively. Each of the graphemes (715) and (717) - (719) is associated with one standard (722) and (725) - (727), respectively. For example, the eighth character from the set shown as an example in FIG. 6 can be associated with three graphemes, the first of which corresponds to the styles (702), (724), (725) and (726), the second to (728) and (730), and the third to (732). In this example, the first grapheme includes faces that use straight horizontal elements, the second grapheme that uses horizontal elements and short vertical elements on the right side, and the third grapheme that includes curved (not straight) ) elements. In an alternative embodiment, all the faces of the eighth character (702), (728), (724), (732), (725), (726) and (730) can be represented as patterns associated with a single grapheme for the eighth character. To a certain extent, the choice of graphemes is arbitrary. In some types of hieroglyphic languages, there are many thousands of different graphemes. Standards can be considered as an alternative representation or image of a symbol, while they can be represented as a set of pairs "parameter - parameter value", as described below.

Несмотря на то, что отношения между символами, графемами, и эталонами, показанные на Фиг. 7В, являются строго иерархическими и при этом каждая графема связана с одним конкретным родительским символом, фактические отношения не могут быть так просто структурированы. На Фиг. 7С показан несколько более сложный набор отношений, когда два символа (730) и (732) являются родительскими для двух разных графем (734) и (736). В качестве еще одного примера можно привести следующие символы английского языка: строчная буква «о», прописная буква «О», цифра «0» и символ градусов «°», которые могут быть связаны с кольцеобразной графемой. Отношения могут быть альтернативно представлены в виде графов или сетей. В некоторых случаях графемы (в отличие от символов или в дополнение к ним) могут отображаться на самых высоких уровнях в рамках выбранного представления отношений. В сущности, идентификация символов, графем, выбор эталонов для конкретного языка, а также определение отношений между ними осуществляются в большой степени произвольно.Although the relationships between symbols, graphemes, and patterns shown in FIG. 7B are strictly hierarchical and each grapheme is associated with one specific parent symbol, the actual relationships cannot be so simply structured. In FIG. 7C shows a slightly more complex set of relationships when two characters (730) and (732) are parent to two different graphemes (734) and (736). As another example, the following characters of the English language can be cited: lowercase letter “o”, capital letter “O”, number “0” and degree symbol “°”, which can be associated with a ring-shaped grapheme. Relations can alternatively be represented as graphs or networks. In some cases, graphemes (unlike or in addition to symbols) can be displayed at the highest levels within the selected relationship view. In essence, the identification of symbols, graphemes, the choice of standards for a particular language, as well as the determination of the relations between them, are carried out to a large extent arbitrarily.

На Фиг. 8А-В показаны параметры и значения параметров, рассчитанные для изображений символов. Следует заметить, что словосочетание «изображение символа» может описывать печатный, рукописный или отображаемый на экране символ или графему. В следующем примере параметры и значения параметров рассматриваются применительно к изображениям символов, но в фактическом контексте реального языка параметры и значения параметров часто применяются для характеристики и представления изображений графем. На Фиг. 8А показано изображение прямоугольного символа (802), полученное из содержащего текст изображения, которое соответствует 22-му символу из набора, показанного в качестве примера на Фиг. 6. На Фиг. 8В показано изображение прямоугольного символа (804), полученное из содержащего текст изображения, которое соответствует 48-му символу из набора, показанного в качестве примера на Фиг. 6. При печати и письме на гипотетическом языке, соответствующем набору символов, приведенному в качестве примера, символы размещаются в середине прямоугольных областей. Если это не так, системы OCR выполнят стадию начальной обработки изображений, изменив ориентацию, масштаб и положение полученных изображений символов относительно фоновой области для нормализации полученных изображений символов для дальнейших стадий обработки.In FIG. 8A-B show parameters and parameter values calculated for symbol images. It should be noted that the phrase "symbol image" may describe a printed, handwritten or displayed on-screen symbol or grapheme. In the following example, parameters and parameter values are considered in relation to symbol images, but in the actual context of a real language, parameters and parameter values are often used to characterize and represent grapheme images. In FIG. 8A shows an image of a rectangular symbol (802) obtained from a text-containing image that corresponds to the 22nd symbol from the set shown as an example in FIG. 6. In FIG. 8B shows an image of a rectangular symbol (804) obtained from a text-containing image that corresponds to the 48th symbol from the set shown as an example in FIG. 6. When printing and writing in a hypothetical language corresponding to the character set given as an example, the characters are placed in the middle of the rectangular areas. If this is not the case, the OCR systems will perform the initial image processing step by changing the orientation, scale and position of the received symbol images relative to the background area to normalize the received symbol images for further processing stages.

На Фиг. 8А показаны три разных параметра, которые могут использоваться системой OCR для характеристики символов. Следует заметить, что область изображения символа, или окно символа, характеризуется вертикальным размером окна символа (806), обозначаемым сокращенно «wc», и горизонтальным размером окна символа (808), обозначаемым сокращенно «hw». Первым параметром является самый длинный в изображении символа непрерывный горизонтальный отрезок линии, обозначаемый «А» (810). Это самая длинная последовательность смежных темных пикселей на фоне по существу белых пикселей в окне символа. Вторым параметром является самый длинный в изображении символа непрерывный вертикальный отрезок линии (812). Третий параметр представляет собой отношение количество пикселей изображения символа к общему числу пикселей в окне символа, выраженное в процентах; в данном примере это процент черных пикселей в по существу белом окне символа. Во всех трех случаях значения параметров могут быть непосредственно рассчитаны сразу после того, как будет создано растровое отображение окна символа. На Фиг. 8В показаны два дополнительных параметра. Первым параметром является число внутренних горизонтальных белых полос в изображении символа; изображение символа, показанного на Фиг. 8В, имеет одну внутреннюю горизонтальную белую полосу (816). Вторым параметром является число внутренних вертикальных белых полос в изображении символа. В 48-м символе из набора, представленном изображением в окне символа (804), показанном на Фиг. 8В, имеется одна внутренняя вертикальная белая полоса (818). Число горизонтальных белых полос обозначается как «hs», а число внутренних вертикальных белых полос - «vs».In FIG. 8A shows three different parameters that can be used by the OCR system to characterize characters. It should be noted that the symbol image area, or symbol window, is characterized by the vertical size of the symbol window (806), denoted by the abbreviation "wc", and the horizontal size of the symbol window (808), denoted by the abbreviation "hw". The first parameter is the longest continuous horizontal line segment in the symbol image, denoted by “A” (810). This is the longest sequence of adjacent dark pixels against a background of essentially white pixels in the symbol window. The second parameter is the longest continuous line segment in the symbol image (812). The third parameter is the ratio of the number of pixels in the symbol image to the total number of pixels in the symbol window, expressed as a percentage; in this example, this is the percentage of black pixels in the essentially white window of the symbol. In all three cases, the parameter values can be directly calculated immediately after the raster display of the symbol window is created. In FIG. 8B shows two additional parameters. The first parameter is the number of internal horizontal white stripes in the symbol image; image of the symbol shown in FIG. 8B has one inner horizontal white strip (816). The second parameter is the number of internal vertical white stripes in the symbol image. In the 48th symbol from the set, represented by the image in the symbol window (804) shown in FIG. 8B, there is one inner vertical white stripe (818). The number of horizontal white stripes is denoted by “hs” and the number of internal vertical white stripes by “vs”.

На Фиг. 9 показана таблица значений параметров, рассчитанных для всех символов из набора, изображенного в качестве примера на Фиг. 6. В каждой строке таблицы (902), показанной на Фиг. 9, представлены значения параметров, рассчитанные для конкретного символа. Эти параметры включают в себя: (1) отношение самого длинного непрерывного горизонтального отрезка линии к окну символа, $\frac{h}{h w}$

, 904; (2) отношение самого длинного непрерывного вертикального отрезка линии к вертикальному размеру окна символа,

\frac{v}{v w}

, 906; (3) выраженная в процентах общая площадь, соответствующая изображению символа или черной области, b, 908; (4) количество внутренних вертикальных полос, vs, 910; (5) количество внутренних горизонтальных полос, hs, 912; (6) общее количество внутренних вертикальных и горизонтальных полос, vs+hs, 914; и (7) отношение самого длинного непрерывного вертикального отрезка к самому длинному непрерывному горизонтальному отрезку,

\frac{v}{h}

, 916. Как и следовало ожидать, в первой строке (920) таблицы (902), представленной на Фиг. 9, первый символ из набора символов ((606) на Фиг. 6) представляет собой вертикальную черту, и численное значение параметра

\frac{v}{v w}

, равное 0,6, значительно больше численного значения параметра

\frac{h}{h w}

, равного 0,2. Символ (606) занимает всего 12 процентов всего окна символа (602). У символа (606) нет ни внутренних горизонтальных, ни внутренних вертикальных белых полос, поэтому значения параметров vs, hs и vs+hs равны 0. Соотношение

\frac{v}{h}

равно 3. Поскольку используемые в качестве примера символы имеют относительно простую блочную структуру, то значения каждого из параметров в таблице (902) отличаются незначительно.In FIG. 9 shows a table of parameter values calculated for all characters from the set shown as an example in FIG. 6. In each row of the table (902) shown in FIG. 9 shows the parameter values calculated for a particular symbol. These parameters include: (1) the ratio of the longest continuous horizontal line segment to the symbol window,

\frac{h}{h w}

904; (2) the ratio of the longest continuous vertical line segment to the vertical size of the symbol window,

\frac{v}{v w}

, 906; (3) expressed as a percentage of the total area corresponding to the image of the symbol or black area, b, 908; (4) the number of internal vertical stripes, vs, 910; (5) the number of internal horizontal stripes, hs, 912; (6) the total number of internal vertical and horizontal stripes, vs + hs, 914; and (7) the ratio of the longest continuous vertical segment to the longest continuous horizontal segment,

\frac{v}{h}

, 916. As expected, in the first row (920) of the table (902) shown in FIG. 9, the first character from the character set ((606) in FIG. 6) is a vertical bar and the numerical value of the parameter

\frac{v}{v w}

equal to 0.6, significantly larger than the numerical value of the parameter

\frac{h}{h w}

equal to 0.2. The symbol (606) occupies only 12 percent of the entire symbol window (602). The symbol (606) has neither internal horizontal nor internal vertical white stripes, therefore the values of the parameters vs, hs and vs + hs are 0. The ratio

\frac{v}{h}

equal to 3. Since the symbols used as an example have a relatively simple block structure, the values of each of the parameters in the table (902) differ slightly.

Несмотря на то, что значения каждого из параметров, рассмотренных выше в отношении Фиг. 9, имеют относительно небольшие отличия для используемых в качестве примера 48 символов, всего трех параметров достаточно для разделения всех этих символов на 18 частей, или кластеров. На Фиг. 10 показан трехмерный график символов для символов из набора, изображенного в качестве примера на Фиг. 6, на котором каждое из измерений представляет значения одного из трех разных параметров. На Фиг. 10 первая горизонтальная ось (1002) представляет параметр $\frac{v}{h}$

((916) на Фиг. 9), вторая горизонтальная ось (1004) представляет параметр vs+hs ((914) на Фиг. 9), а третья вертикальная ось (1006) представляет параметр b ((908) на Фиг. 9). На графике есть 18 различных точек (таких как нанесенная точка (1008)), каждая из которых показана в виде небольшого черного диска с вертикальной проекцией на горизонтальную плоскость, проходящую через оси (1002) и (1004); эта проекция представлена в виде вертикальной пунктирной линии, такой как вертикальная пунктирная линия (1010), соединяющая точку (1008) с ее проекцией на горизонтальную плоскость (1012). Код или номер последовательности символов, которые соответствуют определенной точке на графике, показаны в скобках справа от соответствующей точки. Например, символы 14, 20 и 37 (1014) соответствуют одной точке (1016) с координатами (1, 0, 0,32) относительно осей (1002), (1004) и (1006). Каждая точка связана с номером части или кластера, который указан в небольшом прямоугольнике слева от точки. Например, точка (1016) связана с кластером под номером «14» (1018). На Фиг. 11А-В показаны символы, содержащиеся в каждом из кластеров, представленных точками трехмерного пространства, изображенного на Фиг. 10. Рассмотрев символы, входящие в состав этих кластеров или частей, можно легко заметить, что три параметра, используемые для распределения символов в трехмерном пространстве, показанном на Фиг. 10, эффективно разбивают 48 символов, используемых в качестве примера, на связанные наборы символов.Although the values of each of the parameters discussed above with respect to FIG. 9 have relatively small differences for the 48 characters used as an example, just three parameters are enough to divide all these characters into 18 parts, or clusters. In FIG. 10 shows a three-dimensional graph of symbols for symbols from the set shown as an example in FIG. 6, in which each of the measurements represents the values of one of three different parameters. In FIG. 10, the first horizontal axis (1002) represents the parameter

\frac{v}{h}

((916) in Fig. 9), the second horizontal axis (1004) represents the parameter vs + hs ((914) in Fig. 9), and the third vertical axis (1006) represents the parameter b ((908) in Fig. 9) . There are 18 different points on the graph (such as the plotted dot (1008)), each of which is shown as a small black disk with a vertical projection onto a horizontal plane passing through the axes (1002) and (1004); this projection is represented as a vertical dashed line, such as a vertical dashed line (1010) connecting the point (1008) with its projection onto the horizontal plane (1012). The code or sequence number of characters that correspond to a specific point on the graph is shown in brackets to the right of the corresponding point. For example,

characters

14, 20 and 37 (1014) correspond to one point (1016) with coordinates (1, 0, 0.32) relative to the axes (1002), (1004) and (1006). Each point is associated with a part or cluster number, which is indicated in a small rectangle to the left of the point. For example, the point (1016) is connected to the cluster under the number "14" (1018). In FIG. 11A-B show the symbols contained in each of the clusters represented by points of the three-dimensional space shown in FIG. 10. Having examined the symbols that make up these clusters or parts, one can easily notice that the three parameters used to distribute the symbols in the three-dimensional space shown in FIG. 10, 48 characters used as an example are effectively broken down into related character sets.

Можно использовать дополнительные параметры для однозначного распознавания каждого символа в каждом кластере или части. Рассмотрим, например, кластер 8 (1102), показанный на рисунке 11А. Этот кластер символов включает в себя четыре угловых L-образных символа, отличающихся углом поворота и имеющих коды 26, 32, 38 и 44, а также Т-образный символ с кодом 43 и крестообразный символ с кодом 45. На Фиг. 12А показан отдельный параметр, который можно использовать в комбинации с тремя параметрами, соответствующими каждому из измерений трехмерного пространства параметров, изображенного на Фиг. 10, для полного распознавания каждого из символов в кластере 8. В дальнейшем этот параметр называется «p». Как показано на Фиг. 12А, окно символа (1202) делится на четыре квадранта: Q1 (1204), Q2 (1205), Q3 (1206), и Q4 (1207). После этого в каждом квадранте вычисляется площадь, занимаемая изображением символа, которая указывается рядом с квадрантом. Например, в квадранте Q1 (1204) часть изображения символа занимает 13,5 единиц площади (1210). Эти значения для числа единиц площади в пределах каждого квадранта затем присваиваются переменным Q1, Q2, Q3 и Q4. Следовательно, в примере, представленном на Фиг. 12А, переменной Q1 присвоено значение 13,5, переменной Q2 присвоено значение 0, переменной Q3 присвоено значение 18, а переменной Q4 присвоено значение 13,5. Затем согласно небольшому фрагменту псевдокода 1212, представленному на Фиг. 12А под окном символа, рассчитывается значение нового параметра p. Например, если все четыре переменные Q1, Q2, Q3 и Q4 имеют одинаковые значения, то параметру p будет присвоено значение 0 (1214), что указывает на равенство четырех квадрантов в окне символа относительно количества единиц площади, занимаемой изображением символа. На Фиг. 12В показано значение дополнительного параметра для каждого символа из кластера 8, которое следует рассматривать со ссылкой на Фиг. 12А. Как можно увидеть из значений параметров, связанных с символами на Фиг. 12В, новый параметр, описанный выше касательно Фиг. 12А, имеет разное значение для каждого из шести символов в кластере 8. Другими словами, можно использовать комбинацию трех параметров, используемых для создания трехмерного графика, показанного выше на Фиг. 10, и дополнительного параметра, рассмотренного выше на Фиг. 12А, для однозначной идентификации всех символов в кластере 8.You can use additional parameters to uniquely recognize each character in each cluster or part. Consider, for example, cluster 8 (1102), shown in Figure 11A. This symbol cluster includes four corner L-shaped symbols differing in the angle of rotation and having codes 26, 32, 38 and 44, as well as a T-shaped symbol with code 43 and a cross-shaped symbol with code 45. In FIG. 12A shows a separate parameter that can be used in combination with three parameters corresponding to each of the dimensions of the three-dimensional parameter space shown in FIG. 10, for full recognition of each of the characters in cluster 8. Hereinafter, this parameter is called “p”. As shown in FIG. 12A, the symbol window (1202) is divided into four quadrants: Q1 (1204), Q2 (1205), Q3 (1206), and Q4 (1207). After that, the area occupied by the symbol image, which is indicated next to the quadrant, is calculated in each quadrant. For example, in the Q1 quadrant (1204), a portion of the symbol image occupies 13.5 units of area (1210). These values for the number of area units within each quadrant are then assigned to the variables Q1, Q2, Q3, and Q4. Therefore, in the example shown in FIG. 12A, variable Q1 is assigned a value of 13.5, variable Q2 is assigned a value of 0, variable Q3 is assigned a value of 18, and variable Q4 is assigned a value of 13.5. Then, according to a small fragment of pseudo-code 1212 shown in FIG. 12A under the symbol window, the value of the new parameter p is calculated. For example, if all four variables Q1, Q2, Q3 and Q4 have the same values, then the parameter p will be assigned the value 0 (1214), which indicates the equality of four quadrants in the symbol window with respect to the number of units of the area occupied by the symbol image. In FIG. 12B shows the value of an additional parameter for each symbol from cluster 8, which should be considered with reference to FIG. 12A. As can be seen from the parameter values associated with the symbols in FIG. 12B, the new parameter described above with respect to FIG. 12A has a different meaning for each of the six characters in cluster 8. In other words, you can use a combination of the three parameters used to create the three-dimensional graph shown above in FIG. 10 and the additional parameter discussed above in FIG. 12A to uniquely identify all the characters in cluster 8.

На Фиг. 13А показан дополнительный параметр, используемый для определения характеристик изображений символов. Ниже дополнительный параметр называется параметром «самый длинный отрезок». Этот параметр показывает направление самого длинного отрезка в изображении символа. Если самым длинным отрезком является вертикальное ребро (1302), параметр самого длинного отрезка имеет значение «0» (1304). Если самым длинным отрезком символа является горизонталь (1306), параметр самого длинного отрезка имеет числовое значение «1» (1308). Если самым длинным отрезком символа является диагональ (1310), параметр самого длинного отрезка имеет числовое значение «2» (1312). Наконец, если такой отрезок отсутствует, как в случае символа, показанного в ячейке (1314) на Фиг. 13А, параметр самого длинного отрезка имеет числовое значение «3» (1316).In FIG. 13A shows an additional parameter used to characterize character images. Below, an additional parameter is called the “longest segment” parameter. This parameter shows the direction of the longest section in the character image. If the longest segment is a vertical edge (1302), the parameter of the longest segment is set to “0” (1304). If the longest segment of the symbol is horizontal (1306), the parameter of the longest segment has a numerical value of “1” (1308). If the longest segment of the symbol is the diagonal (1310), the parameter of the longest segment has the numerical value “2” (1312). Finally, if such a segment is missing, as in the case of the symbol shown in cell (1314) in FIG. 13A, the parameter of the longest segment has a numerical value of “3” (1316).

На Фиг. 13В показан полный набор значений параметров для гипотетического набора символов, показанного на Фиг. 6. Таблица, показанная на Фиг. 13В, идентична таблице (902), ранее изображенной на Фиг. 9, за исключением двух последних колонок (1320) и (1322), которые включают значения параметра p, описанные выше со ссылкой на Фиг. 12А, и параметра самого длинного отрезка, обсуждавшегося выше со ссылкой на Фиг. 13А.In FIG. 13B shows a complete set of parameter values for the hypothetical character set shown in FIG. 6. The table shown in FIG. 13B is identical to the table (902) previously depicted in FIG. 9, with the exception of the last two columns (1320) and (1322), which include the values of the parameter p described above with reference to FIG. 12A, and the parameter of the longest segment discussed above with reference to FIG. 13A.

На Фиг. 14А-17С показано построение четырех деревьев принятия решений из четырех подмножеств гипотетического набора символов, показанного на Фиг. 6. На Фиг. 14А показано 36 символов, выбранных из 48 символов гипотетического набора символов, показанного на Фиг. 6. Этот первый поднабор символов затем используется для построения дерева принятия решений, которое может использоваться для классификации входного символа, который лучше всего соответствует конкретному символу в подмножестве символов, показанному на Фиг. 14А. На Фиг. 14В показана таблица значений параметров для 36 символов из подмножества символов, показанных на Фиг. 14А. Значения этих параметров используются для построения дерева принятия решений. На Фиг. 14С показано дерево принятия решений, построенное для подмножества символов, показанных на Фиг. 14А, на основе значений параметров для символов, показанных на Фиг. 14В. Следует отметить, что значение параметра, показанное на Фиг. 14В, является подмножеством полной таблицы значений параметров, показанной на Фиг. 13В.In FIG. 14A-17C show the construction of four decision trees from four subsets of the hypothetical character set shown in FIG. 6. In FIG. 14A shows 36 characters selected from 48 characters of the hypothetical character set shown in FIG. 6. This first character subset is then used to build a decision tree that can be used to classify the input character that best matches the specific character in the character subset shown in FIG. 14A. In FIG. 14B shows a table of parameter values for 36 symbols from the subset of symbols shown in FIG. 14A. The values of these parameters are used to build the decision tree. In FIG. 14C shows a decision tree constructed for the subset of the symbols shown in FIG. 14A, based on the parameter values for the symbols shown in FIG. 14B. It should be noted that the parameter value shown in FIG. 14B is a subset of the complete table of parameter values shown in FIG. 13B.

Существует много различных типов деревьев принятия решений. Деревья принятия решений, созданные в примере, показанном на Фиг. 14А-17С, являются бинарными деревьями решений. Каждый узел дерева решений, такой как узел (1402) на Фиг. 14С, принимает решение,, на какой из двух узлов нижнего уровня (1404) и (1406), следует перейти при обходе дерева решений на основе набора значения параметров, вычисленных для изображения символа. Дерево принятия решений создается с помощью подмножества символов, показанного на Фиг. 14А, а также значения параметра, показанного на Фиг. 14В. В качестве первого шага при построении дерева принятия решений рассматриваются различные значения параметров для каждого из девяти параметров, рассчитанных для каждого символа изображения в подмножестве символ-изображение, с целью определения конкретного параметра, который наиболее равномерно делит подмножество изображений символов на две части, а также для определения его порогового числового значения. В случае подмножества символов, показанного на Фиг. 14А, параметр $\frac{v}{v w}$

является подходящим параметром для первого испытания, представленного узлом (1402) - корневым узлом дерева принятия решений. Используя пороговое числовое значение 0,33, испытание на то, что числовое значение для конкретного изображения символа из подмножества изображений символов меньше или равно пороговому значению 0,33, делит 36 изображений символов в подмножестве символов на первую часть с 18 символами (1408) и вторую часть с 18 символами (1410). На каждом этапе построения дерева решения находится такое испытание, которое наиболее равномерно делит множество входных символов узла входа на две выходные части. Этот процесс осуществляется рекурсивно по узлам и по уровням, для построения всего дерева решений (1400). Корневой узел (1402) представляет собой первый уровень дерева принятия решений. Второй уровень дерева решений включает узлы (1404) и (1406). Оба этих узла включают испытание на основе параметра p, описанного со ссылкой на Фиг. 12А. В случае узла (1404) пороговое значение для испытания равно 11, а в случае узла (1406), пороговое значение равно 4. Конечные узлы дерева («листья» дерева) обведены кружками, например, конечный узел (1412) на Фиг. 14С. Конечные узлы представляют отдельные символы из подмножества символов, используемых для построения дерева решений. Имеется 36 конечных узлов, соответствующих 36 символам в подмножестве символов, показанном на Фиг. 14А.There are many different types of decision trees. Decision trees created in the example shown in FIG. 14A-17C are binary decision trees. Each node of the decision tree, such as node (1402) in FIG. 14C, makes a decision, which of the two nodes of the lower level (1404) and (1406) should go when going around the decision tree based on a set of parameter values calculated for the symbol image. A decision tree is created using the subset of symbols shown in FIG. 14A, as well as the parameter value shown in FIG. 14B. As a first step in constructing a decision tree, various parameter values are considered for each of the nine parameters calculated for each image symbol in a symbol-image subset, in order to determine the specific parameter that most evenly divides a subset of symbol images into two parts, as well as for determining its threshold numerical value. In the case of the subset of symbols shown in FIG. 14A, parameter

\frac{v}{v w}

is a suitable parameter for the first test represented by node (1402), the root node of the decision tree. Using a threshold numerical value of 0.33, testing that a numerical value for a particular symbol image from a subset of symbol images is less than or equal to a threshold value of 0.33, divides 36 symbol images in a subset of symbols into a first part with 18 characters (1408) and a second part with 18 characters (1410). At each stage of constructing the decision tree, there is such a test that most evenly divides the set of input symbols of the input node into two output parts. This process is carried out recursively by nodes and by levels, to build the entire decision tree (1400). The root node (1402) represents the first level of the decision tree. The second level of the decision tree includes nodes (1404) and (1406). Both of these nodes include a test based on parameter p described with reference to FIG. 12A. In the case of the node (1404), the threshold value for the test is 11, and in the case of the node (1406), the threshold value is 4. The end nodes of the tree (tree “leaves”) are circled, for example, the end node (1412) in FIG. 14C. End nodes represent individual characters from a subset of the characters used to build the decision tree. There are 36 end nodes corresponding to 36 symbols in the symbol subset shown in FIG. 14A.

Опять же, существует много различных типов деревьев решений и методов построения таких деревьев. В данном примере каждый узел включает простое испытание, которое сравнивает значение параметра с пороговым значением. В других типах деревьев решений испытания могут быть более сложными и могут включать несколько параметров. В определенных видах деревьев решений параметры, используемые для испытаний, выбирают из случайных подмножеств параметра с целью введения случайности по отношению к серии деревьев принятия решений, созданных для характеристики определенного набора символов или других объектов. Выбор параметров и пороговых значений может использоваться для минимизации или максимизации некоторой целевой функции, в том числе целевых функций, основанных на информационно-теоретических понятиях энтропии и прироста информации. Как дополнительно обсуждается ниже, несколько деревьев решений, построенных для конкретного набора изображений символов или других объектов, с другим подмножеством объектов, используемым для построения каждого из множества деревьев решений, могут совместно образовать лес решений, что часто обеспечивает более надежную и точную классификацию. По темам деревьев решений, лесов решений и случайных лесов решений было выполнено значительное количество исследований и разработок, в результате которых были получены подробные характеристики и разработаны концепции деревьев решений и лесов решений, а также их применения в конкретных типах предметных областей. Поскольку в настоящем примере используются простые бинарные деревья решений, в соответствии с настоящим документом для обработки изображений символов в системах OCR может использоваться любое дерево решений из широкого спектра различных типов деревьев решений и любой лес решений из широкого спектра различных типов лесов решений.Again, there are many different types of decision trees and methods for constructing such trees. In this example, each node includes a simple test that compares a parameter value with a threshold value. In other types of decision trees, testing can be more complex and may include several parameters. In certain types of decision trees, the parameters used for testing are selected from random subsets of the parameter in order to introduce randomness with respect to a series of decision trees created to characterize a particular set of characters or other objects. The choice of parameters and threshold values can be used to minimize or maximize some objective function, including objective functions based on information-theoretical concepts of entropy and information gain. As discussed further below, several decision trees constructed for a particular set of symbol images or other objects, with another subset of the objects used to construct each of the many decision trees, can together form a decision forest, which often provides a more reliable and accurate classification. A significant amount of research and development was carried out on the topics of decision trees, decision forests, and random decision forests, as a result of which detailed characteristics were obtained and concepts of decision trees and decision forests were developed, as well as their application in specific types of subject areas. Since simple binary decision trees are used in this example, in accordance with this document, any decision tree from a wide range of different types of decision trees and any decision forest from a wide range of different types of decision forests can be used to process symbol images in OCR systems.

На Фиг. 14С дерево решений (1400) имеет пять полных уровней узлов (1414)-(1418). Большинство узлов шестого уровня (1420) являются конечными узлами, однако в некоторых случаях на шестом уровне имеются такие узлы (1422)-(1425), которые создают пары конечных узлов на седьмом уровне. Как дополнительно описано ниже, можно классифицировать произвольное изображение символа как соответствующее одному из 36 символов в наборе символов, показанном на Фиг. 14А, путем расчета значений числового параметра для изображения символа, а затем обходя дерево от корневого узла к конечному узлу, используя эти значения параметров и испытания, выполняемые в каждом узле. На Фиг. 15А-С показано построение второго дерева решений (1500) из второго набора из 36 символов, показанного на Фиг. 15А, выбранного из 48 символов гипотетического набора символов, показанного на Фиг. 6. Аналогичным образом на Фиг. 16А-С показано построение третьего дерева решений (1600) из третьего подмножества символов, выбранных из 48 символов гипотетического набора символов, показанного на Фиг. 6. Наконец, на Фиг. 17А-С показано построение четвертого дерева решений (1700) из четвертого подмножества символов, выбранных из набора 48 гипотетических символов, показанных на Фиг. 6. В этом примере каждое из четырех деревьев принятия решений (1400), (1500), (1600), (1700) строится из другого подмножества 36 символов, выбранных из 48 символов гипотетического набора символов, показанных на Фиг. 6. Однако это несколько искусственный пример. В общем случае, когда из обучающей выборки строится лес деревьев решений, для построения каждого дерева решений каждый раз случайным образом с заменой выбирается некоторая относительно большая доля объектов в обучающем наборе. В том случае, когда для построения каждого дерева случайным образом выбираются 90% из объектов, то можно ожидать, что среднее перекрытие обучающих объектов, использующихся для построения любой конкретной пары деревьев, будет превышать 90%, поскольку случайный выбор объектов каждый раз производится из всей совокупности объектов. Тем не менее, этот пример дает простую иллюстрацию построения и использования леса решений.In FIG. 14C, the decision tree (1400) has five complete node levels (1414) - (1418). Most nodes of the sixth level (1420) are end nodes, but in some cases at the sixth level there are nodes (1422) - (1425) that create pairs of end nodes at the seventh level. As further described below, an arbitrary symbol image can be classified as corresponding to one of 36 symbols in the symbol set shown in FIG. 14A by calculating the values of a numerical parameter for the symbol image, and then bypassing the tree from the root node to the final node, using these parameter values and tests performed at each node. In FIG. 15A-C show the construction of a second decision tree (1500) from the second set of 36 characters shown in FIG. 15A, selected from 48 characters of the hypothetical character set shown in FIG. 6. Similarly, in FIG. 16A-C illustrate the construction of a third decision tree (1600) from a third subset of characters selected from 48 characters of the hypothetical character set shown in FIG. 6. Finally, in FIG. 17A-C show the construction of a fourth decision tree (1700) from a fourth subset of symbols selected from a set of 48 hypothetical symbols shown in FIG. 6. In this example, each of the four decision trees (1400), (1500), (1600), (1700) is constructed from another subset of 36 characters selected from 48 characters of the hypothetical character set shown in FIG. 6. However, this is a somewhat artificial example. In the general case, when a forest of decision trees is constructed from a training set, to construct each decision tree each time, a relatively large proportion of objects in the training set are randomly selected with replacement. In the case when 90% of the objects are randomly selected to build each tree, it can be expected that the average overlap of the training objects used to build any particular pair of trees will exceed 90%, since a random selection of objects each time is made from the whole set objects. However, this example provides a simple illustration of building and using a decision forest.

На Фиг. 18А-В показана классификация изображений символов с помощью леса решений. На Фиг. 18А показаны три изображения символов (1802), (1804) и (1806) вместе с числовыми значениями, рассчитанными для этих изображений символов для девяти рассмотренных выше параметров (1808), (1810) и (1812), соответственно. Символ изображения (1802) - это точное совпадение изображения символа для символа 22 набора гипотетических символов, изображенных на Фиг. 6. Таким образом, параметры (1808), вычисленные для этого изображения символа, идентичны параметрам, показанным в строке (1319) на Фиг. 13В. Изображение символа (1804) аналогично изображению символа (1802), но относительные ширины сегментов были уменьшены, что соответствует несколько более тонкому символу. Изображение символа (1806) представляет собой более тонкий горизонтальный отрезок с двумя обращенными внутрь удлинениями на вершинах вертикальных элементов символа. Значения параметров (1810), вычисленные для изображения символа (1804), и значения параметров (1812), вычисленные для изображения символа (1806), не соответствуют ни одному из значений параметров, показанных на Фиг. 13В, поскольку изображения символов (1804) и (1806) не соответствует ни одному из изображений для символов гипотетического набора символов, показанного на Фиг. 6.In FIG. 18A-B show a classification of symbol images using a decision forest. In FIG. 18A shows three symbol images (1802), (1804) and (1806) together with numerical values calculated for these symbol images for the nine parameters (1808), (1810) and (1812) discussed above, respectively. The image symbol (1802) is the exact match of the symbol image for symbol 22 of the set of hypothetical symbols depicted in FIG. 6. Thus, the parameters (1808) calculated for this symbol image are identical to those shown in line (1319) in FIG. 13B. The symbol image (1804) is similar to the symbol image (1802), but the relative widths of the segments have been reduced, which corresponds to a slightly thinner symbol. The symbol image (1806) is a thinner horizontal segment with two inward extensions at the vertices of the vertical elements of the symbol. The parameter values (1810) calculated for the symbol image (1804) and the parameter values (1812) calculated for the symbol image (1806) do not correspond to any of the parameter values shown in FIG. 13B, since the images of symbols (1804) and (1806) do not correspond to any of the images for the symbols of the hypothetical character set shown in FIG. 6.

На Фиг. 18 В показано использование небольшого леса решений, включающего деревья решений (1400), (1500), (1600) и (1700), изображенных на Фиг. 14С, 15С, 16С, и 17С, соответственно. Каждая строка текста справа от изображений символов на Фиг. 18В представляет собой обход одного из четырех деревьев принятия решений. Например, строка текста (1820) на Фиг. 18В представляет обход дерева принятия решений (1400) на Фиг. 14С, именуемого «дерево 1», используя параметры (1808), вычисленные для изображения символа (1802). Первый или корневой узел ((1402) на Фиг. 14С) дерева решений (1400) выполняет проверку: $\frac{v}{v w} \leq 0,33$

. Расчетное значение

\frac{v}{v w}

для изображения символа (1802) показано на Фиг. 18А, как 0,6. Значение 0,6 превышает 0,33, поэтому испытание, выполненное в узле (1402), приводит к переходу к узлу (1406), то есть вправо, как показано буквой «R» (1822) и текстовой строкой (1820). Узел (1406) выполняет проверку: p≤4. Расчетное значение параметра p для изображения символа (1802) показано на Фиг. 18А как 1. Таким образом, в случае узла (1406), проверка завершается успешно и выбирается левая стрелка вниз (1430), представленная буквой «L» (1824) на Фиг. 18В. Узел (1432) выполняет проверку b<0,2. Значение параметра b для изображения символа (1802) на Фиг. 18А, равно 0,28. Таким образом, проверка в узле (1432) заканчивается неудачей, и в качестве следующего перехода выбирается правая стрелка вниз (1434), которая обозначается буквой «R» (1826) на Фиг. 18В. Узел (1436) выполняет проверку: p<1. Значение параметра p для изображения символа (1802) на Фиг. 18А равно 1. Проверка в узле (1436) оканчивается неудачей, и в качестве следующего пути перехода выбирается правая стрелка вниз (1438), обозначенная буквой «R» (1828) на Фиг. 18В. Узел (1440) содержит проверку: p<4. Поскольку значение параметра p для изображения символа (1802) составляет 1, эта проверка завершается успешно, и поэтому левая направленная вниз стрелка (1442) выбирается в качестве следующего перехода, который обозначен буквой «L» (1830) на Фиг. 18В. Это приводит к оконечному узлу (1444), который представляет собой символ с кодом символа 22. Текстовые строки (1832)-(1834) представляют собой переходы со второго по четвертое дерево принятия решений (1500), (1600) и (1700), которые формируют коды символов 37, 22 и 22. Наиболее часто получаемый код символа в переходах по четырем деревьям - это код символа 22 ((1836) на Фиг. 18В), который является результатом применения параметров (1808) к лесу решений, включающему деревья решений (1400), (1500), (1600) и (1700). Отметим, что поднабор символов, использованных для построения второго дерева принятия решений, которое выбрало код символа 37, не включает символ 22. Поэтому неудивительно, что код символа, который формируется переходом по этому дереву, не является кодом символа 22. Применение параметров (1810) изображения символа (1804) к лесу решений дает неоднозначный результат: либо код символа 45, либо код символа 22 (1838). Применение рассчитанных параметров (1812) для изображения символа (1806) приводит к выбору кода символа 22 (1840).In FIG. 18B shows the use of a small decision forest including the decision trees (1400), (1500), (1600), and (1700) shown in FIG. 14C, 15C, 16C, and 17C, respectively. Each line of text to the right of the symbol images in FIG. 18B is a bypass of one of the four decision trees. For example, a line of text (1820) in FIG. 18B represents a traversal of the decision tree (1400) in FIG. 14C, referred to as “tree 1”, using parameters (1808) calculated for the symbol image (1802). The first or root node ((1402) in Fig. 14C) of the decision tree (1400) checks:

\frac{v}{v w} \leq 0.33

. Estimated value

\frac{v}{v w}

for the symbol image (1802) shown in FIG. 18A as 0.6. The value 0.6 exceeds 0.33, so the test performed in the node (1402) leads to the transition to the node (1406), that is, to the right, as shown by the letter "R" (1822) and the text line (1820). The node (1406) checks: p≤4. The calculated value of the parameter p for the symbol image (1802) is shown in FIG. 18A as 1. Thus, in the case of the node (1406), the test is completed successfully and the left down arrow (1430), represented by the letter “L” (1824) in FIG. 18B. The node (1432) performs a check b <0.2. The value of the parameter b for the symbol image (1802) in FIG. 18A, equal to 0.28. Thus, the check at node (1432) fails, and the right down arrow (1434), which is indicated by the letter “R” (1826) in FIG. 18B. The node (1436) performs the check: p <1. The parameter value p for the symbol image (1802) in FIG. 18A is 1. The check at node (1436) fails, and the right down arrow (1438), indicated by the letter “R” (1828) in FIG. 18B. Node (1440) contains a check: p <4. Since the value of the parameter p for the symbol image (1802) is 1, this test succeeds, and therefore the left downward arrow (1442) is selected as the next transition, which is indicated by the letter “L” (1830) in FIG. 18B. This leads to the end node (1444), which is a character with character code 22. Text strings (1832) - (1834) are transitions from the second to fourth decision tree (1500), (1600) and (1700), which

symbol codes

37, 22, and 22 are generated. The most frequently obtained symbol code in transitions over four trees is symbol code 22 ((1836) in Fig. 18B), which is the result of applying parameters (1808) to the decision forest, including decision trees ( 1400), (1500), (1600) and (1700). Note that the subset of symbols used to construct the second decision tree that selected symbol code 37 does not include symbol 22. Therefore, it is not surprising that the symbol code that is generated by clicking on this tree is not symbol code 22. Application of parameters (1810) symbol image (1804) to the decision forest gives an ambiguous result: either symbol code 45 or symbol code 22 (1838). The application of the calculated parameters (1812) for the symbol image (1806) leads to the selection of the symbol code 22 (1840).

В рассматриваемом простом примере все три вариации U-образного символа 22 (1802), (1804) и 1806, были охарактеризованы как символы, имеющие код символа 22, или код символа 22 или 45, согласно лесу решений, содержащему деревья решений (1400), (1500), (1600) и (1700). Тем не менее, изображения U-образных символов, которые больше отклоняются от традиционного изображения символа, показанного на Фиг. 6 для кода символа 22, скорее всего будут давать более различающиеся результаты. Лес из четырех деревьев решений в этом примере не очень надежен по многим причинам. Одна из причин этого заключается в том, что большинство решений в четырех деревьях принятия решений содержат проверки, которые составляют лишь небольшую часть от общего числа параметров. Таким образом, деревья принятия решений в основном полагаются на немногие из девяти параметров. Кроме того, числовые значения параметров в примере не обязательно соотносятся со степенью внешнего соответствия изображениям символов, формирующих числовые значения. Например, числовые значения параметра p, обсуждаемые со ссылкой на Фиг. 12А, не имеют численного представления степени сходства между символами. Другая проблема заключается в том, что каждое дерево решений было построено с использованием всего 75% от общего числа символов, причем эти 75% символов не были выбраны случайным образом. Следовательно, существует значительно меньшее перекрытие в наборах символов, используемых для построения дерева, чем обычно используется для получения надежного леса решений. Наконец, практический лес решений, скорее всего, включает большее количество деревьев решений, построенных на основе подмножеств гораздо большего набора лучше распределенных параметров.In this simple example, all three variations of the U-shaped symbol 22 (1802), (1804), and 1806 were characterized as symbols having a symbol code of 22, or a symbol code of 22 or 45, according to a decision forest containing decision trees (1400), (1500), (1600) and (1700). However, images of U-shaped symbols that deviate more from the traditional symbol image shown in FIG. 6 for character code 22 are likely to give more different results. A forest of four decision trees in this example is not very reliable for many reasons. One of the reasons for this is that most decisions in the four decision trees contain checks that make up only a small fraction of the total number of parameters. Thus, decision trees mainly rely on few of the nine parameters. In addition, the numerical values of the parameters in the example do not necessarily correlate with the degree of external correspondence to the images of the characters forming the numerical values. For example, the numerical values of the parameter p discussed with reference to FIG. 12A do not have a numerical representation of the degree of similarity between the characters. Another problem is that each decision tree was built using only 75% of the total number of characters, and these 75% of the characters were not randomly selected. Therefore, there is significantly less overlap in the character sets used to construct the tree than is commonly used to obtain a robust decision forest. Finally, the practical decision forest most likely includes more decision trees built on the basis of subsets of a much larger set of better distributed parameters.

На Фиг. 19 показано небольшое содержащее текст изображение, которое первоначально обрабатывалось системой OCR для получения сетки окон символов (1900), в каждом из которых содержится изображение символа. Для большей наглядности на Фиг. 19 показана сетка окон символов (1900), не содержащая изображений символов. Для упорядочивания окон символов используется вертикальный индекс j (1902) и горизонтальный индекс j (1904). Для облегчения понимания примера, рассматриваемого ниже, в нем будет идти речь о символах и изображениях символов, а не о графемах. В этом примере предполагается, что существует однозначное соответствие между символами, графемами и эталонами, используемыми для идентификации изображений символов в окнах символов. Кроме сетки окон символов (1900), на Фиг. 19 также показан массив, или матрица, (1906) эталонов, каждая ячейка которой (например, ячейка 1908) включает в себя эталон. Эталоны представляют собой наборы пар «параметр - значение параметра», где параметры выбираются для однозначного распознавания изображений символов, как было описано выше со ссылкой на рисунки 8А-12В. На Фиг. 19 также показан массив параметров (1910), представленный в виде набора пар фигурных скобок, таких как пара фигурных скобок (1912). Каждая пара фигурных скобок представляет собой функционал, который рассчитывает значение параметра относительно изображения символа.In FIG. 19 shows a small text-containing image that was initially processed by the OCR system to obtain a grid of symbol windows (1900), each of which contains a symbol image. For clarity, in FIG. 19 shows a grid of symbol windows (1900) not containing symbol images. The vertical index j (1902) and the horizontal index j (1904) are used to arrange symbol windows. To facilitate understanding of the example discussed below, it will be about symbols and symbol images, and not about graphemes. This example assumes that there is a one-to-one correspondence between the symbols, graphemes, and patterns used to identify symbol images in symbol windows. In addition to the grid of symbol windows (1900), in FIG. 19 also shows an array, or matrix, of (1906) patterns, each cell of which (for example, cell 1908) includes a pattern. The patterns are sets of pairs of “parameter - parameter value”, where the parameters are selected for unambiguous recognition of symbol images, as described above with reference to figures 8A-12B. In FIG. 19 also shows an array of parameters (1910), represented as a set of pairs of curly braces, such as a pair of curly braces (1912). Each pair of curly brackets is a functional that calculates the value of a parameter relative to the symbol image.

На Фиг. 20 показан общий подход к обработке сетки окон символов, показанной на Фиг. 19. На самом высоком уровне обработка может рассматриваться как вложенный цикл for (2002), в котором вызывается подпрограмма «process» (2004) для анализа каждого окна символа (2006) с целью формирования соответствующего кода символа (2008). Другими словами, в примере с псевдокодом сетка окон символов представляет собой двухмерный массив «page_of_text» (страница текста), а система OCR формирует двухмерный массив кодов символов «processed_text» (обработанный текст) на основе двумерного массива окон символов «page_of_text». На Фиг. 20 дугообразные стрелки, такие как дугообразная стрелка (2010), используются для демонстрации порядка обработки первой строки двухмерного массива или сетки окон символов (1900), а горизонтальные стрелки, такие как стрелка (2012), показывают обработку следующих строк, осуществляемую в цикле for (2002). Другими словами, сетка окон символов (1900) обрабатывается согласно указанному выше порядку обработки, при этом каждое окно символа в сетке обрабатывается отдельно для формирования соответствующего кода символа.In FIG. 20 shows a general approach to processing a grid of symbol windows shown in FIG. 19. At the highest level, processing can be considered as a nested for (2002) loop in which the “process” subroutine (2004) is called to analyze each symbol window (2006) in order to generate the corresponding symbol code (2008). In other words, in the pseudo-code example, the grid of symbol windows is a two-dimensional array “page_of_text” (text page), and the OCR system generates a two-dimensional array of symbol codes “processed_text” (processed text) based on a two-dimensional array of symbol windows “page_of_text”. In FIG. 20 arcuate arrows, such as the arcuate arrow (2010), are used to demonstrate the processing order of the first row of a two-dimensional array or grid of symbol windows (1900), and horizontal arrows, such as the arrow (2012), show the processing of the next lines in the for ( 2002). In other words, the grid of symbol windows (1900) is processed according to the above processing order, with each symbol window in the grid being processed separately to generate the corresponding symbol code.

На Фиг. 21 показан первый подход к реализации подпрограммы «process» ((2004) на Фиг. 20). Изображение символа, находящееся в окне символа (2102), используется в качестве входного параметра для подпрограммы «process». Подпрограмма «process» используется для расчета значений восьми разных параметров p1-p8, используемых в данном примере для получения отличительных признаков изображений символов путем восьмикратного вызова подпрограммы «parameterize» (2104), как показано на Фиг. 21. Подпрограмма «parameterize» использует в качестве аргументов изображение символа и целочисленное значение, указывающее, для какого параметра необходимо рассчитать и вернуть рассчитанное значение. Значения параметров хранятся в массиве значений параметров «p_values». Затем, как показано дугообразными стрелками, такими как дугообразная стрелка (2106), подпрограмма «process» перебирает все эталоны (2108), соответствующие символам языка, сравнивая рассчитанные значения параметров для изображения символа, хранящиеся в массиве «p_values» с предварительно рассчитанными значениями параметров каждого эталона, как показано на иллюстрации операции сравнения (2110) на Фиг. 21. Эталон, параметры которого больше всего соответствуют рассчитанным параметрам для изображения символа, выбирается в качестве эталона соответствия, а код символа, который соответствует этому эталону, используется в качестве возвращаемого значения подпрограммы «process». В качестве примера псевдокода, используемого для этого первого варианта осуществления подпрограммы «process», приведен псевдокод (2112) на Фиг. 21. В первом цикле for (2114) рассчитываются значения параметров для входного символа s. Затем в нескольких вложенных циклах for внешнего цикла for (2116) анализируется каждый эталон из массива или вектора эталонов (2108) согласно порядку, указанному дугообразными стрелками, такими как дугообразная стрелка (2106). Во внутреннем цикле for (2118) вызывается подпрограмма «compare» для сравнения каждого рассчитанного значения параметра изображения символа с соответствующим предварительно рассчитанным значением параметра эталона, а общий результат сравнения записывается в локальную переменную t. Максимальное значение, накопленное в результате сравнения, хранится в локальной переменной score, а индекс эталона, который наиболее точно соответствует изображению символа, хранится в переменной p (2120). Код символа, связанный с эталоном p, возвращается подпрограммой «process» (2120) в качестве результата.In FIG. 21 shows a first approach to implementing the “process” routine ((2004) in FIG. 20). The symbol image located in the symbol window (2102) is used as an input parameter for the process subroutine. The process subroutine is used to calculate the values of eight different parameters p1-p8 used in this example to obtain the distinctive features of symbol images by calling the parameterize subroutine (2104) eight times, as shown in FIG. 21. The “parameterize” subroutine uses the symbol image and an integer value as arguments, indicating for which parameter it is necessary to calculate and return the calculated value. The parameter values are stored in an array of parameter values "p_values". Then, as shown by arched arrows, such as an arched arrow (2106), the process subroutine enumerates all the standards (2108) corresponding to the language characters, comparing the calculated parameter values for the symbol image stored in the p_values array with the previously calculated parameter values of each reference, as shown in the illustration of the comparison operation (2110) in FIG. 21. The standard whose parameters most closely correspond to the calculated parameters for the symbol image is selected as the conformance standard, and the symbol code that corresponds to this standard is used as the return value of the “process” subroutine. As an example of the pseudocode used for this first embodiment of the process subroutine, pseudocode (2112) is shown in FIG. 21. In the first for (2114) loop, the parameter values for the input symbol s are calculated. Then, in several nested for loops of the outer for loop (2116), each pattern from the array or pattern vector (2108) is analyzed according to the order indicated by the curved arrows, such as the curved arrow (2106). In the inner for loop (2118), the “compare” subroutine is called to compare each calculated value of the symbol image parameter with the corresponding previously calculated reference parameter value, and the overall comparison result is written to the local variable t. The maximum value accumulated as a result of the comparison is stored in the local variable score, and the index of the standard that most closely matches the symbol image is stored in the variable p (2120). The character code associated with pattern p is returned by the process subroutine (2120) as a result.

Наконец, на Фиг. 21 показана грубая оценка вычислительной сложности первого варианта осуществления подпрограммы «process» (2122). Количество окон символов для содержащего текст изображения равно N=i×j. В текущем примере N=357. Разумеется, количество изображений символов, которое необходимо обработать, зависит от типа документа и количества изображений в нем, а также от языка и других параметров. Однако обычно количество изображений символов N может изменяться от нескольких десятков до нескольких сотен для каждого изображения документа. Количество эталонов, с которыми сравниваются изображения символов, представлено параметром P. Для многих алфавитных языков, включая большинство европейских языков, количество эталонов может быть относительно небольшим, что соответствует относительно небольшому множеству символов алфавита. Однако для таких языков, как китайский, японский и корейский количество эталонов может изменяться от десятков тысяч до сотен тысяч. Поэтому при обработке таких языков значение параметра P значительно превышает значение параметра N. Количество параметров, используемых для получения отличительных признаков каждого изображения символа и эталона, представлено параметром R. Следовательно, общая вычислительная сложность оценивается как NPR. Коэффициент N берется из внешних вложенных циклов for, показанных на Фиг. 20. Коэффициенты PR берутся из вложенных циклов for (2116) и (2118) варианта осуществления подпрограммы «process» (2112), показанной на Фиг. 21. Другими словами, подпрограмма «process» вызывается один раз для каждого из N изображений символов, при этом каждый вызов или обращение к подпрограмме «process» приводит к R сравнениям с каждым из P эталонов. При таком способе анализа изначально вычисленное значение параметра считается постоянной величиной. Вариант осуществления алгоритма, приведенного на Фиг. 21, можно улучшить различными способами. Например, можно сравнивать только определенное подмножество параметров из общего количества параметров, необходимое для однозначного сопоставления изображения символа с конкретным эталоном. Таким образом, может потребоваться произвести среднее количество сравнений параметров $\frac{R}{r}$

а не R сравнений. Кроме того, вместо сравнения каждого изображения символа со всеми эталонами можно задать относительно высокий порог значения соответствия, при превышении которого последовательный перебор эталонов будет прекращаться. В этом случае количество эталонов, которые будут сравниваться с каждым изображением символа, будет равно

\frac{P}{p}

, а не P. Но даже при подобных улучшениях вычислительная сложность будет близка к значению наибольшего из параметров NPR.Finally, in FIG. 21 shows a rough estimate of the computational complexity of the first embodiment of the process subroutine (2122). The number of character windows for the text-containing image is N = i × j. In the current example, N = 357. Of course, the number of images of characters that need to be processed depends on the type of document and the number of images in it, as well as on the language and other parameters. However, usually the number of images of characters N can vary from several tens to several hundred for each image of the document. The number of patterns with which symbol images are compared is represented by the parameter P. For many alphabetic languages, including most European languages, the number of patterns can be relatively small, which corresponds to a relatively small number of characters in the alphabet. However, for languages such as Chinese, Japanese and Korean, the number of standards can vary from tens of thousands to hundreds of thousands. Therefore, when processing such languages, the value of the parameter P significantly exceeds the value of the parameter N. The number of parameters used to obtain the distinguishing features of each image of the symbol and the reference is represented by parameter R. Therefore, the total computational complexity is estimated as NPR. The coefficient N is taken from the external nested for loops shown in FIG. 20. The PR coefficients are taken from the nested for (2116) and (2118) cycles of the embodiment of the process subroutine (2112) shown in FIG. 21. In other words, the “process” subroutine is called once for each of N symbol images, and each call or access to the “process” subroutine leads to R comparisons with each of P references. With this method of analysis, the initially calculated value of the parameter is considered a constant value. An embodiment of the algorithm of FIG. 21, can be improved in various ways. For example, you can compare only a certain subset of parameters from the total number of parameters needed to unambiguously compare the symbol image with a specific reference. Thus, an average number of parameter comparisons may be required.

\frac{R}{r}

not R comparisons. In addition, instead of comparing each symbol image with all the standards, you can set a relatively high threshold for the correspondence value, beyond which successive search of the standards will stop. In this case, the number of standards that will be compared with each symbol image will be equal to

\frac{P}{p}

, not P. But even with such improvements, the computational complexity will be close to the value of the largest of the NPR parameters.

На Фиг. 22А-В показан второй вариант осуществления подпрограммы «process» ((2004) на Фиг. 20). Во втором варианте осуществления изображение символа (2202) также используется в качестве входного параметра подпрограммы «process». Однако в данном варианте осуществления алгоритма эталоны группируются в кластеры, такие как кластеры, рассмотренные ранее на примере на Фиг. 11А-В. Подпрограмма «process» используется для расчета определенного количества значений параметров (2204), достаточного для определения наиболее соответствующего кластера при переборе кластеров эталонов (2206). Таким образом, для выбора наиболее подходящего кластера эталонов сначала используется относительно простая операция сравнения (2208). Затем эталоны (2210) из выбранного кластера эталонов (2211) перебираются с помощью второй довольно простой операции сравнения (2212), для которой используются некоторые дополнительные значения параметров (2214), необходимые для определения наиболее подходящего эталона среди относительно малого числа эталонов (2210), содержащихся в кластере. Псевдокод для второго варианта осуществления подпрограммы «process» представлен на Фиг. 22В. В первом вложенном цикле for (2220) выбирается наиболее подходящий или лучший среди имеющихся кластер эталонов, а во втором вложенном цикле for (2222) определяется наиболее подходящий эталон среди представленных в выбранном кластере. Начальный набор параметров, используемых для определения наилучшего кластера, рассчитывается в цикле for (2224), а дополнительные параметры, необходимые для выбора эталона из числа эталонов выбранного кластера, рассчитываются в цикле for (2226). На Фиг. 22В также представлена приблизительная оценка вычислительной сложности второго варианта осуществления подпрограммы «process» (2230). Как указано, оценочная вычислительная сложность для второго варианта осуществления подпрограммы «process» составляет:In FIG. 22A-B show a second embodiment of a process subroutine ((2004) in FIG. 20). In the second embodiment, the symbol image (2202) is also used as an input parameter to the process subroutine. However, in this embodiment of the algorithm, the patterns are grouped into clusters, such as the clusters discussed previously in the example of FIG. 11A-B. The “process” subroutine is used to calculate a certain number of parameter values (2204), sufficient to determine the most appropriate cluster when enumerating reference clusters (2206). Thus, to select the most suitable cluster of standards, a relatively simple comparison operation is first used (2208). Then the standards (2210) from the selected cluster of standards (2211) are selected using the second rather simple comparison operation (2212), for which some additional parameter values (2214) are used, which are necessary to determine the most suitable standard among a relatively small number of standards (2210), contained in the cluster. The pseudo-code for the second embodiment of the process subroutine is shown in FIG. 22B. In the first nested for (2220) loop, the most suitable or best among the available cluster of standards is selected, and in the second nested for (2222) cycle, the most suitable standard among those in the selected cluster is determined. The initial set of parameters used to determine the best cluster is calculated in the for (2224) cycle, and additional parameters necessary to select a reference from among the standards of the selected cluster are calculated in the for (2226) cycle. In FIG. 22B also provides a rough estimate of the computational complexity of the second embodiment of the process subroutine (2230). As indicated, the estimated computational complexity for the second embodiment of the process subroutine is:

N(CR₁+P'R₂), где:N (CR ₁ + P'R ₂ ), where:

количество символов на странице = N;number of characters per page = N;

количество кластеров = C;number of clusters = C;

количество эталонов/кластер = P′;number of standards / cluster = P ′;

количество исходных параметров = R;number of initial parameters = R;

количество дополнительных параметров = R₂.number of additional parameters = R ₂ .

Поскольку значение P' по существу намного меньше значения P, а значение C еще меньше, то вычислительная сложность второго варианта осуществления подпрограммы «process» вполне приемлема по сравнению с вычислительной сложностью первого варианта осуществления подпрограммы «process».Since the value of P 'is essentially much smaller than the value of P, and the value of C is even less, the computational complexity of the second embodiment of the subroutine "process" is quite acceptable compared to the computational complexity of the first embodiment of the subroutine "process".

Другим подходом к улучшению первого варианта осуществления подпрограммы «process», рассмотренного выше со ссылкой на Фиг. 21, является сортировка эталонов в векторе, или массиве, эталонов, чтобы наиболее вероятные эталоны, соответствующие наиболее часто встречающимся символам, рассматривались в самом начале перебора вектора, или массива эталонов. Когда поиск соответствующего эталона прерывается вследствие нахождения эталона, результат сравнения которого превышает некоторое пороговое значение, и когда эталоны отсортированы по частоте употребления, соответствующей вероятности появления символов в обрабатываемом изображении, содержащем текст, вычислительная сложность значительно снижается. Однако частота появления символов в конкретных изображениях, содержащих текст, может сильно варьироваться в зависимости от типа документа или страницы, которые были отсканированы для получения изображения, и неизвестна до обработки системой OCR. Сортировка эталонов, приводящая к значительному снижению вычислительной сложности для одного типа документа, может значительно повысить вычислительную сложность для другого типа документа. Например, общий статистический анализ различных типов текстовых документов на определенном языке, включая романы, рекламные объявления, учебники и другие подобные документы, может позволить получить общий вариант сортировки эталонов по частоте появления символов. Однако в некоторых документах и текстах, относящихся к профильным сферам деятельности, частота появления символов может совершенно отличаться. В этом случае для документов, относящихся к определенным сферам деятельности, наиболее часто встречающиеся символы могут оказаться расположены ближе к концу обрабатываемого вектора, или матрицы, эталонов, отсортированных в соответствии с общей частотой появления символов. Второй вариант осуществления подпрограммы «process», рассмотренный выше со ссылкой на Фиг. 22А-В, по существу приводит к значительному снижению вычислительной сложности и соответствующему увеличению скорости обработки. Как правило, требуется произвести значительно меньше сравнений для нахождения соответствующего эталона для каждого изображения символа. Однако второй вариант осуществления связан с потенциально серьезной проблемой, которая состоит в том, что при неудачном выполнении первого вложенного цикла for, в котором осуществляется выбор кластера, подпрограмма «process» не сможет найти правильный соответствующий символ. В этом случае правильный соответствующий символ будет находиться в другом кластере, который не будет анализироваться во втором вложенном цикле for. Хотя примеры наборов символов и кластеров, представленные выше, являются относительно простыми, как и параметры, используемые для определения их отличительных особенностей, для таких языков, как китайский и японский, подобная задача является более сложной и подверженной ошибкам из-за несовершенства печати, повреждения документа, а также из-за различных типов ошибок, которые могут возникнуть при сканировании и на начальных стадиях процесса OCR. Таким образом, вероятность неправильного выбора кластера в реальных условиях очень высока.Another approach to improving the first embodiment of the process subroutine discussed above with reference to FIG. 21 is the sorting of patterns in a vector, or array, of patterns, so that the most likely patterns, corresponding to the most common characters, are considered at the very beginning of enumeration of a vector, or array of patterns. When the search for the corresponding standard is interrupted due to finding a standard whose comparison result exceeds a certain threshold value, and when the standards are sorted by the frequency of use corresponding to the probability of occurrence of characters in the processed image containing text, the computational complexity is significantly reduced. However, the frequency of occurrence of characters in specific images containing text can vary greatly depending on the type of document or page that was scanned to receive the image, and is unknown before processing by the OCR system. Sorting the standards, leading to a significant reduction in computational complexity for one type of document, can significantly increase the computational complexity for another type of document. For example, a general statistical analysis of various types of text documents in a particular language, including novels, advertisements, textbooks, and other similar documents, can provide a general option for sorting patterns by the frequency of occurrence of characters. However, in some documents and texts relating to specialized fields of activity, the frequency of occurrence of characters can be completely different. In this case, for documents related to certain fields of activity, the most frequently encountered symbols may be located closer to the end of the processed vector, or matrix, of standards, sorted according to the general frequency of occurrence of the symbols. The second embodiment of the process subroutine discussed above with reference to FIG. 22A-B essentially leads to a significant reduction in computational complexity and a corresponding increase in processing speed. As a rule, significantly fewer comparisons are required to find the appropriate reference for each symbol image. However, the second embodiment is associated with a potentially serious problem, which is that if the first nested for loop in which the cluster is selected is unsuccessful, the process subroutine cannot find the correct corresponding symbol. In this case, the correct corresponding symbol will be in another cluster, which will not be parsed in the second nested for loop. Although the examples of character sets and clusters presented above are relatively simple, as are the parameters used to determine their distinctive features, for languages such as Chinese and Japanese, such a task is more complex and error prone due to imperfect printing, document damage , and also due to the different types of errors that can occur during scanning and in the initial stages of the OCR process. Thus, the probability of a wrong choice of cluster in real conditions is very high.

На Фиг. 23 показан третий вариант осуществления подпрограммы «process», рассмотренной в предыдущем подразделе, с использованием тех же иллюстраций и условных обозначений в псевдокоде, которые использовались в предыдущем подразделе. Как показано на Фиг. 23, третий вариант осуществления подпрограммы «process» применяет дополнительную структуру данных (2302), обозначаемую как «голоса» («votes»). Структура данных votes содержит целочисленное значение для каждого эталона. При начальной инициализации эта структура содержит нулевые значения для всех эталонов. После этого на первой стадии предварительной обработки, представленной двойным вложенным циклом for (2304) на Фиг. 23, для каждого изображения символа в текстовом документе (1900) назначается новый набор кластеров, а эталоны в кластерах упорядочиваются согласно голосам, собранным в структуре данных «votes». Другими словами, эталоны упорядочиваются в заново выделенном наборе, или списке, кластеров, благодаря чему эталоны, с наибольшей вероятностью соответствующие изображению текущего символа, встречаются в начале перебора эталонов. Значения набора сравниваемых параметров, рассчитанные для текущего изображения символа, сравниваются со значениями параметров каждого эталона, при этом голоса отдаются тем эталонам, которые (по результатам сравнения) имеют сходство с изображением символа, превышающее установленный порог. В некоторых вариантах осуществления кластеры также могут быть отсортированы в пределах набора кластеров по накопленному сходству их эталонов с изображением символа.In FIG. 23 shows a third embodiment of the process subroutine discussed in the previous subsection using the same illustrations and pseudo code conventions used in the previous subsection. As shown in FIG. 23, the third embodiment of the “process” subroutine applies an additional data structure (2302), denoted as “votes”. The votes data structure contains an integer value for each reference. At initial initialization, this structure contains zero values for all standards. After that, in the first preprocessing stage, represented by a double nested for loop (2304) in FIG. 23, for each symbol image in a text document (1900) a new set of clusters is assigned, and the patterns in the clusters are ordered according to the votes collected in the “votes” data structure. In other words, the patterns are ordered in a newly selected set, or list, of clusters, so patterns that are most likely to correspond to the image of the current symbol are found at the beginning of the search of patterns. The values of the set of compared parameters calculated for the current image of the symbol are compared with the values of the parameters of each standard, and votes are cast to those standards that (according to the results of comparison) have similarities with the image of the symbol that exceed the set threshold. In some embodiments, the clusters can also be sorted within the cluster set according to the accumulated similarity of their patterns with the symbol image.

После стадии предварительной обработки, осуществляемой вложенными циклами for-(2304), каждое изображение символа обрабатывается третьим вариантом осуществления подпрограммы «process». Псевдокод для третьего варианта осуществления подпрограммы «process» (2310) представлен на Фиг. 23. В данном варианте осуществления подпрограмма «process» принимает в качестве входных параметров изображение символа и набор кластеров, подготовленный для изображения символа на стадии предварительной обработки и хранящийся в массиве NxtLvlClusters, а возвращает указатель на список потенциально соответствующих эталонов. В первом цикле for (2312) рассчитываются значения параметров, которые используются для определения эталонов, соответствующих принятому изображению символа. Во втором внешнем цикле for (2314) рассматриваются все кластеры, пока не заполнится список потенциально соответствующих эталонов. Другими словами, когда находится максимальное количество потенциально соответствующих эталонов, этот внешний цикл for прерывается. Во внутреннем цикле for (2316) для каждого эталона в кластере вызывается функция «similar», осуществляющая определение того, является ли эталон достаточно похожим на изображение символа, чтобы его можно было добавить в список потенциально соответствующих эталонов. Когда список потенциально соответствующих эталонов заполнен, внутренний цикл for также прерывается. На Фиг. 23 представлена оценка вычислительной сложности третьего варианта осуществления подпрограммы «process» 2320. Поскольку как внешний, так и внутренний циклы for (2314) и (2316) прерываются, когда найдено достаточное количество потенциально соответствующих эталонов, а векторы, или списки, эталонов в каждом кластере отсортированы по частоте появления в обрабатываемом документе, то в третьем варианте осуществления подпрограммы «process» требуется выполнить лишь относительно небольшое количество сравнений по сравнению со вторым вариантом осуществления подпрограммы, что и показано дробью $\frac{1}{d}$

(2322). Разумеется, существует штраф за начальную предварительную обработку, представленный коэффициентом «e» (2344). Однако, как указывалось выше, количество обрабатываемых изображений символов N по существу значительно меньше значения параметров P или P', для таких языков, как китайский, японский и корейский, и, следовательно, третий вариант осуществления процедуры «process» обеспечивает значительное снижение вычислительной сложности по сравнению как с первым так и со вторым вариантами осуществления, рассмотренными выше. Что более важно, в третьем варианте осуществления подпрограммы «process» гарантируется просмотр всех кластеров, пока не будет обнаружено некоторое максимальное количество потенциально соответствующих символов. Когда порог сходства для кластеров имеет относительно низкое значение, а порог сходства для эталонов имеет относительно высокое значение, существует очень большая вероятность, что список потенциально соответствующих символов, возвращаемый подпрограммой «process», будет включать в себя именно тот символ, который наиболее точно соответствует входному изображению символа.After the preprocessing step carried out by nested for- (2304) loops, each symbol image is processed by the third embodiment of the process subroutine. The pseudo-code for the third embodiment of the process subroutine (2310) is shown in FIG. 23. In this embodiment, the “process” routine takes as input the symbol image and a set of clusters prepared for the symbol image at the preprocessing stage and stored in the NxtLvlClusters array, and returns a pointer to a list of potentially relevant templates. In the first for (2312) loop, the parameter values are calculated that are used to determine the patterns corresponding to the received symbol image. In the second outer for loop (2314), all clusters are considered until a list of potentially relevant standards is populated. In other words, when the maximum number of potentially relevant patterns is found, this outer for loop breaks. In the for (2316) inner loop, for each pattern in the cluster, the “similar” function is called, which determines whether the pattern is sufficiently similar to the symbol image so that it can be added to the list of potentially relevant patterns. When the list of potentially relevant patterns is full, the inner for loop also breaks. In FIG. 23 presents an assessment of the computational complexity of the third embodiment of the “process” subroutine 2320. Since both the external and internal for (2314) and (2316) loops are interrupted when a sufficient number of potentially relevant patterns are found, and the vectors, or lists, of patterns in each cluster sorted by the frequency of occurrence in the document being processed, then in the third embodiment of the process subroutine, only a relatively small number of comparisons are required compared to the second embodiment dprogrammy as shown fraction

\frac{one}{d}

(2322). Of course, there is a penalty for the initial preprocessing represented by the coefficient “e” (2344). However, as indicated above, the number of processed images of characters N is substantially less than the value of the parameters P or P 'for languages such as Chinese, Japanese, and Korean, and, therefore, the third embodiment of the “process” procedure provides a significant reduction in computational complexity in compared with both the first and second embodiments discussed above. More importantly, in the third embodiment of the “process” routine, it is guaranteed that all clusters are viewed until a certain maximum number of potentially matching characters are detected. When the similarity threshold for clusters is relatively low and the similarity threshold for standards is relatively high, there is a very high probability that the list of potentially matching characters returned by the process subroutine will include exactly the character that most closely matches the input symbol image.

Рассмотренная выше информация, включая третий вариант осуществления, представленный на Фиг. 23, создает основу для описания определенного аспекта этого обобщенного третьего варианта осуществления, к которому относится настоящий документ. Следует четко понимать, что описанные выше варианты осуществления представляют собой обобщенные варианты осуществления и что конкретный вариант осуществления системы OCR может применять любой из большого количества возможных альтернативных вариантов осуществления.The information discussed above, including the third embodiment shown in FIG. 23 provides a basis for describing a specific aspect of this generalized third embodiment to which this document relates. It should be clearly understood that the embodiments described above are generalized embodiments and that a particular embodiment of the OCR system may apply any of a large number of possible alternative embodiments.

Настоящий документ относится к описанию логики управления и структур данных в системе OCR, которые можно использовать как для кластеризации эталонов, так и на стадии предварительной обработки, рассмотренной выше, в ходе которой графемы в эталонах могут быть отсортированы по частоте появления в отсканированном изображении, содержащем текст, или в наборе отсканированных изображений. Эти логика управления и структуры данных применяются в системе OCR для реализации стадий предварительной обработки/кластеризации, в ходе которых фиксированный набор параметров связывается с каждым кластером и применяется при сравнении изображения символа с эталонами, содержащимися в кластере. Кластеры можно использовать в различных локальных операциях или на разных этапах сложной задачи оптического распознавания символов, при этом конкретные параметры (а также количество этих параметров), используемые для сравнения изображения символа с эталонами, содержащимися в кластере, могут отличаться в различных локальных операциях и на разных этапах, а также часто могут быть различными в разных кластерах. В альтернативном варианте реализации кластеры не используются.This document describes the control logic and data structures in the OCR system, which can be used both for clustering patterns and at the preliminary processing stage discussed above, during which graphemes in patterns can be sorted by the frequency of occurrence in a scanned image containing text , or in a set of scanned images. These control logic and data structures are used in the OCR system to implement the preprocessing / clustering stages, during which a fixed set of parameters is associated with each cluster and is used when comparing the symbol image with the standards contained in the cluster. Clusters can be used in various local operations or at different stages of the complex task of optical character recognition, and the specific parameters (as well as the number of these parameters) used to compare the symbol image with the standards contained in the cluster may differ in various local operations and on different stages, and can often be different in different clusters. In an alternative embodiment, clusters are not used.

На Фиг. 24 показаны структуры данных, обеспечивающие кластеризацию и предварительную обработку в одном варианте осуществления системы OCR, включающей в себя третий вариант осуществления подпрограммы «process», описанный выше. Первой структурой данных является массив, или вектор (2402), обозначенный как «votes» («голоса»). В описанном варианте осуществления массив «votes» содержит один элемент для каждой графемы языка. Массив «votes» индексируется целыми значениями кодов графем. Другими словами, каждой графеме присваивается уникальный целочисленный идентификатор или код графемы, который используется в качестве индекса массива «votes». Как показано на Фиг. 24, массив «votes» может содержать n элементов, где n представляет собой количество графем в языке, а коды графем монотонно возрастают от 0 до n. Разумеется, структура данных «votes» может быть альтернативно реализована в виде разреженного массива, когда коды графем возрастают немонотонно, в виде списка, или с использованием других типов структур данных.In FIG. 24 illustrates data structures for clustering and preprocessing in one embodiment of an OCR system including a third embodiment of the “process” routine described above. The first data structure is an array, or vector (2402), designated as “votes”. In the described embodiment, the “votes” array contains one element for each grapheme of the language. The votes array is indexed by integer grapheme codes. In other words, each grapheme is assigned a unique integer identifier or grapheme code, which is used as the index of the “votes” array. As shown in FIG. 24, the votes array can contain n elements, where n represents the number of graphemes in the language, and the grapheme codes monotonically increase from 0 to n. Of course, the “votes” data structure can alternatively be implemented as a sparse array when grapheme codes increase non-monotonously, as a list, or using other types of data structures.

На Фиг. 24 показана вторая структура данных (2404), которая представляется собой массив экземпляров класса «parameter» («параметр»). Как и в случае со структурой данных «votes», массив «parameters» может быть альтернативно реализован с использованием разных структур данных, включая списки, разреженные массивы и другие структуры данных. В текущем рассматриваемом варианте осуществления массив «parameters» включает в себя p записей или элементов, которые проиндексированы с помощью монотонно увеличивающихся чисел 0, 1, 2, …, p. Каждый экземпляр класса "parameter" представляет один из различных параметров, используемых для характеристики изображений символов и эталонов, как описывалось выше.In FIG. 24 shows the second data structure (2404), which is an array of instances of the "parameter" class. As with the votes data structure, the parameters array can alternatively be implemented using different data structures, including lists, sparse arrays, and other data structures. In the current contemplated embodiment, the “parameters” array includes p records or elements that are indexed using monotonically increasing numbers 0, 1, 2, ..., p. Each instance of the "parameter" class represents one of the various parameters used to characterize symbol images and patterns, as described above.

На Фиг. 24 дополнительно показана структура данных кластера (2406), которая представляет собой кластер или набор эталонов. Структура данных кластера включает в себя массив «clusterParameters» (2408), в котором содержатся параметры, применяемые для определения отличительных признаков эталонов в кластере в определенный момент времени, а также для определения отличительных признаков изображений символов для их сравнения с эталонами, содержащимися в кластере. Каждый элемент в массиве «clusterParameters» содержит индекс для массива «параметры» (2404). Используя индексы в массиве «параметры» (2404), можно легко изменить или переконфигурировать конкретные параметры, а также количество параметров, используемых для сравнения, вследствие чего конфигурация кластера может быть эффективно изменена для различных локальных операций или этапов. Структура данных кластера также включает в себя целочисленный параметр число (2410), который указывает количество индексов параметров, содержащихся в массиве «clusterParameters». Структура данных кластера дополнительно содержит параметр «cutoff» («отсечка») (2412), имеющий формат с плавающей запятой (или формат двойной точности) и содержащий пороговое значение для оценки эталонов, находящихся в кластере, на предмет их соответствия изображению символа. Наконец, структура данных кластера (2406) содержит ряд структур данных эталона (2414)-(2422). Структуры данных эталона обсуждаются ниже.In FIG. 24 further shows a cluster data structure (2406), which is a cluster or a set of patterns. The cluster data structure includes an array of "clusterParameters" (2408), which contains the parameters used to determine the distinctive features of the patterns in the cluster at a specific point in time, as well as to determine the distinctive features of symbol images to compare them with the patterns contained in the cluster. Each element in the clusterParameters array contains an index for the parameters array (2404). Using the indices in the “parameters” array (2404), you can easily change or reconfigure specific parameters, as well as the number of parameters used for comparison, as a result of which the cluster configuration can be effectively changed for various local operations or steps. The cluster data structure also includes an integer parameter number (2410), which indicates the number of parameter indices contained in the clusterParameters array. The cluster data structure additionally contains the “cutoff” parameter (2412), which has a floating point format (or double precision format) and contains a threshold value for evaluating the standards in the cluster for their correspondence to the symbol image. Finally, the data structure of the cluster (2406) contains a number of data structures of the standard (2414) - (2422). Pattern data structures are discussed below.

На Фиг. 25А-Н показана предварительная обработка изображения символа с использованием структур данных, рассмотренных выше со ссылкой на Фиг. 24. На Фиг. 25А показаны «голоса» для структуры данных (2402), рассмотренной выше со ссылкой на Фиг. 24, а также структура данных одного эталона (2502), выбранная из эталонов, содержащихся в структуре данных кластера (2406), которая также обсуждалась выше со ссылкой на Фиг. 24. Каждая структура данных эталона содержит номер эталона (2504) и набор значений параметров (2505), рассчитанных для эталона с помощью параметров, ссылки на которые получены из индексов, содержащихся в массиве «clusterParameters» (2408), который находится в структуре данных кластера (2406). Как отмечалось выше, важно помнить, что изображения символов масштабируются, поворачиваются и преобразуются для создания нормированных изображений символов, чтобы облегчить процедуру сравнения изображений символов и эталонов на основе параметров. Структура данных эталонов дополнительно содержит целочисленный параметр (2506), который указывает количество индексов в структуре данных эталона, а также значения самих индексов (2508). Эти индексы связаны с различными возможными весами, которые могут быть рассчитаны при сравнении с изображения символа с эталоном. В одном варианте осуществления в структуре данных может быть столько индексов, сколько имеется возможных рассчитанных весов, при этом каждый индекс содержит целочисленный индекс и рассчитанный вес, связанный с этим индексом. Возможны и другие варианты осуществления. Вес рассчитывается, когда рассчитываются параметры для изображения символа, и значения параметров изображения символа сравниваются с эталоном, представленным структурой данных эталона. Чем больше значение веса, тем меньше изображение символа соответствует эталону. Этот вес применяется для выбора из указателей соответствующего индекса, который используется для выбора количества графем, соответствующих эталону, за которые необходимо проголосовать на стадии предварительной обработки. Каждая структура данных эталона включает в себя целочисленное значение, указывающее количество графем, соответствующих эталону (2510), а также коды всех графем из набора, соответствующих эталону (2512). Во многих вариантах осуществления эти коды графем сортируются по сходству или близости к эталону в порядке убывания сходства.In FIG. 25A-H show symbol image preprocessing using the data structures discussed above with reference to FIG. 24. In FIG. 25A shows “voices” for the data structure (2402) discussed above with reference to FIG. 24, as well as the data structure of one standard (2502), selected from the standards contained in the data structure of the cluster (2406), which was also discussed above with reference to FIG. 24. Each reference data structure contains a reference number (2504) and a set of parameter values (2505) calculated for the reference using parameters referenced from the indices contained in the clusterParameters array (2408), which is in the cluster data structure (2406). As noted above, it is important to remember that symbol images are scaled, rotated, and transformed to create normalized symbol images to facilitate the process of comparing symbol images and patterns based on parameters. The data structure of the standards further comprises an integer parameter (2506), which indicates the number of indices in the data structure of the standard, as well as the values of the indices themselves (2508). These indices are associated with various possible weights that can be calculated by comparing with a symbol image with a reference. In one embodiment, there can be as many indices in the data structure as there are possible calculated weights, with each index containing an integer index and a calculated weight associated with that index. Other embodiments are possible. The weight is calculated when the parameters for the symbol image are calculated, and the values of the symbol image parameters are compared with the reference represented by the data structure of the reference. The larger the weight value, the less the symbol image corresponds to the standard. This weight is used to select from the indexes the corresponding index, which is used to select the number of graphemes corresponding to the standard for which it is necessary to vote at the preliminary processing stage. Each data structure of the standard includes an integer value indicating the number of graphemes corresponding to the standard (2510), as well as codes of all graphemes from the set corresponding to the standard (2512). In many embodiments, these grapheme codes are sorted by similarity or proximity to a reference in descending order of similarity.

На Фиг. 25В-Н показана предварительная обработка одного изображения символа, выбранного из содержащего текст отсканированного изображения. В примере, показанном на Фиг. 25В-Н, изображение символа (2514) представляет собой иероглиф одного из азиатских языков. На Фиг. 25В также показан массив «параметры» (2404), рассмотренный выше со ссылкой на Фиг. 24, и небольшой фрагмент структуры данных кластера (2406), которая включает в себя массив «clusterParameters» (2408) и целочисленный параметр число (2410).In FIG. 25B-H illustrates preprocessing of one symbol image selected from a text-scanned image. In the example shown in FIG. 25B-H, the symbol image (2514) is a hieroglyph of one of the Asian languages. In FIG. 25B also shows the “parameters” array (2404) discussed above with reference to FIG. 24, and a small fragment of the cluster data structure (2406), which includes the clusterParameters array (2408) and the integer parameter number (2410).

Как показано на Фиг. 25С, для всех параметров число, индексы которых включены в массив «clusterParameters» (2408), индекс параметра (2516) извлекается из массива «clusterParameters» (2408) и используется для доступа к экземпляру класса «parameter» (2518) в массиве «параметры» (2404). Для формирования значения параметра (2520), которое затем хранится в локальной переменной (2522), вызывается метод «parameterize» экземпляра класса «parameter» (2518). На Фиг. 25С показан расчет значения первого параметра для изображения символа. На Фиг. 25D показан расчет значения второго параметра для изображения символа. Когда все экземпляры число класса «parameter» задействованы для формирования значений параметров число для изображения символа, формируется список или массив значений параметров изображения символа (2524), как показано на Фиг. 25Е.As shown in FIG. 25C, for all parameters the number whose indices are included in the clusterParameters array (2408), the parameter index (2516) is extracted from the clusterParameters array (2408) and used to access an instance of the parameter class (2518) in the parameters array "(2404). To form the parameter value (2520), which is then stored in a local variable (2522), the parameterize method of the instance of the parameter class is called (2518). In FIG. 25C shows the calculation of the value of the first parameter for the symbol image. In FIG. 25D shows a calculation of a second parameter value for a symbol image. When all instances of the number of the “parameter” class are involved in the formation of the parameter values number for the symbol image, a list or array of symbol image parameter values (2524) is formed, as shown in FIG. 25E.

Затем, как показано на Фиг. 25F, из предварительно рассчитанного значения параметра для эталона, представленного структурой данных эталона, вычитается соответствующий параметр для изображения символа с целью получения ряда рассчитанных значений, по одному для каждого параметра. Например, как показано на Фиг. 25F, из первого значения параметра (2526), хранящегося в структуре данных эталона (2502), вычитается первое значение параметра (2522), рассчитанное для изображения символа, с целью получения промежуточного значения (2528). Аналогичным образом из остальных предварительно определенных значений параметров для эталона (2530)-(2533) вычитаются остальные значения параметров для изображения символа (2534)-(2538) для получения дополнительных промежуточных расчетных значений (2540)-(2544). Абсолютные величины этих промежуточных значений (2528) и (2540)-(2544) суммируются (2546) для получения веса (2548), который численно представляет основанное на параметрах сравнение изображения символа и эталона, представленного структурой данных эталона (2502). И в этом случае, чем больше значение рассчитанного веса, тем меньше изображение символа похоже на эталон, поскольку вес представляет собой накопленную разность между значениями параметров для изображения символа и эталона.Then, as shown in FIG. 25F, from the pre-calculated parameter value for the reference represented by the data structure of the reference, the corresponding parameter for the symbol image is subtracted to obtain a series of calculated values, one for each parameter. For example, as shown in FIG. 25F, from the first parameter value (2526) stored in the data structure of the reference (2502), the first parameter value (2522) calculated for the symbol image is subtracted to obtain an intermediate value (2528). Similarly, from the other predefined parameter values for the standard (2530) - (2533), the remaining parameter values for the symbol image (2534) - (2538) are subtracted to obtain additional intermediate calculated values (2540) - (2544). The absolute values of these intermediate values (2528) and (2540) - (2544) are summarized (2546) to obtain a weight (2548), which numerically represents the parameter-based comparison of the symbol image and the reference represented by the reference data structure (2502). And in this case, the larger the value of the calculated weight, the smaller the symbol image is similar to the standard, since the weight is the accumulated difference between the parameter values for the symbol image and the standard.

Как показано на Фиг. 25G, когда рассчитанный вес (2548) превышает вес отсечки «отсечка» (2412) для кластера, предварительная обработка изображения символа в отношении рассматриваемого эталона, представленного структурой данных эталона (2502), прекращается значением (2550). В противном случае на этапе предварительной обработки изображения символа производится голосование за одну или более графем, соответствующих эталону, представленному структурой данных эталона (2552).As shown in FIG. 25G, when the calculated weight (2548) exceeds the “cutoff” weight (2412) for the cluster, the preprocessing of the symbol image in relation to the considered reference represented by the data structure of the reference (2502) is terminated by the value (2550). Otherwise, at the stage of preliminary processing of the symbol image, a vote is made for one or more graphemes corresponding to the standard represented by the data structure of the standard (2552).

На Фиг. 25Н показан случай, когда рассчитанный вес, представляющий собой сравнение изображения символа с эталоном, представленным структурой данных эталона, меньше или равен весу отсечки для данного кластера. В этом случае для выбора индекса (2554) из набора индексов (2508) используется рассчитанный вес (2548). Как описывалось выше, каждый из индексов (2508) может содержать индекс и связанный с ним вес, что позволяет выбрать один конкретный индекс (2554), который будет выбран с помощью вычисленного веса (2548), из индекса, который является индексом извлекаемых кодов графем. Этот извлеченный индекс (2556) указывает на конкретный код графемы (2558) в наборе кодов графем (2512), хранящемся в структуре данных эталона для представления тех графем, которые соответствуют эталону. Затем для всех кодов графем, начиная с первого кода графемы (2560) и до кода графемы (2558), на который указывает извлеченный индекс (2556), увеличивается соответствующий элемент из структуры данных «голоса» (2402), как показано стрелками (такими как стрелка (2562)), исходящими из элементов, содержащих коды графем между элементами (2560) и (2558) включительно.In FIG. 25H shows the case where the calculated weight, which is a comparison of the symbol image with the reference represented by the data structure of the reference, is less than or equal to the cutoff weight for a given cluster. In this case, the calculated weight (2548) is used to select the index (2554) from the set of indices (2508). As described above, each of the indices (2508) can contain an index and its associated weight, which allows you to select one specific index (2554), which will be selected using the calculated weight (2548), from the index, which is the index of the extracted grapheme codes. This extracted index (2556) indicates a specific grapheme code (2558) in the set of grapheme codes (2512) stored in the pattern data structure to represent those graphemes that correspond to the pattern. Then, for all grapheme codes, starting from the first grapheme code (2560) and up to the grapheme code (2558) indicated by the extracted index (2556), the corresponding element from the “voice” data structure (2402) is increased, as shown by arrows (such as arrow (2562)) emanating from elements containing grapheme codes between elements (2560) and (2558) inclusive.

Таким образом, если рассчитанный вес, используемый для сравнения с изображения символа с эталоном, меньше значения отсечки, значит изображение символа достаточно похоже на эталон, чтобы по меньшей мере за некоторые из соответствующих эталону графем был отдан голос на стадии предварительной обработки. Графемы, достаточно похожие на изображение символа, выбираются на основе рассчитанного веса с использованием индекса, выбранного из индексов (2508) в соответствии с рассчитанным весом. Затем элементы структуры данных «голоса», соответствующие этим графемам, увеличиваются для отражения количества голосов, отданных этим графемам во время предварительной обработки изображения символа.Thus, if the calculated weight used to compare with the symbol image with the reference is less than the cutoff value, then the symbol image is quite similar to the reference so that at least some of the graphemes corresponding to the reference are voted at the pre-processing stage. Graphemes that are quite similar to the symbol image are selected based on the calculated weight using an index selected from indices (2508) in accordance with the calculated weight. Then, the elements of the “voice” data structure corresponding to these graphemes are increased to reflect the number of votes cast to these graphemes during the preliminary processing of the symbol image.

Существует множество различных альтернативных подходов к осуществлению стадии предварительной обработки и формированию описанных выше структур данных. Например, вместо использования веса отсечки для всего кластера можно использовать вес отсечки для конкретных эталонов, при этом вес отсечки может включаться в структуру данных эталона. В другом примере хранящиеся в эталоне индексы могут представлять собой экземпляры классов, содержащих списки кодов графем, а не индексы, указывающие на упорядоченный список кодов графем, как это реализуется в описываемом варианте осуществления. Также возможны и многие другие альтернативные варианты осуществления. Например, подпрограмма «vote» может принимать в качестве второго аргумента указатель на массив «params» а в цикле for, в строках 43-44, значения параметров могут вычисляться только в том случае, если они еще не были рассчитаны при обработке изображения символа для других кластеров. В других вариантах осуществления можно применять разные типы расчета веса и сравнения изображения символа с эталоном. В некоторых случаях большее значение веса может указывать на большее сходство изображения символа с эталоном в отличие от приведенного выше примера, когда увеличение значения веса означало уменьшение сходства изображения символа с эталоном. В ряде систем OCR с графемами могут связываться вещественные коэффициенты, позволяющие при голосовании использовать дробные значения и значения больше 1. В некоторых системах OCR графемы, эталоны и/или кластеры можно отсортировать на основании голосов, накопленных в течение предварительной обработки, для более эффективного последующего распознавания символов. В определенных вариантах осуществления структура данных кластера может включать в себя только количество структур данных эталона или ссылок на структуры данных эталона, при этом вес отсечки и эталоны, связанные с кластером, указываются в алгоритме управления, а не хранятся в структуре данных кластера.There are many different alternative approaches to the implementation of the pre-processing stage and the formation of the data structures described above. For example, instead of using the cutoff weight for the entire cluster, you can use the cutoff weight for specific standards, and the cutoff weight can be included in the data structure of the standard. In another example, the indices stored in the reference may be instances of classes containing lists of grapheme codes, but not indices indicating an ordered list of grapheme codes, as is implemented in the described embodiment. Many other alternative embodiments are also possible. For example, the “vote” subroutine can take a pointer to the “params” array as the second argument, and in a for loop, lines 43-44, parameter values can only be calculated if they have not been calculated while processing the symbol image for others clusters. In other embodiments, different types of weighting and comparing a symbol image with a reference can be used. In some cases, a higher weight value may indicate a greater similarity of the symbol image to the reference, in contrast to the above example, when an increase in the weight value meant a decrease in the similarity of the symbol image to the reference. In a number of OCR systems, material coefficients can be associated with graphemes, allowing fractional values and values greater than 1 to be used during voting. In some OCR systems, graphemes, standards, and / or clusters can be sorted based on votes accumulated during pre-processing for more efficient subsequent recognition characters. In certain embodiments, the cluster data structure may include only the number of reference data structures or references to the reference data structures, wherein the cutoff weights and the references associated with the cluster are indicated in the control algorithm rather than stored in the cluster data structure.

На Фиг. 26 показаны отношения между символами, изображениями символов, эталонами и графемами. На Фиг. 26 первый диск (2602) представляет собой набор символов, или кодов символов, второй диск (2604) представляет собой гораздо больший набор возможных изображений символов, третий диск (2606) представляет собой набор структур данных эталона, с которым изображения эталона сравниваются при обработке изображений символов, а четвертый диск (2608) представляет собой набор графем языка, который включает в себя символы или коды символов набора (2602). Как указано точками, например, точкой (2610), представляющей члены наборов, и стрелками, такими как стрелка (1612), которые представляют собой отображения между членами одного набора и членами другого набора, как правило, существует отображение один-ко-многим символов к графемам, как правило, отображение один-ко-многим символов к изображениям символов, как правило, отображение один-ко-многим изображений символов к структурам данных эталонов, и, как правило, отображение один-ко-многим графем к эталонам. В описываемых в настоящем документе методах и системах для эффективного определения соответствия изображений символов эталонам используются леса решений, а для сопоставления эталонов и графем используется голосование на основе веса.In FIG. 26 shows the relationships between symbols, symbol images, patterns and graphemes. In FIG. 26, the first disk (2602) is a set of symbols or symbol codes, the second disk (2604) is a much larger set of possible symbol images, the third disk (2606) is a set of reference data structures with which reference images are compared when processing symbol images and the fourth disk (2608) is a set of language graphemes that includes characters or character codes of the set (2602). As indicated by dots, for example, a dot (2610) representing members of sets, and arrows, such as arrow (1612), which are mappings between members of one set and members of another set, typically there is a mapping of one to many characters to graphemes, as a rule, the mapping of one-to-many symbols to images of symbols, as a rule, the mapping of one-to-many images of symbols to the data structures of the patterns, and, as a rule, the mapping of one-to-many graphemes to the patterns. In the methods and systems described in this document, decision forests are used to effectively determine the correspondence of symbol images to standards, and weight-based voting is used to compare patterns and graphemes.

На Фиг. 27 показан процесс преобразования изображений символов в коды символов. Этот процесс используется системами OCR в соответствии с текущим документом для замены изображений символов на коды символов или другие типы идентификаторов символов в документах. Обычно процесс начинается с одного или более изображений документа (2702). Изображение документа обрабатывается для того, чтобы идентифицировать небольшие фрагменты изображения, каждый из которых содержит один символ языка. Показано, что первоначально обрабатываемый документ (2704), получаемый из изображения документа, содержит небольшие фрагменты изображения, такие как фрагмент изображения (2705), который содержит изображение одного символа. Разумеется, первоначально обрабатываемый документ может представлять собой массив ссылок на фрагменты изображений в изображении документа или любой из многих различных типов обрабатываемых документов. Затем первоначально обрабатываемый документ (2704) дополнительно обрабатывается для получения таблицы обработанных изображений символов (2706). В этой таблице каждая запись, или строка, например, запись (2708) содержит указание на фрагмент изображения в первоначально обрабатываемом документе, которому соответствует запись (2710), а затем массив или набор кодов графем (2712)-(2716), которые с наибольшей вероятностью соответствуют изображению символа в конкретном фрагменте изображения. Все записи оканчиваются кодом завершения «0» (2717), который показывает окончание записи. Разумеется, существует много различных возможных способов кодировки таблицы обработанных изображений символов в дополнение к подходу, использованному для получения примера таблицы обработанных изображений символов (2706), показанной на Фиг. 27. Наконец, список кодов графем, которые соответствуют конкретным изображениям символов, используется на более поздних стадиях обработки изображений символов совместно со структурами данных эталонов, которые отдали голоса за коды графем, для получения конкретного кода символа, соответствующего изображению каждого символа, для изображения каждого символа в первоначальном обрабатываемом документе (2704). Затем эти коды символов (2720) могут быть сохранены наряду с несимвольными участками изображения документа, а также подвергнуться дополнительной отдельной обработке, например, машинному переводу на другой естественный язык, или использоваться многими другими способами в зависимости от системы OCR и приложения.In FIG. 27 shows a process for converting symbol images to symbol codes. This process is used by OCR systems in accordance with the current document to replace character images with character codes or other types of character identifiers in documents. Typically, a process begins with one or more document images (2702). The image of the document is processed in order to identify small fragments of the image, each of which contains one symbol of the language. It is shown that the initially processed document (2704) obtained from the image of the document contains small fragments of the image, such as a fragment of the image (2705), which contains the image of one character. Of course, the initially processed document may be an array of links to image fragments in the image of the document or any of many different types of processed documents. Then, the initially processed document (2704) is further processed to obtain a table of processed symbol images (2706). In this table, each record, or line, for example, record (2708) contains an indication of the image fragment in the initially processed document, which corresponds to the record (2710), and then an array or set of grapheme codes (2712) - (2716), which have the largest probability correspond to the image of the symbol in a particular fragment of the image. All recordings end with a completion code “0” (2717), which indicates the end of the recording. Of course, there are many different possible ways of encoding the processed symbol image table in addition to the approach used to obtain the example processed symbol image table (2706) shown in FIG. 27. Finally, a list of grapheme codes that correspond to specific character images is used at later stages of processing character images in conjunction with data structures of patterns that cast votes for grapheme codes to obtain a specific character code corresponding to the image of each character for each character image in the original document being processed (2704). Then these character codes (2720) can be saved along with non-symbolic portions of the document image, as well as subjected to additional separate processing, for example, machine translation into another natural language, or used in many other ways depending on the OCR system and application.

Как обсуждалось выше со ссылкой на Фиг. 24 и 25А-Н, структуры данных эталона и кластера обеспечивают основу для механизма голосования, согласно которому обработка изображения символа приводит к голосованию за графемы, которые, вероятно, связаны с изображением символа, или которые соответствуют ему. Обычно описываемые в настоящее время системы OCR используют относительно большое количество структур данных кластера, каждая из которых связана с сотнями структур данных эталонов. Процесс обработки мультикластерной структуры данных описан ниже со ссылкой на Фиг. 28A-G.As discussed above with reference to FIG. 24 and 25A-H, the data structures of the template and the cluster provide the basis for the voting mechanism, according to which the processing of the image of the symbol leads to voting for graphemes, which are likely to be associated with the image of the symbol, or which correspond to it. Typically, the OCR systems currently described use a relatively large number of cluster data structures, each of which is associated with hundreds of reference data structures. A process for processing a multicluster data structure is described below with reference to FIG. 28A-G.

На Фиг. 28А показаны данные и структуры данных, использованные в примере обработки символа с помощью мультикластерной структуры данных. Допущения для изображений, которые использовались на Фиг. 28А, также используются на последующих Фиг. 28B-G. На Фиг. 28А снова показаны многие структуры данных с Фиг. 24 и Фиг. 25A-G. Они содержат массив параметров (2802) (показанный как массив (2404) на Фиг. 24), массив голосов (2804) (показан как массив голосов (2402) на Фиг. 24), три структуры данных кластера (2806)-(2808), каждая из которых идентична структуре данных кластера (2406), показанной на Фиг. 24, и массив рассчитанных значений параметров (2810), аналогичный массиву рассчитанных значений параметров (2524) на Фиг. 25F. Следует отметить, что структурам данных при инициализации присваиваются соответствующие значения, в то время как массиву голосов присваиваются все нулевые значения. Каждая структура данных кластера, такая как структура данных кластера (2806), включает в себя массив параметров (2812), похожий на массив «clusterParameters» (2408), показанный на Фиг. 24, величину число (2814) и величину отсечки отсечка (2816), аналогичные соответственно величинам число и отсечка (2410) и (2412), показанным в кластере (2406) на Фиг. 24, и а также несколько структур данных эталона, таких как структура данных эталона (2818), идентичных структуре данных эталона (2502) на Фиг. 25А. Кроме того, на Фиг. 28А показана структура данных (2820), которая представляет собой отсканированное изображение страницы документа. Эта структура данных является двумерным массивом, каждая ячейка которого, как, например, ячейка (2822), соответствует изображению символа. Также на Фиг. 28А показана переменная (2824), содержащая обрабатываемое в настоящий момент изображение символа.In FIG. 28A shows data and data structures used in an example of processing a symbol using a multi-cluster data structure. Assumptions for the images used in FIG. 28A are also used in subsequent FIGS. 28B-G. In FIG. 28A again shows many of the data structures of FIG. 24 and FIG. 25A-G. They contain an array of parameters (2802) (shown as an array (2404) in Fig. 24), an array of votes (2804) (shown as an array of voices (2402) in Fig. 24), three cluster data structures (2806) - (2808) , each of which is identical to the cluster data structure (2406) shown in FIG. 24, and an array of calculated parameter values (2810), similar to an array of calculated parameter values (2524) in FIG. 25F. It should be noted that during initialization, data structures are assigned the appropriate values, while the array of votes is assigned all zero values. Each cluster data structure, such as a cluster data structure (2806), includes an array of parameters (2812) similar to the clusterParameters array (2408) shown in FIG. 24, the number value (2814) and the cutoff value of the cutoff (2816), similar to the values of the number and cutoff (2410) and (2412), respectively, shown in the cluster (2406) in FIG. 24, as well as several pattern data structures, such as the pattern data structure (2818), identical to the pattern data structure (2502) in FIG. 25A. In addition, in FIG. 28A shows a data structure (2820), which is a scanned image of a document page. This data structure is a two-dimensional array, each cell of which, for example, cell (2822), corresponds to a symbol image. Also in FIG. 28A shows a variable (2824) containing the currently processed symbol image.

Структуры данных, показанные на Фиг. 28А, могут быть реализованы различными способами с помощью различных языков программирования и технологий хранения данных. Структуры данных могут включать в себя дополнительные данные и подструктуры данных. Например, в одном из вариантов осуществления каждая структура данных эталона в каждом кластере представляет собой ссылки из отсортированного массива ссылок структуры данных кластера. В других вариантах осуществления каждая структура данных эталона в каждом кластере связывается с числовой последовательностью, что позволяет проходить по структурам данных эталона в определенном порядке. В некоторых осуществлениях структура данных кластера может включать в себя структуры данных эталона, в то время как в других осуществлениях структура данных кластера может ссылаться на структуры данных эталона. В большинстве осуществлений структуры данных могут быть динамически расширены или сжаты, чтобы соответствовать изменениям способов OCR, в которых они используются. Таким образом, несмотря на то что для описания структуры данных голосов применяется термин «массив», данная структура может быть реализована с использованием структур данных, отличных от простых массивов, но позволяющих при этом индексировать элементы как в массивах.The data structures shown in FIG. 28A may be implemented in various ways using various programming languages and data storage technologies. Data structures may include additional data and data substructures. For example, in one embodiment, each reference data structure in each cluster represents links from a sorted array of links of the cluster data structure. In other embodiments, each reference data structure in each cluster is associated with a numerical sequence, which allows you to go through the reference data structures in a specific order. In some implementations, the cluster data structure may include reference data structures, while in other implementations, the cluster data structure may refer to reference data structures. In most implementations, data structures can be dynamically expanded or compressed to fit changes in the OCR methods in which they are used. Thus, despite the fact that the term “array” is used to describe the structure of the vote data, this structure can be implemented using data structures that are different from simple arrays, but which allow indexing of elements as in arrays.

Текстовая структура данных (2820) представляет собой страницу документа, которую необходимо обработать способом мультикластерной обработки документа с применением OCR для преобразования исходного отсканированного документа, содержащего изображения символов, в эквивалентный электронный документ, содержащий коды символов. Термины «документ», «страница» и «изображение символа» могут иметь различные значения в зависимости от контекста. В данном примере документ состоит из нескольких страниц, и каждая страница включает в себя множество изображений символов. Разумеется, тот же самый или схожий способ мультикластерной обработки документов с применением OCR может использоваться для множества различных типов документов вне зависимости от того, содержат ли они страницы с одним или более изображениями символов или нет.The text data structure (2820) is a document page that needs to be processed using multicluster document processing using OCR to convert the original scanned document containing symbol images to an equivalent electronic document containing symbol codes. The terms “document”, “page” and “symbol image” may have different meanings depending on the context. In this example, a document consists of several pages, and each page includes a plurality of symbol images. Of course, the same or similar OCR-based multi-cluster document processing can be used for many different types of documents, regardless of whether they contain pages with one or more symbol images or not.

На первой стадии, показанной на Фиг. 28В, переменная текущего изображения символа (2826) считывает в себя или ссылается на первое изображение символа (2824), как это показано фигурной стрелкой (2827). Затем, как показано на Фиг. 28С, каждая вычисляющая параметры функция или подпрограмма из массива параметров (2802), применяется к текущему изображению символа из переменной (2824) для формирования соответствующих значений параметров, которые затем записываются в массив рассчитанных значений параметров (2810), как показывают стрелки (2828)-(2831) и точки (2832)-(2833). Таким образом, в одном из осуществлений массив рассчитанных значений параметров (2810) включает в себя числовые значения параметров, соответствующие каждому из параметров массива (2802), представленных функциями или ссылками на них, и рассчитанные для текущего изображения символа. Вычисление значений параметров ранее описывалось со ссылкой на Фиг. 19C-D.In the first stage shown in FIG. 28B, the variable of the current symbol image (2826) reads into itself or refers to the first symbol image (2824), as shown by the curly arrow (2827). Then, as shown in FIG. 28C, each parameter-calculating function or subroutine from the parameter array (2802) is applied to the current symbol image from the variable (2824) to generate the corresponding parameter values, which are then written to the calculated parameter values array (2810), as shown by arrows (2828) - (2831) and points (2832) - (2833). Thus, in one embodiment, the array of calculated parameter values (2810) includes numerical values of the parameters corresponding to each of the parameters of the array (2802), represented by functions or links to them, and calculated for the current symbol image. The calculation of parameter values has previously been described with reference to FIG. 19C-D.

Затем, как показано на Фиг. 28D, в первой структуре данных кластера (2806) выбирается первая структура данных эталона (2834). Значения параметров, связанных с первой структурой данных эталона, используются вместе с соответствующими значениями параметров из массива рассчитанных значений (2810) для расчета веса W (2835), как было описано выше со ссылкой на Фиг. 25F. Выбор первой структуры данных эталона иллюстрируется фигурной стрелкой (2837). Обратите внимание, что массив параметров структуры данных кластера ((2812) на Фиг. 28А) используется для индексации массива рассчитанных значений параметров. Как было описано выше со ссылкой на Фиг. 25G, затем рассчитанный вес сравнивается с весом отсечки ((2816) на Фиг. 28А) (2836). Это позволит определить, может ли структура данных эталона (2834) из первой структуры данных кластера (2806) проголосовать за графемы, как было описано выше со ссылкой на Фиг. 25Н. В рассматриваемом примере, как показано на Фиг. 28Е, рассчитанный вес (2835) меньше веса отсечки, в результате чего происходит накопление голосов, формируемых первой структурой данных эталона (2834) из первой структуры данных кластера (2806), в массиве голосов (2824). Как было описано выше со ссылкой на Фиг. 19Н, рассчитанный вес (2835) используется для выбора индекса из набора индексов (2838) в первой структуре данных эталона (2834), а содержимое выбранного элемента становится указателем (2839) на некоторый код графемы из первой структуры данных эталона (2840). Голоса формируются для всех графем, соответствующих кодам графем из сегмента кодов графем первой структуры данных эталона, начиная с первого кода и заканчивая кодом графемы, на которую указывает индекс, выбранный из сегмента индексов структуры данных эталона. На Фиг. 28Е голоса, поданные первой структурой данных эталона (2834) из первой структуры данных кластера (2806), показаны фигурными стрелками (2842)-(2846). Пустые значения в массиве голосов ((2804) на Фиг. 28А) представляют собой нулевые значения (0). Начальное голосование для первой структуры данных эталона (2834) из первой структуры данных кластера (2806) увеличивает значения накопленных голосов (2847)-(2851) с 0 до 1 для тех графем в массиве голосов, для которых выбраны соответствующие коды из первой структуры данных эталона. В альтернативных вариантах осуществления при голосовании к значениям элементов массива голосов могут прибавляться числа, отличные от 1.Then, as shown in FIG. 28D, in a first cluster data structure (2806), a first reference data structure (2834) is selected. The parameter values associated with the first data structure of the reference are used together with the corresponding parameter values from the calculated value array (2810) to calculate the weight W (2835), as described above with reference to FIG. 25F. The choice of the first data structure of the standard is illustrated by a curly arrow (2837). Note that the array of cluster data structure parameters ((2812) in Fig. 28A) is used to index the array of calculated parameter values. As described above with reference to FIG. 25G, then the calculated weight is compared with the cutoff weight ((2816) in FIG. 28A) (2836). This will determine whether the data structure of the reference (2834) from the first data structure of the cluster (2806) can vote for graphemes, as described above with reference to FIG. 25H. In this example, as shown in FIG. 28E, the calculated weight (2835) is less than the cutoff weight, resulting in the accumulation of votes generated by the first data structure of the standard (2834) from the first data structure of the cluster (2806), in the array of votes (2824). As described above with reference to FIG. 19H, the calculated weight (2835) is used to select an index from a set of indices (2838) in the first data structure of the standard (2834), and the content of the selected element becomes a pointer (2839) to some grapheme code from the first data structure of the standard (2840). Votes are formed for all graphemes corresponding to grapheme codes from the grapheme code segment of the first pattern data structure, starting from the first code and ending with the grapheme code indicated by an index selected from the index segment of the pattern data structure. In FIG. 28E votes cast by the first data structure of the standard (2834) from the first data structure of the cluster (2806) are shown by curly arrows (2842) - (2846). Empty values in the array of votes ((2804) in Fig. 28A) represent zero values (0). The initial vote for the first reference data structure (2834) from the first cluster data structure (2806) increases the accumulated votes (2847) - (2851) from 0 to 1 for those graphemes in the voting array for which the corresponding codes from the first reference data structure are selected . In alternative embodiments, when voting, numbers other than 1 may be added to the values of the elements of the array of votes.

Процесс, описанный со ссылкой на Фиг. 28С-Е, повторяется для каждой структуры данных эталона, принадлежащей каждой структуре данных кластера для формирования законченного голоса за графему, сохраняемого в массиве голосов (2804). На Фиг. 28F показано окончательное голосование после полного прохождения структуры данных эталона для каждой структуры данных кластера, показанное змеевидной стрелкой (2865), продолжающейся из фигурной стрелки (2837), показывающей выбор первой структуры данных эталона в первом кластере. Так выглядит обработка первого изображения символа, выбранного из текстовой страницы (2820). Далее, как показано на Фиг. 28G, голоса, накопленные в массиве голосов для первого изображения символа, выбранного из текстовой страницы, используются для подготовки отсортированного списка кодов графем, которые чаще всего соответствовали изображению символа в процессе обработки, описанной выше со ссылкой на Фиг 28B-F согласно накопленным голосам для кодов графем в массиве голосов. На Фиг. 28G, массив голосов (2804) показан в верхней части чертежа. Каждая ячейка в массиве голосов содержит количество голосов; индексами ячеек являются коды графем. Голоса и индексы кодов графем затем сортируются в порядке убывания количества голосов, формируя, таким образом, отсортированный массив (2867), в котором каждая ячейка содержит код графемы, а индексы слева направо монотонно возрастают, упорядочивая коды графем по количеству голосов, которые они получили в процессе обработки, описанной на Фиг. 28B-F. Например, наибольшее число голосов, равное 16, получил код графемы «9» (2866), а значит, код графемы «9» будет стоять на первой позиции (2868) в отсортированном массиве кодов графем (2867). Затем отсортированный массив (2867) урезается, формируя усеченный отсортированный массив кодов графем (2869). Усеченный отсортированный массив кодов графем включает в себя отсортированный список кодов графем, которые получили голоса в процессе обработки, описанной выше со ссылкой на Фиг. 28B-F. В процессе, описанном на Фиг. 28B-F, голоса получили только 14 кодов графем, а значит, усеченный отсортированный массив кодов графем (2869) содержит только 14 элементов. Это первые 14 элементов в отсортированном массиве кодов графем (2867). Остальные элементы отсортированного массива кодов графем (2867), следующие за четырнадцатым элементом с индексом 13, содержат коды графем, для которых голоса не были получены. Далее, как показывает фигурная стрелка (2870), усеченный массив кодов графем включается в первый элемент (2871) таблицы обработанных изображений символа (2872). Каждый элемент таблицы обработанных изображений символа включает в себя поле (представлено как первый столбец (2873)), указывающее на число или порядок символа внутри текстовой структуры данных ((2820) на Фиг. 28А). поле с количеством кодов графем, получивших голоса в процессе обработки символа (вторая колонка (2874) в таблице обработанных изображений символа), и отсортированный усеченный массив кодов графем (третья колонка (2875) таблицы обработанных изображений символа (2872)).The process described with reference to FIG. 28C-E is repeated for each reference data structure belonging to each cluster data structure to form a complete voice for the grapheme stored in the array of votes (2804). In FIG. 28F shows the final vote after the full passage of the reference data structure for each cluster data structure, shown by a serpentine arrow (2865) continuing from the curly arrow (2837) showing the selection of the first reference data structure in the first cluster. This is how the processing of the first image of the character selected from the text page (2820) looks like. Further, as shown in FIG. 28G, the voices accumulated in the array of voices for the first symbol image selected from the text page are used to prepare a sorted list of grapheme codes that most often corresponded to the symbol image during the processing described above with reference to Fig 28B-F according to the accumulated votes for the codes graphemes in an array of voices. In FIG. 28G, an array of voices (2804) is shown at the top of the drawing. Each cell in the array of votes contains the number of votes; cell indices are grapheme codes. The voices and grapheme code indices are then sorted in descending order of the number of votes, thus forming a sorted array (2867), in which each cell contains the grapheme code, and the indices from left to right monotonously increase, ordering the grapheme codes by the number of votes they received in the processing process described in FIG. 28B-F. For example, the highest number of votes equal to 16 was received by the grapheme code “9” (2866), which means that the grapheme code “9” will be in first position (2868) in the sorted array of grapheme codes (2867). Then, the sorted array (2867) is truncated, forming a truncated sorted array of grapheme codes (2869). A truncated sorted array of grapheme codes includes a sorted list of grapheme codes that received votes during the processing described above with reference to FIG. 28B-F. In the process described in FIG. 28B-F, votes received only 14 grapheme codes, which means that a truncated sorted array of grapheme codes (2869) contains only 14 elements. These are the first 14 elements in a sorted array of grapheme codes (2867). The remaining elements of the sorted array of grapheme codes (2867), following the fourteenth element with index 13, contain grapheme codes for which no votes were received. Further, as the curly arrow (2870) shows, a truncated array of grapheme codes is included in the first element (2871) of the table of processed symbol images (2872). Each element of the processed symbol image table includes a field (represented as the first column (2873)) indicating the number or order of the symbol within the text data structure ((2820) in FIG. 28A). a field with the number of grapheme codes that received votes during symbol processing (second column (2874) in the processed symbol image table), and a sorted truncated array of grapheme codes (third column (2875) of the processed symbol image table (2872)).

В некоторых вариантах осуществления вместо использования таблицы обработанных изображений символов, отсортированный усеченный массив кодов графем немедленно используется в дополнительном алгоритме обнаружения символов, который формирует соответствующий код символа для изображения символа. Этот код символа затем может быть немедленно помещен в обработанную страницу, соответствующую странице, содержащей изображения символов, которые были обработаны вышеописанным способом со ссылкой на Фиг 28A-J. Однако в настоящем осуществлении отсортированные усеченные массивы кодов графем для изображений символов накапливаются в таблице обрабатываемых изображений символа для каждой страницы документа. Отсортированные усеченные массивы кодов графем затем используются совместно со структурами данных эталона на втором этапе для преобразования изображений символов со страницы документа в символы, помещаемые затем на обработанную страницу документа. В любом случае отсортированный усеченный массив кодов графем представляет собой результат начальной мультикластерной обработки, которая определяет набор кодов графем, наиболее вероятно связанных с изображением символа внутри изображения страницы документа. В настоящем варианте осуществления все коды графем, заработавших голоса, включены в отсортированные усеченные массивы кодов графем. В альтернативных вариантах осуществления в отсортированные усеченные массивы кодов графем включаются только те коды графем, число голосов которых превышает заданный порог.In some embodiments, instead of using a table of processed symbol images, a sorted truncated array of grapheme codes is immediately used in an additional symbol detection algorithm that generates a corresponding symbol code for the symbol image. This symbol code can then be immediately placed on the processed page corresponding to the page containing images of symbols that have been processed as described above with reference to FIGS. 28A-J. However, in the present embodiment, the sorted truncated arrays of grapheme codes for symbol images are accumulated in the table of processed symbol images for each page of the document. The sorted truncated arrays of grapheme codes are then used in conjunction with the standard data structures in the second step to convert symbol images from the document page into symbols that are then placed on the processed document page. In any case, the sorted truncated array of grapheme codes is the result of the initial multicluster processing, which determines the set of grapheme codes most likely associated with the symbol image inside the image of the document page. In the present embodiment, all grapheme codes that have earned votes are included in the sorted truncated arrays of grapheme codes. In alternative embodiments, only grapheme codes whose votes exceed a predetermined threshold are included in the sorted truncated arrays of grapheme codes.

Как только завершается обработка первого изображения символа, извлеченного из страницы документа, и в таблице обработанных изображений символов создается соответствующая запись, второе изображение символа выбирается из текстовой структуры данных ((2820) на Фиг. 28А) и помещается в переменную (2824). Затем рассчитываются значения параметров для следующего символа и сохраняются в массиве рассчитанных значений параметров (2810). Затем обрабатывается второе изображение символа с использованием способа, описанного выше со ссылкой на Фиг. 28C-F. Таким образом, формируется новый набор накопленных голосов в массиве голосов (2804) для второго изображения символа. Накопленные голоса для второго изображения символа (2804) сортируются по количеству полученных голосов с целью получения отсортированного массива графем, который затем урезается для получения усеченного отсортированного массива графем. Этот усеченный массив графем включается как второй элемент таблицы обработанных изображений символов (2872). Аналогичная обработка осуществляется для каждого последующего символа изображения в структуре данных текста ((2820) на Фиг. 28А). В результате обработки изображений символа текстового документа, использующей кластеры, формируется таблица обработанных изображений символа (2872), в которой имеется запись для каждого изображения символа обрабатываемого документа. Каждый элемент таблицы обработанных изображений символа представляет собой начальный набор графем, потенциально соответствующих изображению символа. Набор потенциально соответствующих графем помещается в отсортированный усеченный массив кодов графем, отсортированный в порядке убывания накопленных голосов таким образом, что коды графем, набравших наибольшее количество голосов, располагаются первыми в отсортированном усеченном массиве кодов графем. Набор потенциально соответствующих кодов графем, представленный в виде отсортированного усеченного массива кодов графем, наряду со структурами данных эталона, голосующими за входящие в массив графемы, может в дальнейшем использоваться в дополнительном алгоритме распознавания символов для определения наилучших кодов символов для того изображения символа, для которого был сформирован отсортированный усеченный массив кодов графем в процессе обработки, описанной выше со ссылкой на Фиг. 28B-F.As soon as the processing of the first symbol image extracted from the document page is completed, and the corresponding record is created in the processed symbol image table, the second symbol image is selected from the text data structure ((2820) in Fig. 28A) and placed into a variable (2824). Then, the parameter values for the next symbol are calculated and stored in an array of calculated parameter values (2810). Then, the second symbol image is processed using the method described above with reference to FIG. 28C-F. Thus, a new set of accumulated votes is formed in the votes array (2804) for the second symbol image. The accumulated votes for the second symbol image (2804) are sorted by the number of votes received in order to obtain a sorted array of graphemes, which is then truncated to obtain a truncated sorted array of graphemes. This truncated array of graphemes is included as the second element of the table of processed symbol images (2872). Similar processing is carried out for each subsequent image symbol in the text data structure ((2820) in Fig. 28A). As a result of processing symbol images of a text document using clusters, a table of processed symbol images is generated (2872), in which there is a record for each symbol image of the document being processed. Each element of the table of processed symbol images is an initial set of graphemes that potentially correspond to the symbol image. The set of potentially relevant graphemes is placed in a sorted truncated array of grapheme codes, sorted in descending order of accumulated votes so that the grapheme codes with the most votes are placed first in the sorted truncated array of grapheme codes. The set of potentially relevant grapheme codes, represented as a sorted truncated array of grapheme codes, along with the standard data structures that vote for the graphemes included in the array, can be further used in an additional character recognition algorithm to determine the best character codes for that symbol image for which a sorted truncated array of grapheme codes is generated during the processing described above with reference to FIG. 28B-F.

При подходе к обработке символов с мультикластерной структурой данных, описанной выше со ссылкой на Фиг. 28A-G, графемы-кандидаты, выбранные для каждого обработанного символа изображения, отражают выводы, которые можно сделать из огромной базы данных накопленной информации по совпадениям с эталонами для набора символов естественного языка, содержащимся в структурах данных эталонов и структурах данных кластеров. Тем не менее, когда естественный язык описывается многими десятками или сотнями структур данных кластеров, каждая из которых содержит многие сотни структур данных эталонов, накладные расходы для обработки изображения символа очевидно являются высокими. Несмотря на то, что современные компьютерные системы имеют быстрые процессоры, и в машине часто имеется по меньшей мере несколько процессорных ядер, вычислительные накладные расходы методологии обработки изображений символов, описанной выше со ссылками на Фиг. 24-28G, могут оказаться достаточно большим, чтобы многие задачи оптического распознавания символов стали практически невозможными или нецелесообразными. Настоящий документ вводит выбор структур данных эталона изображений символов на основе леса решений для того, чтобы значительно повысить вычислительную эффективность общего подхода на основе мультикластерной структуры данных к оптическому распознаванию символов.In an approach to processing multi-cluster data structure symbols described above with reference to FIG. 28A-G, candidate graphemes selected for each processed image symbol reflect the conclusions that can be drawn from a huge database of accumulated information on matches with patterns for a natural language character set contained in pattern data structures and cluster data structures. However, when a natural language is described by many tens or hundreds of data structures of clusters, each of which contains many hundreds of data structures of patterns, the overhead for processing a symbol image is obviously high. Despite the fact that modern computer systems have fast processors, and the machine often has at least several processor cores, the computational overhead of the symbol image processing methodology described above with reference to FIG. 24-28G, may be large enough so that many tasks of optical character recognition become almost impossible or inappropriate. This document introduces the choice of data structures of a symbol image standard based on a decision forest in order to significantly increase the computational efficiency of a general approach based on a multicluster data structure to optical character recognition.

На Фиг. 29A-D показано использование леса решений для выявления подходящих структур данных для обработки входных изображений символов. На Фиг. 29А входное изображение символа (2902) параметризуется путем вычисления значений числовых параметров для каждого из набора различных параметров, которые затем сохраняются в массив числовых значений параметров (2904). Эти сохраненные параметры, или поднаборы хранимых параметров, затем используются для обхода многочисленных деревьев решений в лесе решений (2906), каждое из которых дает одну или более подходящих структур данных эталона (2908)-(2912), соответствующую входному изображению символа (2902). Во многих случаях каждое дерево принятия решений дает одну подходящую структуру данных эталона. В альтернативных вариантах реализации дерево решений может выбирать несколько подходящих структур данных эталона для входного изображения символа. Деревья решений похожи на схемы принятия решений, описанные выше со ссылкой на Фиг. 14С, 15С, 16С и 17С. Тем не менее, в отличие от описанных ранее деревьев решений, деревья решений в лесу решений (2906) выбирают не коды символов, а идентификаторы, которые указывают на структуры данных эталона, как описано более подробно ниже.In FIG. 29A-D illustrate the use of a decision forest to identify suitable data structures for processing input symbol images. In FIG. 29A, the input symbol image (2902) is parameterized by calculating the values of the numerical parameters for each of the set of different parameters, which are then stored in an array of numerical values of the parameters (2904). These stored parameters, or subsets of stored parameters, are then used to traverse numerous decision trees in the decision forest (2906), each of which gives one or more suitable data structures of the standard (2908) - (2912) corresponding to the input symbol image (2902). In many cases, each decision tree provides one suitable reference data structure. In alternative embodiments, the decision tree may select several suitable reference data structures for the input symbol image. Decision trees are similar to the decision schemes described above with reference to FIG. 14C, 15C, 16C and 17C. However, unlike the decision trees described earlier, the decision trees in the decision forest (2906) do not select symbol codes, but identifiers that indicate the data structures of the template, as described in more detail below.

Таким образом, лес решений (2906) используется для выбора набора подходящих структур данных эталонов, каждая из которых независима от других, используемых в контексте одной или более структур данных кластера, чтобы проголосовать за коды графем, которые вероятно будут соответствовать входному изображению символа. Используя лес решений (2906), длительный и вычислительно неэффективный процесс, описанный выше со ссылкой на Фиг. 28A-G, в котором каждая структура данных эталона в каждой структуре данных кластера голосует за коды графем, преобразуется так, чтобы за введенное изображение символа голосовали только подходящие для этого структуры данных эталона, определенные лесом решений. Лес решений анализирует значения параметров, формируемые из входных символов изображении для выявления набора подходящих структур данных, которые, скорее всего, дадут веса меньше, чем значение отсечки в структурах данных кластера, которые включают подходящие структуры данных эталона, тем самым устраняя необходимость вычислять веса выше значения отсечки, которые не приведут к голосованию за коды графем. Кроме того, подходящие структуры данных эталона, выявленные лесом решений, имеют наибольшую вероятность подачи значимых голосов за коды графемы в отношении изображения входного символа.Thus, the decision forest (2906) is used to select a set of suitable data structures of the patterns, each of which is independent of the others used in the context of one or more cluster data structures, to vote on grapheme codes that are likely to correspond to the input symbol image. Using the decision forest (2906), the lengthy and computationally inefficient process described above with reference to FIG. 28A-G, in which each reference data structure in each cluster data structure votes for grapheme codes, it is converted so that only the appropriate reference data structures defined by the decision forest vote for the entered symbol image. The decision forest analyzes the parameter values generated from the input characters in the image to identify a set of suitable data structures that are likely to give weights less than the cutoff value in the cluster data structures, which include suitable reference data structures, thereby eliminating the need to calculate weights above the value cutoffs that will not lead to voting for grapheme codes. In addition, suitable reference data structures identified by the decision forest are most likely to produce significant votes for grapheme codes in relation to the input symbol image.

На Фиг. 29В показаны отношения между деревьями решений в лесу решений ((2906) на Фиг. 29А), структурами данных эталона и структурами данных кластера. Как показано на Фиг. 29В, лес решений ((2906) на Фиг. 29А) состоит из нескольких отдельных деревьев решений (2914)-(2917). Конечные узлы деревьев принятия решений показаны на Фиг. 29В в горизонтальном уровне (2920), причем каждый из них ссылается по меньшей мере на одну структуру данных эталона. На Фиг. 29В структуры данных эталона показаны в виде широкого среднего слоя (2922) с похожими на колонку объектами, такими как похожий на колонку объект (2924), представляющими отдельные структуры данных эталона. Одна или более структур данных кластера может включать или ссылаться на структуру данных эталона. На Фиг. 29В структуры данных кластера показаны на самом нижнем уровне (2926). Например, структура данных эталона (2928) включена в три различные структуры данных кластера (или связана с ними), включая структуры данных кластера (2930) и (2932), обозначенные означающими ссылки стрелками (2934)-(2936). Следует отметить со ссылкой на Фиг 29В, что дерево принятия решений может включать конечные узлы, которые ссылаются на структуры данных эталона, связанные с несколькими структурами данных кластера, поскольку рассматриваемая структура данных эталона может быть связана с несколькими структурами данных кластера. В некоторых вариантах осуществления конечный узел дерева принятия решений может ссылаться на несколько структур данных эталона, а не на одну структуру данных эталона, как на Фиг. 29В.In FIG. 29B shows the relationships between decision trees in a decision forest ((2906) in FIG. 29A), reference data structures, and cluster data structures. As shown in FIG. 29B, the decision forest ((2906) in Fig. 29A) consists of several separate decision trees (2914) - (2917). The end nodes of decision trees are shown in FIG. 29B at the horizontal level (2920), each of which refers to at least one reference data structure. In FIG. 29B, the reference data structures are shown as a wide middle layer (2922) with column-like objects, such as a column-like object (2924) representing separate reference data structures. One or more cluster data structures may include or reference a reference data structure. In FIG. 29B, the cluster data structures are shown at the lowest level (2926). For example, the reference data structure (2928) is included in (or associated with) three different cluster data structures, including cluster data structures (2930) and (2932), indicated by reference arrows (2934) - (2936). It should be noted with reference to FIG. 29B that the decision tree may include end nodes that reference reference data structures associated with several cluster data structures, since the considered reference data structure may be associated with several cluster data structures. In some embodiments, the final node of the decision tree may refer to several reference data structures, rather than one reference data structure, as in FIG. 29B.

На Фиг. 29С показан процесс голосования за коды графем подходящими структурами данных эталона, выявленными лесом решений ((2906) на Фиг. 29А). Как показано на Фиг. 29С, подходящие структуры данных эталона (2908)-(2912), выявленные лесом решений, могут быть связаны с несколькими структурами данных кластера. Например, структура данных эталона (2908) связана со структурой данных кластера (2940) и со структурой данных кластера (2942). Структура данных эталона (2908) в контексте структуры данных кластера (2940) подает три голоса за коды графемы, показанные стрелками (2944)-(2946). Структура данных эталона (2908) в структуре данных кластера (2942) подает три голоса за код графемы, что показано стрелками (2948)-(2950). Разумеется, конкретная структура данных эталона может подавать такое количество голосов в одной структуре данных кластера, которое будет отличаться от количества голосов, поданных той же структурой данных эталонов в другой структуре данных кластера. Аналогичным образом, структура данных эталона в контексте одной структуры данных кластера может голосовать за другие графемы, не за те, за которые она же проголосовала в структуре данных эталона в контексте другой структуры данных кластера. Таким образом, вместо того, чтобы предоставлять возможность каждой структуре данных эталона в каждой структуре данных кластера самостоятельно голосовать за графемы, в основанном на лесе решений способе лес решений используется для того, чтобы выбрать набор подходящих структур данных эталонов (2908)-(2912), каждая из которых голосует в контексте структур данных кластера, с которыми эта структура данных эталона связана или в которой структура данных эталона находится.In FIG. Figure 29C shows the voting process for grapheme codes by the appropriate data structures of the pattern identified by the decision forest ((2906) in Figure 29A). As shown in FIG. 29C, suitable data structures of the standard (2908) - (2912) identified by the decision forest can be associated with several cluster data structures. For example, the reference data structure (2908) is associated with the cluster data structure (2940) and the cluster data structure (2942). The data structure of the standard (2908) in the context of the cluster data structure (2940) casts three votes for the grapheme codes shown by arrows (2944) - (2946). The data structure of the standard (2908) in the data structure of the cluster (2942) casts three votes for the grapheme code, which is shown by arrows (2948) - (2950). Of course, a particular reference data structure can cast such a number of votes in one cluster data structure that will differ from the number of votes cast by the same reference data structure in another cluster data structure. Similarly, the pattern data structure in the context of one cluster data structure can vote for other graphemes, not for those for which it also voted in the pattern data structure in the context of another cluster data structure. Thus, instead of enabling each reference data structure in each cluster data structure to independently vote for graphemes, in the decision forest-based method, the decision forest is used to select a set of suitable reference data structures (2908) - (2912), each of which votes in the context of the cluster data structures with which this pattern data structure is associated or in which the pattern data structure is located.

На Фиг. 29D показан один вариант осуществления основанного на лесе решений способа обработки изображения символа, который обсуждался выше со ссылкой на Фиг. 29А-С. На Фиг. 29D в верхней части рисунка показано одно дерево решений (2960). Каждый конечный узел, например, конечный узел (2962) содержит по меньшей мере один идентификатор эталона (2964). В некоторых вариантах осуществления каждый конечный узел содержит только один идентификатор эталона, в то время как в альтернативных вариантах осуществления конечный узел дерева принятия решений может содержать несколько идентификаторов, указывающих на разные эталоны. Идентификатор эталона используется в качестве индекса (что показано стрелкой (2966)) на индекс эталона (2968). Индекс эталона представляет собой массив указателей на списки идентификаторов кластера. Например, идентификатор эталона, который индексирует первый элемент в индексе эталона (2970), через индекс эталона ссылается на список из трех идентификаторов кластера (2972)-(2974). Эти идентификаторы кластера определяют кластеры, связанные со структурами данных эталона, выявленные идентификатором эталона (2964) или содержащие эти структуры данных. В случае конечного узла (2962) дерева принятия решений (2960) один идентификатор эталона (2964), содержащийся внутри конечного узла, ссылается на список идентификаторов двух кластеров (2976) и (2978). Первый идентификатор кластера используется в качестве указателя на индекс кластера (2982), что показано стрелкой (2980). Индексированная запись индекса кластера содержит указатель или ссылку на структуру данных кластера (2984). Структура данных кластера содержит локальный индекс структуры данных эталона (2986). Идентификатор эталона (2964) используется для индексации записи (2988) в локальном индексе структуры данных эталона (2986), который содержит ссылку на экземпляр структуры данных эталона (2990), соответствующий идентификатору эталона (2964), который содержится в структуре данных кластера (2984) или связан с ней. Могут быть использованы много дополнительных типов структур данных и методов доступа для выявления подходящих структур данных эталона, содержащихся в структурах данных кластера, связанных с каждым конечным узлом каждого дерева принятия решений. В некоторых случаях рассмотренная структура данных эталона может быть представлена несколькими записями в индексе эталона (2968) и, следовательно, она может быть связана с различными списками, содержащими структуры данных кластера, при этом одно дерево принятия решений может определить голосование структурой данных эталона в контексте структур данных кластера, отличающихся от структур, указанных другим деревом принятия решений.In FIG. 29D shows one embodiment of a forest-based decision method for processing a symbol image, which was discussed above with reference to FIG. 29A-C. In FIG. 29D, one decision tree (2960) is shown at the top of the figure. Each end node, for example, end node (2962) contains at least one reference identifier (2964). In some embodiments, each end node contains only one reference identifier, while in alternative embodiments, the final decision tree node may contain several identifiers pointing to different references. The reference identifier is used as the index (as shown by the arrow (2966)) to the reference index (2968). The reference index is an array of pointers to lists of cluster identifiers. For example, the identifier of the reference that indexes the first element in the reference index (2970), through the reference index refers to a list of three cluster identifiers (2972) - (2974). These cluster identifiers identify clusters associated with the reference data structures identified by the reference identifier (2964) or containing these data structures. In the case of the final node (2962) of the decision tree (2960), one identifier of the standard (2964) contained within the final node refers to the list of identifiers of two clusters (2976) and (2978). The first cluster identifier is used as a pointer to the cluster index (2982), as shown by the arrow (2980). The indexed cluster index record contains a pointer or a link to the cluster data structure (2984). The cluster data structure contains a local index of the reference data structure (2986). The reference identifier (2964) is used to index the record (2988) in the local reference data structure index (2986), which contains a link to an instance of the reference data structure (2990) corresponding to the reference identifier (2964), which is contained in the cluster data structure (2984) or associated with it. Many additional types of data structures and access methods can be used to identify suitable reference data structures contained in the cluster data structures associated with each end node of each decision tree. In some cases, the considered data structure of the reference can be represented by several entries in the reference index (2968) and, therefore, it can be associated with various lists containing cluster data structures, and one decision tree can determine the voting by the reference data structure in the context of structures cluster data that is different from the structures specified by another decision tree.

На Фиг. 30A-D с помощью блок-схем показан один из вариантов осуществления способа обработки документа на основе мультикластерной структуры данных, в котором использует лес решений для выбора подходящих структур данных эталона для каждого изображения символа. На Фиг. 30А показан самый верхний уровень способа обработки документов с применением OCR. На этапе (3002) происходит получение документа и инициализация структуры данных, соответствующей обработанному документу (PD), которая будет хранить коды символов, сформированных в процессе обработки документа, для соответствующих изображений символов из полученного документа. Затем на этапе (3004) инициализируются структуры данных, описанные выше со ссылкой на Фиг. 24-25А для подготовки к обработке документа. Затем в цикле for на этапах (3006)-(3012) происходит обработка каждой страницы документа для замены в обработанном документе PD изображений символов из полученного документа на коды символов или другие вычисленные представления символов. На первом этапе внешнего цикла for (3006)-(3012) очищается и повторно инициализируется таблица обработанных изображений символов, описанная выше со ссылкой на Фиг. 27 и 28G. Затем во внутреннем цикле for на этапах (3008)-(3011) каждый символ текущей страницы обрабатывается путем вызова подпрограммы «process symbol image» («обработка изображения символа») (3009). По завершению вложенных циклов for на этапах (3006)-(3012) на этапе (3014) освобождаются память, занятая структурами данных, использованными при обработке полученного документа с применением OCR, и возвращается обработанный документ PD.In FIG. 30A-D, using flowcharts, one embodiment of a document processing method based on a multi-cluster data structure is shown, which uses a decision forest to select appropriate reference data structures for each symbol image. In FIG. 30A shows the highest level of an OCR document processing method. At step (3002), a document is received and a data structure is initialized corresponding to the processed document (PD), which will store the character codes generated during document processing for the corresponding character images from the received document. Then, in step (3004), the data structures described above with reference to FIG. 24-25A to prepare for document processing. Then, in the for loop at steps (3006) - (3012), each page of the document is processed to replace in the processed document PD symbol images from the received document with symbol codes or other calculated symbol representations. In the first step of the outer loop for (3006) - (3012), the processed symbol image table described above with reference to FIG. 27 and 28G. Then, in the inner for loop at steps (3008) - (3011), each character of the current page is processed by calling the subroutine "process symbol image" (3009). Upon completion of the nested for loops in steps (3006) - (3012) in step (3014), the memory occupied by the data structures used in processing the received document using OCR is freed and the processed PD document is returned.

На Фиг. 30В показана подпрограмма «инициализация структур данных» («initialized data structures))), вызываемая на этапе (3004) на Фиг. 30А. На этапе (3015) происходит выделение памяти для леса решений и его инициализация. В общем случае лес решений основан на независимом анализе больших объемов текстовой информации на том естественном языке, для которого строится дерево решений. На этапе (3016) происходит выделение памяти для массива параметров и его инициализация ((2002) на Фиг. 20А). На этапе (3018) происходит выделение памяти для массива голосов ((2004) на Фиг. 20А). На этапе (3020) происходит выделение памяти и инициализация структур данных кластеров ((2006)-(2008) на Фиг. 20А) и структур данных, которые содержат в себе или на которые ссылаются структуры данных кластеров, включая массивы локальных параметров ((2012) на Фиг. 20А) и структуры данных эталонов ((2018) на Фиг. 20А). Как уже упоминалось выше, каждая структура данных кластера может включать в себя набор ссылок на параметры, описанные в глобальном массиве параметров ((2002) на Фиг. 20А), а также структуры данных эталонов и вес отсечки, отличные от других структур данных кластеров. Каждая структура данных кластера специализируется на распознавании подмножества или семейства символов конкретного языка или набора родственных языков. Наконец, на этапе (3022) происходит выделение памяти для таблицы обработанных изображений символа ((2706) на Фиг. 27). Кроме того, происходит статическое либо динамическое выделение памяти и для других различных переменных и массивов и их инициализация.In FIG. 30B shows the “initialized data structures” subroutine) invoked in step (3004) of FIG. 30A. At step (3015), memory is allocated for the decision forest and its initialization. In the general case, the decision forest is based on an independent analysis of large amounts of textual information in the natural language for which the decision tree is being built. At step (3016), memory is allocated for the parameter array and initialized ((2002) in Fig. 20A). At step (3018), memory is allocated for the array of voices ((2004) in Fig. 20A). At step (3020), memory allocation and initialization of cluster data structures ((2006) - (2008) in Fig. 20A) and data structures that contain or refer to cluster data structures, including arrays of local parameters ((2012) in Fig. 20A) and data structures of the standards ((2018) in Fig. 20A). As mentioned above, each cluster data structure may include a set of links to parameters described in the global parameter array ((2002) in Fig. 20A), as well as data structures of the standards and cutoff weight, different from other data structures of the clusters. Each cluster data structure specializes in recognizing a subset or family of characters in a particular language or set of related languages. Finally, in step (3022), memory is allocated for the table of processed symbol images ((2706) in FIG. 27). In addition, there is a static or dynamic memory allocation for various other variables and arrays and their initialization.

На Фиг. 30С показана блок-схема подпрограммы «обработка изображения символа» («process symbol image»), вызываемой на этапе (3009) на Фиг. 30А. На этапе (3024) подпрограмма вычисляет значения для всех параметров, представленных функциями, рассчитывающими значения параметров, или ссылками на такие функции в массиве параметров (2008). На этапе (3026) рассчитанные параметры или поднаборы рассчитанных параметров вводятся в деревья решений, которые совместно составляют лес решений ((2906) на Фиг. 29А). Затем во вложенных циклах for на этапах (3027)-(3037) подпрограмма «обработка изображения символа» обрабатывает изображение символа, как было описано выше со ссылкой на Фиг 29A-D. Внешний цикл for на этапах (3027)-(3037) выполняется для каждой структуры данных эталона, возвращенной лесом решений. В первом внутреннем цикле for на этапах (3028)-(3036) перебираются все структуры данных эталона внутри текущей структуры данных кластера. На этапе (3029) с помощью текущих значений переменных цикла выявляется следующая структура данных эталона в структуре данных кластера или связанная с ней структура. На этапе (3030) для текущей структуры данных эталона рассчитывается вес W на основе значений параметров эталона, содержащихся в структуре данных эталона, и соответствующих значений параметров, рассчитанных для текущего изображения символа, как было описано выше со ссылкой на Фиг. 28D. Если рассчитанный вес структуры данных кластера больше веса отсечки, как определено на этапе (3031), то обработка текущей структуры данных эталона прекращается и осуществляется переход к этапу (3036), который описан ниже. В противном случае на этапе (3032) рассчитанный вес используется для выбора индекса из сегмента индексов структуры данных эталона, указывающего на код графемы в структуре данных эталона. Затем в самом внутреннем цикле for на этапах (3033)-(3035) происходит добавление голосов к ячейкам массива голосов, на которые указывает каждый код графемы в структуре данных эталона, начиная с первого и заканчивая кодом графемы, индекс которой был выбран на этапе (3033). В описываемом варианте осуществления каждый голос добавляет 1 к числу накопленных голосов за графему. В альтернативных вариантах осуществления голоса могут представлять собой вещественные числа из некоторого диапазона, например, [0,0; 1,0]. В альтернативных вариантах осуществления целочисленные значения голосов из некоторого диапазона целых чисел могут использоваться в качестве значения голосования, характеризующего степень сходства графемы и эталона, представляющего структуру данных эталона. На этапе (3040), как было описано выше со ссылкой на Фиг 28G, коды графем, набравших голоса в процессе обработки текущего изображения символа, сортируются, формируя отсортированный усеченный массив кодов графем ((2869) на Фиг. 28G). Затем на этапе (3042) отсортированный усеченный массив кодов графем добавляется в соответствующий текущему изображению символа элемент таблицы обработанных изображений символа ((2072) на Фиг. 28G).In FIG. 30C shows a block diagram of a “process symbol image” subroutine called in step (3009) of FIG. 30A. At step (3024), the subroutine calculates the values for all parameters represented by functions that calculate the parameter values, or references to such functions in the parameter array (2008). At step (3026), the calculated parameters or subsets of the calculated parameters are entered into decision trees that together make up the decision forest ((2906) in Fig. 29A). Then, in the nested for loops in steps (3027) to (3037), the symbol image processing subroutine processes the symbol image as described above with reference to FIGS. 29A-D. The outer for loop in steps (3027) - (3037) is executed for each pattern data structure returned by the decision forest. In the first internal for loop, in steps (3028) to (3036), all the reference data structures within the current cluster data structure are sorted. At step (3029), using the current values of the loop variables, the following pattern data structure in the cluster data structure or its associated structure is detected. At step (3030), the weight W is calculated for the current pattern data structure based on the pattern parameter values contained in the pattern data structure and the corresponding parameter values calculated for the current symbol image, as described above with reference to FIG. 28D. If the calculated weight of the cluster data structure is greater than the cutoff weight, as determined in step (3031), then the processing of the current data structure of the reference is terminated and the transition to step (3036), which is described below, is carried out. Otherwise, at step (3032), the calculated weight is used to select an index from the index segment of the pattern data structure, indicating the grapheme code in the pattern data structure. Then, in the innermost for loop at stages (3033) - (3035), voices are added to the cells of the voice array, which each grapheme code in the reference data structure points to, starting from the first and ending with the grapheme code whose index was selected at (3033) ) In the described embodiment, each vote adds 1 to the number of accumulated votes for the grapheme. In alternative embodiments, the voices may be real numbers from a certain range, for example, [0,0; 1.0]. In alternative embodiments, the implementation of the integer values of votes from a certain range of integers can be used as the value of the vote, characterizing the degree of similarity of the grapheme and the pattern representing the structure of the pattern data. At step (3040), as described above with reference to FIG. 28G, the codes of the graphemes that gained votes during processing of the current symbol image are sorted, forming a sorted truncated array of grapheme codes ((2869) in FIG. 28G). Then, at step (3042), a sorted truncated array of grapheme codes is added to the table of processed symbol images ((2072) in Fig. 28G) corresponding to the current symbol image.

На Фиг. 30D в виде блок-схемы показана подпрограмма «обработка страницы» («process page»), вызываемая на этапе (3011) на Фиг. 30А. На этапе (3044) происходит инициализация новой структуры данных обработанной страницы, которая будет хранить коды символов или их другие представления для дальнейшей передачи в обработанный документ PD. Затем в цикле for на этапах (3046)-(3050) обрабатывается каждое изображение символа внутри изображения страницы содержащего текст документа, полученного на этапе (3002) на Фиг. 30А. На этапе (3047) осуществляется доступ к записи рассматриваемого в настоящее время изображению символа, хранящемуся в таблице обработанных изображений символов, которое используется на этапе (3048) наряду со структурами данных эталонов, которые подали голоса за графемы в записи для рассматриваемого в настоящее время изображения символа, хранящегося в таблице обрабатываемых символов изображения, чтобы определить символ, который наилучшим образом соответствует изображению символа, с использованием графем, коды которых были связаны с изображением символа во время обработки изображения символа с помощью подпрограммы «обработка символа изображения», показанной на Фиг. 30С. Существует множество способов определения кода символа, наиболее соответствующего изображению символа. На этапе (3049) символ или представляющий его код помещается на место в структуре данных обработанной страницы, соответствующее расположению изображения символа на обрабатываемой в настоящий момент странице полученного документа, содержащего текст. После того как были обработаны все изображения символа, содержимое структуры данных обработанной страницы помещается в обработанный документ PD на этапе (3052), после чего память, занятая структурой данных обработанной страницы, освобождается на стадии (3054).In FIG. 30D is a flowchart showing the “process page” routine invoked in step (3011) of FIG. 30A. At step (3044), a new data structure of the processed page is initialized, which will store the character codes or their other representations for further transmission to the processed PD document. Then, in the for loop in steps (3046) to (3050), each symbol image is processed inside the image of the page containing the text of the document obtained in step (3002) in FIG. 30A. At step (3047), access is made to the record of the currently considered symbol image stored in the processed symbol image table, which is used at step (3048) along with the data structures of the patterns that cast votes for graphemes in the record for the symbol image currently being considered stored in the table of processed symbols of the image to determine the symbol that best matches the image of the symbol, using graphemes whose codes were associated with the image the symbol during processing the symbol image using the image symbol processing routine shown in FIG. 30C. There are many ways to determine the character code that best matches the character image. At step (3049), the symbol or code representing it is placed in a place in the data structure of the processed page corresponding to the location of the symbol image on the currently processed page of the received document containing the text. After all the symbol images have been processed, the contents of the data structure of the processed page is placed in the processed PD document in step (3052), after which the memory occupied by the data structure of the processed page is freed in step (3054).

Хотя настоящее изобретение описывается на примере конкретных вариантов осуществления, предполагается, что оно не будет ограничено только этими вариантами осуществления. Специалистам в данной области будут очевидны возможные модификации сущности настоящего изобретения. Например, любой из множества возможных вариантов осуществления структур данных и методов, используемых для предварительной обработки в соответствии с обобщенным третьим вариантом осуществления системы OCR, описанным выше, может быть достигнута путем изменения любого из различных параметров проектирования и осуществлении, среди которых: структуры данных, структуры управления, модульное исполнение, язык программирования, используемая операционная система и аппаратное обеспечение, а также многие другие подобные параметры проектирования и осуществлении.Although the present invention is described by the example of specific embodiments, it is contemplated that it will not be limited only to these embodiments. Possible modifications to the spirit of the present invention will be apparent to those skilled in the art. For example, any of the many possible embodiments of the data structures and methods used for pre-processing in accordance with the generalized third embodiment of the OCR system described above can be achieved by changing any of the various design and implementation parameters, including: data structures, structures controls, modular execution, programming language, the operating system and hardware used, as well as many other similar design parameters I and the implementation.

Следует понимать, что приведенное выше описание раскрытых вариантов осуществления предоставлено для того, чтобы дать возможность любому специалисту в данной области техники создать или использовать настоящее изобретение. Специалистам в данной области будут очевидны возможные модификации представленных вариантов осуществления, при этом общие принципы, представленные здесь, могут применяться к другим вариантам осуществления без отступления от сущности или объема описания. Таким образом, настоящее описание не ограничено представленными здесь вариантами осуществления, оно соответствует широкому кругу задач, связанных с принципами и новыми отличительными признаками, раскрытыми в настоящем документе.It should be understood that the above description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Possible modifications to the presented embodiments will be apparent to those skilled in the art, while the general principles presented here can be applied to other embodiments without departing from the spirit or scope of the description. Thus, the present description is not limited to the embodiments presented here, it corresponds to a wide range of tasks related to the principles and new features disclosed herein.

Claims

1. An optical character recognition system comprising:
one or more processors;
one or more memory modules;
one or more storage devices; and
instructions in machine code stored in one or more of one or more memory devices when executed by one or more of one or more processors control an optical character recognition system for processing a scanned image of a document containing text by executing:
identifying symbol images in a text-scanned image of a document;
for each page of the document,
for each symbol image on the page,
identifying a set of suitable reference data structures for symbol representation using a decision forest,
using appropriate reference data structures to determine the set of suitable graphemes, and
using an identified set of suitable graphemes to select a character code that matches the character image; and
preparing a processed document containing symbol codes that correspond to symbol images from a scanned image of the document, and storing the processed document in one or more of one or more memory devices and memory modules.

2. The optical character recognition system according to claim 1, characterized in that it uses appropriate data structures of the standard to determine the set of suitable graphemes, further comprising:
evaluating each possibly suitable data structure of the standard in the context of one or more data structures of the cluster for the ability to vote for graphemes;
voting selected graphemes for one or more graphemes by selected appropriate data structures of the standard;
the identification of suitable graphemes for recognizing a symbol image as those graphemes that receive voices; and
sorting the identified graphemes in a truncated sorted array of graphemes, which represents a set of suitable graphemes.

3. The optical character recognition system according to claim 2, characterized in that the truncated array of sorted graphemes contains certain graphemes having a number of voices that exceed a common threshold.

4. The optical character recognition system of claim 2, further comprising a set of data structures stored in one or more of one or more memory devices, the set of data structures including:
decision forest;
a vote data structure in which accumulated voting results are stored, each result being associated with a grapheme code, and the total number of results in the vote data structure corresponds to the number of suitable graphemes;
data structures of standards, each of which is a symbol standard and each of which contains an ordered set of parameter values, an ordered set of indices and an ordered set of grapheme codes; and
two or more cluster data structures, each cluster data structure comprising an ordered set of parameter references and one of several reference data structures or several references to reference data structures.

5. The optical character recognition system according to claim 4, further comprising an ordered set of functions that generate parameter values from symbol images.

6. The optical character recognition system according to claim 5, characterized in that the identification of suitable graphemes for recognizing a character image as graphemes receiving voices further comprises:
initialization of the structure of these votes; and
adding a vote value to an element of the vote data structure corresponding to the grapheme for which the reference data structure votes.

7. The optical character recognition system according to claim 4, characterized in that determining, using the forest, the decisions of the candidate for a set of reference data structures for the symbol image further comprises:
calculation of a set of parameter values for the symbol image;
entering a set of parameter values or a subset of the set of parameter values in each decision tree in the decision forest; and
the identification of reference data structures, each of which is determined by the end node of each decision tree, as a set of suitable reference data structures.

8. The optical character recognition system according to claim 4, in which the evaluation of each candidate from among the data structures of the standard in the context of one or more data structures of the cluster for the ability to vote for this grapheme further comprises:
estimation of parameter values from a plurality of parameter values in each defined reference data structure with respect to parameter values in an ordered cluster of parameter values associated with the cluster to obtain a weight value by summing the absolute difference values between each parameter value in the parameter value set in the reference data structure and the corresponding parameter value in an ordered set of parameter values associated with the cluster; and
if the weight value is less than the cutoff value, the decision that the data structure of the standard is able to vote for this grapheme.

9. The optical character recognition system according to claim 8, characterized in that the voting using suitable data structures of the standard, which were rated as having the ability to vote for graphemes, for one or more graphemes further comprises:
using the obtained weight value to select an index from an ordered set of indices in the standard data structure;
using the selected index to select a grapheme code from an ordered set of grapheme codes in the data structure of the standards; and
for each grapheme code in an ordered set of grapheme codes in the standard data structure, starting from the first grapheme code and ending with the selected grapheme code,
indexing the structure of the voting data in accordance with the grapheme codes to accumulate the total voting result for the grapheme corresponding to the grapheme code, and adding values to the total voting result for graphemes.

10. A method implemented in an optical character recognition system having one or more processors, one or more memory devices, one or more memory devices and machine code instructions stored in one or more of these devices of one or more memory devices that are executed in one or more of one or more processors, this method comprising:
identification of character images in a text-scanned image of a document;
for each page of the document,
for each symbol image on the page,
identifying a set of suitable reference data structures for the symbol image using the decision forest,
using appropriate reference data structures to determine the set of suitable graphemes, and
using an identified set of suitable graphemes to select a character code that matches the character image; and
preparing a processed document containing symbol codes that correspond to symbol images from a scanned image of the document, and storing the processed document in one or more of one or more memory devices and memory modules.

11. The method according to p. 10, characterized in that the use of suitable data structures of the standard to identify a set of suitable graphemes further comprises:
evaluating each possibly suitable data structure of the standard in the context of one or more data structures of the cluster for the ability to vote for graphemes;
voting selected graphemes for one or more graphemes by selected appropriate data structures of the standard;
the identification of suitable graphemes for recognizing a symbol image as those graphemes that receive voices; and
sorting the identified graphemes in a truncated sorted array of graphemes, which represents a set of suitable graphemes.

12. The method according to p. 11, characterized in that the truncated sorted array of graphemes contains identified graphemes having a number of votes greater than the total threshold of votes.

13. The method of claim 11, further comprising a set of data structures stored in one or more of one or more memory devices, this set of data structures including:
decision forest;
a vote data structure in which accumulated voting results are stored, each result being associated with a grapheme code, and the total number of results in the vote data structure corresponds to the number of grapheme codes;
data structures of standards, each of which is a symbol standard and each of which contains an ordered set of parameter values, an ordered set of indices and an ordered set of grapheme codes; and
two or more cluster data structures, each cluster data structure comprising an ordered set of parameter references and one of several reference data structures or several references to reference data structures.

14. The method of claim 13, further comprising an ordered set of functions that generate parameter values from symbol images.

15. The method according to p. 14, characterized in that the determination of suitable graphemes for recognizing a symbol image as those graphemes that receive votes, further comprises:
initialization of the structure of these votes; and
adding a vote value to an element of the vote data structure corresponding to the grapheme for which the reference data structure votes.

16. The method according to p. 14, characterized in that the determination using a forest of decisions of a set of suitable data structures of the template for the image of the symbol further comprises:
calculation of a set of parameter values for the symbol image;
entering a set of parameter values or a subset of the set of parameter values in each decision tree in the decision forest; and
the identification of reference data structures, each of which is determined by the end node of each decision tree, as a set of suitable reference data structures.

17. The method according to p. 14, characterized in that the evaluation of each possibly suitable data structure of the standard in the context of one or more data structures of the cluster for the ability to vote for graphemes further comprises:
estimation of parameter values from a plurality of parameter values in each defined reference data structure with respect to parameter values in an ordered cluster of parameter values associated with the cluster to obtain a weight value by summing the absolute difference values between each parameter value in the parameter value set in the reference data structure and the corresponding parameter value in an ordered set of parameter values associated with the cluster; and
if the weight value is less than the cutoff value, determining that the data structure of the standard is capable of casting a vote for graphemes.

18. The method according to p. 17, characterized in that the voting by a suitable data structure of the standard for graphemes for one or more graphemes further comprises:
using the obtained weight value to select an index from an ordered set of indices in the standard data structure;
using the selected index to select a grapheme code from an ordered set of grapheme codes in the data structure of the standards; and
for each grapheme code in an ordered set of grapheme codes in the standard data structure, starting from the first grapheme code and ending with the selected grapheme code,
indexing the structure of the voting data in accordance with the grapheme codes to accumulate the total voting result for the grapheme corresponding to the grapheme code, and adding values to the total voting result for the grapheme.

19. Commands in machine code stored in memory devices that, when executed by one or more processors in an optical character recognition system, control an optical character recognition system for:
identifying symbol images in a text-scanned image of a document;
for each page of the document,
for each symbol image on the page,
identifying a set of suitable reference data structures for symbol representation using a decision forest,
using appropriate reference data structures to determine the set of suitable graphemes, and
using an identified set of suitable graphemes to select a character code that matches the character image; and
preparing a processed document containing symbol codes corresponding to symbol images from a scanned image of the document, and storing the processed document in one or more of one or more memory devices and memory modules.

20. Instructions in machine code stored in a physical storage device according to claim 19, wherein the use of suitable reference data structures to determine a set of suitable graphemes further comprises:
evaluation of each possibly suitable data structure of the standard in the context of one or more data structures of the cluster on the ability to vote for a given grapheme;
voting selected graphemes for one or more graphemes by selected appropriate data structures of the standard;
the identification of suitable graphemes for recognizing a symbol image as those graphemes that receive voices; and
sorting the identified graphemes in a truncated sorted array of graphemes, which represents a set of suitable graphemes.