RU2295154C1

RU2295154C1 - Method for recognizing text information from graphic file with usage of dictionaries and additional data

Info

Publication number: RU2295154C1
Application number: RU2005118673/09A
Authority: RU
Inventors: Константин Владимирович Анисимович (RU); Константин Владимирович Анисимович; Владимир Юрьевич Рыбкин (RU); Владимир Юрьевич Рыбкин; Александр Львович Шамис (RU); Александр Львович Шамис
Original assignee: "Аби Софтвер Лтд."
Priority date: 2005-06-16
Filing date: 2005-06-16
Publication date: 2007-03-10
Also published as: RU2005118673A

Abstract

FIELD: technology for recognizing text information from graphic file.

SUBSTANCE: in accordance to method, set in advance is order of access to additional information, assigned also is estimate of quality for each type of additional information, different variants of division of image of selected rows on fragments are constructed, for each fragment of row linear division graph is built, images of graphic elements are recognized, using a classifier, and an estimate is assigned to each recognition variant, transition from variants of recognition of graphic elements to variants of alphabet symbols is performed, for each chain, connecting starting and ending vertexes, chains are built, appropriate for all variants of recognition of graphical elements and variants of transitions from recognized graphical elements to alphabet symbols, produced variants are ranked in order of decrease of recognition quality estimate, produced variants are processed with usage of information about position of uppercase and lowercase letters, if more than one variant of symbol is available based on results of recognition of graphic element, variants are processed with successive usage of additional information, and/or when necessary simultaneous usage of all types of additional information, quality estimate is assigned to each produced variant, variants of symbols with estimate below predetermined value are discarded, produced variants are sorted using pair-wise comparison, and additional correction of recognition of spaces, erroneously recognized at previous stages, is performed.

EFFECT: increased precision of recognition of text and increased interference resistance of text recognition.

9 cl, 2 dwg

Description

Изобретение относится к распознаванию образов из графического изображения, и в частности к распознаванию текста на изображении документа в электронном виде.The invention relates to pattern recognition from a graphic image, and in particular to recognition of text in an image of a document in electronic form.

Известны способы предварительной обработки графических изображений, состоящие в разбиении изображения на области, предположительно составляющие абзацы, строки, слова, отдельные символы.Known methods for pre-processing graphic images, consisting in dividing the image into areas, presumably constituting paragraphs, lines, words, individual characters.

Изобретение относится к обработке изображений символов, распознаванию групп символов, слов, групп слов и т.д. с использованием дополнительной информации.The invention relates to the processing of symbol images, recognition of symbol groups, words, word groups, etc. using additional information.

Используемые термины.Terms used.

Вариант распознавания - одна из нескольких версий результатов интерпретации графем и составления слова. Одна из нескольких версий линейного деления фрагмента изображения на слова, символы. Результат распознавания фрагментов (дуг) графа линейного деления (далее ГЛД) и интерпретации графем.The recognition option is one of several versions of the results of grapheme interpretation and word composition. One of several versions of the linear division of the image fragment into words, symbols. The result of recognition of fragments (arcs) of a linear division graph (hereinafter GLD) and interpretation of graphemes.

Шаблон - тип языковой конструкции уровня слова. Примеры: слово из русских букв, число, телефонный номер, URL.A template is a type of language construction of a word level. Examples: word from Russian letters, number, phone number, URL.

Трансляция - переход от графического изображения символа (графемы) к символу.Broadcast - the transition from a graphic image of a symbol (grapheme) to a symbol.

Фрагмент текста - объект, для уровня которого осуществляется большая часть обработки: построение вариантов в соответствии с шаблонами, их оценка и выбор лучшего варианта.A text fragment is an object for the level of which most of the processing is carried out: building options in accordance with the templates, evaluating them and choosing the best option.

Дифференциальный компаратор - механизм оценки вариантов распознавания фрагментов. Представляет собой множество правил, осуществляющих как попарное сравнение вариантов, так и интегральную оценку.Differential comparator - a mechanism for evaluating fragments recognition options. It is a set of rules that implement both pairwise comparison of options, and an integral assessment.

В качестве дополнительной информации могут использоваться словари нескольких разновидностей, правила языка документа, тематика (литературный текст, научная статья, заполняемая форма бланка и др.). Привлечение дополнительной информации позволяет повысить правильность распознавания в следующих случаях:As additional information, dictionaries of several varieties, document language rules, and topics (literary text, scientific article, fill-in form, etc.) can be used. Attraction of additional information allows to increase the recognition accuracy in the following cases:

- выбрать наиболее правильный из нескольких вариантов линейного деления фрагмента строки на символы,- choose the most correct of several options for linear division of a line fragment into characters,

- выбрать наиболее правильный вариант распознавания изображения символа (графемы) несколькими классификаторами,- choose the most correct variant of recognition of the image of a symbol (grapheme) by several classifiers,

- выбрать наиболее правильные из нескольких букв или лигатур, которые может обозначать выбранная графема,- choose the most correct of several letters or ligatures that the selected grapheme can indicate,

- проанализировать большой участок распознанного текста (например, строку) и, по возможности, скорректировать результаты распознавания (например, поменять запятую на точку, если она стоит в конце предложения или объединить два французских слова «d'» и «Alembert» в одно).- analyze a large section of the recognized text (for example, a string) and, if possible, adjust the recognition results (for example, change the comma to a dot if it is at the end of the sentence or combine the two French words “d” and “Alembert” into one).

Известен способ распознавания символов текста с использованием дополнительной информации.A known method of recognizing text characters using additional information.

Патент RU №2234734 раскрывает способ обработки изображения символов и фрагмента текста, включающий несколько последовательных этапов. Выполняют сбор дополнительной информации (в основном пространственно-параметрической), которая становится доступной на каждом этапе, и для последующего ее использования при повторном анализе того же фрагмента.Patent RU No. 2234734 discloses a method for processing an image of characters and a fragment of text, including several successive steps. They collect additional information (mainly spatial-parametric), which becomes available at each stage, and for its subsequent use in re-analysis of the same fragment.

Указанный способ использует лишь ограниченное число видов дополнительной информации. Способ не использует информацию из дополнительных источников, информацию, основанную на особенностях языка, некоторую другую внешнюю информацию. Не учитывают последовательность обращения к разным видам информации.The specified method uses only a limited number of types of additional information. The method does not use information from additional sources, information based on the features of the language, some other external information. Do not take into account the sequence of access to different types of information.

Технический результат состоит в повышении точности распознавания текста, повышении помехозащищенности распознавания текста.The technical result consists in increasing the accuracy of text recognition, increasing the noise immunity of text recognition.

Приведенный способ, так же как и другие известные способы, не позволяет достичь необходимого уровня точности распознавания.The above method, as well as other known methods, does not allow to achieve the required level of recognition accuracy.

Заявленный технический результат достигают за счет использования правил языка и дополнительных правил.The claimed technical result is achieved through the use of language rules and additional rules.

Заявляемый способ заключается в том, что после выполнения операций разбиения изображения, предположительно содержащего текст, на фрагменты, предположительно содержащие символы, выполняют распознавание символов. В результате распознавания получают один или более вариантов символов для каждой графемы. После этого объединяют символы в группы, предположительно составляющие слова. Рассматривают все возможные слова, полученные как комбинации всех возможных вариантов разбиения изображения на символы и вариантов распознавания составляющих фрагментов. Группы символов анализируют, в том числе вместе с одним или несколькими соседними группами с одной или с двух сторон. К словам применяют дополнительную информацию нескольких типов последовательно из нескольких источников, в объеме, необходимом для точного распознавания слова, соизмеряя достигаемый уровень надежности распознавания и объем используемых вычислительных ресурсов.The inventive method consists in the fact that after the operations of dividing the image, presumably containing text, into fragments, presumably containing characters, perform character recognition. As a result of recognition, one or more symbol variants are obtained for each grapheme. After that, the characters are combined into groups, presumably constituting the words. Consider all possible words obtained as combinations of all possible options for splitting the image into characters and recognition options for component fragments. Groups of characters are analyzed, including together with one or more neighboring groups from one or both sides. For words, additional information of several types is applied sequentially from several sources, to the extent necessary for accurate word recognition, by comparing the achieved level of recognition reliability and the amount of computing resources used.

Последовательность анализа следующая.The sequence of analysis is as follows.

Предварительно задают перечень и очередность привлечения дополнительной информации. Перечень дополнительной информации и последовательность обращения следующая:Pre-set list and order of attraction of additional information. The list of additional information and the sequence of treatment are as follows:

1. Информация о точках деления строки на символы.1. Information on the points of dividing the string into characters.

2. Качество распознавания графического элемента.2. The recognition quality of the graphic element.

3. Словарь.3. Dictionary.

4. Словарь возможных частей слов, например, триграмм.4. A dictionary of possible parts of words, for example, trigrams.

5. Правила, обусловленные используемыми типовыми шаблонами данных.5. Rules stipulated by used standard data templates.

6. Правила, обусловленные местонахождением слова в пределах стоки и/или абзаца.6. Rules determined by the location of the word within the runoff and / or paragraph.

7. Правила, обусловленные особенностями языка документа.7. Rules due to the peculiarities of the document language

8. Правила, обусловленные типом документа.8. Rules due to the type of document.

9. Дополнительные правила для обработки редко встречающихся случаев.9. Additional rules for handling rare cases.

Возможно использование не всех, а только части перечисленных видов дополнительной информации.It is possible to use not all, but only parts of the listed types of additional information.

Также предварительно назначают оценку качества для каждого вида дополнительной информации.A quality assessment is also pre-assigned for each type of additional information.

Определяют все возможные варианты разбиения фрагментов изображения, предположительно являющихся строками текста на фрагменты, предположительно относящихся к изображениям отдельных слов, по надежно распознанным пробелам.All possible options are determined for dividing image fragments, presumably being lines of text into fragments, presumably related to images of individual words, according to reliably recognized spaces.

Для каждого фрагмента строки, предположительно являющегося словом, строят граф линейного деления (ГЛД), описывающий варианты разбиения фрагмента на графические элементы, относящиеся к изображениям символов (графем).For each fragment of the line, presumably being a word, a linear division graph (GLD) is constructed that describes the options for dividing the fragment into graphic elements related to symbol images (graphemes).

Распознают изображения графических элементов, используя по крайней мере один классификатор, и каждому варианту распознавания графического элемента (графемы) присваивают оценку.Images of graphic elements are recognized using at least one classifier, and a rating is assigned to each graphic recognition option (grapheme).

Осуществляют переход от вариантов распознавания графем, составляющих слово, к символам алфавита, используя варианты с наибольшими значениями оценок распознавания символов.A transition is made from the recognition options for graphemes that make up the word to the characters of the alphabet, using the options with the highest values for character recognition ratings.

Среди всех цепочек ГЛД, ведущих из начальной вершины в финальную, выбирают (отмечают) одну или более цепочек распознаваемых символов, соответствующих выбранным вариантам распознавания графических элементов (графем).Among all the GLD chains leading from the initial vertex to the final one, one or more chains of recognizable symbols corresponding to the selected recognition options for graphic elements (graphemes) are selected (marked).

Все полученные варианты группы символов подразделяют на четыре разновидности по следующим признакам:All received variants of a group of characters are divided into four varieties according to the following criteria:

- все символы являются заглавными буквами,- all characters are in capital letters,

- все символы являются строчными буквами,- all characters are lowercase,

- первый символ является заглавной буквой, остальные - строчные,- the first character is an uppercase letter, the rest is lowercase,

- вариант, выбранный исходя из оценки выполненных переходов от распознанной графемы к символам с использованием первого вида дополнительной информации.- an option selected based on the assessment of the completed transitions from the recognized grapheme to symbols using the first type of additional information.

Если имеются более одного варианта символа по результатам распознавания графического элемента, их обрабатывают с последовательным привлечением последующих видов дополнительной информации, согласно заранее заданного порядка.If there is more than one variant of a symbol according to the results of recognition of a graphic element, they are processed with successive involvement of subsequent types of additional information, according to a predetermined order.

Проводят дополнительную коррекцию распознавания пробелов, ошибочно распознанных на предыдущих шагах.An additional correction is carried out to recognize gaps erroneously recognized in the previous steps.

В случае необходимости добавляют новые правила и ограничения для типов данных, причем типы данных подразделяют на простые и составные, причем составные типы образуют как соединение двух или более простых или как любую комбинацию простых и сложных типов.If necessary, add new rules and restrictions for data types, moreover, data types are divided into simple and composite, and composite types form as a combination of two or more simple ones or as any combination of simple and complex types.

Ограничения (правила) включают использование, в том числе одного или более типов шаблонов.Limitations (rules) include the use of, including one or more types of templates.

Описание процесса распознавания текста с использованием дополнительной информации.Description of the text recognition process using additional information.

Обработку с использованием дополнительной информации можно условно представить как троекратную обработку одного фрагмента текста.Processing with the use of additional information can be conventionally represented as triple processing of one piece of text.

С помощью средства распознавания символов разбивают графические изображения строк текста на фрагменты и обрабатывают их по очереди слева направо. Для каждого фрагмента строки строят символьный граф линейного деления (ГЛД), описывающий варианты разбиения фрагмента на графемы, а изображения графем распознают с применением классификаторов по крайней мере двух разных типов.With the help of character recognition tools, graphic images of lines of text are divided into fragments and processed in turn from left to right. For each fragment of the line, a symbolic linear division graph (GLD) is constructed that describes the options for splitting the fragment into graphemes, and grapheme images are recognized using at least two different types of classifiers.

Далее выполняют первую обработку ГЛД с использованием дополнительной информации.Next, the first GLD processing is performed using additional information.

Первая обработка состоит в следующем:The first treatment is as follows:

1. На ГЛД выбирают цепочки дуг, ведущие из начальной вершины в финальную, а в каждой из дуг выбирают один из вариантов ее распознавания (букву). С помощью дополнительной информации строят несколько цепочек дуг в порядке убывания суммарного качества распознавания.1. On the GLD, chains of arcs leading from the initial vertex to the final vertex are selected, and in each of the arcs one of the options for its recognition (letter) is selected. Using additional information, several chains of arcs are constructed in descending order of the total recognition quality.

2. Полученные цепочки дуг обрабатывают с привлечением шаблонов, описывающих разновидности слов, которые могут встречаться в тексте. Поскольку одна графема может обозначать несколько букв, вариант слова может содержать несколько вариантов некоторых букв. Шаблоны в некоторой степени уменьшают число возможных вариантов (например, шаблон слова оставляет только варианты с буквами языка, а шаблон числа - с цифрами), но не полностью.2. The resulting chain of arcs is processed using templates that describe the types of words that may occur in the text. Since a single grapheme can denote several letters, a variant of a word can contain several variants of some letters. Templates to some extent reduce the number of possible options (for example, a word template leaves only variants with language letters, and a number template leaves numbers), but not completely.

3. Все порожденные варианты оценивают по суммарному значению показателя качества, а некоторое количество лучших по качеству вариантов сортируют дифференциально на основе попарного сравнения.3. All generated variants are evaluated by the total value of the quality indicator, and a number of the best quality variants are sorted differentially based on pairwise comparison.

4. Из отсортированного списка выбирают несколько лучших вариантов, которые передают на вторую обработку.4. From the sorted list, select some of the best options that are passed to the second processing.

В качестве источников для оценки и сравнения вариантов привлекают:As sources for evaluating and comparing options attract:

- Качество распознавания графем каждого варианта (качество пути),- The quality of recognition of graphemes of each option (the quality of the path),

- Соответствие грамматике одного из шаблонов (качество шаблона). Чем более жесткая грамматика и чем выше частота применения шаблона, тем большее ему отдают предпочтение.- Compliance with the grammar of one of the templates (template quality). The more rigid the grammar and the higher the frequency of application of the template, the more preference is given to it.

- Результаты проверки по словарю. Оценка словарного слова возрастает с увеличением длины слова и зависит от стиля и сложности слова.- Dictionary check results. The vocabulary score increases with increasing word length and depends on the style and complexity of the word.

- Геометрическую информацию. Соседние буквы в слове должны располагаться друг относительно друга заданным образом и быть согласованы по высоте.- Geometric information. Neighboring letters in the word should be located relative to each other in a given way and be consistent in height.

- Информацию о точках линейного деления. Определенные пары букв могут касаться друг друга.- Information about linear division points. Certain pairs of letters can touch each other.

- Правила, написанные для обработки некоторых частных случаев.- Rules written to handle some special cases.

При первой обработке еще нет информации о тексте справа от обрабатываемого фрагмента (правого фрагмента) и может быть недостаточно статистики для вычисления точной высоты строки. Однако при первой обработке есть дополнительная информация в виде ГЛД и изображения графем, которые уничтожаются после обработки фрагмента и недоступны при второй обработке.At the first processing, there is still no information about the text to the right of the fragment being processed (the right fragment) and there may not be enough statistics to calculate the exact line height. However, during the first processing, there is additional information in the form of GLD and grapheme images, which are destroyed after processing the fragment and are not available during the second processing.

Поэтому при первой обработке не выбирают окончательный вариант (зачастую остаются неоднозначности типа «строчная-заглавная» в буквах и «буква-цифра» в идентификаторах).Therefore, during the first processing, the final version is not chosen (often there are ambiguities of the “lowercase-capital” type in letters and “letter-number” in identifiers).

Вторая обработка.The second treatment.

После того, как обработаны все фрагменты одной строки, начинают выполнять вторую обработку. На этом этапе собрана уже вся статистика о высоте букв, поэтому выдвигают окончательные гипотезы и делают окончательные заключения о высоте строки. Одновременно окончательно выбирают расположение заглавных букв и оценивают надежность вариантов трансляций. Фрагменты строки обрабатывают в порядке от последнего к первому.After all fragments of one line are processed, the second processing begins. At this stage, all statistics on the height of letters have already been collected, therefore they put forward the final hypotheses and make final conclusions about the height of the lines. At the same time, the location of the capital letters is finally chosen and the reliability of the broadcast options is evaluated. Fragments of a line are processed in order from last to first.

Все варианты символов при второй обработке разделяют на четыре группы:All variants of characters during the second processing are divided into four groups:

- все заглавные,- all uppercase

- все строчные,- all lower case,

- первая заглавная и- the first capital and

- с расположением строчных и заглавных символов, выбранным исходя из оценок имеющихся трансляций.- with the arrangement of lowercase and uppercase characters, selected on the basis of estimates of available broadcasts.

Если остались неоднозначные трансляции, их также конкретизируют по следующему общему правилу: оставляют только трансляции, обеспечивающие максимально возможную оценку.If there are ambiguous broadcasts, they are also concretized according to the following general rule: only those broadcasts that provide the highest possible score are left.

Снятие неоднозначностей производят последовательно следующими способами:Disambiguation is performed sequentially in the following ways:

1. По геометрическим характеристикам.1. By geometric characteristics.

2. По соотношению высоты строчных и заглавных символов, записанному в шаблоне.2. By the ratio of the height of lowercase and uppercase characters recorded in the template.

3. С учетом вариантов трансляций фрагмента слева.3. Given the options for translating the fragment to the left.

4. По правилам, которые минимизируют число переключений с букв на цифры в буквенно-цифровых словах.4. According to the rules that minimize the number of switchings from letters to numbers in alphanumeric words.

5. Если ни один из предыдущих способов не дает результата, вариант трансляции выбирают из нескольких имеющихся случайным образом.5. If none of the previous methods gives a result, the broadcast option is selected from several available randomly.

После снятия неоднозначностей варианты повторно оценивают и дифференциально сортируют. В оценках и при сортировке учитывают слова как слева, так и справа от обрабатываемого слова. Для слова, расположенного в правой крайней позиции, генерируют несколько вариантов фрагментов слов справа, соответствующих разным гипотезам о высоте строки. Для каждого слова оценки и сортировку производят с несколькими фрагментами справа и оставляют несколько лучших вариантов, которые будут служить фрагментами справа для следующего по порядку слова.After disambiguation, the options are re-evaluated and differentially sorted. In evaluations and sorting, words are taken into account both to the left and to the right of the word being processed. For a word located in the extreme right position, several variants of fragments of words on the right are generated that correspond to different hypotheses about line height. For each word, scoring and sorting is done with several fragments on the right and several best options are left that will serve as fragments on the right for the next word in order.

Третья обработка.The third treatment.

После того, как выбраны варианты для всей строки, производят их коррекцию. Поскольку на этом этапе имеется целая распознанная строка, при коррекции привлекают "синтаксическую" информацию: о начале или конце предложения, о строчных или заглавных буквах следующего или предыдущего слова и т.д.After the options for the entire line are selected, they are corrected. Since at this stage there is a whole recognized string, the correction involves “syntactic” information: about the beginning or end of a sentence, about lowercase or uppercase letters of the next or previous word, etc.

На этапе коррекции производят уточнение расположения пробелов на основе анализа на одинаковую ширину пробелов и распределения ширины просветов, присоединение ошибочно отделенных пунктуаторов (знаков препинания и др.) и единиц в числах, исправление ошибок типа замены точки на запятую, исправление слов

на римские "II" и "III", соответственно.At the correction stage, the location of the gaps is refined based on the analysis of the same gap width and the distribution of the width of the gaps, the punctuators (punctuation marks, etc.) and units in numbers are added incorrectly, errors are corrected, such as replacing a point with a comma, correcting words

to Roman "II" and "III", respectively.

Построение путей на ГЛД.Construction of paths on GLD.

Построение путей на ГЛД выполняют одним из следующих способов.The construction of paths on GLD is performed in one of the following ways.

Первый способ строит лучший по качеству путь. Он использует обычный алгоритм построения лучшего пути для ориентированного ациклического графа.The first method builds the best quality path. It uses the usual algorithm for constructing a better path for a directed acyclic graph.

Второй способ («генератор») предполагает перебор всех путей на ГЛД, начиная от лучшего и далее в порядке ухудшения качества. Необходимость в отдельном способе для построения первого (наилучшего) пути объясняется тем, что, с одной стороны, первый способ строит его намного быстрее, чем второй («генератор»), а, с другой стороны, тем что часто первый (наилучший) путь оказывается очевидно наиболее пригодным.The second method ("generator") involves enumerating all the paths on the GLD, starting from the best and then in order of quality deterioration. The need for a separate method for constructing the first (best) path is explained by the fact that, on the one hand, the first method builds it much faster than the second ("generator"), and, on the other hand, because often the first (best) path turns out to be obviously the most suitable.

Оба способа учитывают множество графем, которые допустимы в цепочках дуг ГЛД. Кроме того, второй способ («генератор») может строить лишь цепочки дуг, определяющиеся допустимыми триграммами или, например, проверкой по словарю, что позволяет увеличить глубину перебора за счет раннего отсечения заведомо неправильных путей.Both methods take into account the many graphemes that are valid in chains of GLD arcs. In addition, the second method (“generator”) can only construct chains of arcs determined by valid trigrams or, for example, by checking the dictionary, which allows increasing the search depth due to the early cutting off of obviously wrong paths.

Варианты результатов распознавания и способ их конкретизации.Variants of recognition results and method of their concretization.

Основным результатом применения дополнительной информации являются варианты слова. Поскольку трансляция графемы в букву неоднозначна, для каждой графемы сохраняют набор вариантов буквы. По мере привлечения дополнительной информации (словарь, геометрия строки, фрагмент слева и справа) число вариантов постепенно уменьшается. При первой обработке уменьшение числа вариантов достигают применением шаблонов с учетом грамматики и словаря. Дальнейшее ограничение числа подходящих вариантов получают при второй обработке.The main result of applying additional information is the word variants. Since the translation of the grapheme into a letter is ambiguous, for each grapheme a set of letter variants is stored. As additional information is attracted (dictionary, line geometry, fragment on the left and right), the number of options gradually decreases. In the first processing, a reduction in the number of options is achieved by applying patterns based on grammar and vocabulary. A further limitation of the number of suitable options is obtained in the second processing.

Распознавание схожих графем.Recognition of similar graphemes.

Существуют пары очень схожих графем: "r"-"г", "п"-"n", "6"-"б" и т.п. Для их распознавания выполняют дополнительную трансляцию. Каждую графему из пары схожих транслируют в буквы, соответствующие обеим графемам пары, но часть трансляций помечают как дополнительные.There are pairs of very similar graphemes: "r" - "g", "p" - "n", "6" - "b", etc. For their recognition, an additional broadcast is performed. Each grapheme from a pair of similar ones is translated into letters corresponding to both graphemes of the pair, but some of the translations are marked as additional.

Например, графемы "г" и "r" имеют следующие трансляции:For example, graphemes "g" and "r" have the following translations:

"г"→"_Г", основная,"g" → "_G", the main one,

"г"→"_г", основная,"g" → "_g", the main one,

"г"→"r", дополнительная,"g" → "r", optional,

"r"→"_Г", дополнительная,"r" → "_G", optional,

"r"→"_г", дополнительная,"r" → "_g", optional,

"r"→"r", основная."r" → "r", the main one.

Если словарь или грамматика не дают оснований для однозначного выбора трансляции, выбирают основную трансляцию.If the dictionary or grammar does not give grounds for an unambiguous choice of translation, choose the main translation.

Обобщенные графемы.Generalized graphemes.

Для распознавания разделившихся символов используются парные подстановки в ГЛД. Вводят обобщенные графемы «||» и «|||» для двух- и трехэлементных графем. Эти графемы транслируют во все двух- или трехэлементные буквы, но все эти трансляции считают дополнительными. Окончательный выбор трансляций осуществляют с привлечением дополнительной информации.For recognition of the divided characters, pair substitutions in GLD are used. Generalized graphemes “||” and “|||” are introduced for two- and three-element graphemes. These graphemes translate into all two- or three-element letters, but all these translations are considered additional. The final selection of broadcasts is carried out with the involvement of additional information.

Назначение оценок качества распознавания при использовании дополнительной информации.Assignment of recognition quality assessments when using additional information.

Используют интегральные и дифференциальные оценки. Интегральная оценка состоит из базовой оценки качества по цепочке дуг ГЛД, качества шаблона и дополнительного качества. Дифференциальную оценку используют при парном сравнении.Use integral and differential estimates. The integral assessment consists of a basic quality assessment of the GLD arc chain, template quality and additional quality. Differential assessment is used for pairwise comparison.

Базовая оценка качества пути.Basic assessment of the quality of the path.

Базовая оценка качества пути определяется как сумма оценок качества распознавания графем для всех графем по цепочке дуг.A basic path quality estimate is defined as the sum of grapheme recognition quality estimates for all graphemes along a chain of arcs.

Качество с учетом шаблонов.Quality tailored templates.

Качество по шаблонам состоит из двух компонент. Одна компонента оценивает соответствие варианта грамматике шаблона. Другая компонента оценивает наличие слова в словаре.Template quality consists of two components. One component evaluates whether a variant matches the grammar of the template. Another component evaluates the presence of a word in a dictionary.

Дополнительное качество.Extra quality.

Дополнительное качество начисляют по дополнительному списку правил. Основные источники дополнительных поправок - геометрические параметры, плохие с точки зрения соответствия графеме или фрагменту слева от обрабатываемого слова, строчные-заглавные буквы, соответствие языку во фрагменте слева и справа.Additional quality accrue for an additional list of rules. The main sources of additional corrections are geometric parameters, poor in terms of matching the grapheme or fragment to the left of the word being processed, lowercase-capital letters, language matching in the fragment to the left and right.

КРАТКОЕ ОПИСАНИЕ ГРАФИЧЕСКОГО МАТЕРИАЛА.SHORT DESCRIPTION OF THE GRAPHIC MATERIAL.

Фиг.1 показывает основные этапы (шаги) распознавания символов, фрагментов строки, строк с использованием дополнительной информации и средств сравнения.Figure 1 shows the main stages (steps) of recognition of characters, fragments of lines, lines using additional information and means of comparison.

Фиг.2 показывает этап выбора окончательного варианта распознанной строки символов.Figure 2 shows the step of selecting the final version of the recognized character string.

ОПИСАНИЕ ПОЗИЦИЙ НА ФИГУРАХ.DESCRIPTION OF POSITIONS IN FIGURES.

1 - операция (этап) построения путей (цепочек) на ГЛД по множеству (группе) графем.1 - operation (stage) of constructing paths (chains) on GLD by the set (group) of graphemes.

2 - операция (этап) построения вариантов группы символов в соответствии с шаблонами.2 - operation (stage) of constructing variants of a group of characters in accordance with the patterns.

3 - операция (этап) суммарной оценки полученных вариантов.3 - operation (stage) of the total assessment of the obtained options.

4 - операция (этап) дифференциальной сортировки вариантов.4 - operation (stage) of the differential sorting of options.

5 - операция (этап) окончательного разрешения неоднозначности трансляций.5 - operation (stage) of the final resolution of broadcast ambiguity.

6 - операция (этап) выбора цепочки лучших вариантов для строки.6 - operation (stage) of choosing the chain of the best options for the string.

7 - пути (цепочки) на ГЛД.7 - paths (chains) on GLD.

8 - варианты.8 - options.

9 - сортированный список вариантов.9 is a sorted list of options.

10 - правила языка документа.10 - document language rules.

11 - шаблоны.11 - templates.

12 - средство сравнения.12 is a means of comparison.

13 - список вариантов.13 is a list of options.

ОПИСАНИЕ ПРИНЦИПА РАБОТЫ.DESCRIPTION OF THE PRINCIPLE OF WORK.

Сущность заявляемого способа представлена на фиг.1, 2.The essence of the proposed method is presented in figure 1, 2.

После выполнения операций разбивки изображения, предположительно содержащего текст, на фрагменты, предположительно содержащие символы, выполняют распознавание символов для всех вариантов разбивки на фрагменты. В результате распознавания получают один или более вариантов символов для каждого изображения символа (графемы). После этого объединяют символы в группы, предположительно составляющие слова.After the operations of dividing the image, presumably containing text, into fragments, presumably containing characters, perform character recognition for all options for breaking into fragments. As a result of recognition, one or more symbol variants are obtained for each symbol image (grapheme). After that, the characters are combined into groups, presumably constituting the words.

Сущность заявляемого способа заключается в том, что рассматривают все возможные слова, полученные как комбинации всех возможных вариантов разбиения изображения и распознавания составляющих изображений символов. Группы символов анализируют, в том числе вместе с одним или несколькими соседними группами с одной или с двух сторон. К словам применяют дополнительную информацию нескольких разновидностей последовательно из нескольких источников, в объеме, необходимом для точного распознавания слова.The essence of the proposed method is that they consider all possible words obtained as a combination of all possible options for splitting the image and recognition of the constituent images of the characters. Groups of characters are analyzed, including together with one or more neighboring groups from one or both sides. Additional words of several varieties are applied to words sequentially from several sources, to the extent necessary for accurate word recognition.

Выполняют анализ, включающий по крайней мере следующие этапы.Perform an analysis that includes at least the following steps.

Предварительно задают перечень и очередность привлечения дополнительной информации. Перечень дополнительной информации и последовательность обращения выбирают из следующего списка.Pre-set list and order of attraction of additional information. The list of additional information and the sequence of treatment are selected from the following list.

3. Словарь полных слов.3. Dictionary of complete words.

5. Правила, обусловленные используемыми типовыми форматами данных.5. Rules stipulated by the standard data formats used.

6. Правила, обусловленные местонахождением в пределах стоки и/или абзаца.6. Rules determined by the location within the runoff and / or paragraph.

Если использование части перечисленных видов дополнительной информации, дает достаточно надежный и достоверный результат, дальнейшую (следующую по списку) информацию не привлекают.If the use of a part of the listed types of additional information gives a sufficiently reliable and reliable result, further (next in the list) information is not attracted.

Определяют все возможные варианты разбиения областей изображения, предположительно являющихся строками текста на фрагменты, предположительно относящихся к изображениям отдельных слов, по надежно распознанным пробелам.All possible options for dividing areas of the image, presumably being lines of text into fragments, presumably related to images of individual words, are determined by reliably recognized spaces.

Для каждого фрагмента строки строят ГЛД, описывающий варианты разбиения фрагмента на графические элементы, относящиеся к графемам.For each fragment of the line, a GLD is constructed that describes the options for splitting the fragment into graphic elements related to graphemes.

Распознают полученные изображения графических элементов, используя один или более классификатор, и каждому варианту распознавания графического элемента (графемы) присваивают оценку.Received images of graphic elements are recognized using one or more classifiers, and each rating option for recognizing a graphic element (grapheme) is assigned a rating.

Осуществляют переход от вариантов распознавания графем к символам алфавита.Carry out the transition from grapheme recognition options to alphabet characters.

Выполняют следующую по крайней мере трехшаговую процедуру.Perform the following at least three-step procedure.

Первый шаг. Для всех цепочек ГЛД, ведущих из начальной вершины в финальную, строят одну или более цепочек распознаваемых символов, соответствующих вариантам распознавания графических элементов (графем) и вариантам переходов от распознанных графем к символам алфавита. Ранжируют полученные варианты в порядке уменьшения оценки качества распознавания.First step. For all GLD chains leading from the initial vertex to the final one, one or more chains of recognizable symbols are constructed corresponding to the recognition options for graphic elements (graphemes) and the options for transitions from recognized graphemes to alphabet symbols. The resulting options are ranked in decreasing order of recognition quality.

Второй шаг. Все полученные варианты группы символов обрабатывают с учетом правил о возможном расположении строчных и заглавных символов (букв). Указанные правила подразделяют на четыре разновидности по следующим признакам:Second step. All received variants of a group of characters are processed taking into account the rules on the possible arrangement of lowercase and uppercase characters (letters). These rules are divided into four varieties according to the following criteria:

- вариант, выбранный исходя из оценки выполненных переходов от распознанной графемы к символам, с использованием первого вида дополнительной информации.- an option selected based on the assessment of the completed transitions from the recognized grapheme to symbols, using the first type of additional information.

Если имеются более одного варианта символа по результатам распознавания графического элемента, их обрабатывают с последовательным привлечением последующих видов дополнительной информации, согласно заранее заданного порядка, и/или при необходимости одновременным привлечением всех видов дополнительной информации. Каждому полученному варианту назначают оценку качества. Варианты символов, имеющие оценку ниже предварительно заданной, отбрасывают. Полученные варианты сортируют, используя попарное сравнение.If there is more than one variant of a symbol based on the recognition of a graphic element, they are processed with sequential involvement of subsequent types of additional information, according to a predetermined order, and / or, if necessary, simultaneously involving all types of additional information. Each received option is assigned a quality rating. Character variations having a score lower than a predetermined one are discarded. The resulting options are sorted using pairwise comparison.

Третий шаг. Проводят дополнительную коррекцию распознавания пробелов, ошибочно распознанных на предыдущих шагах:Third step. An additional correction is carried out to recognize gaps erroneously recognized in the previous steps:

- присоединение элементов, ошибочно отделенных на предыдущих шагах,- joining elements erroneously separated in the previous steps,

- отделение элементов, ошибочно присоединенных на предыдущих шагах.- separation of elements erroneously connected in the previous steps.

Правила, обусловленные особенностями языка документа, могут включать, в том числе фонетические и/или лексические, и/или семантические правила.The rules due to the peculiarities of the language of the document may include, including phonetic and / or lexical, and / or semantic rules.

При повторной оценке и попарной сортировке для самого правого слова генерируют несколько вариантов слов справа, соответствующих разным гипотезам (например, о высоте строки), причем для каждого слова оценку и сортировку производят с несколькими контекстами справа и принимают несколько лучших вариантов, которые затем используют как дополнительную информацию для следующего по порядку (по расположению) слова.When re-evaluating and pairwise sorting for the rightmost word, several variants of the words on the right are generated that correspond to different hypotheses (for example, about the height of the line), and for each word, assessment and sorting is done with several contexts on the right and some of the best options are accepted, which are then used as additional information for the next word in order.

Если необходимо, добавляют новые правила и ограничения и/или редактируют имеющиеся.If necessary, add new rules and restrictions and / or edit existing ones.

Средства для добавления новых правил и ограничений могут включать введение правил для типов данных, причем типы данных подразделяют на простые и составные, причем составные типы образуют как соединение двух или более простых или любые комбинации простых и сложных.Means for adding new rules and restrictions may include the introduction of rules for data types, moreover, data types are divided into simple and composite, and composite types form as a combination of two or more simple or any combination of simple and complex.

Тип данных задают в виде по крайней мере следующих характеристик:The data type is specified in the form of at least the following characteristics:

- перечня символов, разрешенных для использования в словах и/или- a list of characters permitted for use in words and / or

- дополнительного правила, ограничивающего перечень символов, и/или- an additional rule restricting the list of characters, and / or

- перечня пунктуаторов, разрешенных для использования, и/или- a list of punctuators authorized for use and / or

- грамматических правил для часто встречающихся слов, или фрагментов слов.- grammar rules for frequently occurring words, or fragments of words.

Ограничения включают использование, в том числе одного или более из следующих типов шаблонов:Limitations include the use of, including one or more of the following types of templates:

- двуязычное слово,- bilingual word,

- двуязычное слово с цифрами,- bilingual word with numbers,

- словарный идентификатор,- dictionary identifier,

- аббревиатуру,- abbreviation

- число,- number

- римское число,is the Roman number

- число с суффиксом (порядковое число),- number with suffix (ordinal number),

- число с префиксом,- number with a prefix,

- слово, составленное из пунктуаторов,- a word made up of punctuation,

- слово + число,- word + number,

- слово с числом внутри,- a word with a number inside,

- слово со скобками,- a word with brackets,

- телефонный номер,- telephone number,

- шаблон URL,- URL pattern

- имя файла вместе с полной информацией о местонахождении,- file name along with full location information,

- шаблон регулярных выражений,- regular expression pattern,

- вспомогательный шаблон.- auxiliary template.

Более подробное описание некоторых из перечисленных разновидностей шаблонов, смысл которых не очевиден из наименования.A more detailed description of some of the listed varieties of patterns, the meaning of which is not obvious from the name.

Шаблон регулярных выражений. Способ описания слов, которые могут встретиться в тексте с помощью регулярных выражений.Regular expression pattern. A way to describe words that may appear in the text using regular expressions.

Регулярное выражение - сложный формализованный формат данных. Состоит из данных простого типа.Regular expression is a complex formalized data format. Consists of simple data type.

Слово считается предпочтительным, если соответствует описанию, приведенному в регулярном выражении.A word is considered preferred if it matches the description given in the regular expression.

Данные простого типа - набор слов, собранных в специальном словаре. Специальный словарь ограниченного объема показывает, какие слова более вероятны в тексте, а какие маловероятны. Слово считается предпочтительным, если оно есть в словаре.Simple data type - a set of words collected in a special dictionary. A special dictionary of limited volume shows which words are more likely in the text and which are unlikely. The word is considered preferable if it is in the dictionary.

Регулярные выражения - это разновидность описания "предпочтительных" слов в формальном виде. Например, следующее регулярное выражение:Regular expressions are a form of description of "preferred" words in a formal form. For example, the following regular expression:

("кр"|"мон")"ах"("cr" | "mon") "ah"

Такое регулярное выражение означает, что для того, чтобы слово считалось предпочтительным, в начале слова должно быть "кр" или "мон", а затем должно следовать "ах". То есть предпочтительными будут всего два слова "крах" и "монах".Such a regular expression means that in order for a word to be preferred, the word "kr" or "mon" must be at the beginning of the word, and then "ah" must follow. That is, only two words “crash” and “monk” will be preferred.

(99999)|(999999)(99999) | (999999)

Такое регулярное выражение означает, что предпочтительным считают число, имеющее длину пять или шесть знаков.Such a regular expression means that a number having a length of five or six characters is considered preferable.

Другие примеры.Other examples.

s?t означает "sat" и "set".s? t means "sat" and "set".

s*d означает "sad" и "started".s * d means "sad" and "started".

w[io]n означает "win" и "won".w [io] n means win and won.

[r-t]ight означает "right" и "sight".[r-t] ight means "right" and "sight".

m[!a]st означает "mist" и "most", но не "mast".m [! a] st means "mist" and "most", but not "mast".

t[!a-m]ck означает "tock" и "tuck", но не "tack" или "tick".t [! a-m] ck means "tock" and "tuck", but not "tack" or "tick".

fe{2}d означает "feed", но не "fed".fe {2} d means "feed" but not "fed".

fe{1,}d означает "fed" и "feed".fe {1,} d means fed and feed.

10{1,3} означает "10" и "100", и "1000".10 {1,3} means "10" and "100", and "1000".

Вспомогательный шаблон. Шаблон, используемый, в том случае, когда не подходит ни один другой шаблон. Например, если в середине русского текста встречается слово "Юг12Хъ", то все контекстные шаблоны не смогут идентифицировать это слово. В этом случае используют вспомогательный шаблон.Helper template. The template used when no other template is suitable. For example, if in the middle of the Russian text the word "Yug12X" is found, then all contextual patterns will not be able to identify this word. In this case, an auxiliary template is used.

Claims

1. A method for recognizing text information from a graphic file, characterized by obtaining a graphic file from a scanning device or otherwise, image segmentation, recognition of text characters; characterized in that the following order of access to additional information is preliminarily specified, including the following types: information about the points of dividing the line into characters, and / or the recognition quality of the graphic element, and / or the dictionary, and / or the dictionary of possible parts of words, and / or rules due to the used typical data patterns or regular expressions and / or rules due to the location of the word within the line and / or paragraph and / or rules due to the peculiarities of the document language and / or rules defined by the type of document, and / or additional rules for handling rare cases; pre-assign a quality assessment for each type of additional information; build various options for splitting the image of selected lines into fragments, presumably containing images of individual words, according to reliably recognized spaces; for each fragment of the line, a linear division graph is constructed that describes the options for dividing the fragment into graphic elements, presumably containing symbol images; recognizing images of graphic elements using one or more classifiers, and each option for recognizing a graphic element is assigned a rating; make the transition from grapheme recognition options to alphabet character variants; perform at least the following steps: first step: for each chain connecting the start and end vertices, construct chains corresponding to all variants of recognition of graphemes and variants of transitions from recognized graphemes to alphabet characters, rank the obtained options in order of decreasing recognition quality assessment, the second step : all obtained variants of a group of symbols are processed using information on the location of uppercase and lowercase letters, if there are more than one variant of the symbol according to the recognition results raficheskogo element, they are processed with the subsequent involvement of the following types of additional information, according to a predetermined order, and / or if necessary, simultaneously involving all these types of additional information, each received option is assigned a quality rating, symbol options having an estimate below a predetermined one are discarded, received options are sorted using pairwise comparisons; third step: make additional correction for recognition of gaps that were erroneously recognized in the previous stages: attachment of elements erroneously separated in the previous steps, separation of elements erroneously attached in the previous steps.

2. The method according to claim 1, characterized in that the rules due to the characteristics of the language of the document include, including phonetic, and / or lexical, and / or semantic.

3. The method according to claim 1, characterized in that in the second step the information on the possible arrangement of uppercase and lowercase letters includes at least four varieties according to the following criteria: all characters are uppercase, all characters are lowercase, the first character is uppercase , the rest are lowercase, the option selected based on the assessment of the completed transitions from the recognized grapheme to symbols using the first type of additional information.

4. The method according to claim 1, characterized in that they use a dictionary of possible fragments of words that exist in a natural language.

5. The method according to claim 4, characterized in that each combination of possible fragments of words is provided with an estimate of the probability of use in the text.

6. The method according to claim 4, characterized in that for evaluating the word, patterns are used that differ in the composition and types of incoming characters: a bilingual word, and / or a bilingual word with numbers, and / or a dictionary identifier, and / or abbreviation, and / or a number and / or a Roman number, and / or a number with a suffix (ordinal number), and / or a number with a prefix, and / or a word from punctuators, and / or a word + number, and / or a word with a number inside, and / or a word with brackets, and / or a phone number, and / or a URL pattern, and / or a file name together with full location information, and / or a regular pattern GOVERNMENTAL expressions, and / or an auxiliary pattern.

7. The method according to claim 1, characterized in that it contains means for adding new rules and restrictions, including the introduction of rules for data types, which are divided into simple and composite.

8. The method according to claim 7, characterized in that the composite data types form as a connection of at least two simple or any combination of simple and composite data types.

9. The method according to claim 7, in which the data type is set in the form of at least the following characteristics: a list of characters allowed for use in words, and / or an additional rule that restricts the list of characters, and / or a list of punctuators allowed for use, and / or grammar rules for frequently occurring words or fragments of words.