JPH09138840A

JPH09138840A - Character recognition device

Info

Publication number: JPH09138840A
Application number: JP7321180A
Authority: JP
Inventors: Sayori Shimohata; さより下畑
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1995-11-15
Filing date: 1995-11-15
Publication date: 1997-05-27

Abstract

PROBLEM TO BE SOLVED: To correct and complement a recognition result which is missed due to erroneous recognition, etc., by referring to an index table, comparing an appearing probability on plural candidate characters in a character recognition result and deciding an order of certainty. SOLUTION: The index table generation part 15 of a processor 14 reads a text inputted from an input part 3, processes the kind of the character which is possibly connected to a prescribed character (string) into a numerical value and generates the index table 8 of a tree structure to be displayed. A knowledge processing part 4 refers to the index table 8 and judges the propriety of a candidate character string being the character recognition result. Information including the appearance probability that the previously selected character appears next to the prescribed character is stored in the index table 8. The knowledge processing part 4 refers to the index table 8 and compares the appearance probability on the plural candidate characters stored in the character recognition result at each character and gives the order to the certainty of the candidate characters.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、形態素解析装置
や、文字認識装置や音声認識装置等の自然言語処理装置
において、誤切り出しや誤認識によって欠落した正解文
字を補完するための、知識処理等に適する文字認識装置
に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a morphological analysis device, a natural language processing device such as a character recognition device and a voice recognition device, and a knowledge processing for complementing a correct character missing due to erroneous segmentation or recognition. A character recognition device suitable for.

【０００２】[0002]

【従来の技術】自然言語により記述された文書等を情報
処理装置に入力してデータベース化するような場合に、
光学的に文字を読み取って認識処理をすることが行われ
る。この文字認識処理の際には、認識結果を評価して、
誤読を排除する必要がある。こうした後処理を自動的に
行うために次のような技術が開発されている（情報処理
学会研究報告Vol.95,No,68(95-NL-107) ）。これは、Ｏ
ＣＲ（光学的文字読取り装置）による文字認識結果に対
して、誤読文字の検出・修正を行う方法について述べた
ものである。これは、認識対象文字列に出現し得る単語
のリストと、単語間の接続可否に関する情報から、文字
毎に、その文字が出現する単語のＩＤ（識別符号）とそ
の文字の出現位置と単語の長さの情報を記述した文字イ
ンデックスを作成し、複数の可能性（候補文字）を持つ
認識結果と文字インデックスとを照合し、単語候補を抽
出した上で、候補文字の単語の並びとしての妥当性を判
定して、第１候補を決定している。2. Description of the Related Art When a document or the like described in natural language is input to an information processing device and made into a database,
Characters are optically read and recognition processing is performed. At the time of this character recognition processing, the recognition result is evaluated,
It is necessary to eliminate misreading. The following technologies have been developed to automatically perform such post-processing (Information Processing Society of Japan, Research Report Vol.95, No, 68 (95-NL-107)). This is O
It describes a method for detecting / correcting an erroneously read character in a character recognition result by a CR (optical character reader). This is based on the list of words that can appear in the recognition target character string and the information about whether or not the words can be connected, for each character, the ID (identification code) of the word in which the character appears, the appearance position of the character, and the word Create a character index that describes the length information, match the recognition results with multiple possibilities (candidate characters) with the character index, extract word candidates, and then validate the candidate characters as a sequence of words. The first candidate is determined by determining the sex.

【０００３】[0003]

【発明が解決しようとする課題】ところで、上記のよう
な従来の文字認識装置には更に次のような解決すべき課
題があった。上記の方法では、単語単位の認識を行うた
め、住所文字列のように、限定された文字列しか出現し
ない文書の認識には非常に効果がある。しかし、単語の
境界が不明確な場合や、候補文字列が単語辞書に存在し
ない場合には、十分に適切な処理を期待できない。ま
た、認識に用いる単語辞書は予め用意されているが、新
たに候補語を追加したり、文書によって候補語の優先度
を変化させる等、ユーザが自由に辞書の適用条件を操作
することができるような機能を持たせることは容易でな
い。By the way, the above-mentioned conventional character recognition apparatus has the following problems to be solved. In the above method, since recognition is performed on a word-by-word basis, it is very effective for recognizing a document such as an address character string in which only a limited character string appears. However, if the word boundaries are unclear, or if the candidate character string does not exist in the word dictionary, it cannot be expected to perform adequately appropriate processing. Moreover, although the word dictionary used for recognition is prepared in advance, the user can freely operate the application condition of the dictionary, such as adding a new candidate word or changing the priority of the candidate word depending on the document. It is not easy to have such a function.

【０００４】[0004]

【課題を解決するための手段】本発明は以上の点を解決
するため次の構成を採用する。〈構成１〉文字認識結果を入力する入力部と、予め選定
したある文字がある文字の次に出現する出現確率を含む
情報を格納したインデックステーブルと、１文字毎の文
字認識結果中にリストアップされた複数の候補文字につ
いて、インデックステーブルを参照して、出現確率を比
較して候補文字の確からしさに順位付けをする知識処理
部とを備える。The present invention employs the following structure to solve the above problems. <Structure 1> An input unit for inputting a character recognition result, an index table storing information including an appearance probability that a certain character selected in advance appears next to a certain character, and a character recognition result for each character is listed in the character recognition result. A knowledge processing unit that refers to the index table for the plurality of generated candidate characters and compares the appearance probabilities to rank the probability of the candidate characters.

【０００５】〈説明〉文字認識結果は、例えば光学的文
字認識装置等から入力する。文字認識処理の際には、可
能な限り、１文字に対して複数の候補文字を認識結果と
してリストアップしておく。ある文字がある文字の次に
出現する出現確率が明らかになっていると、候補文字毎
に、その直前の文字との関係により、確からしさの順位
付けができる。これにより、誤認識や誤切り出しによっ
て欠落した認識結果の一部を修正し、補完することがで
きる。そのテキストに出現する確率の高い文字の並びが
インデックステーブルにあれば、文字認識結果の自動的
な後処理に信頼性が増す。出現確率は、その確率を直接
表す数値でなくても、ある文字の次にその文字が出現す
る可能性を数値を用いて間接的に示すような情報ならば
なんでもよい。なお、文字の並びとしたのは、必ずしも
一連の文字が単語を構成するかどうかにかかわらず、一
律に出現確率を求めるようにするからである。<Explanation> The character recognition result is input from, for example, an optical character recognition device. In the character recognition process, a plurality of candidate characters for one character are listed as recognition results for each character as much as possible. When the appearance probability that a character appears next to a certain character is known, the probability can be ranked for each candidate character based on the relationship with the character immediately before it. As a result, it is possible to correct and supplement a part of the recognition result that has been lost due to erroneous recognition or erroneous cutting. If the index table has a sequence of characters that have a high probability of appearing in the text, the reliability of automatic post-processing of the character recognition result increases. The appearance probability does not have to be a numerical value that directly represents the probability, but may be any information that indirectly indicates the possibility that the character will appear next to a character using a numerical value. In addition, the reason why the characters are arranged is that the appearance probabilities are uniformly obtained regardless of whether or not a series of characters constitutes a word.

【０００６】〈構成２〉インデックステーブルは、認識
対象とされるテキストと同類のテキスト中に含まれる文
字の並びを抽出し、そのテキスト中に含まれるある文字
がある文字の次に出現する出現確率を数値化して構成す
る。<Structure 2> The index table extracts a sequence of characters included in a text similar to the text to be recognized, and an appearance probability that a certain character included in the text appears next to a certain character. Is configured by digitizing.

【０００７】〈説明〉認識対象のテキストと同類のテキ
ストには、似通った表現や単語が使用されている確率が
高い。従って、そのテキストの文字の並び方の性質を利
用したインデックステーブルを作成して知識処理を行え
ば、文字認識の精度を向上させられる。この方法は、文
字の並びに着目して処理されることから、単語の区切り
や、使用語が未知の語であるかどうかを意識することな
く利用できる。<Explanation> It is highly probable that similar expressions and words are used in the text similar to the recognition target text. Therefore, the accuracy of character recognition can be improved by creating an index table that utilizes the character arrangement of the text and performing knowledge processing. Since this method is processed by paying attention to the arrangement of characters, it can be used without being aware of word divisions and whether or not the used word is an unknown word.

【０００８】〈構成３〉インデックステーブルは、テキ
スト中に含まれる文字の並びを、共通の文字は幹のノー
ドに併合し、その文字に続く異なる文字はそれぞれ枝の
ノードに配置した木構造のテーブルとし、ノードを辿る
ことで、ある文字に後続する可能性のある連続した文字
列を示すように構成する。<Structure 3> The index table is a tree-structured table in which a sequence of characters included in a text is merged with a common character in a trunk node and different characters following the character are arranged in branch nodes. Then, by tracing the node, it is configured to indicate a continuous character string that may follow a certain character.

【０００９】〈説明〉多種の文字の並びを他の文字と関
連付けて数値化して表現する場合に、これを木構造によ
り表現すれば、文字ごとの検索処理を高速に実施でき
る。幹には、テキストで使用される全ての文字が表示さ
れ、共通の文字を直前に持つ場合には、共通の幹を持
ち、後続する文字が異なれば、次第に枝分かれしてい
く。後続する文字がなければ、それ以上枝は延びない。<Explanation> When a sequence of various characters is numerically expressed in association with other characters, if this is expressed by a tree structure, a search process for each character can be performed at high speed. All characters used in the text are displayed on the trunk, and if there is a common character immediately before, it has a common trunk, and if the following characters are different, it will gradually branch. If there are no subsequent characters, the branch will not extend any further.

【００１０】[0010]

【発明の実施の形態】以下、本発明の実施の形態を具体
例を用いて説明する。〈具体例１〉図１は、本発明の文字認識装置の機能ブロ
ック図である。図の装置は、文字認識結果１を受け入れ
る入力部３と、知識処理部４と、記憶装置５とを備えて
いる。図示しないＯＣＲ（光学的文字読取り装置）等で
読み取られて処理された文字認識結果１が入力部３から
この装置に入力する。知識処理部４は、後で説明するよ
うにして、文字認識結果中の候補文字に順位付けを行
い、出力９を得る部分である。なお、この順位付け処理
にはインデックステーブル８が参照される。このインデ
ックステーブルは後で説明するようにして、認識対象の
テキストと同類のテキスト６により予め生成され、記憶
装置５に格納される。BEST MODE FOR CARRYING OUT THE INVENTION Embodiments of the present invention will be described below with reference to specific examples. <Specific Example 1> FIG. 1 is a functional block diagram of a character recognition device of the present invention. The illustrated apparatus includes an input unit 3 that receives the character recognition result 1, a knowledge processing unit 4, and a storage device 5. The character recognition result 1 read and processed by an OCR (optical character reading device) or the like (not shown) is input to this device from the input unit 3. The knowledge processing unit 4 is a unit for ranking the candidate characters in the character recognition result and obtaining the output 9 as described later. The index table 8 is referred to for this ranking process. This index table is generated in advance by the text 6 similar to the text to be recognized and stored in the storage device 5, as described later.

【００１１】図２は、この発明の具体的な装置構成を示
すブロック図である。図示のシステムは、入出力装置１
１と、処理装置１４と、記憶装置５を有する。入出力装
置１１は、テキストの入力、結果の表示等を行う機能を
有する。この入出力装置１１は、入力部３と出力部１３
を有する。ここで、入力部３は、インデックステーブル
を作成するためのテキストや、文字認識結果を入力する
機能を有する。この入力部３は、例えば、キーボードに
より構成されてもよいし、計算機の記憶装置に格納され
ているテキストファイルをアクセスする装置であっても
良い。出力部１３は、処理結果の表示等を行う機能を有
する。この出力部１３は、例えば、ディスプレイやプリ
ンタ等により構成されている。FIG. 2 is a block diagram showing a concrete device configuration of the present invention. The illustrated system is an input / output device 1
1, a processing device 14, and a storage device 5. The input / output device 11 has a function of inputting a text, displaying a result, and the like. The input / output device 11 includes an input unit 3 and an output unit 13.
Having. Here, the input unit 3 has a function of inputting a text for creating an index table and a character recognition result. The input unit 3 may be composed of, for example, a keyboard, or may be a device that accesses a text file stored in the storage device of the computer. The output unit 13 has a function of displaying processing results and the like. The output unit 13 is composed of, for example, a display and a printer.

【００１２】記憶装置５は、テキストや各段階の処理結
果等を保存する機能を有する。記憶装置５は、入力され
たテキストを保存する入力ファイル１８と、文字認識結
果を一時的に保存する認識結果ファイル１９と、インデ
ックステーブル８と、知識処理の結果を保存する出力フ
ァイル２１を備えている。処理装置１４は、演算装置や
メモリ及び制御部等の一般的な構成を備えており、後述
する処理手順に従って、インデックステーブル８の作成
及び文字認識結果に対する知識処理を実行する機能を有
する。この処理装置１４は、インデックステーブル作成
部１５と知識処理部４を有する。インデックステーブル
作成部１５は、入力されたテキストからインデックステ
ーブル８を作成する機能を有する。The storage device 5 has a function of storing texts, processing results at each stage, and the like. The storage device 5 includes an input file 18 for storing input text, a recognition result file 19 for temporarily storing character recognition results, an index table 8, and an output file 21 for storing knowledge processing results. There is. The processing unit 14 has a general configuration such as an arithmetic unit, a memory, and a control unit, and has a function of creating the index table 8 and executing knowledge processing for a character recognition result according to a processing procedure described later. The processing device 14 includes an index table creating unit 15 and a knowledge processing unit 4. The index table creation unit 15 has a function of creating the index table 8 from the input text.

【００１３】知識処理部４は、インデックステーブル８
を参照し、文字認識結果である候補文字列の妥当性を判
定する機能を有する。インデックステーブル８は、ある
文字（列）に接続する可能性のある文字の種類を数値化
して表示するものである。後続する文字の他にその出現
数や出現確率を記述しても良い。インデックステーブル
作成部１５は、テキストを読み込み、木構造のインデッ
クスを作成する。入力するテキストは、文、節、語句の
羅列、あるいは形態素解析の出力等、どのような形式で
記述されていても良い。文字列の単位も任意である。文
字数や特殊文字により区切っても良いし、テキスト全体
を１つの文字列と考えても良い。ただし内容について
は、文字認識結果に対して知識処理を行う場合には、処
理対象と同じ語句や表現が多く含まれているほどインデ
ックスの質が良くなるため、入力テキストは同じ分野の
文書や専門用語等であることが望ましい。The knowledge processing unit 4 includes an index table 8
With the function of determining the validity of the candidate character string that is the result of character recognition. The index table 8 digitizes and displays the types of characters that may be connected to a certain character (column). In addition to the following characters, the number of appearances and the appearance probability may be described. The index table creation unit 15 reads the text and creates a tree-structured index. The input text may be described in any format such as a sentence, a section, a list of phrases, or an output of morphological analysis. The unit of the character string is also arbitrary. It may be separated by the number of characters or special characters, or the entire text may be considered as one character string. However, regarding the content, when knowledge processing is performed on the character recognition result, the more the same words and expressions as the processing target are included, the better the index quality becomes. It is desirable to use terms and the like.

【００１４】図３に、木構造インデックスの例説明図を
示す。インデックスは、任意の文字をトップノードと
し、共通の前部分を１つのノードに併合した形で作られ
る任意の深さの木構造で、各ノードには、後続する文字
が記述されている。従って、トップノードの直後のノー
ドには、入力テキスト中に出現した全ての種類の文字が
記述されており、トップノードからノードを辿った部分
がテキスト中の連続文字列を表すことになる。例えば、
「オンライン」というテキストが入力された場合のイン
デックスは図３のようになる。各ノードには、文字及び
その出現数の情報が並べて記述されている。トップノー
ドの「＊」は任意の文字、即ちテキスト中の全ての文字
を示す。また、「＊」の出現数は、テキストを構成する
文字の総数である。従って、この例では５とある。ま
た、「＠」は、このノードが連続文字列パターンの終わ
りであることを示す。FIG. 3 shows an example of a tree structure index. An index is a tree structure of arbitrary depth, which is created by merging a common front part into one node with an arbitrary character as a top node, and the succeeding characters are described in each node. Therefore, all the types of characters appearing in the input text are described in the node immediately after the top node, and the part following the node from the top node represents a continuous character string in the text. For example,
The index when the text "online" is input is as shown in FIG. In each node, characters and information on the number of appearances are described side by side. The "*" in the top node indicates an arbitrary character, that is, all characters in the text. The number of appearances of "*" is the total number of characters that form the text. Therefore, it is 5 in this example. Further, "@" indicates that this node is the end of the continuous character string pattern.

【００１５】任意の文字の後、即ち第１のノードには、
「オ」「ン」「ラ」「イ」の文字があり、「オ」「ラ」
「イ」の出現数はそれぞれ１回、「ン」の出現数は２回
となっている。これは、テキスト中に「オ」「ラ」
「イ」が１回ずつ、「ン」が２回出現したことを表して
いる。また、「オ」に続く第２のノードには、「ン」が
１つあり、出現数は１回となっている。これは、「オ」
の後に「ン」が１回出現したこと、即ち「オン」が連続
して出現したのが１回であることを表している。「オン
ラ」「オンライ」「オンライン」も、連続して出現する
のは１回となる。出現確率は、“後続文字の出現数÷先
行文字の出現数”で求めることができる。これは、子ノ
ードの出現数を親ノードの出現数で割れば良い。After any character, that is, in the first node,
There are letters "o", "n", "la", and "i", and "o" and "la"
The number of appearances of "a" is once and the number of appearances of "n" is twice. This is "o""la" in the text
It means that "a" appears once and "n" appears twice. The second node following "o" has one "n" and the number of appearances is one. This is "O"
It means that "n" appears once after "," that is, "on" appears once in succession. "Onla,""online," and "online" will only appear once in a row. The appearance probability can be calculated by "the number of appearances of subsequent characters / the number of appearances of preceding characters". This can be done by dividing the number of appearances of the child node by the number of appearances of the parent node.

【００１６】任意の文字＊の後に「ン」が出現する確率
は５分の２だから０．４であるが、「オ」「ライ」
「イ」の後に「ン」が出現する確率は、図の例では、そ
れぞれ親ノードも子ノードも１だから１となる。テキス
ト全体における「ン」の出現確率が０．４であるのに対
して、「オ」「ライ」「イ」の後では必ず「ン」が出現
しているからである。インデックステーブル８は、こう
した木構造インデックスの情報をテーブル形式に記述し
たものである。The probability of "n" appearing after an arbitrary character * is 2/5, which is 0.4, but "o" and "rai".
In the example of the figure, the probability that "n" appears after "a" is 1 because both the parent node and the child node are 1. This is because the appearance probability of “n” in the entire text is 0.4, whereas “n” always appears after “o”, “rai”, and “a”. The index table 8 describes the information of such a tree structure index in a table format.

【００１７】図４は、「オンライン」が入力された場合
のインデックステーブルの内容の例説明である。ここで
は、各ノードのノードレベルＤ１と、文字毎の情報とし
て、それぞれノード番号Ｄ２、文字Ｄ６、出現数Ｄ４、
出現確率Ｄ５を記述している。例えば、ノードレベル１
の「オ」はノード番号が“１”、出現数が“１”、出現
確率は“０．２”である。このインデックステーブル
は、新たにテキストが入力された場合でも、テキストに
出現する文字を含むノード列の、各文字の出現数をカウ
ントアップし、出現確率を再計算するだけで、更新する
ことができる。FIG. 4 shows an example of the contents of the index table when "online" is input. Here, the node level D1 of each node and the information for each character are the node number D2, the character D6, the number of appearances D4,
The appearance probability D5 is described. For example, node level 1
“O” has the node number “1”, the number of appearances “1”, and the appearance probability “0.2”. Even if new text is input, this index table can be updated simply by counting up the number of occurrences of each character in the node string containing the characters that appear in the text and recalculating the appearance probability. .

【００１８】次に、インデックステーブルを用いた知識
処理について説明する。図５は、本発明における知識処
理の動作フローチャートである。知識処理では、最初に
入力部３から入力された認識結果ファイル１９を読み込
む（ステップＳ１）。この文字認識結果は、各文字位置
ｉ（ｉ＝１，２，…ｎ）に対して、１個以上の候補文字
を持つものとする。例えば、５文字の認識結果があれば
各文字毎にそれぞれ１〜３文字程度の候補文字を含めて
おく。この文字位置ｉに対して、状態集合Ｓ［ｉ］を対
応させる。Ｓ［ｉ］は、ｉ−１文字目の候補文字がマッ
チしたノードの位置を表すもので、初期状態は０であ
る。文字位置ｉの比較照合処理を行う場合、Ｓ［ｉ］に
記述されたノードに続く子ノードが処理対象になる（ス
テップＳ２〜ステップＳ５）。各文字位置ｉ（ｉ＝１，
２，…ｎ）の候補文字ｊ（ｊ＝１，２，…ｍ）に対し
て、インデックステーブルの対象ノードの文字との比較
照合処理を行い（ステップＳ６）、マッチした候補文字
には、ポイント加算処理を行う（ステップＳ８）。各文
字位置ｉの候補文字のうち最もポイントの高い文字が、
認識結果としての確信度が高いことになる（ステップＳ
６〜Ｓ８）。最後に、各文字位置の比較照合結果をソー
トしてポイント順に並べ変え（ステップＳ９）、処理結
果を出力ファイルに書き込む（ステップＳ１０）。Next, the knowledge processing using the index table will be described. FIG. 5 is an operational flowchart of the knowledge processing in the present invention. In the knowledge processing, the recognition result file 19 first input from the input unit 3 is read (step S1). This character recognition result has one or more candidate characters for each character position i (i = 1, 2, ... N). For example, if there is a recognition result of 5 characters, candidate characters of about 1 to 3 characters are included for each character. The state set S [i] is associated with this character position i. S [i] represents the position of the node to which the (i-1) th candidate character matched, and the initial state is 0. When performing the comparison and collation processing of the character position i, the child node following the node described in S [i] is the processing target (steps S2 to S5). Each character position i (i = 1,
2, ... n) candidate character j (j = 1,2, ... m) is compared and collated with the character of the target node of the index table (step S6), and the matching candidate character is given a point. Addition processing is performed (step S8). The character with the highest point among the candidate characters at each character position i is
The certainty factor as a recognition result is high (step S
6 to S8). Finally, the comparison and comparison results of each character position are sorted and rearranged in the order of points (step S9), and the processing result is written in the output file (step S10).

【００１９】図６は、図５のステップＳ６に示した比較
照合処理の動作フローチャートである。ここでは、文字
位置ｉの全ての候補文字ｊに対して、インデックステー
ブルの比較対象ノードとの比較照合を行う（ステップＳ
１，Ｓ２）。比較対象ノードは、ノードの状態集合Ｓ
［ｉ］で示される。ｉ−１文字目の候補文字がどのノー
ドともマッチしなかった場合は、状態集合Ｓ［ｉ］は０
なので、ノード番号０のノードの子ノード、即ちノード
レベル１のノードの文字を対象に比較照合処理を行う。
マッチした場合は、Ｓ［ｉ］には０及び候補文字がマッ
チしたノードのノード番号が入っているので、ノードレ
ベル１のノード及びｉ−１文字目の候補文字がマッチし
たノードの子ノードに対して比較照合処理を行うことに
なる。文字位置ｉの候補文字ｊが、インデックステーブ
ルの対象ノードとマッチすれば、ｊに対してポイント加
算処理が実行され（ステップＳ４）、Ｓ［ｉ＋１］にそ
のノード位置情報が追加される（ステップＳ５）。ｊが
最後の候補文字であれば、処理を終了する。そうでなけ
れば、ｊを１つカウントアップして（ステップＳ７）、
次の候補文字に対して同様の処理を繰り返す（ステップ
Ｓ２〜ステップＳ７）。FIG. 6 is an operation flowchart of the comparison and collation processing shown in step S6 of FIG. Here, all candidate characters j at the character position i are compared and collated with the comparison target node of the index table (step S
1, S2). The comparison target node is the node state set S.
It is indicated by [i]. If the i-1th candidate character does not match any node, the state set S [i] is 0.
Therefore, the comparison and collation processing is performed on the child node of the node of node number 0, that is, the characters of the node of node level 1.
If there is a match, S [i] contains 0 and the node number of the node with which the candidate character matched, so the node at the node level 1 and the child node of the node with the i-1th candidate character matched On the other hand, the comparison and collation processing is performed. If the candidate character j at the character position i matches the target node in the index table, point addition processing is executed for j (step S4), and the node position information is added to S [i + 1] (step S5). ). If j is the last candidate character, the process ends. If not, count up j by 1 (step S7),
The same process is repeated for the next candidate character (steps S2 to S7).

【００２０】ステップＳ４のポイント加算処理は、該当
する候補文字に一定の得点を加算する方式やマッチした
ノードの出現数やノードレベルの深さに応じたポイント
を計算して加算する方式等が考えられる。ここでは、出
現確率を得点とし、候補文字にマッチしたノードの出現
確率をポイントとして加算し、得点の多いものから順に
並べる方式を採る。得点が同じ場合には、もとの認識結
果で順位の高い候補文字が上位となる。The point addition processing in step S4 may be a method of adding a certain score to the corresponding candidate character or a method of calculating and adding points according to the number of appearances of matched nodes and the depth of the node level. To be Here, the appearance probability is used as the score, the appearance probability of the node that matches the candidate character is added as the point, and the nodes are arranged in descending order of the score. When the scores are the same, the candidate character having a higher rank in the original recognition result is ranked higher.

【００２１】図７には、認識結果の例説明図を示す。更
に、具体的な知識処理の流れを、図４のインデックステ
ーブルと図７の例を用いて説明する。始めに、図５のス
テップＳ１で認識結果を読み込み、入力文字列の長さ
“５”をｎにセットする（図５のステップＳ２）。次
に、ｉに“１”を、Ｓ［ｉ］に０をセットする（図５の
ステップＳ３，Ｓ４）。また、ｉ（＝１）文字目の候補
文字の数“３”をｍにセットして（図５のステップＳ
５）、比較照合処理に進む。図５のステップＳ６におけ
る比較照合処理では、まずｊ＝１をセットする（図６の
ステップＳ１）。ｊは、ｉ文字目の候補文字のうち照合
対象となっているものの位置を示す。まず、ｉ（＝１）
文字目の第ｊ（＝１）候補文字「才」をキーに、インデ
ックステーブルの検索を行う（図６のステップＳ２）。
このとき、状態集合Ｓ［ｉ］は｛０｝であるから、イン
デックステーブルの“０”に続くノード、即ちノード番
号“１”〜“４”のノードを検索の対象とする。これら
のノードには「才」とマッチするノードはない（図６の
ステップＳ３）ので、ｊ＝１＋１＝２として（図６のス
テップＳ７）、第ｊ（＝２）候補文字「オ」をキーにイ
ンデックステーブルの“１”〜“４”のノードを検索す
る（図６のステップＳ２）。ここで、ノード番号“１”
の「オ」がマッチする（ステップＳ６のステップＳ
３）。候補文字「オ」に対して出現確率の“０．２”を
ポイントとして加算し（図６のステップＳ４）、状態集
合Ｓ［ｉ＋１］、即ちＳ［２］にマッチしたノード番号
｛１｝を加える（図６のステップＳ５）。次に、ｊ＝２
＋１＝３として（図６のステップＳ７）、第ｊ（＝３）
候補文字「千」をキーにインデックステーブルの“１”
〜“４”ノードを検索する。これらのノードには、
「千」とマッチするノードはない（図６のステップＳ
３）。ここで、ｊ＝ｍとなるため、処理を終了し、図５
のステップＳ７に戻る。FIG. 7 shows an example explanatory diagram of the recognition result. Furthermore, a specific flow of knowledge processing will be described using the index table of FIG. 4 and the example of FIG. 7. First, the recognition result is read in step S1 of FIG. 5, and the length “5” of the input character string is set to n (step S2 of FIG. 5). Next, i is set to "1" and S [i] is set to 0 (steps S3 and S4 in FIG. 5). Further, the number of candidate characters of the i (= 1) th character “3” is set to m (step S in FIG. 5).
5) Proceed to the comparison and collation processing. In the comparison and collation processing in step S6 of FIG. 5, first, j = 1 is set (step S1 of FIG. 6). j indicates the position of the i-th candidate character to be collated. First, i (= 1)
The index table is searched by using the jth (= 1) candidate character "age" of the character as a key (step S2 in FIG. 6).
At this time, since the state set S [i] is {0}, the nodes following the “0” in the index table, that is, the nodes with node numbers “1” to “4” are the search targets. Since there is no node that matches "age" in these nodes (step S3 in FIG. 6), j = 1 + 1 = 2 is set (step S7 in FIG. 6), and the jth (= 2) th candidate character "o" is keyed. Then, the nodes "1" to "4" in the index table are searched (step S2 in FIG. 6). Here, node number "1"
"O" matches (step S6, step S6
3). The appearance probability “0.2” is added as a point to the candidate character “o” (step S4 in FIG. 6), and the node number {1} that matches the state set S [i + 1], that is, S [2] is added. Add (step S5 of FIG. 6). Then j = 2
+ 1 = 3 (step S7 in FIG. 6), the j-th (= 3)
Index table “1” with candidate character “thousand” as key
~ Search for "4" node. These nodes include
There is no node that matches "thousand" (step S in FIG. 6).
3). Here, since j = m, the processing is terminated, and
Return to step S7.

【００２２】次に、ｉ＝１＋１＝２として（図５のステ
ップＳ８）、ｉ（＝２）文字目の候補文字の比較照合処
理を行う。このとき、Ｓ［２］は、｛０，１｝であるた
め、比較処理の対象となるノードは、“０”，“１”に
続く子ノード、即ちノード番号“１”〜“４”及び“１
１”のノードとなる。以下、５文字目の第３候補文字の
「ン」まで同様に処理を行う。Next, i = 1 + 1 = 2 is set (step S8 in FIG. 5), and the comparison and collation processing of the candidate character of the i (= 2) th character is performed. At this time, since S [2] is {0, 1}, the nodes to be compared are child nodes following “0” and “1”, that is, node numbers “1” to “4” and "1
1 "node. Hereinafter, the same process is performed up to the fifth candidate character" n ".

【００２３】図８に、比較照合処理の結果を示す。この
ように、各候補文字毎に確からしさの順位付けが行われ
た。最後に、候補文字をポイントの多いものから順に並
べ変え（図５のステップＳ９）、結果を出力ファイルに
出力する（図５のステップＳ１０）。図９に、知識処理
終了後の出力ファイルの内容を示す。この結果が文字認
識結果の後処理に利用される。FIG. 8 shows the result of the comparison and collation processing. In this way, the probabilities are ranked for each candidate character. Finally, the candidate characters are rearranged in order from the one having the most points (step S9 in FIG. 5), and the result is output to the output file (step S10 in FIG. 5). FIG. 9 shows the contents of the output file after the knowledge processing is completed. This result is used for post-processing of the character recognition result.

【００２４】以上に述べたように、この発明によれば、
ユーザが指定したテキストから出現する文字列の並びの
性質を記述したインデックステーブルを作成し、そのテ
ーブルに基づいて文字認識処理を行うことができる。こ
のインデックステーブルは、任意の文字列を単位として
おり、ある文字とある文字が前後に出現するパターンか
ら、ある文字に後続する連続文字列のパターンまでを１
つの形式で記述している。また、入力するテキストを変
更したり、優先したい文字列をインデックステーブルに
追加することにより、条件に応じたインデックステーブ
ルを容易に作成することができる。このテーブルを利用
した知識処理では、連続して出現しやすい文字の並びと
マッチする候補文字を優先的に採用することにより、候
補文字の確からしさを順位付ける。このため、候補文字
列がそのままの形でインデックステーブルに存在しない
場合でも、インデックステーブルを部分的に利用するこ
とにより、文字認識処理を行うことができる。しかも、
内容によって適用するインデックステーブルを変更した
り、重要語の文字の並びの得点に重み付けをすることに
より、認識対象に合わせて適用条件を簡単に変更するこ
とができる。As described above, according to the present invention,
It is possible to create an index table that describes the nature of the arrangement of character strings that appear from the text specified by the user, and perform character recognition processing based on that table. This index table uses an arbitrary character string as a unit, and a character from a pattern in which a character appears before and after to a pattern of a continuous character string following a character is 1
It is described in one format. Also, by changing the input text or adding a character string to be prioritized to the index table, it is possible to easily create the index table according to the conditions. In the knowledge processing using this table, the probability of candidate characters is ranked by preferentially adopting candidate characters that match the sequence of characters that are likely to appear consecutively. Therefore, even if the candidate character string does not exist in the index table as it is, the character recognition process can be performed by partially using the index table. Moreover,
By changing the index table to be applied according to the content or weighting the score of the character arrangement of the important words, it is possible to easily change the application condition according to the recognition target.

[Brief description of the drawings]

【図１】本発明の文字認識装置の機能ブロック図であ
る。FIG. 1 is a functional block diagram of a character recognition device of the present invention.

【図２】具体的な装置構成を示すブロック図である。FIG. 2 is a block diagram showing a specific device configuration.

【図３】木構造インデックスの例説明図である。FIG. 3 is an explanatory diagram of an example of a tree structure index.

【図４】インデックステーブルの例説明図である。FIG. 4 is a diagram illustrating an example of an index table.

【図５】知識処理の動作フローチャートである。FIG. 5 is an operation flowchart of knowledge processing.

【図６】比較処理の動作フローチャートである。FIG. 6 is an operation flowchart of comparison processing.

【図７】認識結果の例説明図である。FIG. 7 is a diagram illustrating an example of a recognition result.

【図８】比較照合処理のポイント説明図である。FIG. 8 is an explanatory diagram of points of comparison and collation processing.

【図９】知識処理終了後の出力ファイルの内容説明図で
ある。FIG. 9 is an explanatory diagram of contents of an output file after the knowledge process is completed.

[Explanation of symbols]

１文字認識結果３入力部４知識処理部５記憶装置８インデックステーブル 1 Character recognition result 3 Input unit 4 Knowledge processing unit 5 Storage device 8 Index table

Claims

[Claims]

1. An input unit for inputting a character recognition result, an index table storing information including an appearance probability that a character selected beforehand appears next to a certain character, and a list in the character recognition result for each character. A character recognition device, comprising: a knowledge processing unit that compares the appearance probabilities with respect to a plurality of updated candidate characters to rank the probabilities of the candidate characters.

2. The index table extracts a sequence of characters included in a text similar to the text to be recognized, and numerically expresses the appearance probability that a certain character included in the text appears next to a certain character. The character recognition device according to claim 1, characterized in that

3. The index table is a tree-structured table in which a sequence of characters included in a text is merged with a common character in a trunk node and different characters following the character are arranged in branch nodes, respectively. By following the node,
The character recognition device according to claim 1 or 2, wherein the character recognition device is configured to indicate a continuous character string that may follow a certain character.