JPH09138836A

JPH09138836A - Correcting system of character recognizing result

Info

Publication number: JPH09138836A
Application number: JP7318627A
Authority: JP
Inventors: Hideyuki Isoyama; 秀幸磯山
Original assignee: N T T DATA TSUSHIN KK; NTT Data Communications Systems Corp
Current assignee: N T T DATA TSUSHIN KK; NTT Data Corp
Priority date: 1995-11-13
Filing date: 1995-11-13
Publication date: 1997-05-27

Abstract

PROBLEM TO BE SOLVED: To prevent the generation of erroneous correction by facilitating the extension and maintenance of the function of a correcting rule necessary for the correction of a character recognizing result. SOLUTION: A candidate character string being a character recognizing result to an inputted character string is disposed from a first order to an n-th order to form a candidate character matrix 201 and this matrix 201 is transferred to a normal expression matching part 106. A matching part 106 is provided with a correction file 109 storing various correcting rules suited to various character string describing form outside a program, selects a correction rule suitable for the inputted matrix 201 from the file 109 and selects a character within the matrix 201 to prepare a candidate character string most suited to the selected correction rule. Continually, according to the selected correction rule, a character string replacing part 107 replaces a character string within the candidate character string 601 with correcter one and next a character replacing part 108 replaces individual character within a candidate character string 602 subjected to character string replacement with correcter one.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、光学式文字読取装
置（以下、「ＯＣＲ」という）等によって行われた文字
認識結果の修正方式の改良に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an improvement in a method of correcting a character recognition result performed by an optical character reader (hereinafter referred to as "OCR") or the like.

【０００２】[0002]

【従来の技術】帳票に記述された文字列の文字認識結果
を修正する場合、住所地名や姓名等の単語を構成する文
字列と、住所番地や電話番号のような単語を構成しない
文字列とでは、修正の仕方に大きな相違がある。単語を
構成する文字列の場合は、記入が予想される文字列を単
語辞書に登録しておくことが可能なため、文字認識結果
と単語辞書との単語照合を行うことにより文字認識誤り
の修正を行うことができる。一方、単語を構成しない文
字列の場合は、上記のような単語照合を行うことができ
ないので、文字種の制限や誤認識し易い文字の置き換え
を修正規則としてアルゴリズム化し、直接プログラムの
ソースコードに記述することにより文字認識結果の修正
を行っている。2. Description of the Related Art When correcting the character recognition result of a character string described in a form, a character string that constitutes a word such as an address or surname and a character string that does not form a word such as an address or a telephone number. Then, there is a big difference in the correction method. In the case of a character string that constitutes a word, it is possible to register the character string that is expected to be entered in the word dictionary, so correct the character recognition error by performing word matching between the character recognition result and the word dictionary. It can be performed. On the other hand, in the case of a character string that does not form a word, it is not possible to perform the word matching as described above, so the limitation of the character type and the replacement of the character that is easily misrecognized are algorithmized as a correction rule and directly described in the program source code By doing so, the character recognition result is corrected.

【０００３】[0003]

【発明が解決しようとする課題】ところで、単語を構成
しない文字列に対する文字認識結果の修正を、上記のよ
うな方法で行う場合、数多くある住所番地の個々の記述
形式に適合するような修正規則を記述できれば望ましい
が、そのように多種多様の修正規則を記述することは実
際には困難であるから、結果として簡単な規則しかプロ
グラムのソースコード中に組込むことができない。By the way, when the character recognition result of a character string that does not form a word is corrected by the above method, a correction rule adapted to each description format of many address addresses is provided. Is desirable, but it is actually difficult to describe such a wide variety of modified rules, and as a result, only simple rules can be incorporated in the source code of the program.

【０００４】また、プログラムのアルゴリズム中に修正
規則を組込んでいるので、修正規則の追加・変更を行お
うとすると、プログラムそのものを変更する必要が生ず
る。そのため、記載パターンの追加や、帳票の記入様式
の変更等に柔軟に対応することができない。Further, since the correction rule is incorporated in the algorithm of the program, it is necessary to change the program itself in order to add or change the correction rule. Therefore, it is not possible to flexibly deal with the addition of the description pattern and the change of the entry form of the form.

【０００５】更に、誤認識し易い類似文字の置き換えを
行う技術はあったが、文字列全体として妥当であるか否
かの検定を行っていないので、修正できる文字種を多く
すると、正解である文字まで置き換えてしまう虞があ
り、修正誤りを起こすことが多い。Further, although there is a technique for replacing similar characters that are apt to be erroneously recognized, since it has not been tested whether or not the entire character string is valid, if the number of types of characters that can be corrected is increased, the correct character will be used. There is a possibility that it will be replaced, and a correction error often occurs.

【０００６】従って本発明の目的は、文字認識結果の修
正に必要な修正規則の機能拡張や保守が容易で、しかも
修正誤りを起こすことがない文字認識結果の修正方法及
び修正方式を提供することにある。Therefore, an object of the present invention is to provide a method and method for correcting a character recognition result, in which the function of the correction rule necessary for correcting the character recognition result can be easily expanded and maintained, and a correction error does not occur. It is in.

【０００７】[0007]

【課題を解決するための手段】本発明では、文字認識結
果を修正するためのアルゴリズムを記述したコンピュー
タ・プログラムの外部に、各種文字列に適合した種々の
修正規則を蓄積した修正規則ファイルが設けられる。そ
して、プログラムに記述されたアルゴリズムに従って文
字認識結果修正プロセスが実行されるとき、プロセスに
よって修正規則ファイルが参照されて文字認識結果に適
合する修正規則が選ばれ、修正プロセスによって利用さ
れる。従って、帳票の様式や記入方法の変更があって
も、プログラムソース自体を変更することなく、修正規
則を追加・変更できるため、機能拡張及び保守が容易に
なる。According to the present invention, a correction rule file accumulating various correction rules adapted to various character strings is provided outside a computer program which describes an algorithm for correcting a character recognition result. To be Then, when the character recognition result correction process is executed according to the algorithm described in the program, the process refers to the correction rule file to select a correction rule that matches the character recognition result and is used by the correction process. Therefore, even if the form format or the entry method is changed, the modification rule can be added / changed without changing the program source itself, which facilitates function expansion and maintenance.

【０００８】ここで、修正規則ファイルは、テキストエ
ディタで追加・変更が容易なように、テキストファイル
であることが望ましい。Here, the modification rule file is preferably a text file so that it can be easily added / changed by a text editor.

【０００９】また、誤修正の発生を減らし修正精度を高
めるために、本発明には次のようなアレンジを加えるこ
とができる。Further, in order to reduce the occurrence of erroneous correction and improve the correction accuracy, the following arrangement can be added to the present invention.

【００１０】(1) 個々の修正規則には、適合する文字列
の表記を所定の正規表現で記述した正規表現パターンが
含まれている。そして、修正プロセスは、入力された文
字認識結果に適合する修正規則をファイルから検索する
とき、各修正規則の正規表現パターンが入力された文字
認識結果に一致するか否かをチェックすることによっ
て、各修正規則が適合するか否か判断する。これによ
り、容易に適合する修正規則を選択することが可能にな
る。(1) Each correction rule includes a regular expression pattern in which a matching character string is described by a predetermined regular expression. Then, when the correction process searches the file for a correction rule that matches the input character recognition result, by checking whether the regular expression pattern of each correction rule matches the input character recognition result, It is judged whether each modified rule is suitable. This makes it possible to easily select a matching modification rule.

【００１１】(2) 文字認識結果は、文字認識により得ら
れた第１位から第ｎ位までの候補文字列を配列してなる
候補文字マトリックスの形で修正プロセスに供給され
る。修正プロセスは、この候補文字マトリックスから、
特定の修正規則の正規表現パターンに一致する候補文字
列が１セット以上選び出せるとき、その特定の修正規則
を文字認識結果に適合した修正規則として選定する。そ
れと共に、修正プロセスは、上記候補文字マトリックス
から選び出せる１セット以上の候補文字列のうち、最も
順位の高い候補文字より成る候補文字列を選び、この選
んだ候補文字列に対して、上記選択した修正規則に基づ
く修正を実施する。これにより、文字認識によって得ら
れた第１位から第ｎ位までの候補文字列の全てがいずれ
の修正規則にも適合しない場合であっても、それらの候
補文字列が組み変えられて、特定の修正規則に適合し
た、より正解に近い候補文字列が作成され、その候補文
字列に対して修正が実行されることになる。よって、修
正の精度が向上する。(2) The character recognition result is supplied to the correction process in the form of a candidate character matrix formed by arranging the first to nth candidate character strings obtained by the character recognition. The correction process is based on this candidate character matrix
When one or more sets of candidate character strings that match the regular expression pattern of the specific correction rule can be selected, the specific correction rule is selected as a correction rule suitable for the character recognition result. At the same time, the correction process selects a candidate character string consisting of the candidate character string having the highest rank from one or more sets of candidate character strings that can be selected from the candidate character matrix, and selects the candidate character string for the selected candidate character string. Make amendments based on the amended rules. As a result, even if all of the first to nth candidate character strings obtained by character recognition do not match any of the correction rules, those candidate character strings are recombined and identified. A candidate character string that is closer to the correct answer and that conforms to the correction rule is created, and the correction is performed on the candidate character string. Therefore, the accuracy of correction is improved.

【００１２】(3) 各修正規則には、適合する文字認識結
果に含まれる文字列又は文字に対して実行されるべき置
換規則が含まれている。そして、修正プロセスは、選択
した修正規則に含まれる置換規則に基づいて、文字認識
結果に含まれる文字列又は文字を、より正解に近い別の
文字列又は文字に置換する。従って、文字認識結果に正
解文字が含まれていない場合でも、文字列又は文字の置
換によって正解文字が得られる可能性が高くなる。(3) Each correction rule includes a replacement rule to be executed for a character string or a character included in a matching character recognition result. Then, the correction process replaces the character string or character included in the character recognition result with another character string or character closer to the correct answer, based on the replacement rule included in the selected correction rule. Therefore, even if the character recognition result does not include the correct character, there is a high possibility that the correct character can be obtained by replacing the character string or the character.

【００１３】(4) 修正プロセスは、修正規則ファイル内
に適合する修正規則が複数存在する場合には、それら適
合する修正規則の各々に基づいて文字認識結果を修正し
て、複数の修正文字列を出力する。これら複数の修正文
字列は、最適候補決定プロセスに渡され、このプロセス
において、それらの修正文字列の中から最適候補が決定
される。(4) If there are a plurality of matching correction rules in the modifying rule file, the modifying process modifies the character recognition result based on each of the matching modifying rules to obtain a plurality of modified character strings. Is output. The plurality of correction character strings are passed to the optimum candidate determination process, and in this process, the optimum candidate is determined from the correction character strings.

【００１４】[0014]

【実施の形態】以下、本発明の実施の形態を、図面によ
り詳細に説明する。Embodiments of the present invention will be described below in detail with reference to the drawings.

【００１５】図１は、本発明の文字認識結果修正方式の
一実施形態を示すブロック図である。FIG. 1 is a block diagram showing an embodiment of a character recognition result correction system of the present invention.

【００１６】本実施形態は、文字切出し部１０２、文字
認識部１０３、文字認識結果修正部１０４、及び最適候
補決定部１０５を備え、実際にはプログラムされたコン
ピュータによって実施される。This embodiment comprises a character cutout unit 102, a character recognition unit 103, a character recognition result correction unit 104, and an optimum candidate determination unit 105, and is actually implemented by a programmed computer.

【００１７】文字切出し部１０２は、イメージスキャナ
（図示省略）等から入力される文字列（文書行）の画像
１０１を１文字毎の画像に切出し、各文字毎の画像を文
字パターンとして文字認識部１０３に出力する。文字認
識部１０３は、文字切出し部１０２から入力した文字パ
ターンと予め用意された標準文字パターン（図示省略）
との比較を行い、各入力文字パターン毎に複数の候補文
字と各候補文字毎の評価値（入力文字パターンと候補文
字パターンとの間の距離の近さ、つまり類似程度を表し
た値）とを生成して、文字認識結果情報として文字認識
結果修正部１０４に出力する。The character cutout unit 102 cuts out an image 101 of a character string (document line) input from an image scanner (not shown) or the like into an image of each character and a character recognition unit using the image of each character as a character pattern. Output to 103. The character recognition unit 103 has a character pattern input from the character cutout unit 102 and a standard character pattern prepared in advance (not shown).
And a plurality of candidate characters for each input character pattern and the evaluation value of each candidate character (a value indicating the closeness of the distance between the input character pattern and the candidate character pattern, that is, the degree of similarity) Is generated and output to the character recognition result correction unit 104 as character recognition result information.

【００１８】文字認識結果修正部１０４は、予め用意さ
れた修正規則（図４にその一例を示す）に基づき、文字
認識部１０３からの文字認識結果情報の修正を行う。文
字認識結果修正部１０４による修正結果は最適候補決定
部１０５に出力される。最適候補決定部１０５は、文字
認識結果修正部１０４からの修正結果に基づいて最終的
な読取り結果としての文字列を決定し、これを表示部
（図示省略）に出力する。The character recognition result correction unit 104 corrects the character recognition result information from the character recognition unit 103 based on a correction rule prepared in advance (an example is shown in FIG. 4). The correction result by the character recognition result correction unit 104 is output to the optimum candidate determination unit 105. The optimum candidate determination unit 105 determines a character string as a final read result based on the correction result from the character recognition result correction unit 104, and outputs this to the display unit (not shown).

【００１９】ここで、文字認識結果修正部１０４が用い
る修正規則は、文字認識結果修正部１０４のプログラム
の外部におかれたファイル１０９（以下、修正規則ファ
イルという）に記述されているものであり、プログラム
の起動時に修正規則ファイル１０９からコンピュータの
メインメモリ（図示省略）にロードされるものである。
この修正規則ファイル１０９には、例えば住所番地を認
識する場合であれば、多種多様の住所番地の記述形式の
各々に適合する多種多様の修正規則が格納されている。The correction rule used by the character recognition result correction unit 104 is described in a file 109 (hereinafter referred to as a correction rule file) placed outside the program of the character recognition result correction unit 104. The program is loaded from the modification rule file 109 into the main memory (not shown) of the computer when the program is started.
In the case of recognizing an address address, for example, the correction rule file 109 stores a wide variety of correction rules suitable for various description formats of address addresses.

【００２０】また、修正規則ファイル１０９は、テキス
トエディタで追加・変更が容易なようにテキストファイ
ルとなっている。このファイル１０９内の個々の修正規
則は、記入されるべき文字列の表記を後述する正規表現
と呼ばれる表現規則（図５にその一例を示す）で記述し
たものと、文字認識結果の表記が該当する正規表現に適
合したときに限って実行される修正処理とを含んでい
る。この修正規則の詳細については後述する。The modification rule file 109 is a text file so that it can be easily added and changed by a text editor. The individual correction rules in this file 109 correspond to the notation of the character string to be entered, which is described by an expression rule called a regular expression (an example of which is shown in FIG. 5) described later, and the notation of the character recognition result. The modification process is executed only when the regular expression is satisfied. Details of this modification rule will be described later.

【００２１】次に、本実施例の処理動作を具体例を用い
てより詳細に説明する。Next, the processing operation of this embodiment will be described in more detail using a concrete example.

【００２２】例えば、「２丁目１７―１」と書かれた文
書行を認識した場合を想定すると、文字切出し部１０２
は１文字毎の画像「２」、「丁」、「目」、「１」、
「７」、「―」、「１」に切出し、文字認識部１０３は
各文字パターンを認識して、例えば図２に示すような文
字認識結果２０１を出力する。この文字認識結果２０１
は候補文字マトリックスと呼ばれ、図中では横方向に文
字列を構成する各文字位置がとられ、縦方向に各文字位
置毎の候補文字の順位（上記評価値に基づいて決定され
た類似程度の高さの順位であり、例えば第１位から第３
位までが選ばれる）がとられ、入力文字列の各文字位置
に対する候補文字がマトリクス状に配置されている。こ
の候補文字マトリクス２０１が文字認識結果修正部１０
４に入力される。For example, assuming that a document line in which "2 chome 17-1" is written is recognized, the character cutting section 102
Is an image of each character "2", "Ding", "eyes", "1",
The characters are cut out into "7", "-", "1", and the character recognition unit 103 recognizes each character pattern and outputs a character recognition result 201 as shown in FIG. 2, for example. This character recognition result 201
Is called a candidate character matrix. In the figure, the character positions that make up the character string are taken in the horizontal direction, and the rank of the candidate characters at each character position in the vertical direction (the degree of similarity determined based on the above evaluation value). Is the order of height of, for example, 1st to 3rd
The candidate characters for each character position of the input character string are arranged in a matrix. This candidate character matrix 201 is used as the character recognition result correction unit 10.
4 is input.

【００２３】図３は、文字認識結果修正部１０４の詳細
構成を示す機能ブロック図である。FIG. 3 is a functional block diagram showing a detailed configuration of the character recognition result correction section 104.

【００２４】文字認識結果修正部１０４は、正規表現マ
ッチング部１０６、文字列置き換え部１０７、及び文字
置き換え部１０８を備える。The character recognition result modification unit 104 includes a regular expression matching unit 106, a character string replacement unit 107, and a character replacement unit 108.

【００２５】まず、正規表現マッチング部１０６は、メ
モリ内の多数の修正規則の中から個々の修正規則を順番
に選択して、候補文字マトリクス２０１に適合したもの
である否かをチェックする。選択した修正規則が候補文
字マトリクス２０１に適合してなければ、次の修正規則
を選択して再びチェックする。もし全ての修正規則をチ
ェックしても適合したものが見つからない場合は、修正
処理を終了する。もし、適合した修正規則が見つかれ
ば、その修正規則に基づいた修正処理を文字列置き換え
部１０７へ依頼すると共に、更に別の修正規則で適合す
るものがないかどうか次の修正規則に対するチェックへ
進む。First, the regular expression matching unit 106 sequentially selects individual correction rules from a large number of correction rules in the memory and checks whether or not they match the candidate character matrix 201. If the selected correction rule does not match the candidate character matrix 201, the next correction rule is selected and checked again. If no matching rule is found even after checking all the modification rules, the modification process ends. If the matching correction rule is found, the correction processing based on the correction rule is requested to the character string replacing unit 107, and the process proceeds to the check for the next correction rule whether or not there is another matching correction rule. .

【００２６】図４は、図２に示した候補文字マトリクス
２０１に適合した修正規則４００を例示したものであ
る。図４に示すように、修正規則４００は、正規表現パ
ターン３０１、文字列置換規則３０２、及び文字置換規
則３０３から構成されている。全ての修正規則はこのよ
うな構成となっている。このうち、正規表現パターン３
０１が、その修正規則が候補文字マトリクス２０１に適
合するか否かをチェックするために使用されるものであ
る。FIG. 4 exemplifies a modification rule 400 adapted to the candidate character matrix 201 shown in FIG. As shown in FIG. 4, the modification rule 400 is composed of a regular expression pattern 301, a character string replacement rule 302, and a character replacement rule 303. All modified rules have this structure. Of these, regular expression pattern 3
01 is used to check whether the correction rule matches the candidate character matrix 201.

【００２７】即ち、正規表現パターン３０１は、正規表
現と呼ばれる表現規則を用いて、適合する文字列のパタ
ーンを表現したものである。図５は、この正規表現の規
則を示したもので、例えば記号＾は文字列の先頭を示
し、記号＄は文字列の終端を示し、［］で括られた複数
個のキャラクタのセットは、その［］に対応する文字位
置に対する適合可能な候補文字を示し、｛｝で括られた
数字は、その直前の文字が繰り返され得る回数を規定す
る、…というように規則が定められている。That is, the regular expression pattern 301 is a pattern of a matching character string expressed by using an expression rule called a regular expression. FIG. 5 shows the rules of this regular expression. For example, the symbol ^ indicates the beginning of a character string, the symbol $ indicates the end of the character string, and a set of a plurality of characters enclosed by [] is A rule is defined such that a candidate character that can be adapted to the character position corresponding to the [] is indicated, and a number enclosed in {} defines the number of times the character immediately before the character can be repeated.

【００２８】この規則を踏まえて、図４に示した正規表
現パターン３０１の意味するところを解説すれば次のよ
うになる。文字列の第１文字目のキャラクタセットは
［１-９ノユク］であるから、第１文字目が数字の
「１」〜「９」、又はカタカナの「ノ」、「ユ」若しく
は「ク」である場合に、この修正規則４００が適合可能
であることを意味する。また、第２文字目は［０-９ロ
ノユク］であるから、第２文字目が数字の「０」〜
「９」、又はカタカナの「ロ」、「ノ」、「ユ」若しく
は「ク」であるとき適合可能であることを意味する。但
し、第２文字目のキャラクタセットの後に｛０，２｝と
書かれているから、このキャラクタセットの繰り返し回
数は０回以上２回以下、つまり、第２文字目がない場合
や第３文字目まで続く場合もあることを意味する。以下
の文字のキャラクタセットについても同様にして解釈で
きる。Based on this rule, the meaning of the regular expression pattern 301 shown in FIG. 4 will be described as follows. Since the character set of the first character of the character string is [1-9 Noyuk], the first character is a numeral "1" to "9", or katakana "No", "Yu" or "Ku". , It means that the modified rule 400 is applicable. Also, the second character is [0-9 Ronoyuk], so the second character is the number "0" ~
When it is "9" or "kata", "no", "you" or "ku" in katakana, it means that it is applicable. However, since {0, 2} is written after the character set of the second character, the number of repetitions of this character set is 0 times or more and 2 times or less, that is, when there is no second character or the third character. Means that it may last to the eye. The following character sets can be similarly interpreted.

【００２９】さて、このような正規表現パターン３０１
が候補文字マトリックス２０１に適合するか否かをチェ
ックする処理は次のように行われる。Now, such a regular expression pattern 301
The process of checking whether or not matches the candidate character matrix 201 is performed as follows.

【００３０】まず、パターン３０１の第１文字目のキャ
ラクタセット［１-９ノユク］に対し、マトリクス２０
１中の第１文字目の第１順位の候補文字「ユ」を最優先
に比較する。この場合、この文字「ユ」が上記キャラク
タセット中にあるため、第１文字目については適合可能
と判断し、第２順位以下の候補文字「ヌ」及び「エ」と
の比較は省略する。そして、パターン３０１が適合する
候補文字を「ユ」と決める。First, for the first character set [1-9 Noyuk] of the pattern 301, the matrix 20
The first-ranked candidate character “Y” of the first character in 1 is compared with the highest priority. In this case, since the character "Yu" is in the character set, it is determined that the first character is compatible, and the comparison with the candidate characters "Nu" and "E" below the second rank is omitted. Then, the candidate character to which the pattern 301 fits is determined to be "Yu".

【００３１】次に、パターン３０１中の第２文字目のキ
ャラクタセット［０-９ロノユク］と、マトリクス２０
１中の第２文字目の第１順位の候補文字「Ｔ」とを比較
する。その結果、一致しないので次に第２順位の候補文
字「了」と比較し、これも一致しないので次に第３順位
の候補文字「ナ」と比較する。この結果、いずれの候補
文字も上記キャラクタセット中にないことになる。しか
し、第２文字目のキャラクタセットは繰り返し回数が０
〜２であるため、この場合は繰り返し回数０と判断し、
次に、パターン３０１中の第３文字目のキャラクタセッ
トとマトリックス２０１中の第２文字目の候補文字との
比較を同様に行う。その結果、第３文字目のキャラクタ
セット［丁、Ｔ、了］と第１順位の候補文字「Ｔ」とが
一致するので、適合可能と判断し、適合する候補文字を
「Ｔ」と決定する。Next, the second character set [0-9 Ronoyuki] in the pattern 301 and the matrix 20 are used.
The candidate character “T” of the first rank of the second character in 1 is compared. As a result, since they do not match, they are compared with the second-ranked candidate character "end", and since they also do not match, they are next compared with the third-ranked candidate character "na". As a result, none of the candidate characters are in the character set. However, the number of repetitions is 0 for the second character set.
Since it is ~ 2, it is judged that the number of repetitions is 0 in this case,
Next, the third character set in the pattern 301 and the second character candidate character in the matrix 201 are compared in the same manner. As a result, the third character set [Ding, T, End] and the first-ranked candidate character “T” match, so it is determined that the matching is possible and the matching candidate character is determined as “T”. .

【００３２】以上と同様の方法で後続の文字についても
比較を行うことにより、結果として、マトリックス２０
１内の全ての文字位置に対して正規表現パターン３０１
が適合可能であることが判明し、そして、適合する候補
文字列が図６に示すように「ユＴ目１クーノ」６０１と
定まる。この文字列６０１は、文字列置き換え部１０７
に渡される。By comparing the subsequent characters in the same manner as above, as a result, the matrix 20
Regular expression pattern 301 for all character positions in 1
Is found to be compatible, and the matching candidate character string is determined to be “Unit 1 Coono” 601 as shown in FIG. This character string 601 is the character string replacement unit 107.
Passed to.

【００３３】なお、上記の比較の過程で、候補マトリッ
クス２０１内のいずれかの文字位置に関して正規表現パ
ターン３０１内の対応するキャラクタセットとの一致が
得られなかった場合は、その修正規則４００は候補マト
リックス２０１に適合しないものと判断され、次の修正
規則に関して同様の比較処理が繰り返される。そして、
全ての修正規則について適合しないと判断された場合
は、前述したように修正処理を終了し、コンピュータの
ディスプレイ上に、候補文字マトリクス２０１をそのま
ま表示して、後はオペレータによるマニュアル修正に処
理を委ねることになる。In the process of the above comparison, if any character position in the candidate matrix 201 does not match the corresponding character set in the regular expression pattern 301, the correction rule 400 is set as a candidate. It is determined that the matrix 201 is not satisfied, and the same comparison process is repeated for the next correction rule. And
If it is determined that all the correction rules are not met, the correction process is terminated as described above, the candidate character matrix 201 is displayed as it is on the display of the computer, and the process is then entrusted to manual correction by the operator. It will be.

【００３４】さて、上述した例のように適合する修正規
則４００が見つかった場合は、次に文字列置き換え部１
０７が、その修正規則４００中の文字列置換規則３０２
に基づいて、入力した文字列を置換する。この置換規則
３０２は、一般にｓ／Ａ／Ｂ／の形式で記述され、［］
によって括られたキャラクタセットで示される文字の複
合から成る文字列Ａを、文字列Ｂに置換することを意味
する。従って、例示の置換規則３０２は、文字列「丁
目」、「丁日」、「Ｔ目」、「Ｔ日」、「了目」、「了
日」を全て文字列「丁目」に置き換えることを示してい
る。Now, when the matching correction rule 400 is found as in the above-described example, the character string replacing unit 1 is next searched.
07 is the character string replacement rule 302 in the correction rule 400.
Replace the entered string based on. This replacement rule 302 is generally described in the format of s / A / B /
It means that the character string A consisting of a composite of the characters shown in the character set enclosed by is replaced with the character string B. Therefore, the exemplary replacement rule 302 replaces all of the character strings “chome”, “chome”, “T”, “T day”, “completion”, and “completion date” with the character string “chome”. Shows.

【００３５】従って、文字列置き換え部１０７は、文字
列「ユＴ目１クーノ」６０１を入力すると、置換規則３
０２に基づいて「Ｔ目」を「丁目」に置き換えることに
より、図６に示すように、文字列「ユＴ目１クーノ」６
０１を「ユ丁目１クーノ」６０２に変更する。この変更
された文字列６０２は文字置き換え部１０８に渡され
る。Accordingly, when the character string replacing unit 107 inputs the character string "YUTE 1 KUNO" 601, the replacement rule 3
By replacing "T eyes" with "chome" based on 02, the character string "YU T eyes 1 Kuno" 6
Change 01 to "Yu Chome 1 Kuno" 602. The changed character string 602 is passed to the character replacement unit 108.

【００３６】文字置き換え部１０８は、修正規則４００
内の文字置換規則３０３に基づき、入力した文字列内の
個々の文字を正しい文字に置換する。ここで、文字置換
規則３０３は、一般にｙ／ａｂｃ／ｘｙｚ／の形式で記
述され、文字「ａ」を文字「ｘ」に、文字「ｂ」を文字
「ｙ」に、文字「ｃ」を文字「ｚ」に夫々置換すること
を示す。従って、文字置き換え部１０８は、文字列「ユ
丁目１クーノ」６０２を入力すると、文字置換規則３０
３に基づいて、この文字列６０２中の文字「ユ」を
「２」に、「ク」を「７」に、「一」を「―」に、
「ノ」を「１」に、夫々置き換えることにより、図６に
示すように、最終的な修正された文字列「２丁目１７―
１」６０３を得る。この文字列「２丁目１７―１」６０
３は最適候補決定部１０５に出力される。The character replacing unit 108 uses the correction rule 400.
Based on the character replacement rule 303 in the above, each character in the input character string is replaced with the correct character. Here, the character replacement rule 303 is generally described in the format of y / abc / xyz /, and the character “a” is converted into the character “x”, the character “b” is converted into the character “y”, and the character “c” is converted into the character. "Z" indicates that each is replaced. Therefore, when the character replacement unit 108 inputs the character string “Yuchome 1 Kuno” 602, the character replacement rule 30
Based on 3, the character "Yu" in this character string 602 is changed to "2", "Ku" is changed to "7", "One" is changed to "-",
By replacing "no" with "1" respectively, as shown in FIG. 6, the final corrected character string "2 chome 17-"
1 ”603 is obtained. This character string "2 chome 17-1" 60
3 is output to the optimum candidate determination unit 105.

【００３７】このようにして、一つの適合する修正規則
４００に基づく文字認識結果の修正の過程が終了する。
この後も、マッチング部１０６は、前述したように、更
に他に適合する修正規則が有るかどうかのチェックを続
ける。その結果、他に適合する規則が見つからなかった
場合には、唯一の適合修正規則４００に基づいた上記最
終修正文字列「２丁目１７―１」６０３が、最適候補決
定部１０５によって、最適候補としてディスプレイ上に
表示されることとなる。In this way, the process of correcting the character recognition result based on one matching correction rule 400 is completed.
After this, the matching unit 106 continues to check whether there is another matching correction rule as described above. As a result, if no other matching rule is found, the final correction character string “2-chome 17-1” 603 based on the only matching correction rule 400 is determined as the optimum candidate by the optimum candidate determination unit 105. It will be displayed on the display.

【００３８】一方、修正規則４００以外にも適合する修
正規則が見つかった場合には、その新たに見つかった適
合修正規則に基づいて、再び上述した一連の修正処理が
繰り返され、そして、その最終修正結果が最適候補決定
部１０５に渡されることになる。この場合、最適候補決
定部１０５は、複数の修正規則に基づく複数の修正結果
を受け取り、その中から最適の修正結果を決定すること
になる。この決定を行うために、更に次のような処理が
行われている。On the other hand, when a matching correction rule other than the correction rule 400 is found, the series of correction processes described above is repeated again based on the newly found matching correction rule, and the final correction is made. The result is passed to the optimum candidate determination unit 105. In this case, the optimum candidate determination unit 105 receives a plurality of correction results based on a plurality of correction rules and determines the optimum correction result from them. In order to make this determination, the following processing is further performed.

【００３９】即ち、文字置き換え部１０８が、最終修正
文字列を得る度に、その最終修正文字列に対し、その最
終修正文字のうち正規表現パターンに一致した候補文字
と、その一致した候補文字列に文字認識部１０３が付与
した評価値とに基づいて、その最終修正文字列の入力文
字列に対する類似程度を示した得点を付与する。より詳
細に説明すれば、例えば上記例で用いた最終修正文字列
「２丁目１７―１」６０３の場合、それに含まれる７文
字全てが正規表現パターン３０１に一致した候補文字で
ある。一方、図５を参照して、例えば［．］＊のような
キャラクタセット（どのような文字が何文字続いてもよ
いことを意味する）を含んだ正規表現パターンを有する
修正規則によって修正された最終修正文字列の場合に
は、その文字列中の［．］に対応する部分は、正規表現
パターンに一致した文字ではないから、一致した候補文
字の数はその修正文字列の総字数よりも少なくなる。こ
のようにして、個々の最終修正文字列毎に、正規表現パ
ターンに一致した文字が定まる。次に、この一致した候
補文字の各々について、文字認識部１０３が付与した評
価値を取得し、各最終修正文字列毎に、その中の一致し
た候補文字の評価値を合計することにより、各最終修正
文字列の得点が求まる。この得点と共に最終修正文字列
が最適候補決定部１０５に渡される。That is, every time the character replacement unit 108 obtains the final modified character string, the candidate character that matches the regular expression pattern among the final modified character strings and the candidate character string that matches the final modified character string. Based on the evaluation value given by the character recognition unit 103, a score indicating the degree of similarity of the final corrected character string to the input character string is given. More specifically, for example, in the case of the final correction character string “2 chome 17-1” 603 used in the above example, all 7 characters included in it are candidate characters that match the regular expression pattern 301. On the other hand, referring to FIG. 5, for example, [. ] In the case of a final modified string modified by a modification rule having a regular expression pattern containing a character set such as * (meaning that any character can last any number of characters), that string In[. ] Is not a character that matches the regular expression pattern, so the number of matching candidate characters is less than the total number of characters in the modified character string. In this way, the character that matches the regular expression pattern is determined for each final modified character string. Next, with respect to each of the matched candidate characters, the evaluation value assigned by the character recognition unit 103 is acquired, and the evaluation values of the matched candidate characters in each of the final modified character strings are summed to obtain each evaluation value. The score of the final correction character string is obtained. The final correction character string is passed to the optimum candidate determination unit 105 together with this score.

【００４０】最適候補決定部１０５は、受け取った複数
の最終修正文字列の各々について、その得点をその文字
列の総文字数で割って一文字当りの平均得点を求め、そ
の一文字当り平均得点の高い順に複数の最終修正文字列
を並べる。そして、最も高得点の最終修正文字列を最適
候補として選択してディスプレイに表示すると共に、プ
ログラム外部の修正規則ファイル１０９に記述する。The optimum candidate determining section 105 divides the score of each of the plurality of final modified character strings received by the total number of characters in the character string to obtain an average score per character, and the average score per character is in descending order. Line up multiple last modified strings. Then, the final correction character string with the highest score is selected as the optimum candidate, displayed on the display, and described in the correction rule file 109 outside the program.

【００４１】尚、最適候補を選択する方法は上記のよう
な得点に基づくものに限られるわけでなく、他にも種々
の方法が採用し得る。例えば、正規表現パターンに一致
した候補文字の個数に基づいて決定する方法などがあ
る。どの方法がベストかは、入力文字列の種類（例え
ば、住所番地か、電話番号か、など）によっても異なる
ため、幾つのかの方法を用意しておき、場合に応じて適
宜選択するようにしてもよい。The method of selecting the optimum candidate is not limited to the one based on the above points, and various other methods can be adopted. For example, there is a method of determining based on the number of candidate characters that match the regular expression pattern. Which method is best depends on the type of input character string (for example, address number, telephone number, etc.), so prepare several methods and select them appropriately depending on the case. Good.

【００４２】以上説明した実施形態によれば、次の利点
が得られる。According to the embodiment described above, the following advantages can be obtained.

【００４３】（１）プログラム外部に修正規則を蓄積し
た修正規則ファイルを設けることとしたので、帳票の様
式や記入方法の変更があっても、プログラムソース自体
を変更することなく、修正規則を追加・変更できるた
め、機能拡張及び保守が容易になる。(1) Since a modification rule file accumulating modification rules is provided outside the program, modification rules are added without changing the program source itself even if the form format or entry method is changed. -Since it can be changed, function expansion and maintenance become easy.

【００４４】（２）また、文字列の置き換え及び文字の
置き換えによって、文字認識結果の候補文字列の中に正
解文字が含まれなくとも、正解文字列を作成することが
できるため、文字認識の正解率が向上する。(2) Further, by replacing the character string and replacing the character, the correct answer character string can be created even if the correct answer character is not included in the candidate character string of the character recognition result. Correct answer rate improves.

【００４５】（３）更には、修正の対象となる文字列の
形式を正規表現で記述するため、多彩な表記が可能とな
り、夫々の正規表現パターンにマッチした時にのみ、文
字列・文字の置き換えをするため、修正誤りの発生を抑
えられる。よって、文字認識の正解率が向上する。(3) Further, since the format of the character string to be modified is described by a regular expression, a variety of notations are possible, and the character string / character replacement is performed only when each regular expression pattern is matched. Therefore, the occurrence of correction errors can be suppressed. Therefore, the accuracy rate of character recognition is improved.

【００４６】なお、上記内容はあくまで本発明に係る一
実施形態に関するものであって、本発明が上記内容のみ
に限定されることを意味するものでないのは勿論であ
る。It should be noted that the above description is only related to one embodiment of the present invention, and does not mean that the present invention is limited to the above description.

【００４７】[0047]

【発明の効果】以上説明したように、本発明によれば、
文字認識結果の修正に必要な修正規則の機能拡張や保守
が容易で、しかも修正誤りが発生しにくい。As described above, according to the present invention,
It is easy to expand the function of the correction rule and correct maintenance necessary for correcting the character recognition result, and it is difficult for correction errors to occur.

[Brief description of the drawings]

【図１】本発明の一実施形態に係る文字認識結果修正方
式を示すブロック図。FIG. 1 is a block diagram showing a character recognition result correction method according to an embodiment of the present invention.

【図２】同実施形態における文字認識結果の候補文字マ
トリクスを示す図。FIG. 2 is an exemplary view showing a candidate character matrix of a character recognition result in the same embodiment.

【図３】同実施形態における文字認識結果修正部を示す
機能ブロック図。FIG. 3 is a functional block diagram showing a character recognition result correction unit in the same embodiment.

【図４】同実施形態で用いる修正規則の一例を示す説明
図。FIG. 4 is an explanatory diagram showing an example of a correction rule used in the same embodiment.

【図５】同実施形態で用いる修正規則を記述するための
正規表現を示す説明図。FIG. 5 is an explanatory view showing a regular expression for describing a modification rule used in the same embodiment.

【図６】文字認識結果修正部における候補文字マトリク
スから最適候補文字列が得られるまでの過程を示した説
明図。FIG. 6 is an explanatory diagram showing a process of obtaining an optimum candidate character string from a candidate character matrix in a character recognition result correction unit.

[Explanation of symbols]

１０１文字列画像１０２文字切出し部１０３文字認識部１０４文字認識結果修正部１０５最適候補決定部１０６正規表現マッチング部１０７文字列置き換え部１０８文字置き換え部１０９修正規則ファイル３０１正規表現パターン３０２、３０３文字列置き換え用規則 101 character string image 102 character cutout unit 103 character recognition unit 104 character recognition result correction unit 105 optimal candidate determination unit 106 regular expression matching unit 107 character string replacement unit 108 character replacement unit 109 correction rule file 301 regular expression pattern 302, 303 character string Replacement rules

Claims

[Claims]

1. A system executed by a programmed computer for modifying a character recognition result for an input character string using a modification rule prepared in advance, which operates according to an algorithm described in the program of the computer. A character recognition result correction unit and a correction rule file prepared in advance outside the program and accumulating a plurality of correction rules are provided, and the character recognition result correction unit adapts the character recognition result from the correction rule file. A method of correcting a character recognition result, wherein the correction rule to be selected is selected, and the character recognition result is corrected based on the selected correction rule.

2. The correction method according to claim 1, wherein the correction rule file is a text file.

3. The correction method according to claim 1, wherein each of the correction rules includes a regular expression pattern in which a notation of a matching character string is described by a predetermined regular expression, and the character recognition result correction unit includes: A method for correcting character recognition results, comprising a regular expression matching unit for determining whether each correction rule matches the character recognition result based on the regular expression pattern of the correction rule.

4. The correction method according to claim 1, wherein the character recognition result is selected from the first to nth candidate characters obtained as a result of character recognition for each character in the input character string, A candidate character matrix arranged along character positions and ranks, wherein the character recognition result correction unit can select a candidate character string that matches the regular expression pattern of each correction rule from the candidate character matrix. Is selected, a correction rule having a regular expression pattern such that one or more sets of matching candidate character strings can be selected is selected as the matching correction rule, and among the one or more sets of matching candidate character strings. From, a means for selecting a candidate character string consisting of the candidate character with the highest rank, and means for correcting the selected candidate character string based on the selected correction rule Modify method of the character recognition result, characterized in that.

5. The modification method according to claim 1, wherein each of the modification rules includes a replacement rule to be executed for a character string or a character included in a matching character recognition result, and the character recognition result modification. Based on the replacement rule included in the selected modification rule,
A method for correcting a character recognition result, comprising: a character string / character replacing unit for replacing a character string or a character included in the character recognition result with another character string or a character.

6. The correction method according to claim 1, wherein the character recognition result correction unit, when there are a plurality of matching correction rules in the correction rule file, determines the character based on each of the matching correction rules. Outputting a plurality of correction character strings by correcting the recognition result, receiving the plurality of correction character strings output from the character recognition result correction unit, and determining the optimum candidate from the correction character strings A method for correcting a character recognition result, further comprising a section.

7. A programmed computer-implemented method for modifying a character recognition result for an input character string using a modification rule prepared in advance, the step of inputting the character recognition result, and the computer. The process of selecting a correction rule that matches the input character recognition result by referring to a correction rule file that stores a plurality of correction rules prepared in advance in the program of And a step of correcting the input character recognition result based on the above.