JPH11120293A

JPH11120293A - Character recognition/correction system

Info

Publication number: JPH11120293A
Application number: JP9283280A
Authority: JP
Inventors: Yasunao Isaki; 保直伊崎
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1997-10-16
Filing date: 1997-10-16
Publication date: 1999-04-30
Anticipated expiration: 2017-10-16
Also published as: JP3452774B2; CN1140878C; CN1215201A; KR19990036515A; KR100412317B1

Abstract

PROBLEM TO BE SOLVED: To precisely detect a character string of low quality which is entered into various slips at irregular character intervals or by an irregular entering method. SOLUTION: A specific character or specific character string is extracted from an input character string 101 by performing a 1st matching process between the input character string 101 and a specific character standard pattern dictionary 107. Then a candidate word group which belongs to a specific category and has the possibility that the candidate is positioned in areas in the input character strings 101 preceding or following to each specific character or specific character string extracted from the input character string 101 is extracted from a specific character dictionary 110 and a knowledge dictionary 111 linked to it. Then a 2nd matching process using a standard pattern dictionary 113 is performed for respective areas in the input character string 101 by candidate words belonging to the candidate word group according to information regarding the respective candidate words to recognize the characters constituting the input character string 101.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、通常見かける各種
伝票に記入される文字列であって、不規則な文字間隔又
は不規則な記入方法で記入され、隣接文字間で接触、分
離が発生することのあるような、低品質な文字列を認識
する技術に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a character string which is usually written on various types of slips, which are written at irregular character intervals or irregular writing methods, and contact and separation occur between adjacent characters. The present invention relates to a technique for recognizing a low-quality character string that may occur.

【０００２】[0002]

【従来の技術及び発明が解決しようとする課題】イメー
ジデータを読み取って文字符号データに変換するＯＣＲ
（光学的文字読取装置）は、その適用分野が広まるに伴
って、様々な業務に使われてきている。業務毎に異なる
帳票が使用され、そこに記入される文字列も記入者もさ
まざまとなってきている。2. Description of the Related Art OCR for reading image data and converting it into character code data
(Optical character reading device) has been used for various tasks as its application field has expanded. Different forms are used for each job, and the character strings and the persons who fill in the forms are various.

【０００３】従来のＯＣＲ用帳票においては、文字記入
枠が１文字ずつ印刷された文字枠が使用され、特に漢字
が記入される場合には大きな文字枠が使用されている。
これは、ＯＣＲ装置にとっては記入される文字を一文字
ずつ検出しやすくし、また、記入者に対しては記入時に
記入される文字が隣の文字と接触しないように促すため
のものである。[0003] In a conventional OCR form, a character frame in which a character entry frame is printed one character at a time is used. In particular, when a kanji is entered, a large character frame is used.
This is to make it easier for the OCR device to detect the characters to be entered one by one, and to encourage the writer to avoid the characters to be entered at the time of entry from touching adjacent characters.

【０００４】このような帳票の場合、例えば、住所や氏
名が２、３箇所記入されるだけでも記入される文字数は
何十文字にもなり、結果的に大きなサイズの帳票が必要
となり費用がかかる。また、記入者に対しても、１文字
ずつ枠の中に記入しなければならないという面倒を強い
ていた。[0004] In the case of such a form, for example, even if the address or name is entered only in two or three places, the number of characters to be entered becomes tens of characters, and as a result, a large-sized form is required, which is costly. In addition, the writer has to complicate the need to fill in the boxes one character at a time.

【０００５】ＯＣＲの適用分野が広まるに従って、通常
の帳票のような小さな帳票の中に漢字文字列を文字枠に
とらわれずに記入でき、かつ実用になる認識精度で認識
でき、また読めない文字を修正する際にも効率よく修正
できる文字認識／修正技術が必要になってきている。[0005] As the application field of OCR becomes widespread, Kanji character strings can be entered in small forms such as ordinary forms without being bound by character frames, and characters which can be recognized with practical recognition accuracy and which cannot be read are used. There is a need for a character recognition / correction technique that can efficiently correct the correction.

【０００６】従来の代表的な文字認識方法では、認識対
象文字列が記入される文字枠の帳票上での座標位置が格
納された、定義体と呼ばれるファイルが参照されなが
ら、記入された文字が１文字ずつ検出されて切り出され
る。そして、その切り出された各文字に対して認識処理
が実行されることにより、認識結果である候補文字群が
出力される。In the conventional typical character recognition method, the entered characters are referred to by referring to a file called a definition body in which the coordinates of the character frame in which the character string to be recognized is entered on the form are stored. Each character is detected and cut out. Then, a recognition process is executed on each of the cut-out characters, so that a candidate character group as a recognition result is output.

【０００７】切り出された文字の認識処理は、例えば次
のようにして実行される。まず、多数の不特定の筆記者
により予め決められた書式に従って記入された文字が収
集され、これらの文字から認識方式に依存する特徴量が
抽出され、統計的な手法等（例えばクラスタリング手
法）によって標準パターンが作成される。そして、目的
とする字種毎の標準パターンから標準パターン辞書が作
成される。Recognition processing of the cut-out character is executed, for example, as follows. First, characters written by a large number of unspecified writers in accordance with a predetermined format are collected, and a characteristic amount depending on the recognition method is extracted from these characters, and the characters are extracted by a statistical method or the like (for example, a clustering method). A standard pattern is created. Then, a standard pattern dictionary is created from the standard patterns for each target character type.

【０００８】標準パターンは、例えば収集された各文字
パターンを平均することによって得られる平均パターン
として作成される。より具体的には、収集された各文字
に対応する特徴量の平均が演算されることによって得ら
れる平均特徴量によって、この平均パターンが表現され
る。The standard pattern is created as an average pattern obtained by averaging the collected character patterns, for example. More specifically, this average pattern is expressed by an average feature amount obtained by calculating the average of the feature amounts corresponding to the collected characters.

【０００９】手書き文字の認識処理においては、記入者
によって大きな字形変形が生じるため、各字種毎に複数
の標準パターンが作成される。通常、１つの標準パター
ンはテンプレートと呼ばれ、上記各字種毎に複数の標準
パターンから作成される辞書は、複数テンプレート辞書
と呼ばれる。In the process of recognizing handwritten characters, a large character deformation is caused by a writer, so that a plurality of standard patterns are created for each character type. Usually, one standard pattern is called a template, and a dictionary created from a plurality of standard patterns for each character type is called a multiple template dictionary.

【００１０】文字認識処理は、上述の標準パターン辞書
又は複数テンプレート辞書を用いて実行される。具体的
には、入力帳票から切り出された１文字から特徴量が抽
出され、この特徴量と標準パターン辞書（又は複数テン
プレート辞書）を構成する各テンプレート（標準パター
ン）の特徴量との間で、類似度又は距離（ユークリッド
距離、マハラノビス距離等）が計算される。そして、類
似度が大きい順又は距離が小さい順に所定順位（例えば
８位）までの各テンプレートが属する各字種カテゴリー
が、候補文字群として出力される。The character recognition processing is executed using the above-described standard pattern dictionary or plural template dictionary. Specifically, a feature amount is extracted from one character cut out from the input form, and between this feature amount and the feature amount of each template (standard pattern) constituting the standard pattern dictionary (or a plurality of template dictionaries), Similarity or distance (Euclidean distance, Mahalanobis distance, etc.) is calculated. Then, each character type category to which each template up to a predetermined rank (for example, the eighth rank) belongs in the order of larger similarity or smaller distance is output as a candidate character group.

【００１１】ここで、認識される文字が住所や氏名を表
わす文字である場合には、一般に、上記候補文字群に対
し、住所単語、氏名単語を使った知識処理が実行され
る。より具体的には、まず、各記入位置毎の候補文字群
が記入位置全体で組み合わせられることにより、候補文
字列群が出力される。If the recognized character is a character representing an address or a name, generally, knowledge processing using the address word and the name word is performed on the candidate character group. More specifically, first, a candidate character string group is output by combining candidate character groups for each entry position over the entire entry position.

【００１２】次に、この候補文字列群を構成する各候補
文字列毎に、知識処理対象の住所辞書又は氏名辞書内の
各単語文字列がその候補文字列中に存在するか否かが比
較される。Next, for each candidate character string constituting this candidate character string group, it is compared whether or not each word character string in the address dictionary or name dictionary to be subjected to knowledge processing exists in the candidate character string. Is done.

【００１３】そして、その比較結果と、例えば候補文字
列を構成する各候補文字の順位等に従って、その候補文
字列に対して得点が付けられる。この処理が全ての候補
文字列に対して実行された後、最も得点の高い候補文字
列が知識処理結果として出力される。A score is given to the candidate character string according to the comparison result and, for example, the order of each candidate character constituting the candidate character string. After this process is performed on all candidate character strings, the candidate character string with the highest score is output as a knowledge processing result.

【００１４】このような知識処理に関する従来技術とし
ては、例えば日本国特許公開公報：特開昭６１−１０７
４８６号に開示されるものが知られている。ここで、通
常の伝票への記入のように、フリーピッチで記入された
住所、氏名のような漢字文字列が認識される場合、隣接
する文字同士が接触することは一般に多く発生し、ま
た、逆に漢字には偏（へん）と旁（つくり）のように分
離して記入される文字も存在する。As a prior art relating to such knowledge processing, for example, Japanese Patent Laid-Open Publication No. Sho 61-107
No. 486 is known. Here, when a kanji character string such as an address written at a free pitch and a name is recognized as in a normal slip, it is generally common for adjacent characters to contact each other, Conversely, some kanji characters are written separately, such as hen and tsukuri.

【００１５】このため、記入文字が１文字ずつ検出され
切り出されて認識される従来の文字認識方法では、どの
範囲が１文字の範囲であるかを判断するのは困難であ
り、実用に耐える認識精度を実現することは困難であ
る。For this reason, in the conventional character recognition method in which the input characters are detected, cut out, and recognized one by one, it is difficult to determine which range is the range of one character, and a recognition that can withstand practical use. It is difficult to achieve accuracy.

【００１６】更に、各文字が正しく認識できなければ、
何文字記入されているかさえ判断できない場合もあり、
単語を構成する文字数が確定していることが前提とされ
る従来の知識処理では、認識精度の向上を図ることには
限界がある。Further, if each character cannot be recognized correctly,
Sometimes it is not possible to judge how many characters are entered,
In conventional knowledge processing on the assumption that the number of characters constituting a word has been determined, there is a limit to improving recognition accuracy.

【００１７】また、特に住所地名などの認識処理におい
て、例えば上位レベルの単語（例えば東京都、大阪府
等）が知識処理によって認識できなかった場合に、その
段階にでは下位レベルの単語は知識処理できていないの
が一般的であるため、住所地名を修正するためには、１
文字目から全ての文字列を順次修正する必要がある。In addition, in particular, in the recognition processing of an address, a place name, or the like, if a word at a higher level (for example, Tokyo, Osaka, etc.) cannot be recognized by the knowledge processing, a word at a lower level is not processed at that stage. In general, it is not possible.
It is necessary to correct all character strings sequentially from the first character.

【００１８】上述のようなフリーピッチの文字列を認識
するための第１の従来技術として、日本国特許公報：特
公平８−２３８７５号「単語読み取り方式」に開示され
ているものが知られている。この第１の従来技術では、
認識結果である候補文字列と単語辞書とがＤＰマッチン
グ等により照合され、一致する文字が多い単語が選択さ
れ、不一致の部分が再度切り出され、その切り出された
文字列に対して更に認識が行われる。As a first prior art for recognizing the above-described free-pitch character string, there is known a technique disclosed in Japanese Patent Publication No. 8-23875 "Word reading system". I have. In this first prior art,
The candidate character string as a recognition result and the word dictionary are collated by DP matching or the like, a word having many matching characters is selected, and a non-matching portion is cut out again, and further recognition is performed on the cut out character string. Will be

【００１９】フリーピッチの文字列を認識するための第
２の従来技術として、日本国特許公開公報：特開昭６３
−１３６２９１号「単語読み取り方式」に開示されてい
るものが知られている。この第２の従来技術では、文字
の偏、旁の各部分を示す部分パターンを標準パターンと
して有する標準パターン辞書を用いて認識処理が実行さ
れ、候補文字列の各文字の偏、旁から文字列が生成さ
れ、それと単語辞書とのマッチング処理が実行される。As a second prior art for recognizing a character string having a free pitch, Japanese Patent Laid-Open Publication No. Sho 63 is an example.
No. 136291 is known. In the second conventional technique, recognition processing is executed using a standard pattern dictionary having partial patterns indicating partial portions of a character as a standard pattern as a standard pattern. Is generated, and a matching process between the generated word dictionary and the word dictionary is executed.

【００２０】フリーピッチの文字列を認識するための第
３の従来技術として、日本国特許公開公報：特開平８−
１７１６１４号「文字列読み取り装置」に開示されてい
るものが知られている。この第３の従来技術では、候補
文字列中に正解文字が含まれずに読み飛ばしが発生した
場合や、正解文字と競合する文字候補の存在によって複
数の読み取り候補が発生した場合などにつき、予想文字
列の存在可能性が検証される。この場合の検証手段とし
て、いくつかの実現方法が開示されている。As a third prior art for recognizing a character string of a free pitch, Japanese Patent Laid-Open Publication No. Hei 8-
The thing disclosed in 171614 "character string reading device" is known. According to the third conventional technique, when a correct character is not included in a candidate character string and skipping occurs, or when a plurality of read candidates occur due to the presence of a character candidate that conflicts with the correct character, an expected character is determined. The existence of the column is verified. Several realization methods are disclosed as verification means in this case.

【００２１】しかし、我々が日常記入するような文字
列、即ち隣接文字間の接触が頻繁に発生し、文字幅も文
字毎に大きく変化し、つぶれやかすれの多い低品質な文
字列に対する認識処理を検討した場合に、上記第１乃至
第３の従来技術は、以下のような問題点を有している。However, a recognition process for a low-quality character string that we frequently enter, that is, a contact between adjacent characters frequently occurs, the character width changes greatly for each character, and the character string is often crushed or blurred. In consideration of the above, the first to third conventional techniques have the following problems.

【００２２】まず、第１の従来技術では、候補文字列の
どの文字が優先的に扱われるかは不定であり、候補文字
列中の全ての文字が対等に扱われるため、最初の文字切
り出し位置によっては全く不適切な単語しか候補に選ば
れない可能性があるという問題点を有している。First, in the first prior art, it is uncertain which character of a candidate character string is to be preferentially treated, and all characters in the candidate character string are treated equally. In some cases, there is a problem that only an inappropriate word may be selected as a candidate.

【００２３】次に、第２の従来技術では、隣接文字同士
が接触した領域に対する処理に問題がある。更に、第３
の従来技術では、検証手段の実現方法としていくつかの
方法が記されているが、いずれの方法も文字候補の組合
せを用いたものであり、それらの検証性能は最初の文字
の切出し結果に大きく依存してしまうという問題点を有
している。Next, in the second prior art, there is a problem in processing for an area where adjacent characters are in contact with each other. Furthermore, the third
In the prior art, several methods are described as a method of realizing the verification means, but each method uses a combination of character candidates, and their verification performance is greatly affected by the result of extracting the first character. It has a problem of dependence.

【００２４】本発明の課題は、特定の文字に着目するこ
とにより低品質な文字列を精度よく認識することにあ
る。An object of the present invention is to accurately recognize a low-quality character string by focusing on a specific character.

【００２５】[0025]

【課題を解決するための手段】本発明は、所定カテゴリ
ーを有する記入フィールドに記入された入力文字列を構
成する文字を認識する文字認識／修正方法、それと同等
の機能を有する文字認識装置、又はコンピュータ読出し
可能記録媒体を前提とする。According to the present invention, there is provided a character recognition / correction method for recognizing a character constituting an input character string entered in an entry field having a predetermined category, a character recognition device having a function equivalent thereto, or A computer-readable recording medium is assumed.

【００２６】本発明において、まず、入力文字列と第１
の認識辞書（特定文字標準パターン辞書１０７）との間
で第１のマッチング処理が実行されることにより、入力
文字列中から特定文字又は特定文字列が抽出される。よ
り具体的には、第１の認識辞書に、特定文字又は特定文
字列に対応する標準パターンが記憶され、入力文字列の
パターンと第１の認識辞書内の各標準パターンとの間で
第１のマッチング処理が実行されることにより、入力文
字列中から特定文字又は特定文字列が抽出される。上述
の特定文字又は特定文字列は、例えば所定カテゴリーに
おいて出現する頻度の高いもの、或いは、認識精度の高
いものである。In the present invention, first, the input character string and the first
By performing the first matching process with the recognition dictionary (specific character standard pattern dictionary 107), a specific character or a specific character string is extracted from the input character string. More specifically, a standard pattern corresponding to a specific character or a specific character string is stored in the first recognition dictionary, and a first pattern between the pattern of the input character string and each standard pattern in the first recognition dictionary is stored. Is executed, a specific character or a specific character string is extracted from the input character string. The above-mentioned specific character or specific character string is, for example, one that frequently appears in a predetermined category or one that has high recognition accuracy.

【００２７】次に、所定カテゴリー（例えば住所文字
列）に属し、かつ入力文字列中から抽出された各特定文
字又は特定文字列の前後の入力文字列中の領域に位置す
る可能性のある候補単語群がカテゴリー別単語辞書（特
定文字辞書１１０、知識辞書１１１）から抽出される。Next, candidates which belong to a predetermined category (for example, an address character string) and may be located in each specific character extracted from the input character string or an area in the input character string before and after the specific character string A word group is extracted from the word dictionary for each category (specific character dictionary 110, knowledge dictionary 111).

【００２８】そして、その抽出された候補単語群に属す
る各候補単語毎に、その各候補単語に関する情報に基づ
いてその各候補単語が位置する入力文字列中の各領域に
対して第２の認識辞書（標準パターン辞書１１３）を用
いて第２のマッチング処理が実行されることにより、入
力文字列を構成する文字が認識される。より具体的に
は、第２の認識辞書に、候補単語群に属する候補単語に
関連する文字又は文字列に対応する標準パターンが記憶
され、候補単語群に属する各候補単語毎に、その各候補
単語に関する情報に基づいてその各候補単語が位置する
入力文字列中の各領域に対してその各候補単語のパター
ンと第２の認識辞書内の各標準パターンとの間で第２の
マッチング処理が実行されることにより、入力文字列を
構成する文字が認識される。この場合に、各候補単語に
関する情報として、例えばその各候補単語の文字数の情
報が使用される。また、第２の認識辞書は、第１の認識
辞書を含むように構成されてもよい。Then, for each candidate word belonging to the extracted candidate word group, a second recognition is performed on each region in the input character string where the candidate word is located based on information on the candidate word. By executing the second matching process using the dictionary (the standard pattern dictionary 113), the characters constituting the input character string are recognized. More specifically, a standard pattern corresponding to a character or a character string related to a candidate word belonging to the candidate word group is stored in the second recognition dictionary, and for each candidate word belonging to the candidate word group, Based on the information about the word, a second matching process is performed between the pattern of each candidate word and each standard pattern in the second recognition dictionary for each region in the input character string where each candidate word is located. By executing, the characters constituting the input character string are recognized. In this case, as information on each candidate word, for example, information on the number of characters of each candidate word is used. Further, the second recognition dictionary may be configured to include the first recognition dictionary.

【００２９】上述の発明の構成により、入力文字列中の
特定文字又は特定文字列がまず優先的に認識され、その
認識結果に基づいてその前後の候補単語が仮定され、更
にその候補単語の情報を用いて入力文字列を構成する文
字が再認識されることによって、通常見かける各種帳票
（伝票）に記入されるような、不規則な間隔、記入方法
で記入された入力文字列を構成する文字を、高い精度で
認識することが可能となる。According to the configuration of the invention described above, a specific character or a specific character string in an input character string is first recognized first, and candidate words before and after the specific character are assumed based on the recognition result. The characters that make up the input character string that are entered at irregular intervals and in a manner that would normally be entered into various forms (forms) by re-recognizing the characters that make up the input character string using Can be recognized with high accuracy.

【００３０】上述の発明の構成において、入力文字列を
構成する文字の認識結果が入力文字列と並列して表示さ
れ、その表示される入力文字列上の所望領域がユーザに
よって指定されてその所望領域に対応する文字又は文字
列が修正され、その修正によって与えられた正解文字又
は正解文字列に関する情報に基づいて、候補単語群の抽
出処理及び第２のマッチング処理が再度実行され、入力
文字列を構成する文字が再度認識されるように構成する
ことができる。この場合に、表示される入力文字列上の
所望領域の指定に応答して、その所望領域における複数
の候補認識結果が表示されるように構成することができ
る。In the configuration of the invention described above, the recognition result of the characters constituting the input character string is displayed in parallel with the input character string, and a desired area on the displayed input character string is designated by the user and the desired area is displayed. The character or character string corresponding to the area is corrected, and based on the information on the correct character or correct character string given by the correction, the extraction processing of the candidate word group and the second matching processing are executed again, and the input character string Can be configured to be recognized again. In this case, in response to designation of a desired area on the displayed input character string, a plurality of candidate recognition results in the desired area can be displayed.

【００３１】このような文字修正技術によって、特定の
文字又は文字列のみを修正するだけで、他の認識不能部
分も自動的に修正することができる。また、上述の発明
の構成において、各候補単語に対して表記上のゆらぎを
有する単語が、候補単語群に属する新たな候補単語とし
て出力されるように構成することができる。With such a character correction technique, it is possible to automatically correct other unrecognizable parts only by correcting a specific character or character string. Further, in the configuration of the invention described above, it is possible to configure so that a word having fluctuation in notation with respect to each candidate word is output as a new candidate word belonging to the candidate word group.

【００３２】このような表記上のゆらぎの制御技術によ
って、種々の記入方法に柔軟に対処することができる。With such a notation fluctuation control technique, it is possible to flexibly cope with various writing methods.

【００３３】[0033]

【発明の実施の形態】以下、図面を参照しながら本発明
の実施の形態につき詳細に説明する。本発明の実施の形態の構成及び概略動作図１は、本発明の実施の形態の構成図である。Embodiments of the present invention will be described below in detail with reference to the drawings. Configuration and Schematic Operation Figure 1 of the embodiment of the present invention is a configuration diagram of an embodiment of the present invention.

【００３４】まず、文字切り出し部１０３が、帳票の記
入フィールド位置に関する情報を定義した記入フィール
ド定義１０４を用いて、イメージメモリ１０２から読み
出された帳票に記入された入力文字列１０１中の先頭か
ら順に１文字ずつを切り出す。First, the character extracting unit 103 uses the entry field definition 104 defining information on the entry field position of the form, from the beginning of the input character string 101 written in the form read from the image memory 102. Cut out one character at a time.

【００３５】次に、特徴抽出部１０５が、その切り出さ
れた文字から特徴量を抽出する。続いて、マッチング部
１０６が、その切り出された文字の特徴量と、特定文字
標準パターン辞書１０７内の各特定文字標準パターンの
特徴量との間のマッチング処理を実行し、マッチング度
が高い順に所定順位までの各特定文字標準パターンが属
する各特定文字の字種カテゴリーを、上記切り出された
文字に対する候補特定文字として候補文字列バッファ１
０８に出力する。Next, the feature extracting unit 105 extracts a feature amount from the extracted character. Subsequently, the matching unit 106 performs a matching process between the feature amount of the cut-out character and the feature amount of each specific character standard pattern in the specific character standard pattern dictionary 107, and determines a predetermined amount in descending order of matching degree. The character type category of each specific character to which each specific character standard pattern belongs up to the rank is set as a candidate specific character for the cut-out character.
08.

【００３６】文字切り出し部１０３、特徴抽出部１０
５、及びマッチング部１０６による上記一連の特定文字
認識処理は、文字切り出し部１０３が入力文字列１０１
の先頭から順に切り出した文字毎に実行される。この結
果、候補文字列バッファ１０８には、入力文字列１０１
から切り出された文字の並び順に対応する並び順で、各
文字に対応する候補特定文字が保持される。Character extraction unit 103, feature extraction unit 10
5 and the above-described series of specific character recognition processing by the matching unit 106, the character cutout unit 103
Is executed for each character cut out in order from the beginning. As a result, the input character string 101 is stored in the candidate character string buffer 108.
The candidate identification characters corresponding to each character are held in the arrangement order corresponding to the arrangement order of the characters extracted from.

【００３７】候補単語検索部１０９は、候補文字列バッ
ファ１０８に得られた候補特定文字列の中から隣接する
任意の２つの特定文字からなる組（特定文字組）を全て
抽出し、それぞれの特定文字組が特定文字辞書１１０に
登録されているか否かを検索する。The candidate word search unit 109 extracts all adjacent two sets of specific characters (specific character sets) from the candidate specific character strings obtained in the candidate character string buffer 108, and A search is performed to determine whether or not the character set is registered in the specific character dictionary 110.

【００３８】候補単語検索部１０９は、１組の特定文字
組が特定文字辞書１１０に登録されている場合、その登
録レコードにリンクする知識辞書１１１中のレコードか
ら、その特定文字組を構成する２つの特定文字により挟
まれる単語群を検索し、その検索された単語群を候補単
語群として候補単語バッファ１１２に保持する。When one specific character set is registered in the specific character dictionary 110, the candidate word search unit 109 constructs the specific character set from a record in the knowledge dictionary 111 linked to the registered record. A word group sandwiched between two specific characters is searched, and the searched word group is held in the candidate word buffer 112 as a candidate word group.

【００３９】候補単語検索部１０９は、候補文字列バッ
ファ１０８から抽出した上記特定文字組毎に、それに対
応する候補単語群を抽出し、候補単語バッファ１１２に
保持する。The candidate word search unit 109 extracts a candidate word group corresponding to each of the specific character sets extracted from the candidate character string buffer 108 and stores the group in the candidate word buffer 112.

【００４０】結局、候補単語バッファ１１２には、１組
の特定文字組について１つ以上の候補単語群が得られ、
最終的に、複数の特定文字組分の候補単語群の集合が得
られることになる。After all, in the candidate word buffer 112, one or more candidate word groups are obtained for one specific character set.
Finally, a set of candidate word groups for a plurality of specific character sets is obtained.

【００４１】１組の特定文字組について候補単語バッフ
ァ１１２に得られた候補単語群に属する各候補単語は、
順次読み出されてそれぞれに対して以下の一連の処理が
実行される。Each candidate word belonging to the candidate word group obtained in the candidate word buffer 112 for one specific character set is
They are sequentially read out, and the following series of processing is executed for each of them.

【００４２】まず、文字切り出し部１０３は、イメージ
メモリ１０２から読み出される入力文字列１０１におい
て、候補単語バッファ１１２から出力された候補単語の
情報を使って、その候補単語が属する特定文字組を構成
する２つの特定文字に挟まれた文字列領域内の文字列を
再度切り出す。First, the character cutout unit 103 uses the information of the candidate word output from the candidate word buffer 112 in the input character string 101 read from the image memory 102 to form a specific character set to which the candidate word belongs. The character string in the character string area sandwiched between two specific characters is cut out again.

【００４３】特徴抽出部１０５は、再度切り出された文
字列から特徴量を抽出する。更に、マッチング部１０６
は、その再度切り出された文字列の特徴量と、第２の辞
書である標準パターン辞書１１３内の各標準パターンの
特徴量とのマッチング処理を実行し、マッチング度が高
い順に所定順位までの各標準パターンが属する文字列の
カテゴリーを、上記候補単語に対する候補認識結果群と
して候補文字列バッファ１０８に保持する。The feature extracting unit 105 extracts a feature amount from the character string cut out again. Further, the matching unit 106
Executes a matching process between the feature amount of the character string cut out again and the feature amount of each standard pattern in the standard pattern dictionary 113 which is the second dictionary, and each of the feature amounts up to a predetermined order in descending order of matching degree. The category of the character string to which the standard pattern belongs is held in the candidate character string buffer 108 as a candidate recognition result group for the candidate word.

【００４４】文字切り出し部１０３、特徴抽出部１０
５、及びマッチング部１０６による上記一連の再認識処
理は、上記１組の特定文字組について候補単語バッファ
１１２に得られた候補単語群に属する候補単語のそれぞ
れにつき実行され、各候補単語毎に所定順位までの候補
認識結果群が候補文字列バッファ１０８に得られる。Character extraction unit 103, feature extraction unit 10
5 and the series of re-recognition processing by the matching unit 106 are executed for each candidate word belonging to the candidate word group obtained in the candidate word buffer 112 for the one specific character set, and a predetermined The candidate recognition result group up to the rank is obtained in the candidate character string buffer 108.

【００４５】そして、マッチング部１０６は、上記１組
の特定文字組に属する各候補単語毎に候補文字列バッフ
ァ１０８に得られる所定順位までの候補認識結果群の全
て中から、最も妥当で信頼度の高い認識結果、より具体
的には最もマッチング度が高い候補認識結果を、上記１
組の特定文字組を構成する２つの特定文字に挟まれた部
分の認識結果として、知識処理部１１４に出力する。The matching unit 106 determines the most appropriate and reliable reliability among all the candidate recognition result groups up to a predetermined order obtained in the candidate character string buffer 108 for each candidate word belonging to the one specific character set. Of the candidate recognition result having the highest matching degree,
The recognition result is output to the knowledge processing unit 114 as a recognition result of a portion sandwiched between two specific characters constituting the specific character set of the set.

【００４６】文字切り出し部１０３、特徴抽出部１０
５、及びマッチング部１０６による、１組の特定文字組
の候補単語群に属する候補単語毎の上記一連の再認識処
理は、候補単語バッファ１１２に登録されている各特定
文字組毎に実行される。この結果、知識処理部１１４に
は、各特定文字組を構成する２つの特定文字に挟まれた
各文字領域に対応する認識結果が出力されることにな
る。Character extraction unit 103, feature extraction unit 10
5, and the above series of re-recognition processing for each candidate word belonging to the candidate word group of one specific character set by the matching unit 106 is executed for each specific character set registered in the candidate word buffer 112. . As a result, a recognition result corresponding to each character region sandwiched between two specific characters constituting each specific character set is output to the knowledge processing unit 114.

【００４７】知識処理部１１４は、各特定文字組を構成
する２つの特定文字に挟まれた各文字領域に対応する認
識結果に対して、記入フィールド定義１０４及び知識辞
書１１１を用いた知識処理によって、上記各文字領域か
らなる全体文字領域の最終認識結果を決定し、それを認
識結果バッファ１１５に出力する。The knowledge processing unit 114 performs the knowledge processing using the entry field definition 104 and the knowledge dictionary 111 on the recognition result corresponding to each character area sandwiched between two specific characters constituting each specific character set. , And determines the final recognition result of the entire character area composed of the above character areas, and outputs the result to the recognition result buffer 115.

【００４８】上述の一連の認識処理において、認識条件
を最後まで満たさなかった文字又は文字列の部分につい
ては、リジェクト（認識不能）情報が付加される。この
場合に、認識結果バッファ１１５に得られた認識結果
が、インタフェース部１１６を介して表示部１１７に表
示される。ユーザは、表示部１１７での認識結果の表示
に対して、マウス及びキーボード等からなる入力部１１
８から、認識不能文字／文字列を修正することができ
る。In the above-described series of recognition processing, reject (unrecognizable) information is added to a portion of a character or a character string that does not satisfy the recognition conditions to the end. In this case, the recognition result obtained in the recognition result buffer 115 is displayed on the display unit 117 via the interface unit 116. The user operates the input unit 11 including a mouse, a keyboard, and the like to display the recognition result on the display unit 117.
8, the unrecognizable character / character string can be corrected.

【００４９】ユーザは、入力部１１８から認識不能文字
／文字列中の特定の正解文字を指定するだけで、その正
解文字に関する情報がインタフェース部１１６から正解
文字バッファ１１９及び領域座標バッファ１２０に出力
される。The user simply specifies a specific correct character in the unrecognizable character / character string from the input unit 118, and information on the correct character is output from the interface unit 116 to the correct character buffer 119 and the area coordinate buffer 120. You.

【００５０】候補単語検索部１０９は、正解文字バッフ
ァ１１９に得られた正解文字に関する情報を特定文字の
情報として、前述した特定文字辞書１１０と知識辞書１
１１を用いた候補単語の検索処理を実行することによ
り、認識不能文字を正しく再認識させることができる。
また、文字切り出し部１０３は、ユーザによって指定さ
れた正解文字の切り出し位置を領域座標バッファ１２０
から取得することによって、正しい文字の切り出しを実
行することができる。The candidate word search unit 109 uses the information on the correct character obtained in the correct character buffer 119 as the information on the specific character, as the specific character dictionary 110 and the knowledge dictionary 1 described above.
By executing the candidate word search process using No. 11, unrecognizable characters can be correctly re-recognized.
The character cutout unit 103 stores the cutout position of the correct character designated by the user in the area coordinate buffer 120.
By extracting from, correct character segmentation can be performed.

【００５１】以上のように、本実施の形態では、帳票中
の各記入フィールドに記入される住所、氏名、品名等の
入力文字列１０１に対し、各フィールド毎に出現頻度が
高い文字或いは特定の文字／文字列に着目することで、
知識辞書１１１が保持する単語情報と、階層構造を有す
る住所等の文字列の場合は各文字領域毎の接続情報を用
いて、上記特定文字に挟まれた文字領域の候補単語を選
択することができる。更に、本実施の形態では、その候
補単語の情報を用いて、入力文字列１０１から上記特定
文字に挟まれた文字領域の抽出とその文字領域に対する
再認識処理が実行されることにより、隣接文字間で接
触、分離が多く発生する書き方で記入された文字列を、
高い認識精度で認識することができる。本発明の実施の形態の詳細動作図２〜図４は、図１に示される構成を有する本発明の実
施の形態が実現する全体制御を示す動作フローチャート
である。＜特定文字の認識処理＞まず、文字切り出し部１０３
が、帳票の記入フィールド位置に関する情報を定義した
記入フィールド定義１０４を用いて、イメージメモリ１
０２から２値化画像データとして読み出された、帳票に
記入された入力文字列１０１中の先頭から順に１文字ず
つを切り出す（図２のステップ２０１）。As described above, in this embodiment, the input character string 101 such as an address, a name, and a product name entered in each entry field in a form is a character or a specific character having a high appearance frequency in each field. By focusing on characters / strings,
Using the word information held by the knowledge dictionary 111 and, in the case of a character string such as an address having a hierarchical structure, connection information for each character region, a candidate word of a character region sandwiched between the specific characters can be selected. it can. Further, in the present embodiment, the extraction of the character region sandwiched between the specific characters from the input character string 101 and the re-recognition processing on the character region are performed using the information of the candidate word, so that the adjacent character A character string written in a style that often causes contact and separation between
Recognition can be performed with high recognition accuracy. Detailed Operation of the Embodiment of the Present Invention FIGS. 2 to 4 are operation flowcharts showing overall control realized by the embodiment of the present invention having the configuration shown in FIG. < Specific Character Recognition Processing > First, the character cutout unit 103
Uses the entry field definition 104 defining information on the entry field position of the form, and
Characters are cut out one by one from the beginning in the input character string 101 written on the form, read out as binary image data from 02 (step 201 in FIG. 2).

【００５２】図５は、文字切り出し部１０３が使用する
記入フィールド定義１０４のデータフォーマット例を示
す図である。例えば、帳票上にフィールド１、２が配置
されており、この２つのフィールドに記入された文字列
が認識される場合、記入フィールド定義１０４は、以下
のようにして決定される。FIG. 5 is a diagram showing an example of the data format of the entry field definition 104 used by the character cutout unit 103. For example, when fields 1 and 2 are arranged on a form and character strings entered in these two fields are recognized, the entry field definition 104 is determined as follows.

【００５３】まず、帳票の上部が座標原点とされ、横方
向にｘ軸、縦方向にｙ軸がそれぞれ定義され、フィール
ド１、２のそれぞれについて、そのフィールドの左上端
の位置の座標（フィールド原点座標）と、ｘ軸方向のフ
ィールド幅及びｙ軸方向のフィールド高さとからなるフ
ィールドの大きさデータが、図５(a) に示されるように
定義される。長さの単位は、ミリメートル又はインチで
ある。First, the top of the form is defined as the coordinate origin, the x-axis is defined in the horizontal direction, and the y-axis is defined in the vertical direction. For each of the fields 1 and 2, the coordinates of the upper left position of the field (field origin) 5A, and field size data comprising a field width in the x-axis direction and a field height in the y-axis direction are defined as shown in FIG. The unit of length is millimeters or inches.

【００５４】次に、フィールド１、２のそれぞれについ
て、各フィールドにどのような種別の文字列が記入され
るかを示すフィールド種別が定義される。これらの情報
が、図５(b) に示される表形式で、記入フィールド定義
１０４として特には図示しない記憶装置に保持される。Next, for each of the fields 1 and 2, a field type indicating what type of character string is written in each field is defined. These pieces of information are stored in a storage device (not shown) as the entry field definition 104 in the form of a table shown in FIG.

【００５５】文字切り出し部１０３は、上述の記入フィ
ールド定義１０４を用いることによって、イメージメモ
リ１０２から読み出されたイメージデータ上で各フィー
ルド毎の文字領域を決定した後、その文字領域内のイメ
ージデータに対して、図６に示される動作フローチャー
トによって示される文字切り出し制御を実行する。The character cutout unit 103 determines the character area for each field on the image data read from the image memory 102 by using the above-described entry field definition 104, and then determines the image data in the character area. , The character cutout control shown by the operation flowchart shown in FIG. 6 is executed.

【００５６】ここで、図８(a) に示されるように、記入
フィールド定義１０４から抽出される対象領域のフィー
ルド原点座標を（ｘ₀，ｙ₀）、ｘ軸方向のフィールド
幅をｄｘ、ｙ軸方向のフィールド高さをｄｙとする。Here, as shown in FIG. 8A, the field origin coordinates of the target area extracted from the entry field definition 104 are (x ₀ , y ₀ ), and the field width in the x-axis direction is dx, y. The field height in the axial direction is dy.

【００５７】まず文字切り出し部１０３は、ｘ軸方向の
各走査ライン毎に、黒画素数を累算することにより、各
ｙ座標位置毎のｘ軸方向の黒画素の出現頻度を示す水平
ヒストグラムを、図８(b) に示されるように算出する
（図６のステップ６０１）。First, the character cutout unit 103 accumulates the number of black pixels for each scanning line in the x-axis direction, thereby forming a horizontal histogram indicating the appearance frequency of black pixels in the x-axis direction for each y-coordinate position. , Calculated as shown in FIG. 8B (step 601 in FIG. 6).

【００５８】次に、図８(b) に示されるように、文字切
り出し部１０３は、上記水平ヒストグラム上をその上方
及び下方のそれぞれから走査し、最初に頻度値Ｃを超え
る位置α及びβを算出し、更にそれらから算出される値
α−βを、その対象領域における文字列高さｈとする
（ステップ６０２）。Next, as shown in FIG. 8B, the character cutout unit 103 scans the horizontal histogram from above and below the horizontal histogram, and first finds the positions α and β that exceed the frequency value C. Then, the value α-β calculated from them is set as the character string height h in the target area (step 602).

【００５９】次に、文字切り出し部１０３は、ｙ軸方向
の各走査ライン毎に、黒画素数を累算することにより、
各ｘ座標位置毎のｙ軸方向の黒画素の出現頻度を示す垂
直ヒストグラムを図８(c) に示されるように算出する
（図６のステップ６０３）。Next, the character cutout unit 103 accumulates the number of black pixels for each scanning line in the y-axis direction,
A vertical histogram indicating the appearance frequency of black pixels in the y-axis direction at each x-coordinate position is calculated as shown in FIG. 8C (step 603 in FIG. 6).

【００６０】続いて、図８(c) に示されるように、文字
切り出し部１０３は、上記垂直ヒストグラム上をその左
から走査し、頻度値がしきい値ｄ以下からしきい値ｄ以
上に変化する点ｘ₁，ｘ₃，ｘ₅，・・・（ｘ_2n-1：ｎ
＝１，２，・・・）を切り出し候補位置として算出し、
また、頻度値がしきい値ｄ以上からしきい値ｄ以下に変
化する点ｘ₂，ｘ₄，ｘ₆，・・・（ｘ_2m：ｍ＝１，
２，・・・）もやはり切り出し候補位置として算出する
（ステップ６０４）。Subsequently, as shown in FIG. 8C, the character cutout unit 103 scans the vertical histogram from the left, and changes the frequency value from the threshold d or less to the threshold d or more. Points x ₁ , x ₃ , x ₅ ,... (X _2n-1 : n
= 1, 2,...) As clipping candidate positions,
Further, points x ₂ , x ₄ , x ₆ ,... (X _2m : m = 1,
2,...) Are also calculated as clipping candidate positions (step 604).

【００６１】次に、文字切り出し部１０３は、下記条件
式を満たす領域［ｘ_2m，ｘ_2n-1］を算出し、それを文字
切り出し結果とする（ステップ６０５）。Next, the character cutout unit 103 calculates an area [x _2m , x _2n-1 ] that satisfies the following conditional expression, and sets it as a character cutout result (step 605).

【００６２】[0062]

【数１】ｈ−ｔ₁≦ｘ_2m−ｘ_2n-1≦ｈ＋ｔ₂ （ｍ＝１，２，３，・・・），（ｎ＝１，２，３，・・
・）ここで、ｈは前述したステップ６０２において算出され
た文字列高さ、ｔ₁及びｔ₂は入力文字列１０１の学習
サンプルによって決まるパラメータである。図８(c) の
例では、下記３つの領域が文字切り出し結果として算出
される。［ｘ₁，ｘ₂］［ｘ₃，ｘ₄］［ｘ₅，ｘ₈］文字切り出し部１０３は、ステップ６０５の処理の結
果、下記条件式を満たす領域が残ったか否かを判定する
（ステップ６０６）。## EQU1 ## ht ₁ ≤x _2m -x _2n-1 ≤h + t ₂ (m = 1,2,3,...), (N = 1,2,3,...)
Here, h is the character string height calculated in step 602 described above, and t ₁ and t ₂ are parameters determined by the learning sample of the input character string 101. In the example of FIG. 8 (c), the following three areas are calculated as a character cutout result. [X ₁ , x ₂ ] [x ₃ , x ₄ ] [x ₅ , x ₈ ] As a result of the processing in step 605, the character cutout unit 103 determines whether or not an area satisfying the following conditional expression remains (step 606).

【００６３】[0063]

【数２】ｘ_2l−ｘ_2l-1＞ｈ＋ｔ₂ （ｌ＝１，２，３，・・・）ステップ６０６の判定がＮＯならば、文字切り出し部１
０３は、その制御処理を終了する。[Number 2] _{_{x 2l -x 2l-1> h}} + t 2 (l = 1,2,3, ···) If the determination in step 606 is NO, the character segmentation unit 1
03 ends the control processing.

【００６４】ステップ６０６の判定がＹＥＳであるなら
ば、文字切り出し部１０３は、領域［ｘ_2l-1，ｘ_2l］に
おいて、ステップ６０３で算出された垂直ヒストグラム
の頻度値がしきい値ｄより大きい所定値以下で、かつ、
下記条件式を満たす値ｋを算出する。If the determination in step 606 is YES, the character cutout unit 103 determines that the frequency value of the vertical histogram calculated in step 603 is larger than the threshold value d in the area [x _{21 -1} , x ₂₁ ]. Less than or equal to a predetermined value, and
A value k that satisfies the following conditional expression is calculated.

【００６５】[0065]

【数３】ｈ≒（ｘ_2l−ｘ_2l-1）／ｋこの結果、領域［ｘ_2l-1，ｘ_2l］をｋ分割した各位置を
文字切り出し位置として算出する（以上、ステップ６０
７）。図８(d) の例においては、ｌ＝１、ｋ＝２とな
り、領域［ｘ₁，ｘ₂］を２分割した位置ｘ′が文字切
り出し位置として算出される。## _EQU3 ## h ≒ (x _2l −x _2l−1 ) / k As a result, each position obtained by dividing the area [x _2l−1 , x _2l ] into k is calculated as a character cutout position (step 60).
7). In the example of FIG. 8 (d), l = 1 , k = 2 , and the range [x _1, x _2] 2 divided position x 'is calculated as a character cut-out position.

【００６６】その後、文字切り出し部１０３は、その制
御処理を終了する。以上説明した図６の動作フローチャ
ートは、文字切り出し部１０３が、文字数が予め与えら
れていないフィールドに対して実行する文字切り出し処
理に対応するものである。After that, the character cutout unit 103 ends the control processing. The operation flowchart of FIG. 6 described above corresponds to the character cutout processing executed by the character cutout unit 103 for a field for which the number of characters has not been given in advance.

【００６７】これに対して、候補単語バッファ１１２か
ら読み出される候補単語の情報に基づいて再認識処理が
実行される場合のように、文字切り出し部１０３が、文
字切り出しの対象となる領域とその領域内の文字数が予
め与えられているフィールドに対して文字切り出し処理
を実行する場合もある。On the other hand, as in the case where re-recognition processing is executed based on information on candidate words read from the candidate word buffer 112, the character extracting unit 103 In some cases, character cutout processing may be performed on a field in which the number of characters in the field is given in advance.

【００６８】この場合には、文字切り出し部１０３は、
図６のステップ６０５〜６０７の処理群の代わりに、図
７のステップ７０１の処理を実行する。即ち、文字切り
出しの対象となる領域の左端のｘ座標がｘ_s、右端のｘ
座標がｘ_t、上記領域内の文字数がｎとして与えられた
ときに、文字切り出し部１０３は、図６のステップ６０
３で算出された垂直ヒストグラムの頻度値が所定値以下
で、かつ、下記条件式を満たす値Ｘ_nに近い隣接間隔を
有する位置を文字切り出し位置として算出する。In this case, the character extracting unit 103
The processing in step 701 in FIG. 7 is executed instead of the processing group in steps 605 to 607 in FIG. That is, the x coordinate of the left end of the area to be extracted is x _s , and the x coordinate of the right end is x
When the coordinates are given as x _t and the number of characters in the area is given as n, the character cutout unit 103 executes step 60 in FIG.
The position where the frequency value of the vertical histogram calculated in step 3 is equal to or less than a predetermined value and which has an adjacent interval close to the value _Xn satisfying the following conditional expression is calculated as a character cutout position.

【００６９】[0069]

【数４】（ｘ_t−ｘ_s）／ｎ＝Ｘ_n 具体的には、隣接する２つの文字切り出し位置をｘ_i，
ｘ_i+1（ｉ＝１，２，・・・、ｘ_s≦ｘ_i，ｘ_i+1≦ｘ
_t）としたときに、文字切り出し部１０３は、下記条件
式を満たす文字切り出し位置ｘ_i（ｘ_i≠ｘ_s，ｘ_t）
を算出する。## EQU4 ## Specifically, (x _t −x _s ) / n = X _n Specifically, two adjacent character cutout positions are defined as x _i ,
x _{i + 1} (i = 1, 2,..., x _s ≦ x _i , x _{i + 1} ≦ x
when a _t), the character segmentation unit 103, character segmentation positions x _i satisfying the following condition _{_{(x i ≠ x s, x}} t)
Is calculated.

【００７０】[0070]

【数５】Ｘ_n−ｔ₅≦ｘ_i+1−ｘ_i≦Ｘ_n＋ｔ₆ ここで、ｔ₅及びｔ₆は入力文字列１０１の学習サンプ
ルによって決まるパラメータである。Equation 5] where _{_{_{X n -t 5 ≦ x i +}}} 1 -x i ≦ X n + t 6, t 5 and t ₆ are parameters determined by the learning sample of the input string 101.

【００７１】以上説明した文字切り出し部１０３による
文字切り出し処理の後、特徴抽出部１０５が、その切り
出された１文字から、認識のための特徴量である特徴ベ
クトルを抽出する（図２のステップ２０２）。After the character extraction processing by the character extraction unit 103 described above, the feature extraction unit 105 extracts a feature vector, which is a feature amount for recognition, from the extracted one character (step 202 in FIG. 2). ).

【００７２】具体的には、特徴抽出部１０５は、例えば
以下の一連の処理によって特徴ベクトルを抽出する。即
ちまず、特徴抽出部１０５は、切り出された文字のイメ
ージデータから文字輪郭画素を抽出する。Specifically, the feature extracting unit 105 extracts a feature vector by, for example, the following series of processes. That is, first, the feature extraction unit 105 extracts a character contour pixel from the image data of the cut out character.

【００７３】次に、特徴抽出部１０５は、その切り出さ
れた領域を複数の分割領域に分割する。更に、特徴抽出
部１０５は、各分割領域につき、その分割領域内の輪郭
画素毎に方向成分（例えば、縦方向、横方向、左斜め方
向、右斜め方向の４方向成分）を抽出し、その分割領域
内の全輪郭画素の方向成分を集計することによりその分
割領域内の各方向成分毎の集計値を算出し、それらを各
方向成分に対応する要素値として有する部分特徴ベクト
ルを算出する。Next, the feature extracting unit 105 divides the cut-out region into a plurality of divided regions. Further, the feature extracting unit 105 extracts a directional component (for example, four directional components of a vertical direction, a horizontal direction, a leftward diagonal direction, and a rightward diagonal direction) for each contour pixel in the divided region for each divided region. By calculating the directional components of all the contour pixels in the divided region, a total value for each directional component in the divided region is calculated, and a partial feature vector having these as element values corresponding to each directional component is calculated.

【００７４】最後に、特徴抽出部１０５は、全ての分割
領域の部分特徴ベクトルの各要素を統合することによ
り、特徴ベクトルを抽出する。上述のようにして特徴抽
出部１０５が切り出された文字の特徴ベクトルを抽出し
た後に、マッチング部１０６が、その切り出された文字
の特徴ベクトルと、特定文字標準パターン辞書１０７内
の各特定文字標準パターンの特徴ベクトルとの間のマッ
チング処理を実行し（図２のステップ２０３）、マッチ
ング度が高い順に所定順位までの各特定文字標準パター
ンが属する各特定文字の字種カテゴリーを、上記切り出
された文字に対する候補特定文字群として候補文字列バ
ッファ１０８に出力する（図２のステップ２０４）。Finally, the feature extracting unit 105 extracts a feature vector by integrating the elements of the partial feature vectors of all the divided areas. After the feature extraction unit 105 extracts the feature vector of the cut-out character as described above, the matching unit 106 compares the feature vector of the cut-out character with each specific character standard pattern in the specific character standard pattern dictionary 107. (Step 203 in FIG. 2), and the character type category of each specific character to which each specific character standard pattern up to a predetermined order belongs in the descending order of the matching degree is extracted from the extracted character. Are output to the candidate character string buffer 108 as a candidate specifying character group for (step 204 in FIG. 2).

【００７５】より具体的には、マッチング部１０６が、
その切り出された文字の特徴ベクトルと、特定文字標準
パターン辞書１０７内の各特定文字標準パターンの特徴
ベクトルとの間で、例えば距離（ユークリッド距離、マ
ハラノビス距離等）を計算する。そして、マッチング部
１０６は、距離が小さい順に所定順位（ｎ位）までの各
特定文字標準パターンが属する各特定文字の字種カテゴ
リーを、上述の切り出された特定文字に対する候補特定
文字群として候補文字列バッファ１０８に出力する。More specifically, the matching unit 106
For example, a distance (Euclidean distance, Mahalanobis distance, etc.) is calculated between the feature vector of the extracted character and the feature vector of each specific character standard pattern in the specific character standard pattern dictionary 107. Then, the matching unit 106 determines the character type category of each specific character to which the specific character standard pattern belongs to the predetermined order (n-th) in ascending order of the distance as a candidate specific character group for the cut-out specific character. Output to the column buffer 108.

【００７６】なお、１位の特定文字標準パターンの距離
が所定のしきい値Ｔ₁より大きい場合は、その切り出さ
れた文字にはリジェクト（認識不能）情報が付加され
る。ここで、上述の特定文字標準パターン辞書１０７に
ついて、具体例を挙げて説明する。[0076] The distance 1 of the specific character standard pattern if greater than a predetermined threshold value T _1, at its cut-out character rejected (unrecognizable) information is added. Here, the specific character standard pattern dictionary 107 will be described with a specific example.

【００７７】今、入力文字列１０１が住所文字列である
場合を考える。本実施の形態では、最初は、例えば住所
文字列において、その階層構造の区切りを示す出現頻度
が高い、「都」「道」「府」「県」「市」「区」「郡」
「町」「村」「字」「大字」等の１文字又は２文字から
なる特定文字のみが認識されればよい。また、住所文字
列においては、「東」「西」「南」「北」等の特定文字
も出現頻度が高い。Now, consider a case where the input character string 101 is an address character string. In the present embodiment, initially, for example, in an address character string, the frequency of appearance indicating the break of the hierarchical structure is high, such as “city”, “road”, “fu”, “prefecture”, “city”, “ward”, “county”.
Only one or two specific characters such as "town", "village", "character", and "large" need be recognized. Further, in the address character string, specific characters such as “East”, “West”, “South”, and “North” also have a high appearance frequency.

【００７８】このため、本実施の形態では、これらの特
定文字の認識精度を高めるために、これらの特定文字の
標準パターンのみから構成され辞書容量の小さな特定文
字標準パターン辞書１０７が使用される。For this reason, in this embodiment, in order to improve the recognition accuracy of these specific characters, a specific character standard pattern dictionary 107 having only a standard pattern of these specific characters and having a small dictionary capacity is used.

【００７９】このような特定文字標準パターン辞書１０
７が標準パターン辞書１１３とは別に用意されることに
より、認識処理速度を短縮し、かつ認識精度を高めるこ
とが可能となる。The specific character standard pattern dictionary 10
7 is prepared separately from the standard pattern dictionary 113, so that the recognition processing speed can be reduced and the recognition accuracy can be increased.

【００８０】なお、特定文字標準パターン辞書１０７が
標準パターン辞書１１３と同じ辞書として構成され、特
定文字の認識精度を高めるために、各特定文字毎に多く
のテンプレート（標準パターン）が記憶されるように構
成されてもよい。The specific character standard pattern dictionary 107 is configured as the same dictionary as the standard pattern dictionary 113. In order to improve the recognition accuracy of specific characters, many templates (standard patterns) are stored for each specific character. May be configured.

【００８１】一方、入力文字列１０１が氏名文字列であ
る場合には、住所文字列のように区切りとなる文字は存
在しないが、出現頻度の高い文字種は存在する。例え
ば、名字に使用される文字は出現頻度において上位５０
０位までの文字種が８２％程度をカバーしているため、
上位Ｎ文字で特定文字標準パターン辞書１０７が作成さ
れるように構成することができる。On the other hand, when the input character string 101 is a name character string, there is no character that serves as a delimiter like an address character string, but a character type with a high appearance frequency exists. For example, the characters used for surnames are the top 50 in appearance frequency.
Since the character types up to the 0th place cover about 82%,
The specific character standard pattern dictionary 107 can be configured with the upper N characters.

【００８２】或いは、標準パターン辞書１１３から選択
的にＮ字種のみが特定文字の認識に使用されるように構
成されてもよい。そして、特定文字辞書１１０は、上述
の特定の字種に対応するように構成される。Alternatively, the configuration may be such that only the N character types are selectively used from the standard pattern dictionary 113 for recognition of a specific character. The specific character dictionary 110 is configured to correspond to the specific character type described above.

【００８３】また、出現頻度によって字種を選択するの
ではなく、認識しやすい文字を多数の実データから統計
的に決定し、それらの決定された字種を選択するように
構成してもよい。Instead of selecting a character type based on the frequency of appearance, characters that are easy to recognize may be statistically determined from a large number of actual data, and the determined character type may be selected. .

【００８４】文字切り出し部１０３、特徴抽出部１０
５、及びマッチング部１０６による上記一連の特定文字
認識処理は、文字切り出し部１０３が入力文字列１０１
の先頭から順に切り出した文字毎に実行される（図２の
ステップ２０５→２０２の繰り返し）。この結果、候補
文字列バッファ１０８には、入力文字列１０１から切り
出された文字の並び順に対応する並び順で、各文字毎の
候補特定文字群が保持される。＜特定文字間領域の候補単語の検索とその領域での再認
識処理＞候補単語検索部１０９は、候補文字列バッファ
１０８に得られた候補特定文字群の集合の中から隣接す
る任意の２つの特定文字からなる組（特定文字組）を全
て抽出し、それぞれの特定文字組が特定文字辞書１１０
に登録されているか否かを検索する。そして、候補単語
検索部１０９は、１組の特定文字組が特定文字辞書１１
０に登録されている場合、その登録レコードにリンクす
る知識辞書１１１中のレコードから、その特定文字組を
構成する２つの特定文字により挟まれる単語群を検索
し、その検索された単語群を候補単語群として候補単語
バッファ１１２に保持する（以上、図３のステップ２０
６）。Character extraction section 103, feature extraction section 10
5 and the above-described series of specific character recognition processing by the matching unit 106, the character cutout unit 103
Is executed for each character cut out in order from the top (repetition of steps 205 → 202 in FIG. 2). As a result, the candidate character string buffer 108 holds candidate specific character groups for each character in the arrangement order corresponding to the arrangement order of the characters cut out from the input character string 101. < Search for candidate words in specific inter-character area and recognize in that area
Knowledge processing > candidate word search unit 109 extracts all sets of two adjacent specific characters (specific character sets) from the set of candidate specific character groups obtained in candidate character string buffer 108, Is a specific character set of the specific character dictionary 110
Search whether it is registered in. Then, the candidate word search unit 109 determines that one specific character set is the specific character dictionary 11
0, a group of words sandwiched between two specific characters constituting the specific character set is searched from a record in the knowledge dictionary 111 linked to the registered record, and the searched word group is selected as a candidate. It is stored in the candidate word buffer 112 as a word group (step 20 in FIG. 3).
6).

【００８５】今、入力文字列１０１が住所文字列である
場合を考える。なお、住所文字列以外の氏名文字列、品
名文字列等については、階層構造を持たないため、階層
構造に関する部分を除いて住所文字列の場合と同様に実
現できる。Now, consider a case where the input character string 101 is an address character string. It should be noted that the name character string, the product name character string, and the like other than the address character string do not have a hierarchical structure, and thus can be realized in the same manner as the case of the address character string except for the part related to the hierarchical structure.

【００８６】住所辞書である知識辞書１１１の構造は、
例えば図１０に示されるように、住所の階層構造に従っ
て、レベル１：都道府県、レベル２：市区郡、レベル
３：町村、・・・というように分割されて、それぞれの
階層に属する単語が格納されている。The structure of the knowledge dictionary 111, which is an address dictionary, is as follows.
For example, as shown in FIG. 10, according to the hierarchical structure of the address, the words are divided into level 1: prefecture, level 2: city / county, level 3: town, and so on. Is stored.

【００８７】一方、特定文字辞書１１０には、図１１に
示されるように、「文字１」と「文字２」という２つの
特定文字からなる特定文字組に対応するレコード毎に、
その特定文字組を構成する２つの特定文字により挟まれ
る単語群が格納されている知識辞書１１１上のレコード
の集合を示すための、ポインタ情報とそのポインタから
始まるデータ数情報とからなるデータ組が格納されてい
る。このデータ組としては、図１１に示されるように複
数組指定することができ、特定文字辞書１１０の各特定
文字組毎のレコードには、図１１に示されるように、上
記ポインタ情報とデータ数情報のデータ組の数に対応す
るポインタ数情報Ｎも記憶される。On the other hand, as shown in FIG. 11, the specific character dictionary 110 stores, for each record corresponding to a specific character set including two specific characters “character 1” and “character 2”,
A data set consisting of pointer information and information on the number of data starting from the pointer for indicating a set of records on the knowledge dictionary 111 in which a word group sandwiched between two specific characters constituting the specific character set is stored is shown. Is stored. As this data set, a plurality of sets can be designated as shown in FIG. 11, and the record for each specific character set in the specific character dictionary 110 has the pointer information and the number of data as shown in FIG. Pointer number information N corresponding to the number of data sets of information is also stored.

【００８８】図１２の例では、特定文字辞書１１０内
の、空白文字と「県」という２つの特定文字からなる特
定文字組に対応するレコードには、図１０に示される知
識辞書１１１内のレベル１領域内の単語「青森」から始
まるｎ₁個のレコードと、同じくレベル１領域内の単語
「神奈川」から始まるｎ₂個のレコードをそれぞれ示す
データ組（ポインタ情報とデータ数情報）と、ポインタ
数Ｎ＝２が登録されている。In the example of FIG. 12, a record in the specific character dictionary 110 corresponding to a specific character set composed of two specific characters of a blank character and “ken” is a level in the knowledge dictionary 111 shown in FIG. N ₁ records starting with the word “Aomori” in one area, n ₂ records also starting with the word “Kanagawa” in the level 1 area (pointer information and data number information), and a pointer The number N = 2 is registered.

【００８９】また図１３の例では、特定文字辞書１１０
内の、「都」と「区」という２つの特定文字からなる特
定文字組に対応するレコードには、図１０に示される知
識辞書１１１内のレベル２領域内の単語「千代田」から
始まるｎ₃個のレコードと、ポインタ数Ｎ＝１が登録さ
れている。In the example of FIG. 13, the specific character dictionary 110
In the record corresponding to the specific character set consisting of two specific characters “To” and “ku”, n ₃ starting from the word “Chiyoda” in the level 2 area in the knowledge dictionary 111 shown in FIG. Records and the number of pointers N = 1 are registered.

【００９０】また、住所は通常、「・・・丁目・・・番
地・・・方」という書き方で終わるが、このような特定
文字「丁目」「番地」「番」「方」「号」に挟まれた領
域には、単語ではなく数字が記入される場合が多い。こ
のような場合には、図１４に示されるように、特定文字
辞書１１０内の、上記特定文字からなる特定文字組に対
応するレコードには、前述したようんポインタ情報とデ
ータ数情報とかなるデータ組ではなく、「（数字）＊
ｎ」というような記号が設定される。候補単語検索部１
０９は、特定文字辞書１１０から上述したような記号が
設定されているレコードを検索した場合には、上述のよ
うな特定文字に挟まれた領域には数字が連続して記入さ
れていることを検出し、その旨を示す検出結果を候補単
語バッファ１１２に書き込む。The address usually ends in the form of "... chome ... address ... how". However, such specific characters "chome", "address", "number", "ho" and "go" are used. In many cases, numbers, not words, are written in the interposed area. In such a case, as shown in FIG. 14, a record in the specific character dictionary 110 corresponding to the specific character set made up of the specific characters described above includes data including pointer information and data number information as described above. "(Number) *
A symbol such as "n" is set. Candidate word search unit 1
09 indicates that when a record in which the above-described symbols are set is searched from the specific character dictionary 110, numbers are continuously written in the area between the specific characters as described above. Then, a detection result indicating the detection is written to the candidate word buffer 112.

【００９１】更に、例えば図１５に示されるような特定
文字辞書１１０及び知識辞書１１１の構成も可能であ
る。即ち、図１５の例では、特定文字辞書１１０内の、
空白文字と「川」という２つの特定文字からなる特定文
字組に対応するレコードには、知識辞書１１１内の４文
字の単語「神奈川県」を指すポインタ情報及びデータ数
＝１と、知識辞書１１１内の２文字の単語「神奈」を指
すポインタ情報及びデータ数＝１が設定される。Further, for example, the configuration of the specific character dictionary 110 and the knowledge dictionary 111 as shown in FIG. 15 is also possible. That is, in the example of FIG.
Records corresponding to a specific character set consisting of a blank character and two specific characters “kawa” include pointer information indicating the four-character word “Kanagawa” in the knowledge dictionary 111 and the number of data = 1, and the knowledge dictionary 111 The pointer information indicating the two-letter word "Kana" and the number of data = 1 are set.

【００９２】また特定文字辞書１１０内の、「川」と
「中」という２つの特定文字からなる特定文字組に対応
するレコードには、知識辞書１１１内の２文字の単語
「崎市」を指すポインタ情報及びデータ数＝１が設定さ
れる。A record in the specific character dictionary 110 corresponding to a specific character set composed of two specific characters “kawa” and “medium” indicates the two-character word “Sakiichi” in the knowledge dictionary 111. The pointer information and the number of data = 1 are set.

【００９３】更に特定文字辞書１１０内の、「中」と
「中」という２つの特定文字からなる特定文字組に対応
するレコードには、知識辞書１１１内の５文字の単語
「原区上小田」を指すポインタ情報及びデータ数＝１が
設定される。Further, a record corresponding to a specific character set consisting of two specific characters “medium” and “medium” in the special character dictionary 110 includes a five-character word “Kamioda” in the knowledge dictionary 111. Is set and pointer information indicating the number of data = 1 is set.

【００９４】このように、住所文字列に高い頻度で出現
する特定文字及び単語に対応する情報を、特定文字辞書
１１０と知識辞書１１１に記憶させることも可能であ
る。次に、図１６に示されるように、特定文字辞書１１
０内の、「区」と住所の終わりを示す特定文字の２つの
特定文字からなる特定文字組に対応するレコードに、知
識辞書１１１内の単語「丸の内」がリンクしている場合
に、表示のゆらぎとして、「丸の内」ではなく「丸ノ
内」という文字列が記入される可能性がある。このよう
な場合に、知識辞書１１１に全ての表記のゆらぎに対応
する単語を記憶させるのは無駄である。As described above, it is also possible to store information corresponding to specific characters and words appearing frequently in an address character string in the specific character dictionary 110 and the knowledge dictionary 111. Next, as shown in FIG.
When the word “Marunouchi” in the knowledge dictionary 111 is linked to a record corresponding to a specific character set including two specific characters of “ku” and a specific character indicating the end of the address in 0, As a fluctuation, there is a possibility that a character string "Marunouchi" instead of "Marunouchi" is entered. In such a case, it is useless to store the words corresponding to the fluctuations of all the notations in the knowledge dictionary 111.

【００９５】そこで、本実施の形態では、特定文字辞書
１１０からリンクする知識辞書１１１内の単語の検索時
に、図９の動作フローチャートで示される表記のゆれに
対処するための制御動作が実行される。Therefore, in the present embodiment, when searching for a word in the knowledge dictionary 111 linked from the specific character dictionary 110, a control operation for coping with the fluctuation of the notation shown in the operation flowchart of FIG. 9 is executed. .

【００９６】まず、候補単語検索部１０９は、１組の特
定文字組に対し、特定文字辞書１１０及び知識辞書１１
１をここまで説明してきた規則に従って検索し、その結
果検索された単語群を現在処理中の特定文字組に対応す
る候補単語群として候補単語バッファ１１２に書き込む
（図９のステップ９０１）。このステップ９０１は、図
３のステップ２０６の一部である。First, the candidate word search unit 109 applies a specific character dictionary 110 and a knowledge dictionary 11 to one specific character set.
1 is searched according to the rules described so far, and the searched word group is written to the candidate word buffer 112 as a candidate word group corresponding to the specific character set currently being processed (step 901 in FIG. 9). This step 901 is a part of step 206 in FIG.

【００９７】次に、図３のステップ２０６の一部とし
て、候補単語検索部１０９は、１組の特定文字組に対し
て候補単語バッファ１１２に得られた候補単語群のそれ
ぞれに対して、図９のステップ９０２〜９１０で示され
る一連の処理を繰り返し実行する。Next, as a part of step 206 in FIG. 3, the candidate word search unit 109 performs a search on each candidate word group obtained in the candidate word buffer 112 for one specific character set. Nine steps 902 to 910 are repeatedly executed.

【００９８】即ち、候補単語検索部１０９は、検出した
単語を構成する文字にひらがなが存在する場合に、その
ひらがなをカタカナに変更し、その結果得られる単語
を、現在処理中の特定文字組に対応する他の候補単語と
して、候補単語バッファ１１２に書き込む（図９のステ
ップ９０２→９０３）。That is, when there are hiragana in the characters constituting the detected word, the candidate word search unit 109 changes the hiragana to katakana, and converts the resulting word into the specific character set currently being processed. It is written into the candidate word buffer 112 as another corresponding candidate word (steps 902 → 903 in FIG. 9).

【００９９】次に、候補単語検索部１０９は、検出した
単語を構成する文字にカタカナが存在する場合に、その
カタカナをひらがなに変更し、その結果得られる単語
を、現在処理中の特定文字組に対応する他の候補単語と
して、候補単語バッファ１１２に書き込む（図９のステ
ップ９０４→９０５）。Next, if there are katakana in the characters constituting the detected word, the candidate word search unit 109 changes the katakana to hiragana, and replaces the resulting word with the specific character set currently being processed. Is written into the candidate word buffer 112 as another candidate word corresponding to (step 904 → 905 in FIG. 9).

【０１００】次に、候補単語検索部１０９は、検出した
単語を構成する文字に漢数字が存在する場合に、その漢
数字をアラビア数字に変更し、その結果得られる単語
を、現在処理中の特定文字組に対応する他の候補単語と
して、候補単語バッファ１１２に書き込む（図９のステ
ップ９０６→９０７）。Next, the candidate word search unit 109 changes the Kanji numerals to Arabic numerals when the characters constituting the detected word include Kanji numerals, and replaces the resulting word with the currently processed word. It is written into the candidate word buffer 112 as another candidate word corresponding to the specific character set (steps 906 → 907 in FIG. 9).

【０１０１】次に、候補単語検索部１０９は、検出した
単語を構成する文字にアラビア数字が存在する場合に、
そのアラビア数字を漢数字に変更し、その結果得られる
単語を、現在処理中の特定文字組に対応する他の候補単
語として、候補単語バッファ１１２に書き込む（図９の
ステップ９０８→９０９）。Next, the candidate word search unit 109 determines whether or not the characters constituting the detected word include Arabic numerals.
The Arabic numerals are changed to Chinese numerals, and the resulting word is written into the candidate word buffer 112 as another candidate word corresponding to the specific character set currently being processed (steps 908 → 909 in FIG. 9).

【０１０２】最後に候補単語検索部１０９は、検出した
単語を構成する文字に省略可能文字（例えば「溝ノ口」
が「溝口」と省略されたときの「ノ」）が存在する場合
に、その省略可能文字を省略して得られる文字列を、現
在処理中の特定文字組に対応する他の候補単語として、
候補単語バッファ１１２に書き込む（図９のステップ９
０８→９０９）。Finally, the candidate word search unit 109 replaces the characters constituting the detected word with optional characters (for example, “Mizonokuchi”).
Is abbreviated as "Mizoguchi"), the character string obtained by omitting the optional character is used as another candidate word corresponding to the specific character set currently being processed.
Write to candidate word buffer 112 (step 9 in FIG. 9)
08 → 909).

【０１０３】候補単語検索部１０９は、１組の特定文字
組に対して候補単語バッファ１１２にまだ表記のゆらぎ
に対する制御処理を実行していない候補単語群がある場
合には、上述の図９のステップ９０２〜９１０で示され
る一連の処理を繰り返し実行する（図９のステップ９１
１→９０２〜９１０→９１１の繰り返し）。If there is a candidate word group for which the control process for the fluctuation of the notation has not been executed in the candidate word buffer 112 for one specific character set, the candidate word searching unit 109 shown in FIG. The series of processing shown in steps 902 to 910 is repeatedly executed (step 91 in FIG. 9).
1 → 902-910 → 911).

【０１０４】上述のようにして、１組の特定文字組に対
して候補単語バッファ１１２に得られた候補単語群に対
して、表記のゆらぎに対する制御が実現される。以上の
ようにして、候補文字列バッファ１０８から選択された
１組の特定文字組に対して候補単語バッファ１１２に候
補単語群が得られる。As described above, the control for the fluctuation of the notation is realized for the candidate word group obtained in the candidate word buffer 112 for one specific character set. As described above, a candidate word group is obtained in the candidate word buffer 112 for one specific character set selected from the candidate character string buffer 108.

【０１０５】今、例えば図１７に示される入力文字列１
０１が記入されると、前述の図２のステップ２０１〜２
０５の特定文字の認識処理によって、領域１７０１が特
定文字「都」、領域１７０２が特定文字「区」と認識さ
れる。Now, for example, the input character string 1 shown in FIG.
01 is entered, steps 201 to 2 in FIG.
In the recognition processing of the specific character 05, the area 1701 is recognized as the specific character “To” and the area 1702 is recognized as the specific character “ku”.

【０１０６】この認識結果に対して、候補単語検索部１
０９は、上述した図３のステップ２０６で、特定文字辞
書１１０において空白文字と特定文字「都」とからなる
特定文字組のレコードを検出し、その登録レコードにリ
ンクする知識辞書１１１中のエントリから、その特定文
字組を構成する２つの特定文字によって挟まれる１つの
単語「東京」を検索して、その検索された単語を、空白
文字と特定文字「都」とからなる特定文字組に対応する
候補単語群として、候補単語バッファ１１２に保持す
る。この場合は、上記特定文字組に対する候補単語群の
数は１個で、図１８に示されるように、、候補単語「東
京」の文字数は２文字となる。In response to the recognition result, the candidate word search unit 1
In step 206 of FIG. 3 described above, a record of a specific character set consisting of a blank character and a specific character “tsu” is detected in the specific character dictionary 110 in step 206 in FIG. , One word “Tokyo” sandwiched between two specific characters constituting the specific character set, and the searched word corresponds to a specific character set consisting of a blank character and a specific character “To” The candidate word group is held in the candidate word buffer 112. In this case, the number of candidate words for the specific character set is one, and as shown in FIG. 18, the number of characters of the candidate word “Tokyo” is two.

【０１０７】また、候補単語検索部１０９は、後述する
図３のステップ２１１の判定の後に２回目に実行される
図３のステップ２０６で、特定文字辞書１１０において
特定文字「都」と「区」からなる特定文字組のレコード
を検出し、その登録レコードにリンクする図１０に示さ
れる知識辞書１１１中のエントリから、その特定文字組
を構成する２つの特定文字によって挟まれる２３個の単
語「千代田」「中央」「港」・・・を検索して、それら
の検索された単語群を、上記特定文字組に対応する候補
単語群として、候補単語バッファ１１２に保持する。こ
の場合は、上記特定文字組に対する候補単語群の数は２
３個となり、図１９に示されるように、各候補単語の文
字数は、３文字、２文字、又は１文字の何れかとなる。Also, the candidate word search unit 109 determines the specific characters “U” and “K” in the specific character dictionary 110 in step 206 in FIG. 3 which is executed for the second time after the determination in step 211 in FIG. From the entry in the knowledge dictionary 111 shown in FIG. 10 linked to the registered record, the 23 words “Chiyoda” sandwiched between two specific characters constituting the specific character set are detected. "", "Center", "port",..., And the searched word group is held in the candidate word buffer 112 as a candidate word group corresponding to the specific character set. In this case, the number of candidate word groups for the specific character set is 2
As shown in FIG. 19, the number of characters of each candidate word is one of three, two, or one.

【０１０８】このようにして、候補文字列バッファ１０
８から選択された１組の特定文字組に対して候補単語バ
ッファ１１２に候補単語群が得られた後、その候補単語
群に属する候補単語のそれぞれにつき、文字切り出し部
１０３、特徴抽出部１０５、及びマッチング部１０６
が、図３のステップ２０７〜２１１の一連の再認識処理
を実行することにより、各候補単語毎に所定順位までの
候補認識結果群を抽出する。Thus, the candidate character string buffer 10
After a candidate word group is obtained in the candidate word buffer 112 for one set of specific character sets selected from No. 8, the character cutout unit 103, the feature extraction unit 105, And matching unit 106
Performs a series of re-recognition processes in steps 207 to 211 of FIG. 3 to extract a group of candidate recognition results up to a predetermined order for each candidate word.

【０１０９】まず、文字切り出し部１０３は、イメージ
メモリ１０２から読み出される入力文字列１０１におい
て、候補単語バッファ１１２から出力された候補単語の
情報を使って、その候補単語が属する特定文字組を構成
する２つの特定文字に挟まれた文字列領域内の文字列を
再度切り出す（図３のステップ２０７）。First, in the input character string 101 read from the image memory 102, the character cutout unit 103 uses the information on the candidate word output from the candidate word buffer 112 to form a specific character set to which the candidate word belongs. The character string in the character string area sandwiched between two specific characters is cut out again (step 207 in FIG. 3).

【０１１０】この場合、候補単語の文字数が例えば図１
８に示される「東京」又は図１９に示される「中央」の
ように２文字である場合には、文字切り出し部１０３
は、前述した図６のステップ６０１〜６０４及び図７の
ステップ７０１で示される動作フローチャートに従っ
て、文字切り出しの対象となる領域を２分割して（前述
した数３式におけるｎ＝２）、各文字の切り出し位置を
決定する。In this case, the number of characters of the candidate word is, for example, as shown in FIG.
In the case of two characters such as “Tokyo” shown in FIG. 8 or “center” shown in FIG.
In accordance with the operation flowchart shown in steps 601 to 604 in FIG. 6 and step 701 in FIG. Is determined.

【０１１１】また候補単語の文字数が例えば図１９に示
される「千代田」のように３文字である場合は、文字切
り出し部１０３は、文字切り出しの対象となる領域を３
分割して（前述した数３式におけるｎ＝３）、各文字の
切り出し位置を決定する。If the number of characters of the candidate word is three, for example, “Chiyoda” shown in FIG. 19, the character cutout unit 103 sets the area for character cutout to three.
The character is divided (n = 3 in Equation 3 described above), and the cutout position of each character is determined.

【０１１２】更に候補単語の文字数が例えば図１９に示
される「港」のように１文字である場合は、文字切り出
し部１０３は、文字切り出しの対象となる領域に１文字
のみが存在すると仮定する（前述した数３式におけるｎ
＝１）。Further, when the number of characters of the candidate word is one, for example, "port" shown in FIG. 19, the character extracting unit 103 assumes that only one character exists in the area to be extracted. (N in Equation 3 described above)
= 1).

【０１１３】次に特徴抽出部１０５は、再度切り出され
た文字列に対して１文字ずつ、前述したようにして特徴
ベクトルを抽出する（図３のステップ２０８）。更に、
マッチング部１０６は、上記各文字毎に、その文字の特
徴ベクトルと、第２の辞書である標準パターン辞書１１
３内の各標準パターンの特徴ベクトルとの間のマッチン
グ処理を実行し（図３のステップ２０９）、マッチング
度が高い順に所定順位までの各標準パターンが属する各
字種カテゴリーを、上記文字に対する候補文字群として
候補文字列バッファ１０８に出力する（図３のステップ
２１０）。Next, the feature extracting unit 105 extracts a feature vector for each character from the character string cut out again as described above (step 208 in FIG. 3). Furthermore,
The matching unit 106 determines, for each character, the feature vector of the character and the standard pattern dictionary 11 that is the second dictionary.
3 (step 209 in FIG. 3), and the respective character type categories to which the standard patterns up to a predetermined order belong in the descending order of the matching degree are candidates for the characters. The character group is output to the candidate character string buffer 108 (step 210 in FIG. 3).

【０１１４】より具体的には、マッチング部１０６が、
上記文字の特徴ベクトルと、標準パターン辞書１１３内
の各標準パターンの特徴ベクトルとの間で、例えば距離
（ユークリッド距離、マハラノビス距離等）を計算す
る。そして、マッチング部１０６は、距離が小さい順に
所定順位（ｎ位）までの各標準パターンが属する各字種
カテゴリーを、上述の文字に対する候補文字群として候
補文字列バッファ１０８に出力する。More specifically, the matching unit 106
For example, a distance (Euclidean distance, Mahalanobis distance, etc.) is calculated between the character feature vector and the feature vector of each standard pattern in the standard pattern dictionary 113. Then, the matching unit 106 outputs, to the candidate character string buffer 108, each character type category to which each of the standard patterns up to the predetermined order (nth order) belongs in the order of smaller distance as a candidate character group for the above-described characters.

【０１１５】文字切り出し部１０３によって再度切り出
された文字列を構成する各文字のそれぞれについて、上
述のように距離が小さい順に所定順位までの候補文字群
が候補文字列バッファ１０８に得られた後、１つの特定
文字組について候補単語バッファ１１２に得られた候補
単語群に属する他の候補単語について、ステップ２０７
〜２１０の一連の処理が繰り返し実行される。For each of the characters constituting the character string cut out again by the character cutout unit 103, a candidate character group up to a predetermined order is obtained in the candidate character string buffer 108 in ascending order of distance as described above. Step 207 is performed for other candidate words belonging to the candidate word group obtained in the candidate word buffer 112 for one specific character set.
A series of processes from to 210 are repeatedly executed.

【０１１６】１つの特定文字組について候補単語バッフ
ァ１１２に得られた候補単語群に属する全ての候補単語
について、それぞれを構成する文字毎に所定順位までの
候補文字群が候補文字列バッファ１０８に得られると、
マッチング部１０６は、各候補単語のそれぞれについ
て、それぞれを構成する文字毎の所定順位までの候補文
字群の全てを組み合わせて候補文字列群を生成し、それ
に含まれる各候補文字列毎に、次式によってその平均距
離を計算する（図３のステップ２１２）。For all the candidate words belonging to the candidate word group obtained in the candidate word buffer 112 for one particular character set, the candidate character group up to a predetermined order is obtained in the candidate character string buffer 108 for each character constituting each word. When it is
The matching unit 106 generates, for each of the candidate words, a candidate character string group by combining all of the candidate character groups up to a predetermined rank for each of the characters constituting each of the candidate words. The average distance is calculated by the equation (step 212 in FIG. 3).

【０１１７】[0117]

【数６】（Ｄ₁＋Ｄ₂＋・・・＋Ｄ_m）／ｍここで、ｍは対象候補単語の文字数であり、Ｄ_i（１≦
ｉ≦ｍ）は、対象候補単語内のｉ文字目において対象候
補文字列を構成するために選択された候補文字の距離を
示す。(D ₁ + D ₂ +... + D _m ) / m where m is the number of characters of the target candidate word, and D _i (1 ≦ 1)
i ≦ m) indicates the distance of the candidate character selected to form the target candidate character string at the i-th character in the target candidate word.

【０１１８】そして、マッチング部１０６は、１つの特
定文字組についての全ての候補単語に対応して生成され
た候補文字列群の中から、それを構成する各候補文字列
に対応する平均距離が小さい順に所定数（Ｐ個）の候補
文字列を選択し、それらを上記特定文字組を構成する２
つの特定文字により挟まれた文字領域の認識結果とし
て、知識処理部１１４に出力する。Then, the matching unit 106 determines, from the candidate character string group generated corresponding to all the candidate words for one specific character set, the average distance corresponding to each of the candidate character strings constituting the group. A predetermined number (P) of candidate character strings are selected in ascending order, and these are set as 2
The recognition result is output to the knowledge processing unit 114 as a recognition result of a character area sandwiched between two specific characters.

【０１１９】このようにして、１つの特定文字組を構成
する２つの特定文字により挟まれた文字領域の認識結果
が得られると、再び図３のステップ２１３からステップ
２０６の処理に戻る。When the recognition result of the character area sandwiched between the two specific characters constituting one specific character set is obtained in this way, the process returns to step 206 from step 213 in FIG.

【０１２０】そして、前述の図２のステップ２０１〜２
０５の特定文字の認識処理によって候補文字列バッファ
１０８に得られている候補特定文字群の集合の中から隣
接する他の任意の２つの特定文字からなる他の特定文字
組が再び抽出され、その特定文字組に対して図３のステ
ップ２０６〜２１２の一連の制御処理が再び実行される
ことにより、その特定文字組を構成する２つの特定文字
により挟まれた文字領域の認識結果が算出されるという
動作が、各特定文字組毎に繰り返し実行される（図３の
ステップ２１３→２０６〜２１２→２１３の繰り返
し）。Then, steps 201 to 2 in FIG.
From the set of candidate specific character groups obtained in the candidate character string buffer 108 by the specific character recognition process of step 05, another specific character set consisting of any two adjacent specific characters is extracted again. The series of control processing of steps 206 to 212 in FIG. 3 is executed again for the specific character set, whereby the recognition result of the character area sandwiched between two specific characters constituting the specific character set is calculated. Is repeated for each specific character set (steps 213 → 206 to 212 → 213 in FIG. 3 are repeated).

【０１２１】知識処理部１１４は、各特定文字組を構成
する２つの特定文字に挟まれた各文字領域に対応する認
識結果に対して、記入フィールド定義１０４及び知識辞
書１１１を用いた知識処理によって、上記各文字領域か
らなる全体文字領域の最終認識結果を決定し、それを認
識結果バッファ１１５に出力する（図４のステップ２１
４）。The knowledge processing unit 114 performs the knowledge processing using the entry field definition 104 and the knowledge dictionary 111 on the recognition result corresponding to each character area sandwiched between two specific characters constituting each specific character set. The final recognition result of the entire character area composed of the above character areas is determined, and is output to the recognition result buffer 115 (step 21 in FIG. 4).
4).

【０１２２】以上説明した図２のステップ２０１〜図４
のステップ２１４の一連制御処理が帳票の記入フィール
ド位置毎に繰り返し実行されることにより、各記入フィ
ールドに対する最終認識結果が決定される（図４のステ
ップ２１５→図２のステップ２０１の繰り返し）。Steps 201 to 4 in FIG.
The final recognition result for each entry field is determined by repeatedly executing the series of control processes of step 214 for each entry field position of the form (step 215 in FIG. 4 → repetition of step 201 in FIG. 2).

【０１２３】上述の一連の認識処理において、認識条件
を最後まで満たさなかった文字又は文字列の部分につい
ては、リジェクト（認識不能）情報が付加される。この
場合に、認識結果バッファ１１５に得られた認識結果
が、インタフェース部１１６を介して表示部１１７に表
示される。ユーザは、表示部１１７での認識結果の表示
に対して、マウス及びキーボード等からなる入力部１１
８から、認識不能文字／文字列を修正することができ
る。In the above-described series of recognition processing, reject (unrecognizable) information is added to a portion of a character or a character string that does not satisfy the recognition conditions to the end. In this case, the recognition result obtained in the recognition result buffer 115 is displayed on the display unit 117 via the interface unit 116. The user operates the input unit 11 including a mouse, a keyboard, and the like to display the recognition result on the display unit 117.
8, the unrecognizable character / character string can be corrected.

【０１２４】ユーザは、入力部１１８から認識不能文字
／文字列中の特定の正解文字を指定するだけで、その正
解文字に関する情報がインタフェース部１１６から正解
文字バッファ１１９及び領域座標バッファ１２０に出力
される。The user simply specifies a specific correct character in the unrecognizable character / character string from the input unit 118, and information on the correct character is output from the interface unit 116 to the correct character buffer 119 and the area coordinate buffer 120. You.

【０１２５】図２１の例では、表示部１１７に、認識結
果２１０２と並列に、対象文字列のイメージ２１０１が
表示される。ユーザは、イメージ２１０１上の特定領域
２１０３を入力部１１８であるマウス等から指示する
と、それに対応する認識結果文字２１０４が強調又は反
転表示等される。この表示に対し、ユーザが、入力部１
１８であるキーボード等から正解文字「都」を入力する
と、その正解文字「都」に関する情報がインタフェース
部１１６から正解文字バッファ１１９及び領域座標バッ
ファ１２０に出力される。当然、ユーザが、イメージ２
１０１上の例えば「東京」に対応する領域を指示し、そ
れに対応する認識結果「束長」を「東京」に修正する
と、その正解文字「東京」に関する情報がインタフェー
ス部１１６から正解文字バッファ１１９及び領域座標バ
ッファ１２０に出力される。In the example of FIG. 21, an image 2101 of the target character string is displayed on the display unit 117 in parallel with the recognition result 2102. When the user designates a specific area 2103 on the image 2101 using the mouse or the like as the input unit 118, the corresponding recognition result character 2104 is highlighted or highlighted. In response to this display, the user
When the correct character "To" is input from the keyboard 18 or the like, information on the correct character "To" is output from the interface unit 116 to the correct character buffer 119 and the area coordinate buffer 120. Naturally, the user
When an area corresponding to, for example, “Tokyo” on 101 is designated and the corresponding recognition result “Bunch length” is corrected to “Tokyo”, information on the correct character “Tokyo” is transmitted from the interface unit 116 to the correct character buffer 119 and The data is output to the area coordinate buffer 120.

【０１２６】候補単語検索部１０９は、正解文字バッフ
ァ１１９に得られた正解文字に関する情報を特定文字の
情報として、前述した特定文字辞書１１０と知識辞書１
１１を用いた候補単語の検索処理を実行することによ
り、認識不能文字を正しく再認識させることができる。
また、文字切り出し部１０３は、ユーザによって指定さ
れた正解文字の切り出し位置を領域座標バッファ１２０
から取得することによって、正しい文字の切り出しを実
行することができる。The candidate word search unit 109 uses the information on the correct character obtained in the correct character buffer 119 as the information on the specific character, as the specific character dictionary 110 and the knowledge dictionary 1 described above.
By executing the candidate word search process using No. 11, unrecognizable characters can be correctly re-recognized.
The character cutout unit 103 stores the cutout position of the correct character designated by the user in the area coordinate buffer 120.
By extracting from, correct character segmentation can be performed.

【０１２７】また、図２２の例では、表示部１１７に、
認識結果２２０２と並列に、対象文字列のイメージが表
示される。ユーザは、そのイメージ上の特定領域２２０
１を入力部１１８であるマウス等から指示すると、それ
に対応する認識結果文字２２０３が強調又は反転表示等
されると共に、指示部分に認識結果候補２２０４が表示
される。この表示に対して、ユーザが、入力部１１８で
あるキーボード等から正解文字「都」を選択すると、そ
の正解文字「都」に関する情報がインタフェース部１１
６から正解文字バッファ１１９及び領域座標バッファ１
２０に出力される。この場合に、指示部分に表示される
認識結果候補２２０４は、表示される文字の出現頻度
順、或いは住所文字列のように階層構造を有する場合に
はその階層構造による決定順、或いは単純に文字コード
順で表示されるように構成することができる。In the example shown in FIG. 22, the display 117 displays
An image of the target character string is displayed in parallel with the recognition result 2202. The user can select a specific area 220 on the image.
When the user designates “1” from the mouse or the like serving as the input unit 118, the corresponding recognition result character 2203 is highlighted or highlighted, and a recognition result candidate 2204 is displayed at the designated portion. In response to this display, when the user selects the correct character “To” from the keyboard or the like serving as the input unit 118, information on the correct character “To” is displayed in the interface unit 11.
6 to correct character buffer 119 and area coordinate buffer 1
20. In this case, the recognition result candidates 2204 displayed in the instruction portion are displayed in the order of appearance frequency of the displayed characters, or in the case of having a hierarchical structure such as an address character string, in the order determined by the hierarchical structure, or simply in the character order. It can be configured to be displayed in code order.

【０１２８】図２２の例に続いて図２３に示されるよう
に、更に指示位置２３０１とそれに対応する認識結果位
置２３０２についても、同様の修正処理が行われること
により、文字列２３０３を正しく再認識させることが可
能となる。As shown in FIG. 23 following the example of FIG. 22, similar correction processing is further performed on the designated position 2301 and the corresponding recognition result position 2302 to correctly recognize the character string 2303 again. It is possible to do.

【０１２９】ここで、各特定文字組を構成する２つの特
定文字に挟まれた各文字領域に対する再認識処理につい
て、前述した図３のステップ２０７〜２１２において
は、１つの候補単語を構成する文字毎に個別に再認識処
理が実行され、最終的にその候補単語に対する認識結果
が出力されるように構成されている。Here, regarding the re-recognition processing for each character area sandwiched between two specific characters constituting each specific character set, in steps 207 to 212 in FIG. The re-recognition process is individually performed for each candidate word, and the recognition result for the candidate word is finally output.

【０１３０】この場合に、マッチング部１０６が標準パ
ターン辞書１１３上から検索する文字種が、候補単語が
属するカテゴリーの文字種に限定されることにより、効
率的な再認識処理が実現される。In this case, the character type searched from the standard pattern dictionary 113 by the matching unit 106 is limited to the character type of the category to which the candidate word belongs, so that efficient re-recognition processing is realized.

【０１３１】一方、２つの特定文字に挟まれた文字領域
全体に対して、特徴ベクトルの抽出とマッチング部１０
６によるマッチング処理が実行されるように構成されて
もよい。この場合には、標準パターン辞書１１３には、
「川崎」「横浜」「横須賀」・・・のそれぞれの単語を
１つのパターンとする標準パターンの特徴ベクトルが保
持され、マッチング部１０６は、１つの候補単語の全体
を１つのパターンとする特徴ベクトルと、標準パターン
辞書１１３内の各標準パターンの特徴ベクトルとのマッ
チング処理を実行する。On the other hand, for the entire character region sandwiched between two specific characters, the feature vector extraction and matching unit 10
6 may be configured to execute the matching process. In this case, the standard pattern dictionary 113 includes
A feature vector of a standard pattern in which each of the words “Kawasaki”, “Yokohama”, “Yokosuka”,. Then, the matching process is performed with the feature vector of each standard pattern in the standard pattern dictionary 113.

【０１３２】この場合に、マッチング部１０６が標準パ
ターン辞書１１３上から検索する単語群が、候補単語が
属するカテゴリーの単語群に限定されることにより、効
率的な再認識処理が実現される。In this case, the word group searched from the standard pattern dictionary 113 by the matching unit 106 is limited to the word group of the category to which the candidate word belongs, so that efficient re-recognition processing is realized.

【０１３３】より具体的には、例えば住所文字列の認識
において、マッチング部１０６が標準パターン辞書１１
３上から検索する単語群が、候補単語が属する階層レベ
ルを構成する単語群に限定されることにより、効率的な
再認識処理が実現される。More specifically, for example, in recognition of an address character string, the matching unit 106
3. The efficient re-recognition processing is realized by limiting the word group searched from above to the word group constituting the hierarchical level to which the candidate word belongs.

【０１３４】例えば、図２０に示されるように、２つの
特定文字「県」と「市」に挟まれた領域の再認識処理に
おいて、標準パターン辞書１１３を、「川崎」「横浜」
「横須賀」・・・等の市を表わす単語群のみのものに限
定することができる。For example, as shown in FIG. 20, in the re-recognition processing of an area sandwiched between two specific characters “prefecture” and “city”, the standard pattern dictionary 113 stores “Kawasaki”, “Yokohama”
It can be limited to only words that represent the city such as "Yokosuka".

【０１３５】また、例えば住所文字列の認識において、
上位レベルの認識結果が得られているときには、マッチ
ング部１０６が標準パターン辞書１１３上から検索する
単語群が、その上位レベルの認識結果に属しかつ候補単
語が属する下位レベルを構成する単語群に限定されるこ
とにより、更に効率的な再認識処理が実現される。For example, in recognition of an address character string,
When a higher-level recognition result is obtained, the word group searched by the matching unit 106 from the standard pattern dictionary 113 is limited to a word group belonging to the higher-level recognition result and constituting a lower level to which the candidate word belongs. As a result, more efficient re-recognition processing is realized.

【０１３６】例えば、住所文字列のレベル１の認識結果
が「青森」である場合に、レベル２の標準パターンは、
２つの特定文字「県」と「市」に挟まれて出現し得る全
ての単語群ではなく、「青森県」に属する市を表わす単
語群に限定することが可能である。For example, if the recognition result of the address character string at level 1 is "Aomori", the standard pattern at level 2 is
Instead of all the word groups that can appear between two specific characters “prefecture” and “city”, it is possible to limit to a word group representing a city belonging to “Aomori prefecture”.

【０１３７】上記とは逆に、例えば住所文字列の認識に
おいて、下位レベルの認識結果が得られているときに
は、マッチング部１０６が標準パターン辞書１１３上か
ら検索する単語群が、その下位レベルの認識結果が属し
かつ候補単語が属する上位レベルを構成する単語群に限
定されることにより、上位レベルの認識不能状態を救済
することもできる。本実施の形態を実現するプログラムが記録された記録媒
体についての補足本発明は、コンピュータにより使用されたときに、上述
の本発明の実施の形態の各構成によって実現される機能
と同様の機能をコンピュータに行わせるためのコンピュ
ータ読出し可能記録媒体として構成することもできる。Contrary to the above, for example, in the recognition of an address character string, when a lower-level recognition result is obtained, the word group searched from the standard pattern dictionary 113 by the matching unit 106 is replaced with the lower-level recognition. By being limited to a word group forming a higher level to which the result belongs and to which the candidate word belongs, an unrecognizable state at a higher level can be relieved. Recording medium on which a program for realizing the present embodiment is recorded
The present invention is configured as a computer-readable recording medium that, when used by a computer, causes the computer to perform the same functions as the functions realized by the above-described embodiments of the present invention. You can also.

【０１３８】この場合に、図２４に示されるように、例
えばフロッピィディスク、ＣＤ−ＲＯＭディスク、光デ
ィスク、リムーバブルハードディスク等の可搬型記録媒
体２４０２や、ネットワーク回線２４０３経由で、本発
明の実施の形態の各種機能を実現するプログラムが、コ
ンピュータ２４０１の本体２４０４内のメモリ（ＲＡＭ
又はハードディスク等）２４０５にロードされて、実行
される。In this case, as shown in FIG. 24, according to the embodiment of the present invention via a portable recording medium 2402 such as a floppy disk, CD-ROM disk, optical disk, removable hard disk or the like, or a network line 2403. Programs for realizing various functions are stored in a memory (RAM) in the main body 2404 of the computer 2401.
Or a hard disk) 2405 and executed.

【０１３９】[0139]

【発明の効果】本発明の文字認識技術によれば、入力文
字列中の特定文字又は特定文字列がまず優先的に認識さ
れ、その認識結果に基づいてその前後の候補単語が仮定
され、更にその候補単語の情報を用いて入力文字列を構
成する文字が再認識されることによって、通常見かける
各種帳票（伝票）に記入されるような、不規則な間隔、
記入方法で記入された入力文字列を構成する文字を、高
い精度で認識することが可能となる。According to the character recognition technique of the present invention, a specific character or a specific character string in an input character string is first recognized first, and candidate words before and after the specific character are assumed based on the recognition result. By re-recognizing the characters that make up the input character string using the information on the candidate words, irregular intervals, such as those normally entered in various forms (slips),
Characters constituting the input character string entered by the entry method can be recognized with high accuracy.

【０１４０】本発明の文字修正技術によれば、特定の文
字又は文字列のみを修正するだけで、他の認識不能部分
も自動的に修正することが可能となる。本発明の表記ゆ
らぎの制御技術によれば、種々の記入方法に柔軟に対処
することが可能となる。According to the character correcting technique of the present invention, it is possible to automatically correct other unrecognizable portions only by correcting a specific character or character string. According to the notation fluctuation control technique of the present invention, it is possible to flexibly cope with various writing methods.

[Brief description of the drawings]

【図１】本発明の実施の形態の構成図である。FIG. 1 is a configuration diagram of an embodiment of the present invention.

【図２】本発明の実施の形態の全体制御動作フローチャ
ート（その１）である。FIG. 2 is an overall control operation flowchart (part 1) of the embodiment of the present invention.

【図３】本発明の実施の形態の全体制御動作フローチャ
ート（その２）である。FIG. 3 is an overall control operation flowchart (part 2) of the embodiment of the present invention.

【図４】本発明の実施の形態の全体制御動作フローチャ
ート（その３）である。FIG. 4 is an overall control operation flowchart (3) according to the embodiment of the present invention.

【図５】記入フィールド定義のデータフォーマット例を
示す図である。FIG. 5 is a diagram showing an example of a data format of an entry field definition.

【図６】文字切り出し部の制御動作フローチャート（そ
の１）である。FIG. 6 is a control operation flowchart (part 1) of a character cutout unit.

【図７】文字切り出し部の制御動作フローチャート（そ
の２）である。FIG. 7 is a control operation flowchart (part 2) of the character cutout unit.

【図８】文字切り出し部の制御動作の説明図である。FIG. 8 is an explanatory diagram of a control operation of a character cutout unit.

【図９】表記のゆれについての制御動作フローチャート
である。FIG. 9 is a flowchart of a control operation regarding the fluctuation of the notation.

【図１０】知識辞書（住所）の構造図である。FIG. 10 is a structural diagram of a knowledge dictionary (address).

【図１１】特定文字辞書の構造図である。FIG. 11 is a structural diagram of a specific character dictionary.

【図１２】特定文字辞書１１０の構造例（その１）を示
す図である。FIG. 12 is a diagram showing a structural example (part 1) of a specific character dictionary 110.

【図１３】特定文字辞書１１０の構造例（その２）を示
す図である。FIG. 13 is a diagram showing a structural example (part 2) of the specific character dictionary 110.

【図１４】特定文字辞書１１０の構造例（その３）を示
す図である。FIG. 14 is a diagram showing a structural example (part 3) of the specific character dictionary 110.

【図１５】特定文字辞書１１０の構造例（その４）を示
す図である。FIG. 15 is a diagram showing a structural example (part 4) of the specific character dictionary 110.

【図１６】表記のゆらぎの制御動作の説明図である。FIG. 16 is an explanatory diagram of the control operation of the fluctuation of the notation.

【図１７】候補単語検索部の動作説明図（その１）であ
る。FIG. 17 is an explanatory diagram (part 1) of the operation of the candidate word search unit.

【図１８】候補単語検索部の動作説明図（その２）であ
る。FIG. 18 is a diagram (part 2) illustrating the operation of the candidate word search unit.

【図１９】候補単語検索部の動作説明図（その３）であ
る。FIG. 19 is a diagram (part 3) illustrating the operation of the candidate word search unit.

【図２０】標準パターン辞書による文字列検出／認識動
作の説明図である。FIG. 20 is an explanatory diagram of a character string detection / recognition operation using a standard pattern dictionary.

【図２１】入力部と表示部の動作説明図（その１）であ
る。FIG. 21 is an explanatory diagram (part 1) of the operation of the input unit and the display unit.

【図２２】入力部と表示部の動作説明図（その２）であ
る。FIG. 22 is a diagram (part 2) illustrating the operation of the input unit and the display unit.

【図２３】入力部と表示部の動作説明図（その３）であ
る。FIG. 23 is an explanatory diagram (part 3) of the operation of the input unit and the display unit.

【図２４】本実施の形態を実現するプログラムが記録さ
れた記録媒体の説明図である。FIG. 24 is an explanatory diagram of a recording medium on which a program for realizing the present embodiment is recorded.

[Explanation of symbols]

１０１入力文字列１０２イメージメモリ１０３文字切り出し部１０４記入フィールド定義１０５特徴抽出部１０６マッチング部１０７特定文字標準パターン辞書１０８候補文字列バッファ１０９候補単語検索部１１０特定文字辞書１１１知識辞書１１２候補単語バッファ１１３標準パターン辞書１１４知識処理部１１５認識結果バッファ１１６インタフェース部１１７表示部１１８入力部１１９正解文字バッファ１２０領域座標バッファ Reference Signs List 101 Input character string 102 Image memory 103 Character cutout unit 104 Entry field definition 105 Feature extraction unit 106 Matching unit 107 Specific character standard pattern dictionary 108 Candidate character string buffer 109 Candidate word search unit 110 Specific character dictionary 111 Knowledge dictionary 112 Candidate word buffer 113 Standard pattern dictionary 114 Knowledge processing unit 115 Recognition result buffer 116 Interface unit 117 Display unit 118 Input unit 119 Correct character buffer 120 Area coordinate buffer

Claims

[Claims]

1. A character recognition method for recognizing characters constituting an input character string entered in an entry field having a predetermined category, wherein a first matching is performed between the input character string and a first recognition dictionary. By executing the process, a specific character or a specific character string is extracted from the input character string, belongs to the predetermined category, and the specific character or the specific character string before and after each specific character or the specific character string extracted from the input character string is extracted. A candidate word group that may be located in an area in the input character string is extracted from the category-based word dictionary, and for each candidate word belonging to the extracted candidate word group, the candidate word is determined based on information about the candidate word. A character constituting the input character string is recognized by performing a second matching process on each area in the input character string where each candidate word is located using a second recognition dictionary. A character recognition method characterized by including a process.

2. The method according to claim 1, wherein a standard pattern corresponding to the specific character or the specific character string is stored in the first recognition dictionary, and the pattern of the input character string and the first pattern are stored in the first recognition dictionary. Extracting the specific character or the specific character string from the input character string by executing the first matching process with each standard pattern in the recognition dictionary of the character string. Recognition method.

3. The method according to claim 1, wherein a standard corresponding to a character or a character string related to a candidate word belonging to the candidate word group is stored in the second recognition dictionary. A pattern is stored, and for each candidate word belonging to the candidate word group, a pattern of the candidate word is determined for each region in the input character string where the candidate word is located based on information on the candidate word. Recognizing characters constituting the input character string by executing the second matching process between the input character string and the standard pattern in the second recognition dictionary. Method.

4. The method according to claim 1, further comprising: using information on the number of characters of each candidate word as the information on each of the candidate words. Character recognition method.

5. The method according to claim 1, wherein the second recognition dictionary including the first recognition dictionary is used as the first recognition dictionary. A character recognition method comprising:

6. The method according to claim 1, wherein a first matching process is performed between the input character string and the first recognition dictionary. Extracting a specific character or a specific character string that frequently appears in the predetermined category from the input character string.

7. The method according to claim 1, wherein a first matching process is performed between the input character string and the first recognition dictionary, thereby performing the first matching process. A character recognition method characterized by including a step of extracting a specific character or a specific character string having high recognition accuracy from an input character string.

8. A character correction method using the character recognition method according to claim 1, wherein recognition results of characters constituting the input character string are arranged in parallel with the input character string. Specifying a desired area on the displayed input character string, and correcting a character or a character string corresponding to the desired area, based on information on the correct character or the correct character string given by the correction Re-recognizing characters constituting the input character string by executing the candidate word group extraction process and the second matching process again.

9. The method according to claim 8, further comprising: displaying a plurality of candidate recognition results in the desired area in response to designation of a desired area on the displayed input character string. A character correction method, characterized in that:

10. The method according to claim 1, wherein a word having a notational fluctuation for each of the candidate words is output as a new candidate word belonging to the candidate word group. A character recognition method or a character correction method characterized by including the following.

11. A character recognition device for recognizing characters constituting an input character string entered in an entry field having a predetermined category, wherein a first matching is performed between the input character string and a first recognition dictionary. A specific character / specific character string extracting means for extracting a specific character or a specific character string from the input character string by executing a process; and each specific attribute belonging to the predetermined category and extracted from the input character string. Candidate word group extraction means for extracting a candidate word group that may be located in an area in the input character string before or after a character or a specific character string from a category-based word dictionary; For each candidate word, a second matching process is performed for each area in the input character string where the candidate word is located, using a second recognition dictionary, based on information about the candidate word. An input character string recognizing means for recognizing characters constituting the input character string by executing the input character string.

12. A recording medium storing a program read by the computer when used by a computer, wherein a first character recognition dictionary is provided between an input character string entered in an entry field having a predetermined category and a first recognition dictionary. A function of extracting a specific character or a specific character string from the input character string by executing the first matching process; and a specific character or a specific character belonging to the predetermined category and extracted from the input character string. A function of extracting a group of candidate words that may be located in an area in the input character string before and after a column from the word dictionary for each category; and for each candidate word belonging to the extracted candidate word group, A second matching process is performed on each area in the input character string where each candidate word is located based on information about the word using a second recognition dictionary. And a computer-readable recording medium storing a program for causing the computer to perform a function of recognizing characters constituting the input character string.