JPH09231318A

JPH09231318A - Character recognizing device

Info

Publication number: JPH09231318A
Application number: JP8037852A
Authority: JP
Inventors: Shiori Ooaku; 志緒理大阿久
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1996-02-26
Filing date: 1996-02-26
Publication date: 1997-09-05

Abstract

PROBLEM TO BE SOLVED: To provide dictionary retrieval capable of absorbing the variation of uppercase letter/lowercase letter without increasing the size of a collating dictionary. SOLUTION: The character recognizing device recognizes a character string from picture information, collates a candidate lattice storing each recognized character and its candidate characters with a word dictionary by a language processing part 106 and outputs a finally determined character string. The language processing part 106 is provided with an English candidate lattice preparing part 108 for preparing an English candidate lattice capable of storing candidate characters from the 1st candidate up to the lowest candidate by maximum word length in order to collate an English word with a dictionary and an alphabetical uppercase letter/lowercase letter adding part 109 for adding an uppercase letter or a lowercase letter to an English candidate lattice so as to provide both the uppercase and lowercase letters of the character in a pair.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明が属する技術分野】本発明は日本語・英語が混在
した文書を対象とした文字認識装置に関し，より詳細に
は，認識した文字列に対する各候補文字を含む候補ラテ
ィスを対象に言語処理し，該言語処理により最終的に決
定した文字列を出力する文字認識装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a character recognition device for a document in which Japanese and English are mixed, and more specifically, it performs language processing on a candidate lattice including each candidate character for a recognized character string. The present invention relates to a character recognition device that outputs a character string finally determined by the language processing.

【０００２】[0002]

【従来の技術】従来における一般的な文字認識装置は，
認識された文字列を，より尤もらしい文とするために認
識後の後処理を行うための言語処理部を備えている。こ
の言語処理部は，認識結果の候補ラティスに対し，単語
辞書を用いて候補となる単語を検索し，確定するもので
ある。そして，この処理の結果，認識結果の第１候補文
字に不具合があれば，さらに第２候補文字以降の文字と
入れ換えるなどの処理を実行する。2. Description of the Related Art A conventional general character recognition device is
The language processing unit is provided for post-processing post-recognition to make the recognized character string a more plausible sentence. The language processing unit searches the candidate lattice of the recognition result by using a word dictionary and determines the candidate word, and determines it. Then, as a result of this processing, if there is a defect in the first candidate character of the recognition result, processing such as replacing with the character after the second candidate character is executed.

【０００３】また，機械翻訳などの分野における辞書検
索装置に関連技術が，特開昭６１−７０６６２号公報や
特開昭６２−８２４６６号公報に開示されている。ここ
では入力文字列中の大文字を小文字に変換する手段を設
け，通常の単語検索で該当する単語がなかった場合，大
文字を小文字に変換した文字列によって単語辞書を検索
している。Further, a technique related to a dictionary search device in the field of machine translation or the like is disclosed in Japanese Patent Laid-Open Nos. 61-70662 and 62-82466. Here, a means for converting uppercase letters into lowercase letters in the input character string is provided, and when there is no corresponding word in a normal word search, the word dictionary is searched by the character string with the uppercase letters converted to lowercase letters.

【０００４】[0004]

【発明が解決しようとする課題】しかしながら，上記に
示されるような従来の装置にあっては，以下に示すよう
な英文字列の場合，英単語辞書と照合する際に問題点が
あった。However, in the conventional apparatus as described above, the following English character strings have a problem in collating with the English word dictionary.

【０００５】第１に，たとえば「ｗｉｎｄｏｗ」，「Ｗ
ＩＮＤＯＷ」，「Ｗｉｎｄｏｗ」というように，大文字
のみ，小文字のみ，あるいは大文字・小文字の混在時な
ど，同じ単語であっても書き手や使用場面に応じて，表
記方法が様々である。First, for example, "window", "W"
There are various notation methods even for the same word, such as only uppercase letters, lowercase letters, or a mixture of uppercase letters and lowercase letters, such as "INDOW" and "Windows".

【０００６】第２に，すべての単語が複数の表記を許容
しているとは限らず，大文字表記のみでしか記述しない
語，たとえば「ＩＮＳ」は「ｉｎｓ」とは記述しないよ
うな略語などもある。Secondly, not all words allow plural notations, and words that are described only in capital letters, such as abbreviations in which "INS" is not described as "ins", etc. is there.

【０００７】第３に，地名・人名など固有名詞は，一般
的に小文字のみの表記で記述しない。たとえば地名「Ｅ
ｎｇｌａｎｄ」は「ｅｎｇｌａｎｄ」とは記述しない
し，人名「Ｊｏｈｎ」は「ｊｏｈｎ」とは記述しない。Thirdly, proper names such as place names and personal names are generally not written in lower case only. For example, the place name "E
"ngland" is not described as "england", and the personal name "John" is not described as "john".

【０００８】そこで，これら３つに示されるような表記
のゆれをすべて記述した単語辞書を作成し，候補ラティ
スと照合する方法が考えられるが，この方法では，照合
用の辞書メモリの容量が大きくなるため，メモリコスト
の増大および辞書検索速度の低下を招来させるという問
題点があった。Therefore, a method of creating a word dictionary that describes all the variations of the notations as shown in these three and collating with a candidate lattice can be considered. However, this method requires a large dictionary memory capacity for collation. Therefore, there is a problem that the memory cost increases and the dictionary search speed decreases.

【０００９】本発明は，上記に鑑みてなされたものであ
って，照合用の辞書サイズを大きくすることなく，大文
字・小文字のゆれを吸収した辞書検索を実現することを
目的とする。The present invention has been made in view of the above, and it is an object of the present invention to realize a dictionary search that absorbs fluctuations in uppercase and lowercase letters without increasing the dictionary size for collation.

【００１０】[0010]

【課題を解決するための手段】上記の目的を達成するた
めに，請求項１に係る文字認識装置にあっては，画像情
報から文字列を認識し，言語処理手段により認識結果文
字とその候補文字を収めた候補ラティスと単語辞書との
照合を行って，最終決定した文字列を出力する文字認識
装置において，前記言語処理手段が，英単語の辞書照合
のために第１候補から下位までの候補文字を最大単語長
収めることが可能な英字用候補を作成する英字用候補ラ
ティス作成手段と，候補文字に英字があれば，その文字
の大文字・小文字の両方が対となるように前記英字用候
補ラティス内に大文字あるいは小文字を追加する英大文
字・小文字追加手段とを備えたものである。In order to achieve the above object, in the character recognition device according to claim 1, a character string is recognized from image information, and the recognition result character and its candidate are recognized by the language processing means. In a character recognition device for collating a candidate lattice containing characters with a word dictionary and outputting a finally determined character string, the language processing means includes first to lower ranks for collating English words in a dictionary. If a candidate character has an alphabetic character, the candidate lattice creating means for creating an alphabetic character candidate capable of storing the candidate character in the maximum word length, and if the candidate character has an alphabetic character, both the uppercase and lowercase letters of the character are paired. It is provided with means for adding uppercase / lowercase letters in the candidate lattice.

【００１１】すなわち，言語処理手段において候補ラテ
ィスと辞書との照合を行う際，候補文字に英字が存在す
れば，その文字の大文字・小文字の両方が対となるよう
に候補ラティス内に大文字あるいは小文字を追加するこ
とにより，大文字・小文字の両方が必ず候補文字となる
ので，実際の原稿中の表記記述が英単語辞書になくても
単語照合を行うことが可能となる。That is, when the candidate lattice is matched with the dictionary in the language processing means, if there is an alphabetic character in the candidate character, uppercase letters or lowercase letters are included in the candidate lattice so that both uppercase letters and lowercase letters of the letters become a pair. By adding "," both uppercase and lowercase letters are always candidate characters, so it is possible to perform word matching even if the written description in the actual manuscript is not in the English word dictionary.

【００１２】このため，英単語辞書を表記の違いによっ
てエントリを増やすなどの措置を取る必要もなくなり，
不必要に辞書サイズを大きくすることもなくなる。ま
た，大文字・小文字を意識せずに英単語の辞書検索が行
えるので，英単語専用の辞書照合のアルゴリズムを新た
に用意する必要もなくなる。Therefore, it is not necessary to take measures such as increasing the number of entries in the English word dictionary depending on the notation,
There is no need to increase the dictionary size unnecessarily. In addition, since the dictionary search for English words can be performed without being aware of uppercase and lowercase letters, it is not necessary to newly prepare a dictionary matching algorithm dedicated to English words.

【００１３】また，請求項２に係る文字認識装置にあっ
ては，前記言語処理手段は，候補文字が追加された候補
ラティスと単語辞書との照合の結果，選択された単語の
表記を各認識文字の候補順位に基づいて，大文字・小文
字の修正を行う英大文字・小文字修正手段をさらに備え
たものである。Further, in the character recognition device according to the second aspect, the language processing means recognizes each notation of the selected word as a result of the matching between the candidate lattice to which the candidate character is added and the word dictionary. It further comprises English uppercase / lowercase correction means for correcting uppercase / lowercase based on the order of candidate characters.

【００１４】すなわち，候補文字が追加された候補ラテ
ィスと単語辞書との照合の結果，該照合された英単語表
記を構成している英字の大文字・小文字を，認識文字の
候補順位に基づいての置き換えるため，実際の原稿に記
述された単語表記に近づけられ，その認識精度が向上す
る。That is, as a result of collation between the candidate lattice to which the candidate character is added and the word dictionary, the uppercase and lowercase letters of the English letters constituting the collated English word notation are determined based on the candidate rank of the recognized character. Because of the replacement, the word notation described in the actual manuscript is approximated, and the recognition accuracy is improved.

【００１５】また，請求項３に係る文字認識装置にあっ
ては，前記言語処理手段は，候補文字が追加された候補
ラティスと単語辞書との照合の結果，選択された単語の
表記がすべて大文字である場合，各認識文字の候補順位
に基づいて評価対象とすべき単語かどうかを判定する英
単語判定手段をさらに備えたものである。Further, in the character recognition device according to the third aspect, the language processing means, as a result of collating the candidate lattice to which the candidate character has been added with the word dictionary, all the notation of the selected word is capitalized. In this case, an English word determination means for determining whether or not the word should be evaluated based on the candidate rank of each recognized character is further provided.

【００１６】すなわち，候補文字が追加された候補ラテ
ィスと単語辞書との照合の結果，選択された単語の表記
がすべて大文字である場合は，各認識文字の候補順位に
基づいて評価対象とすべき単語であるか否かを判断する
ことにより，たとえば略語などに対し，不必要な単語を
評価対象から削除することができるので，処理精度が向
上し，また，言語解析の誤りも減少し，認識精度が向上
する。That is, as a result of collating the candidate lattice to which the candidate character is added with the word dictionary, if the selected word is in all capital letters, it should be evaluated based on the candidate rank of each recognized character. By determining whether or not it is a word, unnecessary words such as abbreviations can be deleted from the evaluation target, which improves processing accuracy and reduces linguistic analysis errors. Accuracy is improved.

【００１７】また，請求項４に係る文字認識装置にあっ
ては，前記言語処理手段は，候補文字が追加された候補
ラティスと単語辞書との照合の結果，選択された単語の
表記が大文字・小文字の混在語である場合，各認識文字
の候補順位に基づいて評価対象とすべき単語かどうかを
判定する英単語判定手段をさらに備えたものである。Further, in the character recognition device according to the fourth aspect, the language processing means, as a result of collating the candidate lattice to which the candidate character is added with the word dictionary, the selected word is capitalized. In the case of mixed words in lower case, the apparatus further comprises English word determination means for determining whether or not the word should be evaluated based on the candidate rank of each recognized character.

【００１８】すなわち，候補文字が追加された候補ラテ
ィスと単語辞書との照合の結果，選択された単語の表記
が大文字・小文字の混在語である場合は，各認識文字の
候補順位に基づいて評価対象とすべき単語であるか否か
を判断することにより，たとえば地名・人名などに対
し，不必要な単語を評価対象から削除することができる
ので，処理精度が向上し，また，言語解析の誤りも減少
し，認識精度が向上する。That is, as a result of matching the candidate lattice to which the candidate character is added with the word dictionary, if the selected word is a mixed word of uppercase and lowercase letters, it is evaluated based on the candidate rank of each recognized character. By determining whether or not a word should be a target, unnecessary words can be deleted from the evaluation target, such as a place name and a person's name, so that processing accuracy is improved and language analysis is performed. Errors are reduced and recognition accuracy is improved.

【００１９】また，請求項５に係る文字認識装置にあっ
ては，前記言語処理手段は，単語表記の大文字・小文字
のゆれをすべて記述することなく，１つの単語に対し１
つの表記で記述された英単語辞書により候補ラティスと
単語辞書とを照合し，候補単語を得る単語照合手段をさ
らに備えたものである。In addition, in the character recognition device according to the fifth aspect, the language processing means does not describe all upper and lower case fluctuations of the word notation, and the language processing means does not include one for each word.
It further comprises a word collating means for collating the candidate lattice and the word dictionary with an English word dictionary described in one notation to obtain a candidate word.

【００２０】すなわち，単語表記の大文字・小文字のゆ
れをすべて記述することなく，１単語につき１表記で記
述された英単語辞書を有することにより，単語辞書のサ
イズが大きくならず，しかも辞書エントリを複数作成す
る手間を省き，かつエントリ作成の作業量を減少させ
る。That is, by having an English word dictionary in which each word is described in one notation without describing all upper and lower case fluctuations in the word notation, the size of the word dictionary is not increased, and the dictionary entries are This saves the effort of creating multiple entries and reduces the work of creating entries.

【００２１】また，請求項６に係る文字認識装置にあっ
ては，前記言語処理手段は，単語辞書照合アルゴリズム
が日本語・英語文字列のいずれにおいても共有である。In the character recognition device according to the sixth aspect, the language processing means has a common word dictionary matching algorithm for both Japanese and English character strings.

【００２２】すなわち，単語辞書照合アルゴリズムが日
本語・英語文字列のいずれにおいても共有し，英単語専
用の辞書照合のアルゴリズムを新た用意する必要がなく
なるので，日本語と英語との混在文書を扱うＯＣＲなど
において，処理の簡略化・高速化が実現する。That is, since the word dictionary matching algorithm is shared in both Japanese and English character strings and it is not necessary to prepare a new dictionary matching algorithm dedicated to English words, a mixed document of Japanese and English is handled. In OCR, etc., simplification and speeding up of processing are realized.

【００２３】[0023]

【発明の実施の形態】以下，本発明の実施の形態を添付
図面を参照して説明する。BEST MODE FOR CARRYING OUT THE INVENTION Embodiments of the present invention will be described below with reference to the accompanying drawings.

【００２４】〔実施の形態〕（実施の形態の構成）図１は，実施の形態に係る文字認
識装置の全体構成を示すブロック図である。図におい
て，この文字認識装置は，認識対象のイメージ画像を入
力する画像入力部１０１と，イメージ画像を行単位で抽
出する行切り出し部１０２と，行切り出し部１０２で抽
出された行部分の文字情報を抽出する文字切り出し部１
０３と，文字切り出し部１０３で切り出された文字情報
を認識する文字認識処理部１０４と，文字認識処理部１
０４で認識された文字を組み合わせ・選択する文字組み
合わせ選択処理部１０５と，後述するように認識処理後
の各候補文字を含む候補ラティスを対象に言語処理を行
う言語処理手段としての言語処理部（後処理部）１０６
と，言語処理部（後処理部）１０６で処理された文字情
報を確定し文字列として出力する確定文字列出力部１０
７とから構成されている。[Embodiment] (Configuration of Embodiment) FIG. 1 is a block diagram showing an overall configuration of a character recognition apparatus according to an embodiment. In the figure, the character recognition device includes an image input unit 101 for inputting an image image to be recognized, a line cutout unit 102 for extracting an image image in units of lines, and character information of line parts extracted by the line cutout unit 102. Character extraction unit 1 to extract
03, a character recognition processing unit 104 that recognizes the character information cut out by the character cutting unit 103, and a character recognition processing unit 1
A character combination selection processing unit 105 for combining and selecting the characters recognized in 04, and a language processing unit (language processing unit for performing language processing on a candidate lattice including each candidate character after recognition processing as described later ( Post-processing unit) 106
And a fixed character string output unit 10 for fixing the character information processed by the language processing unit (post-processing unit) 106 and outputting it as a character string.
7 is comprised.

【００２５】また，言語処理部（後処理部）１０６は，
通常の候補ラティスとは別に，英単語の辞書照合のため
の英字用候補ラティスを作成する英字用候補ラティス作
成手段としての英字用候補ラティス作成部１０８と，候
補文字に英字が存在したときに，該文字の大文字・小文
字の両方が対となるように候補ラティス内に大文字ある
いは小文字を追加する英大文字・小文字追加手段として
の英大文字・小文字追加部１０９と，候補ラティスと単
語辞書を照合し，候補単語を得る単語照合手段としての
単語照合部１１０と，辞書照合した単語の表記が小文字
ではない語に対し，認識文字の候補順位の差により文字
の確度を決定し，候補単語とするかを判定する英単語判
定手段としての英単語判定部１１１と，選択された単語
の表記を各認識文字の候補順位に基づいて大文字・小文
字の修正を実行する英大文字・小文字修正手段としての
英大文字・小文字修正部１１２と，各認識文字の候補順
位に基づいて評価対象とすべき単語であるか否かを判断
する候補単語評価部１１３とから構成されている。The language processing unit (post-processing unit) 106 is
In addition to the normal candidate lattice, an English character candidate lattice creating unit 108 as an English character candidate lattice creating means for creating an English character candidate lattice for dictionary matching of English words, and when an English character exists in the candidate character, The candidate lattice and the word dictionary are collated with an uppercase / lowercase addition unit 109 as an uppercase / lowercase addition means for adding uppercase or lowercase letters in the candidate lattice so that both uppercase and lowercase letters of the character are paired, A word matching unit 110 as a word matching means for obtaining a candidate word, and for a word whose dictionary matching is not in lowercase, determine the accuracy of the character based on the difference in the candidate rank of the recognized character to determine whether it is a candidate word. An English word determination unit 111 as an English word determination unit for determining and an upper case / lower case correction of the notation of the selected word are performed based on the candidate rank of each recognized character. It is composed of an English upper / lower case correction unit 112 as an upper / lower case correction unit, and a candidate word evaluation unit 113 which determines whether or not the word should be evaluated based on the candidate rank of each recognized character. There is.

【００２６】（実施の形態の動作）次に，以上のように
構成された文字認識装置の動作について説明する。ま
ず，画像入力部１０１でイメージ画像を入力し，このイ
メージ画像の文字認識を行い，認識文字とその候補文字
を収めた候補ラティスを作成する。この通常候補ラティ
スの例を図２に示す。(Operation of Embodiment) Next, the operation of the character recognition device configured as described above will be described. First, the image input unit 101 inputs an image image, character recognition of this image image is performed, and a candidate lattice containing the recognized character and its candidate character is created. An example of this normal candidate lattice is shown in FIG.

【００２７】以下，図３に示すフローチャートを用いて
言語処理部（後処理部）１０６の動作を説明する。図に
おいて，まず，対象先頭の文字位置に英字があるか否か
を判断する（Ｓ３０１）。なお，ここでは第１候補文字
だけではなく，下位の候補文字についてもチェックす
る。このステップＳ３０１において，先頭の文字位置に
英字があると判断した場合には，英字用候補ラティス作
成部１０８により処理動作を実行し（Ｓ３０２），さら
に英大文字・小文字追加部１０９による処理動作を実行
する。The operation of the language processing unit (post-processing unit) 106 will be described below with reference to the flowchart shown in FIG. In the figure, first, it is determined whether or not there is an alphabetic character at the character position at the beginning of the target (S301). Here, not only the first candidate character but also the lower candidate characters are checked. If it is determined in this step S301 that there is an alphabetic character at the beginning character position, the alphabetic character candidate lattice creating unit 108 executes the processing operation (S302), and further the English uppercase / lowercase character adding unit 109 executes the processing operation. To do.

【００２８】一方，ステップＳ３０１において，先頭の
文字位置に英字がないと判断した場合，あるいは上記ス
テップＳ３０３の英大文字・小文字追加部１０９による
処理動作が行われた後，通常の単語照合部１１０による
処理を実行する（Ｓ３０４）。以下，英字用候補ラティ
ス作成部１０８，英大文字・小文字追加部１０９，単語
照合部１１０の各動作を詳述する。On the other hand, when it is determined in step S301 that there is no alphabetic character at the leading character position, or after the processing operation by the uppercase / lowercase addition unit 109 in step S303 is performed, the normal word matching unit 110 is used. The process is executed (S304). The operations of the English character candidate lattice creation unit 108, the English uppercase / lowercase addition unit 109, and the word matching unit 110 will be described in detail below.

【００２９】英字用候補ラティス作成部１０８では，通
常の候補ラティスとは別に，英単語の辞書照合のために
第１候補から下位までの候補文字を最大単語長分収める
ことができる英字用候補ラティスを作成する。また，候
補文字内に英字があれば英字用候補ラティスに候補文字
を一式コピーしてゆき，候補内に英字がなくたった時点
で処理を抜ける。この例を図４に示す。その後，英大文
字・小文字追加部１０９による処理動作に移行する。In addition to the normal candidate lattice, the English character candidate lattice creating unit 108 can store the candidate characters from the first candidate to the lower candidate for the maximum word length in order to match the dictionary of English words. To create. Also, if there is an alphabetic character in the candidate character, a set of candidate characters is copied to the English character candidate lattice, and the process ends when there are no more alphabetic characters in the candidate. This example is shown in FIG. After that, the processing shifts to the processing operation by the English uppercase / lowercase addition unit 109.

【００３０】英大文字・小文字追加部１０９では，英字
用候補ラティスに対し，第１候補から下位までの候補文
字内に英字の大文字・小文字が両方とも存在するか否か
をチェックする。存在しない場合は，図５に示す例のよ
うに，候補順位の最後尾の位置に不足文字を追加する。
ここでは，ｎとＮ，Ｄとｄ，というように文字形状に関
係なく強制的に大文字・小文字を対にして追加する。The English upper / lower case addition unit 109 checks whether or not both upper and lower case letters of the alphabet exist in the candidate characters from the first candidate to the lower order with respect to the alphabet candidate lattice. If it does not exist, the missing character is added to the last position of the candidate rank as in the example shown in FIG.
Here, upper and lower case letters are forcibly added regardless of the character shape, such as n and N and D and d.

【００３１】単語照合部１１０では，候補ラティスと単
語辞書とを照合し，候補単語を得る。英字用候補ラティ
スが存在する場合には，英字用候補ラティスと英単語辞
書とを，一方，存在しない場合には，通常の候補ラティ
スと日本語辞書とを照合させる。The word collating unit 110 collates the candidate lattice with the word dictionary to obtain candidate words. If the English character candidate lattice exists, the English character candidate lattice is compared with the English word dictionary. On the other hand, if the English character candidate lattice does not exist, the normal candidate lattice and the Japanese dictionary are matched.

【００３２】なお，照合のアルゴリズムは，Ａ＊アルゴ
リズムなどの一般的に知られている手法を用いてもよ
い。また，単語辞書は日本語と英語の２セット使用して
もよいし，あるいは２つをマージ（ｍｅｒｇｅ）した１
辞書としてもよい。As the matching algorithm, a generally known method such as the A * algorithm may be used. Also, the word dictionary may use two sets of Japanese and English, or merge two (1).
It may be a dictionary.

【００３３】この英単語辞書のエントリ例を図６に示
す。一般の単語は小文字表記，略語などは大文字表記，
地名・人名などは先頭文字のみ大文字表記とする。この
ように１単語は１表記のみをエントリとする。そして，
検索した単語が英単語であれば英単語判定部１１１へ，
そうでなければ候補単語としてセットし，候補単語評価
部１１３でコスト算出などの評価を行う。FIG. 6 shows an example of entries in this English word dictionary. General words are in lower case, abbreviations are in upper case,
Only the first letter of a place name or person's name is capitalized. In this way, one word has only one notation as an entry. And
If the searched word is an English word, go to the English word determination unit 111,
Otherwise, the candidate word is set as a candidate word, and the candidate word evaluation unit 113 performs evaluation such as cost calculation.

【００３４】すなわち，上記ステップ３０４で単語照合
処理がなされ，その候補単語が英単語であるか否かを判
断する（Ｓ３０５）。ここで候補単語が英単語であると
判断した場合，さらに表記が小文字ではないか否かを判
断する（Ｓ３０６）。そして，表記が小文字ではないと
判断した場合に，英単語判定部１１１による処理動作
（Ｓ３０７）に移行する。That is, word matching processing is performed in step 304, and it is determined whether the candidate word is an English word (S305). If it is determined that the candidate word is an English word, it is further determined whether the notation is lowercase (S306). Then, when it is determined that the notation is not lowercase, the process proceeds to the processing operation (S307) by the English word determination unit 111.

【００３５】英単語判定部１１１は，図７に示すよう
に，辞書照合した単語の表記が小文字ではない語に対
し，認識文字の候補順位の差により文字の確度を決定
し，候補単語とすべきであるか否かを判定する。全英文
字表記の語，たとえば「ＩＮＳ」などに対しては，表記
中の文字に小文字の方が大文字よりも候補順が先である
文字が１つでもあれば，その単語は無効とし，セットし
ない。As shown in FIG. 7, the English word determination unit 111 determines the accuracy of a character based on the difference in the candidate character candidate rank for words whose dictionary collation is not lowercase, and regards it as a candidate word. Determine if it should. For a word in all alphabetic characters, such as "INS", if there is at least one character whose lowercase letter is lower in candidate order than uppercase letter, the word is invalid and set. do not do.

【００３６】また，先頭文字が大文字，２文字目以降が
小文字である語，たとえば「Ｉｒａｎ」などに対して
は，先頭の文字のみを評価し，小文字の方が大文字より
も候補順位が先であれば，その単語は無効とし，セット
しない。すなわち，上記英単語判定部１１１による判定
結果がＯＫであるかを判断する（Ｓ３０８）。ＯＫであ
ると判断された場合，これらに該当しない語は，大文字
・小文字修正部１１２に入力され，大文字・小文字修正
部１１２による処理動作（Ｓ３０９）に移行する。ま
た，上記ステップ３０８において，ＯＫではないと判断
した場合，その単語は削除する（Ｓ３１０）。For a word in which the first character is uppercase and the second and subsequent letters are lowercase, for example, "Iran", etc., only the first character is evaluated, and lowercase letters come first before uppercase letters. If so, the word is invalid and is not set. That is, it is determined whether the determination result by the English word determination unit 111 is OK (S308). If it is determined to be OK, words that do not correspond to these are input to the upper / lower case correction unit 112, and the processing operation by the upper / lower case correction unit 112 is performed (S309). If it is determined in step 308 that the word is not OK, the word is deleted (S310).

【００３７】大文字・小文字修正部１１２は，図８に示
すように，英単語表記の各文字を候補ラティス内の各候
補文字を検索し，候補順位が先の大文字あるいは小文字
に置き換える。As shown in FIG. 8, the upper / lower case correction unit 112 searches each character in the English word for each candidate character in the candidate lattice, and replaces the candidate order with the upper case or lower case.

【００３８】（実施の形態の効果）次に，以上説明した
実施の形態が奏する効果について列記する。第１に，言
語処理部１０６において候補ラティスと辞書との照合を
行う際，候補文字に英字が存在すれば，その文字の大文
字・小文字の両方が対となるように候補ラティス内に大
文字あるいは小文字を追加することにより，大文字・小
文字の両方が必ず候補文字となるので，実際の原稿中の
表記記述が英単語辞書になくても単語照合を行うことが
可能となる。(Effects of the Embodiment) Next, the effects of the above-described embodiment will be listed. First, in the language processing unit 106, when matching a candidate lattice and a dictionary, if there is an alphabetic character in the candidate character, uppercase letters or lowercase letters are included in the candidate lattice so that both uppercase and lowercase letters of the candidate character form a pair. By adding "," both uppercase and lowercase letters are always candidate characters, so it is possible to perform word matching even if the written description in the actual manuscript is not in the English word dictionary.

【００３９】このため，英単語辞書を表記の違いによっ
てエントリを増やすなどの措置を取る必要もなくなり，
不必要に辞書サイズを大きくすることもなくなる。ま
た，大文字・小文字を意識せずに英単語の辞書検索が行
えるので，英単語専用の辞書照合のアルゴリズムを新た
に用意する必要もなくなる。Therefore, there is no need to take measures such as increasing the number of entries depending on the notation of the English word dictionary.
There is no need to increase the dictionary size unnecessarily. In addition, since the dictionary search for English words can be performed without being aware of uppercase and lowercase letters, it is not necessary to newly prepare a dictionary matching algorithm dedicated to English words.

【００４０】第２に，候補文字が追加された候補ラティ
スと単語辞書との照合の結果，該照合された英単語表記
を構成している英字の大文字・小文字の置き換えを，認
識文字の候補順位に基づいて行うため，実際の原稿に記
述された単語表記に近づけられ，その認識精度が向上す
る。Secondly, as a result of the matching between the candidate lattice to which the candidate character is added and the word dictionary, the uppercase and lowercase letters of the alphabetic characters constituting the matched English word notation are replaced by the candidate rank of the recognized character. Since it is based on, the word notation described in the actual manuscript can be approximated and the recognition accuracy thereof can be improved.

【００４１】第３に，候補文字が追加された候補ラティ
スと単語辞書との照合の結果，選択された単語の表記が
すべて大文字である場合は，各認識文字の候補順位に基
づいて候補単語評価部１１３で評価対象とすべき単語で
あるか否かを判断することにより，たとえば略語などに
対し，不必要な単語を候補単語評価部１１３の評価対象
から削除することができるので，候補単語評価部１１３
の処理精度が向上する。また，これにより言語解析の誤
りも減少させることができるので，認識精度が向上す
る。Thirdly, as a result of matching the candidate lattice to which the candidate character has been added with the word dictionary, if the selected word is in all capital letters, the candidate word is evaluated based on the candidate rank of each recognized character. By determining whether or not the word should be the evaluation target in the unit 113, unnecessary words can be deleted from the evaluation target in the candidate word evaluation unit 113, for example, for abbreviations. Part 113
Processing accuracy is improved. In addition, this also reduces errors in linguistic analysis, which improves recognition accuracy.

【００４２】第４に，候補文字が追加された候補ラティ
スと単語辞書との照合の結果，選択された単語の表記が
大文字・小文字の混在語である場合は，各認識文字の候
補順位に基づいて候補単語評価部１１３で評価対象とす
べき単語であるか否かを判断することにより，たとえば
地名・人名などに対し，不必要な単語を候補単語評価部
１１３の評価対象から削除することができるので，候補
単語評価部１１３の処理精度が向上する。また，これに
より言語解析の誤りも減少させることができるので，認
識精度が向上する。Fourthly, as a result of collating the candidate lattice to which the candidate character has been added with the word dictionary, if the selected word is a mixed word of uppercase and lowercase letters, it is based on the candidate rank of each recognized character. By determining whether or not the word should be an evaluation target by the candidate word evaluation unit 113, unnecessary words can be deleted from the evaluation target of the candidate word evaluation unit 113 with respect to, for example, a place name or a person's name. Therefore, the processing accuracy of the candidate word evaluation unit 113 is improved. In addition, this also reduces errors in linguistic analysis, which improves recognition accuracy.

【００４３】第５に，単語表記の大文字・小文字のゆれ
をすべて記述することなく，１単語につき１表記で記述
された英単語辞書を有することにより，単語辞書のサイ
ズが大きくならず，しかも辞書エントリを複数作成する
手間を省くことができ，かつエントリ作成の作業量を減
少させることができる。Fifth, by having an English word dictionary in which each word is described in one notation without describing all upper and lower case fluctuations in the word notation, the size of the word dictionary is not increased and the dictionary It is possible to save the trouble of creating a plurality of entries and reduce the work of creating an entry.

【００４４】第６に，単語辞書照合アルゴリズムが日本
語・英語文字列のいずれにおいても共有し，英単語専用
の辞書照合のアルゴリズムを新たに用意にしないため，
日本語と英語との混在文書を扱うＯＣＲ（光学式文字読
取装置）にとって，処理の簡略化・高速化を図ることが
できる。Sixth, the word dictionary matching algorithm is shared in both Japanese and English character strings, and a new dictionary matching algorithm dedicated to English words is not prepared.
For an OCR (optical character reader) that handles a mixed document of Japanese and English, it is possible to simplify and speed up the process.

【００４５】[0045]

【発明の効果】以上説明したように，本発明に係る文字
認識装置（請求項１）によれば，言語処理手段において
候補ラティスと辞書との照合を行う際，候補文字に英字
が存在すれば，その文字の大文字・小文字の両方が対と
なるように候補ラティス内に大文字あるいは小文字を追
加することにより，大文字・小文字の両方が必ず候補文
字となるため，実際の原稿中の表記記述が英単語辞書に
なくても単語照合を行うことができる。As described above, according to the character recognition device of the present invention (claim 1), when the candidate lattice is collated with the dictionary in the language processing means, if the candidate character has an alphabetic character. , By adding uppercase or lowercase letters in the candidate lattice so that both uppercase and lowercase letters become a pair, both uppercase and lowercase letters will always be candidate letters. Word matching can be performed even if it is not in the word dictionary.

【００４６】したがって，英単語辞書を表記の違いによ
ってエントリを増やすなどの措置が不要となるため，不
必要に辞書サイズを大きくすることもなくなる。また，
大文字・小文字を意識せずに英単語の辞書検索が行える
ので，英単語専用の辞書照合のアルゴリズムを新たに用
意する必要もなくなる。Therefore, it is not necessary to increase the size of the dictionary because it is not necessary to increase the number of entries in the English dictionary. Also,
The dictionary search for English words can be done without regard to uppercase and lowercase letters, so there is no need to prepare a new dictionary matching algorithm for English words.

【００４７】また，本発明に係る文字認識装置（請求項
２）によれば，候補文字が追加された候補ラティスと単
語辞書との照合の結果，該照合された英単語表記を構成
している英字の大文字・小文字を，認識文字の候補順位
に基づいての置き換えるため，実際の原稿に記述された
単語表記に近づけることができ，その認識精度が向上す
る。Further, according to the character recognition device of the present invention (claim 2), the collated English word notation is formed as a result of collation of the candidate lattice to which the candidate character is added and the word dictionary. Since uppercase and lowercase letters are replaced based on the candidate rank of the recognized characters, the word notation described in the actual manuscript can be approximated, and the recognition accuracy is improved.

【００４８】また，本発明に係る文字認識装置（請求項
３）によれば，候補文字が追加された候補ラティスと単
語辞書との照合の結果，選択された単語の表記がすべて
大文字である場合は，各認識文字の候補順位に基づいて
評価対象とすべき単語であるか否かを判断することによ
り，たとえば略語などに対し，不必要な単語を評価対象
から削除することができるため，処理精度を向上させる
ことができ，また，言語解析の誤りも減少し，認識精度
が向上する。Further, according to the character recognition device of the present invention (claim 3), when the selected word is all capitalized as a result of matching between the candidate lattice to which the candidate character is added and the word dictionary. Determines whether or not it is a word to be evaluated based on the candidate rank of each recognized character, so that unnecessary words can be deleted from the evaluation target for abbreviations, for example. The accuracy can be improved, linguistic analysis errors can be reduced, and the recognition accuracy can be improved.

【００４９】また，本発明に係る文字認識装置（請求項
４）によれば，候補文字が追加された候補ラティスと単
語辞書との照合の結果，選択された単語の表記が大文字
・小文字の混在語である場合は，各認識文字の候補順位
に基づいて評価対象とすべき単語であるか否かを判断す
ることにより，たとえば地名・人名などに対し，不必要
な単語を評価対象から削除することができるため，処理
精度が向上し，また，言語解析の誤りも減少し，認識精
度が向上する。Further, according to the character recognition device of the present invention (claim 4), as a result of matching between the candidate lattice to which the candidate character is added and the word dictionary, the notation of the selected word is mixed in upper case and lower case. If the word is a word, unnecessary words are deleted from the evaluation target, for example, for a place name or a person's name, by determining whether or not the word should be evaluated based on the candidate rank of each recognized character. As a result, processing accuracy is improved, linguistic analysis errors are reduced, and recognition accuracy is improved.

【００５０】また，本発明に係る文字認識装置（請求項
５）によれば，単語表記の大文字・小文字のゆれをすべ
て記述することなく，１単語につき１表記で記述された
英単語辞書を有するため，単語辞書のサイズが大きくな
らず，しかも辞書エントリを複数作成する手間を省き，
かつエントリ作成の作業量を減少させることができる。Further, according to the character recognition device of the present invention (claim 5), it has an English word dictionary in which each word is described in one notation without describing all the case fluctuations of the word notation. Therefore, the size of the word dictionary does not become large, and the effort of creating multiple dictionary entries is saved.
In addition, it is possible to reduce the amount of work for creating an entry.

【００５１】また，本発明に係る文字認識装置（請求項
６）によれば，単語辞書照合アルゴリズムが日本語・英
語文字列のいずれにおいても共有し，英単語専用の辞書
照合のアルゴリズムを新た用意する必要がなくなるた
め，日本語と英語との混在文書を扱うＯＣＲなどにおい
て，処理の簡略化・高速化を図ることができる。According to the character recognition device of the present invention (claim 6), the word dictionary matching algorithm is shared for both Japanese and English character strings, and a new dictionary matching algorithm dedicated to English words is newly prepared. Since it is not necessary to do so, it is possible to simplify and speed up the processing in OCR that handles mixed documents of Japanese and English.

[Brief description of drawings]

【図１】実施の形態に係る文字認識装置の全体構成を示
すブロック図である。FIG. 1 is a block diagram showing an overall configuration of a character recognition device according to an embodiment.

【図２】実施の形態に係る通常の候補ラティス例を示す
説明図である。FIG. 2 is an explanatory diagram showing an example of a normal candidate lattice according to the embodiment.

【図３】図１における言語処理部（後処理部）の処理動
作を示すフローチャートである。FIG. 3 is a flowchart showing a processing operation of a language processing unit (post-processing unit) in FIG.

【図４】実施の形態に係る英字用の候補ラティス例を示
す説明図である。FIG. 4 is an explanatory diagram showing an example of a candidate lattice for English characters according to the embodiment.

【図５】実施の形態に係る英大文字・小文字追加後にお
ける英字用の候補ラティス例を示す説明図である。FIG. 5 is an explanatory diagram showing an example of a candidate lattice for English characters after addition of uppercase and lowercase English letters according to the embodiment.

【図６】実施の形態に係る英単語辞書エントリ例を示す
説明図である。FIG. 6 is an explanatory diagram showing an example of an English word dictionary entry according to the embodiment.

【図７】実施の形態に係る英単語判定部の事例を示す説
明図である。FIG. 7 is an explanatory diagram showing an example of an English word determination unit according to the embodiment.

【図８】実施の形態に係る大文字・小文字修正部の事例
を示す説明図である。FIG. 8 is an explanatory diagram showing an example of an uppercase / lowercase correction unit according to the embodiment.

[Explanation of symbols]

１０１画像入力部１０４文字認識処理部１０６言語処理部（後処理部）１０７確定文字列出力部１０８英字用候補ラティス作成部１０９英大文字・小文字追加部１１０単語照合部１１１英単語判定部１１２英大文字・小文字修正部１１３候補単語評価部 101 image input unit 104 character recognition processing unit 106 language processing unit (post-processing unit) 107 fixed character string output unit 108 candidate character lattice creation unit for English characters 109 uppercase / lowercase addition unit 110 word matching unit 111 English word determination unit 112 English capitalization・ Lower case correction part 113 Candidate word evaluation part

Claims

[Claims]

1. A character recognition for recognizing a character string from image information, collating a recognition result character and a candidate lattice containing the candidate character with a word dictionary by a language processing means, and outputting a finally determined character string. In the apparatus, the language processing means creates an English character candidate lattice creation means for creating an English character candidate capable of containing a maximum of the candidate characters from the first candidate to the lower order for dictionary matching of English words, and a candidate. If a character has an alphabetic character, the uppercase and lowercase letters are added to the candidate lattice for English characters so that both uppercase and lowercase letters of the character are paired, and an uppercase and lowercase letter adding means is provided. Character recognizer.

2. The language processing means corrects the case of the notation of the selected word based on the candidate rank of each recognized character as a result of matching the candidate lattice to which the candidate character is added with the word dictionary. The character recognition device according to claim 1, further comprising an uppercase / lowercase correction means for performing the above.

3. The language processing means, based on a candidate rank of each recognized character, when the selected word is all capitalized as a result of matching between the candidate lattice to which the candidate character is added and the word dictionary. The character recognition device according to claim 1, further comprising an English word determination means for determining whether or not the word is to be evaluated.

4. The language processing means, as a result of collating the candidate lattice to which the candidate character is added with the word dictionary, when the selected word is a mixed word in uppercase and lowercase, a candidate for each recognized character. The character recognition device according to claim 1, further comprising an English word determination means for determining whether or not the word should be evaluated based on the rank.

5. The language processing means collates a candidate lattice with a word dictionary by an English word dictionary described in one notation for one word without describing all uppercase and lowercase fluctuations in the word notation. 5. Further, a word collating means for obtaining a candidate word is further provided.
Character recognition device according to any one of the above.

6. The character recognition device according to claim 1, wherein the language processing means shares a word dictionary matching algorithm for both Japanese and English character strings.