JP2011018108A

JP2011018108A - Device and program for correction of recognized character string

Info

Publication number: JP2011018108A
Application number: JP2009160631A
Authority: JP
Inventors: Keiji Ishimori; 圭二石森
Original assignee: Toshiba Corp; Toshiba Solutions Corp
Current assignee: Toshiba Corp; Toshiba Digital Solutions Corp
Priority date: 2009-07-07
Filing date: 2009-07-07
Publication date: 2011-01-27

Abstract

PROBLEM TO BE SOLVED: To provide a recognized character string correction device capable of correcting a character string recognized from an imaged document with high accuracy, and to provide a program for correction of the recognized character string.SOLUTION: The device includes: a correction information memory 31 for storing normal character string information for use in an object document and similar pattern information configured of the pertinent character information group of each similar pattern of characters and one representative character information selected from among the character information group; a character string comparison part 35 for comparing character string information acquired by converting character information in the character information group of the similar pattern in character information configuring normal character string information into the representative character information of the similar pattern with the character string information acquired by converting the acquired recognized character string information into the representative character information in the same way; and a character string correction part 36 for replacing the acquired recognized character string information with the normal character string information for correction, when the comparison results shows the identical character string information.

Description

本発明は、イメージ化された書類から文字認識される際に誤認識された文字列を、正確な文字列に補正する認識文字列補正装置および認識文字列補正用プログラムに関する。 The present invention relates to a recognized character string correcting apparatus and a recognized character string correcting program for correcting a character string that is erroneously recognized when characters are recognized from an imaged document into an accurate character string.

従来、イメージ化された書類からコンピュータで文字を認識する技術として、ＯＣＲやテキストリーダーなどの装置を利用したものが知られている。これらの装置の機能はソフトウェアに搭載されて市販されているものもあり、新聞記事、各種明細書、著書等のテキスト化に広く活用されている。 2. Description of the Related Art Conventionally, as a technique for recognizing characters from an imaged document by a computer, a technique using an apparatus such as an OCR or a text reader is known. Some of the functions of these devices are commercially available in software, and are widely used to convert newspaper articles, various specifications, books, etc. into text.

ところで、これらの技術を利用した文字認識処理では文字が誤認識される場合もあり、これを正しく補正する処理が必要になる。 By the way, in the character recognition process using these techniques, a character may be erroneously recognized, and it is necessary to correct it.

認識された文字を補正する技術として、特許文献１および特許文献２に記載の技術がある。 As a technique for correcting a recognized character, there are techniques described in Patent Document 1 and Patent Document 2.

特許文献１には、誤認識の可能性が高い類似文字の部分集合をより少ない文字で代表させる縮約によって作成された文字コード体系に対応する辞書を用いて、対象となる用語を照合して単文字レベルでの誤りを補正する技術が記載されている。 Patent Document 1 uses a dictionary corresponding to a character code system created by contraction that represents a subset of similar characters that have a high possibility of misrecognition with fewer characters, and matches the target term. A technique for correcting errors at the single character level is described.

また、特許文献２には、用語単位で文字を認識して用語辞書との照合によって用語候補の確からしさを示す得点とともに用語候補を出力し、用語辞書中に含まれる各用語についてその他の用語との類似度から各用語の識別度を求めこれらを集約した用語識別度テーブルを用い、各用語候補に対する用語識別度の大小に応じた用語得点の補正を行うことにより正解率の高い用語候補から順に示すことができる技術が記載されている。 In Patent Document 2, a word candidate is output together with a score indicating the certainty of the candidate word by recognizing the character in a term unit and collated with the term dictionary, and each term included in the term dictionary is Using the term discriminant table that calculates the discriminating degree of each term from the similarity of the terms and aggregates them, and corrects the term score according to the level of term discriminating degree for each term candidate, in order from the word candidate with the highest accuracy Techniques that can be shown are described.

特開平４−３６１３９３号公報JP-A-4-361393 特開平４−６８４８４号公報Japanese Patent Laid-Open No. 4-68484

上記の特許文献１または特許文献２に記載の技術を利用することにより、誤認識された文字列を正確な文字列に補正できる可能性は高まるが、限定された分野の書類において利用する場合、例えば使用される文字列が限定的な帳票や請求書などの文字認識に利用する場合には、より精度の高い補正を行うことが望まれていた。特に、コンピュータでの識別が困難とされる「ソ」と「ン」、「ツ」と「シ」などを正確に区別して補正を行うことが望まれていた。 By using the technique described in Patent Document 1 or Patent Document 2 described above, there is a high possibility that a misrecognized character string can be corrected to an accurate character string. However, when used in documents in a limited field, For example, when character strings used are used for character recognition such as limited forms and invoices, it has been desired to perform correction with higher accuracy. In particular, it has been desired to correct by accurately distinguishing “seo” and “n”, “tsu” and “shi”, etc., which are difficult to identify with a computer.

従って本発明の目的は、イメージ化された書類から認識された文字列を、高い精度で補正することが可能な認識文字列補正装置および認識文字列補正用プログラムを提供することである。 Accordingly, an object of the present invention is to provide a recognized character string correction device and a recognized character string correction program capable of correcting a character string recognized from an imaged document with high accuracy.

上記課題を解決するための本発明の認識文字列補正装置は、イメージ画像情報による文書を解析することにより認識された認識文字列情報の、誤認識された文字を補正する認識文字列補正装置において、予め設定された、前記文書に使用する正規の文字列情報と、文字の類似パターンごとの、該当する文字情報群およびこの文字情報群の中から選択された一の代表文字情報で構成された類似パターン情報とを記憶する補正情報記憶部と、前記認識文字列情報を取得する認識文字列情報取得部と、前記認識文字列情報取得部で取得された認識文字列情報を構成する文字情報のうち、前記補正情報記憶部に記憶されたいずれかの類似パターンの文字情報群内にある文字情報を、当該類似パターンの代表文字情報に変換する認識文字列変換部と、前記補正情報記憶部に記憶された正規の文字列情報を構成する文字情報のうち、いずれかの前記類似パターンの文字情報群内にある文字情報を、当該類似パターンの代表文字情報に変換した文字列情報と、前記認識文字列変換部で前記認識文字列情報が変換された文字列情報とを比較する文字列比較部と、前記文字列比較部で比較された結果、前記正規の文字列情報が変換された文字列情報と前記認識文字列情報が変換された文字列情報とが同一であったときには、前記認識文字列情報取得部で取得された認識文字列情報を、前記補正情報記憶部に記憶された前記正規の文字列情報に置き換えることで補正する文字列補正部とを備えることを特徴とする。 A recognition character string correction apparatus according to the present invention for solving the above problems is a recognition character string correction apparatus that corrects misrecognized characters in recognition character string information recognized by analyzing a document based on image image information. , Which is composed of preset character string information used for the document, corresponding character information group for each character similar pattern, and one representative character information selected from the character information group Correction information storage unit for storing similar pattern information, recognized character string information acquisition unit for acquiring the recognized character string information, and character information constituting the recognized character string information acquired by the recognized character string information acquisition unit Among them, a character string conversion unit that converts character information in a character information group of any similar pattern stored in the correction information storage unit into representative character information of the similar pattern; A character string obtained by converting character information in the character information group of any one of the similar patterns from character information constituting regular character string information stored in the correction information storage unit into representative character information of the similar pattern The character string comparison unit that compares the information with the character string information obtained by converting the recognized character string information by the recognized character string conversion unit, and as a result of comparison by the character string comparison unit, the regular character string information is When the converted character string information and the character string information obtained by converting the recognized character string information are the same, the recognized character string information acquired by the recognized character string information acquisition unit is stored in the correction information storage unit. And a character string correction unit that corrects the data by replacing with the stored regular character string information.

この認識文字列補正装置の前記認識文字列変換部ではさらに、前記認識文字列情報を構成する文字情報に小文字が含まれるときには、当該小文字を大文字に変換し、前記文字列比較部では前記補正情報記憶部に記憶された正規の文字列情報を構成する文字情報のうち、いずれかの前記類似パターンの文字情報群内にある文字情報を、当該類似パターンの代表文字情報に変換するとともに当該文字情報に小文字が含まれるときには当該小文字を大文字に変換した文字列情報と、前記認識文字列変換部で前記認識文字列情報が変換された文字列情報とを比較するようにしてもよい。 In the recognized character string converting unit of the recognized character string correcting device, when the character information constituting the recognized character string information includes lowercase characters, the lowercase characters are converted into uppercase characters, and the character string comparing unit converts the correction information. Among the character information constituting the regular character string information stored in the storage unit, the character information in the character information group of any one of the similar patterns is converted into the representative character information of the similar pattern and the character information When the character string includes lowercase letters, the character string information obtained by converting the lowercase letters into uppercase letters may be compared with the character string information obtained by converting the recognized character string information by the recognized character string conversion unit.

また、この認識文字列補正装置の前記補正情報記憶部は、前記正規の文字列情報を構成する文字情報のうち、いずれかの前記類似パターンの文字情報群内にある文字情報を、当該類似パターンの代表文字情報に変換した正規変換文字列情報をさらに記憶し、前記文字列比較部は、前記補正情報記憶部に記憶された正規変換文字列情報と、前記認識文字列変換部で前記認識文字列情報が変換された文字列情報とを比較するようにしてもよい。 Further, the correction information storage unit of the recognized character string correction device converts character information included in the character information group of any one of the similar patterns from the character information constituting the regular character string information. The character string comparison unit further stores the normal conversion character string information converted into the representative character information, and the character string comparison unit stores the normal conversion character string information stored in the correction information storage unit and the recognized character string conversion unit. You may make it compare with the character string information into which column information was converted.

また、本発明の認識文字列補正用プログラムは、イメージ画像情報による文書を解析することにより認識された認識文字列情報の、誤認識された文字を補正する認識文字列補正装置に、予め設定された、前記文書に使用する正規の文字列情報と、文字の類似パターンごとの、該当する文字情報群およびこの文字情報群の中から選択された一の代表文字情報で構成された類似パターン情報とを記憶する機能と、前記認識文字列情報を取得する機能と、取得された認識文字列情報を構成する文字情報のうち、記憶されたいずれかの類似パターンの文字情報群内にある文字情報を、当該類似パターンの代表文字情報に変換する機能と、記憶された正規の文字列情報を構成する文字情報のうち、いずれかの前記類似パターンの文字情報群内にある文字情報を、当該類似パターンの代表文字情報に変換した文字列情報と、前記認識文字列情報が変換された文字列情報とを比較する機能と、比較された結果、前記正規の文字列情報が変換された文字列情報と前記認識文字列情報が変換された文字列情報とが同一であったときには、前記取得された認識文字列情報を、前記正規の文字列情報に置き換えることで補正する機能とを実行させることを特徴とする。 The recognition character string correction program of the present invention is set in advance in a recognized character string correction device that corrects misrecognized characters in recognized character string information recognized by analyzing a document based on image image information. In addition, regular character string information used for the document, corresponding character information group for each character similar pattern, and similar pattern information composed of one representative character information selected from the character information group, The character information in the character information group of any one of the stored similar patterns among the character information constituting the acquired recognized character string information. Characters in the character information group of any one of the similar patterns among the function of converting to the representative character information of the similar pattern and the character information constituting the stored regular character string information A function for comparing the character string information obtained by converting the information into the representative character information of the similar pattern and the character string information obtained by converting the recognized character string information, and as a result of the comparison, the regular character string information is converted. When the obtained character string information is the same as the character string information obtained by converting the recognized character string information, a function of correcting the acquired recognized character string information by replacing it with the regular character string information; Is executed.

本発明の認識文字列補正装置および認識文字列補正用プログラムによれば、イメージ化された書類から認識された文字列を、高い精度で補正することができる。 According to the recognized character string correcting apparatus and the recognized character string correcting program of the present invention, the character string recognized from the imaged document can be corrected with high accuracy.

本発明の一実施形態による認識文字列補正装置を利用した文字認識システムの構成を示すブロック図である。It is a block diagram which shows the structure of the character recognition system using the recognition character string correction apparatus by one Embodiment of this invention. 本発明の一実施形態による認識文字列補正装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the recognition character string correction apparatus by one Embodiment of this invention. 本発明の一実施形態による認識文字列補正装置の補正情報記憶部に記憶された正規の文字列情報の一例を示す説明図である。It is explanatory drawing which shows an example of the regular character string information memorize | stored in the correction information storage part of the recognition character string correction apparatus by one Embodiment of this invention. 本発明の一実施形態による認識文字列補正装置の補正情報記憶部に記憶された類似パターン情報の一例を示す説明図である。It is explanatory drawing which shows an example of the similar pattern information memorize | stored in the correction information storage part of the recognition character string correction apparatus by one Embodiment of this invention. 本発明の一実施形態による認識文字列補正装置の認識文字列変換部、正規文字列変換部、文字列比較部、および文字列補正部で実行される処理を説明する説明図である。It is explanatory drawing explaining the process performed by the recognition character string conversion part of the recognition character string correction apparatus by one Embodiment of this invention, a regular character string conversion part, a character string comparison part, and a character string correction | amendment part.

本発明の認識文字列補正装置を利用した文字認識システムの一実施形態について、図面を参照して説明する。 An embodiment of a character recognition system using a recognized character string correction apparatus of the present invention will be described with reference to the drawings.

〈一実施形態による文字認識システムの構成〉
本実施形態の文字認識システム１は、印刷された書類から文字を認識してテキストファイルを生成するものであり、図１に示すように、画像読取装置１０と、文字認識装置２０と、認識文字列補正装置３０とを有する。 <Configuration of Character Recognition System According to One Embodiment>
A character recognition system 1 according to the present embodiment recognizes characters from a printed document and generates a text file. As shown in FIG. 1, an image reading device 10, a character recognition device 20, a recognized character, and the like. A column correction device 30.

画像読取装置１０はスキャナ等であり、テキストファイル生成対象の書類をイメージ画像データとして読み取る読取制御部１１と、読み取ったイメージ画像データを記憶するイメージ画像データ記憶部１２とを有する。 The image reading apparatus 10 is a scanner or the like, and includes a reading control unit 11 that reads a document to be generated as a text file as image image data, and an image image data storage unit 12 that stores the read image image data.

文字認識装置２０はＯＣＲ等であり、画像読取装置１０のイメージ画像データ記憶部１２に記憶されたイメージ画像データを解析して文字情報を認識する文字認識制御部２１と、認識された文字情報で構成された文字列情報である認識文字列情報を記憶する認識文字列情報記憶部２２とを有する。 The character recognition device 20 is an OCR or the like. The character recognition control unit 21 recognizes character information by analyzing the image image data stored in the image image data storage unit 12 of the image reading device 10, and the recognized character information. And a recognized character string information storage unit 22 for storing recognized character string information, which is configured character string information.

認識文字列補正装置３０は、補正情報記憶部３１と、認識文字列情報取得部３２と、認識文字列変換部３３と、正規文字列変換部３４と、文字列比較部３５と、文字列補正部３６と、テキストファイル生成部３７と、テキストファイル記憶部３８とを有する。 The recognized character string correction device 30 includes a correction information storage unit 31, a recognized character string information acquisition unit 32, a recognized character string conversion unit 33, a regular character string conversion unit 34, a character string comparison unit 35, and a character string correction. A unit 36, a text file generation unit 37, and a text file storage unit 38.

補正情報記憶部３１は、予め設定された当該文書に使用する正規の文字列情報と、文字の類似パターンごとの、該当する文字情報群およびこの文字情報群の中から選択された一の代表文字情報で構成された類似パターン情報とを記憶する。 The correction information storage unit 31 includes regular character string information used for the document set in advance, a corresponding character information group for each similar character pattern, and one representative character selected from the character information group Similar pattern information composed of information is stored.

認識文字列情報取得部３２は、文字認識装置２０の認識文字列情報記憶部２２に記憶された認識文字列情報を取得する。 The recognized character string information acquisition unit 32 acquires the recognized character string information stored in the recognized character string information storage unit 22 of the character recognition device 20.

認識文字列変換部３３は、認識文字列情報取得部３２で取得された認識文字列情報を構成する各文字情報が、補正情報記憶部３１に記憶されたいずれかの類似パターンの文字情報群内にあるか否かをそれぞれ判断し、いずれかの類似パターンの文字情報群内にあった文字情報を、当該類似パターンの代表文字情報に変換する。 In the character information group of any similar pattern in which each character information constituting the recognized character string information acquired by the recognized character string information acquisition unit 32 is stored in the correction information storage unit 31. The character information in the character information group of any similar pattern is converted into the representative character information of the similar pattern.

正規文字列変換部３４は、補正情報記憶部３１に記憶された正規の文字列情報を構成する各文字情報が、いずれかの類似パターンの文字情報群内にあるか否かをそれぞれ判断し、いずれかの類似パターンの文字情報群内にあった文字情報を、当該類似パターンの代表文字情報に変換する。 The regular character string conversion unit 34 determines whether each character information constituting the regular character string information stored in the correction information storage unit 31 is in the character information group of any similar pattern, respectively. Character information in the character information group of any similar pattern is converted into representative character information of the similar pattern.

文字列比較部３５は、認識文字列変換部３３で認識文字列情報が変換された文字列情報と、正規文字列変換部３４で正規文字列情報が変換された文字列情報とが同一であるか否かを比較する。 In the character string comparison unit 35, the character string information whose recognition character string information is converted by the recognition character string conversion unit 33 and the character string information whose normal character string information is converted by the normal character string conversion unit 34 are the same. Compare whether or not.

文字列補正部３６は、文字列比較部３５で比較された結果、認識文字列情報が変換された文字列情報と正規文字列情報が変換された文字列情報とが同一であったときには、認識文字列情報取得部３２で取得された認識文字列情報を、補正情報記憶部３１に記憶された該当する正規の文字列情報に置き換えることで補正する。 When the character string information converted from the recognized character string information and the character string information converted from the regular character string information are the same as a result of the comparison by the character string comparing unit 35, the character string correcting unit 36 recognizes Correction is performed by replacing the recognized character string information acquired by the character string information acquisition unit 32 with the corresponding regular character string information stored in the correction information storage unit 31.

テキストファイル生成部３７は、文字列補正部３６で補正された文字列情報で構成されたテキストファイルを生成する。 The text file generation unit 37 generates a text file composed of the character string information corrected by the character string correction unit 36.

テキストファイル記憶部３８は、テキストファイル生成部３７で生成されたテキストファイルを記憶する。 The text file storage unit 38 stores the text file generated by the text file generation unit 37.

〈一実施形態による文字認識システムの動作〉
次に、本実施形態による文字認識システム１において、印刷された書類Ｘからテキストファイルを生成し記憶する場合の処理について説明する。 <Operation of Character Recognition System According to One Embodiment>
Next, a process when a text file is generated and stored from the printed document X in the character recognition system 1 according to the present embodiment will be described.

まず、テキストファイル生成対象の印刷された書類である書類Ｘが画像読取装置１０の読取制御部１１によりイメージ画像データとして読み取られ、イメージ画像データ記憶部１２に記憶される。 First, a document X, which is a printed document to be generated as a text file, is read as image image data by the reading control unit 11 of the image reading apparatus 10 and stored in the image image data storage unit 12.

次に、文字認識装置２０の文字認識制御部２１により、イメージ画像データ記憶部１２に記憶されたイメージ画像データが解析されて文字情報が認識される。認識された文字情報で構成された文字列情報は、認識文字列情報記憶部２２に記憶される。 Next, the character recognition control unit 21 of the character recognition device 20 analyzes the image image data stored in the image image data storage unit 12 and recognizes character information. The character string information composed of the recognized character information is stored in the recognized character string information storage unit 22.

次に、認識文字列補正装置３０において、認識された文字列情報の誤認識された文字を補正する処理が行われる。この誤認識された文字の補正処理について、図２のフローチャートを参照して説明する。 Next, in the recognized character string correction device 30, a process of correcting a misrecognized character in the recognized character string information is performed. The correction process for the erroneously recognized character will be described with reference to the flowchart of FIG.

この認識文字列補正装置３０の補正情報記憶部３１には、予め設定された、当該書類に使用する正規の文字列情報と、文字の類似パターンごとの、該当する文字情報群およびこの文字情報群の中から選択された一の代表文字情報で構成された類似パターン情報とが記憶されている。 In the correction information storage unit 31 of the recognized character string correction device 30, the regular character string information used for the document, the corresponding character information group for each similar character pattern, and the character information group are set in advance. And similar pattern information composed of one representative character information selected from the above.

当該書類に使用する正規の文字列情報の一例を、図３に示す。本実施形態においては、当該書類に使用する正規の文字列情報として「ソフトウェア」、「ソリューション１」、「ミドルウェア」が格納されている。 An example of regular character string information used for the document is shown in FIG. In the present embodiment, “software”, “solution 1”, and “middleware” are stored as regular character string information used for the document.

また、文字の類似パターンごとの、該当する文字情報群およびこの文字情報群の中から選択された一の代表文字情報で構成された類似パターン情報の一例を、図４に示す。この類似パターンはコンピュータでの識別を苦手とする類似文字に注目して生成されたものであり、本実施形態においては類似パターン情報として、類似パターンＡ〜Ｇの７つのパターンにそれぞれ該当する文字情報群と、それぞれの文字情報群の中から選択された一の代表文字情報が格納されている。 FIG. 4 shows an example of similar pattern information composed of a corresponding character information group and one representative character information selected from the character information group for each similar pattern of characters. This similar pattern is generated by paying attention to similar characters that are difficult to identify with a computer. In the present embodiment, as similar pattern information, character information corresponding to each of the seven patterns of similar patterns A to G, respectively. A group and one representative character information selected from each character information group are stored.

例えば図４では、類似する文字情報である「ソ」と「ン」とが類似パターンＡの文字情報群として格納され、これらの文字情報群の中の代表文字情報として「ン」が選択されて格納されている。また、類似する文字情報である「ツ」と「シ」とが類似パターンＢの文字情報群として格納され、これらの文字情報群の中の代表文字情報として「シ」が選択されて格納されている。また、類似する文字情報である「ー」と「―」とが類似パターンＣの文字情報群として格納され、これらの文字情報群の中の代表文字情報として「―」が選択されて格納されている。また、類似する文字情報である「１」と「Ｉ」と「｜」と「！」とが類似パターンＤの文字情報群として格納され、これらの文字情報群の中の代表文字情報として「１」が選択されて格納されている。また、類似する文字情報である「；」と「：」とが類似パターンＥの文字情報群として格納され、これらの文字情報群の中の代表文字情報として「：」が選択されて格納されている。また、類似する文字情報である「．」と「，」と「。」とが類似パターンＦの文字情報群として格納され、これらの文字情報群の中の代表文字情報として「．」が選択されて格納されている。また、類似する文字情報である「＋」と「十」とが類似パターンＧの文字情報群として格納され、これらの文字情報群の中の代表文字情報として「＋」が選択されて格納されている。 For example, in FIG. 4, “SO” and “N” which are similar character information are stored as the character information group of the similar pattern A, and “N” is selected as the representative character information in these character information groups. Stored. Also, similar character information “tsu” and “shi” are stored as the character information group of the similar pattern B, and “shi” is selected and stored as representative character information in these character information groups. Yes. Also, “—” and “—” that are similar character information are stored as character information groups of the similar pattern C, and “—” is selected and stored as representative character information in these character information groups. Yes. Also, similar character information “1”, “I”, “|”, and “!” Are stored as character information groups of the similar pattern D, and “1” is represented as representative character information in these character information groups. "Is selected and stored. Also, similar character information “;” and “:” are stored as character information groups of the similar pattern E, and “:” is selected and stored as representative character information in these character information groups. Yes. Further, similar character information “.”, “,” And “.” Are stored as character information groups of the similar pattern F, and “.” Is selected as representative character information in these character information groups. Stored. Further, similar character information “+” and “ten” are stored as the character information group of the similar pattern G, and “+” is selected and stored as the representative character information in these character information groups. Yes.

これらの正規の文字列情報および類似パターン情報が補正情報記憶部３１に記憶されている状態で、誤認識された文字を補正する処理が開始されると、まず認識文字列情報取得部３２において文字認識装置２０の認識文字列情報記憶部２２に記憶された認識文字列情報が取得される（Ｓ１）。ここでは、認識文字列情報として「ンリュ−ツョソＩ」が取得されたものとする。 When a process for correcting a misrecognized character is started in a state in which these regular character string information and similar pattern information are stored in the correction information storage unit 31, first, a character string information acquisition unit 32 performs a character search. The recognized character string information stored in the recognized character string information storage unit 22 of the recognition device 20 is acquired (S1). Here, it is assumed that “Narutsuso I” is acquired as the recognized character string information.

次に、認識文字列変換部３３において、認識文字列情報取得部３２で取得された認識文字列情報「ンリュ−ツョソＩ」を構成する各文字情報「ン」、「リ」、「ュ」、「−」、「ツ」、「ョ」、「ソ」、「Ｉ」が、補正情報記憶部３１に記憶されたいずれかの類似パターンの文字情報群内にあるか否かがそれぞれ判断され、いずれかの類似パターンの文字情報群内にあった文字情報が、図５の５１に示すように該当する類似パターンの代表文字情報に変換される（Ｓ２）。 Next, in the recognized character string conversion unit 33, each character information “n”, “li”, “u”, which constitutes the recognized character string information “Nunyusoso I” acquired by the recognized character string information acquisition unit 32. It is determined whether or not “-”, “tu”, “yo”, “so”, “I” are in the character information group of any similar pattern stored in the correction information storage unit 31, respectively. The character information in the character information group of any similar pattern is converted into the representative character information of the corresponding similar pattern as indicated by 51 in FIG. 5 (S2).

ここでは、まず文字情報「ン」が類似パターンＡの文字情報群にあると判断され、この文字情報「ン」が類似パターンＡの代表文字情報「ン」に変換される。ここでは、判断対象の文字情報と変換する代表文字情報とが同じであるため実際には文字情報は変わらない。 Here, it is first determined that the character information “n” is in the character information group of the similar pattern A, and this character information “n” is converted into the representative character information “n” of the similar pattern A. Here, since the character information to be determined and the representative character information to be converted are the same, the character information does not actually change.

また、文字情報「−」が類似パターンＣの文字情報群にあると判断され、この文字情報「−」が類似パターンＣの代表文字情報「−」に変換される。ここでも、判断対象の文字情報と変換する代表文字情報とが同じであるため実際には文字情報は変わらない。 Further, it is determined that the character information “-” is in the character information group of the similar pattern C, and this character information “-” is converted into the representative character information “-” of the similar pattern C. Again, since the character information to be determined and the representative character information to be converted are the same, the character information does not actually change.

また、文字情報「ツ」が類似パターンＢの文字情報群にあると判断され、この文字情報「ツ」が類似パターンＢの代表文字情報「シ」に変換される。 Further, it is determined that the character information “TSU” is in the character information group of the similar pattern B, and this character information “TSU” is converted into the representative character information “SH” of the similar pattern B.

また、文字情報「ソ」が類似パターンＡの文字情報群にあると判断され、この文字情報「ソ」が類似パターンＡの代表文字情報「ン」に変換される。 Further, it is determined that the character information “SO” is in the character information group of the similar pattern A, and this character information “SO” is converted into the representative character information “N” of the similar pattern A.

また、文字情報「Ｉ」が類似パターンＤの文字情報群にあると判断され、この文字情報「Ｉ」が類似パターンＤの代表文字情報「１」に変換される。 Further, it is determined that the character information “I” is in the character information group of the similar pattern D, and this character information “I” is converted into the representative character information “1” of the similar pattern D.

このようにして各文字情報が変換されることにより、認識文字列情報「ンリュ−ツョソＩ」が、「ンリュ−ション１」に変換される。 By converting each piece of character information in this way, the recognized character string information “Narutsuso I” is converted to “version 1”.

次に、正規文字列変換部３４において、補正情報記憶部３１に記憶された正規の文字列情報を構成する各文字情報が、いずれかの類似パターンの文字情報群内にあるか否かがそれぞれ判断され、いずれかの類似パターンの文字情報群内にあった文字情報が、該当する類似パターンの代表文字情報に変換される（Ｓ３）。 Next, in the regular character string conversion unit 34, each character information constituting the regular character string information stored in the correction information storage unit 31 is in the character information group of any similar pattern, respectively. The character information that has been determined and was in the character information group of any similar pattern is converted into the representative character information of the corresponding similar pattern (S3).

ここではまず、図３のように格納されている正規の文字列情報のうち、１番目の文字列情報「ソフトウェア」を構成する各文字情報「ソ」、「フ」、「ト」、「ウ」、「ェ」、「ア」が、補正情報記憶部３１に記憶されたいずれかの類似パターンの文字情報群内にあるか否かがそれぞれ判断される。 Here, first, among the regular character string information stored as shown in FIG. 3, each character information “So”, “F”, “G”, “W” constituting the first character string information “software” is displayed. It is determined whether or not “”, “e”, and “a” are in the character information group of any similar pattern stored in the correction information storage unit 31.

上述した認識文字列情報の場合と同様に各文字情報について判断された結果、１番目の正規の文字列情報「ソフトウェア」が「ンフトウェア」に変換される。 As in the case of the recognized character string information described above, as a result of determination for each character information, the first regular character string information “software” is converted to “software”.

次に、文字列比較部３５において、ステップＳ２で認識文字列情報が変換された文字列情報と、ステップＳ３で１番目の正規の文字列情報が変換された文字列情報とが同一であるか否かが比較される（Ｓ４）。 Next, in the character string comparison unit 35, the character string information obtained by converting the recognized character string information in step S2 is the same as the character string information obtained by converting the first regular character string information in step S3. Whether or not is compared (S4).

比較された結果、ここではステップＳ２で変換された文字列情報「ンリュ−ション１」と、ステップＳ３で変換された文字列情報「ンフトウェア」とは同一ではなく（Ｓ５の「NO」）、次の２番目の正規の文字列情報に移動して処理が継続される（Ｓ６）。 As a result of the comparison, the character string information “version 1” converted in step S2 is not the same as the character string information “software” converted in step S3 (“NO” in S5). The process proceeds to the next second regular character string information (S6).

ステップＳ３に戻り、正規文字列変換部３４において、２番目の正規の文字列情報「ソリューション１」を構成する各文字情報「ソ」、「リ」、「ュ」、「ー」、「シ」、「ョ」、「ン」、「１」が、補正情報記憶部３１に記憶されたいずれかの類似パターンの文字情報群内にあるか否かがそれぞれ判断される。 Returning to step S3, the character string conversion unit 34 uses the character information “So”, “Li”, “Yu”, “-”, “Sh” constituting the second normal character string information “Solution 1”. , “X”, “N”, and “1” are determined to be in the character information group of any similar pattern stored in the correction information storage unit 31.

上述した認識文字列情報と同様に各文字情報について判断された結果、図５の５２に示すように、２番目の正規の文字列情報「ソリューション１」が「ンリュ−ション１」に変換される。 As a result of the determination for each character information in the same manner as the recognized character string information described above, the second regular character string information “solution 1” is converted to “solution 1” as indicated by 52 in FIG. .

次に、文字列比較部３５において、ステップＳ２で認識文字列情報が変換された文字列情報と、ステップＳ３で２番目の正規の文字列情報が変換された文字列情報とが同一であるか否かが比較される（Ｓ４）。 Next, in the character string comparison unit 35, the character string information obtained by converting the recognized character string information in step S2 is the same as the character string information obtained by converting the second regular character string information in step S3. Whether or not is compared (S4).

比較された結果、ここではステップＳ２で変換された文字列情報「ンリュ−ション１」と、ステップＳ３で変換された文字列情報「ンリュ−ション１」とが同一であると判断される（Ｓ５の「YES」）。 As a result of the comparison, it is determined here that the character string information “version 1” converted in step S2 is the same as the character string information “solution 1” converted in step S3 (S5). "YES")

ステップＳ２で変換された文字列情報とステップＳ３で変換された文字列情報とが同一であると判断されると、文字列補正部３６において図５の５３に示すように、認識文字列情報取得部３２で取得された認識文字列情報「ンリュ−ツョソＩ」を、該当する２番目の正規の文字列情報「ソリューション１」に置き換えることで補正される（Ｓ７）。 If it is determined that the character string information converted in step S2 is the same as the character string information converted in step S3, the character string correction unit 36 obtains recognized character string information as shown at 53 in FIG. Correction is performed by replacing the recognized character string information “Narutsuso I” acquired by the unit 32 with the corresponding second regular character string information “Solution 1” (S7).

そして、文字列補正部３６で補正された文字列情報で構成されたテキストファイルがテキストファイル生成部３７で生成され（Ｓ８）、テキストファイル記憶部３８に記憶される（Ｓ９）。 Then, a text file composed of the character string information corrected by the character string correction unit 36 is generated by the text file generation unit 37 (S8) and stored in the text file storage unit 38 (S9).

以上の本実施形態によれば、使用する文字情報がある程度決まっている限定された分野の書類において、印刷された書類から認識された文字列を、高い精度で補正して正確なテキストファイルを生成することが可能になる。 According to the above embodiment, in a limited field document in which character information to be used is determined to some extent, a character string recognized from a printed document is corrected with high accuracy to generate an accurate text file. It becomes possible to do.

なお、上記は本発明の最良の実施の形態によって記載したが、この開示の一部をなす論述および図面はこの発明を限定するものであると理解すべきではない。この開示から当業者には様々な代替実施の形態、実施例および運用技術が明らかとなる。 Although the above has been described based on the best mode of the present invention, it should not be understood that the description and the drawings, which form a part of this disclosure, limit the present invention. From this disclosure, various alternative embodiments, examples, and operational techniques will be apparent to those skilled in the art.

例えば、本実施形態においては、文字列比較部３５において比較処理を行う際に、補正処理の都度、各正規の文字列情報を順次変換して比較する場合について説明したが、各正規の文字列情報を類似パターン情報に基づいて変換した文字列情報を予め補正情報記憶部３１に記憶させておき、比較処理に利用するようにしてもよい。 For example, in the present embodiment, a case has been described in which each regular character string information is sequentially converted and compared each time correction processing is performed in the character string comparison unit 35, but each regular character string is compared. Character string information obtained by converting information based on similar pattern information may be stored in advance in the correction information storage unit 31 and used for comparison processing.

また、図４のように記憶した類似パターン情報に基づいて認識文字列変換部３３および正規文字列変換部３４で変換処理を行う場合に、例えば「ツ」と「ッ」などのように大文字と小文字とある文字については大文字に統一して変換処理を行うようにすることで、類似パターン情報に格納する情報量を減らすことができるとともに、大きさが誤認識された場合にも正確に補正することができ、補正の精度を向上させることができる。 In addition, when the recognition character string conversion unit 33 and the regular character string conversion unit 34 perform conversion processing based on the stored similar pattern information as shown in FIG. 4, uppercase letters such as “tsu” and “tsu” are used. By converting the conversion of lowercase letters to uppercase letters, it is possible to reduce the amount of information stored in similar pattern information and correct it correctly even if the size is misrecognized. And the correction accuracy can be improved.

本発明はここでは記載していない様々な実施の形態等を含むことは勿論である。従って、本発明の技術的範囲は上記の説明から妥当な特許請求の範囲に係る発明特定事項によってのみ定められるものである。 It goes without saying that the present invention includes various embodiments not described herein. Accordingly, the technical scope of the present invention is defined only by the invention specifying matters according to the scope of claims reasonable from the above description.

１…文字認識システム
１０…画像読取装置
１１…読取制御部
１２…イメージ画像データ記憶部
２０…文字認識装置
２１…文字認識制御部
２２…認識文字列情報記憶部
３０…認識文字列補正装置
３１…補正情報記憶部
３２…認識文字列情報取得部
３３…認識文字列変換部
３４…正規文字列変換部
３５…文字列比較部
３６…文字列補正部
３７…テキストファイル生成部
３８…テキストファイル記憶部 DESCRIPTION OF SYMBOLS 1 ... Character recognition system 10 ... Image reading apparatus 11 ... Reading control part 12 ... Image image data memory | storage part 20 ... Character recognition apparatus 21 ... Character recognition control part 22 ... Recognition character string information storage part 30 ... Recognition character string correction | amendment apparatus 31 ... Correction information storage unit 32 ... Recognition character string information acquisition unit 33 ... Recognition character string conversion unit 34 ... Regular character string conversion unit 35 ... Character string comparison unit 36 ... Character string correction unit 37 ... Text file generation unit 38 ... Text file storage unit

Claims

In a recognized character string correction device that corrects misrecognized characters in recognized character string information recognized by analyzing a document based on image image information,
Pre-set regular character string information used for the document, corresponding character information group for each character similar pattern, and similarity composed of one representative character information selected from this character information group A correction information storage unit for storing pattern information;
A recognized character string information acquisition unit for acquiring the recognized character string information;
Among the character information constituting the recognized character string information acquired by the recognized character string information acquiring unit, character information in the character information group of any similar pattern stored in the correction information storage unit A recognized character string conversion unit for converting into representative character information of a pattern;
Character information that is included in the character information group of any one of the similar patterns out of character information that constitutes regular character string information stored in the correction information storage unit, and is converted into representative character information of the similar pattern A character string comparison unit that compares column information with character string information obtained by converting the recognized character string information by the recognized character string conversion unit;
When the character string information converted from the regular character string information and the character string information converted from the recognized character string information are the same as a result of the comparison by the character string comparison unit, the recognized character string information A character string correction unit that corrects the recognized character string information acquired by the acquisition unit by replacing it with the regular character string information stored in the correction information storage unit;
A recognition character string correction apparatus comprising:

Further, in the recognized character string conversion unit, when lowercase letters are included in the character information constituting the recognized character string information, the lowercase letters are converted into uppercase letters,
In the character string comparison unit, the character information included in the character information group of any one of the similar patterns among the character information constituting the regular character string information stored in the correction information storage unit. When the character information includes lower case characters, the character string information obtained by converting the lower case characters to upper case is compared with the character string information obtained by converting the recognized character string information by the recognized character string conversion unit. The recognition character string correction device according to claim 1, wherein

The correction information storage unit converts the character information included in the character information group of any one of the similar patterns from the character information constituting the regular character string information into the representative character information of the similar pattern. Store more string information,
The character string comparison unit compares the normal conversion character string information stored in the correction information storage unit with the character string information obtained by converting the recognition character string information in the recognition character string conversion unit. The recognition character string correction device according to claim 1 or 2.

To a recognized character string correction device that corrects misrecognized characters in recognized character string information recognized by analyzing a document based on image image information,
Pre-set regular character string information used for the document, corresponding character information group for each character similar pattern, and similarity composed of one representative character information selected from this character information group A function for storing pattern information;
A function of acquiring the recognized character string information;
A function of converting character information in the character information group of any one of the similar patterns out of the character information constituting the acquired recognized character string information into representative character information of the similar pattern;
Character information in the character information group of any one of the similar patterns out of character information constituting the stored regular character string information, character string information converted into representative character information of the similar pattern, and the recognition A function for comparing the character string information with the converted character string information;
As a result of the comparison, when the character string information converted from the regular character string information is the same as the character string information converted from the recognized character string information, the acquired recognized character string information is A function to correct by replacing with regular character string information,
Recognition character string correction program that executes