JP2020154776A

JP2020154776A - False recognition character table, false recognition character table creation method, character string searching device, character string searching method and character string searching program

Info

Publication number: JP2020154776A
Application number: JP2019053038A
Authority: JP
Inventors: 清孝宮井; Kiyotaka Miyai; 清孝粕渕; Kiyotaka Kasubuchi; 明子吉田; Akiko Yoshida; 北村　一博; Kazuhiro Kitamura; 一博北村; 万理寺田; Manri TERADA; 光規梅原; Mitsunori Umehara
Original assignee: Screen Holdings Co Ltd
Current assignee: Screen Holdings Co Ltd
Priority date: 2019-03-20
Filing date: 2019-03-20
Publication date: 2020-09-24
Anticipated expiration: 2039-03-20
Also published as: JP7257204B2

Abstract

To enable appropriately searching in low cost even when a printing form of an OCR object image is different, without requiring confirmation by means of visual observation, or continuous dictionary update, in character string searching having text data being an OCR result as a searching object.SOLUTION: A character is recognized by reading a character image, in which OCR object characters are printed by variously changing a printing form identified by a printer, a font, and a type of a sheet of paper, by a scanner, to perform OCR processing. A false recognition character table, in which not only a clerical error recognition character but also the printing form (the printer, the font, and the type of the sheet of paper) of the character image are associated with each of false recognized OCR object characters, is created on the basis of a result of the OCR processing. When a character string is searched in text data being an OCR result, the character registered in the false recognition character table in an input character string is replaced with a wild card to be searched.SELECTED DRAWING: Figure 4

Description

本発明は、光学的文字認識（以下「ＯＣＲ」という）において誤認識され易い文字を集めて作成される誤認識文字テーブル、その作成方法、および、当該誤認識文字テーブルを使用する文字列検索装置等に関する。 The present invention provides a misrecognition character table created by collecting characters that are easily misrecognized in optical character recognition (hereinafter referred to as "OCR"), a method for creating the misrecognition character table, and a character string search device using the misrecognition character table. Etc.

従来より、ＯＣＲ（Optical Character Recognition）の精度を向上させるために種々の手法が提案されている。例えば、ＯＣＲのために読み取るべき画像に対し、テキスト部を正確に抽出すべく当該画像の歪み補正や地紋やゴミ等のノイズを除去等の処理を行う手法が知られている（例えば特許文献３参照）。また、大量の学習データを用いた機械学習によりＯＣＲの精度を向上させる手法や、ＯＣＲにおいて誤りやすい単語の辞書（以下「誤認識単語辞書」という）を作成する手法も考えられている。 Conventionally, various methods have been proposed to improve the accuracy of OCR (Optical Character Recognition). For example, there is known a method of performing processing such as distortion correction of an image to be read for OCR and removal of noise such as tint block and dust in order to accurately extract a text portion (for example, Patent Document 3). reference). Further, a method of improving the accuracy of OCR by machine learning using a large amount of learning data and a method of creating a dictionary of words that are prone to error in OCR (hereinafter referred to as "misrecognition word dictionary") are also considered.

なお、本願で開示される誤認識文字テーブルや文字列検索装置に関連して、特許文献１には、ＯＣＲにおいて誤認識され易い文字である誤認識文字群を管理する誤認識文字リストを用いて、入力された検索文字列により文書画像のＯＣＲ結果を検索する文書画像検索システムが記載されている（段落［００２４］等参照）。この文書画像検索システムは、入力された検索文字列による検索結果が得られない場合、入力された検索文字列の一部をワイルドカードに置き換えて再検索を行い、さらに、ワイルドカードを含む検索文字列による検索結果が得られない場合には、誤認識文字リストに基づき検索文字中の一部の文字を別の誤認識文字に置き換えて再検索を行うように構成されている（段落［００６８］、［００７１］等参照）。また特許文献２には、ＯＣＲ結果に基づき生成される学習セット（例えば、ＯＣＲモジュールによって識別されたキャラクタのそれぞれに対して、それぞれのキャラクタに対応しているとして識別されたイメージレットに対する平均および分散等）を用いてＯＣＲの認識精度を向上させるためシステムや方法等が記載されている（段落［００１４］〜［００２０］等参照）。 In connection with the misrecognition character table and the character string search device disclosed in the present application, Patent Document 1 uses a misrecognition character list that manages a misrecognition character group that is a character that is easily misrecognized in OCR. , A document image search system for searching the OCR result of a document image by the input search character string is described (see paragraph [0024], etc.). If the search result by the entered search string cannot be obtained, this document image search system replaces a part of the entered search string with a wildcard and performs a re-search, and further, the search character including the wildcard. If the search result by the column cannot be obtained, it is configured to replace some characters in the search character with another misrecognized character based on the misrecognized character list and perform a re-search (paragraph [0068]]. , [0071], etc.). Also in Patent Document 2, for each of the learning sets generated based on the OCR results (eg, for each of the characters identified by the OCR module, the mean and variance for the imagelets identified as corresponding to each character). Etc.) are used to describe systems, methods, etc. for improving the recognition accuracy of OCR (see paragraphs [0014] to [0020], etc.).

特開２００４−２１３０９１号公報Japanese Unexamined Patent Publication No. 2004-213001 特表２０１３−５０９６６４号公報Special Table 2013-509664 特開２００５−１９６５６３号公報Japanese Unexamined Patent Publication No. 2005-196563

ＯＣＲにより得られるテキストデータに対しキーワード（検索語としての入力文字列）による検索を行うシステムが開発されている。このようなシステムにおいて所望の検索結果を得るために、通常、ＯＣＲ結果としてのテキストデータを人間が目視で確認して認識誤りを修正したものを検索対象とする手法が採られる。 A system has been developed that searches text data obtained by OCR by keywords (input character strings as search terms). In order to obtain a desired search result in such a system, a method is usually adopted in which a human visually confirms text data as an OCR result and corrects a recognition error as a search target.

しかし、ＯＣＲ結果としてのテキストデータの全てを目視で確認するには極めて大きなコストを要する。また、ＯＣＲ結果において認識誤りの修正漏れが生じた場合、上記のようなシステムにおいて正しい検索結果を得ることができない。また、既述の誤認識単語辞書を作成する場合、新しい語句が現れる度に辞書更新が必要となり、そのためのコストすなわちメンテナンスのコストが継続的に必要である。さらに、同一文字であっても、使用されるプリンタやフォント等によってＯＣＲによる認識結果が相違することがあり、従来の誤認識単語辞書や誤認識文字リストにおいてその相違を調整するのは困難である。 However, it is extremely costly to visually confirm all of the text data as an OCR result. Further, if a recognition error is not corrected in the OCR result, the correct search result cannot be obtained in the above system. Further, when creating the above-mentioned misrecognized word dictionary, it is necessary to update the dictionary every time a new phrase appears, and the cost for that, that is, the maintenance cost is continuously required. Furthermore, even if the characters are the same, the recognition result by OCR may differ depending on the printer, font, etc. used, and it is difficult to adjust the difference in the conventional misrecognition word dictionary or misrecognition character list. ..

そこで、目視による確認や継続的な辞書更新を必要とすることなく、上記テキストデータの生成のためのＯＣＲの対象とすべき画像の印刷に使用されるプリンタやフォント等が異なっても上記の検索を低コストで適切に行えるようにする方法や装置を提供することが望まれる。 Therefore, the above search is performed even if the printer, font, etc. used for printing the image to be the target of OCR for generating the above text data are different without the need for visual confirmation or continuous dictionary update. It is desired to provide a method and a device that can appropriately perform the above at a low cost.

本発明の第１の局面は、誤認識文字テーブルの作成方法であって、
ＯＣＲ装置の認識対象となり得る文字であるＯＣＲ対象文字のそれぞれを複数の異なる印刷形態で印刷することにより当該ＯＣＲ対象文字が文字画像として印刷された記録媒体を生成する印刷ステップと、
前記印刷ステップにより印刷された前記文字画像を前記ＯＣＲ装置により読み取って文字を認識し、認識結果としての文字のコードを生成するＯＣＲステップと、
前記印刷ステップにより印刷されたＯＣＲ対象文字のコードと前記ＯＣＲステップによる認識結果としての文字のコードとを照合することにより、当該印刷されたＯＣＲ対象文字のそれぞれにつき前記ＯＣＲステップで誤認識されたか否かを判定する誤認識判定ステップと、
前記ＯＣＲ対象文字のいずれかの文字の画像が前記ＯＣＲステップで誤認識されたと前記誤認識判定ステップにより判定された場合に、前記ＯＣＲステップでの認識結果としての文字を誤認識文字として当該いずれかの文字に対応付けるテーブル形式の誤認識文字対応付けデータを生成するとともに、当該いずれかの文字の画像が印刷されたときの印刷形態を当該いずれかの文字に対応付けるテーブル形式の印刷形態対応付けデータを生成するデータ生成ステップとを備える。 The first aspect of the present invention is a method of creating a misrecognized character table.
A printing step of generating a recording medium in which the OCR target character is printed as a character image by printing each of the OCR target characters, which are characters that can be recognized by the OCR device, in a plurality of different print formats.
An OCR step in which the character image printed by the printing step is read by the OCR device, characters are recognized, and a character code as a recognition result is generated.
By collating the code of the OCR target character printed by the printing step with the character code as the recognition result by the OCR step, whether or not each of the printed OCR target characters was erroneously recognized in the OCR step. Misrecognition judgment step to determine whether
When it is determined by the misrecognition determination step that the image of any character of the OCR target character is erroneously recognized in the OCR step, the character as the recognition result in the OCR step is regarded as the erroneous recognition character. Generates table-format misrecognition character mapping data that corresponds to the character of, and also generates table-format print format mapping data that maps the print format when the image of that character is printed to that character. It includes a data generation step to be generated.

本発明の第２の局面は、本発明の第１の局面において、
前記印刷形態対応付けデータは、前記ＯＣＲステップで誤認識された文字の画像の印刷において使用された印刷装置、フォント、および、記録媒体のうち少なくとも１つを特定する情報を含む。 The second aspect of the present invention is the first aspect of the present invention.
The print form association data includes information that identifies at least one of a printing device, a font, and a recording medium used in printing an image of characters erroneously recognized in the OCR step.

本発明の第３の局面は、本発明の第２の局面において、
前記印刷形態対応付けデータは、前記ＯＣＲステップで誤認識された文字の画像の印刷において使用された記録媒体としての紙の種類を特定する情報を含み、
前記紙の種類を特定する情報は、前記文字の画像の印刷において使用されるインクの滲み易さを識別できる情報を含む。 The third aspect of the present invention is the second aspect of the present invention.
The print form association data includes information for specifying the type of paper as a recording medium used in printing an image of characters erroneously recognized in the OCR step.
The information that identifies the type of the paper includes information that can identify the easiness of bleeding of the ink used in printing the image of the characters.

本発明の第４の局面は、誤認識文字テーブルであって、
テキストを含む対象画像が印刷された記録媒体から当該対象画像を読み取って文字を認識するＯＣＲ装置において誤認識される可能性が所定の許容範囲を超えるとみなされる文字である誤認識高可能性文字のそれぞれに対し、当該文字の画像が当該ＯＣＲ装置により誤認識されたときの認識結果としての誤認識文字を対応付ける誤認識文字対応付けデータと、
前記誤認識高可能性文字のそれぞれに対し、当該文字が当該ＯＣＲ装置により誤認識されたときの当該文字の画像の印刷形態を対応付ける印刷形態対応付けデータとを備える。 The fourth aspect of the present invention is a misrecognition character table.
A character with a high possibility of misrecognition, which is a character whose possibility of being erroneously recognized by an OCR device that recognizes a character by reading the target image from a recording medium on which the target image including text is printed is considered to exceed a predetermined allowable range. The misrecognized character mapping data that associates the misrecognized character as the recognition result when the image of the character is misrecognized by the OCR device with each of the above.
Each of the characters having a high possibility of erroneous recognition is provided with print form association data that associates the print form of the image of the character when the character is erroneously recognized by the OCR device.

本発明の第５の局面は、本発明の第４の局面において、
前記印刷形態対応付けデータは、前記ＯＣＲ装置により誤認識された前記文字の画像の印刷において使用された印刷装置、フォント、および、記録媒体のうち少なくとも１つを特定する情報を含む。 The fifth aspect of the present invention is the fourth aspect of the present invention.
The print form association data includes information that identifies at least one of a printing device, a font, and a recording medium used in printing an image of the character misrecognized by the OCR device.

本発明の第６の局面は、本発明の第５の局面において、
前記印刷形態対応付けデータは、前記ＯＣＲ装置により誤認識された前記文字の画像の印刷において使用された記録媒体としての紙の種類を特定する情報を含み、
前記紙の種類を特定する情報は、前記文字の画像の印刷において使用されるインクの滲み易さを識別できる情報を含む。 The sixth aspect of the present invention is the fifth aspect of the present invention.
The print form association data includes information for specifying the type of paper as a recording medium used in printing the image of the character misrecognized by the OCR device.
The information that identifies the type of the paper includes information that can identify the easiness of bleeding of the ink used in printing the image of the characters.

本発明の第７の局面は、テキストを含む対象画像が印刷された記録媒体から当該対象画像を読み取って文字を認識することにより得られるＯＣＲ結果としてのテキストデータを検索対象とする文字列検索装置であって、
本発明の第４から第６の局面のいずれかに係る誤認識文字テーブルと、
外部から与えられる入力文字列におけるいずれかの文字が、前記誤認識文字テーブルに登録された誤認識高可能性文字および誤認識文字のいずれかに一致する場合に、前記入力文字列における当該一致する文字をワイルドカードに置き換えることにより検索語を作成し、前記入力文字列におけるいずれの文字も、前記誤認識文字テーブルに登録された誤認識高可能性文字および誤認識文字のいずれにも一致しない場合には、前記入力文字列を検索語とする検索語作成部と、
前記検索語作成部により得られる検索語に一致する文字列を前記ＯＣＲ結果としてのテキストデータの中から検索する検索部とを備える。 A seventh aspect of the present invention is a character string search device that searches for text data as an OCR result obtained by reading the target image from a recording medium on which the target image containing text is printed and recognizing characters. And
The misrecognition character table according to any of the fourth to sixth aspects of the present invention,
When any character in the input character string given from the outside matches any of the misrecognition high possibility character and the misrecognition character registered in the misrecognition character string, the match in the input character string is made. When a search term is created by replacing characters with wildcards, and none of the characters in the input character string matches any of the misrecognition high-probability characters and misrecognition characters registered in the misrecognition character table. The search term creation unit using the input character string as the search term and
The search unit includes a search unit that searches the text data as the OCR result for a character string that matches the search word obtained by the search word creation unit.

本発明の第８の局面は、本発明の第７の局面において、
前記検索語作成部は、
前記入力文字列におけるいずれかの文字が、前記誤認識文字テーブルに登録された誤認識高可能性文字および誤認識文字のいずれかに一致し、かつ、当該一致する文字に対応付けられた印刷形態が前記対象画像の印刷形態と一致する場合に、前記入力文字列における当該一致する文字をワイルドカードに置き換えることにより検索語を作成し、
前記入力文字列におけるいずれかの文字が、前記誤認識文字テーブルに登録された誤認識高可能性文字および誤認識文字のいずれかに一致するが、当該一致する文字に対応付けられた印刷形態が前記対象画像の印刷形態と一致しない場合に、前記入力文字列を検索語とする。 The eighth aspect of the present invention is the seventh aspect of the present invention.
The search term creation unit
A print form in which any character in the input character string matches any of the misrecognition high possibility character and the misrecognition character registered in the misrecognition character table, and is associated with the matching character. Creates a search term by replacing the matching character in the input character string with a wildcard when matches the print form of the target image.
Any character in the input character string matches any of the misrecognition high possibility character and the misrecognition character registered in the misrecognition character table, but the print form associated with the matching character is When it does not match the print form of the target image, the input character string is used as a search term.

本発明の第９局面は、テキストを含む対象画像が印刷された記録媒体から当該対象画像を読み取って文字を認識することにより得られるＯＣＲ結果としてのテキストデータを検索対象とする文字列検索装置であって、
本発明の第４から第６の局面のいずれかに係る誤認識文字テーブルと、
外部から与えられる入力文字列におけるいずれかの文字が、前記誤認識文字テーブルに登録された誤認識高可能性文字および誤認識文字のいずれかに一致する場合に、前記入力文字列を検索語とするとともに、前記入力文字列における当該一致する文字を前記誤認識文字テーブルによって当該一致する文字に対応付けられた他の文字に置き換えることにより検索語を作成し、前記入力文字列におけるいずれの文字も、前記誤認識文字テーブルに登録された誤認識高可能性文字および誤認識文字のいずれにも一致しない場合に、前記入力文字列を検索語とする検索語作成部と、
前記検索語作成部により得られる検索語のいずれかに一致する文字列を前記ＯＣＲ結果としてのテキストデータの中から検索する検索部とを備える。 A ninth aspect of the present invention is a character string search device that searches for text data as an OCR result obtained by reading the target image from a recording medium on which the target image containing text is printed and recognizing characters. There,
The misrecognition character table according to any of the fourth to sixth aspects of the present invention,
When any character in the input character string given from the outside matches any of the misrecognition high possibility character and the misrecognition character registered in the misrecognition character table, the input character string is used as a search term. At the same time, a search term is created by replacing the matching character in the input character string with another character associated with the matching character by the misrecognition character table, and any character in the input character string is used. , A search term creation unit that uses the input character string as a search term when neither the false recognition high possibility character or the false recognition character registered in the false recognition character table is matched.
The search unit includes a search unit that searches the text data as the OCR result for a character string that matches any of the search words obtained by the search word creation unit.

本発明の第１０の局面は、本発明の第９の局面において、
前記検索語作成部は、
前記入力文字列におけるいずれかの文字が、前記誤認識文字テーブルに登録された誤認識高可能性文字および誤認識文字のいずれかに一致し、かつ、当該一致する文字に対応付けられた印刷形態が前記対象画像の印刷形態と一致する場合に、前記入力文字列を検索語とするとともに、前記入力文字列における当該一致する文字を前記誤認識文字テーブルによって当該一致する文字に対応付けられた他の文字に置き換えることにより検索語を作成し、
前記入力文字列におけるいずれかの文字が、前記誤認識文字テーブルに登録された誤認識高可能性文字および誤認識文字のいずれかに一致するが、当該一致する文字に対応付けられた印刷形態が前記対象画像の印刷形態と一致しない場合に、前記入力文字列のみを検索語とする。 The tenth aspect of the present invention is the ninth aspect of the present invention.
The search term creation unit
A printing form in which any character in the input character string matches any of the misrecognition high possibility character and the misrecognition character registered in the misrecognition character table, and is associated with the matching character. Is the same as the print form of the target image, the input character string is used as a search term, and the matching character in the input character string is associated with the matching character by the misrecognition character table. Create a search term by replacing it with the letters
Any character in the input character string matches any of the misrecognition high possibility character and the misrecognition character registered in the misrecognition character table, but the print form associated with the matching character is If it does not match the print form of the target image, only the input character string is used as the search term.

本発明の第１１の局面は、本発明の第９または第１０の局面において、
前記検索部は、記検索語作成部により得られる検索語のいずれに一致する文字列も前記ＯＣＲ結果としてのテキストデータの中から見出せない場合において、前記入力文字列におけるいずれかの文字が、前記誤認識文字テーブルに登録された誤認識高可能性文字および誤認識文字のいずれかに一致するときに、前記入力文字列における当該一致する文字をワイルドカードに置き換えることにより得られる検索語に一致する文字列を前記ＯＣＲ結果としてのテキストデータの中から検索する。 The eleventh aspect of the present invention is the ninth or tenth aspect of the present invention.
When the search unit cannot find a character string that matches any of the search words obtained by the written search word creation unit in the text data as the OCR result, any character in the input character string is the said. Matches the search term obtained by replacing the matching character in the input character string with a wildcard when it matches any of the misrecognition high-probability characters and misrecognition characters registered in the misrecognition character table. The character string is searched from the text data as the OCR result.

本発明の第１２の局面は、テキストを含む対象画像が印刷された記録媒体から当該対象画像を読み取って文字を認識することにより得られるＯＣＲ結果としてのテキストデータを検索対象とする文字列検索方法であって、
外部から与えられる入力文字列におけるいずれかの文字が、本発明の第４から第６の局面のいずれかに係る誤認識文字テーブルに登録された誤認識高可能性文字および誤認識文字のいずれかに一致する場合に、前記入力文字列における当該一致する文字をワイルドカードに置き換えることにより検索語を作成し、前記入力文字列におけるいずれの文字も、前記誤認識文字テーブルに登録された誤認識高可能性文字および誤認識文字のいずれにも一致しない場合に、前記入力文字列を検索語とする検索語作成ステップと、
前記検索語作成ステップにより得られる検索語に一致する文字列を前記ＯＣＲ結果としてのテキストデータの中から検索する検索ステップとを備える。 A twelfth aspect of the present invention is a character string search method for searching text data as an OCR result obtained by reading a target image including a text from a recording medium on which the target image is printed and recognizing characters. And
Any character in the input character string given from the outside is either a misrecognition high possibility character or a misrecognition character registered in the misrecognition character table according to any of the fourth to sixth aspects of the present invention. When is matched, a search term is created by replacing the matching character in the input character string with a wildcard, and any character in the input character string is the false recognition height registered in the false recognition character table. A search word creation step using the input character string as a search word when neither the possibility character nor the misrecognized character is matched,
It includes a search step of searching the text data as the OCR result for a character string matching the search word obtained by the search word creation step.

本発明の第１３の局面は、テキストを含む対象画像が印刷された記録媒体から当該対象画像を読み取って文字を認識することにより得られるＯＣＲ結果としてのテキストデータを検索対象とする文字列検索方法であって、
外部から与えられる入力文字列におけるいずれかの文字が、本発明の第４から第６の局面のいずれかに係る誤認識文字テーブルに登録された誤認識高可能性文字および誤認識文字のいずれかに一致する場合に、前記入力文字列を検索語とするとともに、前記入力文字列における当該一致する文字を前記誤認識文字テーブルによって当該一致する文字に対応付けられた他の文字に置き換えることにより検索語を作成し、前記入力文字列におけるいずれの文字も、前記誤認識文字テーブルに登録された誤認識高可能性文字および誤認識文字のいずれにも一致しない場合に、前記入力文字列を検索語とする検索語作成ステップと、
前記検索語作成ステップにより得られる検索語のいずれかに一致する文字列を前記ＯＣＲ結果としてのテキストデータの中から検索する検索ステップとを備える。 A thirteenth aspect of the present invention is a character string search method for searching text data as an OCR result obtained by reading a target image including a text from a recording medium on which the target image is printed and recognizing characters. And
Any character in the input character string given from the outside is either a misrecognition high possibility character or a misrecognition character registered in the misrecognition character table according to any of the fourth to sixth aspects of the present invention. In the case of matching, the input character string is used as a search term, and the matching character in the input character string is replaced with another character associated with the matching character by the misrecognition character table. When a word is created and none of the characters in the input character string matches any of the misrecognition high-probability characters and the misrecognition characters registered in the misrecognition character table, the input character string is searched for. Search term creation step and
It includes a search step of searching the text data as the OCR result for a character string matching any of the search terms obtained by the search term creation step.

本発明の他の局面は、本発明の上記局面ならびに後述の実施形態およびその変形例に関する説明から明らかであるので、その説明を省略する。 Since other aspects of the present invention are clear from the above-mentioned aspects of the present invention and the description of the embodiments and modifications thereof described later, the description thereof will be omitted.

本発明の第１の局面によれば、ＯＣＲ装置の認識対象となり得る文字であるＯＣＲ対象文字のそれぞれが複数の異なる印刷形態で文字画像として記録媒体に印刷された後、その記録媒体からＯＣＲ装置により当該文字画像が読み取られて文字が認識される。このとき誤認識されたＯＣＲ対象文字のそれぞれに対しその認識結果としての誤認識文字を対応づける誤認識文字対応付けデータが生成される。また、このとき誤認識されたＯＣＲ対象文字のそれぞれに対しそれが印刷されたときの印刷形態を対応付ける印刷形態対応付けデータも生成される。したがって、このような誤認識文字対応付けデータおよび印刷形態対応付けデータを含む誤認識文字テーブルを使用することにより、ＯＣＲ装置が対象とするテキスト画像の印刷に使用される印刷形態が異なっても、すなわち当該印刷に使用される印刷装置や、フォント、用紙の種類が異なっても、ＯＣＲに関連する処理においてＯＣＲ装置での誤認識に適切に対応することができる。例えば、複数の異なる印刷形態で印刷された対象画像のＯＣＲ結果としてのテキストデータを検索対象とする文字列検索処理を、当該誤認識文字テーブルを使用することで適切に行うことができる。 According to the first aspect of the present invention, after each of the OCR target characters, which are characters that can be recognized by the OCR device, is printed on the recording medium as a character image in a plurality of different print forms, the OCR device is used from the recording medium. The character image is read and the character is recognized. At this time, erroneous recognition character association data is generated in which the erroneous recognition character as the recognition result is associated with each of the erroneously recognized OCR target characters. In addition, print form mapping data for associating the print form when the OCR target character misrecognized at this time is printed is also generated. Therefore, by using the misrecognition character table including such misrecognition character association data and print form association data, even if the print form used for printing the target text image by the OCR device is different, That is, even if the printing device used for the printing, the font, and the type of paper are different, it is possible to appropriately deal with erroneous recognition by the OCR device in the processing related to OCR. For example, a character string search process for searching text data as an OCR result of a target image printed in a plurality of different print formats can be appropriately performed by using the misrecognized character table.

本発明の第２の局面によれば、上記印刷形態対応付けデータには、ＯＣＲ装置で誤認識された文字の画像の印刷において使用された印刷装置、フォント、および、記録媒体のうち少なくとも１つを特定する情報が含まれるので、ＯＣＲ装置が対象とするテキスト画像の印刷に使用された印刷装置、フォント、および、記録媒体のうち当該情報が特定する少なくとも１つが異なっても、上記印刷形態対応付けデータを含む誤認識文字テーブルを使用することで、ＯＣＲに関連する処理において本発明の第１の局面と同様の効果が得られる。 According to the second aspect of the present invention, the print mode association data includes at least one of the printing device, the font, and the recording medium used in printing the image of the character erroneously recognized by the OCR device. Since the information for specifying the data is included, even if at least one of the printing device, the font, and the recording medium used for printing the text image targeted by the OCR device is different, the above printing form is supported. By using the misrecognized character table including the attached data, the same effect as that of the first aspect of the present invention can be obtained in the processing related to OCR.

本発明の第３の局面によれば、上記印刷形態対応付けデータには、ＯＣＲ装置で誤認識された文字の画像の印刷において使用された記録媒体として紙の種類を特定する情報が含まれるので、ＯＣＲ装置が対象とするテキスト画像の印刷に使用される記録媒体としての紙の種類が異なるために印刷のインクの滲み易さが異なっても、上記印刷形態対応付けデータを含む誤認識文字テーブルを使用することで、ＯＣＲに関連する処理において本発明の第１の局面と同様の効果が得られる。 According to the third aspect of the present invention, the print form association data includes information for specifying the type of paper as the recording medium used in printing the image of the character erroneously recognized by the OCR device. , Even if the printing ink bleeds easily because the type of paper used as the recording medium used for printing the target text image of the OCR device is different, the erroneous recognition character table including the print mode correspondence data is included. By using the above, the same effect as that of the first aspect of the present invention can be obtained in the treatment related to OCR.

本発明の第４の局面に係る誤認識文字テーブルは、本発明の第１の局面に係る方法により作成される誤認識文字テーブルに相当する。このため本発明の第４の局面によれば、本発明の第１の局面と同様の効果が得られる。 The misrecognition character table according to the fourth aspect of the present invention corresponds to the misrecognition character table created by the method according to the first aspect of the present invention. Therefore, according to the fourth aspect of the present invention, the same effect as that of the first aspect of the present invention can be obtained.

本発明の第５の局面に係る誤認識文字テーブルは、本発明の第２の局面に係る方法により作成される誤認識文字テーブルに相当する。このため本発明の第５の局面によれば、本発明の第２の局面と同様の効果が得られる。 The misrecognition character table according to the fifth aspect of the present invention corresponds to the misrecognition character table created by the method according to the second aspect of the present invention. Therefore, according to the fifth aspect of the present invention, the same effect as that of the second aspect of the present invention can be obtained.

本発明の第６の局面に係る誤認識文字テーブルは、本発明の第３の局面に係る方法により作成される誤認識文字テーブルに相当する。このため本発明の第６の局面によれば、本発明の第３の局面と同様の効果が得られる。 The misrecognition character table according to the sixth aspect of the present invention corresponds to the misrecognition character table created by the method according to the third aspect of the present invention. Therefore, according to the sixth aspect of the present invention, the same effect as that of the third aspect of the present invention can be obtained.

本発明の第７の局面によれば、外部から与えられる入力文字列におけるいずれかの文字が、本発明の第４から第６の局面のいずれかに係る誤認識文字テーブルに登録された誤認識高可能性文字および誤認識文字のいずれかに一致する場合に、その入力文字列における当該一致する文字をワイルドカードに置き換えることにより検索語が作成され、その検索語に一致する文字列がＯＣＲ結果としてのテキストデータの中から検索される。これにより、当該ＯＣＲ結果としてのテキストデータに誤認識文字が含まれる場合であっても検索漏れを抑制することができる。このようにしてワイルドカードを含む検索語を作成するために使用される当該誤認識文字テーブルには、複数の異なる印刷形態のいずれかで印刷された画像に対してＯＣＲ装置によりいずれかの文字が誤認識されると当該文字が登録される。このため、検索対象としてのテキストデータを作成するためのＯＣＲの対象画像の印刷形態が異なっても、検索漏れを確実に抑制することができる。また、当該誤認識文字テーブルが文字列検索装置において使用されると、その検索対象としてのテキストデータをＯＣＲ装置により作成するときに当該テキストデータを目視で確認する作業や誤認識単語辞書の更新作業等が不要となり、これらの作業のためのコストが削減される。 According to the seventh aspect of the present invention, any character in the input character string given from the outside is registered in the misrecognition character table according to any of the fourth to sixth aspects of the present invention. If it matches either a high-probability character or a misrecognized character, a search term is created by replacing the matching character in the input string with a wildcard, and the string that matches the search term is the OCR result. It is searched from the text data as. As a result, it is possible to suppress search omission even when the text data as the OCR result contains erroneously recognized characters. In the misrecognition character table used to create a search term containing a wildcard in this way, any character is displayed by the OCR device on an image printed in one of a plurality of different print formats. If it is misrecognized, the character is registered. Therefore, even if the print form of the target image of the OCR for creating the text data as the search target is different, the search omission can be reliably suppressed. Further, when the misrecognized character table is used in the character string search device, the work of visually confirming the text data and the work of updating the misrecognized word dictionary when the text data to be searched is created by the OCR device. Etc. are not required, and the cost for these operations is reduced.

本発明の第８の局面によれば、外部から与えられる入力文字列におけるいずれかの文字が、上記誤認識文字テーブルに登録された誤認識高可能性文字および誤認識文字のいずれかに一致し、かつ、当該一致する文字に対応付けられた印刷形態がＯＣＲの対象画像の印刷形態と一致する場合に、その入力文字列における当該一致する文字をワイルドカードに置き換えることにより検索語が作成され、当該入力文字列におけるいずれかの文字が、上記誤認識文字テーブルに登録された誤認識高可能性文字および誤認識文字のいずれかに一致しても、当該一致する文字に対応付けられた印刷形態がＯＣＲの対象画像の印刷形態と一致しない場合には、入力文字列が検索語とされ、ワイルドカードは使用されない。これにより、本発明の第７の局面と同様の効果を奏しつつ、余分なまたは不適切な検索結果の出力が抑制される。 According to the eighth aspect of the present invention, any character in the input character string given from the outside matches any of the misrecognition high possibility character and the misrecognition character registered in the above misrecognition character table. In addition, when the print format associated with the matching character matches the print format of the target image of OCR, the search term is created by replacing the matching character in the input character string with a wildcard. Even if any character in the input character string matches any of the misrecognition high possibility character and the misrecognition character registered in the above misrecognition character table, the print form associated with the matching character. If does not match the print format of the target image of OCR, the input character string is used as the search term and wildcards are not used. As a result, the output of extra or inappropriate search results is suppressed while achieving the same effect as that of the seventh aspect of the present invention.

本発明の第９の局面によれば、外部から与えられる入力文字列におけるいずれかの文字が、上記誤認識文字テーブルに登録された誤認識高可能性文字および誤認識文字のいずれかに一致する場合に、その入力文字列が検索語とされるとともに、その入力文字列における当該一致する文字を上記誤認識文字テーブルによって当該一致する文字に対応付けられた他の文字に置き換えることにより検索語が作成される。これにより、本発明の第７の局面と同様の効果が得られる。ただし、検索漏れの抑制の点では本発明の第７の局面が有利であり、余分なまたは不適切な検索結果の出力抑制の点では本発明の第９の局面が有利である。 According to the ninth aspect of the present invention, any character in the input character string given from the outside matches any of the misrecognition high possibility character and the misrecognition character registered in the misrecognition character table. In some cases, the input character string is used as a search term, and the search term is replaced by replacing the matching character in the input character string with another character associated with the matching character by the misrecognition character table. Will be created. As a result, the same effect as that of the seventh aspect of the present invention can be obtained. However, the seventh aspect of the present invention is advantageous in terms of suppressing search omissions, and the ninth aspect of the present invention is advantageous in terms of suppressing the output of extra or inappropriate search results.

本発明の第１０の局面によれば、外部から与えられる入力文字列におけるいずれかの文字が、上記誤認識文字テーブルに登録された誤認識高可能性文字および誤認識文字のいずれかに一致し、かつ、当該一致する文字に対応付けられた印刷形態がＯＣＲの対象画像の印刷形態と一致する場合に、その入力文字列が検索語とされるとともに、その入力文字列における当該一致する文字を上記誤認識文字テーブルによって当該一致する文字に対応付けられた他の文字に置き換えることにより検索語が作成され、その入力文字列におけるいずれかの文字が、上記誤認識文字テーブルに登録された誤認識高可能性文字および誤認識文字のいずれかに一致しても、当該一致する文字に対応付けられた印刷形態がＯＣＲの対象画像の印刷形態と一致しない場合には、その入力文字列のみが検索語とされる。これにより、本発明の第９の局面と同様の効果を奏しつつ、余分なまたは不適切な検索結果の出力が更に抑制される。 According to the tenth aspect of the present invention, any character in the input character string given from the outside matches any of the misrecognition high possibility character and the misrecognition character registered in the misrecognition character table. In addition, when the print form associated with the matching character matches the print form of the target image of OCR, the input character string is used as a search term, and the matching character in the input character string is used. A search term is created by replacing the misrecognition character table with another character associated with the matching character, and any character in the input character string is misrecognized registered in the misrecognition character table. Even if it matches either a high-probability character or a misrecognized character, if the print format associated with the matching character does not match the print format of the target image of OCR, only the input character string is searched. It is said to be a word. As a result, the output of extra or inappropriate search results is further suppressed while achieving the same effect as that of the ninth aspect of the present invention.

本発明の第１１の局面によれば、検索語作成部により得られる検索語のいずれに一致する文字列もＯＣＲ結果としてのテキストデータの中から見出せない場合において、上記入力文字列におけるいずれかの文字が、上記誤認識文字テーブルに登録された誤認識高可能性文字および誤認識文字のいずれかに一致するときに、上記入力文字列における当該一致する文字をワイルドカードに置き換えることにより得られる検索語に一致する文字列がＯＣＲ結果としてのテキストデータの中から検索される。これにより、本発明の第９または第１０の局面と同様の効果を奏しつつ、検索漏れが更に抑制される。 According to the eleventh aspect of the present invention, when a character string matching any of the search words obtained by the search word creation unit cannot be found in the text data as the OCR result, any of the above input character strings A search obtained by replacing the matching character in the input character string with a wildcard when the character matches any of the misrecognition high possibility character and the misrecognition character registered in the misrecognition character table. A character string that matches the word is searched for in the text data as the OCR result. As a result, the search omission is further suppressed while achieving the same effect as that of the ninth or tenth aspect of the present invention.

本発明の他の局面の効果については、本発明の上記局面の効果ならびに下記実施形態およびその変形例の効果についての説明から明らかであるので、説明を省略する。 Since the effects of the other aspects of the present invention are clear from the description of the effects of the above aspects of the present invention and the effects of the following embodiments and variations thereof, the description thereof will be omitted.

本発明の一実施形態に係る誤認識文字テーブルの作成に使用されるシステムの構成を示すブロック図である。It is a block diagram which shows the structure of the system used for creating the misrecognition character table which concerns on one Embodiment of this invention. 上記誤認識文字テーブルの作成に使用されるシステムにおいて誤認識文字テーブル作成装置として機能するコンピュータのハードウェア構成を示すブロック図である。It is a block diagram which shows the hardware composition of the computer which functions as the misrecognition character table creation apparatus in the system used for creating the misrecognition character table. 上記誤認識文字テーブルを作成するための処理（誤認識文字テーブル作成処理）を示すフローチャートである。It is a flowchart which shows the process (misrecognition character table creation process) for creating the said misrecognition character table. 上記誤認識文字テーブルの一例を示す図である。It is a figure which shows an example of the said misrecognition character table. 上記誤認識文字テーブルを備える文字列検索装置の一例を説明するための図である。It is a figure for demonstrating an example of the character string search apparatus provided with the said misrecognition character table. 図５に示す文字列検索装置におけるハードウェア構成を示すブロック図である。It is a block diagram which shows the hardware configuration in the character string search apparatus shown in FIG. 図５に示す文字列検索装置において使用される検索語を説明するための図である。It is a figure for demonstrating the search term used in the character string search apparatus shown in FIG. 図５に示す文字列検索装置における文字列検索処理の一例を示すフローチャートである。It is a flowchart which shows an example of the character string search process in the character string search apparatus shown in FIG. 図５に示す文字列検索装置における文字列検索処理の別例を示すフローチャートである。It is a flowchart which shows another example of the character string search process in the character string search apparatus shown in FIG. 上記誤認識文字テーブルを利用したＯＣＲ調整量テーブル作成処理の一例を示すフローチャートである。It is a flowchart which shows an example of the OCR adjustment amount table creation processing using the said misrecognition character table. 図１０に示すＯＣＲ調整量テーブル作成処理により作成されるＯＣＲ調整量テーブルの一例を示す図である。It is a figure which shows an example of the OCR adjustment amount table created by the OCR adjustment amount table creation process shown in FIG. 図１０に示すＯＣＲ調整量テーブル作成処理により作成されるＯＣＲ調整量テーブルを用いたＯＣＲ処理（光学的文字認識処理）を示すフローチャートである。It is a flowchart which shows the OCR process (optical character recognition process) using the OCR adjustment amount table created by the OCR adjustment amount table creation process shown in FIG. 図５に示す文字列検索装置の第１の変形例における文字列検索処理を示すフローチャートである。It is a flowchart which shows the character string search process in the 1st modification of the character string search apparatus shown in FIG. 図５に示す文字列検索装置の第２の変形例における文字列検索処理を示すフローチャートである。It is a flowchart which shows the character string search process in the 2nd modification of the character string search apparatus shown in FIG. 図５に示す文字列検索装置の第３の変形例における文字列検索処理を示すフローチャートである。It is a flowchart which shows the character string search process in the 3rd modification of the character string search apparatus shown in FIG.

以下、添付図面を参照しつつ本発明の実施形態について説明する。 Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings.

＜１．全体構成＞
図１は、本発明の一実施形態に係る誤認識文字テーブルの作成に使用されるシステムの構成を示すブロック図である。本システムは、コンピュータ１０とスキャナ２０と第１から第３プリンタＰ１〜Ｐ３とを備えている。コンピュータ１０と第１から第３プリンタＰ１〜Ｐ３とは、ＬＡＮ（Loacal Area Network）によって通信可能に接続されており、スキャナ２０はコンピュータ１０に接続されている。コンピュータ１０は、所定のプログラムを実行することにより、印刷制御装置として機能し、ＯＣＲ装置としても機能し、また、誤認識テーブル作成装置としても機能する。また、第１から第３プリンタＰ１〜Ｐ３は、互いに異なる解像度を有しており、各プリンタＰｉ（ｉ＝１，２，３）は、記録媒体として３種類の用紙すなわち用紙Ｓ１〜Ｓ３を選択的に使用できるように構成されている。これら３種類の用紙は、互いにインクの滲み易さの異なる用紙である。なお、図１に示す例では、コンピュータ１０は、第１から第３プリンタＰ１〜Ｐ３からなる３台のプリンタのいずれをも選択的に使用でき、第１から第３プリンタＰ１〜Ｐ３のそれぞれは、３種類の用紙Ｓ１〜Ｓ３のいずれをも選択的に使用できるが、選択的に使用可能なプリンタは２台または４台以上であってもよく、また、各プリンタで選択的に使用可能な用紙は、２種類または４種類以上であってもよい。また、プリンタ間で使用可能な用紙の種類が異なっていてもよい。 <1. Overall configuration>
FIG. 1 is a block diagram showing a configuration of a system used for creating a misrecognition character table according to an embodiment of the present invention. This system includes a computer 10, a scanner 20, and first to third printers P1 to P3. The computer 10 and the first to third printers P1 to P3 are communicably connected by a LAN (Loacal Area Network), and the scanner 20 is connected to the computer 10. By executing a predetermined program, the computer 10 functions as a print control device, also functions as an OCR device, and also functions as a false recognition table creation device. Further, the first to third printers P1 to P3 have different resolutions from each other, and each printer Pi (i = 1, 2, 3) selects three types of paper, that is, papers S1 to S3 as a recording medium. It is configured to be used as a printer. These three types of paper are papers having different easiness of ink bleeding from each other. In the example shown in FIG. 1, the computer 10 can selectively use any of the three printers consisting of the first to third printers P1 to P3, and each of the first to third printers P1 to P3 Although any of the three types of papers S1 to S3 can be selectively used, the number of printers that can be selectively used may be two or four or more, and each printer can be selectively used. The paper may be of two or more types. In addition, the types of paper that can be used may differ between printers.

図２は、コンピュータ１０のハードウェア構成を示すブロック図である。このコンピュータ１０は、本体１１、補助記憶装置１２、光ディスクドライブ１３、表示部１４、キーボード１５、およびマウス１６を備えている。本体１１は、ＣＰＵ１１１、メモリ１１２、第１ディスクインタフェース部１１３、第２ディスクインタフェース部１１４、表示制御部１１５、入力インタフェース部１１６、およびネットワークインタフェース部１１７を含んでいる。ＣＰＵ１１１、メモリ１１２、第１ディスクインタフェース部１１３、第２ディスクインタフェース部１１４、表示制御部１１５、入力インタフェース部１１６、およびネットワークインタフェース部１１７は、システムバスを介して互いに接続されている。第１ディスクインタフェース部１１３には補助記憶装置１２が接続されている。第２ディスクインタフェース部１１４には光ディスクドライブ１３が接続されている。表示制御部１１５には、表示部（表示装置）１４が接続されている。入力インタフェース部１１６には、キーボード１５およびマウス１６が接続されている。ネットワークインタフェース部１１７には、ネットワーク（ＬＡＮ）３が接続されている。補助記憶装置１２は磁気ディスク装置等である。光ディスクドライブ１３には、ＤＶＤ（Digital Versatile Disc）またはＣＤ−ＲＯＭ（Compact Disc Read Only Memory）等のコンピュータ読み取り可能な記録媒体としての光ディスク１７が挿入される。表示部１４は液晶ディスプレイ等である。表示部１４は、使用者が所望する情報を表示するために使用される。キーボード１５およびマウス１６は、このコンピュータ１０に対して使用者が指示を入力するために使用される。 FIG. 2 is a block diagram showing a hardware configuration of the computer 10. The computer 10 includes a main body 11, an auxiliary storage device 12, an optical disk drive 13, a display unit 14, a keyboard 15, and a mouse 16. The main body 11 includes a CPU 111, a memory 112, a first disk interface unit 113, a second disk interface unit 114, a display control unit 115, an input interface unit 116, and a network interface unit 117. The CPU 111, the memory 112, the first disk interface unit 113, the second disk interface unit 114, the display control unit 115, the input interface unit 116, and the network interface unit 117 are connected to each other via a system bus. An auxiliary storage device 12 is connected to the first disk interface unit 113. An optical disk drive 13 is connected to the second disk interface unit 114. A display unit (display device) 14 is connected to the display control unit 115. A keyboard 15 and a mouse 16 are connected to the input interface unit 116. A network (LAN) 3 is connected to the network interface unit 117. The auxiliary storage device 12 is a magnetic disk device or the like. An optical disk 17 as a computer-readable recording medium such as a DVD (Digital Versatile Disc) or a CD-ROM (Compact Disc Read Only Memory) is inserted into the optical disk drive 13. The display unit 14 is a liquid crystal display or the like. The display unit 14 is used to display information desired by the user. The keyboard 15 and the mouse 16 are used for the user to input instructions to the computer 10.

補助記憶装置１２には、誤認識文字テーブルを作成するためのプログラム（以下「誤認識文字テーブル作成プログラム」という）１８が格納されている。この誤認識文字テーブルは、図４に示すように、ＯＣＲにおいて誤認識される可能性の高い文字を誤認識高可能性文字として集め、誤認識高可能性文字のそれぞれにつき、ＯＣＲの対象とすべき画像（以下「ＯＣＲ対象画像」という）の作成のための印刷形態を特定する幾つかの情報（プリンタやフォントの識別情報等）を対応づけるものである（詳細は後述）。ＣＰＵ１１１は、補助記憶装置１２に格納された誤認識文字テーブル作成プログラム１８をメモリ１１２に読み出して実行することにより、誤認識文字テーブルを作成するための各種機能を実現する。メモリ１１２は、ＲＡＭ（Random Access Memory）およびＲＯＭ（Read Only Memory）を含んでいる。メモリ１１２は、補助記憶装置１２に格納された誤認識文字テーブル作成プログラム１８をＣＰＵ１１１が実行するためのワークエリアとして機能する。なお、誤認識文字テーブル作成プログラム１８は、上記ＤＶＤ等のコンピュータ読み取り可能な記録媒体（非一過性の記録媒体）に格納されて提供される。すなわち、使用者は、例えば、誤認識文字テーブル作成プログラム１８の記録媒体としての光ディスク１７を購入して光ディスクドライブ１３に挿入し、光ディスク１７から誤認識文字テーブル作成プログラム１８を読み出して補助記憶装置１２にインストールする。また、これに代えて、ＬＡＮ３等のネットワークを介して送信される誤認識文字テーブル作成プログラム１８をネットワークインタフェース部１１７で受信して、それを補助記憶装置１２にインストールするようにしてもよい。なお、誤認識文字テーブル作成プログラム１８がＣＰＵ１１１により実行されることにより、誤認識テーブルの作成に必要なＯＣＲ対象の文字画像の印刷のためにプリンタＰ１〜Ｐ３を制御する機能、および、印刷された当該文字画像を光学的に読み取って文字を認識するＯＣＲ機能が実現される。 The auxiliary storage device 12 stores a program (hereinafter referred to as “misrecognition character table creation program”) 18 for creating a misrecognition character table. As shown in FIG. 4, this misrecognition character table collects characters that are highly likely to be misrecognized in OCR as high-probability misrecognition characters, and targets each of the high-probability misrecognition characters for OCR. It associates some information (printer and font identification information, etc.) that specifies the print form for creating an image to be output (hereinafter referred to as "OCR target image") (details will be described later). The CPU 111 realizes various functions for creating the erroneous recognition character table by reading the erroneous recognition character table creation program 18 stored in the auxiliary storage device 12 into the memory 112 and executing the program. The memory 112 includes a RAM (Random Access Memory) and a ROM (Read Only Memory). The memory 112 functions as a work area for the CPU 111 to execute the erroneous recognition character table creation program 18 stored in the auxiliary storage device 12. The misrecognition character table creation program 18 is provided by being stored in a computer-readable recording medium (non-transient recording medium) such as the DVD. That is, for example, the user purchases an optical disk 17 as a recording medium of the misrecognition character table creation program 18, inserts it into the optical disk drive 13, reads the misrecognition character table creation program 18 from the optical disk 17, and performs the auxiliary storage device 12. Install on. Alternatively, the network interface unit 117 may receive the erroneous recognition character table creation program 18 transmitted via the network such as LAN3, and install it in the auxiliary storage device 12. The misrecognition character table creation program 18 is executed by the CPU 111 to control the printers P1 to P3 for printing the character image to be OCR necessary for creating the misrecognition table, and to print. An OCR function that optically reads the character image and recognizes the character is realized.

＜２．誤認識文字テーブル作成処理＞
図３は、本実施形態に係る誤認識文字テーブルを作成するためにコンピュータ１０において上記誤認識文字テーブル作成プログラムに基づき実行される処理（以下「誤認識文字テーブル作成処理」という）を示すフローチャートである。すなわち、図１に示すシステムにおいて、コンピュータ１０内のＣＰＵ１１１は、誤認識文字テーブルを作成するために上記誤認識文字テーブル作成プログラムにしたがって下記のように動作する。なお、上記誤認識文字テーブル作成プログラムの起動時すなわち誤認識文字テーブル作成処理の開始時において、第１から第３プリンタＰ１〜Ｐ３はいずれも未使用状態であり、各プリンタＰｉ（ｉ＝１，２，３）での印刷に使用可能な用紙Ｓ１〜Ｓ３はいずれも未使用状態であり、各プリンタＰｉで使用可能なフォントも全て未使用状態であるものとする。 <2. Misrecognition character table creation process>
FIG. 3 is a flowchart showing a process (hereinafter referred to as “misrecognition character table creation process”) executed by the computer 10 based on the misrecognition character table creation program in order to create the misrecognition character table according to the present embodiment. is there. That is, in the system shown in FIG. 1, the CPU 111 in the computer 10 operates as follows according to the misrecognition character table creation program in order to create the misrecognition character table. When the misrecognition character table creation program is started, that is, when the misrecognition character table creation process is started, all of the first to third printers P1 to P3 are in an unused state, and each printer Pi (i = 1, It is assumed that all the papers S1 to S3 that can be used for printing in 2 and 3) are in an unused state, and all the fonts that can be used in each printer Pi are also in an unused state.

まずＣＰＵ１１１は、図１のシステムにおける第１から第３プリンタＰ１〜Ｐ３のうち未使用のプリンタのいずれか１つを使用すべきプリンタ（以下「使用プリンタ」という）Ｐｓとして設定する（ステップＳ１０１）。誤認識文字テーブル作成処理の開始後、最初に当該ステップＳ１０１が実行される直前では、第１から第３プリンタＰ１〜Ｐ３の全てが未使用状態である。次に、使用プリンタＰｓで使用可能なフォント（通常は「明朝」や「ゴシック」等の複数のフォントが使用可能）のうち未使用のいずれかのフォントを使用フォントＦｓとして設定する（ステップＳ１０２）。続いて、使用プリンタＰｓで使用可能な３種類の用紙Ｓ１〜Ｓ３のうち未使用のいずれかの種類の用紙を使用用紙Ｓｓとして設定する（ステップＳ１０３）。 First, the CPU 111 is set as printers (hereinafter referred to as "used printers") Ps to which any one of the unused printers of the first to third printers P1 to P3 in the system of FIG. 1 should be used (step S101). .. Immediately before the first execution of the step S101 after the start of the erroneous recognition character table creation process, all of the first to third printers P1 to P3 are in an unused state. Next, one of the unused fonts (usually a plurality of fonts such as "Mincho" and "Gothic" can be used) that can be used by the printer Ps used is set as the font Fs used (step S102). ). Subsequently, one of the unused types of paper S1 to S3 that can be used by the printer Ps used is set as the used paper Ss (step S103).

その後、使用プリンタＰｓにより使用フォントＦｓを使用して、ＯＣＲ対象文字の全てを使用用紙Ｓｓに文字画像として印刷する（ステップＳ１０４）。ここで、ＯＣＲ対象文字とは、スキャナ２０で読み取る画像から認識すべき文字の全て、すなわち、スキャナ２０とコンピュータ１０とそこで実行される誤認識文字テーブル作成プログラムとにより実現されるＯＣＲ装置８０の認識対象となり得る文字の全てをいう。 After that, all of the OCR target characters are printed as character images on the used paper Ss by using the used font Fs by the used printer Ps (step S104). Here, the OCR target character is all the characters to be recognized from the image read by the scanner 20, that is, the recognition of the OCR device 80 realized by the scanner 20, the computer 10, and the erroneous recognition character table creation program executed therein. Refers to all characters that can be targeted.

次に、ステップＳ１０４で印刷されたＯＣＲ対象文字の画像をスキャナ２０により対象画像として読み取り（ステップＳ１０６）、対象画像における各文字をパターン認識により特定して文字コードを生成しＯＣＲ結果文字として出力する（ステップＳ１０８）。なお、ステップＳ１０６で読み取るべき対象画像が印刷された用紙は手作業によりスキャナ２０の読み取る位置に移動させるものとするが、これに代えて、当該印刷された用紙を使用プリンタＰｓからスキャナ２０へ移動させる機構を備え、当該機構をコンピュータ１０で制御するようにしてもよい。 Next, the image of the OCR target character printed in step S104 is read as the target image by the scanner 20 (step S106), each character in the target image is specified by pattern recognition, a character code is generated, and the character code is output as the OCR result character. (Step S108). The paper on which the target image to be read in step S106 is printed is manually moved to the reading position of the scanner 20, but instead, the printed paper is moved from the printer Ps used to the scanner 20. A mechanism for causing the mechanism may be provided, and the mechanism may be controlled by the computer 10.

その後、誤認識文字テーブルに登録すべき文字を決定すべく下記の処理を行う。
まず、ＯＣＲ対象文字のいずれか１つと、それに対応するＯＣＲ結果文字の１つに着目する（ステップＳ１１０）。なお、ステップＳ１０８が実行された直後では、ＯＣＲ対象文字の全ておよびＯＣＲ結果文字の全ては未着目の状態である。 After that, the following processing is performed to determine the characters to be registered in the misrecognition character table.
First, attention is paid to any one of the OCR target characters and one of the corresponding OCR result characters (step S110). Immediately after step S108 is executed, all the OCR target characters and all the OCR result characters are in an unfocused state.

次に、着目した２つの文字が互いに一致しているか否か、すなわち、着目した１つのＯＣＲ対象文字（以下「着目ＯＣＲ対象文字」という）のコードと着目した１つのＯＣＲ結果文字（以下「着目ＯＣＲ結果文字」という）のコードとが互いに一致しているか否かを判定する（ステップＳ１１２）。この判定の結果、当該着目した２つの文字が互いに一致している場合には誤認識は生じていないものとしてステップＳ１１６へ進み、当該着目した２つの文字が互いに異なる場合には誤認識が生じたものとしてステップＳ１１４へ進む。 Next, whether or not the two characters of interest match each other, that is, the code of one OCR target character of interest (hereinafter referred to as "focused OCR target character") and one focused OCR result character (hereinafter "focused"). It is determined whether or not the codes of "OCR result character") match each other (step S112). As a result of this determination, if the two characters of interest match each other, the process proceeds to step S116 assuming that no erroneous recognition has occurred, and if the two characters of interest are different from each other, erroneous recognition has occurred. As a result, the process proceeds to step S114.

ステップＳ１１４へ進んだ場合、下記事項を互いに対応付けて下記のように誤認識文字テーブル（以下、単に「テーブル」ともいう）に登録し、その後、ステップＳ１１６へ進む。
（１）着目ＯＣＲ対象文字のコードをテーブルに格納することにより、当該文字を「誤認識高可能性文字」として登録する。
（２）着目ＯＣＲ結果文字のコードをテーブルに格納することにより、当該文字を「誤認識文字」として登録する。
（３）対象画像の印刷に使用されたプリンタＰｓ、フォントＦｓ、用紙Ｓｓを示すデータを格納することにより、当該プリンタＰｓ、当該フォントＰｓ、および当該用紙Ｓｓによって特定される印刷形態を登録する。例えば図４に示すように、ＯＣＲ対象文字の１つである「ソ」（ローマ字表記の“ＳＯ”に相当する文字）が「ン」（ローマ字表記の“Ｎ”に相当する文字）として認識された場合、すなわち着目ＯＣＲ対象文字の「ソ」に対する着目ＯＣＲ結果文字が「ン」である場合、「ソ」が誤認識高可能性文字（文字１）として登録されるとともに、これに対応付けて、「ン」が誤認識文字（文字２）として登録され、さらにＯＣＲ対象文字の「ソ」を含む対象画像の印刷形態を特定するプリンタＰｓ，フォントＦｓ、用紙Ｓｓが登録される。図４に示す例では、プリンタＰｓとして「プリンタＰ１」が、フォントＦｓとして「明朝」が、用紙Ｓｓとして「用紙Ｓ１」がそれぞれ登録される。 When the process proceeds to step S114, the following items are associated with each other and registered in the misrecognition character table (hereinafter, also simply referred to as “table”) as described below, and then the process proceeds to step S116.
(1) By storing the code of the character of interest OCR in the table, the character is registered as a "character with high possibility of misrecognition".
(2) By storing the code of the OCR result character of interest in the table, the character is registered as a "misrecognized character".
(3) By storing data indicating the printer Ps, the font Fs, and the paper Ss used for printing the target image, the printer Ps, the font Ps, and the print form specified by the paper Ss are registered. For example, as shown in FIG. 4, "so" (a character corresponding to "SO" in Roman alphabet), which is one of the characters subject to OCR, is recognized as "n" (a character corresponding to "N" in Roman alphabet). In other words, when the OCR result character of interest for the character "So" of interest OCR is "n", "So" is registered as a character with high possibility of misrecognition (character 1) and is associated with this. , "N" is registered as a misrecognized character (character 2), and printer Ps, font Fs, and paper Ss that specify the print form of the target image including the OCR target character "so" are registered. In the example shown in FIG. 4, "printer P1" is registered as the printer Ps, "Mincho" is registered as the font Fs, and "paper S1" is registered as the paper Ss.

ステップＳ１１６では、未着目のＯＣＲ対象文字があるか否かを判定する。この判定の結果、未着目のＯＣＲ対象文字がある場合には、ステップＳ１１０へ戻る。以降、未着目のＯＣＲ対象文字がなくなるまでステップＳ１１０〜Ｓ１１６を繰り返し実行し、未着目のＯＣＲ対象文字がなくなると、ステップＳ１１８へ進む。この時点では、１つの印刷形態の対象画像におけるＯＣＲ対象文字の全てにつき誤認識されたか否かが判定されており、誤認識された文字については誤認識文字テーブルに上記（１）〜（３）の登録が行われている。 In step S116, it is determined whether or not there is an unfocused OCR target character. As a result of this determination, if there is an unfocused OCR target character, the process returns to step S110. After that, steps S110 to S116 are repeatedly executed until there are no unfocused OCR target characters, and when there are no unfocused OCR target characters, the process proceeds to step S118. At this point, it is determined whether or not all the OCR target characters in the target image of one print form are erroneously recognized, and the erroneously recognized characters are displayed in the erroneous recognition character table (1) to (3) above. Has been registered.

ステップＳ１１８では、使用プリンタＰｓで使用可能な全ての種類の用紙（図１のシステムでは３種類の用紙Ｓ１〜Ｓ３）が使用されたか否かを判定する。この判定の結果、全ての種類の用紙が使用されてはいない場合には、全てのＯＣＲ対象文字を未着目状態とし（ステップＳ１１９）、ステップＳ１０３へ戻る。以降、全ての種類の用紙Ｓ１〜Ｓ３が使用されるまでステップＳ１０３〜Ｓ１１９を繰り返し実行し、全ての種類の用紙Ｓ１〜Ｓ３が使用されると、ステップＳ１２０へ進む。 In step S118, it is determined whether or not all types of paper (three types of paper S1 to S3 in the system of FIG. 1) that can be used by the printer Ps used are used. As a result of this determination, if all types of paper are not used, all the OCR target characters are set to the unfocused state (step S119), and the process returns to step S103. After that, steps S103 to S119 are repeatedly executed until all types of papers S1 to S3 are used, and when all types of papers S1 to S3 are used, the process proceeds to step S120.

ステップＳ１２０では、使用プリンタＰｓで使用可能な全てのフォントが使用されたか否かを判定する。この判定の結果、全てのフォントが使用されてはいない場合には、全てのＯＣＲ対象文字を未着目状態とし、使用プリンタＰｓで使用可能な全ての種類の用紙を未使用状態として（ステップＳ１２１）、ステップＳ１０２へ戻る。以降、全てのフォントが使用されるまでステップＳ１０２〜Ｓ１２１を繰り返し実行し、全てのフォントが使用されると、ステップＳ１２２へ進む。 In step S120, it is determined whether or not all the fonts available in the printer Ps used have been used. As a result of this determination, when all the fonts are not used, all the OCR target characters are set to the unfocused state, and all types of paper that can be used by the printer Ps used are set to the unused state (step S121). , Return to step S102. After that, steps S102 to S121 are repeatedly executed until all the fonts are used, and when all the fonts are used, the process proceeds to step S122.

ステップＳ１２２では、誤認識文字テーブルの作成に使用可能な全てのプリンタ（図１のシステムでは３台のプリンタＰ１〜Ｐ３）が使用されたか否かを判定する。この判定の結果、全てのプリンタが使用されてはいない場合（いずれかのプリンタが未使用の場合）には、全てのＯＣＲ対象文字を未着目状態とし、使用プリンタＰｓで使用可能な全ての種類の用紙および全てのフォントを未使用状態として（ステップＳ１２３）、ステップＳ１０１へ戻る。以降、全てのプリンタが使用されるまでステップＳ１０１〜Ｓ１２３を繰り返し実行し、全てのプリンタが使用されると、誤認識文字テーブル作成処理を終了する。 In step S122, it is determined whether or not all the printers (three printers P1 to P3 in the system of FIG. 1) that can be used for creating the misrecognition character table have been used. As a result of this determination, when not all printers are used (when any printer is unused), all OCR target characters are set to the unfocused state, and all types that can be used by the printers used Ps. With the paper and all fonts in the unused state (step S123), the process returns to step S101. After that, steps S101 to S123 are repeatedly executed until all the printers are used, and when all the printers are used, the erroneous recognition character table creation process is terminated.

このようにして誤認識文字テーブル作成処理を終了した時点では、ＯＣＲ対象文字の全てが、プリンタとフォントと用紙の種類で特定される印刷形態の全てで印刷され、当該印刷により得られる対象画像における各文字がパターン認識（ＯＣＲ）により特定されてＯＣＲ結果として出力され、それらのＯＣＲ結果に基づき誤認識の発生の有無が判定され、その判定結果に基づき誤認識文字テーブルが作成されている。図４は、このようにして作成される誤認識文字テーブルの一例を示している。 When the misrecognition character table creation process is completed in this way, all of the OCR target characters are printed in all the print formats specified by the printer, the font, and the paper type, and the target image obtained by the printing is printed. Each character is specified by pattern recognition (OCR) and output as an OCR result, and the presence or absence of erroneous recognition is determined based on the OCR result, and a erroneous recognition character table is created based on the determination result. FIG. 4 shows an example of the misrecognition character table created in this way.

＜３．文字列検索装置＞
上記のような本実施形態に係る誤認識文字テーブル（図４参照）は、テキストを含む画像からＯＣＲ装置によって認識される文字からなるテキストデータを検索対象とし、外部から入力される文字列に一致する文字列を当該テキストデータにおいて探す文字列検索装置において使用することができる。 <3. Character string search device>
The misrecognition character table (see FIG. 4) according to the present embodiment as described above searches for text data consisting of characters recognized by the OCR device from an image containing text, and matches a character string input from the outside. It can be used in a character string search device that searches for a character string to be used in the text data.

＜３．１構成＞
図５は、このような文字列検索装置を備える検索システムの構成を示す概略図である。この検索システムは、検索すべき文字列を使用者の操作によって入力するための検索用端末装置３０と、その検索用端末装置３０において入力された検索語としての文字列（以下「入力文字列」という）に一致する文字列を検索対象としてのテキストデータの中から探す文字列検索装置４０と、当該テキストデータを作成するためのＯＣＲ装置８０とを備えている。検索用端末装置３０および文字列検索装置４０は、インターネット５に接続されており、インターネット５によってよって互いに通信可能である。また、文字列検索装置４０とＯＣＲ装置８０とは、ＬＡＮによって互いに通信可能に接続されている。なお、図５の検索システムは、１台の検索用端末装置３０を含んでいるが、複数台の検索用端末装置を含んでいてもよい。 <3.1 configuration>
FIG. 5 is a schematic view showing the configuration of a search system including such a character string search device. This search system has a search terminal device 30 for inputting a character string to be searched by a user's operation, and a character string as a search term input in the search terminal device 30 (hereinafter, "input character string"). It is provided with a character string search device 40 for searching text data as a search target for a character string matching the above), and an OCR device 80 for creating the text data. The search terminal device 30 and the character string search device 40 are connected to the Internet 5 and can communicate with each other by the Internet 5. Further, the character string search device 40 and the OCR device 80 are communicably connected to each other by a LAN. Although the search system of FIG. 5 includes one search terminal device 30, it may include a plurality of search terminal devices.

検索用端末装置３０は、パーソナルコンピュータ（以下「パソコン」と略記する）において所定プログラムを実行することにより実現されている。すなわち、当該所定プログラムに基づき検索用端末装置３０は、使用者の入力操作に応じて、文字列検索装置４０内のテキストデータＤｔｘにおいて検索すべき文字列を入力文字列として受け取ると、その入力文字列をインターネット５を介して文字列検索装置４０に送り、その後、文字列検索装置４０からその入力文字列に基づく検索結果を受け取って表示するように構成されている。 The search terminal device 30 is realized by executing a predetermined program on a personal computer (hereinafter abbreviated as "personal computer"). That is, when the search terminal device 30 receives the character string to be searched in the text data Dtx in the character string search device 40 as an input character string in response to the input operation of the user based on the predetermined program, the input character is received. The column is sent to the character string search device 40 via the Internet 5, and then the search result based on the input character string is received from the character string search device 40 and displayed.

文字列検索装置４０は、図５に示すように、検索処理装置４５と、それに接続された補助記憶装置４６と、検索処理装置４５およびＯＣＲ装置８０からＬＡＮ３を介してアクセス可能に構成されたネットワーク接続記憶装置（Network Attached Storage）（以下「ＮＡＳ」という）４８とを備えている。検索処理装置４５は、パソコンにおいて後述の文字列検索プログラムＳｐｇを実行することにより実現されている。補助記憶装置４６およびＮＡＳ４８は磁気ディスク等を用いて構成されている。補助記憶装置４６には、既述の誤認識文字テーブル作成処理により作成された誤認識文字テーブルＥｔｂｌと後述の文字列検索プログラムＳｐｇとが格納されており、ＮＡＳ４８には、検索対象としてのテキストデータＤｔｘが格納されている。 As shown in FIG. 5, the character string search device 40 is a network configured to be accessible from the search processing device 45, the auxiliary storage device 46 connected to the search processing device 45, the search processing device 45, and the OCR device 80 via LAN3. It is equipped with a network attached storage device (hereinafter referred to as "NAS") 48. The search processing device 45 is realized by executing the character string search program Spg described later on a personal computer. The auxiliary storage device 46 and NAS 48 are configured by using a magnetic disk or the like. The misrecognition character table Etbl created by the above-mentioned misrecognition character table creation process and the character string search program Spg described later are stored in the auxiliary storage device 46, and the text data as a search target is stored in the NAS 48. Dtx is stored.

ＯＣＲ装置８０は、ＯＣＲ処理装置８５と、それに接続されたスキャナ８６とを備えている。ＯＣＲ処理装置８５は、パソコンを用いて実現されており、ＯＣＲプログラムに基づき、スキャナ８６によりテキストを含む画像を読み取り、パターン認識により当該画像に含まれる文字を特定することでＯＣＲ結果としてのテキストデータを生成する。このテキストデータは、ＬＡＮ３を介してＮＡＳ４８に送られ、文字列検索装置４０における検索対象のテキストデータＤｔｘとしてＮＡＳ４８に格納される。このテキストデータＤｔｘには、スキャナ８６により読み取られるＯＣＲ対象画像の印刷時の出力条件、すなわち当該ＯＣＲ対象画像の印刷に使用されたプリンタ、フォント、および用紙の種類により特定される印刷形態を示す情報も含まれている。なお、ＯＣＲ処理装置８５を実現するために使用されるＯＣＲプログラムは、特に限定されるものではなく、既知のプログラムを使用することができる。 The OCR device 80 includes an OCR processing device 85 and a scanner 86 connected to the OCR processing device 85. The OCR processing device 85 is realized by using a personal computer. Based on the OCR program, the scanner 86 reads an image including text, and pattern recognition identifies characters included in the image to identify text data as an OCR result. To generate. This text data is sent to the NAS 48 via the LAN 3 and stored in the NAS 48 as the text data Dtx to be searched by the character string search device 40. The text data Dtx contains information indicating the output conditions at the time of printing the OCR target image read by the scanner 86, that is, the printing form specified by the printer, font, and paper type used for printing the OCR target image. Is also included. The OCR program used to realize the OCR processing device 85 is not particularly limited, and a known program can be used.

＜３．２検索処理装置および文字列検索処理＞
図６は、図５に示す文字列検索装置４０における検索処理装置４５のハードウェアとしてのパソコンの構成（検索処理装置４５のハードウェア構成）を示すブロック図である。この検索処理装置４５のハードウェア構成は、内蔵の補助記憶装置１２に代えて外付けの補助記憶装置４６を備える点で図２のコンピュータ１０のハードウェア構成と相違するが、その他の点では図２のコンピュータ１０のハードウェア構成と同様であるので、同一部分には同一の参照符号を付して詳しい説明を省略する。なお、図５に示す文字列検索装置４０では、補助記憶装置４６は検索処理装置４５に外付けされているが、検索処理装置４５に内蔵されていてもよい。 <3.2 Search processing device and character string search processing>
FIG. 6 is a block diagram showing a configuration of a personal computer (hardware configuration of the search processing device 45) as hardware of the search processing device 45 in the character string search device 40 shown in FIG. The hardware configuration of the search processing device 45 differs from the hardware configuration of the computer 10 of FIG. 2 in that an external auxiliary storage device 46 is provided instead of the built-in auxiliary storage device 12, but the hardware configuration of the computer 10 is different from that of FIG. Since it is the same as the hardware configuration of the computer 10 of No. 2, the same reference numerals are given to the same parts, and detailed description thereof will be omitted. In the character string search device 40 shown in FIG. 5, the auxiliary storage device 46 is externally attached to the search processing device 45, but may be built in the search processing device 45.

既述のように補助記憶装置４６には、誤認識文字テーブルＥｔｂｌおよび文字列検索プログラムＳｐｇが格納されている。検索処理装置４５内のＣＰＵ１１１は、補助記憶装置４６に格納された文字列検索プログラムＳｐｇをメモリ１１２に読み出して実行し、これにより後述の文字列検索処理が実現される。なお、文字列検索プログラムＳｐｇは、ＤＶＤ等のコンピュータ読み取り可能な記録媒体（非一過性の記録媒体）に格納されて提供される。すなわち使用者は、例えば、文字列検索プログラムＳｐｇの記録媒体としての光ディスク１７を購入して光ディスクドライブ１３に挿入し、光ディスク１７から文字列検索プログラムＳｐｇを読み出して補助記憶装置４６にインストールする。また、これに代えて、インターネット５を介して送信される文字列検索プログラムＳｐｇをネットワークインタフェース部１１７で受信して、それを補助記憶装置４６にインストールするようにしてもよい。 As described above, the auxiliary storage device 46 stores the misrecognition character table Etbl and the character string search program Spg. The CPU 111 in the search processing device 45 reads the character string search program Spg stored in the auxiliary storage device 46 into the memory 112 and executes it, whereby the character string search processing described later is realized. The character string search program Spg is provided by being stored in a computer-readable recording medium (non-transient recording medium) such as a DVD. That is, for example, the user purchases an optical disk 17 as a recording medium of the character string search program Spg, inserts it into the optical disk drive 13, reads the character string search program Spg from the optical disk 17, and installs it in the auxiliary storage device 46. Alternatively, the network interface unit 117 may receive the character string search program Spg transmitted via the Internet 5 and install it in the auxiliary storage device 46.

検索処理装置４５は、検索用端末装置３０からインターネット５を介して入力文字列を受け取り、この入力文字列に基づき、ＮＡＳ４８に格納されたＯＣＲ結果としてのテキストデータＤｔｘを検索する。このとき、入力文字列を検索語として使用するだけでなく、図７に示すように、入力文字列から誤認識文字テーブルＥｔｂｌを用いて作成された新たな検索語も用いて文字列の検索を行う（詳細は後述）。以下、検索処理装置４５により実行される文字列検索処理につき、図８および図９に示す２つの例を説明する。なお、検索処理装置４５は、図８に示す文字列検索処理と図９に示す文字列検索処理とを選択的に実行可能で、これら２つの文字列検索処理のうちいずれを実行するかを使用者の入力操作により指定できるように構成されていてもよい。 The search processing device 45 receives the input character string from the search terminal device 30 via the Internet 5, and searches the text data Dtx as the OCR result stored in the NAS 48 based on the input character string. At this time, not only the input character string is used as a search term, but also a new search term created from the input character string using the misrecognition character table Etbl is also used to search the character string as shown in FIG. Do (details below). Hereinafter, two examples shown in FIGS. 8 and 9 will be described with respect to the character string search process executed by the search processing device 45. The search processing device 45 can selectively execute the character string search process shown in FIG. 8 and the character string search process shown in FIG. 9, and uses which of these two character string search processes is to be executed. It may be configured so that it can be specified by a person's input operation.

＜３．２．１文字列検索処理の一例＞
図８は、検索処理装置４５において実行される文字検索処理の一例を示すフローチャートである。当該文字列検索処理が実行される場合、検索処理装置４５のハードウェアとしてのパソコンのＣＰＵ１１１は、文字列検索プログラムＳｐｇに基づき下記のように動作する。 <3.2.1 Example of character string search process>
FIG. 8 is a flowchart showing an example of the character search process executed by the search processing device 45. When the character string search process is executed, the CPU 111 of the personal computer as the hardware of the search processing device 45 operates as follows based on the character string search program Spg.

図８の文字列検索処理に対応する文字列検索プログラムＳｐｇが起動されると、ＣＰＵ１１１は、検索用端末装置３０から検索ための入力文字列を受け取るまで待機する状態となり、入力文字列を受け取ると（ステップＳ２０１）、当該入力文字列における未着目の文字のいずれかに着目する（ステップＳ２０３）。なお、入力文字列を受け取った時点では、当該入力文字列における全ての文字は未着目状態である。 When the character string search program Spg corresponding to the character string search process of FIG. 8 is started, the CPU 111 is in a state of waiting until it receives an input character string for search from the search terminal device 30, and when it receives the input character string, it is in a state of waiting. (Step S201), attention is paid to any of the unfocused characters in the input character string (step S203). At the time of receiving the input character string, all the characters in the input character string are in the unfocused state.

次に、着目文字が誤認識文字テーブルＥｔｂｌに登録されているか否かを判定する（ステップＳ２０４）。図４に示す誤認識文字テーブルＥｔｂｌが使用される場合、着目文字が、このテーブルＥｔｂｌに文字１として登録されている文字（「ソ」、「タ」、「高」、…、「リ」、…）、文字２として登録されている文字（「ン」、「タ」、「▲高▼（はしご高）」、…、「ソ」、…）、および、文字３として登録されている文字（「ク」、…「ン」、…）のいずれかの文字であるか否かを判定する。ここで、“はしご高”と呼ばれている「高」の異体字（Unicode“9AD9”が割り当てられている文字）を、便宜上、「▲高▼」と表記するものとする（以下においても同様）。ステップＳ２０４での判定の結果、着目文字が誤認識文字テーブルＥｔｂｌに登録されている場合にはステップＳ２０６へ進み、着目文字が誤認識文字テーブルＥｔｂｌに登録されていない場合にはステップＳ２０８へ進む。 Next, it is determined whether or not the character of interest is registered in the erroneous recognition character table Etbl (step S204). When the misrecognized character table Etbl shown in FIG. 4 is used, the character of interest is a character registered as character 1 in this table Etbl (“so”, “ta”, “high”, ..., “ri”, ...), the character registered as character 2 ("n", "ta", "▲ high ▼ (ladder height)", ..., "so", ...), and the character registered as character 3 ( It is determined whether or not the character is one of "ku", ... "n", ...). Here, a variant of "high" (a character to which Unicode "9AD9" is assigned) called "ladder height" is referred to as "▲ high ▼" for convenience (the same applies hereinafter). ). As a result of the determination in step S204, if the character of interest is registered in the erroneous recognition character table Etbl, the process proceeds to step S206, and if the character of interest is not registered in the erroneous recognition character table Etbl, the process proceeds to step S208.

ステップＳ２０６へ進んだ場合、着目文字はＯＣＲ装置８０により誤認識される可能性が許容範囲を超えるとみなし、入力文字列における着目文字をワイルドカード（「？」）に置き換え（ステップＳ２０６）、その後、ステップＳ２０８へ進む。 When the process proceeds to step S206, it is considered that the possibility of erroneous recognition of the character of interest by the OCR device 80 exceeds the permissible range, and the character of interest in the input character string is replaced with a wildcard (“?”) (Step S206), and then. , Step S208.

ステップＳ２０８では、入力文字列に未着目の文字があるか否かを判定する。この判定の結果、入力文字列に未着目の文字がある場合には、ステップＳ２０３へ戻る。以降、入力文字列において未着目の文字がなくなるまでステップＳ２０３〜Ｓ２０８を繰り返し実行し、未着目の文字がなくなると、ステップＳ２１０へ進む。この時点では、入力文字列に含まれる文字のうち誤認識文字テーブルＥｔｂｌに登録されている文字は、いずれも、ＯＣＲ装置８０により誤認識される可能性が許容範囲を超えるとみなされ、ワイルドカードに置き換えられている。 In step S208, it is determined whether or not there is an unfocused character in the input character string. As a result of this determination, if there is an unfocused character in the input character string, the process returns to step S203. After that, steps S203 to S208 are repeatedly executed until there are no unfocused characters in the input character string, and when there are no unfocused characters, the process proceeds to step S210. At this point, any of the characters included in the input character string registered in the misrecognition character table Etbl is considered to exceed the permissible range of misrecognition by the OCR device 80, and is a wild card. Has been replaced by.

ステップＳ２１０では、この入力文字列を検索語とし、当該検索語に一致する文字列を、ＮＡＳ４８に格納された検索対象としてのテキストデータＤｔｘの中から検索する。 In step S210, this input character string is used as a search term, and a character string matching the search term is searched from the text data Dtx stored in NAS48 as a search target.

その後、上記検索による検索結果が検索用端末装置３０で表示されるように、当該検索結果を示すデータをインターネット５を介して検索用端末装置３０に送る（ステップＳ２１２）。これにより、検索用端末装置３０において、例えば、検索対象としてのテキストデータＤｔｘのうちステップＳ２１０の時点での入力文字列（例えば図７に示す検索語２の文字列）に一致する文字列を含む文または段落等が当該文字列をハイライト状態にして表示される。 After that, the data indicating the search result is sent to the search terminal device 30 via the Internet 5 so that the search result by the above search is displayed on the search terminal device 30 (step S212). As a result, in the search terminal device 30, for example, the text data Dtx as the search target includes a character string that matches the input character string at the time of step S210 (for example, the character string of the search term 2 shown in FIG. 7). A sentence or paragraph is displayed with the character string highlighted.

上記のような文字列検索処理によれば、例えば図７に示すように、検索用端末装置３０から受け取る入力文字列（検索語１）が「ベンチャー」であるとすると（ステップＳ２０１）、図４に示すように文字「ン」が誤認識文字テーブルに登録されているので、ステップＳ２１０では、検索語２としての入力文字列「ベ？チャー」に一致する文字列がテキストデータＤｔｘの中から検索される。これにより、ＯＣＲ結果としてのテキストデータＤｔｘにおいて、「ベンチャー」という語における文字「ン」が誤認識されて「ベソチャー」として含まれている場合であっても、ワイルドカードを含む「ベ？チャー」という検索語に一致する文字列として「ベソチャー」を含む文または段落等が検索結果として表示される。なお、検索語における“？”は任意の１文字を表すものとする（以下においても同様）。 According to the character string search process as described above, assuming that the input character string (search term 1) received from the search terminal device 30 is a "venture" (step S201), for example, as shown in FIG. 7, FIG. Since the character "n" is registered in the misrecognized character table as shown in the above, in step S210, a character string matching the input character string "becha" as the search word 2 is searched from the text data Dtx. Will be done. As a result, even if the character "n" in the word "venture" is erroneously recognized and included as "besochar" in the text data Dtx as the OCR result, the "besochar" including the wildcard is included. A sentence or paragraph containing "besochar" is displayed as a search result as a character string matching the search term. In addition, "?" In the search term represents an arbitrary one character (the same applies below).

また、例えば図７に示すように、検索用端末装置３０から受け取る入力文字列（検索語１）が「高島」であるとすると（ステップＳ２０１）、図４に示すように文字「高」が誤認識文字テーブルに登録されているので、ステップＳ２１０では、検索語２としての入力文字列「？島」に一致する文字列がテキストデータＤｔｘの中から検索される。これにより、ＯＣＲ結果としてのテキストデータＤｔｘにおいて、「高島」という語における文字「高」が誤認識されて「▲高▼島」として含まれている場合であっても、ワイルドカードを含む「？島」という検索語に一致する文字列として「▲高▼島」を含む文または段落等が検索結果として表示される。 Further, for example, as shown in FIG. 7, if the input character string (search term 1) received from the search terminal device 30 is "Takashima" (step S201), the character "high" is incorrect as shown in FIG. Since it is registered in the recognition character table, in step S210, a character string matching the input character string "? Island" as the search term 2 is searched from the text data Dtx. As a result, even if the character "high" in the word "Takashima" is erroneously recognized and included as "▲ high ▼ island" in the text data Dtx as the OCR result, "?" Including the wildcard. A sentence or paragraph containing "▲ high ▼ island" is displayed as a search result as a character string matching the search term "island".

このように、ＯＣＲ結果としてのテキストデータＤｔｘが検索対象であって、その中にＯＣＲ装置８０により誤認識された文字が含まれる場合であっても、検索漏れを抑制することができる。また、図８の文字列検索処理においてワイルドカードを含む検索語の作成に使用される誤認識文字テーブルＥｔｂｌには、プリンタや、フォント、用紙の種類の異なる種々の印刷形態のいずれかで印刷された画像に対してＯＣＲ装置によりいずれかの文字が誤認識されると当該文字が登録される。このため、検索対象としてのテキストデータＤｔｘを作成するためのＯＣＲ対象画像の印刷形態が異なっても、検索漏れを確実に抑制することができる。 As described above, even when the text data Dtx as the OCR result is the search target and the characters erroneously recognized by the OCR device 80 are included in the search target, the search omission can be suppressed. Further, the misrecognized character table Etbl used for creating a search term including a wildcard in the character string search process of FIG. 8 is printed by any of various printing forms having different printers, fonts, and paper types. If any character is erroneously recognized by the OCR device for the image, the character is registered. Therefore, even if the print form of the OCR target image for creating the text data Dtx as the search target is different, the search omission can be reliably suppressed.

＜３．２．２文字列検索処理の別例＞
図９は、検索処理装置４５において実行される文字検索処理の別例を示すフローチャートである。当該文字列検索処理が実行される場合、検索処理装置４５のハードウェアとしてのパソコンのＣＰＵ１１１は、文字列検索プログラムＳｐｇに基づき下記のように動作する。なお、本例の文字列検索処理におけるステップのうち図８に示す上記一例の文字列検索処理と同一部分には、同一のステップ番号を付し、詳しい説明を省略する。 <3.2.2 Another example of character string search processing>
FIG. 9 is a flowchart showing another example of the character search process executed by the search processing device 45. When the character string search process is executed, the CPU 111 of the personal computer as the hardware of the search processing device 45 operates as follows based on the character string search program Spg. Of the steps in the character string search process of this example, the same part as the character string search process of the above example shown in FIG. 8 is assigned the same step number, and detailed description thereof will be omitted.

図９の文字列検索処理に対応する文字列検索プログラムＳｐｇが起動されると、ＣＰＵ１１１は、検索用端末装置３０から検索のための入力文字列を受け取るまで待機する状態となり、入力文字列を受け取ると（ステップＳ２０１）、当該入力文字列を１つの検索語として、検索に使用すべき検索語群に含める（ステップＳ２０２）。なお、文字列検索プログラムが起動された後、ステップＳ２０２が実行される直前では、当該検索語群にはいずれの検索語も含まれていない。 When the character string search program Spg corresponding to the character string search process of FIG. 9 is started, the CPU 111 is in a state of waiting until it receives an input character string for search from the search terminal device 30, and receives the input character string. And (step S201), the input character string is included in the search term group to be used for the search as one search term (step S202). It should be noted that, after the character string search program is started and immediately before step S202 is executed, the search term group does not include any search term.

次に、当該入力文字列における未着目の文字のいずれかに着目し（ステップＳ２０３）、着目文字が誤認識文字テーブルＥｔｂｌに登録されているか否かを判定する（ステップＳ２０４）。この判定の結果、着目文字が誤認識文字テーブルＥｔｂｌに登録されている場合にはステップＳ２２０へ進み、着目文字が誤認識文字テーブルＥｔｂｌに登録されていない場合にはステップＳ２２２へ進む。 Next, attention is paid to any of the unfocused characters in the input character string (step S203), and it is determined whether or not the attention character is registered in the erroneous recognition character table Etbl (step S204). As a result of this determination, if the character of interest is registered in the erroneous recognition character table Etbl, the process proceeds to step S220, and if the character of interest is not registered in the erroneous recognition character table Etbl, the process proceeds to step S222.

ステップＳ２２０へ進んだ場合、着目文字はＯＣＲ装置８０により誤認識される可能性が許容範囲を超えるとみなし、検索語群に含まれる各検索語における着目文字を、誤認識文字テーブルにより当該着目文字に対応付けられる他の文字に置き換えることにより、検索語を新たに作成して検索語群に含める。その後、ステップＳ２２２へ進む。 When the process proceeds to step S220, it is considered that the possibility of erroneous recognition of the character of interest by the OCR device 80 exceeds the permissible range, and the character of interest in each search term included in the search term group is selected as the character of interest by the erroneous recognition character table. A new search term is created and included in the search term group by replacing it with another character associated with. After that, the process proceeds to step S222.

ステップＳ２２２では、入力文字列に未着目の文字があるか否かを判定する。この判定の結果、入力文字列に未着目の文字がある場合には、ステップＳ２０３へ戻る。以降、入力文字列において未着目の文字がなくなるまでステップＳ２０３〜Ｓ２２２を繰り返し実行し、未着目の文字がなくなると、ステップＳ２２４へ進む。この時点では、入力文字列に含まれる文字のうち誤認識文字テーブルＥｔｂｌに登録されている文字は、いずれもＯＣＲ装置８０により誤認識される可能性が許容範囲を超えるとみなされ、入力文字列において当該登録されている文字のすくなくとも１つを誤認識文字テーブルＥｔｂｌにより対応付けられる他の文字にそれぞれ置き換えることにより得られる検索語の全てが、新たに作成されて検索語群に含められている。 In step S222, it is determined whether or not there is an unfocused character in the input character string. As a result of this determination, if there is an unfocused character in the input character string, the process returns to step S203. After that, steps S203 to S222 are repeatedly executed until there are no unfocused characters in the input character string, and when there are no unfocused characters, the process proceeds to step S224. At this point, among the characters included in the input character string, all the characters registered in the misrecognition character table Etbl are considered to exceed the allowable range of misrecognition by the OCR device 80, and the input character string All of the search terms obtained by replacing at least one of the registered characters with other characters associated with the misrecognized character table Etbl are newly created and included in the search term group. ..

ステップＳ２２４では、検索語群におけるいずれかの検索語に一致する文字列を、ＮＡＳ４８に格納された検索対象としてのテキストデータＤｔｘの中から検索する。 In step S224, a character string that matches any of the search terms in the search term group is searched from the text data Dtx stored in NAS48 as a search target.

その後、上記検索による検索結果が検索用端末装置３０で表示されるように、当該検索結果を示すデータをインターネット５を介して検索用端末装置３０に送る（ステップＳ２２６）、例えば、検索語群における各検索語につき、テキストデータＤｔｘにおいて当該検索語（例えば図７に示す検索語３の文字列）に一致する文字列を含む文または段落等が当該文字列をハイライト状態にして表示される。 After that, the data indicating the search result is sent to the search terminal device 30 via the Internet 5 so that the search result by the above search is displayed on the search terminal device 30 (step S226), for example, in the search term group. For each search term, a sentence or paragraph containing a character string matching the search term (for example, the character string of the search term 3 shown in FIG. 7) in the text data Dtx is displayed with the character string highlighted.

上記のような文字列検索処理によれば、例えば図７に示すように、検索用端末装置３０から受け取る入力文字列（検索語１）が「ベンチャー」であるとすると（ステップＳ２０１）、図４に示すように、文字「ン」が誤認識文字テーブルＥｔｂｌにおいて登録されているとともに、文字「ン」が誤認識文字テーブルＥｔｂｌにより文字「ソ」と対応付けられているので、ステップＳ２２４では、図７において“べ（ン｜ソ）チャー”（検索語３）として表記されている２つの文字列「ベンチャー」および「ベソチャー」のそれぞれにつき、一致する文字列がテキストデータＤｔｘの中から検索される。これにより、ＯＣＲ結果としてのテキストデータＤｔｘにおいて、「ベンチャー」という語における文字「ン」が誤認識されて「ベソチャー」として含まれている場合であっても、検索語群に含まれる１つの検索語に一致する文字列として「ベソチャー」を含む文または段落等が検索結果として表示される。 According to the character string search process as described above, assuming that the input character string (search term 1) received from the search terminal device 30 is a "venture" (step S201), for example, as shown in FIG. 7, FIG. As shown in the above, since the character "n" is registered in the misrecognized character table Etbl and the character "n" is associated with the character "so" by the misrecognized character table Etbl, in step S224, the figure For each of the two character strings "venture" and "besochar" described as "be (n | so) char" (search term 3) in 7, a matching character string is searched from the text data Dtx. .. As a result, even if the character "n" in the word "venture" is erroneously recognized and included as "besochar" in the text data Dtx as the OCR result, one search included in the search term group. A sentence or paragraph containing "besochar" as a character string matching a word is displayed as a search result.

また、例えば図７に示すように、検索用端末装置３０から受け取る入力文字列（検索語１）が「高島」であるとすると（ステップＳ２０１）、図４に示すように、文字「高」が誤認識文字テーブルＥｔｂｌにおいて登録されているとともに、文字「高」が誤認識文字テーブルＥｔｂｌにより文字「▲高▼」と対応付けられているので（図４のＩＤ＝３の行を参照されたい）、ステップＳ２２４では、図７において“（高｜▲高▼）島”（検索語３）として表記されている２つの文字列「高島」および「▲高▼島」のそれぞれにつき、一致する文字列がテキストデータＤｔｘの中から検索される。 Further, for example, as shown in FIG. 7, assuming that the input character string (search term 1) received from the search terminal device 30 is "Takashima" (step S201), the character "high" is as shown in FIG. Since it is registered in the misrecognition character table Etbl and the character "high" is associated with the character "▲ high ▼" by the misrecognition character table Etbl (see the line with ID = 3 in FIG. 4). , In step S224, a matching character string is attached to each of the two character strings "Takashima" and "▲ High ▼ Island" represented as "(High | ▲ High ▼) Island" (search term 3) in FIG. Is searched from the text data Dtx.

このように、図９の文字列検索処理によれば、ＯＣＲ結果としてのテキストデータＤｔｘが検索対象であって、その中にＯＣＲ装置８０により誤認識された文字が含まれる場合であっても検索漏れを抑制することができ、図８の文字列検索処理と同様の効果が得られる。ただし、図９の文字列検索処理では、入力文字列における文字のうち誤認識文字テーブルＥｔｂｌに登録されている文字がワイルドカードに置き換えられるのではなく、当該文字が誤認識文字テーブルＥｔｂｌによりそれに対応付けられる他の文字に置き換えられることから、不適切または余分な検索結果の出力を抑えるという点で、図９の文字列検索処理は図８の文字列検索処理よりも有利である。一方、検索漏れを抑制するという点では、図８の文字列検索処理は図９の文字列検索処理よりも有利である。 As described above, according to the character string search process of FIG. 9, even if the text data Dtx as the OCR result is the search target and the character erroneously recognized by the OCR device 80 is included in the search target, the search is performed. Leakage can be suppressed, and the same effect as the character string search process of FIG. 8 can be obtained. However, in the character string search process of FIG. 9, the characters registered in the misrecognition character table Etbl among the characters in the input character string are not replaced with wildcards, but the characters correspond to the misrecognition character table Etbl. The character string search process of FIG. 9 is more advantageous than the character string search process of FIG. 8 in that it suppresses the output of inappropriate or extra search results because it is replaced with other characters to be attached. On the other hand, the character string search process of FIG. 8 is more advantageous than the character string search process of FIG. 9 in terms of suppressing search omission.

＜４．ＯＣＲ調整量テーブルとＯＣＲ処理＞
本実施形態に係る誤認識文字テーブル（図４参照）は、ＯＣＲ装置による文字認識の精度（以下「ＯＣＲ精度」という）を向上させるためのＯＣＲ調整量テーブルを作成するために使用することができる。このＯＣＲ調整量テーブルは、ＯＣＲ対象画像の印刷に使用されるプリンタや、フォント、用紙の種類等により特定される印刷形態に対し、ＯＣＲ対象画像を読み取って文字を認識する処理（以下「ＯＣＲ処理」という）における適切な調整量を対応付けるテーブルであり、例えば図１１に示すように構成されている。 <4. OCR adjustment amount table and OCR processing>
The erroneous recognition character table (see FIG. 4) according to the present embodiment can be used to create an OCR adjustment amount table for improving the accuracy of character recognition by the OCR device (hereinafter referred to as “OCR accuracy”). .. This OCR adjustment amount table is a process of reading an OCR target image and recognizing characters for a printer used for printing the OCR target image, a printing form specified by a font, a paper type, etc. (hereinafter, "OCR processing"). It is a table for associating an appropriate adjustment amount in), and is configured as shown in FIG. 11, for example.

このようなＯＣＲ調整量テーブルは、図２に示すように構成されたコンピュータ１０で所定プログラムに基づき図１０に示すようなＯＣＲ調整量テーブル作成処理を実行することにより作成することができる。以下、図１０を参照して、このＯＣＲ調整量テーブル作成処理につき説明する。なお、このＯＣＲ調整量テーブル作成処理の開始時において、使用可能なプリンタはいずれも未使用状態であり、使用可能な各プリンタでの印刷に使用可能な用紙はいずれも未使用状態であり、各プリンタで使用可能なフォントも全て未使用状態であるものとする。また、図１０において「ＯＣＲ調整量」とは、ＯＣＲ装置においてスキャナで読み取った画像に対して文字認識のために施される画像処理におけるいずれか１つの調整量であり、例えば、読み取った画像に含まれる文字の画像に対する縦方向調整量（縦方向の太らせまたは細らせの量）または横方向調整量（横方向の太らせまたは細らせの量）等である（図１１参照）。 Such an OCR adjustment amount table can be created by executing the OCR adjustment amount table creation process as shown in FIG. 10 based on a predetermined program on the computer 10 configured as shown in FIG. Hereinafter, the OCR adjustment amount table creation process will be described with reference to FIG. At the start of the OCR adjustment amount table creation process, all usable printers are in an unused state, and all usable papers for printing by each usable printer are in an unused state. It is assumed that all fonts that can be used by the printer are also unused. Further, in FIG. 10, the “OCR adjustment amount” is any one of the adjustment amounts in the image processing performed for character recognition on the image read by the scanner in the OCR device, and is, for example, the read image. The amount of vertical adjustment (amount of thickening or thinning in the vertical direction) or the amount of horizontal adjustment (amount of thickening or thinning in the horizontal direction) with respect to the image of the included characters (see FIG. 11).

図１０のＯＣＲ調整量テーブル作成処理では、まず、ＯＣＲ対象画像（テキストを含む画像）の印刷に使用可能なプリンタのうち未使用のいずれかのプリンタを使用プリンタＰｓとして設定する（ステップＳ３０１）。次に、使用プリンタＰｓで使用可能なフォントのうち未使用のいずれかのフォントを使用フォントＦｓとして設定する（ステップＳ３０２）。続いて、使用プリンタＰｓで使用可能な種類の用紙のうち未使用のいずれかの種類の用紙を使用用紙Ｓｓとして設定する（ステップＳ３０３）。 In the OCR adjustment amount table creation process of FIG. 10, first, any unused printer among the printers that can be used for printing the OCR target image (image including text) is set as the printer Ps to be used (step S301). Next, one of the unused fonts among the fonts that can be used by the printer Ps used is set as the font Fs used (step S302). Subsequently, among the types of paper that can be used by the printer Ps used, any unused type of paper is set as the used paper Ss (step S303).

次に、ＯＣＲ調整量を予め決められた最小値に設定する（ステップＳ３０４）。 Next, the OCR adjustment amount is set to a predetermined minimum value (step S304).

その後、ＯＣＲ対象文字（ＯＣＲ装置の認識対処となり得る全ての文字）のうち誤認識文字テーブルに登録されている文字をＯＣＲ調整対象文字とし、ＯＣＲ調整対象文字のうち未着目のいずれかの文字に着目する（ステップＳ３０６）。なお、このＯＣＲ調整量テーブル作成処理の開始後、最初にステップＳ３０６が実行される直前では、ＯＣＲ調整対象文字は全て未着目状態である。 After that, among the OCR target characters (all characters that can be recognized by the OCR device), the characters registered in the erroneous recognition character table are set as the OCR adjustment target characters, and any of the OCR adjustment target characters that have not been focused are selected. Attention is paid (step S306). After the start of the OCR adjustment amount table creation process and immediately before the first execution of step S306, all the characters subject to OCR adjustment are in the unfocused state.

次に、着目文字を使用プリンタＰｓにより使用フォントＦｓを使用して印刷し、ＯＣＲ装置（例えば図５に示すＯＣＲ装置８０）によりその印刷された着目文字を画像として読み取ってパターン認識で文字を決定し、当該文字（のコードデータ）をＯＣＲ結果文字として出力する（ステップＳ３０８）。 Next, the character of interest is printed by the printer Ps used using the font Fs used, and the printed character of interest is read as an image by an OCR device (for example, the OCR device 80 shown in FIG. 5) to determine the character by pattern recognition. Then, the character (code data) is output as an OCR result character (step S308).

その後、着目文字（着目したＯＣＲ対象文字）とそれに対応するＯＣＲ結果文字とを比較し、両文字が一致しているか否かを示すデータを比較結果として保存する（ステップＳ３１０）。 After that, the character of interest (the character targeted for OCR of interest) is compared with the corresponding OCR result character, and data indicating whether or not both characters match is saved as a comparison result (step S310).

次に、未着目のＯＣＲ調整対象文字があるか否かを判定する（ステップＳ３１２）。この判定の結果、未着目のＯＣＲ調整対象文字がある場合にはステップＳ３０６へ戻る。以降、未着目のＯＣＲ調整対象文字がなくなるまでステップＳ３０６〜Ｓ３１２を繰り返し実行し、未着目のＯＣＲ調整対象文字がなくなると、ステップＳ３１４へ進む。 Next, it is determined whether or not there is an unfocused OCR adjustment target character (step S312). As a result of this determination, if there is an unfocused OCR adjustment target character, the process returns to step S306. After that, steps S306 to S312 are repeatedly executed until there are no unfocused OCR adjustment target characters, and when there are no unfocused OCR adjustment target characters, the process proceeds to step S314.

ステップＳ３１４では、ＯＣＲ調整量を予め決められた調整単位量だけ増大させ、その後、ＯＣＲ調整量が予め決められた最大値を超えたか否かを判定する（ステップＳ３１６）。この判定の結果、ＯＣＲ調整量が当該最大値を超えていない場合には、全てのＯＣＲ調整対象文字を未着目状態とし（ステップＳ３１７）、ステップＳ３０６へ戻る。以降、ＯＣＲ調整量が当該最大値を超えるまでステップＳ３０６〜Ｓ３１７を繰り返し実行し、ＯＣＲ調整量が当該最大値を超えるとステップＳ３１８へ進む。 In step S314, the OCR adjustment amount is increased by a predetermined adjustment unit amount, and then it is determined whether or not the OCR adjustment amount exceeds a predetermined maximum value (step S316). As a result of this determination, if the OCR adjustment amount does not exceed the maximum value, all the OCR adjustment target characters are set to the unfocused state (step S317), and the process returns to step S306. After that, steps S306 to S317 are repeatedly executed until the OCR adjustment amount exceeds the maximum value, and when the OCR adjustment amount exceeds the maximum value, the process proceeds to step S318.

ステップＳ３１８へ進んだ時点では、上記最小値から上記最大値までの範囲における上記調整単位量間隔での各ＯＣＲ調整量につき、各ＯＣＲ調整対象文字とそれに対応するＯＣＲ結果文字との比較結果が保存されている。そこで、これらの比較結果に基づき最良のＯＣＲ調整量を求め、当該最良のＯＣＲ調整量を、使用プリンタＰｓ、使用フォントＦｓ、および、使用用紙Ｓｓにより特定される印刷形態と対応付けてＯＣＲ調整量テーブルに登録する（ステップＳ３１８）。ここで、上記最小値から上記最大値までの範囲における上記調整単位量間隔でのＯＣＲ調整量のうち、各ＯＣＲ調整対象文字とそれに対応するＯＣＲ結果文字とからなる文字対のうち互いに一致する文字対の数が最も多いＯＣＲ調整量を、最良のＯＣＲ調整量とみなすものとする。 At the time of proceeding to step S318, the comparison result between each OCR adjustment target character and the corresponding OCR result character is saved for each OCR adjustment amount at the adjustment unit amount interval in the range from the minimum value to the maximum value. Has been done. Therefore, the best OCR adjustment amount is obtained based on these comparison results, and the best OCR adjustment amount is associated with the printing form specified by the printer Ps used, the font Fs used, and the paper Ss used, and the OCR adjustment amount is associated with the printing form. Register in the table (step S318). Here, among the OCR adjustment amounts at the adjustment unit amount intervals in the range from the minimum value to the maximum value, the characters that match each other in the character pair consisting of each OCR adjustment target character and the corresponding OCR result character. The OCR adjustment amount having the largest number of pairs shall be regarded as the best OCR adjustment amount.

その後、使用プリンタＰｓで使用可能な全ての種類の用紙が使用されたか否かを判定する（ステップＳ３２０）。この判定の結果、全ての種類の用紙が使用されてはいない場合には、全てのＯＣＲ調整対象文字を未着目状態とし（ステップＳ３２１）、ステップＳ３０３へ戻る。以降、全ての種類の用紙が使用されるまでステップＳ３０３〜Ｓ３２１を繰り返し実行し、全ての種類の用紙が使用されると、ステップＳ３２２へ進む。 After that, it is determined whether or not all kinds of papers that can be used by the printer Ps used are used (step S320). As a result of this determination, if all types of paper are not used, all the characters subject to OCR adjustment are set to the unfocused state (step S321), and the process returns to step S303. After that, steps S303 to S321 are repeatedly executed until all types of paper are used, and when all types of paper are used, the process proceeds to step S322.

ステップＳ３２２では、使用プリンタＰｓで使用可能な全てのフォントが使用されたか否かを判定する。この判定の結果、全てのフォントが使用されてはいない場合には、全てのＯＣＲ調整対象文字を未着目状態とし、使用プリンタＰｓで使用可能な全ての種類の用紙を未使用状態として（ステップＳ３２３）、ステップＳ３０２へ戻る。以降、全てのフォントが使用されるまでステップＳ３０２〜Ｓ３２３を繰り返し実行し、全てのフォントが使用されると、ステップＳ３２４へ進む。 In step S322, it is determined whether or not all the fonts available in the printer Ps used have been used. As a result of this determination, if all the fonts are not used, all the characters subject to OCR adjustment are set to the unfocused state, and all types of paper that can be used by the printer Ps used are set to the unused state (step S323). ), Return to step S302. After that, steps S302 to S323 are repeatedly executed until all the fonts are used, and when all the fonts are used, the process proceeds to step S324.

ステップＳ３２４では、ＯＣＲ対象画像の印刷に使用可能な全てのプリンタが使用されたか否かを判定する。この判定の結果、全てのプリンタが使用されてはいない場合には、全てのＯＣＲ調整対象文字を未着目状態とし、使用プリンタＰｓで使用可能な全ての種類の用紙および全てのフォントを未使用状態として（ステップＳ３２５）、ステップＳ３０１へ戻る。以降、全てのプリンタが使用されるまでステップＳ３０１〜Ｓ３２５を繰り返し実行し、全てのプリンタが使用されると、ＯＣＲ調整量テーブル作成処理を終了する。 In step S324, it is determined whether or not all the printers that can be used for printing the OCR target image have been used. As a result of this determination, if all printers are not used, all OCR adjustment target characters are set to the unfocused state, and all types of paper and all fonts that can be used by the printer Ps used are in the unused state. (Step S325), the process returns to step S301. After that, steps S301 to S325 are repeatedly executed until all the printers are used, and when all the printers are used, the OCR adjustment amount table creation process is completed.

上記では、１つのＯＣＲ調整量についてのＯＣＲ調整量テーブル作成処理（図１０）を説明したが、他のＯＣＲ調整量についても同様の処理によりＯＣＲ調整テーブルを作成することができる。例えば、既述の縦調整量と横調整量とのそれぞれにつきＯＣＲ調整量テーブル作成処理を実行し、その実行結果を１つのテーブルにまとめると、図１１に示すようなＯＣＲ調整量テーブルが得られる。 In the above, the OCR adjustment amount table creation process (FIG. 10) for one OCR adjustment amount has been described, but the OCR adjustment table can be created for the other OCR adjustment amounts by the same process. For example, if the OCR adjustment amount table creation process is executed for each of the vertical adjustment amount and the horizontal adjustment amount described above and the execution results are combined into one table, the OCR adjustment amount table as shown in FIG. 11 can be obtained. ..

図１１に示すように、このＯＣＲ調整量テーブルでは、ＯＣＲ対象画像の印刷時の出力条件毎に、すなわちＯＣＲ対象画像の印刷に使用されるプリンタ、フォント、および用紙の種類により特定される印刷形態毎に、ＯＣＲ装置により高精度に文字を認識できるＯＣＲ調整量（図１１に示す例では文字の画像に対する縦調整量および横調整量）が示されている。 As shown in FIG. 11, in this OCR adjustment amount table, the printing form specified for each output condition at the time of printing the OCR target image, that is, the printer, font, and paper type used for printing the OCR target image. For each, an OCR adjustment amount (in the example shown in FIG. 11, a vertical adjustment amount and a horizontal adjustment amount with respect to the image of the character) capable of recognizing the character with high accuracy by the OCR device is shown.

図１２は、このようなＯＣＲ調整量テーブルを用いてＯＣＲ装置により画像から文字を認識してテキストデータを生成するためのＯＣＲ処理を示すフローチャートである。 FIG. 12 is a flowchart showing an OCR process for recognizing characters from an image and generating text data by an OCR apparatus using such an OCR adjustment amount table.

このＯＣＲ処理では、まず、ＯＣＲ対象画像の印刷時の出力条件、具体的には、ＯＣＲ対象画像の印刷に使用されたプリンタ、フォント、および用紙の種類により特定される印刷形態を取得する（ステップＳ４０２）。次に、ＯＣＲ調整量テーブルからこの出力条件（印刷形態）に対応するＯＣＲ調整量を取得する（ステップＳ４０４）。その後、当該ＯＣＲ調整量をＯＣＲ装置において設定して、当該ＯＣＲ装置によりＯＣＲを実行する（ステップＳ４０６）。すなわち、当該ＯＣＲ装置により、ＯＣＲ対象画像を読み取ってパターン認識で当該ＯＣＲ対象画像から文字を特定してテキストデータを生成する。 In this OCR process, first, the output conditions at the time of printing the OCR target image, specifically, the printing form specified by the printer, font, and paper type used for printing the OCR target image are acquired (step). S402). Next, the OCR adjustment amount corresponding to this output condition (print form) is acquired from the OCR adjustment amount table (step S404). After that, the OCR adjustment amount is set in the OCR device, and OCR is executed by the OCR device (step S406). That is, the OCR device reads the OCR target image, identifies characters from the OCR target image by pattern recognition, and generates text data.

図１１のＯＣＲ調整量テーブルを使用するものとすると、このようなＯＣＲ処理によれば、ステップＳ４０２で取得される出力条件がプリンタＰ１、ゴシックのフォント、および、用紙Ｓ１で特定される印刷形態に相当する場合、縦調整量として“−１［ｐｉｘ］”という細らせ量が、横調整量として“＋１［ｐｉｘ］”という太らせ量が、ＯＣＲ装置にそれぞれ設定されてＯＣＲが実行される。このようにして、ＯＣＲ対象画像の印刷時の出力条件に応じて適切なＯＣＲ調整量がＯＣＲ装置に設定されるので、当該ＯＣＲ装置により高い精度で文字を認識することができる。 Assuming that the OCR adjustment amount table of FIG. 11 is used, according to such OCR processing, the output conditions acquired in step S402 are set to the print form specified by the printer P1, the Gothic font, and the paper S1. In the corresponding cases, the vertical adjustment amount is set to a thinning amount of "-1 [pix]" and the horizontal adjustment amount is set to a thickening amount of "+1 [pix]" in the OCR device, and OCR is executed. .. In this way, an appropriate OCR adjustment amount is set in the OCR device according to the output conditions at the time of printing the OCR target image, so that the OCR device can recognize characters with high accuracy.

＜５．効果＞
以上のように、本実施形態に係る誤認識文字テーブルは、文字列検索装置に使用することができ、ＯＣＲ装置におけるＯＣＲ調整量を決定するためのＯＣＲ調整量テーブルの作成にも使用することができる。これにより、以下のような効果が得られる。 <5. Effect>
As described above, the erroneous recognition character table according to the present embodiment can be used for the character string search device, and can also be used for creating the OCR adjustment amount table for determining the OCR adjustment amount in the OCR device. it can. As a result, the following effects can be obtained.

上記誤認識テーブルを使用した文字列検索装置では、ＯＣＲ結果を目視でチェックしなくとも、ＯＣＲ結果としてのテキストデータから文字列を高い精度で検索し検索漏れを抑制することができる。このため、ＯＣＲ結果としてのテキストデータの全てを目視で確認する必要がなくなり、このような確認作業によるコストが削減される。 In the character string search device using the misrecognition table, the character string can be searched with high accuracy from the text data as the OCR result without visually checking the OCR result, and the search omission can be suppressed. Therefore, it is not necessary to visually confirm all the text data as the OCR result, and the cost of such confirmation work is reduced.

また従来、既述の誤認識単語辞書に載っていない未知の単語を検索することは困難であったが、上記誤認識テーブルを使用した文字列検索装置では、未知の単語の検索も可能となる。すなわち、ＯＣＲ結果としてのテキストデータから文字列を検索する場合であっても、入力文字列のうちＯＣＲで誤認識され易い文字をワイルドカードまたは誤認識文字テーブルで当該文字に対応付けられる他の文字に置き換えることにより検索語が作成され（図８のステップＳ２０６、図９のステップＳ２２０）、これにより未知の単語も検索することができる。 Further, conventionally, it has been difficult to search for an unknown word that is not listed in the above-mentioned misrecognition word dictionary, but the character string search device using the above misrecognition table can also search for the unknown word. .. That is, even when searching for a character string from the text data as an OCR result, a character that is easily misrecognized by OCR in the input character string is associated with the character by a wildcard or a misrecognition character table. A search term is created by replacing with (step S206 in FIG. 8 and step S220 in FIG. 9), whereby an unknown word can also be searched.

また、従来において使用していた誤認識単語辞書が不要になることから、辞書の継続的更新も不要であり、ＯＣＲ装置におけるメンテナンスのコストが低減される。 Further, since the misrecognized word dictionary used in the prior art is not required, the dictionary is not continuously updated, and the maintenance cost in the OCR device is reduced.

また、誤認識文字テーブルを使用することで、ＯＣＲ装置による文字認識の精度が高くなくても、文字列検索装置の検索精度を向上させることができる。 Further, by using the erroneous recognition character table, the search accuracy of the character string search device can be improved even if the accuracy of character recognition by the OCR device is not high.

また、ＯＣＲ対象画像に含まれるテキストに異体字（例えば「高」と「▲高▼」）が含まれる場合であっても、誤認識文字テーブルを使用することで、ＯＣＲ結果としてのテキストデータを通常と同様に扱うことができる。 In addition, even if the text included in the OCR target image contains variant characters (for example, "high" and "▲ high ▼"), the text data as the OCR result can be obtained by using the misrecognition character table. It can be treated as usual.

さらに、誤認識文字テーブルを用いて作成されたＯＣＲ調整量テーブルをＯＣＲ装置において使用することで、ＯＣＲ装置による文字認識の精度を向上させることができる。 Further, by using the OCR adjustment amount table created by using the erroneous recognition character table in the OCR device, the accuracy of character recognition by the OCR device can be improved.

＜６．変形例＞
本発明は上記実施形態に限定されるものではなく、本発明の範囲を逸脱しない限りにおいてさらに種々の変形を施すことができる。以下、上記実施形態に係る誤認識文字テーブルを使用して既述の文字列検索装置の変形例について説明する。 <6. Modification example>
The present invention is not limited to the above-described embodiment, and various modifications can be further applied without departing from the scope of the present invention. Hereinafter, a modified example of the character string search device described above will be described using the misrecognition character table according to the above embodiment.

＜６．１第１の変形例＞
上記のように、図５に示す文字列検索装置において図８に示す文字列検索処理または図９に示す文字列検索処理が行われるが、これらの検索処理を組み合わせた文字列検索処理を行うようにしてもよい。すなわち、入力文字列のうち誤認識文字テーブルＥｔｂｌに登録されている文字をワイルドに置き換えて検索を行う処理と、入力文字列のうち誤認識文字テーブルＥｔｂｌに登録されている文字を誤認識文字テーブルＥｔｂｌにより当該文字に対応付けられる他の文字に置き換えて検索を行う処理とを組み合わせた文字列検索処理を行うようにしてもよい。 <6.1 First modification>
As described above, the character string search device shown in FIG. 5 performs the character string search process shown in FIG. 8 or the character string search process shown in FIG. 9, but the character string search process combining these search processes is performed. It may be. That is, the process of replacing the characters registered in the misrecognition character table Etbl in the input character string with wild and performing the search, and the misrecognition character table of the characters registered in the misrecognition character table Etbl in the input character string. A character string search process may be performed in combination with a process of performing a search by replacing the character with another character associated with the character by Etbl.

図１３は、このような変形例における文字列検索処理の一例を示すフローチャートである。この図１３の文字列検索処理のうちステップＳ２０１〜Ｓ２２４は、図９の文字列検索処理におけるステップＳ２０１〜Ｓ２２４とそれぞれ同一であるので、それらの説明を省略する。図１３の文字列検索処理では、入力文字列のうち誤認識文字テーブルＥｔｂｌに登録されている文字を誤認識文字テーブルＥｔｂｌにより着目文字に対応付けられる他の文字に置き換えて検索を行う処理（図９参照）によっては検索結果が得られない場合に、入力文字列のうち誤認識文字テーブルＥｔｂｌに登録されている文字をワイルドに置き換えて検索を行う。 FIG. 13 is a flowchart showing an example of the character string search process in such a modified example. Of the character string search processes of FIG. 13, steps S201 to S224 are the same as steps S201 to S224 of the character string search process of FIG. 9, so their description will be omitted. In the character string search process of FIG. 13, the character registered in the misrecognition character table Etbl among the input character strings is replaced with another character associated with the character of interest by the misrecognition character table Etbl (FIG. 13). If the search result cannot be obtained depending on (see 9), the characters registered in the misrecognized character table Etbl in the input character string are replaced with wild characters to perform the search.

すなわち、ステップＳ２３０において、ステップＳ２０２〜Ｓ２２２において作成される検索語群における少なくとも１つの検索語に一致する文字列が対象テキストデータＤｔｘにおいて見出せたか否かを判定する。この判定の結果、検索語群におけるいずれの検索語についてもそれに一致する文字列が対象テキストデータＤｔｘにおいて見出せない場合にはステップＳ２３２へ進み、検索語群における少なくとも１つの検索語に一致する文字列が見出せた場合にはステップＳ２３６へ進む。 That is, in step S230, it is determined whether or not a character string matching at least one search term in the search term group created in steps S202 to S222 can be found in the target text data Dtx. As a result of this determination, if a character string matching any of the search terms in the search term group cannot be found in the target text data Dtx, the process proceeds to step S232, and a character string matching at least one search term in the search term group is obtained. If is found, the process proceeds to step S236.

ステップＳ２３２では、ステップＳ２０１で受け取った入力文字列のうち誤認識文字テーブルＥｔｂｌの登録されている文字を全てワイルドカードに置き換えることにより、ワイルドカード検索語を作成する。 In step S232, a wildcard search term is created by replacing all the characters registered in the misrecognition character table Etbl among the input character strings received in step S201 with wildcards.

次に、このワイルドカード検索語に一致する文字列を、ＮＡＳ４８に格納された検索対象のテキストデータＤｔｘの中から検索し（ステップＳ２３４）、その後、ステップＳ２３６へ進む。 Next, the character string matching the wildcard search term is searched from the search target text data Dtx stored in NAS48 (step S234), and then the process proceeds to step S236.

ステップＳ２３６では、上記検索による検索結果が検索用端末装置３０で表示されるように、当該検索結果を示すデータをインターネット５を介して検索用端末装置３０に送る（ステップＳ２３６）。これにより、検索用端末装置３０において、例えば、検索対象としてのテキストデータＤｔｘのうち上記のいずれかの検索語に一致する文字列を含む文または段落等が当該文字列をハイライト状態にして表示される。 In step S236, data indicating the search result is sent to the search terminal device 30 via the Internet 5 so that the search result of the search is displayed on the search terminal device 30 (step S236). As a result, in the search terminal device 30, for example, a sentence or paragraph containing a character string matching any of the above search terms in the text data Dtx as a search target is displayed with the character string highlighted. Will be done.

上記文字列検索処理では、上記検索語群における少なくとも１つの検索語に一致する文字列が対象テキストデータＤｔｘにおいて見出せた場合には、上記ワイルドカード検索語による検索は行われない（ステップＳ２３０）。一方、上記検索語群におけるいずれの検索語についてもそれに一致する文字列が対象テキストデータＤｔｘにおいて見出せない場合には、上記ワイルドカード検索語による検索が行われる。したがって、このような文字列検索処理によれば、不適切または余分な検索結果の出力を抑えつつ、検索漏れを確実に抑制することができる。 In the character string search process, if a character string matching at least one search term in the search term group is found in the target text data Dtx, the search by the wildcard search term is not performed (step S230). On the other hand, if a character string corresponding to any of the search terms in the search term group cannot be found in the target text data Dtx, the search by the wildcard search term is performed. Therefore, according to such a character string search process, it is possible to surely suppress search omission while suppressing output of inappropriate or extra search results.

＜６．２第２の変形例＞
上記実施形態に係る誤認識文字テーブルＥｔｂｌでは、そこに登録された文字に対し、当該文字を含む画像の印刷に使用されたプリンタ、フォント、および、用紙の種類により特定される印刷形態（出力条件）が対応付けられている。一方、図８、図９、および、図１３にそれぞれ示す文字列検索処理では、ＯＣＲ対象画像の印刷形態に関連する処理は含まれていない。しかし、ＯＣＲ検索結果としてのテキストデータＤｔｘを検索対象とする文字列検索において不適切または余分な検索結果の出力を抑えるべく、図５に示す文字列検索装置４０において実行すべき文字列検索処理においてＯＣＲ対象画像の印刷形態に関連する処理を含めることも考えられる。 <6.2 Second modification>
In the misrecognition character table Etbl according to the above embodiment, for the characters registered therein, the printing form (output condition) specified by the printer, font, and paper type used for printing the image including the characters. ) Is associated. On the other hand, the character string search processing shown in FIGS. 8, 9, and 13, respectively, does not include the processing related to the printing form of the OCR target image. However, in the character string search process to be executed in the character string search device 40 shown in FIG. 5 in order to suppress the output of inappropriate or extra search results in the character string search targeting the text data Dtx as the OCR search result. It is also conceivable to include processing related to the print form of the OCR target image.

図１４は、図５に示す文字列検索装置の第２の変形例における文字列検索処理、すなわち図８の文字列検索処理においてＯＣＲ対象画像の印刷形態に関連する処理を含めた構成の文字列検索処理を示すフローチャートである。 FIG. 14 shows a character string having a configuration including a character string search process in the second modification of the character string search device shown in FIG. 5, that is, a process related to the printing form of the OCR target image in the character string search process of FIG. It is a flowchart which shows the search process.

この文字列検索処理は、図８の文字列検索処理に対し、ステップＳ２０６の直前にステップＳ２０５が挿入されている点が異なり、その他のステップは、図８の文字列検索処理のステップと同様であり、対応するステップには同一のステップ番号が付されている。 This character string search process is different from the character string search process of FIG. 8 in that step S205 is inserted immediately before step S206, and the other steps are the same as the steps of the character string search process of FIG. Yes, the corresponding steps are numbered the same.

この文字列検索処理では、ステップＳ２０４において、着目文字が誤認識文字テーブルＥｔｂｌに登録されていると判定されると、ステップＳ２０６の実行前に、誤認識文字テーブルＥｔｂｌにより着目文字に対応付けられたプリンタ、フォント、および、用紙の種類により特定される印刷形態（出力条件）が検索対象としてのテキストデータ（対象テキストデータ）Ｄｔｘの元画像の印刷形態に一致するか否かを判定する（ステップＳ２０５）。ここで、対象テキストデータＤｔｘの元画像とは、ＯＣＲ装置８０によって対象テキストデータＤｔｘを生成するためのＯＣＲ対象画像である。 In this character string search process, if it is determined in step S204 that the character of interest is registered in the misrecognition character table Etbl, it is associated with the character of interest by the misrecognition character table Etbl before the execution of step S206. It is determined whether or not the print form (output condition) specified by the printer, font, and paper type matches the print form of the original image of the text data (target text data) Dtx as the search target (step S205). ). Here, the original image of the target text data Dtx is an OCR target image for generating the target text data Dtx by the OCR device 80.

ステップＳ２０５の判定の結果、誤認識文字テーブルＥｔｂｌにより着目文字に対応付けられた印刷形態がＯＣＲ対象画像（元画像）の印刷形態に一致する場合には、ステップＳ２０６へ進む。一方、この判定の結果、誤認識文字テーブルＥｔｂｌにより着目文字に対応付けられた印刷形態がＯＣＲ対象画像の印刷形態に一致しない場合には、ステップＳ２０６を実行することなくステップＳ２０８へ進む。これにより、着目文字が誤認識文字テーブルＥｔｂｌに登録されていても、当該着目文字に対応付けられた印刷形態がＯＣＲ対象画像の印刷形態に一致しない場合には、入力文字列において当該着目文字がワイルドカードに置き換えられることはない。この文字列検索処理における上記以外の処理については、図８の文字列検索処理と同様であるので説明を省略する。 As a result of the determination in step S205, if the print form associated with the character of interest by the erroneous recognition character table Etbl matches the print form of the OCR target image (original image), the process proceeds to step S206. On the other hand, as a result of this determination, if the print form associated with the character of interest by the erroneous recognition character table Etbl does not match the print form of the OCR target image, the process proceeds to step S208 without executing step S206. As a result, even if the character of interest is registered in the erroneous recognition character table Etbl, if the print form associated with the character of interest does not match the print form of the image to be OCR, the character of interest is displayed in the input character string. It cannot be replaced by a wildcard. The processing other than the above in this character string search processing is the same as the character string search processing of FIG. 8, and the description thereof will be omitted.

＜６．３第３の変形例＞
図１５は、図５に示す文字列検索装置の第３の変形例における文字列検索処理、すなわち図９の文字列検索処理においてＯＣＲ対象画像の印刷形態に関連する処理を含めた構成の文字列検索処理を示すフローチャートである。 <6.3 Third modification>
FIG. 15 shows a character string having a configuration including a character string search process in the third modification of the character string search device shown in FIG. 5, that is, a process related to the printing form of the OCR target image in the character string search process of FIG. It is a flowchart which shows the search process.

この文字列検索処理は、図９の文字列検索処理に対し、ステップＳ２２０の直前にステップＳ２０５が挿入されている点が異なり、その他のステップは、図９の文字列検索処理のステップと同様であり、対応するステップには同一のステップ番号が付されている。 This character string search process is different from the character string search process of FIG. 9 in that step S205 is inserted immediately before step S220, and the other steps are the same as the steps of the character string search process of FIG. Yes, the corresponding steps are numbered the same.

この文字列検索処理においても、ステップＳ２０４において、着目文字が誤認識文字テーブルＥｔｂｌに登録されていると判定されると、ステップＳ２２０の実行前に、誤認識文字テーブルＥｔｂｌにより着目文字に対応付けられたプリンタ、フォント、および、用紙の種類により特定される印刷形態（出力条件）が対象テキストデータＤｔｘの元画像すなわちＯＣＲ対象画像に一致するか否かを判定する（ステップＳ２０５）。 Also in this character string search process, if it is determined in step S204 that the character of interest is registered in the misrecognition character table Etbl, it is associated with the character of interest by the misrecognition character table Etbl before the execution of step S220. It is determined whether or not the print form (output condition) specified by the printer, font, and paper type matches the original image of the target text data Dtx, that is, the OCR target image (step S205).

ステップＳ２０５の判定の結果、誤認識文字テーブルＥｔｂｌにより着目文字に対応付けられた印刷形態がＯＣＲ対象画像の印刷形態に一致する場合には、ステップＳ２２０へ進む。一方、この判定の結果、誤認識文字テーブルＥｔｂｌにより着目文字に対応付けられた印刷形態がＯＣＲ対象画像の印刷形態に一致しない場合には、ステップＳ２２０を実行することなくステップＳ２２２へ進む。これにより、着目文字が誤認識文字テーブルＥｔｂｌに登録されていても、当該着目文字に対応付けられた印刷形態がＯＣＲ対象画像の印刷形態に一致しない場合には、入力文字列において当該着目文字を誤認識文字テーブルＥｔｂｌにより当該着目文字に対応付けられた他の文字に置き換えて検索語が新たに作成されることはない。この文字列検索処理における上記以外の処理については、図９の文字列検索処理と同様であるので説明を省略する。 As a result of the determination in step S205, if the print form associated with the character of interest by the erroneous recognition character table Etbl matches the print form of the OCR target image, the process proceeds to step S220. On the other hand, as a result of this determination, if the print form associated with the character of interest by the erroneous recognition character table Etbl does not match the print form of the OCR target image, the process proceeds to step S222 without executing step S220. As a result, even if the character of interest is registered in the erroneous recognition character table Etbl, if the print form associated with the character of interest does not match the print form of the image to be OCR, the character of interest is added to the input character string. The misrecognition character table Etbl does not create a new search term by replacing it with another character associated with the character of interest. The processing other than the above in this character string search processing is the same as the character string search processing of FIG. 9, and the description thereof will be omitted.

１０ …コンピュータ
１８ …誤認識文字テーブル作成プログラム
２０ …スキャナ
３０ …検索用端末装置
４０ …文字列検索装置
４５ …検索処理装置
８０ …ＯＣＲ装置
８５ …ＯＣＲ処理装置
８６ …スキャナ
Ｅｔｂｌ …誤認識文字テーブル
Ｄｔｘ …テキストデータ（検索対象、ＯＣＲ結果）
Ｓｐｇ …文字列検索プログラム 10 ... Computer 18 ... False recognition character table creation program 20 ... Scanner 30 ... Search terminal device 40 ... Character string search device 45 ... Search processing device 80 ... OCR device 85 ... OCR processing device 86 ... Scanner Etbl ... False recognition character table Dtx … Text data (search target, OCR result)
Spg… Character string search program

Claims

A printing step of generating a recording medium in which the OCR target character is printed as a character image by printing each of the OCR target characters, which are characters that can be recognized by the OCR device, in a plurality of different print formats.
An OCR step in which the character image printed by the printing step is read by the OCR device, characters are recognized, and a character code as a recognition result is generated.
By collating the code of the OCR target character printed by the printing step with the character code as the recognition result by the OCR step, whether or not each of the printed OCR target characters was erroneously recognized in the OCR step. Misrecognition judgment step to determine whether
When it is determined by the misrecognition determination step that the image of any character of the OCR target character is erroneously recognized in the OCR step, the character as the recognition result in the OCR step is regarded as the erroneous recognition character. Generates table-format misrecognition character mapping data that corresponds to the character of, and also generates table-format print format mapping data that maps the print format when the image of that character is printed to that character. A method of creating a misrecognized character table that includes a data generation step to generate.

The print form association data includes information for identifying at least one of a printing device, a font, and a recording medium used in printing an image of characters erroneously recognized in the OCR step, according to claim 1. How to create the misrecognized character table described.

The print form association data includes information for specifying the type of paper as a recording medium used in printing an image of characters erroneously recognized in the OCR step.
The method for creating a misrecognition character table according to claim 2, wherein the information for specifying the type of the paper includes information that can identify the easiness of bleeding of ink used in printing an image of the characters.

A character with a high possibility of misrecognition, which is a character whose possibility of being erroneously recognized by an OCR device that recognizes a character by reading the target image from a recording medium on which the target image including text is printed is considered to exceed a predetermined allowable range. The misrecognized character mapping data that associates the misrecognized character, which is the character as the recognition result when the image of the character is misrecognized by the OCR device, with each of the above.
A erroneous recognition character table including print form association data for associating the print form of the image of the character when the image of the character is erroneously recognized by the OCR device with respect to each of the characters having a high possibility of erroneous recognition.

4. The print form association data includes information that identifies at least one of a printing device, a font, and a recording medium used in printing an image of the character misrecognized by the OCR device. Misidentified character table described in.

The print form association data includes information for specifying the type of paper as a recording medium used in printing the image of the character misrecognized by the OCR device.
The misrecognition character table according to claim 5, wherein the information for specifying the type of the paper includes information that can identify the easiness of bleeding of the ink used in printing the target image.

A character string search device that searches for text data as an OCR result obtained by reading a target image containing a text from a recording medium on which the target image is printed and recognizing characters.
The misrecognition character table according to any one of claims 4 to 6 and
When any character in the input character string given from the outside matches any of the misrecognition high possibility character and the misrecognition character registered in the misrecognition character string, the match in the input character string is made. When a search term is created by replacing characters with wildcards, and none of the characters in the input character string matches any of the misrecognition high-probability characters and misrecognition characters registered in the misrecognition character table. The search term creation unit using the input character string as the search term and
A character string search device including a search unit that searches text data as an OCR result for a character string that matches a search word obtained by the search word creation unit.

The search term creation unit
A print form in which any character in the input character string matches any of the misrecognition high possibility character and the misrecognition character registered in the misrecognition character table, and is associated with the matching character. Creates a search term by replacing the matching character in the input character string with a wildcard when matches the print form of the target image.
Any character in the input character string matches any of the misrecognition high possibility character and the misrecognition character registered in the misrecognition character table, but the print form associated with the matching character is The character string search device according to claim 7, wherein the input character string is used as a search term when the print form of the target image does not match.

A character string search device that searches for text data as an OCR result obtained by reading a target image containing a text from a recording medium on which the target image is printed and recognizing characters.
The misrecognition character table according to any one of claims 4 to 6 and
When any character in the input character string given from the outside matches any of the misrecognition high possibility character and the misrecognition character registered in the misrecognition character table, the input character string is used as a search term. At the same time, a search term is created by replacing the matching character in the input character string with another character associated with the matching character by the misrecognition character table, and any character in the input character string is used. , A search term creation unit that uses the input character string as a search term when neither the false recognition high possibility character or the false recognition character registered in the false recognition character table is matched.
A character string search device including a search unit that searches text data as an OCR result for a character string that matches any of the search words obtained by the search word creation unit.

The search term creation unit
A printing form in which any character in the input character string matches any of the misrecognition high possibility character and the misrecognition character registered in the misrecognition character table, and is associated with the matching character. Is the same as the print form of the target image, the input character string is used as a search term, and the matching character in the input character string is associated with the matching character by the misrecognition character table. Create a search term by replacing it with the letters
Any character in the input character string matches any of the misrecognition high possibility character and the misrecognition character registered in the misrecognition character table, but the print form associated with the matching character is The character string search device according to claim 9, wherein only the input character string is used as a search term when the print form of the target image does not match.

When the search unit cannot find a character string that matches any of the search words obtained by the written search word creation unit from the text data as the OCR result, any character in the input character string is the said. Matches the search term obtained by replacing the matching character in the input character string with a wildcard when it matches any of the misrecognition high-probability characters and misrecognition characters registered in the misrecognition character table. The character string search device according to claim 9 or 10, wherein a character string is searched from the text data as the OCR result.

This is a character string search method that searches for text data as an OCR result obtained by reading the target image from a recording medium on which the target image containing text is printed and recognizing characters.
Any character in the input character string given from the outside matches any of the misrecognition high possibility character and the misrecognition character registered in the misrecognition character table according to any one of claims 4 to 6. In this case, a search term is created by replacing the matching character in the input character string with a wildcard, and any character in the input character string has a high possibility of misrecognition registered in the misrecognition character table. A search term creation step using the input character string as a search term when neither a character nor a misrecognized character is matched,
A character string search method comprising a search step of searching text data as an OCR result for a character string matching the search word obtained by the search word creation step.

This is a character string search method that searches for text data as an OCR result obtained by reading the target image from a recording medium on which the target image containing text is printed and recognizing characters.
Any character in the input character string given from the outside matches any of the misrecognition high possibility character and the misrecognition character registered in the misrecognition character table according to any one of claims 4 to 6. In this case, the input character string is used as a search term, and the matching character in the input character string is replaced with another character associated with the matching character by the misrecognition character table. When none of the characters created and in the input character string matches any of the misrecognition high possibility character and the misrecognition character registered in the misrecognition character table, the input character string is used as a search term. Search term creation steps and
A character string search method comprising a search step of searching text data as an OCR result for a character string that matches any of the search words obtained by the search word creation step.

A character string search program that searches for text data as an OCR result obtained by reading a target image containing text from a recording medium on which the target image is printed and recognizing characters.
Any character in the input character string given from the outside matches any of the misrecognition high possibility character and the misrecognition character registered in the misrecognition character table according to any one of claims 4 to 6. In this case, a search term is created by replacing the matching character in the input character string with a wildcard, and any character in the input character string has a high possibility of misrecognition registered in the misrecognition character table. A search term creation step using the input character string as a search term when neither a character nor a misrecognized character is matched,
A character that causes the CPU of the computer to execute a search step of searching the text data as the OCR result for a character string matching any of the search terms created by the search term creation step using memory. Column search program.

A character string search program that searches for text data as an OCR result obtained by reading a target image containing text from a recording medium on which the target image is printed and recognizing characters.
Any character in the input character string given from the outside matches any of the misrecognition high possibility character and the misrecognition character registered in the misrecognition character table according to any one of claims 4 to 6. In this case, the input character string is used as a search term, and the matching character in the input character string is replaced with another character associated with the matching character by the misrecognition character table. When none of the characters created and in the input character string matches any of the misrecognition high possibility character and the misrecognition character registered in the misrecognition character table, the input character string is used as a search term. Search term creation steps and
A character that causes the CPU of the computer to execute a search step of searching the text data as the OCR result for a character string matching any of the search terms created by the search term creation step using memory. Column search program.