JP6484974B2

JP6484974B2 - Information processing apparatus, information processing system, and program

Info

Publication number: JP6484974B2
Application number: JP2014193541A
Authority: JP
Inventors: 祐大竹
Original assignee: Fuji Xerox Co Ltd; Fujifilm Business Innovation Corp
Current assignee: Fujifilm Business Innovation Corp
Priority date: 2014-09-24
Filing date: 2014-09-24
Publication date: 2019-03-20
Anticipated expiration: 2034-09-24
Also published as: JP2016066157A

Description

本発明は、情報処理装置、情報処理システム及びプログラムに関する。 The present invention relates to an information processing apparatus, an information processing system, and a program.

従来、スキャナ等で読み込んだ文書画像から文字を認識して文書データに変換するＯＣＲ（Optical Character Recognition）技術が知られている。また、複数の言語を含む文書に対してＯＣＲ処理を行う技術も提案されている。例えば、特許文献１には、複数の言語を含む文書に対して、文書の領域ごとに異なる言語でＯＣＲ処理を行う技術が記載されている。 2. Description of the Related Art Conventionally, an OCR (Optical Character Recognition) technique for recognizing characters from a document image read by a scanner or the like and converting them into document data is known. A technique for performing OCR processing on a document including a plurality of languages has also been proposed. For example, Patent Document 1 describes a technique for performing OCR processing on a document including a plurality of languages in different languages for each document area.

また、複数のＯＣＲ結果から、ユーザが入力した検索条件に合致するＯＣＲ結果を検索する技術も知られている（例えば特許文献２）。 Further, a technique for searching for an OCR result that matches a search condition input by a user from a plurality of OCR results is also known (for example, Patent Document 2).

特開２００６−２６０１１５号公報JP 2006-260115 A 特開２００４−２１３０９１号公報JP 2004-213091 A

ここで複数言語が含まれる文書には、例えば、文書の一部分が日本語で記載され、他の部分が韓国語で記載され、更に他の部分が英語で記載されている文書や、文書全体で日本語と韓国語と英語が混在して記載されている文書がある。日本語と韓国語は文字形状が類似しているため、特に後者の文書ではＯＣＲ処理において文字認識の精度が低下する。そして文字認識の精度が低下すると、ＯＣＲ結果に対する検索精度も低下してしまう。 Here, a document including a plurality of languages includes, for example, a document in which a part of the document is written in Japanese, another part is written in Korean, and another part is written in English. Some documents contain a mixture of Japanese, Korean, and English. Since Japanese and Korean have similar character shapes, the accuracy of character recognition is reduced in the OCR process especially in the latter document. If the accuracy of character recognition is reduced, the search accuracy for the OCR result is also reduced.

本発明は、複数言語が含まれる文書に対する文字認識の精度を高め、文字認識結果に対する検索精度を向上させることができる情報処理装置、情報処理システム及びプログラムを提供することを目的とする。 An object of the present invention is to provide an information processing apparatus, an information processing system, and a program capable of improving the accuracy of character recognition for a document including a plurality of languages and improving the search accuracy for a character recognition result.

本発明の請求項１に係る情報処理装置は、互いに異なる複数の文字セットを含む文書を画像として読み取った文書画像データを取得するデータ取得手段と、前記複数の文字セットごとに、該文字セットにより前記文書画像データの全体を文字認識し文書データに変換して、１つの前記文書画像データに対して前記文字セットの数に応じた数の前記文書データを生成する文書データ生成手段と、前記文書データ生成手段により生成された複数の前記文書データを記憶する文書データ記憶手段と、検索条件として入力された入力文字の文字セットを判定する検索文字セット判定手段と、前記文書データ記憶手段に記憶された複数の前記文書データのうち、前記検索文字セット判定手段により判定された前記入力文字の文字セットに対応する前記文書データに対して、前記検索条件として入力された前記入力文字の検索を実行する検索実行手段と、を含むことを特徴とする。 An information processing apparatus according to claim 1 of the present invention includes a data acquisition unit configured to acquire document image data obtained by reading a document including a plurality of different character sets as an image, and the character set for each of the plurality of character sets. Document data generating means for recognizing the whole of the document image data and converting the document image data into document data, and generating a number of the document data corresponding to the number of the character sets for one document image data, and the document Document data storage means for storing a plurality of the document data generated by the data generation means, search character set determination means for determining a character set of input characters input as a search condition, and stored in the document data storage means more of the document data, the text corresponding to the character set of the search character set determined by the determination means that said input character For the data, characterized in that it comprises a and a search executing means for executing a search of the input character input as the search condition.

本発明の請求項２に係る情報処理装置は、請求項１に記載の構成において、前記文書に、互いに異なる第１の文字セット及び第２の文字セットが含まれる場合、前記文書データ生成手段は、前記第１の文字セットに基づいて前記文書画像データの全体を文字認識し文書データに変換して第１の文書データを生成するとともに、前記第２の文字セットに基づいて前記文書画像データの全体を文字認識し文書データに変換して第２の文書データを生成する。 In the information processing apparatus according to claim 2 of the present invention, in the configuration according to claim 1, when the document includes a first character set and a second character set different from each other, the document data generation means The entire document image data is character-recognized based on the first character set and converted into document data to generate first document data, and the document image data is converted based on the second character set. The whole is recognized and converted into document data to generate second document data.

本発明の請求項３に係る情報処理装置は、請求項１に記載の構成において、前記文書に、互いに異なる第１の文字セット及び第２の文字セットが含まれる場合、前記文書データ生成手段は、互いに文字形状が非類似の前記第１の文字セット及び第３の文字セットに基づいて、前記文書画像データの全体を文字認識し文書データに変換して第１の文書データを生成するとともに、互いに文字形状が非類似の前記第２の文字セット及び第４の文字セットに基づいて、前記文書画像データの全体を文字認識し文書データに変換して第２の文書データを生成する。 In the information processing apparatus according to claim 3 of the present invention, in the configuration according to claim 1, when the document includes a first character set and a second character set different from each other, the document data generation unit , Based on the first character set and the third character set having dissimilar character shapes, the entire document image data is character-recognized and converted into document data to generate first document data, Based on the second character set and the fourth character set having dissimilar character shapes, the entire document image data is recognized and converted into document data to generate second document data.

本発明の請求項４に係る情報処理装置は、請求項１に記載の構成において、前記検索条件として入力された前記入力文字の文字セットに対応する前記文書データを、前記文書データ記憶手段から取得する文書データ取得手段をさらに含み、前記検索条件として入力された前記入力文字の文字セットが第５の文字セットであった場合、前記文書データ取得手段は、前記文書データ記憶手段から、前記第５の文字セットを含む少なくとも１つの前記文書データを取得し、前記検索実行手段は、取得された前記文書データごとに、前記入力文字の検索を実行する。 The information processing apparatus according to claim 4 of the present invention, obtained in arrangement according to claim 1, the document data corresponding to the character set of the inputted input character as the search condition, from the document data storage unit If further comprising, character set of the input character input as the search condition the document data acquisition unit that was the fifth character set, the document data acquiring means, from the document data storage unit, the fifth obtain at least one of the document data including the character set, the search executing means, for each acquisition is the document data, perform a search of the input character.

本発明の請求項５に係る情報処理装置は、請求項１に記載の構成において、前記互いに異なる複数の文字セットは、互いに異なる複数の言語である。 An information processing apparatus according to claim 5 of the present invention is the configuration according to claim 1, wherein the plurality of different character sets are a plurality of different languages.

本発明の請求項６に係る情報処理システムは、文字変換装置と検索装置とを含む情報処理システムであって、前記文字変換装置は、互いに異なる複数の文字セットを含む文書を画像として読み取った文書画像データを取得するデータ取得手段と、前記複数の文字セットごとに、該文字セットにより前記文書画像データの全体を文字認識し文書データに変換して、１つの前記文書画像データに対して前記文字セットの数に応じた数の前記文書データを生成する文書データ生成手段と、前記文書データ生成手段により生成された複数の前記文書データを記憶する文書データ記憶手段と、前記文書データを前記検索装置に送信する送信手段と、を含み、前記検索装置は、前記文書データを前記文字変換装置から受信する受信手段と、検索条件として入力された入力文字の文字セットを判定する検索文字セット判定手段と、前記文書データ記憶手段に記憶された複数の前記文書データのうち、前記検索文字セット判定手段により判定された前記入力文字の文字セットに対応する前記文書データに対して、前記検索条件として入力された前記入力文字の検索を実行する検索実行手段と、前記検索実行手段による検索結果を、前記検索条件を入力したユーザのユーザ端末に送信する送信手段と、を含むことを特徴とする。 An information processing system according to claim 6 of the present invention is an information processing system including a character conversion device and a search device, wherein the character conversion device reads a document including a plurality of different character sets as an image. Data acquisition means for acquiring image data; and for each of the plurality of character sets, the character set is used to recognize the whole of the document image data and convert it into document data. Document data generation means for generating the number of document data corresponding to the number of sets, document data storage means for storing a plurality of the document data generated by the document data generation means, and the search device for the document data anda transmitting means for transmitting to said search device includes: a receiving means for receiving the document data from the character conversion unit, as a search condition A search character set determination unit that determines a character set of the input character that has been input, and the character of the input character determined by the search character set determination unit among the plurality of document data stored in the document data storage unit for the document data corresponding to the set, the search execution means for executing a search for the inputted input character as the search condition, the search result by the search executing means, the user of the user terminal that inputs the search condition And transmitting means for transmitting to.

本発明の請求項７に係るプログラムは、互いに異なる複数の文字セットを含む文書を画像として読み取った文書画像データを取得するデータ取得手段、前記複数の文字セットごとに、該文字セットにより前記文書画像データの全体を文字認識し文書データに変換して、１つの前記文書画像データに対して前記文字セットの数に応じた数の前記文書データを生成する文書データ生成手段、前記文書データ生成手段により生成された複数の前記文書データを記憶する文書データ記憶手段、検索条件として入力された入力文字の文字セットを判定する検索文字セット判定手段、及び、前記文書データ記憶手段に記憶された複数の前記文書データのうち、前記検索文字セット判定手段により判定された前記入力文字の文字セットに対応する前記文書データに対して、前記検索条件として入力された前記入力文字の検索を実行する検索実行手段、としてコンピュータを機能させるためのプログラムである。このプログラムは、ＣＤ−ＲＯＭやＤＶＤ−ＲＯＭなどの、コンピュータが読み取り可能な情報記憶媒体に格納されてもよい。 A program according to claim 7 of the present invention is a data acquisition means for acquiring document image data obtained by reading a document including a plurality of character sets different from each other as an image, and for each of the plurality of character sets, the document image by the character set. It converts the entire data to the character recognition document data, one of the document the relative image data character sets document data generating means for generating the number of the document data corresponding to the number of the previous SL document data generating means Document data storage means for storing a plurality of document data generated by the above, a search character set determination means for determining a character set of input characters input as a search condition, and a plurality of data stored in the document data storage means wherein among the document data, the document de corresponding to the character set of the search character sets the input character determined by the determining means For the data, a program for the search executing means for executing a search for the inputted input character as the search condition, a computer to function as a. This program may be stored in a computer-readable information storage medium such as a CD-ROM or a DVD-ROM.

本発明の請求項１、５、６、７の構成によれば、複数言語が含まれる文書に対する文字認識の精度が高まり、文字認識結果に対する検索精度が向上する。 According to the configurations of the first, fifth, sixth, and seventh aspects of the present invention, the accuracy of character recognition for a document including a plurality of languages is improved, and the search accuracy for character recognition results is improved.

本発明の請求項２の構成によれば、文書画像データは、文書に含まれる言語に応じたそれぞれの文書データに変換される。 According to the configuration of claim 2 of the present invention, the document image data is converted into each document data corresponding to the language included in the document.

本発明の請求項３の構成によれば、文書画像データは、文書に含まれる言語ごとに、互いに文字形状が非類似の２つの言語に基づいて、１つの文書データに変換される。 According to the configuration of claim 3 of the present invention, the document image data is converted into one document data based on two languages whose character shapes are dissimilar to each other for each language included in the document.

本発明の請求項４の構成によれば、複数の文書データが取得された場合、文書データごとに入力文字の検索が実行される。 According to the configuration of claim 4 of the present invention, when a plurality of document data is acquired, the input character is searched for each document data.

本実施形態に係る情報処理システムの全体構成図である。1 is an overall configuration diagram of an information processing system according to an embodiment. ＯＣＲサーバ及び検索サーバのハードウェア構成を示すブロック図である。It is a block diagram which shows the hardware constitutions of an OCR server and a search server. ＯＣＲサーバの機能的なブロック図である。It is a functional block diagram of an OCR server. 非類似言語グループのテーブルの一例を示す図である。It is a figure which shows an example of the table of a dissimilar language group. ユーザが言語を選択する選択画面の一例である。It is an example of the selection screen with which a user selects a language. ＯＣＲ実行言語決定部の動作フロー図である。It is an operation | movement flowchart of an OCR execution language determination part. 登録された非類似言語グループの情報を示すテーブルである。It is a table which shows the information of the registered dissimilar language group. ＯＣＲ実行部の動作フロー図である。It is an operation | movement flowchart of an OCR execution part. ＯＣＲ結果の情報を示すテーブルである。It is a table which shows the information of an OCR result. ＯＣＲ実行動作とＯＣＲ結果との関係を模式的に示す図である。It is a figure which shows typically the relationship between OCR execution operation and an OCR result. 検索サーバの機能的なブロック図である。It is a functional block diagram of a search server. 検索サーバの動作フロー図である。It is an operation | movement flowchart of a search server. 文字コード対応テーブルの一例を示す図である。It is a figure which shows an example of a character code corresponding | compatible table.

本発明の一実施形態について、図面を用いて以下に説明する。図１は、本実施形態に係る情報処理システムの全体構成図である。同図に示すように、情報処理システム１０は、情報処理装置１００、複合機４００、及び、ユーザ端末５００を含んでいる。情報処理装置１００は、ＯＣＲサーバ２００（文字変換装置）及び検索サーバ３００（検索装置）を含んでいる。ＯＣＲサーバ２００及び検索サーバ３００は、別個の装置として設けられてもよいし、これらの機能が１つの装置に含まれてもよい。 An embodiment of the present invention will be described below with reference to the drawings. FIG. 1 is an overall configuration diagram of an information processing system according to the present embodiment. As illustrated in FIG. 1, the information processing system 10 includes an information processing apparatus 100, a multifunction peripheral 400, and a user terminal 500. The information processing apparatus 100 includes an OCR server 200 (character conversion device) and a search server 300 (search device). The OCR server 200 and the search server 300 may be provided as separate devices, or these functions may be included in one device.

情報処理装置１００、複合機４００、及びユーザ端末５００は、ＬＡＮ（Local Area Network）やインターネット等の通信ネットワークを介して相互に接続されている。 The information processing apparatus 100, the multifunction machine 400, and the user terminal 500 are connected to each other via a communication network such as a LAN (Local Area Network) or the Internet.

複合機４００は、コピー機能、ＦＡＸ機能、及びスキャナ機能等を含んでいる。本実施形態では、ユーザの操作に基づき文書を画像として読み取るスキャナ機能に着目して説明する。複合機４００は、用紙等の対象物（文書）を光学的に走査することにより文書を画像として読み取り、読み取った文書画像データを、内部の記憶部に保存するとともに、ＯＣＲサーバ２００に送信する。なお、複合機４００は、ＯＣＲサーバ２００からの指示に基づいて、記憶部に保存されている文書画像データをＯＣＲサーバ２００に送信してもよい。複合機４００は、周知の構成を適用することができる。 The multi-function device 400 includes a copy function, a FAX function, a scanner function, and the like. In the present embodiment, description will be given focusing on a scanner function for reading a document as an image based on a user operation. The multifunction device 400 optically scans an object (document) such as paper to read the document as an image, saves the read document image data in an internal storage unit, and transmits it to the OCR server 200. Note that the MFP 400 may transmit document image data stored in the storage unit to the OCR server 200 based on an instruction from the OCR server 200. A known configuration can be applied to the MFP 400.

ユーザ端末５００は、通信ネットワークに接続された、ユーザが利用する端末装置である。ユーザ端末５００は、ユーザの操作に基づいて、通信ネットワークを介してＯＣＲサーバ２００、検索サーバ３００、及び複合機４００とデータ通信を行う。これにより、例えば、ユーザは、ユーザ端末５００を操作して、ＯＣＲサーバ２００にＯＣＲ処理の実行を指示したり、検索サーバ３００にＯＣＲ結果に対する検索処理の実行を指示したり、検索サーバ３００から検索結果を取得したりする。また、ユーザは、ユーザ端末５００を操作して、ＯＣＲ処理に関する各種設定を行ったり、各種設定を変更したりする。ユーザ端末５００には、ブラウザや電子メールクライアント等のソフトウェアが組み込まれていてもよい。ユーザ端末５００は、例えば、パーソナルコンピュータ、ＰＤＡ（Personal Digital Assistant）、スマートフォン等の携帯情報端末等である。情報処理システム１０に含まれるユーザ端末５００の数は限定されない。 The user terminal 500 is a terminal device used by a user connected to a communication network. The user terminal 500 performs data communication with the OCR server 200, the search server 300, and the multifunction device 400 via a communication network based on a user operation. Thereby, for example, the user operates the user terminal 500 to instruct the OCR server 200 to execute the OCR process, to instruct the search server 300 to execute the search process on the OCR result, or to search from the search server 300. Get results. In addition, the user operates the user terminal 500 to perform various settings related to the OCR processing or change various settings. The user terminal 500 may incorporate software such as a browser and an e-mail client. The user terminal 500 is, for example, a personal computer, a PDA (Personal Digital Assistant), a portable information terminal such as a smartphone, or the like. The number of user terminals 500 included in the information processing system 10 is not limited.

［ＯＣＲサーバの構成］
ＯＣＲサーバ２００は、複合機４００から文書画像データを受信するとＯＣＲ処理を実行する。図２は、ＯＣＲサーバ２００のハードウェア構成を示すブロック図である。ＯＣＲサーバ２００は、ＣＰＵ２１、メモリ２２、記憶部２３、及び通信部２４を含むコンピュータで構成されている。これらのハードウェア要素はバスにより相互にデータの授受が可能に接続されている。通信部２４は、通信ネットワークを介して、検索サーバ３００、複合機４００及びユーザ端末５００とデータ通信を行う。ＣＰＵ２１は、ＯＣＲサーバ２００の各部を制御したり、各種の情報処理を実行したりする。メモリ２２は、各種のプログラムやデータを保持する。メモリ２２には、ＣＰＵ２１の作業領域も確保される。記憶部２３は、各種のデータを記憶する。記憶部２３は、ＯＣＲサーバ２００の外部に設けられてもよい。 [Configuration of OCR server]
When the OCR server 200 receives document image data from the multi-function peripheral 400, the OCR server 200 executes OCR processing. FIG. 2 is a block diagram illustrating a hardware configuration of the OCR server 200. The OCR server 200 includes a computer including a CPU 21, a memory 22, a storage unit 23, and a communication unit 24. These hardware elements are connected to each other by a bus so as to be able to exchange data. The communication unit 24 performs data communication with the search server 300, the multi-function device 400, and the user terminal 500 via a communication network. The CPU 21 controls each unit of the OCR server 200 and executes various types of information processing. The memory 22 holds various programs and data. A work area for the CPU 21 is also secured in the memory 22. The storage unit 23 stores various data. The storage unit 23 may be provided outside the OCR server 200.

図３は、ＯＣＲサーバ２００の機能的なブロック図である。図３に示すように、ＯＣＲサーバ２００は、機能的には、データ受信部２０１、ＯＣＲ実行言語決定部２０２、ＯＣＲ実行部２０３、及び、ＯＣＲ結果送信部２０４を含む。これらの要素はメモリ２２（図２参照）に格納されたプログラムをＣＰＵ２１が実行することにより実現される。このプログラムはＣＤ−ＲＯＭ、ＤＶＤ−ＲＯＭ、メモリカードなどのコンピュータが読み取り可能な情報記憶媒体からＯＣＲサーバ２００にインストールされてもよいし、インターネット等の通信ネットワークを介してダウンロードされてもよい。ＯＣＲサーバ２００の記憶部２３には、データ保管部２０５、言語情報記憶部２０６、及びＯＣＲ結果記憶部２０７が含まれる。 FIG. 3 is a functional block diagram of the OCR server 200. As shown in FIG. 3, the OCR server 200 functionally includes a data reception unit 201, an OCR execution language determination unit 202, an OCR execution unit 203, and an OCR result transmission unit 204. These elements are realized by the CPU 21 executing a program stored in the memory 22 (see FIG. 2). This program may be installed in the OCR server 200 from a computer-readable information storage medium such as a CD-ROM, DVD-ROM, or memory card, or may be downloaded via a communication network such as the Internet. The storage unit 23 of the OCR server 200 includes a data storage unit 205, a language information storage unit 206, and an OCR result storage unit 207.

データ受信部２０１（データ取得手段）は、複合機４００から送信される文書画像データを受信する。データ受信部２０１は、受信した文書画像データをデータ保管部２０５に保管する。データ保管部２０５は、ＯＣＲ処理の実行前の文書画像データを一時的に保管する。データ保管部２０５は、例えばキューで構成され、キューイングの処理を行う。 The data receiving unit 201 (data acquisition unit) receives document image data transmitted from the multi-function peripheral 400. The data receiving unit 201 stores the received document image data in the data storage unit 205. The data storage unit 205 temporarily stores document image data before the execution of the OCR process. The data storage unit 205 is configured by a queue, for example, and performs queuing processing.

ＯＣＲ実行言語決定部２０２は、ＯＣＲ処理を実行するための言語（以下、ＯＣＲ実行言語という。）を決定する。例えば、日本語の文書画像データに対して日本語でＯＣＲ処理を実行する場合は、ＯＣＲ実行言語は日本語となる。 The OCR execution language determination unit 202 determines a language (hereinafter referred to as an OCR execution language) for executing the OCR process. For example, when executing OCR processing in Japanese for Japanese document image data, the OCR execution language is Japanese.

ここで、文書に複数の言語（文字セット）が混在する場合、ＯＣＲ処理の文字認識の精度が低下することが知られている。文字形状が類似していない言語が混在している場合は、それぞれの言語の文字形状をそれぞれの言語として認識するため、認識する確率は高い。しかし、特に、複数の言語の文字形状が類似する場合には、誤認識する確率が高くなり、文字認識精度がさらに低下するおそれがある。なお、文字には、記号や数字が含まれてもよい。そこで、本実施形態に係るＯＣＲサーバ２００は、文字形状が互いに類似しない複数の言語を組み合わせた言語グループ（以下、非類似言語グループという。）に基づいて、ＯＣＲ処理を実行する。例えば、日本語と韓国語と中国語とは文字形状が相互に類似するが、これらと英語とは相互に類似しない。このため、「日本語・英語」、「韓国語・英語」、「中国語・英語」のそれぞれを非類似言語グループに設定する。非類似言語グループは、ＯＣＲサーバ２００の管理者やユーザ端末５００のユーザ等により予め設定される。設定された非類似言語グループの情報は、言語情報記憶部２０６に保存される。図４は、非類似言語グループのテーブルの一例を示す図である。テーブルには、各非類似言語グループの識別情報（ＩＤ）が付される。上記した言語以外では、例えばタイ語は、英語、日本語、韓国語、及び中国語のいずれとも文字形状が類似していない非類似言語グループとして設定できる。他には、ヒンズー語とビルマ語は文字形状が類似している言語として設定してもよい。 Here, when a plurality of languages (character sets) are mixed in a document, it is known that the accuracy of character recognition in OCR processing is lowered. When languages that do not have similar character shapes are mixed, the character shapes of the respective languages are recognized as the respective languages, so the probability of recognition is high. However, in particular, when the character shapes of a plurality of languages are similar, the probability of erroneous recognition increases and the character recognition accuracy may further decrease. Note that the characters may include symbols and numbers. Therefore, the OCR server 200 according to the present embodiment executes the OCR process based on a language group (hereinafter referred to as a dissimilar language group) in which a plurality of languages whose character shapes are not similar to each other are combined. For example, Japanese, Korean, and Chinese are similar in character shape to each other, but these are not similar to English. Therefore, “Japanese / English”, “Korean / English”, and “Chinese / English” are set as dissimilar language groups. The dissimilar language group is set in advance by an administrator of the OCR server 200, a user of the user terminal 500, or the like. Information on the set dissimilar language group is stored in the language information storage unit 206. FIG. 4 is a diagram illustrating an example of a table of dissimilar language groups. Identification information (ID) of each dissimilar language group is attached to the table. Other than the above languages, for example, Thai can be set as a dissimilar language group whose character shape is not similar to any of English, Japanese, Korean, and Chinese. In addition, Hindu and Burmese may be set as languages having similar character shapes.

ＯＣＲ実行言語決定部２０２は、言語情報記憶部２０６に保存される非類似言語グループから、ユーザにより選択された言語に対応する非類似言語グループを抽出して、ＯＣＲ実行言語グループに決定する。図５は、ユーザが言語を選択する選択画面の一例である。選択画面は、ＯＣＲサーバ２００に接続されたユーザ端末５００の表示部に表示されてもよいし、ＯＣＲサーバ２００上の操作画面に表示されてもよい。ユーザは、選択画面において、所望の言語を選択（チェック）する。以下、ＯＣＲ実行言語決定部２０２の動作フローについて、具体例を挙げて説明する。 The OCR execution language determination unit 202 extracts a dissimilar language group corresponding to the language selected by the user from the dissimilar language group stored in the language information storage unit 206, and determines it as an OCR execution language group. FIG. 5 is an example of a selection screen for the user to select a language. The selection screen may be displayed on a display unit of the user terminal 500 connected to the OCR server 200 or may be displayed on an operation screen on the OCR server 200. The user selects (checks) a desired language on the selection screen. Hereinafter, the operation flow of the OCR execution language determination unit 202 will be described with a specific example.

図６は、ＯＣＲ実行言語決定部２０２の動作フロー図である。ここでは、図５の選択画面において、ユーザが「日本語」と「韓国語」と「英語」を選択した場合を例に挙げる。 FIG. 6 is an operation flowchart of the OCR execution language determination unit 202. Here, as an example, the user selects “Japanese”, “Korean”, and “English” on the selection screen of FIG.

初めに、ＯＣＲ実行言語決定部２０２は、ユーザ端末５００からＯＣＲ実行言語の設定指示を受信する（Ｓ１０１）。例えば、ユーザがユーザ端末５００の表示画面（図５参照）において、「日本語」と「韓国語」と「英語」を選択すると、選択内容を表すデータ（設定指示）がＯＣＲサーバ２００に送信され、ＯＣＲ実行言語決定部２０２が受信する。次に、ＯＣＲ実行言語決定部２０２は、選択された言語の中から１つの言語を取得する（Ｓ１０２）。ここでは例えば、「日本語」を取得する。 First, the OCR execution language determination unit 202 receives an OCR execution language setting instruction from the user terminal 500 (S101). For example, when the user selects “Japanese”, “Korean”, and “English” on the display screen of the user terminal 500 (see FIG. 5), data (setting instruction) indicating the selected content is transmitted to the OCR server 200. , The OCR execution language determination unit 202 receives. Next, the OCR execution language determination unit 202 acquires one language from the selected languages (S102). Here, for example, “Japanese” is acquired.

次に、ＯＣＲ実行言語決定部２０２は、言語情報記憶部２０６に保存された非類似言語グループ（図４参照）から、取得した言語を含む非類似言語グループを取得する（Ｓ１０３）。ここでは、「日本語」を含む非類似言語グループの「日本語・英語」（ＩＤ：ＪＥ）を取得する。次に、ＯＣＲ実行言語決定部２０２は、取得した非類似言語グループをＯＣＲ実行言語グループに登録する（Ｓ１０４）。 Next, the OCR execution language determination unit 202 acquires a dissimilar language group including the acquired language from the dissimilar language group stored in the language information storage unit 206 (see FIG. 4) (S103). Here, “Japanese / English” (ID: JE) of the dissimilar language group including “Japanese” is acquired. Next, the OCR execution language determination unit 202 registers the acquired dissimilar language group in the OCR execution language group (S104).

次に、ＯＣＲ実行言語決定部２０２は、ユーザにより選択された言語が他にあるか否かを判定し（Ｓ１０５）、他にある場合はＳ１０２に戻り上記処理を繰り返す。ここでは、上記各処理により、ＯＣＲ実行言語決定部２０２は、「韓国語」を取得し、「韓国語」を含む非類似言語グループの「韓国語・英語」（ＩＤ：ＫＥ）を取得する。なお、ユーザにより選択された言語には「英語」が含まれるが（図５参照）、「英語」は上記非類似言語グループ（ＩＤ：ＪＥ、ＫＥ）に既に含まれるため、新たな非類似言語グループは取得されない。登録された非類似言語グループの情報は、言語情報記憶部２０６に保存される。図７は、登録された非類似言語グループの情報を示すテーブルである。 Next, the OCR execution language determination unit 202 determines whether there is another language selected by the user (S105). If there is another language, the process returns to S102 and repeats the above processing. Here, through the above processes, the OCR execution language determination unit 202 acquires “Korean” and acquires “Korean / English” (ID: KE) of the dissimilar language group including “Korean”. The language selected by the user includes “English” (see FIG. 5), but “English” is already included in the dissimilar language group (ID: JE, KE). The group is not acquired. Information on the registered dissimilar language group is stored in the language information storage unit 206. FIG. 7 is a table showing information on registered dissimilar language groups.

ユーザにより選択された言語が他にない場合（Ｓ１０５でＮＯ）、ＯＣＲ実行言語決定部２０２は、登録された上記非類似言語グループ（ＩＤ：ＪＥ、ＫＥ）をＯＣＲ実行言語グループに決定する（Ｓ１０６）。 When there is no other language selected by the user (NO in S105), the OCR execution language determination unit 202 determines the registered dissimilar language group (ID: JE, KE) as an OCR execution language group (S106). ).

以上のようにして、ＯＣＲ実行言語決定部２０２は、ＯＣＲ実行言語（ＯＣＲ実行言語グループ）を決定する。 As described above, the OCR execution language determination unit 202 determines the OCR execution language (OCR execution language group).

図３に戻り、ＯＣＲ実行部２０３（文書データ生成手段）は、ＯＣＲ実行言語決定部２０２により決定されたＯＣＲ実行言語グループによりＯＣＲ処理を実行して、ＯＣＲ結果（文書データ）を生成する。ＯＣＲ結果送信部２０４は、ＯＣＲ結果を検索サーバ３００に送信する。以下、ＯＣＲ実行部２０３の動作フローについて、具体例を挙げて説明する。 Returning to FIG. 3, the OCR execution unit 203 (document data generation unit) executes the OCR process using the OCR execution language group determined by the OCR execution language determination unit 202 and generates an OCR result (document data). The OCR result transmission unit 204 transmits the OCR result to the search server 300. Hereinafter, the operation flow of the OCR execution unit 203 will be described with a specific example.

図８は、ＯＣＲ実行部２０３の動作フロー図である。ここでは、ＯＣＲ実行言語グループとして、図７に示す「日本語・英語」と「韓国語・英語」が決定された場合を例に挙げる。 FIG. 8 is an operation flowchart of the OCR execution unit 203. Here, a case where “Japanese / English” and “Korean / English” shown in FIG. 7 are determined as the OCR execution language group will be described as an example.

初めに、データ受信部２０１が複合機４００から文書画像データを受信する（Ｓ２０１）。受信した文書画像データを「Ｄ１」と称す。次に、ＯＣＲ実行部２０３は、ＯＣＲ実行言語グループから非類似言語グループを１つ取得する（Ｓ２０２）。ここでは例えば、「日本語・英語」の非類似言語グループを取得する。 First, the data receiving unit 201 receives document image data from the multifunction machine 400 (S201). The received document image data is referred to as “D1”. Next, the OCR execution unit 203 acquires one dissimilar language group from the OCR execution language group (S202). Here, for example, a dissimilar language group of “Japanese / English” is acquired.

次に、ＯＣＲ実行部２０３は、取得した非類似言語グループを使用して、文書画像データＤ１に対してＯＣＲ処理を実行する（Ｓ２０３）。具体的には、文書画像データＤ１に日本語と英語の文字が含まれる場合は、日本語に対応する部分は日本語に変換され、英語に対応する部分は英語に変換される。日本語と英語は文字形状が類似しないため、日本語が英語に変換されたり、英語が日本語に変換されたりする誤変換を防ぐことができる。また、文書画像データＤ１に日本語と韓国語の文字が含まれる場合は、日本語に対応する部分は日本語に変換され、韓国語に対応する部分は変換されない（エラーになる）か、あるいは文字形状が類似する日本語に変換される。これにより、日本語の部分は韓国語に変換されることはないため、少なくとも日本語の部分は確実に日本語に変換される。 Next, the OCR execution unit 203 executes OCR processing on the document image data D1 using the acquired dissimilar language group (S203). Specifically, when Japanese and English characters are included in the document image data D1, a portion corresponding to Japanese is converted to Japanese, and a portion corresponding to English is converted to English. Since Japanese and English do not have similar character shapes, it is possible to prevent erroneous conversion in which Japanese is converted into English or English is converted into Japanese. When the document image data D1 includes Japanese and Korean characters, the part corresponding to Japanese is converted to Japanese and the part corresponding to Korean is not converted (results in an error), or It is converted to Japanese with similar character shape. As a result, the Japanese part is not converted into Korean, so at least the Japanese part is reliably converted into Japanese.

次に、ＯＣＲ実行部２０３は、ＯＣＲ結果（第１ＯＣＲ結果；第１文書データ）と、使用した非類似言語グループの情報とを関連付けてＯＣＲ結果記憶部２０７に保存する（Ｓ２０４）。図９は、ＯＣＲ結果記憶部２０７に保存されるＯＣＲ結果の情報を示すテーブルである。ここでは、上記情報がＯＣＲ処理ＩＤ：ｔ１として保存される。 Next, the OCR execution unit 203 associates the OCR result (first OCR result; first document data) with the used dissimilar language group information and stores it in the OCR result storage unit 207 (S204). FIG. 9 is a table showing OCR result information stored in the OCR result storage unit 207. Here, the information is stored as an OCR process ID: t1.

次に、ＯＣＲ実行部２０３は、ＯＣＲ実行言語グループに他の非類似言語グループがあるか否かを判定し（Ｓ２０５）、他の非類似言語グループがある場合はＳ２０２に戻り上記処理を繰り返す。 Next, the OCR execution unit 203 determines whether there are other dissimilar language groups in the OCR execution language group (S205). If there are other dissimilar language groups, the process returns to S202 and repeats the above processing.

ここでは、ＯＣＲ実行部２０３は、「韓国語・英語」の非類似言語グループを取得し、「韓国語・英語」を使用して文書画像データＤ１に対して再度ＯＣＲ処理を実行する（Ｓ２０３）。具体的には、文書画像データＤ１に韓国語と英語の文字が含まれる場合は、韓国語に対応する部分は韓国語に変換され、英語に対応する部分は英語に変換される。韓国語と英語は文字形状が類似しないため、韓国語が英語に変換されたり、英語が韓国語に変換されたりする誤変換を防ぐことができる。また、文書画像データＤ１に韓国語と日本語の文字が含まれる場合は、韓国語に対応する部分は韓国語に変換され、日本語に対応する部分は変換されない（エラーになる）か、あるいは文字形状が類似する韓国語に変換される。すなわち、韓国語の部分は日本語に変換されることはないため、少なくとも韓国語の部分は確実に韓国語に変換される。 Here, the OCR execution unit 203 acquires a dissimilar language group of “Korean / English”, and executes OCR processing again on the document image data D1 using “Korean / English” (S203). . Specifically, when the document image data D1 includes Korean and English characters, the portion corresponding to Korean is converted to Korean, and the portion corresponding to English is converted to English. Since Korean and English do not have similar character shapes, it is possible to prevent erroneous conversion such as Korean being converted into English or English being converted into Korean. If the document image data D1 includes Korean and Japanese characters, the part corresponding to Korean is converted to Korean and the part corresponding to Japanese is not converted (results in an error), or It is converted to Korean with similar character shape. That is, since the Korean part is not converted into Japanese, at least the Korean part is definitely converted into Korean.

次に、ＯＣＲ実行部２０３は、ＯＣＲ結果（第２ＯＣＲ結果；第２文書データ）と、使用した非類似言語グループの情報とを関連付けてＯＣＲ結果記憶部２０７に保存する（Ｓ２０４）。ここでは、上記情報がＯＣＲ処理ＩＤ：ｔ２として保存される（図９参照）。なお、図９では、各ＯＣＲ処理ＩＤ：ｔ１、ｔ２において、同一のＪＯＢ＿ＩＤ：ＪＯＢ００１が登録されているが、これは、ＯＣＲ実行部２０３の１回の動作（上記の例では、２回のＯＣＲ処理を含む）を示している。図１０には、ＯＣＲ実行動作とＯＣＲ結果との関係を模式的に示している。また、ＪＯＢ＿ＩＤは、文書画像データにも対応している。また、上記の例では、非類似言語グループが２つ登録されているため、１回のＯＣＲ実行に対してＯＣＲ処理が２回（ＩＤ：ｔ１、ｔ２）行われているが、非類似言語グループが３つ登録されている場合は、１回のＯＣＲ実行に対してＯＣＲ処理が３回行われることになる。 Next, the OCR execution unit 203 associates the OCR result (second OCR result; second document data) with the used dissimilar language group information and stores it in the OCR result storage unit 207 (S204). Here, the information is stored as the OCR process ID: t2 (see FIG. 9). In FIG. 9, the same JOB_ID: JOB001 is registered in each of the OCR process IDs: t1 and t2, but this is one operation of the OCR execution unit 203 (in the above example, two OCRs are performed). Processing). FIG. 10 schematically shows the relationship between the OCR execution operation and the OCR result. JOB_ID also corresponds to document image data. In the above example, since two dissimilar language groups are registered, OCR processing is performed twice (ID: t1, t2) for one OCR execution. When three are registered, OCR processing is performed three times for one OCR execution.

以上のようにして、ＯＣＲ実行部２０３はＯＣＲ処理を実行する。すなわち、ＯＣＲ実行部２０３は、例えば文書に日本語と韓国語が混在する場合、初めに、文書画像データ全体に対して日本語及び英語でＯＣＲ処理を実行し、続いて、同じ文書画像データ全体に対して韓国語及び英語でＯＣＲ処理を実行する。１回目のＯＣＲ処理の結果では、日本語の部分が確実に日本語に変換されている。また、２回目のＯＣＲ処理の結果では、韓国語の部分が確実に韓国語に変換されている。文書中に日本語及び韓国語に類似した文字があったとしても、少なくとも本来の言語によりＯＣＲ処理した結果の方には、正確な変換文字が反映されることになる。このため、文書中の文字のうち、正確に文字変換されず、何れのＯＣＲ結果にも含まれなくなるような文字が生じることはない。換言すると、文書中の全ての文字を正確に文字変換することができる。よって、複数言語が含まれる文書に対する文字認識の精度を高めることができる。 As described above, the OCR execution unit 203 executes the OCR process. That is, for example, when Japanese and Korean are mixed in a document, the OCR execution unit 203 first executes OCR processing in Japanese and English on the entire document image data, and then performs the same entire document image data. Execute OCR processing in Korean and English. In the result of the first OCR process, the Japanese part is reliably converted into Japanese. Further, in the result of the second OCR process, the Korean portion is surely converted to Korean. Even if there are characters similar to Japanese and Korean in the document, at least the result of OCR processing in the original language reflects the correct converted character. For this reason, characters that are not accurately converted among the characters in the document and are not included in any OCR result will not occur. In other words, all characters in the document can be accurately converted. Therefore, the accuracy of character recognition for a document including a plurality of languages can be improved.

［検索サーバの構成］
検索サーバ３００は、ＯＣＲサーバ２００において生成されたＯＣＲ結果に対して、ユーザにより入力された検索条件に基づいて文字列検索を実行する。図２には、検索サーバ３００のハードウェア構成を示している。検索サーバ３００は、ＣＰＵ３１、メモリ３２、記憶部３３、及び通信部３４を含むコンピュータで構成されている。これらのハードウェア要素はバスにより相互にデータの授受が可能に接続されている。通信部３４は、通信ネットワークを介して、ＯＣＲサーバ２００及びユーザ端末５００とデータ通信を行う。ＣＰＵ３１は、検索サーバ３００の各部を制御したり、各種の情報処理を実行したりする。メモリ３２は、各種のプログラムやデータを保持する。メモリ３２には、ＣＰＵ３１の作業領域も確保される。記憶部３３は、各種のデータを記憶する。記憶部３３は、検索サーバ３００の外部に設けられていてもよい。 Search server configuration
The search server 300 performs a character string search on the OCR result generated in the OCR server 200 based on the search condition input by the user. FIG. 2 shows a hardware configuration of the search server 300. The search server 300 includes a computer that includes a CPU 31, a memory 32, a storage unit 33, and a communication unit 34. These hardware elements are connected to each other by a bus so as to be able to exchange data. The communication unit 34 performs data communication with the OCR server 200 and the user terminal 500 via a communication network. The CPU 31 controls each unit of the search server 300 and executes various types of information processing. The memory 32 holds various programs and data. A work area for the CPU 31 is also secured in the memory 32. The storage unit 33 stores various data. The storage unit 33 may be provided outside the search server 300.

図１１は、検索サーバ３００の機能的なブロック図である。図１１に示すように、検索サーバ３００は、機能的には、ＯＣＲ結果受信部３０１、ＯＣＲ情報管理部３０２、検索指示受信部３０３、検索言語判定部３０４、検索処理部３０５、及び、検索結果送信部３０６を含む。これらの要素はメモリ３２（図２参照）に格納されたプログラムをＣＰＵ３１が実行することにより実現される。このプログラムはＣＤ−ＲＯＭ、ＤＶＤ−ＲＯＭ、メモリカードなどのコンピュータが読み取り可能な情報記憶媒体から検索サーバ３００にインストールされてもよいし、インターネット等の通信ネットワークを介してダウンロードされてもよい。検索サーバ３００の記憶部３３は、ＯＣＲ結果保管部３０７、ＯＣＲ情報記憶部３０８、及び文字情報記憶部３０９が含まれる。 FIG. 11 is a functional block diagram of the search server 300. As shown in FIG. 11, the search server 300 functionally includes an OCR result receiving unit 301, an OCR information management unit 302, a search instruction receiving unit 303, a search language determining unit 304, a search processing unit 305, and a search result. A transmission unit 306 is included. These elements are realized by the CPU 31 executing a program stored in the memory 32 (see FIG. 2). This program may be installed in the search server 300 from a computer-readable information storage medium such as a CD-ROM, DVD-ROM, or memory card, or may be downloaded via a communication network such as the Internet. The storage unit 33 of the search server 300 includes an OCR result storage unit 307, an OCR information storage unit 308, and a character information storage unit 309.

ＯＣＲ結果受信部３０１は、ＯＣＲサーバ２００からＯＣＲ結果を受信する。ＯＣＲ結果受信部３０１は、受信したＯＣＲ結果をＯＣＲ結果保管部３０７に保管する。ＯＣＲ結果保管部３０７は、検索処理の実行前のＯＣＲ結果を一時的に保管する。ＯＣＲ結果保管部３０７は、例えばキューで構成され、キューイングの処理を行う。ここでは例えば、ＩＤ：ｔ１〜ｔ４のＯＣＲ結果（図９参照）を受信する。 The OCR result receiving unit 301 receives the OCR result from the OCR server 200. The OCR result reception unit 301 stores the received OCR result in the OCR result storage unit 307. The OCR result storage unit 307 temporarily stores the OCR result before the search process is executed. The OCR result storage unit 307 includes, for example, a queue, and performs queuing processing. Here, for example, OCR results (see FIG. 9) with IDs: t1 to t4 are received.

ＯＣＲ情報管理部３０２は、ＯＣＲ結果受信部３０１が受信したＯＣＲ結果をＯＣＲ情報記憶部３０８に保存する。ＯＣＲ情報記憶部３０８に保存される情報は、ＯＣＲ結果記憶部２０７（図９参照）に保存されるＯＣＲ結果の情報と同一である。 The OCR information management unit 302 stores the OCR result received by the OCR result reception unit 301 in the OCR information storage unit 308. The information stored in the OCR information storage unit 308 is the same as the OCR result information stored in the OCR result storage unit 207 (see FIG. 9).

以下では、検索サーバ３００の検索処理及び検索処理を実行する要素について、図１２の動作フローを参照しつつ説明する。図１２は、検索サーバ３００の動作フロー図である。 Hereinafter, the search processing of the search server 300 and elements for executing the search processing will be described with reference to the operation flow of FIG. FIG. 12 is an operation flowchart of the search server 300.

検索指示受信部３０３は、ユーザ端末５００から検索指示を受信する（Ｓ３０１）。例えばユーザはユーザ端末５００において、所望の検索条件（検索キーワード）を入力する。検索キーワードが入力されると、検索キーワードを含む検索指示が、ユーザ端末５００から検索サーバ３００に送信される。 The search instruction receiving unit 303 receives a search instruction from the user terminal 500 (S301). For example, the user inputs a desired search condition (search keyword) on the user terminal 500. When a search keyword is input, a search instruction including the search keyword is transmitted from the user terminal 500 to the search server 300.

検索言語判定部３０４は、受信した上記検索指示に含まれる情報（検索キーワード）に基づいて、検索対象の言語を判定する（Ｓ３０２）。具体的には、検索言語判定部３０４は、検索キーワードの文字の文字コードを取得し、文字コード対応テーブルを参照して、取得した文字コードに対応する言語を判定する。文字コード対応テーブルは、文字情報記憶部３０９に予め登録されている。図１３は、文字コード対応テーブルの一例を示す図である。例えば、検索キーワードの開始文字の文字コードが「３０４０」で、終了文字コードが「３０９Ｆ」で、文字種別が「ひらがな」の場合、検索言語判定部３０４は、当該検索キーワードを、「日本語」と判定する。 The search language determination unit 304 determines a search target language based on information (search keyword) included in the received search instruction (S302). Specifically, the search language determination unit 304 acquires the character code of the character of the search keyword, refers to the character code correspondence table, and determines the language corresponding to the acquired character code. The character code correspondence table is registered in advance in the character information storage unit 309. FIG. 13 is a diagram illustrating an example of a character code correspondence table. For example, when the character code of the start character of the search keyword is “3040”, the end character code is “309F”, and the character type is “Hiragana”, the search language determination unit 304 sets the search keyword to “Japanese”. Is determined.

検索処理部３０５（検索実行手段）は、ＯＣＲ情報記憶部３０８に保存された全てのＯＣＲ結果（図９参照）から、検索言語判定部３０４により判定された言語に対応するＯＣＲ結果を１つ取得し（Ｓ３０３）、取得したＯＣＲ結果に対して、文字列検索を実行する（Ｓ３０４）。例えば、ＩＤ：ｔ１〜ｔ４の４つのＯＣＲ結果が保存されているとすると、このうち、ＯＣＲ実行言語グループ（非類似言語グループ）に「日本語」を含むＩＤ：ｔ１のＯＣＲ結果を取得し、文字列検索を実行する。上記判定された言語に該当するＯＣＲ結果が他にある場合（Ｓ３０５でＹＥＳ）、Ｓ３０３に戻り上記処理が繰り返される。ここでは、ＯＣＲ実行言語グループに「日本語」を含むＩＤ：ｔ３のＯＣＲ結果を取得し、文字列検索を実行する。 The search processing unit 305 (search execution means) obtains one OCR result corresponding to the language determined by the search language determination unit 304 from all the OCR results stored in the OCR information storage unit 308 (see FIG. 9). Then, a character string search is performed on the acquired OCR result (S304). For example, if four OCR results of ID: t1 to t4 are stored, the OCR execution language group (dissimilar language group) among them, the OCR result of ID: t1 including “Japanese” is acquired, Perform a string search. If there is another OCR result corresponding to the determined language (YES in S305), the process returns to S303 and the above process is repeated. Here, an OCR result of ID: t3 including “Japanese” in the OCR execution language group is acquired, and a character string search is executed.

検索結果送信部３０６は、検索結果を、検索指示を送信したユーザ端末５００に送信する（Ｓ３０６）。検索結果は、検索キーワードを含む文書画像データであってもよいし、検索キーワードを含む文字変換後のテキストデータであってもよい。また、文書画像データやテキストデータにおいて、検索キーワードに該当する部分の表示形態を変更してもよい。例えば、該当部分を、反転表示、カラー表示、太字表示、枠付き表示などの表示形態に変更してもよい。 The search result transmission unit 306 transmits the search result to the user terminal 500 that has transmitted the search instruction (S306). The search result may be document image data including a search keyword, or may be text data after character conversion including the search keyword. Further, the display form of the portion corresponding to the search keyword in the document image data or text data may be changed. For example, the corresponding portion may be changed to a display form such as reverse display, color display, bold display, or framed display.

以上のようにして、検索サーバ３００は検索処理を実行する。上記構成によれば、文字認識精度が高いＯＣＲ処理による結果に対して、検索条件に応じた言語の検索を行うため、検索精度を向上させることができる。 As described above, the search server 300 executes the search process. According to the above configuration, the search accuracy can be improved because the language search according to the search condition is performed on the result of the OCR process with high character recognition accuracy.

［変形例］
本実施形態は上記構成に限定されない。例えば情報処理システム１０は、ＯＣＲ結果として、文字列検索可能な文書ファイル（例えば、サーチャブルＰＤＦ）を生成する機能と、生成した上記文書ファイルをユーザ端末５００にダウンロードさせる機能とを含んでもよい。この機能を実現するために情報処理システム１０は、ＯＣＲ実行言語グループごとに上記文書ファイルを生成する手段と、ＯＣＲ対象文書に、生成した上記文書ファイルとＯＣＲ実行言語グループとを関連付けて管理する手段と、検索結果の文書表示時に、検索条件としてのキーワードを含む言語情報を一時的に保持する手段と、検索結果の表示後に上記文書ファイルのダウンロードが指示された場合に、上記保持手段で保持されている言語情報に関連付けられている文書ファイルをダウンロードさせる手段とを含んでいればよい。これにより、ユーザが所望する適切な文字列検索可能文書をユーザの負担無くダウンロードさせることができる。 [Modification]
The present embodiment is not limited to the above configuration. For example, the information processing system 10 may include a function for generating a text string searchable document file (for example, searchable PDF) as an OCR result and a function for causing the user terminal 500 to download the generated document file. In order to realize this function, the information processing system 10 generates a document file for each OCR execution language group, and manages the OCR target document in association with the generated document file and the OCR execution language group. And means for temporarily holding language information including keywords as search conditions when displaying the search result document, and for holding the download when the document file is instructed after the search result is displayed. And a means for downloading a document file associated with the language information. As a result, an appropriate character string searchable document desired by the user can be downloaded without any burden on the user.

以上の説明では、互いに異なる複数の「言語」を例に挙げたが、本実施形態に係る情報処理装置は、「言語」に限定されない。本実施形態に係る情報処理装置は、例えば、読み取り対象の文書中に、「ひらがな」と「カタカナ」が含まれる場合や、「ひらがな」と「漢字」が含まれる場合や、「ひらがな」と「アラビア数字」が含まれる場合等、互いに文字の種類（文字セット）が異なる場合でも適用可能である。文字セットには、例えば、言語の他、ひらがな、カタカナ、漢字、アラビア数字等の文字の種類が含まれる。また、同じ文字の種類又は同じ言語のうち文字の字体が異なるもの同士を、別々の文字セットとしてもよい。ＯＣＲ実行部２０３は、例えば、「ひらがな」（第１の文字セット）と「カタカナ」（第２の文字セット）を含む文書において、ひらがなの部分はひらがな（ＯＣＲ実行言語）によりＯＣＲ処理を行い、カタカナの部分はカタカナ（ＯＣＲ実行言語）によりＯＣＲ処理を行う。これにより、ひらがなの部分は確実にひらがらに変換され、カタカナの部分は確実にカタカナに変換される。 In the above description, a plurality of “languages” which are different from each other are given as examples. However, the information processing apparatus according to the present embodiment is not limited to “languages”. The information processing apparatus according to the present embodiment, for example, includes “Hiragana” and “Katakana” in the document to be read, “Hiragana” and “Kanji”, or “Hiragana” and “ The present invention is applicable even when the character types (character sets) are different from each other, such as when “Arabic numerals” are included. The character set includes, for example, character types such as hiragana, katakana, kanji, and Arabic numerals in addition to languages. Also, different character sets of the same character type or the same language may be used as different character sets. For example, in a document including “Hiragana” (first character set) and “Katakana” (second character set), the OCR execution unit 203 performs OCR processing on the hiragana portion using Hiragana (OCR execution language). The katakana portion is subjected to OCR processing using katakana (OCR execution language). As a result, the hiragana part is reliably converted to hiragana, and the katakana part is reliably converted to katakana.

このように、本実施形態に係る情報処理装置１０は、互いに異なる複数の文字セットを含む文書を画像として読み取った文書画像データを取得するデータ受信部２０１と、複数の文字セットごとに、該文字セットにより上記文書画像データの全体を文字認識し文書データに変換して、１つの上記文書画像データに対して上記文字セットの数に応じた数の上記文書データを生成するＯＣＲ実行部２０３と、ＯＣＲ実行部２０３により生成された少なくとも１つの上記文書データにおいて、検索条件として入力された入力文字の検索を実行する検索処理部３０５と、を含む構成である。 As described above, the information processing apparatus 10 according to the present embodiment includes the data receiving unit 201 that acquires document image data obtained by reading a document including a plurality of different character sets as an image, and the character for each of the plurality of character sets. An OCR execution unit 203 for recognizing the entire document image data by the set and converting the document image data into document data, and generating a number of the document data corresponding to the number of the character sets for one document image data; A search processing unit 305 that executes a search for input characters input as a search condition in at least one of the document data generated by the OCR execution unit 203.

１０情報処理システム、１００情報処理装置、２００ＯＣＲサーバ、３００検索サーバ、４００複合機、５００ユーザ端末、２１，３１ＣＰＵ、２２，３２メモリ、２３，３３記憶部、２４，３４通信部、２０１データ受信部、２０２ＯＣＲ実行言語決定部、２０３ＯＣＲ実行部、２０４ＯＣＲ結果送信部、２０５データ保管部、２０６言語情報記憶部、２０７ＯＣＲ結果記憶部、３０１ＯＣＲ結果受信部、３０２ＯＣＲ情報管理部、３０３検索指示受信部、３０４検索言語判定部、３０５検索処理部、３０６検索結果送信部、３０７ＯＣＲ結果保管部、３０８ＯＣＲ情報記憶部、３０９文字情報記憶部３０９。 DESCRIPTION OF SYMBOLS 10 Information processing system, 100 Information processing apparatus, 200 OCR server, 300 Search server, 400 MFP, 500 User terminal, 21,31 CPU, 22,32 Memory, 23,33 Storage part, 24,34 Communication part, 201 Data Receiving unit, 202 OCR execution language determination unit, 203 OCR execution unit, 204 OCR result transmission unit, 205 data storage unit, 206 language information storage unit, 207 OCR result storage unit, 301 OCR result reception unit, 302 OCR information management unit, 303 Search instruction receiving unit 304 Search language determination unit 305 Search processing unit 306 Search result transmission unit 307 OCR result storage unit 308 OCR information storage unit 309 Character information storage unit 309

Claims

Data acquisition means for acquiring document image data obtained by reading a document including a plurality of different character sets as an image;
For each of the plurality of character sets, the entire document image data is recognized by the character set and converted into document data, and the number of the documents corresponding to the number of the character sets for one document image data. Document data generating means for generating data;
Document data storage means for storing a plurality of the document data generated by the document data generation means ;
A search character set determination means for determining a character set of an input character input as a search condition;
Among the plurality of document data stored in the document data storage means, for the document data corresponding to the determined character set of the input character by the search character set determining unit, entered as the search condition a search executing means for executing a search of the input character,
An information processing apparatus comprising:

If the document includes different first and second character sets,
The document data generation means recognizes the entire document image data based on the first character set and converts it into document data to generate first document data, and based on the second character set. Recognizing the entire document image data and converting it to document data to generate second document data;
The information processing apparatus according to claim 1.

If the document includes different first and second character sets,
The document data generation means recognizes the entire document image data based on the first character set and the third character set whose character shapes are dissimilar to each other, converts the document image data into document data, and converts the first document set into document data. The second document data is generated by generating the data, recognizing the entire document image data based on the second character set and the fourth character set having dissimilar character shapes, and converting the document image data into document data. Generate
The information processing apparatus according to claim 1.

The document data corresponding to the character set of the inputted input character as the search condition further comprises document data obtaining means for obtaining from the document data storage unit,
When the character set of the inputted input character as the search condition is a fifth character set,
The document data acquisition means acquires at least one document data including the fifth character set from the document data storage means;
The search executing means, for each said were acquisition document data, perform a search of the input character,
The information processing apparatus according to claim 1.

The plurality of different character sets are a plurality of different languages.
The information processing apparatus according to claim 1.

An information processing system including a character conversion device and a search device,
The character conversion device includes:
Data acquisition means for acquiring document image data obtained by reading a document including a plurality of different character sets as an image;
For each of the plurality of character sets, the entire document image data is recognized by the character set and converted into document data, and the number of the documents corresponding to the number of the character sets for one document image data. Document data generating means for generating data;
Document data storage means for storing a plurality of the document data generated by the document data generation means ;
Transmitting means for transmitting the document data to the search device,
The search device includes:
Receiving means for receiving the document data from the character conversion device;
A search character set determination means for determining a character set of an input character input as a search condition;
Among the plurality of document data stored in the document data storage means, for the document data corresponding to the determined character set of the input character by the search character set determining unit, entered as the search condition a search executing means for executing a search of the input character,
Transmitting means for transmitting a search result by the search execution means to a user terminal of a user who has input the search condition;
An information processing system comprising:

Data acquisition means for acquiring document image data obtained by reading a document including a plurality of different character sets as an image;
For each of the plurality of character sets, the entire document image data is recognized by the character set and converted into document data, and the number of the documents corresponding to the number of the character sets for one document image data. Document data generating means for generating data ,
Document data storage means for storing a plurality of the document data generated by the previous SL document data generating means,
Search character set determination means for determining a character set of input characters input as a search condition, and
Among the plurality of document data stored in the document data storage means, for the document data corresponding to the determined character set of the input character by the search character set determining unit, entered as the search condition search execution means for executing a search of the input character,
As a program to make the computer function as.