JP2008152502A

JP2008152502A - Document image retrieval device and program

Info

Publication number: JP2008152502A
Application number: JP2006339357A
Authority: JP
Inventors: Masahiro Wada; 正寛和田
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 2006-12-18
Filing date: 2006-12-18
Publication date: 2008-07-03
Anticipated expiration: 2026-12-18
Also published as: JP4823049B2

Abstract

<P>PROBLEM TO BE SOLVED: To shorten an index creation period for retrieval of document image, and make it possible to significantly reduce retrieval processing burden. <P>SOLUTION: A document image retrieval device performs index creation process 11b. Here, a text area of a document image transmitted from a scanner 20 is cut out and split into units of punctuation of the document and the number of characters is measured for every split unit to register it as an index. When performing retrieval of a document image data, the number of characters between punctuations of retrieval image data which is a retrieval source is measured (11c), an index having the same number of characters as the measured number of characters is detected (11d), and the document image data corresponding to the detected index is extracted from a memory unit 12 (11e). Information on the extracted document image data is displayed on a display input section 13 by display control function 11a. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、文書画像検索装置及びプログラムに関し、より詳細には、予め登録されている文書画像データから、検索文書データに該当する文書画像データを検索する文書画像検索装置と、該装置の機能を実現するプログラムとに関する。 The present invention relates to a document image search apparatus and program, and more specifically, a document image search apparatus that searches document image data corresponding to search document data from previously registered document image data, and functions of the apparatus. It relates to the program to be realized.

予め登録されている検索対象の文書画像データから、検索元となる検索文書データに該当する文書画像データを検索する文書画像検索装置が知られている。
このような文書画像検索装置において、従来では、例えばスキャナから読み込んだ文書画像データを検索対象として登録する際に、文書画像データから文字認識により抽出したテキストをインデックスとして作成して登録している。そして登録した文書画像データを検索する際に、まずインデックスを参照して、検索文書データと同一のテキストがあるかどうかを検索していた。 2. Description of the Related Art There is known a document image search apparatus that searches document image data corresponding to search document data as a search source from document image data to be searched that is registered in advance.
In such a document image search apparatus, conventionally, for example, when document image data read from a scanner is registered as a search target, text extracted from the document image data by character recognition is created and registered as an index. When searching for registered document image data, first, an index is referenced to search for the same text as the search document data.

このように検索対象の文書画像データのインデックスを生成して登録しておく技術として、例えば特許文献１の全文検索装置がある。
特許文献１の全文検索装置は、高速かつ高精度な全文検索を実現するため、インデックスを参照してキーワードと一致する認識候補文字の文書を検索する一方、文字画像の形状とキーワードを構成する文字の形状特徴を照合して、検索条件に合致する文書を検索するようにしている。
特開２００１−１７５６６１号公報 As a technique for generating and registering an index of document image data to be searched in this way, for example, there is a full-text search device disclosed in Patent Document 1.
The full-text search device of Patent Document 1 searches a document of recognition candidate characters that match a keyword with reference to an index to realize a high-speed and high-accuracy full-text search. The document matches the search condition and is searched for by matching the shape features.
JP 2001-175661 A

しかしながら、上記特許文献１の検索装置では、文字認識による検索を行っているため、文字認識の処理負荷が大きい。すなわち、文書画像をＯＣＲ（Optical Character Recognition）により文字認識して検索用のインデックスとして登録する処理を行っているために、インデックス生成に時間がかかり、検索処理負担が大きくなるという問題が生じる。 However, since the search device of Patent Document 1 performs a search by character recognition, the processing load for character recognition is large. That is, since the document image is subjected to character recognition by OCR (Optical Character Recognition) and registered as a search index, there is a problem that it takes time to generate the index and the search processing load is increased.

本発明は、上述のごとき実情に鑑みてなされたもので、文書画像検索用のインデックス生成時間を短縮し、検索処理負担を大きく削減することができるようにした文書画像検索装置及び該装置の機能を実現するプログラムを提供することを目的とするものである。 The present invention has been made in view of the circumstances as described above. The document image search apparatus and the function of the apparatus that can reduce the index generation time for document image search and greatly reduce the search processing load. It aims at providing the program which realizes.

上記課題を解決するために、本発明の第１の技術手段は、検索対象として保持するための文書画像データ、及び文書画像データを検索する検索元となる検索文書データを入力する入力部と、入力部が入力した文書画像データ、または検索文書データから句読点を認識し、句読点間の文字数を計測する文字数計測部と、文字数計測部によって計測された文書画像データの文字数をインデックスとして登録する登録部と、文字数計測部によって計測された検索文書データの各句読点間の文字数と同一の文字数を持つインデックスを検索する検索部と、を備えることを特徴としたものである。 In order to solve the above problems, the first technical means of the present invention includes: an input unit for inputting document image data to be stored as a search target; and search document data serving as a search source for searching the document image data; Character number measurement unit that recognizes punctuation marks from document image data input by the input unit or search document data and measures the number of characters between punctuation marks, and a registration unit that registers the number of characters of document image data measured by the character number measurement unit as an index And a search unit that searches for an index having the same number of characters as the number of characters between the punctuation marks of the search document data measured by the character number measurement unit.

第２の技術手段は、第１の技術手段において、文字数計測部が、句読点間の文字数の他に文字サイズを計測し、登録部は、文字数に加えて文字サイズを含めてインデックスとして登録することを特徴としたものである。 According to a second technical means, in the first technical means, the character number measurement unit measures the character size in addition to the number of characters between punctuation marks, and the registration unit registers the character size in addition to the number of characters as an index. It is characterized by.

第３の技術手段は、第１または第２の技術手段において、文字数計測部が、検索文書データにおいて、文書の行の最終位置ではない場所で句読点がなく改行されている部分があった場合、改行点も句読点とみなして文字数の計測を行うことを特徴としたものである。 The third technical means is that in the first or second technical means, when the character number measurement unit has a part in the search document data where there is no punctuation and a line break in a place that is not the final position of the line of the document, Characteristic is that the number of characters is measured by regarding the line break as a punctuation mark.

第４の技術手段は、第１ないし第３のいずれかの技術手段において、文書画像データ及び検索文書データの文書から、頁単位または一頁内に含まれる複数の領域単位からなるテキスト領域を抽出する領域抽出部を備え、文字数計測部は、領域抽出部により抽出されたテキスト領域毎に句読点間の文字数を計測し、登録部は、テキスト領域の位置情報と対応付けて文書画像の文字数をインデックスとして登録することを特徴としたものである。 According to a fourth technical means, in any one of the first to third technical means, a text area composed of a page unit or a plurality of area units included in one page is extracted from the document image data and the document of the search document data. A character number measuring unit that measures the number of characters between punctuation marks for each text region extracted by the region extracting unit, and the registration unit indexes the number of characters in the document image in association with the position information of the text region. It is characterized by registering as.

第５の技術手段は、第４の技術手段において、文字数計測部が、連続する一つの文章が複数のテキスト領域に分かれている場合には、連続した一つの文章を含む複数のテキスト領域を統合し、統合したテキスト領域を一つの検索単位として文字数計測を行なうことを特徴としたものである。 According to a fifth technical means, in the fourth technical means, the character number measuring unit integrates a plurality of text areas including one continuous sentence when one continuous sentence is divided into a plurality of text areas. In this case, the number of characters is measured using the integrated text area as one search unit.

第６の技術手段は、第１ないし第５のいずれかの技術手段において、検索部が、文字数計測部によって計測された検索文書データの文字数の配列のうち、少なくとも部分的に一致する文字数の配列を持つインデックスを検索することを特徴としたものである。 According to a sixth technical means, in any one of the first to fifth technical means, the search unit includes an array of the number of characters that at least partially matches the character number array of the search document data measured by the character number measurement unit. It is characterized by searching an index having.

第７の技術手段は、第１ないし第６のいずれかの技術手段において、登録部が、文字数をインデックスとして登録する際に、文書画像の文書に含まれる句点及び／または読点に重み付けをして登録し、検索部は、検索文書から抽出された句読点間の文字数に加えて、句点及び／または読点の重み付けを用いて検索を行うことを特徴としたものである。 According to a seventh technical means, in any one of the first to sixth technical means, the registration unit weights the punctuation marks and / or reading marks included in the document of the document image when registering the number of characters as an index. The registration and search unit is characterized by performing a search using the weights of the punctuation marks and / or punctuation marks in addition to the number of characters between the punctuation marks extracted from the search document.

第８の技術手段は、第７の技術手段において、登録部が、文書の行の最終位置ではない場所で改行されている部分があった場合、改行されている点を改行点として重み付けして登録し、検索部は、句点、読点の重み付けに加えて、改行点の重み付けを用いて検索を行うことを特徴としたものである。 According to an eighth technical means, in the seventh technical means, the registration unit weights the line break point as a line break point when there is a line break at a location other than the final position of the document line. The registration and search unit is characterized in that a search is performed using weighting of line breaks in addition to weighting of punctuation marks and reading marks.

第９の技術手段は、第１ないし第８のいずれかの技術手段である文書画像検索装置の機能を実現する文書画像検索プログラムである。 The ninth technical means is a document image search program that realizes the function of the document image search apparatus that is one of the first to eighth technical means.

本発明によれば、句読点間の文字数を計測してインデックスとして登録し、検索文書データと同一の文字数を持つインデックスを検索することにより、文字認識処理を行うよりもインデックスの生成時間を短縮し、検索処理負担を大きく削減することができるようになる。 According to the present invention, the number of characters between punctuation marks is measured and registered as an index, and by searching for an index having the same number of characters as the search document data, the index generation time is shortened compared to performing character recognition processing, The search processing burden can be greatly reduced.

すなわち本発明によれば、句読点間の文字数を計測してインデックスとして登録するため、ＯＣＲによる文字認識処理よりも短時間にインデックスを生成することができ、検索元の検索文書データと同一の文字数を持つインデックスを検索することにより、検索時間が短縮され、結果的に検索処理負担を大きく削減することができる。 That is, according to the present invention, since the number of characters between punctuation marks is measured and registered as an index, the index can be generated in a shorter time than the character recognition processing by OCR, and the same number of characters as the search document data of the search source can be obtained. By searching the index, the search time is shortened, and as a result, the search processing load can be greatly reduced.

また本発明によれば、句読点間の文字数に加えて文字サイズも検索条件に含めることにより、文字数のみの検索よりも精度の高い検索を行なうことができるようになる。
さらに本発明によれば、検索対象の文書画像データから抽出したテキスト領域で検索することにより、効率良く文書画像データの検索を行うことができるようになる。 Further, according to the present invention, by including the character size in addition to the number of characters between punctuation marks in the search condition, it becomes possible to perform a search with higher accuracy than a search using only the number of characters.
Furthermore, according to the present invention, it is possible to efficiently search for document image data by searching in a text area extracted from document image data to be searched.

さらに本発明によれば、連続する一つの文章が複数のテキスト領域に分かれている場合には、その連続した一つの文章を含む複数のテキスト領域を統合し、同一の計測対象領域として文字数計測を行なうことにより、例えば改変された文書を元に検索を行うような場合や、電子化された文書を元に検索を行うような場合に精度よく検索することができる。
さらに本発明によれば、計測された検索文書の文字数の配列に少なくとも部分的に一致する文字数の配列を持つインデックスを検索することにより、バージョンアップ等により文書の一部の追加や削除が行われていても効率よく検索することができる。
さらに本発明によれば、句点や読点、改行などの夫々に重み付けをすることにより、検索精度を高めることができる。 Furthermore, according to the present invention, when one continuous sentence is divided into a plurality of text areas, the plurality of text areas including the one continuous sentence are integrated to measure the number of characters as the same measurement target area. By performing, for example, when searching based on a modified document or when searching based on a digitized document, it is possible to search with high accuracy.
Furthermore, according to the present invention, a part of a document is added or deleted by version upgrade or the like by searching an index having an array of character numbers that at least partially matches the measured character array of search documents. Can search efficiently.
Furthermore, according to the present invention, it is possible to improve the search accuracy by weighting each of a punctuation mark, a punctuation mark, and a line feed.

図１は、本発明による文書画像検索装置の位置実施形態を説明するためのブロック図である。
本実施形態の文書検索装置は、画像データをファイルして文書画像検索を行うサーバとして機能する画像ファイリングサーバ１０により実現される。画像ファイリングサーバ１０にはスキャナ２０が接続される。
スキャナ２０は、文書画像を読み取るスキャン部２１と、スキャナ各部を制御する制御部２２と、画像ファイリングサーバ１０と通信を行うための通信Ｉ／Ｆ２３とを備えている。そしてスキャン部２１で読み取った文書画像データを通信Ｉ／Ｆ２３から画像ファイリングサーバ１０に送信する。 FIG. 1 is a block diagram for explaining a position embodiment of a document image search apparatus according to the present invention.
The document search apparatus according to the present embodiment is realized by an image filing server 10 that functions as a server that searches image data by filed image data. A scanner 20 is connected to the image filing server 10.
The scanner 20 includes a scanning unit 21 that reads a document image, a control unit 22 that controls each unit of the scanner, and a communication I / F 23 that communicates with the image filing server 10. Then, the document image data read by the scanning unit 21 is transmitted from the communication I / F 23 to the image filing server 10.

画像ファイリングサーバ１０は、スキャナ２０と通信するための通信Ｉ／Ｆ１４と、スキャナ２０で読み取られた文書画像データや、文書画像データの検索に使用するインデックスを記憶する記憶部１２と、検索結果などの各種情報を表示するとともに、タッチパネルなどによりユーザの操作入力を可能とした表示・入力部１３と、画像ファイリングサーバ１０の各要素全体を制御する制御部１１とを備えている。また通信Ｉ／Ｆ１４は、図示しないＰＣなどの外部機器との通信も可能に構成される。そして制御部１１は、スキャナ２０で読み取られた文書画像データを通信Ｉ／Ｆ１４から受信すると、その文書画像データを記憶部１２に登録する。 The image filing server 10 includes a communication I / F 14 for communicating with the scanner 20, a storage unit 12 that stores document image data read by the scanner 20, and an index used for searching the document image data, a search result, and the like. The display / input unit 13 that allows the user to input operations using a touch panel or the like, and the control unit 11 that controls the entire elements of the image filing server 10 are provided. The communication I / F 14 is configured to be able to communicate with an external device such as a PC (not shown). When the control unit 11 receives the document image data read by the scanner 20 from the communication I / F 14, the control unit 11 registers the document image data in the storage unit 12.

制御部１１は、本発明の機能を実現するプログラムを図示しないＲＯＭやＨＤＤなどのメモリから読み出し、ＲＡＭなどのワーキングエリアを使用してそのプログラムを実行する。これにより、本発明に関わる文字数計測部、領域抽出部、検索部などの機能が実現される。また制御部１１の制御によって、文章画像データやインデックスを記憶部１２に記憶させることにより本発明の登録部の機能が実現される。 The control unit 11 reads a program that realizes the functions of the present invention from a memory such as a ROM or an HDD (not shown), and executes the program using a working area such as a RAM. As a result, functions such as a character number measurement unit, a region extraction unit, and a search unit according to the present invention are realized. The function of the registration unit of the present invention is realized by storing the text image data and the index in the storage unit 12 under the control of the control unit 11.

そしてこれらの機能によって制御部１１では、受信した文章画像データについてインデックス作成を行うインデックス作成処理１１ｂを実行可能である。
インデックス作成処理１１ｂでは、まずスキャナ２０から送信された文書画像データのテキスト領域を抽出する。ここでは、文書画像データに含まれるテキスト領域を判別し、その領域を切り出して抽出する。テキスト領域は、頁単位または一頁内に含まれる領域単位からなる。 With these functions, the control unit 11 can execute index creation processing 11b for creating an index for the received text image data.
In the index creation process 11b, first, the text area of the document image data transmitted from the scanner 20 is extracted. Here, the text area included in the document image data is discriminated, and the area is cut out and extracted. The text area is composed of page units or area units included in one page.

そして抽出したテキスト領域を、文書の句読点単位に分割する。ここでは、テキスト領域に含まれる文書データから句読点を判別し、句読点と句読点との間の単位に分割する処理を行う。この場合、文書の各行の最終位置ではない場所で句読点がなく改行されている場合にも、その改行位置に句読点があるものとみなすようにしてもよい。
そして分割した単位ごとに文書の文字数を計測し、計測した文字数をインデックスとして記憶部１２に登録する。ここで記憶部１２には、文書画像データとインデックスとを対応付けて登録する。 Then, the extracted text area is divided into punctuation marks of the document. Here, punctuation marks are discriminated from the document data included in the text area, and processing for dividing the punctuation marks into units between punctuation marks is performed. In this case, even when there is no punctuation at a position other than the final position of each line of the document, a line break may be regarded as having a punctuation mark at the line break position.
Then, the number of characters of the document is measured for each divided unit, and the measured number of characters is registered in the storage unit 12 as an index. Here, the document image data and the index are registered in the storage unit 12 in association with each other.

上記のテキスト領域を抽出する場合、テキスト領域の位置をユーザが手動で設定してもよく、また制御部１１で自動抽出を行うようにしてもよい。自動抽出の場合、行列方向の特定の画像パターンからテキストが記載されている領域であることを判別することができる。あるいは、階調表現されている領域を写真やグラフィックの描画領域であると判別してテキスト領域から除く処理を行なったり、また線分や曲線などを画像から認識することで図形描画領域であると判別してテキスト領域から除く処理を行ったりしてもよい。 When extracting the text area, the position of the text area may be set manually by the user, or automatic extraction may be performed by the control unit 11. In the case of automatic extraction, it is possible to determine that the text is written from a specific image pattern in the matrix direction. Alternatively, it is determined that the area represented by gradation is a drawing area for a photo or graphic and is removed from the text area, or a line drawing or curve is recognized from the image to be a graphic drawing area. It may be determined and removed from the text area.

そして文書画像データの検索に際しては、検索元となる検索画像データに該当する文書画像データを記憶部１２から検索する処理を行う。
この場合、画像ファイリングサーバ１０では、検索元の検索画像データをスキャナ２０や図示しないＰＣ等の外部機器もしくは記録媒体などから受け取ると、その検索画像データの句読点間の文字数を計測し（１１ｃ）、計測した文字数に対応するインデックスを検出し（１１ｄ）、検出したインデックスに対応する文書画像データを記憶部１２から抽出する（１１ｅ）。抽出した文書画像データの情報は、表示制御機能１１ａによって表示部・入力部１３に表示される。
またこのときに、抽出した文書画像データが複数あれば、これらの全てを抽出して、その情報を表示・入力部１３に表示し、さらにユーザ操作指示やデフォルトの条件設定に従って、抽出した文書画像データにＯＣＲ処理を行ってテキストデータによる詳細な検索を行なう。 When searching for document image data, a process of searching the storage unit 12 for document image data corresponding to the search image data serving as a search source is performed.
In this case, when receiving the search source image data from the scanner 20 or an external device such as a PC (not shown) or a recording medium, the image filing server 10 measures the number of characters between the punctuation marks of the search image data (11c). An index corresponding to the measured number of characters is detected (11d), and document image data corresponding to the detected index is extracted from the storage unit 12 (11e). The extracted document image data information is displayed on the display / input unit 13 by the display control function 11a.
At this time, if there are a plurality of extracted document image data, all of them are extracted, the information is displayed on the display / input unit 13, and the extracted document image is displayed in accordance with user operation instructions and default condition settings. The data is subjected to OCR processing and a detailed search using text data is performed.

図２は、本発明の実施形態である画像ファイリングサーバに対して、利用者が文書画像データを登録するときのイメージを示す図である。
本例では、スキャナ２０は、複写機能を持った複合機３０に組み込まれているものとする。複合機３０はネットワークに接続することにより、ネットワーク上の画像ファイリングサーバ１０にアクセスすることができる。 FIG. 2 is a diagram showing an image when a user registers document image data in the image filing server according to the embodiment of the present invention.
In this example, it is assumed that the scanner 20 is incorporated in a multifunction machine 30 having a copying function. The MFP 30 can access the image filing server 10 on the network by connecting to the network.

利用者は、複合機３０のスキャナ機能を利用して、画像ファイリングサーバ１０に登録しておきたい文書を読み込ませる。複合機３０は、スキャナ機能により読み込んだ文書画像データをネットワークを介して画像ファイリングサーバ１０に送信し、画像ファイリングサーバ１０に登録する。また利用者は、ネットワーク上に接続されたネットワークスキャナ装置や、ＰＣなどの情報処理装置から文書画像データを画像ファイリングサーバ１０送信することができる。この場合必要に応じて利用者の認証処理などを行う。そして画像ファイリングサーバ１０では利用者から送信された文書画像データを蓄積し、記憶保持しておく。 The user uses the scanner function of the multi-function device 30 to read a document to be registered in the image filing server 10. The multifunction device 30 transmits the document image data read by the scanner function to the image filing server 10 via the network, and registers it in the image filing server 10. In addition, the user can transmit document image data to the image filing server 10 from a network scanner device connected to the network or an information processing device such as a PC. In this case, user authentication processing is performed as necessary. The image filing server 10 accumulates document image data transmitted from the user, and stores and stores it.

図３は、本発明における文書画像データのテキスト処理について説明するための図である。
画像ファイリングサーバ１０では、外部のスキャナ（もしくはスキャナが組み込まれた複写機や複合機など）２０から、画像ファイリングサーバ１０に登録すべき文書画像データが送信された際に、送信された文書画像データを自身の記憶部１２に登録して記憶保持するともに、登録する文書画像データのテキスト処理を行う。 FIG. 3 is a diagram for explaining text processing of document image data in the present invention.
In the image filing server 10, the document image data transmitted when the document image data to be registered in the image filing server 10 is transmitted from an external scanner (or a copier or a multifunction machine incorporating the scanner) 20. Is registered and stored in its own storage unit 12 and text processing of the registered document image data is performed.

テキスト処理において、画像ファイリングサーバ１０の制御部１１は、まず文書画像データ内のテキスト領域を切り出して抽出する。図３（Ａ）は、文書画像データのテキスト領域の一例を示す図である。
そして抽出したテキスト領域の文書を、句読点間の単位に分割する（図３（Ｂ））。上記のように、ここでは文書の行の最終位置ではない場所で句読点がなく改行されている部分があった場合、その改行点も句読点とみなすように設定してもよい。 In the text processing, the control unit 11 of the image filing server 10 first cuts out and extracts a text area in the document image data. FIG. 3A is a diagram illustrating an example of a text area of document image data.
Then, the extracted text area document is divided into units between punctuation marks (FIG. 3B). As described above, here, when there is a part where there is no punctuation and there is a line break in a place other than the final position of the line of the document, the line break may be regarded as a punctuation mark.

そして分割単位ごとにテキストの文字数を計測する。図３（Ｃ）は計測した文字数を示す図で、句点、読点、及び改行点（行の最終位置でなく句読点のない改行点）の間の文字数が計測されている。そして図３（Ｄ）に示すように、画像ファイリングサーバ１０の記憶部１２には、計測した文字数をインデックスにして、文書画像データに対応付けて登録する。 The number of text characters is measured for each division unit. FIG. 3C is a diagram showing the measured number of characters, and the number of characters between a punctuation point, a punctuation point, and a line break point (a line break point without a punctuation mark but not the final position of the line) is measured. As shown in FIG. 3D, the measured number of characters is registered in the storage unit 12 of the image filing server 10 in association with the document image data.

上記のテキスト処理においては、分割単位内の文字数計測のみならず、本発明の実施形態に応じて、文字サイズ、縦書き／横書きなどの文章方向、複数のテキスト領域に分かれた文章を考慮した文章ブロック、句読点や改行を重み付けした情報などが計測され、文字数情報に加えて登録される。 In the above text processing, not only the measurement of the number of characters in the division unit, but also the text considering the text size, the text direction such as vertical / horizontal writing, and the text divided into a plurality of text areas according to the embodiment of the present invention. Information that weights blocks, punctuation marks, and line breaks is measured and registered in addition to the character count information.

上記のようなテキスト処理により、文書画像データと、その文書画像データに対応付けられたインデックスとが画像ファイリングサーバ１０に蓄積される。
そして蓄積された文書画像データを実際に利用者が検索する場合、利用者は、スキャナや外部ＰＣなどを用いて、検索元となる検索文書データを画像ファイリングサーバ１０に入力する。検索文書データは、スキャナ２０などから読み込んだ画像データを用いることができるが、ワードプロセッサなどにより作成されたテキストデータやアプリケーション対応のバイナリデータなどであってもよい。
画像ファイリングサーバ１０では、入力された検索文書データに対して、上記の登録する文書画像データと同様のテキスト処理を実行する。 Through the text processing as described above, document image data and an index associated with the document image data are accumulated in the image filing server 10.
When the user actually searches the stored document image data, the user inputs search document data as a search source to the image filing server 10 using a scanner, an external PC, or the like. The search document data can be image data read from the scanner 20 or the like, but may be text data created by a word processor or binary data corresponding to an application.
The image filing server 10 executes text processing similar to the document image data to be registered on the input search document data.

そしてそのテキスト処理により得られた分割単位ごとの文字数の配列に対して、同じ配列を持つ文書画像データを検索する。また同一の文字数の配列の他、部分的に一致する配列をもつ文書画像データを、類似する文書画像データとして検索できるようにしてもよい。
また検索には、上記文字数の配列に加えて、文字サイズ、縦書き／横書きなどの文章方向、複数のテキスト領域に分かれた文章を考慮した文章ブロック、句読点や改行を重み付けした情報などが適宜使用される。 Then, the document image data having the same arrangement is searched for the arrangement of the number of characters for each division unit obtained by the text processing. Further, document image data having a partially matching sequence in addition to the same number of characters may be searched as similar document image data.
In addition to the above arrangement of the number of characters, the search uses text size, text direction such as vertical / horizontal writing, text blocks considering text divided into multiple text areas, information weighted with punctuation marks and line breaks, etc. Is done.

そして検索した結果、上記のような所定の検索条件に該当する文書画像データが抽出できたならば、その文書画像データを表示し、検索に失敗した場合にはその旨を表示する。また複数の文章画像データが抽出された場合、さらにＯＣＲ処理などを行って検索対象を確定することができる。 As a result of the search, if the document image data corresponding to the predetermined search condition as described above can be extracted, the document image data is displayed. If the search fails, the fact is displayed. When a plurality of text image data is extracted, the search target can be determined by performing OCR processing or the like.

図４及び図５は、一頁内に複数のテキスト領域がある文書画像データの例を示す図である。ここでは、例えば図４（Ａ）に示すような文書画像データ１００を登録する場合を考える。
このような文章画像データ１００に対して上記のようなテキスト処理を行って、テキスト領域の抽出を行うことにより、図４（Ｂ）に示すような複数（ここでは６つ）のテキスト領域Ｒ１〜Ｒ６が抽出されたものとする。 4 and 5 are diagrams showing examples of document image data having a plurality of text areas in one page. Here, for example, consider the case of registering document image data 100 as shown in FIG.
By performing the text processing as described above on the sentence image data 100 and extracting the text areas, a plurality of (here, six) text areas R1 to R1 as shown in FIG. Assume that R6 has been extracted.

そしてテキスト処理においては、抽出された各テキスト領域Ｒ１〜Ｒ６に対して、上記の文字数計測等を実施する。図４（Ｂ）は、各テキスト領域Ｒ１〜Ｒ６の文字数の計測結果を示している。
また本例においても、上記の例と同様に、各テキスト領域Ｒ１〜Ｒ６に対して、上記文字数の配列に加えて、文字サイズ、縦書き／横書きなどの文章方向、複数のテキスト領域に分かれた文章を考慮した文章ブロック、句読点や改行を重み付けした情報などが適宜計測される。 In the text processing, the above-described character count measurement or the like is performed on each extracted text region R1 to R6. FIG. 4B shows the measurement result of the number of characters in each text region R1 to R6.
Also in this example, in the same way as in the above example, each text region R1 to R6 is divided into a plurality of text regions in addition to the above arrangement of the number of characters, a character size, a text direction such as vertical writing / horizontal writing, and the like. Sentence blocks that take into account sentences, information that weights punctuation marks and line breaks, and the like are appropriately measured.

テキスト処理において計測された計測情報は、画像ファイリングサーバ１０の記憶部１２にインデックスとして記憶されるが、このときに計測情報は、テキスト領域Ｒ１〜Ｒ６の頁内の位置情報と紐付けした状態で記憶される。
図５は、各テキスト領域の位置情報を示す図で、頁内の各テキスト領域Ｒ１〜Ｒ６の４つの頂点の座標情報により位置情報を表している。座標情報は、例えばビットマップの画素位置を示す情報であってもよく、あるいは任意に定めた座標系の位置情報であってもよい。 The measurement information measured in the text processing is stored as an index in the storage unit 12 of the image filing server 10. At this time, the measurement information is linked to the position information in the pages of the text areas R1 to R6. Remembered.
FIG. 5 is a diagram showing position information of each text area, and the position information is represented by coordinate information of four vertices of each text area R1 to R6 in the page. The coordinate information may be information indicating the pixel position of the bitmap, for example, or may be position information in an arbitrarily defined coordinate system.

そして利用者が文書画像データを検索する際に、上記と同様に検索元の検索画像データをテキスト処理させることにより、検索画像データのテキスト領域とその位置情報を取得する。ここでは、検索画像データのテキスト領域と同じ位置にある文書画像データのテキスト領域について、計測文字数の配列を比較し、計測文字数の配列が一致する文書画像データを抽出して表示する。この場合、複数のテキスト領域のうち、一つの領域について計測文字数の配列を比較して判断してもよく、あるいは全てまたは複数のテキスト領域の計測文字数を用いてその配列を比較するようにしてもよい。 Then, when the user searches the document image data, the search image data of the search source is subjected to text processing in the same manner as described above, thereby acquiring the text area of the search image data and its position information. Here, with respect to the text area of the document image data at the same position as the text area of the search image data, the arrangement of the number of measured characters is compared, and the document image data with the same arrangement of the measured character number is extracted and displayed. In this case, it may be determined by comparing the arrangement of the number of measured characters for one area among the plurality of text areas, or the arrangement may be compared using the number of measured characters of all or a plurality of text areas. Good.

図６は、文書画像データの検索処理例を説明するための図で、テキスト領域の計測文字数及び文字サイズを用いてインデックスを検索するときの処理について示すものである。
検索元の検索文書データを用いて、検索対象の文書画像データを検索する際に、検索条件として上記のようにテキスト領域内の計測文字数を用いて検索を行うが、これに加えて、テキスト領域内の文書の文字サイズを用いることができる。 FIG. 6 is a diagram for explaining an example of document image data search processing, and shows processing for searching an index using the measured number of characters and character size in a text area.
When searching the document image data to be searched using the search document data of the search source, the search is performed using the measured number of characters in the text area as described above, but in addition to this, the text area The character size of the document inside can be used.

例えば図６（Ａ）に示すように、一つのテキスト領域Ｒ内において文字サイズが異なる文書が混在しているものとする。
そしてこの場合に、画像ファイリングサーバ１０がテキスト領域Ｒに対してテキスト処理を行う際に、句読点間の計測文字数に加えて、文字サイズを関連付けてインデックスにして記憶しておく（図６（Ｂ））。文字サイズは、所定のパラメータに応じて生成されたサイズ情報である。 For example, as shown in FIG. 6A, it is assumed that documents having different character sizes are mixed in one text region R.
In this case, when the image filing server 10 performs text processing on the text region R, in addition to the measured number of characters between punctuation marks, the character size is associated and stored as an index (FIG. 6B). ). The character size is size information generated according to a predetermined parameter.

記憶した文書画像データの検索を行う際には、テキスト領域Ｒの計測文字数とともに、文字サイズも比較することにより、検索精度を向上させることができる。
また例えば、文書情報データを検索する際に、文字サイズを用いて検索対象を絞り込み、さらに計測文字数を用いて検索することで検索効率を向上させることができる。例えば、文字サイズが１．５以上である見出し文章という条件で検索を行って、該当する文書画像データが複数あればさらに計測文字数による検索を行う、などの方法で検索することができる。見出し文章は、例えば、文章の初頭にあって文字サイズが他の文字よりも大きい文字列、あるいは文字が太字である文字列などの条件により判別することができる。 When searching the stored document image data, the search accuracy can be improved by comparing the character size with the number of characters measured in the text region R.
Further, for example, when searching for document information data, the search efficiency can be improved by narrowing down the search target using the character size and further using the measured number of characters. For example, it is possible to perform a search by a method of performing a search under the condition of a headline sentence having a character size of 1.5 or more, and performing a search based on the number of measured characters if there are a plurality of corresponding document image data. The headline sentence can be determined based on conditions such as a character string at the beginning of the sentence and having a character size larger than other characters, or a character string in which the characters are bold.

図７は、文書画像の検索処理の他の例を説明するための図で、テキスト領域の計測文字数及び文章方向情報を用いてインデックスを検索するときの処理について示すものである。
検索元の検索文書データを用いて、検索対象の文書画像データを検索する際に、検索条件としてテキスト領域内の計測文字数に加えて、テキスト領域内の文章方向（縦書き／横書きなど）を用いることができる。 FIG. 7 is a diagram for explaining another example of a document image search process, and shows a process when searching for an index using the number of characters measured in a text area and sentence direction information.
When searching the document image data to be searched using the search document data of the search source, in addition to the number of measured characters in the text area, the text direction (vertical writing / horizontal writing, etc.) in the text area is used as a search condition. be able to.

例えば図７（Ａ）に示すように、文書画像データ１００において、文章方向が異なる文書が記載されたテキスト領域Ｒ１、Ｒ２が混在しているものとする。
ここで画像ファイリングサーバがテキスト処理を行う際に、句読点間の計測文字数に加えて、文章方向（縦書き／横書きなど）を関連付けてインデックスにして記憶しておく（図７（Ｂ））。 For example, as shown in FIG. 7A, it is assumed that text regions R1 and R2 in which documents having different sentence directions are described are mixed in the document image data 100.
Here, when the image filing server performs text processing, in addition to the number of characters measured between punctuation marks, the text direction (vertical writing / horizontal writing, etc.) is associated and stored as an index (FIG. 7B).

文章方向（縦書き／横書き）は、縦書きと横書きとによって異なる画像データのパターンから判別することができる。あるいは、文書の一部分にＯＣＲを施し、得られた文字データに対して形態素解析を行って、意味のある文字列が抽出できた方向を文章方向として決定するようにしてもよい。 The text direction (vertical writing / horizontal writing) can be discriminated from the pattern of image data that differs depending on whether vertical writing or horizontal writing. Alternatively, OCR may be applied to a part of the document, and morphological analysis may be performed on the obtained character data to determine the direction in which a meaningful character string can be extracted as the sentence direction.

そして記憶した文書画像データの検索を行う際には、テキスト領域の計測文字数とともに、文章方向も比較することにより、検索精度を向上させることができる。
例えば、文書画像データを検索する際に、文章方向を用いて検索対象を絞り込み、さらに計測文字数を用いて検索することで検索効率を向上させることができる。 When searching the stored document image data, the search accuracy can be improved by comparing the text direction with the number of characters measured in the text area.
For example, when searching for document image data, the search efficiency can be improved by narrowing down the search target using the text direction and further searching using the measured number of characters.

図８及び図９は、文書画像データの検索処理のさらに他の例を説明するための図で、テキスト領域の計測文字数及び検索単位情報を用いてインデックスを検索するときの処理を示すものである。
検索元の検索文書データを用いて、検索対象の文書画像データを検索する際に、検索条件としてテキスト領域内の計測文字数に加えて、テキスト領域として認識する検索単位（文章ブロック）を用いるようにしてもよい。 FIGS. 8 and 9 are diagrams for explaining still another example of document image data search processing, and show processing when searching for an index using the number of characters measured in a text area and search unit information. .
When searching the search target document image data using the search source document data, the search unit (sentence block) recognized as the text area is used as the search condition in addition to the number of measured characters in the text area. May be.

例えば図８に示すように、一つの文章が二つのテキスト領域Ｒ１，Ｒ２に分かれて記載されているものとする。ここでは、最初の頁の最後部から次の頁の最初に文章が続いている。この場合、上記のテキスト処理においては頁毎にテキスト領域が抽出されるため、一つの文章が異なるテキスト領域Ｒ１，Ｒ２に分かれてしまう。 For example, as shown in FIG. 8, it is assumed that one sentence is divided into two text regions R1 and R2. Here, sentences continue from the end of the first page to the beginning of the next page. In this case, in the above text processing, a text area is extracted for each page, so that one sentence is divided into different text areas R1 and R2.

本実施形態では、画像ファイリングサーバ１０の制御部１１は、抽出したテキスト領域Ｒ１，Ｒ２ごとに、そのテキストの最終位置に句読点があるかどうかを判別する。
例えば、テキスト領域Ｒ１の最終位置を確認し、最終位置に句読点がなければ、テキスト領域Ｒ１の文章が次のテキスト領域Ｒ２に続いているものと判断する。そしてこれらのテキスト領域Ｒ１，Ｒ２を統合したものを文章ブロックとみなす処理を行う。文章ブロックが一つの検索単位となる。 In the present embodiment, the control unit 11 of the image filing server 10 determines whether or not there is a punctuation mark at the final position of the text for each of the extracted text regions R1 and R2.
For example, the final position of the text area R1 is confirmed, and if there is no punctuation mark at the final position, it is determined that the sentence in the text area R1 continues to the next text area R2. Then, processing is performed in which these text regions R1 and R2 are integrated as a sentence block. A sentence block is one search unit.

そしてこのような場合、画像ファイリングサーバがテキスト処理を行う際に、句読点間の文字数情報に加えて、検索単位となる文章ブロックの情報を関連付けてインデックスにして記憶しておく（図９）。
図９の例では、テキスト処理により切り出した通常のテキスト領域を示す検索単位として、“文書”を設定し、複数のテキスト領域を統合して文章ブロックとした検索単位を“文章ブロック”として設定している。 In such a case, when the image filing server performs text processing, in addition to the information on the number of characters between punctuation marks, information on the text block as a search unit is associated and stored as an index (FIG. 9).
In the example of FIG. 9, “document” is set as a search unit indicating a normal text area cut out by text processing, and a search unit which is a text block formed by integrating a plurality of text areas is set as “text block”. ing.

文書画像データの検索を行う際に、例えば文書画像データに対して全く文書が改変されていない検索文書データを用いる場合には、上記のような頁単位のテキスト領域で検索が可能であるが、改変された検索文書データで検索を行うような場合には、ブロック単位を用いた検索が有効となる。あるいは電子化されたテキストデータなどの検索文書データで検索を行うような場合には、テキスト領域の抽出ができないため、ブロック単位にて検索することで対応することができる。 When searching document image data, for example, when using search document data in which the document is not modified at all with respect to the document image data, it is possible to search in the text area in page units as described above. When a search is performed using modified search document data, a search using a block unit is effective. Alternatively, when a search is performed using search document data such as digitized text data, the text area cannot be extracted, and can be handled by searching in block units.

また、文書画像データの計測文字数をインデックスにして登録しておく際に、検索単位を“文書”単位のみで登録しておき、実際に文書検索を行う際に検索対象のテキスト領域の最終位置に句読点がない場合、次のテキスト領域とによるブロック化処理を行うことで、検索対象を文章ブロックにして検索を行うようにすることもできる。
なお上記の例では二つのテキスト領域を統合して文章ブロックとしているが、文章ブロックは、三つ以上のテキスト領域を統合したものであってもよい。 In addition, when registering the number of measured characters of document image data as an index, the search unit is registered only in the “document” unit, and when the actual document search is performed, it is placed at the final position of the text area to be searched. When there is no punctuation mark, it is also possible to perform a search using a text block as a search target by performing a blocking process with the next text area.
In the above example, the two text areas are integrated into a sentence block, but the sentence block may be an integration of three or more text areas.

図１０〜図１２は、文書画像データの検索処理のさらに他の例を説明するための図で、句読点や改行に対して重み付けして文書画像データを登録し、これらの重み付けを用いて検索できるようにした処理例を示すものである。
ここでは例えば図１０（Ａ）に示すようなテキスト領域Ｒを含む文書画像データがあるものとする。この文書画像データに対してテキスト処理を行って、抽出したテキスト領域Ｒの文字数を計測した結果は、図１０（Ｂ）のようになる。 10 to 12 are diagrams for explaining still another example of document image data search processing. Document image data is registered by weighting punctuation marks and line breaks, and search can be performed using these weightings. An example of such processing is shown.
Here, for example, it is assumed that there is document image data including a text region R as shown in FIG. The result of performing text processing on the document image data and measuring the number of characters in the extracted text region R is as shown in FIG.

図１１は、テキスト領域の計測文字数をグラフにして表した図で、図１１（Ａ）は、上記図１０に示すような文書画像データから計測した句読点間の文字数を順にプロットしたグラフである。このプロット波形を文字数情報波形と呼ぶものとする。
図１１（Ａ）の文字数情報波形は、計測した句読点間の計測文字数を単にプロットしたもので、ここでは、テキスト領域Ｒに含まれる句点、読点、改行を全て同等に扱ったものと解される。 FIG. 11 is a graph showing the measured number of characters in the text area. FIG. 11A is a graph in which the number of characters between punctuation marks measured from the document image data as shown in FIG. 10 is plotted in order. This plot waveform is called a character number information waveform.
The number-of-characters information waveform in FIG. 11 (A) is simply a plot of the number of measured characters between the measured punctuation marks. Here, it is understood that the punctuation marks, punctuation marks, and line breaks included in the text region R are all treated equally. .

これに対して図１１（Ｂ）では、文単位の情報を持たせるために、句点の文字数を“０”とみなして重み付けをした文字数波形情報を示している。句点は文章の特徴となるため、重み付けをして検索用情報として登録しておくことにより、検索の精度を高めることができる。
つまり本例では、検索対象として登録する文書画像データに対してテキスト処理を行う際に、テキスト領域の抽出と、テキスト領域ごとの文字数計測とを行ない、このときに句点の文字数を０として重み付けし、その情報を計測文字数とともにインデックスにして登録しておく。 On the other hand, FIG. 11B shows character number waveform information weighted by regarding the number of characters at a punctuation mark as “0” in order to provide sentence-by-sentence information. Since the punctuation mark is a feature of the sentence, the accuracy of the search can be improved by weighting and registering it as search information.
In other words, in this example, when text processing is performed on document image data to be registered as a search target, the text area is extracted and the number of characters for each text area is measured. The information is registered as an index together with the number of measured characters.

そして文書画像データを検索する際に、検索元となる検索文書データに対して同様にテキスト処理を行って、テキスト領域ごとの文字数を計測し、同様に句点に重み付けを行なう。そしてこれら計測文字数と句点の重み付け情報とを用いて、検索対象の文書画像データの検索を実行する。 When searching for document image data, text processing is similarly performed on the search document data serving as a search source, the number of characters for each text area is measured, and similarly, the punctuation points are weighted. Then, the retrieval of the document image data to be retrieved is executed using the measured number of characters and the weight information of the punctuation marks.

図１２は、上記の句点の重み付けに加えて、節単位の情報を持たせるため、改行がある場所については、改行の文字数を“０”が２つ続くものとして重み付けし、その情報を計測文字数とともに登録する。ここでは、句点の後に改行がある場合には、句点の重み付けを省いて単に“０”が２つ続くように重み付けをする。このとき行の最後の位置でない部分でテキストが終了し、次の行に改行されているときに“改行”であるものと判断する。 In FIG. 12, in addition to the above-mentioned weighting of the punctuation points, information in units of clauses is provided. Therefore, for a place where there is a line break, the number of characters for the line break is weighted as two “0” s, and the information is the number of measured characters. Register with. Here, when there is a line feed after a punctuation point, the punctuation point is omitted and weighting is performed so that two “0” s continue. At this time, when the text ends at a portion other than the last position of the line and a line break is made on the next line, it is determined to be a “line feed”.

この場合には、改行と句点とを重み付けしているため、検索精度をさらに高めることができる。またさらには、上記の句点の重み付けに加えて、もしくは句点と改行点の重み付けに加えて、読点に対して重み付けを行って検索できるようにしてもよい。 In this case, since the line feed and the punctuation are weighted, the search accuracy can be further improved. Furthermore, in addition to the above-described weighting of the punctuation marks, or in addition to the weighting of the punctuation marks and the line feed point, the punctuation marks may be weighted for retrieval.

ただし、検索元の検索文書データとして、電子化されたテキストデータなどの文書データを用いた場合や、検索対象の文書画像データのレイアウト変更や編集などにより文書が改変されている場合などでは、逆に句点や改行点に重み付けを付与することなく同等に扱った方が検索精度が高い場合ある。従って検索するときに任意に検索方法を切り替えることができるようにするとよい。 However, when document data such as digitized text data is used as search document data of the search source, or when the document is altered due to layout change or editing of the search target document image data, the reverse is true. There is a case where the search accuracy is higher when they are treated equally without giving weights to punctuation marks and line break points. Therefore, it is preferable that the search method can be arbitrarily switched when searching.

図１３及び図１４は、検索文書データに類似する文書画像データを検索する処理例を説明するための図である。
ここでは図１１の例と同様に、検索対象の文書画像データのテキスト処理により、図１３（Ａ）に示すような文字数情報波形が得られたものとする。 FIGS. 13 and 14 are diagrams for explaining an example of processing for searching for document image data similar to search document data.
Here, as in the example of FIG. 11, it is assumed that the character number information waveform as shown in FIG. 13A is obtained by text processing of the document image data to be searched.

このような文字数情報波形を持つ文書画像データを検索対象とする場合、検索元となる検索文書データが、検索対象の文書画像データから全く改変されていなければ、文字数情報波形は同じ形になる。従って検索時には、文字数情報波形の一部に該当する短い計測文字数だけで、検索を行なうことができる。
ここでは図１３（Ｂ）に示すように、全く改変されていない検索文書データで検索を行なう場合には、検索を行なう際に文字数情報波形の一部（例えば点線内のブロックＤ）の計測文字数を用いるだけでよい。改変が行なわれていないため、検索文書データと、検索対象の文書画像データとは必ず一致するからである。 When document image data having such a character number information waveform is to be searched, the character number information waveform has the same shape unless the search document data as the search source is altered from the document image data to be searched. Therefore, at the time of search, the search can be performed only with the short number of measured characters corresponding to a part of the character number information waveform.
Here, as shown in FIG. 13B, when a search is performed with search document data that has not been altered at all, the number of characters measured in a part of the character count information waveform (for example, the block D within the dotted line) when performing the search. Just use. This is because the search document data and the search target document image data always match because no modification has been made.

しかしながら、検索元となる検索文書データが、検索対象の文書画像データから改変されている場合、これらの間で文字数が変化している可能性が高い。その場合、計測文字数の部分的な違いにとらわれることなく、文書全体の類似性を比較する必要がある。従ってこの場合には、比較的長い計測文字数の配列情報を使用して検索する必要がある。 However, when the search document data as the search source is modified from the search target document image data, the number of characters is likely to change between them. In that case, it is necessary to compare the similarity of the whole document without being caught by a partial difference in the number of measured characters. Therefore, in this case, it is necessary to search using the arrangement information of a relatively long number of measured characters.

例えば図１４に示すように、テキスト領域全体の文字数情報波形のうち、ブロックＥ，Ｆは改変されていない部分で、ブロックＧが改変されている部分であるものとする。このような場合、検索を行なう計測文字数としては、ブロックＥ〜Ｇ〜Ｆにいたる領域の波形に該当する計測文字数の配列を用いる。
このときに、ブロックＧの計測文字数の配列だけは、対象の文章画像データの計測文字数の配列と一致しないが、少なくとも一部分の計測文字数（ここではブロックＥ，Ｆ）の配列が一致していれば、全体が類似しているものと判断し、検索対象の文書画像データとして抽出する。 For example, as shown in FIG. 14, in the character number information waveform of the entire text area, blocks E and F are unmodified portions, and block G is a modified portion. In such a case, as the number of measurement characters to be searched, an array of the number of measurement characters corresponding to the waveform in the area from the blocks E to G to F is used.
At this time, only the array of the measured number of characters of the block G does not match the array of the measured number of characters of the target sentence image data, but if the array of at least a part of the measured characters (here, the blocks E and F) matches. The whole image is determined to be similar, and is extracted as document image data to be searched.

図１５は、本発明による文書画像検索装置におけるインデックスの作成処理の一例を説明するためのフローチャートである。
まず文書検索装置では、スキャナから文書画像データを受信する（ステップＳ１）。そしてスキャナから受信した文書画像データのテキスト領域を抽出する（ステップＳ２）。ここでは、例えば文書画像データの一頁を一つのテキスト領域とし、もしくは一頁内に複数のテキスト領域があれば、その複数の領域毎に切り出して抽出する。 FIG. 15 is a flowchart for explaining an example of index creation processing in the document image retrieval apparatus according to the present invention.
First, the document retrieval apparatus receives document image data from the scanner (step S1). Then, the text area of the document image data received from the scanner is extracted (step S2). Here, for example, one page of document image data is set as one text area, or if there are a plurality of text areas in one page, the plurality of areas are cut out and extracted.

そして抽出したテキスト領域が複数あるかどうかを判別し（ステップＳ３）、複数のテキスト領域がなければ、そのテキスト領域の文字数計測を行う（ステップＳ４）。文字数計測では、対象のテキスト領域の文書の句読点間の文字数を計測する処理を行う。そして計測した句読点間の文字数をインデックスとして登録する（ステップＳ５）。
ここでは、また文字数計測処理の実施形態に応じて、句読点間の文字数とともに、文字サイズ、縦書き／横書きなどの文章方向、複数のテキスト領域に分かれた文章を考慮した文章ブロック、句読点や改行を重み付けした情報などを計測し、計測文字数に加えて登録する。 Then, it is determined whether or not there are a plurality of extracted text areas (step S3). If there are no plurality of text areas, the number of characters in the text area is measured (step S4). In the character count measurement, a process for measuring the number of characters between punctuation marks of a document in the target text area is performed. The measured number of characters between punctuation marks is registered as an index (step S5).
Here, depending on the embodiment of the character count measurement process, along with the number of characters between punctuation marks, character size, sentence direction such as vertical / horizontal writing, sentence block considering sentences divided into multiple text areas, punctuation marks and line breaks Measures weighted information and registers it in addition to the number of characters measured.

また上記ステップＳ３で、テキスト領域が複数あると判別された場合には、最初のテキスト領域を選択し（ステップＳ６）、選択したテキスト領域の文字数計測処理を行う（ステップＳ７）。この処理は上記ステップＳ４の計測処理と同様である。
そして選択したテキスト領域の文字計測処理が終了すると、さらに計測してない他のテキスト領域があるかどうかを判別し（ステップＳ８）、計測していないテキスト領域があれば、そのテキスト領域を次のテキスト領域として選択し（ステップＳ１０）、ステップＳ７の文字数計測処理に進む。 If it is determined in step S3 that there are a plurality of text areas, the first text area is selected (step S6), and the number of characters in the selected text area is measured (step S7). This process is the same as the measurement process in step S4.
When the character measurement process for the selected text area is completed, it is determined whether there is another text area that has not been measured (step S8). If there is a text area that has not been measured, It selects as a text area (step S10), and progresses to the character number measurement process of step S7.

またステップＳ８で計測していない他のテキスト領域がなければ、全てのテキスト領域の文字数計測処理が終了しているため、各テキスト領域の位置情報と、各テキスト領域の句読点間の文字数をインデックスとして登録する（ステップＳ９）。この場合にもステップＳ５と同様に、実施形態に応じて文字サイズ、縦書き／横書きなどの文章方向（文章の文字列の進行方向）、複数のテキスト領域に別れた文章を考慮した文章ブロック、句読点や改行を重み付けした情報などを計測文字数に加えて登録する。 If there is no other text area that has not been measured in step S8, the character count measurement processing for all text areas has been completed. Therefore, the position information of each text area and the number of characters between punctuation marks in each text area are used as indexes. Register (step S9). Also in this case, as in step S5, the text size, the text direction such as vertical writing / horizontal writing (the progress direction of the text string of the text), the text block considering the text divided into a plurality of text areas, according to the embodiment, Register information including weighted punctuation marks and line breaks in addition to the number of measured characters.

図１６は、テキスト領域の文字数計測処理の一例を説明するためのフローチャートである。本例では、テキスト領域の文書の縦書き／横書きの判定と、句読点間の分割単位の文字数計測及び文字のサイズ計測処理を含む処理例を説明する。
まず、対象のテキスト領域の文書が縦書きになっているか横書きになっているかを判定する（ステップＳ１１）。ここでは、上述のように文書画像のパターンから縦書きか横書きかを判定してもよく、あるいは文書画像データの一部をＯＣＲ処理した後形態素解析を行って、縦書きか横書きかを判定するようにしてもよい。 FIG. 16 is a flowchart for explaining an example of the character region count process in the text region. In this example, a processing example including vertical / horizontal writing determination of a document in a text area, measurement of the number of characters in units of division between punctuation marks, and character size measurement processing will be described.
First, it is determined whether the document in the target text area is vertically written or horizontally written (step S11). Here, as described above, whether the vertical writing or the horizontal writing may be determined from the pattern of the document image, or a part of the document image data is subjected to the OCR process and then the morphological analysis is performed to determine whether the writing is the vertical writing or the horizontal writing. You may do it.

そしてテキスト領域内の文書を句読点単位に分割し（ステップＳ１２）、句読点間の分割単位の文字数を計測する（ステップＳ１３）。この場合にも改行点を分割点に含めて処理してもよい。
さらにテキスト領域内において、文書の文字サイズが一定かどうかを判別する（ステップＳ１４）。文字サイズが一定であれば、文字数計測処理を終了し、文字サイズが一定でなければ、分割単位毎に文字サイズを計測して（ステップＳ１５）、処理を終了する。
なお本例では、文字サイズを計測する例を示しているが、上述のように文字サイズを計測しない処理であってもよい。 Then, the document in the text area is divided into punctuation marks (step S12), and the number of characters in the division unit between punctuation marks is measured (step S13). In this case, the line break point may be included in the dividing point for processing.
Further, it is determined whether or not the character size of the document is constant within the text area (step S14). If the character size is constant, the character count measurement process is terminated. If the character size is not constant, the character size is measured for each division unit (step S15), and the process is terminated.
In this example, an example is shown in which the character size is measured. However, processing that does not measure the character size as described above may be used.

図１７は、一つの文章が複数のテキスト領域に別れて記載されているときの分割処理例を説明するためのフローチャートである。
まずテキスト領域の文書画像データに対して、判定した文章の方向（縦書き／横書き）に従って句読点単位に分割する処理を開始する（ステップＳ２１）。そしてテキスト領域の文書の最終位置が句点であるかどうかを判別する（ステップＳ２２）。 FIG. 17 is a flowchart for explaining an example of division processing when one sentence is described separately in a plurality of text areas.
First, processing for dividing the document image data in the text area into punctuation marks according to the determined text direction (vertical writing / horizontal writing) is started (step S21). Then, it is determined whether or not the final position of the document in the text area is a punctuation mark (step S22).

最終位置が句点でなければ、さらに同一頁内に文字数計測処理を行っていないテキスト領域があるかどうかを判別する（ステップＳ２３）。そして同一頁内に文字数計測処理を行っていないテキスト領域があれば、上記の最終位置が句読点ではないテキスト領域と、次のテキスト領域とを同一のテキスト領域に統合して文章ブロックとし、この文章ブロックの文書画像データを句読点単位に分割する（ステップＳ２４）。 If the final position is not a punctuation point, it is further determined whether or not there is a text area that has not been subjected to the character count measurement process in the same page (step S23). If there is a text area that has not been subjected to character count processing on the same page, the text area whose final position is not a punctuation mark and the next text area are integrated into the same text area to form a sentence block. The block document image data is divided into punctuation marks (step S24).

一方上記ステップＳ２２で、テキスト領域の最終位置が句読点である場合は本分割処理を終了する。またステップＳ２３で同一頁内に文字数計測処理を行っていないテキスト領域がなければ、上記の最終位置が句読点ではないテキスト領域と、次の頁の文書画像データのテキスト領域とを同一のテキスト領域に統合して文章ブロックとし、この文章ブロックの文書画像データを句読点単位に分割する（ステップＳ２５）。この場合、次の頁に複数のテキスト領域があれば、最初のテキスト領域を使用して文章ブロックを設定する。 On the other hand, if the final position of the text area is a punctuation mark in step S22, the division process ends. If there is no text area that has not been subjected to character count measurement processing in the same page in step S23, the text area whose final position is not a punctuation mark and the text area of the document image data on the next page are made the same text area. The text blocks are integrated to divide the document image data of the text blocks into punctuation marks (step S25). In this case, if there are a plurality of text areas on the next page, a text block is set using the first text area.

図１８は、本発明による文書検索装置における文書検索処理の一例を説明するためのフローチャートである。
文書画像データを検索する際に、まず検索元となる検索文書データを入力する（ステップＳ３１）。文書検索装置では、入力された検索文書データのテキスト領域を抽出する処理を行う（ステップＳ３２）。そして抽出したテキスト領域が複数あるかどうかを判別し（ステップＳ３３）、テキスト領域が複数なければ、そのテキスト領域の検索文章データの文字数計測処理を行う（ステップＳ３４）。
文字数計測処理は、前述のように句読点間の文字数を計測する処理であり、さらに実施形態に応じて文字サイズ、縦書き／横書きなどの文章方向、複数のテキスト領域に分かれた文章を考慮した文章ブロック、句読点や改行を重み付けした情報などの計測処理が行われる。 FIG. 18 is a flowchart for explaining an example of a document search process in the document search apparatus according to the present invention.
When searching for document image data, first, search document data as a search source is input (step S31). In the document search apparatus, a process for extracting a text area of the input search document data is performed (step S32). Then, it is determined whether or not there are a plurality of extracted text areas (step S33). If there are not a plurality of text areas, a process for measuring the number of characters of search sentence data in the text area is performed (step S34).
The character count measurement process is a process for measuring the number of characters between punctuation marks as described above, and further considers text size, text direction such as vertical / horizontal writing, and text divided into multiple text areas according to the embodiment. Measurement processing such as information weighted with blocks, punctuation marks and line breaks is performed.

そして文字数計測処理による計測結果を用いて、予め登録してあるインデックスから検索する処理を行う（ステップＳ３５）。
検索の結果、抽出された文書画像データが複数あるかどうかを判別し（ステップＳ３６）、抽出された文書画像データが一つであれは、その抽出された文章画像データを検索結果として表示する（ステップＳ４２）。また抽出された文書画像データが複数あれば、抽出された全ての文書画像データをＯＣＲ処理し、検索文書データに対応する文書画像データを選択して検索結果として表示する（ステップＳ３７）。 Then, using the measurement result of the character count measurement process, a process for searching from a pre-registered index is performed (step S35).
As a result of the search, it is determined whether or not there are a plurality of extracted document image data (step S36). If there is only one extracted document image data, the extracted sentence image data is displayed as a search result ( Step S42). If there are a plurality of extracted document image data, OCR processing is performed on all the extracted document image data, and the document image data corresponding to the search document data is selected and displayed as a search result (step S37).

一方、上記ステップＳ３３において、切り出したテキスト領域が複数あれば、まず最初のテキスト領域を選択し（ステップＳ３８）、その選択したテキスト領域の文字数計測処理を行う（ステップＳ３９）。この場合もステップＳ３４の文字数計測処理と同様の処理を行う。 On the other hand, if there are a plurality of cut-out text areas in step S33, the first text area is first selected (step S38), and the number of characters in the selected text area is measured (step S39). In this case, the same processing as the character count measurement processing in step S34 is performed.

そして選択したテキスト領域に対する文字数計測処理が終了すると、さらに他の計測していないテキスト領域があるかどうかを判別する（ステップＳ４０）。ここで計測していない他のテキスト領域があれば、計測していない次のテキスト領域を選択し（ステップＳ４１）、選択したテキスト領域について文字数計測処理を実行する（ステップＳ３９）。 When the character count measurement process for the selected text area is completed, it is further determined whether there is another text area that has not been measured (step S40). If there is another text area that has not been measured here, the next text area that has not been measured is selected (step S41), and the character count measurement process is executed for the selected text area (step S39).

一方ステップＳ４０で、計測してない他のテキスト領域がなければ、全てのテキスト領域の文字数計測処理が終了しているため、文字数計測処理による計測結果を用いて、予め登録してあるインデックスから検索する処理を行う（ステップＳ３５）。
なおここでは検索文書データにおける複数のテキスト領域のうち、一つのテキスト領域のみで文字数計測処理を行って、その結果を用いてインデックス検索を行うようにしてもよい。 On the other hand, if there is no other text area that has not been measured in step S40, the character count measurement process for all text areas has been completed, and therefore, a search is made from a pre-registered index using the measurement result of the character count measurement process. Is performed (step S35).
In this case, it is also possible to perform the character count measurement process in only one text area among the plurality of text areas in the search document data, and perform an index search using the result.

図１９は、検索処理におけるインデックスからの検索処理例をさらに説明するためのフローチャートである。
インデックスからの検索処理においては、まず同一の文書を検索するように指定されているかどうかを判別する（ステップＳ５１）。ここでは例えば、編集などにより一部改変された文章画像データを検索対象とすることなく、完全に同一の文章画像データのみを検索して抽出するモードと、テキスト領域内の一部が同一で類似している文章画像データを抽出するモードとが選択可能である場合に、いずれかのモードが指定されているかどうかを判断する。類似している文書画像データを検索するモードでは、元の文書にバージョンアップを加えたり編集を施して改変した文書画像データを検索することができる。 FIG. 19 is a flowchart for further explaining an example of search processing from an index in search processing.
In the search processing from the index, it is first determined whether or not the same document is specified to be searched (step S51). Here, for example, a mode in which only completely identical text image data is searched and extracted without searching text image data partially modified by editing or the like is partially the same in the text area and similar. If it is possible to select the mode for extracting the text image data being performed, it is determined whether any mode is designated. In the mode for searching for similar document image data, it is possible to search for document image data that has been modified by adding a version upgrade or editing to the original document.

そして同一の文書画像データを抽出するように指定されている場合、検索文書データのテキスト領域の計測文字数を持つインデックスを検索する（ステップＳ５２）。また同一の文書画像データではなく、類似する文書画像データを抽出するように指定されている場合には、検索文書データに類似する計測文字数を持つインデックスを検索する（ステップＳ５４）。ここでは、句読点間の計測文字数の配列が、テキスト領域の少なくとも一部分で一致する場合に、類似しているものと判断することができる。 If it is designated to extract the same document image data, an index having the measured number of characters in the text area of the search document data is searched (step S52). If it is specified to extract similar document image data instead of the same document image data, an index having a measured number of characters similar to the search document data is searched (step S54). Here, when the arrangement of the number of measured characters between the punctuation marks matches in at least a part of the text area, it can be determined that they are similar.

また上記ステップＳ５２のインデックス検索処理では、計測文字数による検索のみならず、実施形態に応じて、文字サイズ、縦書き／横書きなどの文章方向（文章の文字列の進行方向）、複数のテキスト領域に別れた文章を考慮した文章ブロック、句読点や改行を重み付けした情報などを用いることができる。 In the index search process in step S52, not only the search based on the number of measured characters, but also the text size (vertical / horizontal writing) and other text directions (progression direction of the text string) and a plurality of text areas according to the embodiment. It is possible to use sentence blocks that take into account separated sentences, information that weights punctuation marks and line breaks, and the like.

そして、ステップＳ５２またはステップＳ５４でインデックスが検出されたならば、検出されたインデックスに対応して登録されている文書画像データを抽出する（ステップＳ５３）。 If an index is detected in step S52 or step S54, document image data registered corresponding to the detected index is extracted (step S53).

本発明によるプログラムは、上記文書画像検索装置の機能を実現するためのプログラムである。プログラムは、文書画像検索装置が備えるＲＯＭやＨＤＤなどのメモリに記憶され、ＣＰＵなどの制御手段がプログラムを読み出して実行することにより、上記各実施形態で説明した文書画像検索装置の各機能を実現することができる。またメモリに記録したプログラムを実行することにより上述した実施形態の機能が実現されるだけでなく、そのプログラムの指示に基づき、オペレーティングシステムあるいは他のアプリケーションプログラム等と共同して処理することにより、本発明の機能が実現される場合もある。 A program according to the present invention is a program for realizing the function of the document image search apparatus. The program is stored in a memory such as a ROM or HDD provided in the document image search device, and the functions of the document image search device described in the above embodiments are realized by a control unit such as a CPU reading and executing the program. can do. Further, by executing the program recorded in the memory, not only the functions of the above-described embodiments are realized, but also by cooperating with the operating system or other application programs based on the instructions of the program, The functions of the invention may be realized.

プログラムは記録媒体に記録して流通させることができる。記録媒体としては、半導体媒体（例えば、ＲＯＭ、不揮発性メモリカード等）、光記録媒体（例えば、ＤＶＤ，ＭＯ，ＭＤ，ＣＤ，ＢＤ等）、磁気記録媒体（例えば、磁気テープ，フレキシブルディスク等）等が適用できる。また市場に流通させる場合には、インターネット等のネットワークを介して接続されたサーバコンピュータに保持し、これを文書画像検索装置に転送させることができる。 The program can be recorded on a recording medium and distributed. Recording media include semiconductor media (eg, ROM, nonvolatile memory card, etc.), optical recording media (eg, DVD, MO, MD, CD, BD, etc.), magnetic recording media (eg, magnetic tape, flexible disk, etc.) Etc. are applicable. In addition, when distributing to the market, it can be held in a server computer connected via a network such as the Internet and transferred to a document image search apparatus.

本発明による文書画像検索装置の位置実施形態を説明するためのブロック図である。It is a block diagram for demonstrating the position embodiment of the document image search device by this invention. 本発明の実施形態である画像ファイリングサーバに対して、利用者が文書画像データを登録するときのイメージを示す図である。It is a figure which shows an image when a user registers document image data with respect to the image filing server which is embodiment of this invention. 本発明における文書画像データのテキスト処理について説明するための図である。It is a figure for demonstrating the text processing of the document image data in this invention. 一頁内に複数のテキスト領域がある文書画像データの例を示す図である。It is a figure which shows the example of the document image data which has a some text area in one page. 一頁内に複数のテキスト領域がある文書画像データの例を示す他の図である。It is another figure which shows the example of the document image data which has a some text area in one page. 文書画像データの検索処理例を説明するための図である。It is a figure for demonstrating the example of a search process of document image data. 文書画像の検索処理の他の例を説明するための図である。It is a figure for demonstrating the other example of the search process of a document image. 文書画像データの検索処理のさらに他の例を説明するための図である。FIG. 10 is a diagram for explaining still another example of document image data search processing. 文書画像データの検索処理のさらに他の例を説明するための他の図である。It is another figure for demonstrating the further another example of the search process of document image data. 文書画像データの検索処理のさらに他の例を説明するための図である。FIG. 10 is a diagram for explaining still another example of document image data search processing. 文書画像データの検索処理のさらに他の例を説明するための他の図である。It is another figure for demonstrating the further another example of the search process of document image data. 文書画像データの検索処理のさらに他の例を説明するための他の図である。It is another figure for demonstrating the further another example of the search process of document image data. 検索文書データに類似する文書画像データを検索する処理例を説明するための図である。It is a figure for demonstrating the process example which searches the document image data similar to search document data. 検索文書データに類似する文書画像データを検索する処理例を説明するための他の図である。It is another figure for demonstrating the process example which searches the document image data similar to search document data. 本発明による文書画像検索装置におけるインデックスの作成処理の一例を説明するためのフローチャートである。It is a flowchart for demonstrating an example of the creation process of the index in the document image search device by this invention. テキスト領域の文字数計測処理の一例を説明するためのフローチャートである。It is a flowchart for demonstrating an example of the character number measurement process of a text area. 一つの文章が複数のテキスト領域に別れて記載されているときの分割処理例を説明するためのフローチャートである。It is a flowchart for demonstrating the example of a division | segmentation process when one sentence is divided and described in several text area | regions. 本発明による文書検索装置における文書検索処理の一例を説明するためのフローチャートである。It is a flowchart for demonstrating an example of the document search process in the document search apparatus by this invention. 検索処理におけるインデックスからの検索処理例をさらに説明するためのフローチャートである。It is a flowchart for demonstrating further the example of a search process from the index in a search process.

Explanation of symbols

Ｒ１〜Ｒ６…テキスト領域、１０…画像ファイリングサーバ、１１…制御部、１２…記憶部、１３…表示・入力部、１４，２３…通信Ｉ／Ｆ、２０…スキャナ、２１…スキャン部、２２…制御部、３０…複合機、１００…文章画像データ。 R1 to R6 ... text area, 10 ... image filing server, 11 ... control unit, 12 ... storage unit, 13 ... display / input unit, 14, 23 ... communication I / F, 20 ... scanner, 21 ... scanning unit, 22 ... Control unit, 30 ... MFP, 100 ... sentence image data.

Claims

An input unit for inputting document image data to be stored as a search target, and search document data serving as a search source for searching the document image data;
A character number measuring unit that recognizes punctuation marks from the document image data input by the input unit or the search document data, and measures the number of characters between punctuation marks;
A registration unit that registers the number of characters of the document image data measured by the character number measurement unit as an index;
A document image retrieval apparatus comprising: a retrieval unit that retrieves an index having the same number of characters as the number of characters between punctuation marks of the retrieved document data measured by the character number measurement unit.

The character count measuring unit measures a character size in addition to the number of characters between punctuation marks, and the registration unit registers the index including the character size in addition to the number of characters. Document image search device.

The character number measurement unit, when there is a part in the search document data where there is no punctuation at a place that is not the final position of the document line, the line break is regarded as a punctuation mark and the number of characters is measured. The document image retrieval apparatus according to claim 1, wherein:

An area extracting unit that extracts a text area composed of a plurality of area units included in a page unit or one page from the document image data and the document of the search document data;
The character number measurement unit measures the number of characters between the punctuation marks for each text region extracted by the region extraction unit,
4. The document image search apparatus according to claim 1, wherein the registration unit registers the number of characters of the document image as an index in association with position information of the text area.

When the continuous text is divided into a plurality of text areas, the number-of-characters measurement unit integrates a plurality of text areas including the continuous text, and the integrated text area is used as one search unit. 5. The document image retrieval apparatus according to claim 4, wherein the number of characters is measured.

The search unit according to any one of claims 1 to 5, wherein the search unit searches for an index having an array of at least partially matching characters from the array of the number of characters of the search document data measured by the character number measurement unit. The document image search device according to any one of the above.

The registration unit, when registering the number of characters as an index, weights and registers the punctuation marks and / or reading marks included in the document of the document image,
7. The search unit according to claim 1, wherein the search unit performs a search using a weight of the punctuation marks and / or punctuation marks in addition to the number of characters between the punctuation marks extracted from a search document. Document image search device.

The registration unit, when there is a part that is line-breaked at a place that is not the final position of the line of the document, the weighted point as a line-break point is registered and registered,
The document image search apparatus according to claim 7, wherein the search unit performs a search using the weight of the line break point in addition to the weight of the punctuation mark and the reading point.

A document image search program for realizing the function of the document image search device according to claim 1.