JP3142986B2

JP3142986B2 - Document information retrieval device

Info

Publication number: JP3142986B2
Application number: JP05133746A
Authority: JP
Inventors: 由明黒沢; 久子田中; 善邦松村
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1993-06-04
Filing date: 1993-06-04
Publication date: 2001-03-07
Anticipated expiration: 2016-03-07
Also published as: JPH06348758A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】本発明は、文書画像あるいはこれ
を認識処理した結果をファイルする装置において、所望
の文書を検索するための文書情報検索装置及び方法に関
する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a document information retrieving apparatus and method for retrieving a desired document in a device for filing a document image or the result of recognizing the document image.

【０００２】[0002]

【従来の技術】従来、文書画像をスキャナで読み込みこ
の画像情報を蓄積することにより、紙の書類のファイル
に代わるものを電子的に実現するファイリング装置が提
供されてきた。しかし、膨大な量の文書の蓄積がなされ
るようになると、ファイルしたは良いが、その文書に付
されたキーワードを忘れてしまい、膨大な文書の中から
所望のものを探し出すのに多大な労力がかかることとな
り、このようなファイリング装置は非常に使いづらいも
のになっていた。2. Description of the Related Art Conventionally, there has been provided a filing apparatus which reads a document image by a scanner and accumulates the image information, thereby electronically realizing an alternative to a paper document file. However, when an enormous amount of documents is accumulated, it is good to file, but forgets the keywords attached to the documents, and a great deal of effort is required to find the desired one from the vast amount of documents. However, such a filing apparatus is very difficult to use.

【０００３】紙の書類であれば、その外観的な特徴から
比較的簡易に、例えば「コーヒーのしみが付いていた文
書」のような人間の記憶を基に、探し出すことができる
が、電子化される際にこのような情報は余分なものとし
て捨てられてしまうため、従来の電子ファイリング装置
では、紙の場合のように人間の自然な思い出し方で所望
の文書を探すことができない。[0003] A paper document can be found relatively easily on the basis of human characteristics such as a "document with a coffee stain" on the basis of its appearance characteristics. In such a case, such information is discarded as redundant information, so that a conventional electronic filing apparatus cannot search for a desired document in a natural way of remembering a person as in the case of paper.

【０００４】特に、文書画像に対して文書構造解析や文
字認識等の処理を施したものをファイルする装置では、
認識処理の際に、認識率を上げるためしみのようなノイ
ズは除去されてしまう。そして、検索時に表示される文
書は、認識処理が施された後のものであって、人間の記
憶に残っている元の文書画像とは外観上異なるものにな
ってしまうため、一見して所望の文書であるかどうかを
判別できない。[0004] In particular, in an apparatus for filing a document image that has been subjected to processing such as document structure analysis and character recognition,
At the time of the recognition processing, noise such as blots for increasing the recognition rate is removed. Then, the document displayed at the time of the search is the one after the recognition process is performed, and the appearance of the document image is different from the original document image remaining in the memory of the human. Document cannot be determined.

【０００５】[0005]

【発明が解決しようとする課題】このように、従来の電
子ファイリング装置には、人間の記憶に残り易い、文書
の外観的な特徴によっては、ファイルされた文書情報を
検索することができないという問題点があった。As described above, the conventional electronic filing apparatus has a problem that filed document information cannot be searched depending on the appearance characteristics of the document, which are likely to remain in human memory. There was a point.

【０００６】本発明はこの点に鑑みてなされたもので、
文書の外観的な特徴を検索キーとして所望の文書を検索
できる文書情報検索装置及び方法を提供することを目的
とする。[0006] The present invention has been made in view of this point,
It is an object of the present invention to provide a document information search apparatus and a method capable of searching for a desired document by using the appearance characteristics of the document as a search key.

【０００７】[0007]

【課題を解決するための手段】本発明に係る文書情報検
索装置及び方法は、入力された文書画像から抽出される
文書の外観的な特徴を表す情報を入力された前記文書画
像あるいはこれを認識処理したものに対応付けて記憶し
ておき、文書の外観的な特徴を示す情報が検索キーとし
て入力されると、入力された検索キーと記憶された前記
情報との照合を行い、この照合結果に従い対応する前記
文書画像あるいは認識処理されたものを出力することを
特徴とする。SUMMARY OF THE INVENTION According to the present invention, there is provided a document information retrieval apparatus and method according to the present invention, in which information representing appearance characteristics of a document extracted from an input document image is input or the document image is recognized. When information indicating the appearance characteristics of the document is input as a search key, the input search key is compared with the stored information, and the matching result is stored. And outputs the corresponding document image or the one subjected to the recognition processing.

【０００８】この文書の外観的な特徴を表す情報を用い
た検索を、文書名あるいはキーワードを指定する検索と
併用することもできる。文書の外観的な特徴を表す情報
は、以下の３つの種類に大別される。（１）文書画像が
定着していた媒体の特徴を表す情報（例えば紙の色、紙
質、用紙の種類等）、（２）文書画像情報として媒体に
定着していた物質の特徴を表す情報（例えば筆記具の種
類、シミの有無等）、（３）媒体上に表された文書画像
情報のイメージとしての特徴を表す情報（例えば余白
量、字の種類、筆記者、字の密度、レイアウト等）であ
る。[0008] The search using the information representing the appearance characteristics of the document can be used together with the search for specifying a document name or a keyword. Information representing the appearance characteristics of a document is roughly classified into the following three types. (1) information indicating the characteristics of the medium on which the document image has been fixed (for example, paper color, paper quality, paper type, etc.); and (2) information indicating the characteristics of the substance fixed on the medium as the document image information ( (For example, the type of writing implement, the presence or absence of a stain), (3) information representing the image characteristics of the document image information represented on the medium (for example, the amount of blank space, type of character, scribe, character density, layout, etc.) It is.

【０００９】[0009]

【作用】本発明によれば、入力された文書画像から文書
の外観的な特徴を表す情報を抽出し、これと文書とを対
応づけて記憶するため、人間の印象に残り易い文書の外
観的な特徴を検索キーとして検索が行える。さらに、こ
のような自然な検索を実現するかなめである文書の外観
的な特徴を表す情報は、入力された文書画像から自動的
に抽出されるため、特別なセンサは不要であるし、ユー
ザに余計な負担をかけることもない。According to the present invention, information representing the appearance characteristics of a document is extracted from an input document image, and the extracted information is stored in association with the document. Search can be performed using various features as search keys. Furthermore, since information representing the appearance characteristics of a document, which is the key to realizing such a natural search, is automatically extracted from an input document image, a special sensor is not required, and the user is not required. There is no extra burden.

【００１０】[0010]

【実施例】以下に、本発明の一実施例を図面を参照して
説明する。第１図は、本実施例装置の概略構成図であ
る。ファイルとして格納されるべき文書の書類は、画像
入力部１（例えばスキャナ）から画像データとして入力
される。次に、特定部分抽出部２が、この入力された画
像データに基づいて、書類の構成要素（例えば書類のバ
ックグラウンド部、文字部、表の罫線部、写真部、イラ
スト部、グラフ部等の種類がある）を抽出する。An embodiment of the present invention will be described below with reference to the drawings. FIG. 1 is a schematic configuration diagram of the apparatus of the present embodiment. A document of a document to be stored as a file is input as image data from the image input unit 1 (for example, a scanner). Next, the specific part extracting unit 2 determines the components of the document (for example, the background portion of the document, the character portion, the ruled portion of the table, the photograph portion, the illustration portion, the graph portion, etc.) based on the input image data. Type).

【００１１】一方、格納情報生成部３は、入力された画
像データをそのままファイルに格納する場合には、画像
データをそのままデータ格納部４に格納し、入力された
画像データに何らかの処理を施す場合には、画像データ
をファイルに格納されるデータ・フォーマットに変換し
てデータ格納部４に格納する。画像データに対し文書構
造解析や文字認識処理を施してデータ格納部４に格納し
ても良い。このように格納されたデータを、格納データ
と呼ぶ。On the other hand, when storing the input image data in a file as it is, the storage information generating unit 3 stores the image data in the data storage unit 4 as it is and performs some processing on the input image data. , The image data is converted into a data format stored in a file and stored in the data storage unit 4. The image data may be subjected to document structure analysis or character recognition processing and stored in the data storage unit 4. The data stored in this way is called stored data.

【００１２】特定部分特徴識別部５は、特定部分抽出部
２で抽出された構成要素について、その特徴（例えば形
態、位置、大きさ、形状、色、種類等）を識別する。こ
の特徴の識別は、入力された画像全体に対して、特定部
分抽出と並行して行っても良い。The specific part feature identifying section 5 identifies the features (for example, form, position, size, shape, color, type, etc.) of the components extracted by the specific part extracting section 2. This feature identification may be performed in parallel with the specific portion extraction for the entire input image.

【００１３】派生情報抽出部６は、これら特定部分抽出
あるいは特定部分特徴識別の分析結果から、派生情報と
して用紙の種類、紙質、シミ、紙の色、筆記用具の種
類、書き込み比率、書類の種類等を決定する。The derived information extraction unit 6 analyzes the results of the specific part extraction or the specific part feature identification as derived information such as paper type, paper quality, stain, paper color, writing instrument type, writing ratio, and document type. Etc. are determined.

【００１４】そして、特定部分抽出部２で得られた構成
要素と特定部分特徴識別部５で得られた特徴（構成要素
の属性情報と呼ぶ）とが、その他の付属情報と共に格納
データに付加されて、それら全体がデータ格納部４にフ
ァイルとして格納される。あるいは、派生情報抽出部６
で得られた派生情報とその属性情報とが、その他の付属
情報と共に格納データに付加されて、それら全体がデー
タ格納部４にファイルとして格納される。このように格
納データに付加される構成要素の情報や派生情報は、文
書の外観的な特徴を表す情報である。The components obtained by the specific part extracting unit 2 and the features (called attribute information of the components) obtained by the specific part feature identifying unit 5 are added to the stored data together with other attached information. All of them are stored in the data storage unit 4 as files. Alternatively, the derived information extracting unit 6
The derived information and the attribute information obtained in the above are added to the storage data together with other attached information, and the whole is stored in the data storage unit 4 as a file. The component information and derivative information added to the stored data as described above are information indicating the appearance characteristics of the document.

【００１５】ファイル検索時には、オペレータにより文
書の外観的な特徴を表す情報が検索データとして、検索
データ入力部７を介して入力される。検索部８は、格納
データに付加された構成要素の情報や派生情報と、入力
された検索データとの比較照合を行い、これらが合致し
た格納データを検索結果として出力部９へ出力する。At the time of file search, information representing the appearance characteristics of the document is input as search data by the operator via the search data input unit 7. The search unit 8 performs comparison and collation of the information on the constituent elements and the derivative information added to the stored data with the input search data, and outputs the stored data that matches these to the output unit 9 as a search result.

【００１６】なお、格納情報生成部３で画像データに対
して認識処理を施す場合、認識精度を上げるために前処
理としてノイズ除去やシミ抜きを行うことがあるが、こ
こで除去されるノイズやシミは、特定部分特徴識別部５
で識別されるノイズやシミと同じであるから、この前処
理部分を両者で共有するようにしても良い。When performing recognition processing on image data in the storage information generating unit 3, noise removal or stain removal may be performed as preprocessing to improve recognition accuracy. The stain is a specific partial feature identification unit 5
Since this is the same as the noise or the stain identified by, this preprocessing part may be shared by both.

【００１７】また、入力画像は、カラー（色）データ、
グレー（多値）データ、２値データのいずれでも良い。
各データに見合った構成要素に関する情報が選択され
て、その特徴が識別され格納されることになる。The input image includes color (color) data,
Either gray (multi-valued) data or binary data may be used.
The information on the components corresponding to each data is selected, and its characteristics are identified and stored.

【００１８】以下に、特定部分抽出部２、特定部分特徴
識別部５、派生情報抽出部６について、詳述する。ま
ず、バックグラウンド部を抽出する場合を図２を用いて
説明する。入力画像データは色情報により表現されてい
るものとする。Hereinafter, the specific part extracting unit 2, the specific part feature identifying unit 5, and the derived information extracting unit 6 will be described in detail. First, a case of extracting a background portion will be described with reference to FIG. It is assumed that input image data is represented by color information.

【００１９】まず、入力画像データに対して色分離部で
１１で色分離が行われ、各色毎に分離抽出された画像デ
ータが各色画像バッファ１２に記憶される。特定の色の
みを色画像バッファに記憶するようにしても良い。色の
分離は、ＲＧＢの３原色あるいは明度・彩度・色相の３
要素を用いて行われるが、色画像バッファに記憶する段
階では、代表色毎（赤、青、黄、緑、紫、橙、藍、白、
黒の他、書類に使われる色として水色、ピンク、黄緑等
を設定しても良い）に分離されていることが望ましい。
この色分離は、原理的には、画像データの各ドットに含
まれる色の成分を分析し、そのドットの色がどの代表色
に属するかを決定し、決定された代表色に対応する色画
像バッファにそのドットの情報を記憶することにより行
われる。First, color separation is performed on input image data by a color separation unit 11, and image data separated and extracted for each color is stored in each color image buffer 12. Only a specific color may be stored in the color image buffer. Separation of colors is performed by three primary colors of RGB or three of lightness, saturation, and hue.
It is performed using elements, but at the stage of storing in the color image buffer, each representative color (red, blue, yellow, green, purple, orange, indigo, white,
In addition to black, light blue, pink, yellow green, and the like may be set as colors used for documents.)
This color separation, in principle, analyzes the color components contained in each dot of the image data, determines which representative color the color of the dot belongs to, and determines the color image corresponding to the determined representative color. This is performed by storing the dot information in a buffer.

【００２０】次に、それらの中で支配的な色をバックグ
ラウンドカラーと決定する。即ち、上記各色画像バッフ
ァに記憶されたドット数の合計（総面積）をそれぞれ算
出部１３で算出し、この総面積が最大となる色をバック
グラウンドカラー決定部１４で決定する。ここで決定さ
れた色に基づいて、バックグラウンド部抽出部１５が入
力画像中のバックグラウンド部をその他の部分から区別
して特定し、バックグラウンドカラーの色画像バッファ
中の情報を中心に、バックグラウンド部の情報を色画像
バッファから抽出する。このとき、その入力画像データ
から得られる他の種類の構成要素の特定（後述する文字
部等の抽出）を前もってあるいは同時に行い、他の種類
の構成要素と特定された部分以外について総面積が最大
の部分を抽出するようにすれば、より精度良くバックグ
ラウンド部の抽出が行える。Next, the dominant color among them is determined as the background color. That is, the total (total area) of the number of dots stored in each color image buffer is calculated by the calculation unit 13, and the color having the maximum total area is determined by the background color determination unit 14. Based on the color determined here, the background part extraction unit 15 distinguishes and identifies the background part in the input image from other parts, and focuses on the background in the color image buffer of the background color. The unit information is extracted from the color image buffer. At this time, the other types of components obtained from the input image data are specified (extraction of a character portion or the like described later) in advance or at the same time, and the total area other than the portion specified as the other types of components is maximized. If the portion is extracted, the background portion can be more accurately extracted.

【００２１】また、総面積によりバックグラウンドカラ
ーを決定するのではなく、入力画像データを各色につい
てのラン表現に符号化し、各色の部分についてのランの
長さやその分布からバックグラウンドカラーを決定する
こともできる。また、各色毎の連結領域を求め、その連
結領域の大きさや、連結領域の面積の平均値や、それら
の分布からバックグラウンド部を特定する方法もある。Further, instead of determining the background color based on the total area, the input image data is encoded into a run expression for each color, and the background color is determined from the run length of each color portion and its distribution. Can also. There is also a method in which a connected region for each color is obtained, and the background portion is specified from the size of the connected region, the average value of the area of the connected region, and their distribution.

【００２２】ここで決定されたバックグラウンドカラー
は、書類の「紙の色」を表す情報であり、この情報は、
「紙の色」抽出部２１（派生情報抽出部６）を介して、
データ格納部４にその書類の格納データとともに格納さ
れる。また、抽出されたバックグラウンド部の大きさか
ら、書類の「余白の量」を表す情報を抽出し、上記と同
様に扱うこともできる。The background color determined here is information representing the "paper color" of the document.
Via the “paper color” extraction unit 21 (derived information extraction unit 6),
The data is stored in the data storage unit 4 together with the storage data of the document. Further, information representing the “margin amount” of the document can be extracted from the extracted size of the background portion, and can be handled in the same manner as described above.

【００２３】また、ここで抽出されたバックグラウンド
部について、ノイズ検出部１７が、バックグラウンド中
に含まれる非常に小さい別の色の点の数をカウントし、
これの単位面積あたりの密度を計算することにより、バ
ックグラウンドカラー内のノイズの程度を得る。これ
は、紙の品質（「紙質」）を表す情報であり、この情報
は、「紙質」抽出部１９（派生情報抽出部６）を介し
て、データ格納部４にその書類の格納データとともに格
納される。「紙質」は、ノイズの数値として格納しても
良いが、所定のしきい値と比較することにより、「普通
紙」「再生紙」「わら半紙」等の情報に変換して格納し
ても良い。「紙質」の定義は、紙の色や濃度やノイズ量
等を総合して決めても良い。また、「紙の厚さ」を検知
する機構をスキャナに設けておき、これから得られる情
報を上記と同様に扱うこともできる。Further, with respect to the background portion extracted here, the noise detection unit 17 counts the number of very small dots of another color included in the background,
By calculating its density per unit area, the degree of noise in the background color is obtained. This is information representing the quality of the paper (“paper quality”), and this information is stored in the data storage unit 4 via the “paper quality” extraction unit 19 (derived information extraction unit 6) together with the storage data of the document. Is done. “Paper quality” may be stored as a numerical value of noise, or may be stored by being converted to information such as “plain paper”, “recycled paper”, and “straw half paper” by comparing with a predetermined threshold value. . The definition of “paper quality” may be determined by integrating the color, density, noise amount, and the like of the paper. Further, a mechanism for detecting “paper thickness” may be provided in the scanner, and information obtained from the mechanism may be handled in the same manner as described above.

【００２４】さらに、バックグラウンドが複数の色から
構成されていても良い。このとき、例えば全領域を覆う
白い色のバックグラウンド部分と、別の色のより小さい
部分のバックグラウンド部分とがあった場合に、白以外
のバックグラウンド部分を「シミ部分」なる別の構成要
素として「シミ」部分抽出部１６により抽出する。白以
外の色で、形状（輪郭）が直線的でない部分を「シミ」
部分として決定しても良い。このような「シミ」部分が
存在するときには、「シミ」の有無を表す情報を、「シ
ミ」情報抽出部２０（派生情報抽出部６）を介して、デ
ータ格納部４にその書類の格納データとともに格納す
る。この際、「シミ」部分の大きさ（総面積）や位置
（中心点あるいは代表点）を、大きさ・位置検出部１８
により検出し、「シミ」部分の色とともに「シミ」情報
に含めて格納するようにしても良い。これら「シミ」や
「紙質」の抽出は入力画像がグレーでもできる。Further, the background may be composed of a plurality of colors. At this time, for example, if there is a white background portion that covers the entire area and a smaller background portion of another color, the background component other than white is another component called a “stain portion”. Is extracted by the “stain” portion extraction unit 16. Colors other than white with non-linear shapes (contours)
It may be determined as a part. When such a “stain” portion exists, information indicating the presence or absence of the “stain” is stored in the data storage unit 4 via the “stain” information extraction unit 20 (derived information extraction unit 6). Store with At this time, the size (total area) and position (center point or representative point) of the “stain” portion are determined by the size / position detection unit 18.
And may be stored together with the color of the “stain” portion in the “stain” information. The extraction of “stain” and “paper quality” can be performed even when the input image is gray.

【００２５】次に、文字部を抽出する場合を図３を用い
て説明する。入力画像データはカラーでもグレーでも２
値でも良い。まず、連結領域抽出部３１で、入力画像デ
ータから、ある程度黒画素が固まって存在する連結領域
（大抵の場合１文字が１連結領域を構成する）を抽出す
る。そして解析部３２で、連結領域、または入り組んで
いる連結領域、またはごく近くに存在する連結領域をマ
ージしてできる領域について、それらの並びが直線的で
あるか否か、それらの大きさが揃っているか否か、並び
のピッチがほぼ一定であるか否か、あるいはそれらに対
して文字認識を行った結果妥当な確信度が得られたか否
か等を判断し、文字部抽出部３３で、抽出された連結領
域を文字部として特定するか否かを決定する。Next, a case of extracting a character portion will be described with reference to FIG. Input image data can be color or gray 2
It can be a value. First, the connected area extracting unit 31 extracts, from the input image data, a connected area in which a certain amount of black pixels are present (in most cases, one character constitutes one connected area). Then, the analysis unit 32 determines whether or not the arrangement of the connected regions, the intricate connected regions, or the regions formed by merging the close connected regions is linear, whether or not the arrangement is linear. Is determined, whether or not the arrangement pitch is substantially constant, or whether or not a reasonable degree of certainty is obtained as a result of performing character recognition on them, and the like. It is determined whether or not the extracted connected region is specified as a character portion.

【００２６】ここで抽出された文字部の特徴を以下のよ
うに識別する。まず、カラー画像が入力された場合に
は、文字の色検出部３５で文字部の色を決定する。１枚
の書類に異なる色の文字部が複数存在する場合、それぞ
れの文字部について色を検出する。そして検出された文
字の色を表す情報を、データ格納部４にその書類の格納
データとともに格納する。複数色がある場合には、文字
部の位置と色とを対応させた情報を格納する。The features of the character portion extracted here are identified as follows. First, when a color image is input, the color of the character portion is determined by the character color detection unit 35. When a plurality of character portions of different colors exist in one document, the color is detected for each character portion. Then, the information indicating the color of the detected character is stored in the data storage unit 4 together with the storage data of the document. When there are a plurality of colors, information that associates the position of the character portion with the color is stored.

【００２７】また、文字部の特徴の１つである文字タイ
プには、手書、活字、はんこう、ドット印字、フォント
の種類等がある。これらの種類は、その文字の大きさ、
色、文字列の並び方、文字列を囲む枠の形状等により識
別されて、その結果がデータ格納４に格納される。例え
ば、人間の記憶に残り易い特徴としては、書類の大部分
が手書の文字で書かれていたか、印刷あるいはプリント
アウトされた活字であったかが挙げられることに着目
し、文字タイプ判定部３７で、手書辞書４０あるいは活
字辞書４１の少なくとも一方を用いて、手書か活字かを
識別することとする。例えば活字辞書４１を用いて文字
認識を行えば、書類の大部分が手書の文字であれば低い
確信度しか得られない文字が多く、活字であればほぼ妥
当な確信度が得られるため、確信度の全文字についての
合計が所定のしきい値より高ければ活字、低ければ手書
と判断する。この手書か活字かの情報は文字タイプ情報
として、データ格納部４にその書類の格納データととも
に格納される。なお、手書か活字かは、辞書を用いず
に、連結領域の縦方向、横方向の並びにおいて、そのず
れが非常に小さい場合には活字であると判断し、バラバ
ラであれば手書と判断することもできる。The character type, which is one of the features of the character portion, includes handwriting, printed characters, stamps, dot printing, font types, and the like. These types include the size of the characters,
The result is identified by the color, the arrangement of the character strings, the shape of the frame surrounding the character strings, and the like, and the result is stored in the data storage 4. For example, attention is paid to the fact that most of the documents are written in handwritten characters or printed or printed in printed characters as features that are likely to remain in human memory. It is assumed that at least one of the handwritten dictionary 40 and the type dictionary 41 is used to determine whether the type is handwritten. For example, if character recognition is performed using the type dictionary 41, if the majority of the document is handwritten, many characters can only have low confidence, and if it is type, almost reasonable confidence can be obtained. If the sum of all the characters of the certainty factor is higher than a predetermined threshold value, it is determined that the character is a print type. The information of handwritten type is stored in the data storage unit 4 as character type information together with the storage data of the document. It should be noted that whether a handwritten type is used without using a dictionary, if the deviation is extremely small in the vertical and horizontal directions of the connected area, it is determined to be a typeface, and if it is not the same, it is determined to be a handwritten type. You can also.

【００２８】ここで手書と判断された場合、さらに次の
処理が可能である。即ち、本装置をパーソナルに使用す
るとすれば、自分が書いたものか、他人が書いたものか
が重要となる。そこで、本装置の所有者の手書文字の特
徴パターンを有する辞書４５を用いて、その確信度の高
低により、筆記者識別部４４が筆記者が所有者であるか
否かを判断する。辞書４５を複数人について持てば、そ
の書類の筆記者の名前を推定することもできる。この筆
記者情報は、データ格納部４にその書類の格納データと
ともに格納される。If it is determined that the handwriting is performed, further processing can be performed. That is, if this apparatus is used personally, it is important whether it is written by oneself or another person. Thus, using the dictionary 45 having the characteristic pattern of the handwritten character of the owner of the apparatus, the writer identification unit 44 determines whether or not the writer is the owner based on the degree of certainty. If the dictionary 45 is provided for a plurality of persons, the name of the writer of the document can be estimated. The writer information is stored in the data storage unit 4 together with the storage data of the document.

【００２９】また、手書の場合、筆記用具の種類を特定
することもできる。即ち、筆記具識別部４６で、文字線
のかすれ方や濃度（画像データはグレーであることが必
要）や、文字線の太さ検出部３６により検出される線幅
等から、鉛筆で書かれたものか、ボールペンで書かれた
ものか、サインペンで書かれたものか等を判定する。こ
の筆記具情報は、データ格納部４にその書類の格納デー
タとともに格納される。文字線の太さをそのまま筆記具
情報として格納しても良い。なお、画像入力部１に反射
光の検出器を別に用意し、反射率や分光特性を解析する
手段を付加し、紙の上に定着している物質が、鉛筆の芯
か、ボールペンのインクか、サインペンのインクか、コ
ピーのトナーか、あるいはプリンタのリボンか等を識別
することにより、筆記用具の種類あるいはコピーと原紙
の区別を特定するようにすることもできる。In the case of handwriting, the type of writing implement can be specified. In other words, the writing implement identification unit 46 uses a pencil based on the blurring and density of the character line (the image data must be gray), the line width detected by the character line thickness detection unit 36, and the like. It is determined whether the object has been written with a ballpoint pen or a felt-tip pen. The writing instrument information is stored in the data storage unit 4 together with the storage data of the document. The thickness of the character line may be directly stored as writing instrument information. It should be noted that a separate detector for reflected light is prepared in the image input unit 1 and a means for analyzing the reflectance and spectral characteristics is added, so that the substance fixed on the paper is a pencil lead or a ballpoint pen ink. By distinguishing between ink of a felt-tip pen, toner of a copy, and ribbon of a printer, it is also possible to specify the type of writing implement or the distinction between copy and base paper.

【００３０】活字と判断された場合には、そのフォント
（明朝体、毛筆、ゴシック、イタリック等）をフォント
識別部４７でさらに判別し、データ格納部４に格納して
も良い。これらの特徴を表す情報は、書類の大部分を占
める文字についてのものが抽出されれば足りる。If it is determined that the font is a print type, the font (Mincho, writing brush, Gothic, italic, etc.) may be further determined by the font identification unit 47 and stored in the data storage unit 4. It is sufficient that information representing these characteristics be extracted for characters that occupy the majority of the document.

【００３１】また、手書文字と活字文字とが混在してい
ると判断される場合には、混在率検出部４８により、手
書文字の推定文字数、あるいはその推定手書文字数の推
定全文字数に対する比を算出し、書類中に手書で書き込
んだ文字の量を表す情報として、データ格納部４に格納
することも有効である。この情報は、手書文字の存在す
る領域と活字文字の存在する領域との総面積の比を算出
して求めることもできる。If it is determined that the handwritten characters and the printed characters are mixed, the mixture ratio detecting unit 48 detects the estimated number of handwritten characters or the estimated number of handwritten characters with respect to the estimated total number of characters. It is also effective to calculate the ratio and store it in the data storage unit 4 as information indicating the amount of characters written by hand in the document. This information can also be obtained by calculating the ratio of the total area of the area where the handwritten character exists and the area where the printed character exists.

【００３２】文字部の特徴の他の１つである文字種に
は、数字、英字、カナ、漢字等がある。ここで、人間の
記憶に残り易い特徴としては、書類が英語で書かれてい
たか、日本語で書かれていたかが挙げられることに着目
し、文字種判定部３８で、英字辞書４２あるいはカナ漢
字辞書４３の少なくとも一方を用いて、英語か日本語か
を識別することとする。例えば英字（アルファベット）
辞書４２を用いて文字認識を行えば、書類が日本語であ
れば低い確信度しか得られない文字が多く、英語であれ
ばほぼ妥当な確信度が得られるため、確信度の全文字に
ついての合計が所定のしきい値より高ければ英語、低け
れば日本語と判断する。この言語種類の情報は、データ
格納部４にその書類の格納データとともに格納される。
さらに、数字についても同様な処理を行い、帳票のよう
に数字が羅列された書類であることを示す情報を上記言
語種類の情報に加えることもできる。The character type which is another one of the features of the character portion includes numerals, alphabets, kana, kanji and the like. Focusing on whether the document is written in English or written in Japanese as a feature that easily remains in human memory, the character type determination unit 38 determines whether the document is written in English or in the Kana-Kanji dictionary 43. At least one is used to distinguish between English and Japanese. For example, English letters (alphabet)
When character recognition is performed using the dictionary 42, many characters can be obtained only with low confidence if the document is Japanese, and almost reasonable confidence can be obtained if the document is English. If the total is higher than a predetermined threshold, it is determined that the language is English, and if the total is lower, the language is determined to be Japanese. The information of the language type is stored in the data storage unit 4 together with the storage data of the document.
Further, similar processing can be performed for numbers, and information indicating that the document is a list of numbers, such as a form, can be added to the information of the language type.

【００３３】他に、抽出された文字部に対して、ピッチ
検出部３９が文字ピッチや行ピッチを検出し、この結果
を元に縦横識別部４９が、その書類が縦書きであるか横
書きであるかを識別する。これには４つの状態が有り
得、１つ目は例えばＡ４の用紙を縦に置いて横書きした
もの、２つ目は用紙を横に置いて縦書きしたもの、３つ
目は用紙を縦に置いて縦書きしたもの、４つ目は用紙を
横に置いて横書きしたものである。そこで、横方向のピ
ッチが縦方向のピッチより小さければ上記１か２である
と判定し、逆ならば３か４と判定する。さらに、用紙を
置いた向きそのままで読めるように文字が書かれている
と仮定した文字認識と、用紙の向きを直角方向に置き換
えた場合に読めるように文字が書かれていると仮定した
文字認識の双方を行って、結果を比較することにより、
１と２の区別、あるいは３と４の区別を行う。この４つ
の状態のいずれであるかを示す情報は、データ格納部４
にその書類の格納データとともに格納される。In addition, the pitch detection unit 39 detects the character pitch and the line pitch of the extracted character part, and based on the result, the vertical / horizontal identification unit 49 determines whether the document is written vertically or horizontally. Identify if there is. There can be four states, the first one is for example A4 paper placed vertically and horizontally, the second is paper horizontally placed and vertically written, the third is paper placed vertically The fourth one is written horizontally with the paper placed sideways. Therefore, if the pitch in the horizontal direction is smaller than the pitch in the vertical direction, it is determined that the above is 1 or 2, and if it is the opposite, it is determined that it is 3 or 4. Furthermore, character recognition assuming that characters are written so that they can be read as they are in the direction where the paper is placed, and character recognition assuming that characters are written so that they can be read when the paper direction is replaced with a right angle direction , And comparing the results,
A distinction between 1 and 2 or a distinction between 3 and 4 is made. The information indicating which of these four states is stored in the data storage unit 4
Is stored together with the storage data of the document.

【００３４】また、文字の大きさ・密度検出部５０が文
字の大きさや密度を判定することもできる。この場合
も、大きさや密度を示す数値をそのままデータ格納部４
に格納するのではなく、「細かい字・びっしり」「大き
い字・すかすか」のような情報に変換して格納しても良
い。Further, the character size / density detecting section 50 can determine the character size and density. Also in this case, the numerical values indicating the size and density are stored in the data storage unit 4 as they are.
Instead of storing the information in a small character, the information may be converted into information such as “fine characters / closely” or “large characters / watermark” and stored.

【００３５】上記の例ではバックグラウンド部から「紙
質」や「シミ」部分を抽出する処理を説明したが、文字
部に対して文字認識を施す際に、通常認識率を上げるた
めに前処理として行う正規化等を行わないで、そのまま
のデータに対して認識を行うことにより、認識率がまと
まって所定のしきい値より悪い部分を「シミ」部分と特
定したり、認識率が１枚の紙全体に対して悪いならば
「質の悪い紙」と特定したりすることもできる。In the above example, the process of extracting the "paper quality" and "stain" portions from the background portion has been described. By performing recognition on the data as it is without performing normalization, etc., a portion where the recognition rate is collectively worse than a predetermined threshold is specified as a “stain” portion, or the recognition rate is one. If the whole paper is bad, it can be specified as "poor quality paper".

【００３６】以下にその他の特定部分（構成要素）を抽
出する場合を図４を用いて説明する。表の罫線部の抽出
は、直線・曲線検出部６１により直線が数多く検出さ
れ、交わり検出部６２により前記の直線が互いに直行し
ている交差点が数多く検出され、解析部６３により各直
線の位置や長さが揃っていると判断されるエリアを、表
の罫線部として抽出することにより行われる。Hereinafter, a case of extracting another specific portion (component) will be described with reference to FIG. The ruled line portion of the table is extracted by detecting a large number of straight lines by the straight line / curve detecting unit 61, detecting a large number of intersections where the straight lines are perpendicular to each other by the intersection detecting unit 62, and analyzing the position of each straight line by the analyzing unit 63. This is performed by extracting areas determined to have the same length as ruled line portions of the table.

【００３７】その後、用紙の種類判定部７１で、一定ピ
ッチで並ぶ直線が紙面全体に存在すると判断される場合
には、用紙の種類が罫線入りレポート用紙あるいは便箋
であると決定する。さらに直線の並びにより、縦罫線の
用紙か横罫線の用紙かをも決定できる。また、よく使う
用紙の罫線の並びや色や印（社名入り等）を用紙種辞書
７２に登録しておき、抽出された直線群等とのマッチン
グをとることにより、用紙の種類を「自社製レポート用
紙」「Ａ部課提出用記入用紙」のように特定することが
できる。この用紙の種類を示す情報や、表の罫線部の位
置・大きさを表す情報は、データ格納部４にその書類の
格納データとともに格納される。Thereafter, when the paper type determination section 71 determines that straight lines arranged at a constant pitch exist over the entire surface of the paper, it determines that the paper type is ruled report paper or stationery. Furthermore, it is possible to determine whether the paper is a vertical ruled line or a horizontal ruled line based on the arrangement of straight lines. In addition, the arrangement, color, and marks (including the company name) of the ruled lines of frequently used paper are registered in the paper type dictionary 72, and matching is performed with the extracted straight line group, etc. It can be specified as "report paper" or "entry paper for submitting to Section A". The information indicating the type of paper and the information indicating the position and size of the ruled line portion of the table are stored in the data storage unit 4 together with the storage data of the document.

【００３８】図面部の抽出は、直線や曲線が数多く検出
され、それらの交差点が数多く検出され、それらが表の
罫線部とみなされないエリアを抽出することにより行わ
れる。The extraction of the drawing portion is performed by detecting a large number of straight lines and curves, detecting a large number of intersections thereof, and extracting an area where they are not regarded as ruled line portions of the table.

【００３９】写真部は、画像処理技術として知られてい
る像域分離技術を用いて抽出できる。写真部には、画像
の濃淡が滑らかに変化するグラビア写真部と、画像の部
分に応じてその大きさが変動する黒点が並んでいること
が特徴である網掛写真部とがある。また、写真部の色を
分析することにより、カラー写真かモノクロ写真かの判
定ができる。The photographic part can be extracted using an image area separation technique known as an image processing technique. The photographic section includes a gravure photographic section in which the shading of the image smoothly changes, and a shaded photographic section characterized by arranging black spots whose size varies according to the image portion. Further, by analyzing the color of the photographic portion, it can be determined whether the photograph is a color photograph or a monochrome photograph.

【００４０】グラフ部の抽出は、図面認識で通常使われ
ている円抽出や矩形抽出、線分抽出等の技術を使うこと
により実現される。これらの抽出処理は、前述のように
抽出された図面部にのみ行うことにより、図面がグラフ
であるかその他の図面であるかを特定するようにしても
良い。グラフ部には、棒グラフと円グラフと折れ線グラ
フとがある。The extraction of the graph portion is realized by using techniques such as circle extraction, rectangle extraction, line segment extraction and the like which are usually used in drawing recognition. These extraction processes may be performed only on the drawing part extracted as described above to specify whether the drawing is a graph or another drawing. The graph section includes a bar graph, a pie graph, and a line graph.

【００４１】このように抽出されたバックグラウンド、
文字、図面、写真、グラフ等の構成要素は、その位置や
大きさ、さらに種類等の属性情報、派生情報も含めて、
データ格納部４にその書類の格納データとともに格納さ
れる。このとき、位置や大きさそのものを格納するので
はなく、各構成要素の位置関係・比率検出部７３を介し
て、「右上に写真部が、左下にグラフ部が存在する」の
ような位置関係の情報や、「図面が全体の６割を占めて
いる」、「図面と文字が１：２の比率で存在する」のよ
うな比率の情報に変換して、格納することも有効であ
る。The background thus extracted,
Components such as characters, drawings, photographs, graphs, etc., include their position and size, attribute information such as type, derivative information,
The data is stored in the data storage unit 4 together with the storage data of the document. At this time, instead of storing the position and the size itself, the positional relationship such as “a photograph portion exists at the upper right and a graph portion exists at the lower left” via the positional relationship / ratio detector 73 of each component. It is also effective to convert the information into information having a ratio such as “drawings occupy 60% of the whole” or “drawings and characters exist at a ratio of 1: 2” and store them.

【００４２】別の構成要素として、予め指定した場所に
存在する印や色も考えられる。即ち指定された場所に特
定の印あるいは色が存在するか否かを検出して、この情
報をデータ格納部４にその書類の格納データとともに格
納する。例えばユーザが重要と思う書類にはその右上隅
に赤ペンでチェックをしておくことにすると、入力され
た画像の右上隅に赤い色が存在するか否かを検出して、
重要な書類か否かを表す情報として格納データに付加す
ると効果的である。また場所を特定せず、全画面をサー
チしてその特定の印あるいは色を発見するようにしても
良い。As another component, a mark or a color existing at a place designated in advance can be considered. That is, it detects whether or not a specific mark or color exists at the designated place, and stores this information in the data storage unit 4 together with the storage data of the document. For example, if the user considers important documents to be checked with a red pen in the upper right corner, it detects whether there is a red color in the upper right corner of the input image,
It is effective to add to the stored data as information indicating whether or not the document is important. Instead of specifying the location, the entire screen may be searched to find the specific mark or color.

【００４３】以上説明した情報がどのように格納される
かを図５に示す。各構成要素や派生情報、それらの属性
情報には、名称に対応する数値データあるいはコードを
割り当てる。構成要素については、図の左半分に示すよ
うに、属性名とこれの値である属性値とを組にし、これ
を属性セットと呼ぶ。この属性セット（複数）と構成要
素とを組にし、例えば表形式で、格納する。派生情報に
ついては、図の右半分に示すように、派生情報とその属
性値とを組にして格納する。これらの一方だけを格納し
ても本発明の効果は得られる。また図５は例示であり、
これら全ての情報を格納する必要はない。FIG. 5 shows how the information described above is stored. Numerical data or codes corresponding to the names are assigned to each component, derivative information, and their attribute information. As for the constituent elements, as shown in the left half of the figure, an attribute name and an attribute value which is a value thereof are paired, and this is called an attribute set. The attribute set (plurality) and the constituent elements are paired and stored, for example, in a table format. As shown in the right half of the figure, the derived information is stored as a set of the derived information and its attribute value. Even if only one of these is stored, the effect of the present invention can be obtained. FIG. 5 is an example,
It is not necessary to store all of this information.

【００４４】上記の構成要素についての表形式の情報、
派生情報とその属性値との組で表される情報は、格納デ
ータ（文書画像やその認識結果）とは別の場所、例えば
ディレクトリ部に格納しても良いし、格納データのヘッ
ダ部分に付加して格納しても良い。別に格納した方が、
これらの情報を用いて検索する場合に、ディレクトリ部
のみを検索し、合致したものについてのみ格納データを
読み出せば良いので、検索速度は早くなる。Tabular information on the above components,
Information represented by a set of derived information and its attribute value may be stored in a location different from the storage data (the document image and its recognition result), for example, in a directory portion, or added to a header portion of the storage data. May be stored. If you store it separately,
When a search is performed using these pieces of information, only the directory portion is searched, and only the matching data is read out, so that the search speed is increased.

【００４５】また、各構成要素の中に含まれる属性名の
種類、派生情報の種類、派生情報で定義できる属性値の
種類は、予め定めておく。つまり、例えば図面部であれ
ば、これについての属性名は、色と大きさと位置の３種
類のようにである。そして、予め表のどこ（メモリのア
ドレス）にどの構成要素のどの属性名を割り当てるか決
めておく。そして、特定部分抽出部２や特定部分特徴識
別部５で求められた属性値を、対応する属性名のところ
に書き込む。抽出や識別に失敗したり、スキャナがカラ
ーでなく色は求められないような場合には、求めること
ができなかった属性名のところにＮＵＬＬを書き込む。The type of the attribute name, the type of the derived information, and the type of the attribute value that can be defined by the derived information included in each component are determined in advance. That is, for example, in the case of a drawing section, the attribute names for the drawing section are like three types of color, size, and position. Then, it is determined in advance which part of the table (memory address) is assigned which attribute name of which component. Then, the attribute value obtained by the specific part extracting unit 2 or the specific part feature identifying unit 5 is written in the corresponding attribute name. If the extraction or identification fails or the scanner is not color and the color cannot be determined, NULL is written in the attribute name that could not be determined.

【００４６】派生情報についても、予めメモリのどの格
納位置にどの派生情報を割り当てるか決めておく。そし
て、各派生情報について定義できる属性値も、例えば余
白であれば多・中・少の３種類、文字タイプであれば手
書・活字の２種類のように、予め決められている。この
各派生情報について定義されている属性値は、テーブル
の形で記憶しておくと、後で述べる検索の際に便利なこ
とがある。派生情報抽出部６で求められる属性値は、予
め定義されている中から選ばれるものであり、この求め
られた属性値を、対応する派生情報のところに書き込
む。求めることができなかった派生情報のところにはＮ
ＵＬＬを書き込む。Regarding derivative information, it is determined in advance which derivative information is to be allocated to which storage location in the memory. The attribute values that can be defined for each derivative information are also predetermined, such as, for example, three types of large, medium, and small for margins, and two types of handwriting and type for character types. If the attribute values defined for each of the derivative information are stored in the form of a table, it may be convenient for a search described later. The attribute value obtained by the derived information extracting unit 6 is selected from predefined ones, and the obtained attribute value is written in the corresponding derived information. N for the derivative information that could not be obtained
Write ULL.

【００４７】検索時の動作の一例を図６（ａ）に示す。
検索データ入力部７からは、例えば「ピンクの紙に自分
で書いたもので、コーヒーのシミがついている文書」の
ように自然言語で入力する。すると、検索情報抽出部８
１は、各派生情報とこれについて予め定義されている属
性値を対応させた表を記憶している記憶部８２の情報を
用いて、上記の検索データから「ピンク」「自分」「シ
ミ」という検索情報の元となるワードを抽出し、「紙の
色：ピンク」「筆記者：自分」「シミ：有」という３つ
の検索情報を得る。そして、検索情報比較照合部８３
が、得られた検索情報の項目（「紙の色」等）を含む属
性セット（図５右半分のような派生情報と属性値の組）
をデータ格納部４に各格納データ（文書）に付随して格
納されている中から探し、この属性値と検索情報のそれ
（「ピンク」等）とを比較照合し、これらが合致する文
書を選択して、文書提示部８５へ出力する。FIG. 6A shows an example of the operation at the time of retrieval.
From the search data input unit 7, an input is made in a natural language, for example, "a document written by yourself on pink paper and stained with coffee". Then, the search information extraction unit 8
Reference numeral 1 designates “pink”, “self”, and “stain” from the above search data using information in the storage unit 82 that stores a table in which each derivative information is associated with a predetermined attribute value. A word that is the source of the search information is extracted, and three search information items “paper color: pink”, “writer: myself”, and “stain: yes” are obtained. Then, the search information comparison / collation unit 83
Is an attribute set including the obtained search information items (such as "paper color") (a set of derived information and attribute values as shown in the right half of FIG. 5).
From the data stored in the data storage unit 4 along with each stored data (document), and compares this attribute value with that of the search information (eg, “pink”) to find a document that matches these. Select and output to the document presentation unit 85.

【００４８】合致するかどうかの判断においては、属性
値同士の類似度を定義しておき（例えば完全一致は類似
度１００％、ピンクと赤は類似度が８０％、白と黒は類
似度０％、格納されている属性値がＮＵＬＬであれば類
似度は５０％（判断できないことを示す）等）、検索情
報抽出部８１で得られた検索情報の全てが完全一致でな
くとも、各検索情報についての比較照合の結果である類
似度を全検索情報で合計した値が所定のしきい値より大
きければその文書を選択するようにしても良い。また、
派生情報の中には、「紙の色」のようにある程度の確信
度を持って属性値を抽出できる性質のものと、「筆記
者」のように自分なのか他人なのかの決定に曖昧さが残
ることが多い性質のものとがある。そこで、派生情報毎
に重みを予め定めておき、前記の類似度合計の際に、確
からしい派生情報についての類似度を重視し、曖昧な派
生情報についての類似度は参考程度にするように、重み
付けした合計を行うようにしても良い。この重みは、予
め定めておくのではなく、派生情報抽出部６での抽出の
際に確信度をも求めることにより決定しても良い。In determining whether or not the attribute values match, the similarity between the attribute values is defined (for example, a perfect match is 100% similarity, pink and red have a similarity of 80%, and white and black have a similarity of 0%). %, If the stored attribute value is NULL, the similarity is 50% (indicating that it is not possible to judge, etc.), and even if not all of the search information obtained by the search information extraction unit 81 is a perfect match, each search If the sum of the similarities as a result of the comparison and collation of the information in all the search information is larger than a predetermined threshold value, the document may be selected. Also,
Some of the derived information has the property that attribute values can be extracted with a certain degree of certainty, such as "paper color", and the ambiguity in determining whether oneself or another person, such as "writer" There are those of the nature that often remains. Therefore, a weight is determined in advance for each derivative information, and in the case of the sum of the similarities, the similarity for the likely derivative information is emphasized, and the similarity for the ambiguous derivative information is set to the reference level. A weighted sum may be performed. This weight may be determined by obtaining a certainty factor at the time of extraction by the derivative information extraction unit 6 instead of being determined in advance.

【００４９】また、通常のファイリング装置におけるキ
ーワード検索を併用するのも有効である。つまり、文書
の内容を表すキーワードをファイル時に文書データに自
動あるいは手動で付加しておき、検索時にこのキーワー
ドが思い出せればまずキーワードで検索件数を絞り込
み、その後上記のような派生情報を用いた検索を行う。
このようにすれば、キーワードを付加する際に、そのキ
ーワードがユニークなものであるか否かについて注意す
る必要がなくなり、ユーザの負担を軽減できる。その
他、「何月頃ファイルした」という情報を、ファイリン
グ装置に備えた時計機能で取り出して文書データに付加
しておき、この時間に関する情報と上記のような派生情
報とを組み合わせて用いて検索しても、検索精度を上げ
ることができる。It is also effective to use a keyword search in an ordinary filing apparatus. In other words, a keyword representing the content of the document is automatically or manually added to the document data at the time of the file, and if the keyword can be remembered at the time of the search, the number of searches is first narrowed down by the keyword, and then the search using the derived information as described above I do.
By doing so, it is not necessary to pay attention to whether or not the keyword is unique when adding the keyword, and the burden on the user can be reduced. In addition, the information of "what month was filed" is extracted by the clock function provided in the filing device, added to the document data, and searched using a combination of the information on the time and the derivative information as described above. Can also improve search accuracy.

【００５０】上記では派生情報を用いた検索を説明した
が、構成要素とその属性情報を用いた検索は以下のよう
にできる。この場合、属性値は派生情報の場合よりも生
データに近いものが格納されているので、記憶部８２に
は各構成要素とそれが持つ属性名、及びその組が表す情
報名を記憶しておく。そして、例えば「紙の色はピンク
で、シミがついていて、シミの大きさは大きく、シミの
位置は右上あたりだった文書」という検索データが入力
されたとすると、記憶部８２の情報名とマッチングを取
りながら「紙の色」「シミの大きさ」「シミの位置」と
いう検索情報の項目を抽出し、各項目の直後に書かれた
検索データである「ピンク」「大」「右上」を抽出し、
それぞれセットとして検索情報とする。Although the search using the derived information has been described above, the search using the component and its attribute information can be performed as follows. In this case, since the attribute value is closer to the raw data than in the case of the derived information, the storage unit 82 stores each component, the attribute name of the component, and the information name represented by the set. deep. Then, for example, if search data such as “document in which the color of the paper is pink, stained, the size of the stain is large, and the location of the stain is near the upper right” is input, the matching is performed with the information name in the storage unit 82. And extract the search information items "paper color", "stain size", and "stain position", and search for "pink", "large", and "top right", which are the search data written immediately after each item. Extract,
Each is set as search information.

【００５１】さらに、記憶部８２の情報を用いて、「紙
の色」という情報名を「バックグラウンド部のカラー」
という構成要素と属性名に変換した後、検索情報と、各
文書に付随して格納されたデータ格納部４の図５左半分
のような構成要素と属性情報の組との比較照合を行う。
このとき、まず「バックグラウンドの部のカラー」とい
う検索情報の項目を含む属性セットをデータ格納部４の
中から探し、この属性値と検索情報のそれ（「ピンク」
等）とを比較照合し、これらが合致する文書を選択す
る。検索情報では「大」のように大まかな表現がされて
いるが、例えば「シミの大きさ」については数値「１〜
１０」が「小」、「１１〜２０」が「中」、「２１〜３
０」が「大」のような対応を予め記憶しておくことによ
り、「大」であれば「２１〜３０」という数値に変換し
て、データ格納部４の属性値との比較照合を行う。この
場合は、数値同士の比較照合であるから、類似度計算は
簡単にできる。Further, using the information in the storage unit 82, the information name “paper color” is changed to “background color”.
After that, the search information is compared with a set of the component and attribute information as shown in the left half of FIG. 5 of the data storage unit 4 which is stored in association with each document.
At this time, first, an attribute set including an item of the search information “color of background” is searched from the data storage unit 4, and this attribute value and that of the search information (“pink”) are searched.
, Etc.), and select a document that matches these. In the search information, a rough expression such as "large" is used.
"10" is "small", "11-20" is "medium", "21-3"
By storing in advance correspondence such as “0” as “large”, if “large”, it is converted to a numerical value of “21 to 30” and compared with the attribute value of the data storage unit 4. . In this case, since the numerical value is compared and compared, the similarity calculation can be easily performed.

【００５２】尚、派生情報と構成要素の属性情報の双方
を用いた検索もできる。特に、派生情報に「シミ：有」
のような大まかな情報が、構成要素の属性情報に「シ
ミ」の色、大きさ等の細かい情報が入っている場合、
「シミのついた」という検索データからまず文書データ
に付加された派生情報を見て「シミ：有」の文書（複
数）を選択した後、対応する構成要素「シミ」の属性名
である「色」「大きさ」等を提示し、ユーザが「色」は
「茶」、「大きさ」は「大」のように検索データの続き
を入力して、構成要素の属性情報による絞り込みを行
う。また、派生情報の中にも、「文字タイプ」と「フォ
ント」あるいは「筆記者」のように、「文字タイプ」が
「活字」なら「フォント」という派生情報があり得る一
方、「手書」なら「筆記者」という派生情報があり得る
というように階層構造を持つものがあり、ここでも前述
した対話的な絞り込みが可能である。Note that a search using both the derived information and the attribute information of the constituent elements can be performed. In particular, the derivative information contains "stain: yes"
If the attribute information of the component contains detailed information such as the color and size of “stain”,
First, the document (plural) having "stain: present" is selected by looking at the derived information added to the document data from the search data "stained", and the attribute name of the corresponding component "stain" is displayed. The color and size are presented, and the user inputs the continuation of the search data such as "brown" for "color" and "large" for "size", and narrows down by the attribute information of the component. . Also, in the derived information, if the "character type" is "print type", such as "character type" and "font" or "scriber", there may be derived information called "font", while "handwritten" Then, there is a thing having a hierarchical structure such that there may be derived information called "scriber", and the above-mentioned interactive narrowing down is also possible here.

【００５３】図６（ｂ）には、検索時の動作の別の例を
示す。まず、派生情報項目表示部８６が、派生情報抽出
部６により抽出可能な派生情報を、それについて予め定
義されている属性値（複数）とともに、図中１００のよ
うに表示する。ユーザはこれを見て、検索データ入力部
７により、所望の文書の「紙の色」は「ピンク」だっ
た、「余白」は中くらいだった、のように思い出しなが
ら、各派生情報について指示していく。思い出せない場
合には、その項目を除いて後の比較照合を行うので、そ
の項目については入力しなくて良い。このように入力さ
れた検索データに対して、検索情報比較照合部８３が、
図６（ａ）の場合と同様に合致する文書を選択、提示す
る。FIG. 6B shows another example of the operation at the time of retrieval. First, the derivative information item display unit 86 displays the derivative information that can be extracted by the derivative information extraction unit 6 along with attribute values (plurality) defined in advance for the derivative information as shown in FIG. The user sees this and instructs each derivative information through the search data input unit 7 while remembering that the “paper color” of the desired document was “pink” and the “margin” was medium. I will do it. If it cannot be remembered, the comparison and collation will be performed except for the item, so that the item need not be entered. With respect to the search data input in this way, the search information comparison / collation unit 83
As in the case of FIG. 6A, a matching document is selected and presented.

【００５４】以上は、ファイルされる文書画像から構成
要素の属性情報や派生情報を抽出して、これを用いて検
索を行う実施例であるが、これらをパラパラめくりに用
いることも有効である。つまり、格納データ（文書）を
パラパラめくりながら提示することによりユーザに所望
の文書を選択させるシステムにおいて、提示する文書に
付随して格納されている構成要素の属性情報や派生情報
を画像に展開する。例えば、格納されている「シミ」の
色や大きさの情報に従ってその文書の画像に「シミ」の
画像情報を重畳して表示する。これにより、特に、入力
された文書画像からノイズを除去したものが格納データ
とした格納される場合には、パラパラめくりのとき提示
される文書に見覚えのあるノイズがないためにユーザが
一見して所望の文書か否かを判断することができないと
いうことがなくなり、使い勝手を向上させることができ
る。The above is an embodiment in which attribute information and derivative information of constituent elements are extracted from a document image to be filed, and a search is performed using the extracted information. However, it is also effective to use these for flipping. In other words, in a system in which a user selects a desired document by presenting stored data (documents) while flipping over, the attribute information and derivative information of constituent elements stored along with the presented document are developed into an image. . For example, the image information of “stain” is superimposed and displayed on the image of the document in accordance with the stored information on the color and size of “stain”. Accordingly, especially when the input document image is stored as storage data in which noise has been removed, since the document presented at the time of flipping has no recognizable noise, the user may at first glance It is not impossible to determine whether the document is the desired document, and the usability can be improved.

【００５５】また、原文書画像あるいは格納データと共
に、抽出された派生情報や属性情報（例えば紙の色：ピ
ンクのように表示）し、派生情報や属性情報をユーザが
修正変更、追加できるよう構成しても良い。この場合、
ユーザが例えば紙の色：白と修正し、書込量：多という
情報を追加し、シミ：有りという情報を削除したとする
と、このように修正された派生情報等を該当文書に対応
付けてデータ格納部４に格納する。Further, extracted derivative information and attribute information (for example, displayed as a paper color: pink) are displayed together with the original document image or the stored data so that the user can modify, change, and add the derivative information and attribute information. You may. in this case,
For example, if the user corrects the color of the paper to white, adds the information of the writing amount: many, and deletes the information of the stain: present, the derivative information and the like corrected in this way are associated with the corresponding document. The data is stored in the data storage unit 4.

【００５６】[0056]

【発明の効果】以上詳述したように、本発明によれば入
力画像から自動的に抽出される文書の外観を表す情報
（紙の色、紙質、シミ、文字の色、文字タイプ、文字
種、筆記者、筆記具、フォント、余白への書き込みの
量、書類の縦横、文字の大きさ・密度、用紙の種類、図
面や写真の位置関係等）を用いて、所望の文書が検索で
き、ユーザが文書の内容やキーワードを明確に覚えてい
ない場合にも、その文書の周辺的な情報を思い出すこと
による検索が実現できる。As described in detail above, according to the present invention, information representing the appearance of a document automatically extracted from an input image (paper color, paper quality, stain, character color, character type, character type, The desired document can be searched using the writer, writing implements, fonts, the amount of writing in the margins, the length and width of the document, the size and density of characters, the type of paper, the positional relationship between drawings and photos, etc. Even when the user does not clearly remember the contents and keywords of a document, a search can be realized by remembering peripheral information of the document.

[Brief description of the drawings]

【図１】本実施例装置の概略構成を示す図。FIG. 1 is a diagram illustrating a schematic configuration of an apparatus according to an embodiment.

【図２】本実施例装置でバックグラウンド部を抽出す
る場合の処理例を示す図。FIG. 2 is a diagram illustrating a processing example when a background part is extracted by the apparatus according to the embodiment.

【図３】本実施例装置で文字部を抽出する場合の処理
例を示す図。FIG. 3 is a diagram illustrating a processing example when a character portion is extracted by the apparatus according to the embodiment.

【図４】本実施例装置で表の罫線部、図面部、写真
部、グラフ部を抽出する場合の処理例を示す図。FIG. 4 is a diagram showing an example of processing when a ruled line portion, a drawing portion, a photograph portion, and a graph portion of a table are extracted by the apparatus according to the embodiment.

【図５】データ格納部４に格納される構成要素の情報
や派生情報の形式例を示す図。FIG. 5 is a diagram showing an example of the format of information on constituent elements and derivative information stored in a data storage unit 4;

【図６】本実施例装置における検索のための構成を示
す図。FIG. 6 is a diagram showing a configuration for searching in the apparatus according to the embodiment.

[Explanation of symbols]

１…画像入力部、２…特定部分抽出部、３…格納情報生
成部、４…データ格納部、５…特定部分特徴識別部、６
…派生情報抽出部、７…検索データ入力部、８…検索
部、９…出力部、１１…色分離部、１２…色画像バッフ
ァ、１３…総面積算出部、１４…バックグラウンドカラ
ー決定部、１５…バックグラウンド部抽出部、１６…
「シミ」部分抽出部、１７…ノイズ検出部、１８…大き
さ・位置検出部、１９…「紙質」抽出部、２０…「シ
ミ」情報抽出部、２１…「紙の色」抽出部、３１…連結
領域抽出部、３２…解析部、３３…文字部抽出部、３４
…画像バッファ、３５…文字の色検出部、３６…文字線
の太さ検出部、３７…文字タイプ判定部、３８…文字種
判定部、３９…ピッチ検出部、４０…手書辞書、４１…
活字辞書、４２…英字辞書、４３…カナ漢字辞書、４４
…筆記者識別部、４５…所有者手書辞書、４６…筆記具
識別部、４７…フォント識別部、４８…混在率検出部、
４９…縦横識別部、５０…文字の大きさ・密度検出部、
６１…直線・曲線検出部、６２…交わり検出部、６３・
６５…解析部、６４…表の罫線部抽出部、６６…図面部
抽出部、６７…像域分離部、６８…写真部抽出部、６９
…円・矩形・線分抽出部、７０…グラフ部抽出部、７１
…用紙の種類判定部、７２…用紙種辞書、７３…各構成
要素の位置関係・比率検出部、８１…検索情報抽出部、
８２…派生情報・属性値対応表記憶部、８３…検索情報
比較照合部、８４…類似度・重み記憶部、８５…文書提
示部、８６…派生情報項目表示部DESCRIPTION OF SYMBOLS 1 ... Image input part, 2 ... Specific part extraction part, 3 ... Storage information generation part, 4 ... Data storage part, 5 ... Specific part characteristic identification part, 6
... Derived information extraction unit, 7 ... Search data input unit, 8 ... Search unit, 9 ... Output unit, 11 ... Color separation unit, 12 ... Color image buffer, 13 ... Total area calculation unit, 14 ... Background color determination unit, 15 ... background part extraction part, 16 ...
"Stain" part extraction unit, 17 ... Noise detection unit, 18 ... Size / position detection unit, 19 ... "Paper quality" extraction unit, 20 ... "Stain" information extraction unit, 21 ... "Paper color" extraction unit, 31 ... Connected area extraction unit, 32. Analysis unit, 33.
... Image buffer, 35 ... Character color detector, 36 ... Character line thickness detector, 37 ... Character type determiner, 38 ... Character type determiner, 39 ... Pitch detector, 40 ... Handwriting dictionary, 41 ...
Type dictionary, 42: English dictionary, 43: Kana-kanji dictionary, 44
.., A writer identification unit, 45, an owner handwriting dictionary, 46, a writing implement identification unit, 47, a font identification unit, 48, a mixture rate detection unit,
49: vertical / horizontal identification section, 50: character size / density detection section,
61: straight line / curve detecting unit, 62: intersection detecting unit, 63
65: analysis unit, 64: table ruled line part extraction unit, 66: drawing part extraction unit, 67: image area separation unit, 68: photograph part extraction unit, 69
... Circle / rectangle / line segment extraction unit, 70 ... Graph part extraction unit, 71
... Paper type determination unit, 72 ... Paper type dictionary, 73 ... Position relation / ratio detection unit of each component, 81 ... Search information extraction unit,
82: Derived information / attribute value correspondence table storage unit, 83: Search information comparison and collation unit, 84: Similarity / weight storage unit, 85: Document presentation unit, 86: Derived information item display unit

───────────────────────────────────────────────────── フロントページの続き (56)参考文献特開平２−41566（ＪＰ，Ａ) 特開平１−229373（ＪＰ，Ａ) (58)調査した分野(Int.Cl.⁷，ＤＢ名) G06F 17/30 G06T 1/00 ＪＩＣＳＴファイル（ＪＯＩＳ)────────────────────────────────────────────────── ─── Continuation of the front page (56) References JP-A-2-41566 (JP, A) JP-A-1-229373 (JP, A) (58) Fields investigated (Int. Cl. ⁷ , DB name) G06F 17/30 G06T 1/00 JICST file (JOIS)

Claims

(57) [Claims]

An image input unit for inputting a document image fixed on a medium; a color separation process for the document image input by the image input unit; and a background of the document image based on a distribution of each separated color. Background color extraction means for extracting a color; background part extraction means for extracting a background part of the document image based on the background color extracted by the background color extraction means and a component of the document image; Noise detection means for detecting noise included in the background portion extracted by the background part extraction means; and paper quality extraction means for extracting paper quality information from the noise extracted by the noise detection means and the background color Storage means for storing the paper quality information extracted by the paper quality extraction means in association with the input document image; and a desired sentence to be searched. Search key input means for inputting information on the paper quality of an image as a search key; and search for collating the search key input by the search key input means with the paper quality information stored in the storage means. Means for outputting a corresponding document image from the stored document images in accordance with a result of the comparison by the search means.

2. An image input means for inputting a document image fixed on a medium, a color separation process of the document image input by the image input means, and a background of the document image from a distribution of each separated color. Background color extraction means for extracting a color; background part extraction means for extracting a background part of the document image based on the background color extracted by the background color extraction means and a component of the document image; In the background portion extracted by the background portion extraction means,
A stain portion extracting means for extracting a portion having a color different from the background color and having a non-linear shape or contour as a stain, and information on the stain portion extracted by the stain portion extracting means in association with the input document image. Storage means for storing; search key input means for inputting information on a stain portion of a desired document image to be searched as a search key; search key input by the search key input means; and storage in the storage means Search means for performing a comparison with the information on the stained portion, and output means for outputting a corresponding document image from the stored document images in accordance with a result of the comparison by the search means. A document information search device characterized by the following:

3. An image input means for inputting a document image fixed on a medium, a character part extracting means for extracting a character part from the document image input by the image input means, and a character part extracting means. Based on the thickness of the character portion extracted by the writing instrument identification means for identifying the writing implement used in the document image, information of the writing implement identified by the writing implement identification means, Storage means for storing in association with a document image, search key input means for inputting, as a search key, information on a writing implement of a desired document image to be searched, and a search key input by the search key input means Searching means for comparing the information of the writing implement stored in the storage means with the sentence information stored in the storage means. Document information retrieval apparatus characterized by comprising an output means for outputting the corresponding document image from the image.