JPH06348758A

JPH06348758A - Device and method for retrieving document information

Info

Publication number: JPH06348758A
Application number: JP5133746A
Authority: JP
Inventors: Yoshiaki Kurosawa; 由明黒沢; Hisako Tanaka; 久子田中; Yoshikuni Matsumura; 善邦松村
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1993-06-04
Filing date: 1993-06-04
Publication date: 1994-12-22
Anticipated expiration: 2016-03-07
Also published as: JP3142986B2

Abstract

PURPOSE:To provide a device and method for retrieving document information to retrieve a desired document with information expressing the external appearance of the document as a retrieval key. CONSTITUTION:The specified part (such as a background part, character part, ruled line part of a table, drawing part or photograph part) of inputted image data is extracted (2), the feature (such as a size, position, kind or color) is discriminated, information expressing the external appearance of the document (such as the color of paper, paper quality, spot, color of characters, character type, character kind, writer, writing tool, font, amount of writing in margins, longitudinal and lateral lengths of documents, size/density of characters, kind of paper or positional relation of drawings or photographs) is extracted (6), and this information expressing the external appearance of the document is stored in a data storage part (4) taking correspondence with a document picture or one of recognition result of it. At the time of retrieval, the desired document is retrieved by using this information expressing the external appearance of the document.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、文書画像あるいはこれ
を認識処理した結果をファイルする装置において、所望
の文書を検索するための文書情報検索装置及び方法に関
する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a document information retrieving apparatus and method for retrieving a desired document in an apparatus for filing a document image or a result of recognition processing of the document image.

【０００２】[0002]

【従来の技術】従来、文書画像をスキャナで読み込みこ
の画像情報を蓄積することにより、紙の書類のファイル
に代わるものを電子的に実現するファイリング装置が提
供されてきた。しかし、膨大な量の文書の蓄積がなされ
るようになると、ファイルしたは良いが、その文書に付
されたキーワードを忘れてしまい、膨大な文書の中から
所望のものを探し出すのに多大な労力がかかることとな
り、このようなファイリング装置は非常に使いづらいも
のになっていた。2. Description of the Related Art Conventionally, a filing apparatus has been provided which electronically realizes an alternative to a file of a paper document by reading a document image with a scanner and storing the image information. However, when a huge amount of documents are accumulated, it is good to file them, but I forget the keywords attached to the documents, and it takes a lot of effort to find the desired one from the huge documents. Therefore, such a filing device is very difficult to use.

【０００３】紙の書類であれば、その外観的な特徴から
比較的簡易に、例えば「コーヒーのしみが付いていた文
書」のような人間の記憶を基に、探し出すことができる
が、電子化される際にこのような情報は余分なものとし
て捨てられてしまうため、従来の電子ファイリング装置
では、紙の場合のように人間の自然な思い出し方で所望
の文書を探すことができない。Paper documents can be searched relatively easily from the external characteristics, for example, based on human memory such as "a document with coffee stains". Since such information is discarded as extra information at the time of processing, the conventional electronic filing device cannot search for a desired document by a natural way of human beings to remember, as in the case of paper.

【０００４】特に、文書画像に対して文書構造解析や文
字認識等の処理を施したものをファイルする装置では、
認識処理の際に、認識率を上げるためしみのようなノイ
ズは除去されてしまう。そして、検索時に表示される文
書は、認識処理が施された後のものであって、人間の記
憶に残っている元の文書画像とは外観上異なるものにな
ってしまうため、一見して所望の文書であるかどうかを
判別できない。Particularly, in an apparatus for filing a document image that has undergone processing such as document structure analysis and character recognition,
During the recognition process, noise such as spots is removed to increase the recognition rate. Then, the document displayed at the time of the search is after the recognition process is performed, and is different in appearance from the original document image that remains in human memory. Can not determine whether the document is.

【０００５】[0005]

【発明が解決しようとする課題】このように、従来の電
子ファイリング装置には、人間の記憶に残り易い、文書
の外観的な特徴によっては、ファイルされた文書情報を
検索することができないという問題点があった。As described above, in the conventional electronic filing apparatus, it is difficult to search the filed document information depending on the appearance characteristics of the document, which is likely to remain in human memory. There was a point.

【０００６】本発明はこの点に鑑みてなされたもので、
文書の外観的な特徴を検索キーとして所望の文書を検索
できる文書情報検索装置及び方法を提供することを目的
とする。The present invention has been made in view of this point,
An object of the present invention is to provide a document information search device and method capable of searching a desired document by using the external characteristics of the document as a search key.

【０００７】[0007]

【課題を解決するための手段】本発明に係る文書情報検
索装置及び方法は、入力された文書画像から抽出される
文書の外観的な特徴を表す情報を入力された前記文書画
像あるいはこれを認識処理したものに対応付けて記憶し
ておき、文書の外観的な特徴を示す情報が検索キーとし
て入力されると、入力された検索キーと記憶された前記
情報との照合を行い、この照合結果に従い対応する前記
文書画像あるいは認識処理されたものを出力することを
特徴とする。SUMMARY OF THE INVENTION A document information retrieval apparatus and method according to the present invention recognizes the input document image or information indicating the external characteristics of a document extracted from the input document image. When the information indicating the external characteristics of the document is entered as a search key and stored in association with the processed one, the entered search key and the stored information are matched, and the matching result is obtained. According to the above, the corresponding document image or the recognition-processed document image is output.

【０００８】この文書の外観的な特徴を表す情報を用い
た検索を、文書名あるいはキーワードを指定する検索と
併用することもできる。文書の外観的な特徴を表す情報
は、以下の３つの種類に大別される。（１）文書画像が
定着していた媒体の特徴を表す情報（例えば紙の色、紙
質、用紙の種類等）、（２）文書画像情報として媒体に
定着していた物質の特徴を表す情報（例えば筆記具の種
類、シミの有無等）、（３）媒体上に表された文書画像
情報のイメージとしての特徴を表す情報（例えば余白
量、字の種類、筆記者、字の密度、レイアウト等）であ
る。The search using the information showing the external characteristic of the document can be used together with the search for designating the document name or the keyword. Information representing the appearance characteristics of a document is roughly classified into the following three types. (1) Information indicating the characteristics of the medium on which the document image has been fixed (for example, paper color, paper quality, paper type, etc.), (2) Information indicating the characteristics of the substance fixed on the medium as the document image information ( For example, the type of writing instrument, the presence or absence of stains, etc., and (3) information indicating the characteristics of the image of the document image information displayed on the medium (for example, margin amount, type of character, writer, character density, layout, etc.). Is.

【０００９】[0009]

【作用】本発明によれば、入力された文書画像から文書
の外観的な特徴を表す情報を抽出し、これと文書とを対
応づけて記憶するため、人間の印象に残り易い文書の外
観的な特徴を検索キーとして検索が行える。さらに、こ
のような自然な検索を実現するかなめである文書の外観
的な特徴を表す情報は、入力された文書画像から自動的
に抽出されるため、特別なセンサは不要であるし、ユー
ザに余計な負担をかけることもない。According to the present invention, information representing the external characteristics of a document is extracted from the input document image and stored in association with the document, so that the external appearance of the document that is likely to remain in the impression of a human is You can search using various features as search keys. Furthermore, since the information that represents the external characteristics of a document, which is the key to realizing such a natural search, is automatically extracted from the input document image, no special sensor is required, and the user is not required to do so. There is no extra burden.

【００１０】[0010]

【実施例】以下に、本発明の一実施例を図面を参照して
説明する。第１図は、本実施例装置の概略構成図であ
る。ファイルとして格納されるべき文書の書類は、画像
入力部１（例えばスキャナ）から画像データとして入力
される。次に、特定部分抽出部２が、この入力された画
像データに基づいて、書類の構成要素（例えば書類のバ
ックグラウンド部、文字部、表の罫線部、写真部、イラ
スト部、グラフ部等の種類がある）を抽出する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS An embodiment of the present invention will be described below with reference to the drawings. FIG. 1 is a schematic configuration diagram of the apparatus of this embodiment. A document, which is a document to be stored as a file, is input as image data from the image input unit 1 (for example, a scanner). Next, the specific portion extraction unit 2 based on the input image data, the components of the document (for example, the background portion of the document, the character portion, the ruled line portion of the table, the photograph portion, the illustration portion, the graph portion, etc. There are types).

【００１１】一方、格納情報生成部３は、入力された画
像データをそのままファイルに格納する場合には、画像
データをそのままデータ格納部４に格納し、入力された
画像データに何らかの処理を施す場合には、画像データ
をファイルに格納されるデータ・フォーマットに変換し
てデータ格納部４に格納する。画像データに対し文書構
造解析や文字認識処理を施してデータ格納部４に格納し
ても良い。このように格納されたデータを、格納データ
と呼ぶ。On the other hand, the storage information generation unit 3 stores the input image data in the file as it is, stores the image data in the data storage unit 4 as it is, and performs some processing on the input image data. The image data is converted into a data format stored in a file and stored in the data storage unit 4. The image data may be subjected to document structure analysis or character recognition processing and stored in the data storage unit 4. The data stored in this way is called stored data.

【００１２】特定部分特徴識別部５は、特定部分抽出部
２で抽出された構成要素について、その特徴（例えば形
態、位置、大きさ、形状、色、種類等）を識別する。こ
の特徴の識別は、入力された画像全体に対して、特定部
分抽出と並行して行っても良い。The specific portion feature identifying unit 5 identifies the features (eg, form, position, size, shape, color, type, etc.) of the constituent elements extracted by the specific portion extracting unit 2. This feature identification may be performed on the entire input image in parallel with the extraction of the specific portion.

【００１３】派生情報抽出部６は、これら特定部分抽出
あるいは特定部分特徴識別の分析結果から、派生情報と
して用紙の種類、紙質、シミ、紙の色、筆記用具の種
類、書き込み比率、書類の種類等を決定する。The derived information extraction unit 6 uses the analysis result of the specific portion extraction or the specific portion feature identification as the derived information, such as paper type, paper quality, stain, paper color, writing instrument type, writing ratio, document type. And so on.

【００１４】そして、特定部分抽出部２で得られた構成
要素と特定部分特徴識別部５で得られた特徴（構成要素
の属性情報と呼ぶ）とが、その他の付属情報と共に格納
データに付加されて、それら全体がデータ格納部４にフ
ァイルとして格納される。あるいは、派生情報抽出部６
で得られた派生情報とその属性情報とが、その他の付属
情報と共に格納データに付加されて、それら全体がデー
タ格納部４にファイルとして格納される。このように格
納データに付加される構成要素の情報や派生情報は、文
書の外観的な特徴を表す情報である。Then, the constituent elements obtained by the specific portion extracting section 2 and the characteristics (called attribute information of the constituent elements) obtained by the specific portion characteristic identifying section 5 are added to the stored data together with other auxiliary information. Then, all of them are stored as a file in the data storage unit 4. Alternatively, the derivative information extraction unit 6
The derivation information and its attribute information obtained in step 1 are added to the stored data together with other attached information, and the whole is stored as a file in the data storage unit 4. In this way, the component information and derivative information added to the stored data are information representing the external characteristics of the document.

【００１５】ファイル検索時には、オペレータにより文
書の外観的な特徴を表す情報が検索データとして、検索
データ入力部７を介して入力される。検索部８は、格納
データに付加された構成要素の情報や派生情報と、入力
された検索データとの比較照合を行い、これらが合致し
た格納データを検索結果として出力部９へ出力する。At the time of file search, the operator inputs information representing the external characteristics of the document as search data through the search data input unit 7. The search unit 8 compares and collates the component information and derivative information added to the stored data with the input search data, and outputs the stored data that matches these to the output unit 9 as a search result.

【００１６】なお、格納情報生成部３で画像データに対
して認識処理を施す場合、認識精度を上げるために前処
理としてノイズ除去やシミ抜きを行うことがあるが、こ
こで除去されるノイズやシミは、特定部分特徴識別部５
で識別されるノイズやシミと同じであるから、この前処
理部分を両者で共有するようにしても良い。When the stored information generation unit 3 performs recognition processing on image data, noise removal or spot removal may be performed as preprocessing in order to improve recognition accuracy. The spot is the specific part feature identification section 5
Since it is the same as the noise or stain identified by, the preprocessing portion may be shared by both parties.

【００１７】また、入力画像は、カラー（色）データ、
グレー（多値）データ、２値データのいずれでも良い。
各データに見合った構成要素に関する情報が選択され
て、その特徴が識別され格納されることになる。Further, the input image is color data,
Either gray (multi-value) data or binary data may be used.
Information about the component that is commensurate with each piece of data will be selected and its characteristics will be identified and stored.

【００１８】以下に、特定部分抽出部２、特定部分特徴
識別部５、派生情報抽出部６について、詳述する。ま
ず、バックグラウンド部を抽出する場合を図２を用いて
説明する。入力画像データは色情報により表現されてい
るものとする。The specific portion extracting unit 2, the specific portion feature identifying unit 5, and the derivative information extracting unit 6 will be described in detail below. First, the case of extracting the background portion will be described with reference to FIG. It is assumed that the input image data is represented by color information.

【００１９】まず、入力画像データに対して色分離部で
１１で色分離が行われ、各色毎に分離抽出された画像デ
ータが各色画像バッファ１２に記憶される。特定の色の
みを色画像バッファに記憶するようにしても良い。色の
分離は、ＲＧＢの３原色あるいは明度・彩度・色相の３
要素を用いて行われるが、色画像バッファに記憶する段
階では、代表色毎（赤、青、黄、緑、紫、橙、藍、白、
黒の他、書類に使われる色として水色、ピンク、黄緑等
を設定しても良い）に分離されていることが望ましい。
この色分離は、原理的には、画像データの各ドットに含
まれる色の成分を分析し、そのドットの色がどの代表色
に属するかを決定し、決定された代表色に対応する色画
像バッファにそのドットの情報を記憶することにより行
われる。First, the input image data is color-separated by the color separation section 11 and the image data separated and extracted for each color is stored in each color image buffer 12. Only a specific color may be stored in the color image buffer. Color separation is based on the three primary colors of RGB or the three of lightness, saturation, and hue.
However, at the stage of storing in the color image buffer, each representative color (red, blue, yellow, green, purple, orange, indigo, white,
In addition to black, the colors used for documents may be light blue, pink, yellow green, etc.).
In principle, this color separation analyzes the color components included in each dot of the image data, determines which representative color the dot color belongs to, and determines the color image corresponding to the determined representative color. This is done by storing the dot information in the buffer.

【００２０】次に、それらの中で支配的な色をバックグ
ラウンドカラーと決定する。即ち、上記各色画像バッフ
ァに記憶されたドット数の合計（総面積）をそれぞれ算
出部１３で算出し、この総面積が最大となる色をバック
グラウンドカラー決定部１４で決定する。ここで決定さ
れた色に基づいて、バックグラウンド部抽出部１５が入
力画像中のバックグラウンド部をその他の部分から区別
して特定し、バックグラウンドカラーの色画像バッファ
中の情報を中心に、バックグラウンド部の情報を色画像
バッファから抽出する。このとき、その入力画像データ
から得られる他の種類の構成要素の特定（後述する文字
部等の抽出）を前もってあるいは同時に行い、他の種類
の構成要素と特定された部分以外について総面積が最大
の部分を抽出するようにすれば、より精度良くバックグ
ラウンド部の抽出が行える。Next, the dominant color among them is determined as the background color. That is, the total (total area) of the number of dots stored in each color image buffer is calculated by the calculating unit 13, and the color having the maximum total area is determined by the background color determining unit 14. Based on the color determined here, the background extraction unit 15 distinguishes and specifies the background in the input image from the other parts, and the background color is mainly focused on the information in the color image buffer. Extract the copy information from the color image buffer. At this time, other types of components obtained from the input image data are specified (extraction of character portions described later) in advance or at the same time, and the total area is maximized except for the parts identified as other types of components. If the portion of is extracted, the background portion can be extracted more accurately.

【００２１】また、総面積によりバックグラウンドカラ
ーを決定するのではなく、入力画像データを各色につい
てのラン表現に符号化し、各色の部分についてのランの
長さやその分布からバックグラウンドカラーを決定する
こともできる。また、各色毎の連結領域を求め、その連
結領域の大きさや、連結領域の面積の平均値や、それら
の分布からバックグラウンド部を特定する方法もある。In addition, the background color is not determined based on the total area, but the input image data is encoded into a run expression for each color, and the background color is determined from the run length of each color portion and its distribution. You can also There is also a method of obtaining a connected region for each color and specifying the background portion from the size of the connected region, the average value of the areas of the connected regions, and their distribution.

【００２２】ここで決定されたバックグラウンドカラー
は、書類の「紙の色」を表す情報であり、この情報は、
「紙の色」抽出部２１（派生情報抽出部６）を介して、
データ格納部４にその書類の格納データとともに格納さ
れる。また、抽出されたバックグラウンド部の大きさか
ら、書類の「余白の量」を表す情報を抽出し、上記と同
様に扱うこともできる。The background color determined here is information representing the "paper color" of the document, and this information is
Through the “paper color” extraction unit 21 (derivative information extraction unit 6),
It is stored in the data storage unit 4 together with the stored data of the document. Further, it is also possible to extract the information indicating the “margin amount” of the document from the size of the extracted background portion and handle it in the same manner as above.

【００２３】また、ここで抽出されたバックグラウンド
部について、ノイズ検出部１７が、バックグラウンド中
に含まれる非常に小さい別の色の点の数をカウントし、
これの単位面積あたりの密度を計算することにより、バ
ックグラウンドカラー内のノイズの程度を得る。これ
は、紙の品質（「紙質」）を表す情報であり、この情報
は、「紙質」抽出部１９（派生情報抽出部６）を介し
て、データ格納部４にその書類の格納データとともに格
納される。「紙質」は、ノイズの数値として格納しても
良いが、所定のしきい値と比較することにより、「普通
紙」「再生紙」「わら半紙」等の情報に変換して格納し
ても良い。「紙質」の定義は、紙の色や濃度やノイズ量
等を総合して決めても良い。また、「紙の厚さ」を検知
する機構をスキャナに設けておき、これから得られる情
報を上記と同様に扱うこともできる。With respect to the background portion extracted here, the noise detecting portion 17 counts the number of points of another very small color included in the background,
By calculating the density of this per unit area, the degree of noise in the background color is obtained. This is information indicating the quality of the paper (“paper quality”), and this information is stored in the data storage unit 4 together with the stored data of the document via the “paper quality” extraction unit 19 (derivative information extraction unit 6). To be done. The “paper quality” may be stored as a numerical value of noise, but may be stored after being converted into information such as “plain paper”, “recycled paper”, “straw paper” by comparing with a predetermined threshold value. . The definition of “paper quality” may be determined by comprehensively considering the color, density, noise amount, etc. of the paper. It is also possible to provide the scanner with a mechanism for detecting the "paper thickness" and handle the information obtained from it in the same manner as described above.

【００２４】さらに、バックグラウンドが複数の色から
構成されていても良い。このとき、例えば全領域を覆う
白い色のバックグラウンド部分と、別の色のより小さい
部分のバックグラウンド部分とがあった場合に、白以外
のバックグラウンド部分を「シミ部分」なる別の構成要
素として「シミ」部分抽出部１６により抽出する。白以
外の色で、形状（輪郭）が直線的でない部分を「シミ」
部分として決定しても良い。このような「シミ」部分が
存在するときには、「シミ」の有無を表す情報を、「シ
ミ」情報抽出部２０（派生情報抽出部６）を介して、デ
ータ格納部４にその書類の格納データとともに格納す
る。この際、「シミ」部分の大きさ（総面積）や位置
（中心点あるいは代表点）を、大きさ・位置検出部１８
により検出し、「シミ」部分の色とともに「シミ」情報
に含めて格納するようにしても良い。これら「シミ」や
「紙質」の抽出は入力画像がグレーでもできる。Further, the background may be composed of a plurality of colors. At this time, for example, when there is a white color background part covering the entire area and a background part of a smaller part of another color, the background part other than white is another constituent element that is a "spot part". Is extracted by the "spot" portion extraction unit 16. "Blemish" is the part of the shape (outline) that is not linear and is not white.
You may decide as a part. When such a "spot" portion exists, information indicating the presence or absence of the "spot" is stored in the data storage unit 4 via the "spot" information extraction unit 20 (derivative information extraction unit 6). Store with. At this time, the size (total area) and position (center point or representative point) of the "spot" portion are determined by the size / position detection unit 18
Alternatively, the color may be detected by, and the color of the "spot" portion may be stored together with the "spot" information. These "blemishes" and "paper quality" can be extracted even if the input image is gray.

【００２５】次に、文字部を抽出する場合を図３を用い
て説明する。入力画像データはカラーでもグレーでも２
値でも良い。まず、連結領域抽出部３１で、入力画像デ
ータから、ある程度黒画素が固まって存在する連結領域
（大抵の場合１文字が１連結領域を構成する）を抽出す
る。そして解析部３２で、連結領域、または入り組んで
いる連結領域、またはごく近くに存在する連結領域をマ
ージしてできる領域について、それらの並びが直線的で
あるか否か、それらの大きさが揃っているか否か、並び
のピッチがほぼ一定であるか否か、あるいはそれらに対
して文字認識を行った結果妥当な確信度が得られたか否
か等を判断し、文字部抽出部３３で、抽出された連結領
域を文字部として特定するか否かを決定する。Next, the case of extracting the character portion will be described with reference to FIG. Input image data can be color or gray 2
Value can be used. First, the connected area extracting unit 31 extracts a connected area in which black pixels are solidified to some extent (in most cases, one character forms one connected area) from the input image data. Then, in the analysis unit 32, regarding the connected regions, the connected regions that are complicated, or the regions formed by merging the connected regions that are very close to each other, whether the arrangement is linear or not, their sizes are uniform. Is determined, whether the arrangement pitch is substantially constant, or whether a certain degree of certainty is obtained as a result of performing character recognition for them, and the like, and the character part extraction unit 33 It is determined whether or not the extracted connected area is specified as a character portion.

【００２６】ここで抽出された文字部の特徴を以下のよ
うに識別する。まず、カラー画像が入力された場合に
は、文字の色検出部３５で文字部の色を決定する。１枚
の書類に異なる色の文字部が複数存在する場合、それぞ
れの文字部について色を検出する。そして検出された文
字の色を表す情報を、データ格納部４にその書類の格納
データとともに格納する。複数色がある場合には、文字
部の位置と色とを対応させた情報を格納する。The characteristics of the character portion extracted here are identified as follows. First, when a color image is input, the character color detection unit 35 determines the color of the character portion. When a plurality of character parts of different colors are present in one document, the color is detected for each character part. Then, the information representing the detected color of the character is stored in the data storage unit 4 together with the stored data of the document. When there are a plurality of colors, information in which the position of the character part and the color are associated is stored.

【００２７】また、文字部の特徴の１つである文字タイ
プには、手書、活字、はんこう、ドット印字、フォント
の種類等がある。これらの種類は、その文字の大きさ、
色、文字列の並び方、文字列を囲む枠の形状等により識
別されて、その結果がデータ格納４に格納される。例え
ば、人間の記憶に残り易い特徴としては、書類の大部分
が手書の文字で書かれていたか、印刷あるいはプリント
アウトされた活字であったかが挙げられることに着目
し、文字タイプ判定部３７で、手書辞書４０あるいは活
字辞書４１の少なくとも一方を用いて、手書か活字かを
識別することとする。例えば活字辞書４１を用いて文字
認識を行えば、書類の大部分が手書の文字であれば低い
確信度しか得られない文字が多く、活字であればほぼ妥
当な確信度が得られるため、確信度の全文字についての
合計が所定のしきい値より高ければ活字、低ければ手書
と判断する。この手書か活字かの情報は文字タイプ情報
として、データ格納部４にその書類の格納データととも
に格納される。なお、手書か活字かは、辞書を用いず
に、連結領域の縦方向、横方向の並びにおいて、そのず
れが非常に小さい場合には活字であると判断し、バラバ
ラであれば手書と判断することもできる。The character type, which is one of the characteristics of the character portion, includes handwriting, printed characters, stamps, dot printing, font type, and the like. These types are the size of the character,
The result is stored in the data storage 4 after being identified by the color, the arrangement of the character string, the shape of the frame surrounding the character string, and the like. For example, attention is paid to the fact that most of the documents are memorized by human beings, whether most of the documents are written by hand or printed or printed out. At least one of the handwriting dictionary 40 and the type dictionary 41 is used to identify whether it is a handwriting type. For example, if character recognition is performed using the print dictionary 41, most of the documents will have low certainty if they are handwritten, and if they are print, almost certain confidence can be obtained. If the sum of all the characters of the certainty factor is higher than a predetermined threshold value, it is determined to be a print type, and if the total is low, it is determined to be a handwriting. This handwritten type information is stored in the data storage unit 4 together with the stored data of the document as character type information. It should be noted that whether the handwritten type is a handwritten type is determined without using a dictionary in the vertical and horizontal arrangements of the connected areas if the deviation is very small, and if it is different, it is determined to be a handwritten type. You can also do it.

【００２８】ここで手書と判断された場合、さらに次の
処理が可能である。即ち、本装置をパーソナルに使用す
るとすれば、自分が書いたものか、他人が書いたものか
が重要となる。そこで、本装置の所有者の手書文字の特
徴パターンを有する辞書４５を用いて、その確信度の高
低により、筆記者識別部４４が筆記者が所有者であるか
否かを判断する。辞書４５を複数人について持てば、そ
の書類の筆記者の名前を推定することもできる。この筆
記者情報は、データ格納部４にその書類の格納データと
ともに格納される。If it is determined here that the handwriting is performed, the following processing can be further performed. That is, if the apparatus is used for personal use, it is important whether it is written by oneself or written by another person. Therefore, using the dictionary 45 having the handwritten character feature pattern of the owner of the apparatus, the writer identifying unit 44 determines whether or not the writer is the owner, based on the degree of certainty. If the dictionary 45 is held for a plurality of persons, the name of the writer of the document can be estimated. This writer information is stored in the data storage unit 4 together with the stored data of the document.

【００２９】また、手書の場合、筆記用具の種類を特定
することもできる。即ち、筆記具識別部４６で、文字線
のかすれ方や濃度（画像データはグレーであることが必
要）や、文字線の太さ検出部３６により検出される線幅
等から、鉛筆で書かれたものか、ボールペンで書かれた
ものか、サインペンで書かれたものか等を判定する。こ
の筆記具情報は、データ格納部４にその書類の格納デー
タとともに格納される。文字線の太さをそのまま筆記具
情報として格納しても良い。なお、画像入力部１に反射
光の検出器を別に用意し、反射率や分光特性を解析する
手段を付加し、紙の上に定着している物質が、鉛筆の芯
か、ボールペンのインクか、サインペンのインクか、コ
ピーのトナーか、あるいはプリンタのリボンか等を識別
することにより、筆記用具の種類あるいはコピーと原紙
の区別を特定するようにすることもできる。In the case of handwriting, it is also possible to specify the type of writing instrument. That is, the writing tool identification unit 46 writes with a pencil based on the blurring and density of the character line (image data needs to be gray), the line width detected by the character line thickness detection unit 36, and the like. It is determined whether the object is a ballpoint pen, a felt-tip pen, or the like. This writing instrument information is stored in the data storage unit 4 together with the stored data of the document. The thickness of the character line may be stored as it is as the writing instrument information. In addition, a detector for reflected light is separately prepared in the image input unit 1, a means for analyzing reflectance and spectral characteristics is added, and whether the substance fixed on the paper is a pencil lead or a ballpoint pen ink. It is also possible to specify the type of writing instrument or the distinction between the copy and the base paper by identifying the ink of the felt-tip pen, the toner of the copy, the ribbon of the printer, or the like.

【００３０】活字と判断された場合には、そのフォント
（明朝体、毛筆、ゴシック、イタリック等）をフォント
識別部４７でさらに判別し、データ格納部４に格納して
も良い。これらの特徴を表す情報は、書類の大部分を占
める文字についてのものが抽出されれば足りる。When it is determined that the character is a print type, the font (Mincho, brush, Gothic, italic, etc.) may be further determined by the font identification unit 47 and stored in the data storage unit 4. It suffices that the information representing these characteristics is extracted for the characters that occupy most of the document.

【００３１】また、手書文字と活字文字とが混在してい
ると判断される場合には、混在率検出部４８により、手
書文字の推定文字数、あるいはその推定手書文字数の推
定全文字数に対する比を算出し、書類中に手書で書き込
んだ文字の量を表す情報として、データ格納部４に格納
することも有効である。この情報は、手書文字の存在す
る領域と活字文字の存在する領域との総面積の比を算出
して求めることもできる。When it is determined that the handwritten characters and the printed characters are mixed, the mixture ratio detecting unit 48 estimates the number of handwritten characters or the estimated total number of handwritten characters with respect to the total number of characters. It is also effective to calculate the ratio and store it in the data storage unit 4 as information indicating the amount of characters handwritten in the document. This information can also be obtained by calculating the ratio of the total area of the area where handwritten characters are present and the area where typed characters are present.

【００３２】文字部の特徴の他の１つである文字種に
は、数字、英字、カナ、漢字等がある。ここで、人間の
記憶に残り易い特徴としては、書類が英語で書かれてい
たか、日本語で書かれていたかが挙げられることに着目
し、文字種判定部３８で、英字辞書４２あるいはカナ漢
字辞書４３の少なくとも一方を用いて、英語か日本語か
を識別することとする。例えば英字（アルファベット）
辞書４２を用いて文字認識を行えば、書類が日本語であ
れば低い確信度しか得られない文字が多く、英語であれ
ばほぼ妥当な確信度が得られるため、確信度の全文字に
ついての合計が所定のしきい値より高ければ英語、低け
れば日本語と判断する。この言語種類の情報は、データ
格納部４にその書類の格納データとともに格納される。
さらに、数字についても同様な処理を行い、帳票のよう
に数字が羅列された書類であることを示す情報を上記言
語種類の情報に加えることもできる。Character types that are another one of the characteristics of the character portion include numbers, letters, kana, and kanji. Here, paying attention to whether the document was written in English or Japanese as a feature that is likely to remain in human memory, and in the character type determination unit 38, the English dictionary 42 or the Kana-Kanji dictionary 43 is used. At least one of the above will be used to identify whether it is English or Japanese. For example, English letters (alphabet)
If character recognition is performed using the dictionary 42, many characters can be obtained with low confidence when the document is Japanese, and almost appropriate confidence is obtained when the document is English. If the total is higher than a predetermined threshold, it is determined to be English, and if it is low, it is determined to be Japanese. This language type information is stored in the data storage unit 4 together with the stored data of the document.
Further, it is possible to perform the same process for numbers and add information indicating that the document is a document in which numbers are listed, such as a form, to the information of the language type.

【００３３】他に、抽出された文字部に対して、ピッチ
検出部３９が文字ピッチや行ピッチを検出し、この結果
を元に縦横識別部４９が、その書類が縦書きであるか横
書きであるかを識別する。これには４つの状態が有り
得、１つ目は例えばＡ４の用紙を縦に置いて横書きした
もの、２つ目は用紙を横に置いて縦書きしたもの、３つ
目は用紙を縦に置いて縦書きしたもの、４つ目は用紙を
横に置いて横書きしたものである。そこで、横方向のピ
ッチが縦方向のピッチより小さければ上記１か２である
と判定し、逆ならば３か４と判定する。さらに、用紙を
置いた向きそのままで読めるように文字が書かれている
と仮定した文字認識と、用紙の向きを直角方向に置き換
えた場合に読めるように文字が書かれていると仮定した
文字認識の双方を行って、結果を比較することにより、
１と２の区別、あるいは３と４の区別を行う。この４つ
の状態のいずれであるかを示す情報は、データ格納部４
にその書類の格納データとともに格納される。In addition, the pitch detection unit 39 detects a character pitch or a line pitch for the extracted character portion, and based on the result, the vertical / horizontal identification unit 49 determines whether the document is in vertical writing or in horizontal writing. Identify if there is. There can be four states. The first is, for example, A4 paper placed vertically and horizontally, the second is the paper horizontally placed and vertically written, and the third is the paper vertically placed. Vertical writing and the fourth is horizontal writing with the paper placed horizontally. Therefore, if the pitch in the horizontal direction is smaller than the pitch in the vertical direction, it is determined to be 1 or 2, and if it is the opposite, it is determined to be 3 or 4. Furthermore, character recognition assuming that the characters are written so that they can be read as they are placed on the paper and character recognition assuming that the characters are written so that they can be read when the orientation of the paper is changed to a right angle direction By doing both, and comparing the results,
A distinction is made between 1 and 2, or a distinction between 3 and 4. The information indicating which of the four states the data storage unit 4 has.
Is stored with the stored data of the document.

【００３４】また、文字の大きさ・密度検出部５０が文
字の大きさや密度を判定することもできる。この場合
も、大きさや密度を示す数値をそのままデータ格納部４
に格納するのではなく、「細かい字・びっしり」「大き
い字・すかすか」のような情報に変換して格納しても良
い。The character size / density detecting section 50 can also determine the character size and density. Also in this case, the numerical values indicating the size and the density are directly stored in the data storage unit 4.
Instead of storing it in, the information may be stored after being converted into information such as “fine characters / closely packed”, “large characters / slightly shaded”.

【００３５】上記の例ではバックグラウンド部から「紙
質」や「シミ」部分を抽出する処理を説明したが、文字
部に対して文字認識を施す際に、通常認識率を上げるた
めに前処理として行う正規化等を行わないで、そのまま
のデータに対して認識を行うことにより、認識率がまと
まって所定のしきい値より悪い部分を「シミ」部分と特
定したり、認識率が１枚の紙全体に対して悪いならば
「質の悪い紙」と特定したりすることもできる。In the above example, the process of extracting the "paper quality" or "spot" part from the background part has been described, but when performing character recognition on the character part, as a pre-process in order to increase the normal recognition rate. By performing recognition on the data as it is without performing normalization, etc., a portion where the recognition rate is aggregated and is worse than a predetermined threshold value is specified as a "spot" portion, or the recognition rate is one. If it is bad for the entire paper, it can be specified as "poor quality paper".

【００３６】以下にその他の特定部分（構成要素）を抽
出する場合を図４を用いて説明する。表の罫線部の抽出
は、直線・曲線検出部６１により直線が数多く検出さ
れ、交わり検出部６２により前記の直線が互いに直行し
ている交差点が数多く検出され、解析部６３により各直
線の位置や長さが揃っていると判断されるエリアを、表
の罫線部として抽出することにより行われる。The case of extracting other specific parts (components) will be described below with reference to FIG. In the extraction of the ruled line portion of the table, the straight line / curve detecting unit 61 detects many straight lines, the intersection detecting unit 62 detects many intersections where the straight lines are orthogonal to each other, and the analyzing unit 63 detects the position of each straight line. This is performed by extracting the area determined to have the same length as the ruled line portion of the table.

【００３７】その後、用紙の種類判定部７１で、一定ピ
ッチで並ぶ直線が紙面全体に存在すると判断される場合
には、用紙の種類が罫線入りレポート用紙あるいは便箋
であると決定する。さらに直線の並びにより、縦罫線の
用紙か横罫線の用紙かをも決定できる。また、よく使う
用紙の罫線の並びや色や印（社名入り等）を用紙種辞書
７２に登録しておき、抽出された直線群等とのマッチン
グをとることにより、用紙の種類を「自社製レポート用
紙」「Ａ部課提出用記入用紙」のように特定することが
できる。この用紙の種類を示す情報や、表の罫線部の位
置・大きさを表す情報は、データ格納部４にその書類の
格納データとともに格納される。After that, when the paper type determining unit 71 determines that straight lines arranged at a constant pitch exist on the entire surface of the paper, it is determined that the paper type is ruled report paper or notepaper. Further, it is possible to determine whether the paper has vertical ruled lines or horizontal ruled lines based on the linear arrangement. In addition, the line type, color, and mark (entering the company name, etc.) of frequently used paper sheets are registered in the paper type dictionary 72, and matching is made with the extracted straight line groups, etc. It can be specified as in “report form” “part A section submission form”. The information indicating the type of the sheet and the information indicating the position / size of the ruled line portion of the table are stored in the data storage unit 4 together with the storage data of the document.

【００３８】図面部の抽出は、直線や曲線が数多く検出
され、それらの交差点が数多く検出され、それらが表の
罫線部とみなされないエリアを抽出することにより行わ
れる。Extraction of the drawing portion is performed by detecting many straight lines and curves, detecting many intersections thereof, and extracting areas that are not regarded as ruled line portions of the table.

【００３９】写真部は、画像処理技術として知られてい
る像域分離技術を用いて抽出できる。写真部には、画像
の濃淡が滑らかに変化するグラビア写真部と、画像の部
分に応じてその大きさが変動する黒点が並んでいること
が特徴である網掛写真部とがある。また、写真部の色を
分析することにより、カラー写真かモノクロ写真かの判
定ができる。The photograph portion can be extracted by using an image area separation technique known as an image processing technique. The photograph portion includes a gravure photograph portion in which the gradation of the image changes smoothly, and a halftone photograph portion in which black dots whose size varies depending on the image portion are arranged. Further, by analyzing the color of the photograph portion, it is possible to determine whether the photograph is a color photograph or a monochrome photograph.

【００４０】グラフ部の抽出は、図面認識で通常使われ
ている円抽出や矩形抽出、線分抽出等の技術を使うこと
により実現される。これらの抽出処理は、前述のように
抽出された図面部にのみ行うことにより、図面がグラフ
であるかその他の図面であるかを特定するようにしても
良い。グラフ部には、棒グラフと円グラフと折れ線グラ
フとがある。The extraction of the graph portion is realized by using techniques such as circle extraction, rectangle extraction, line segment extraction and the like which are usually used in drawing recognition. It is possible to specify whether the drawing is a graph or another drawing by performing these extraction processes only on the drawing portion extracted as described above. The graph section includes a bar graph, a pie graph, and a line graph.

【００４１】このように抽出されたバックグラウンド、
文字、図面、写真、グラフ等の構成要素は、その位置や
大きさ、さらに種類等の属性情報、派生情報も含めて、
データ格納部４にその書類の格納データとともに格納さ
れる。このとき、位置や大きさそのものを格納するので
はなく、各構成要素の位置関係・比率検出部７３を介し
て、「右上に写真部が、左下にグラフ部が存在する」の
ような位置関係の情報や、「図面が全体の６割を占めて
いる」、「図面と文字が１：２の比率で存在する」のよ
うな比率の情報に変換して、格納することも有効であ
る。The background thus extracted,
Components such as letters, drawings, photographs, and graphs, including their position and size, attribute information such as type, and derivative information,
It is stored in the data storage unit 4 together with the stored data of the document. At this time, instead of storing the position and size itself, a positional relationship such as “the photograph part is in the upper right and the graph part is in the lower left” is stored via the positional relationship / ratio detection unit 73 of each component. It is also effective to convert and store the information such as the above information or information having a ratio such as "drawings occupy 60% of the whole" and "drawings and characters exist in a ratio of 1: 2".

【００４２】別の構成要素として、予め指定した場所に
存在する印や色も考えられる。即ち指定された場所に特
定の印あるいは色が存在するか否かを検出して、この情
報をデータ格納部４にその書類の格納データとともに格
納する。例えばユーザが重要と思う書類にはその右上隅
に赤ペンでチェックをしておくことにすると、入力され
た画像の右上隅に赤い色が存在するか否かを検出して、
重要な書類か否かを表す情報として格納データに付加す
ると効果的である。また場所を特定せず、全画面をサー
チしてその特定の印あるいは色を発見するようにしても
良い。As another component, a mark or color existing in a predesignated place can be considered. That is, it is detected whether or not a specific mark or color exists at the designated place, and this information is stored in the data storage unit 4 together with the stored data of the document. For example, if you want to check a document that the user thinks is important with a red pen in the upper right corner, detect whether a red color exists in the upper right corner of the input image,
It is effective to add it to the stored data as information indicating whether it is an important document. Instead of specifying the location, the entire screen may be searched to find the specific mark or color.

【００４３】以上説明した情報がどのように格納される
かを図５に示す。各構成要素や派生情報、それらの属性
情報には、名称に対応する数値データあるいはコードを
割り当てる。構成要素については、図の左半分に示すよ
うに、属性名とこれの値である属性値とを組にし、これ
を属性セットと呼ぶ。この属性セット（複数）と構成要
素とを組にし、例えば表形式で、格納する。派生情報に
ついては、図の右半分に示すように、派生情報とその属
性値とを組にして格納する。これらの一方だけを格納し
ても本発明の効果は得られる。また図５は例示であり、
これら全ての情報を格納する必要はない。FIG. 5 shows how the above-described information is stored. Numerical data or codes corresponding to names are assigned to each component, derivative information, and their attribute information. As for the constituent elements, as shown in the left half of the figure, an attribute name and an attribute value, which is the value thereof, are paired, and this is called an attribute set. The attribute set (plurality) and the component are paired and stored in, for example, a tabular format. As for the derivative information, as shown in the right half of the figure, the derivative information and its attribute value are stored as a set. Even if only one of these is stored, the effect of the present invention can be obtained. Also, FIG. 5 is an example,
It is not necessary to store all this information.

【００４４】上記の構成要素についての表形式の情報、
派生情報とその属性値との組で表される情報は、格納デ
ータ（文書画像やその認識結果）とは別の場所、例えば
ディレクトリ部に格納しても良いし、格納データのヘッ
ダ部分に付加して格納しても良い。別に格納した方が、
これらの情報を用いて検索する場合に、ディレクトリ部
のみを検索し、合致したものについてのみ格納データを
読み出せば良いので、検索速度は早くなる。Tabular information about the above components,
The information represented by the combination of the derived information and its attribute value may be stored in a place different from the stored data (the document image and the recognition result), for example, in the directory part, or added to the header part of the stored data. And store it. It is better to store it separately
When searching using these pieces of information, only the directory part is searched and the stored data is read out only for the matched data, so the search speed is increased.

【００４５】また、各構成要素の中に含まれる属性名の
種類、派生情報の種類、派生情報で定義できる属性値の
種類は、予め定めておく。つまり、例えば図面部であれ
ば、これについての属性名は、色と大きさと位置の３種
類のようにである。そして、予め表のどこ（メモリのア
ドレス）にどの構成要素のどの属性名を割り当てるか決
めておく。そして、特定部分抽出部２や特定部分特徴識
別部５で求められた属性値を、対応する属性名のところ
に書き込む。抽出や識別に失敗したり、スキャナがカラ
ーでなく色は求められないような場合には、求めること
ができなかった属性名のところにＮＵＬＬを書き込む。Further, the types of attribute names, the types of derivation information, and the types of attribute values that can be defined by the derivation information, which are included in each component, are set in advance. That is, for example, in the case of the drawing part, there are three types of attribute names, that is, color, size, and position. Then, which attribute name of which component is to be assigned to which (memory address) in the table is determined in advance. Then, the attribute value obtained by the specific portion extraction unit 2 or the specific portion feature identification unit 5 is written in the corresponding attribute name. If extraction or identification fails, or if the scanner is not a color and a color cannot be obtained, NULL is written at the attribute name that could not be obtained.

【００４６】派生情報についても、予めメモリのどの格
納位置にどの派生情報を割り当てるか決めておく。そし
て、各派生情報について定義できる属性値も、例えば余
白であれば多・中・少の３種類、文字タイプであれば手
書・活字の２種類のように、予め決められている。この
各派生情報について定義されている属性値は、テーブル
の形で記憶しておくと、後で述べる検索の際に便利なこ
とがある。派生情報抽出部６で求められる属性値は、予
め定義されている中から選ばれるものであり、この求め
られた属性値を、対応する派生情報のところに書き込
む。求めることができなかった派生情報のところにはＮ
ＵＬＬを書き込む。Regarding the derivative information, it is determined in advance which derivative information should be assigned to which storage position in the memory. The attribute values that can be defined for each piece of derived information are also determined in advance, for example, three types of large / medium / small for margins and two types of handwriting / printing for character types. If the attribute value defined for each piece of derived information is stored in the form of a table, it may be convenient for a search described later. The attribute value calculated by the derivative information extraction unit 6 is selected from the predefined ones, and the calculated attribute value is written in the corresponding derivative information. N for derived information that could not be obtained
Write ULL.

【００４７】検索時の動作の一例を図６（ａ）に示す。
検索データ入力部７からは、例えば「ピンクの紙に自分
で書いたもので、コーヒーのシミがついている文書」の
ように自然言語で入力する。すると、検索情報抽出部８
１は、各派生情報とこれについて予め定義されている属
性値を対応させた表を記憶している記憶部８２の情報を
用いて、上記の検索データから「ピンク」「自分」「シ
ミ」という検索情報の元となるワードを抽出し、「紙の
色：ピンク」「筆記者：自分」「シミ：有」という３つ
の検索情報を得る。そして、検索情報比較照合部８３
が、得られた検索情報の項目（「紙の色」等）を含む属
性セット（図５右半分のような派生情報と属性値の組）
をデータ格納部４に各格納データ（文書）に付随して格
納されている中から探し、この属性値と検索情報のそれ
（「ピンク」等）とを比較照合し、これらが合致する文
書を選択して、文書提示部８５へ出力する。An example of the operation at the time of retrieval is shown in FIG. 6 (a).
Input from the search data input unit 7 in natural language, for example, "a document written by myself on a pink paper and having a stain on coffee". Then, the search information extraction unit 8
1 is called "pink", "myself", or "spot" from the above search data, using the information in the storage unit 82 that stores a table in which each derivative information is associated with the attribute values defined in advance. The word that is the basis of the search information is extracted, and three search information of “paper color: pink”, “writer: myself”, and “stain: present” are obtained. Then, the search information comparing and collating unit 83
Is an attribute set including the obtained search information items (such as "paper color") (a pair of derived information and attribute values as shown in the right half of Fig. 5)
From among the data stored in the data storage unit 4 in association with each stored data (document), compare and collate this attribute value with that of the search information (“pink”, etc.), and find the documents that match these. It is selected and output to the document presentation unit 85.

【００４８】合致するかどうかの判断においては、属性
値同士の類似度を定義しておき（例えば完全一致は類似
度１００％、ピンクと赤は類似度が８０％、白と黒は類
似度０％、格納されている属性値がＮＵＬＬであれば類
似度は５０％（判断できないことを示す）等）、検索情
報抽出部８１で得られた検索情報の全てが完全一致でな
くとも、各検索情報についての比較照合の結果である類
似度を全検索情報で合計した値が所定のしきい値より大
きければその文書を選択するようにしても良い。また、
派生情報の中には、「紙の色」のようにある程度の確信
度を持って属性値を抽出できる性質のものと、「筆記
者」のように自分なのか他人なのかの決定に曖昧さが残
ることが多い性質のものとがある。そこで、派生情報毎
に重みを予め定めておき、前記の類似度合計の際に、確
からしい派生情報についての類似度を重視し、曖昧な派
生情報についての類似度は参考程度にするように、重み
付けした合計を行うようにしても良い。この重みは、予
め定めておくのではなく、派生情報抽出部６での抽出の
際に確信度をも求めることにより決定しても良い。In determining whether or not they match, the similarity between the attribute values is defined (eg 100% similarity for perfect matching, 80% similarity for pink and red, 0 similarity for white and black). %, If the stored attribute value is NULL, the degree of similarity is 50% (indicating that it cannot be determined), etc.), even if all of the search information obtained by the search information extraction unit 81 is not a complete match, each search The document may be selected if the value obtained by summing the similarities as a result of comparison and collation of information in all search information is larger than a predetermined threshold value. Also,
Some of the derived information is such that it is possible to extract attribute values with a certain degree of certainty, such as "paper color," and ambiguous in determining whether you are yourself or another person, such as "scribe." There are some things that often remain. Therefore, a weight is set in advance for each piece of derived information, and when the similarity is summed, the degree of similarity for certain derived information is emphasized, and the degree of similarity for ambiguous derived information is set to a reference level. A weighted sum may be performed. This weight may not be set in advance but may be determined by also obtaining the certainty factor at the time of extraction by the derivative information extraction unit 6.

【００４９】また、通常のファイリング装置におけるキ
ーワード検索を併用するのも有効である。つまり、文書
の内容を表すキーワードをファイル時に文書データに自
動あるいは手動で付加しておき、検索時にこのキーワー
ドが思い出せればまずキーワードで検索件数を絞り込
み、その後上記のような派生情報を用いた検索を行う。
このようにすれば、キーワードを付加する際に、そのキ
ーワードがユニークなものであるか否かについて注意す
る必要がなくなり、ユーザの負担を軽減できる。その
他、「何月頃ファイルした」という情報を、ファイリン
グ装置に備えた時計機能で取り出して文書データに付加
しておき、この時間に関する情報と上記のような派生情
報とを組み合わせて用いて検索しても、検索精度を上げ
ることができる。It is also effective to use a keyword search in an ordinary filing apparatus together. In other words, keywords that represent the contents of the document are added to the document data automatically or manually at the time of file, and if you can remember this keyword at the time of search, first narrow down the number of searches by keyword and then search using the derived information as described above. I do.
In this way, when adding a keyword, it is not necessary to pay attention to whether or not the keyword is unique, and the burden on the user can be reduced. In addition, the information "about what month was filed" is taken out by the clock function provided in the filing device and added to the document data, and the information related to this time and the derived information as described above are used in combination to search. Can improve the search accuracy.

【００５０】上記では派生情報を用いた検索を説明した
が、構成要素とその属性情報を用いた検索は以下のよう
にできる。この場合、属性値は派生情報の場合よりも生
データに近いものが格納されているので、記憶部８２に
は各構成要素とそれが持つ属性名、及びその組が表す情
報名を記憶しておく。そして、例えば「紙の色はピンク
で、シミがついていて、シミの大きさは大きく、シミの
位置は右上あたりだった文書」という検索データが入力
されたとすると、記憶部８２の情報名とマッチングを取
りながら「紙の色」「シミの大きさ」「シミの位置」と
いう検索情報の項目を抽出し、各項目の直後に書かれた
検索データである「ピンク」「大」「右上」を抽出し、
それぞれセットとして検索情報とする。Although the search using the derivative information has been described above, the search using the component and its attribute information can be performed as follows. In this case, since the attribute value is stored closer to the raw data than in the case of the derived information, the storage unit 82 stores each component, the attribute name of the component, and the information name represented by the set. deep. Then, for example, if the search data "the paper color is pink, the stain is large, the stain size is large, and the stain position is in the upper right corner" is input, the information name in the storage unit 82 is matched. While extracting, extract the items of search information such as “paper color”, “size of stains” and “position of stains”, and enter the search data “pink”, “large” and “upper right” written immediately after each item. Extract and
Search information is set as a set.

【００５１】さらに、記憶部８２の情報を用いて、「紙
の色」という情報名を「バックグラウンド部のカラー」
という構成要素と属性名に変換した後、検索情報と、各
文書に付随して格納されたデータ格納部４の図５左半分
のような構成要素と属性情報の組との比較照合を行う。
このとき、まず「バックグラウンドの部のカラー」とい
う検索情報の項目を含む属性セットをデータ格納部４の
中から探し、この属性値と検索情報のそれ（「ピンク」
等）とを比較照合し、これらが合致する文書を選択す
る。検索情報では「大」のように大まかな表現がされて
いるが、例えば「シミの大きさ」については数値「１〜
１０」が「小」、「１１〜２０」が「中」、「２１〜３
０」が「大」のような対応を予め記憶しておくことによ
り、「大」であれば「２１〜３０」という数値に変換し
て、データ格納部４の属性値との比較照合を行う。この
場合は、数値同士の比較照合であるから、類似度計算は
簡単にできる。Furthermore, using the information in the storage unit 82, the information name "paper color" is changed to "background color".
After conversion into the component and attribute name, the search information is compared and collated with the set of component and attribute information as shown in the left half of the data storage unit 4 in FIG.
At this time, first, the data storage unit 4 is searched for an attribute set including the item of the search information "background color", and this attribute value and that of the search information ("pink") are searched.
Etc.) and collate, and select the documents that match these. In the search information, a rough expression such as "large" is used, but for example, for "size of stain", the numerical value "1
"10" is "small", "11-20" is "medium", "21-3"
By storing a correspondence such as “0” is “large” in advance, if it is “large”, it is converted into a numerical value of “21 to 30” and compared and collated with the attribute value of the data storage unit 4. . In this case, since the numerical values are compared and collated, the similarity calculation can be easily performed.

【００５２】尚、派生情報と構成要素の属性情報の双方
を用いた検索もできる。特に、派生情報に「シミ：有」
のような大まかな情報が、構成要素の属性情報に「シ
ミ」の色、大きさ等の細かい情報が入っている場合、
「シミのついた」という検索データからまず文書データ
に付加された派生情報を見て「シミ：有」の文書（複
数）を選択した後、対応する構成要素「シミ」の属性名
である「色」「大きさ」等を提示し、ユーザが「色」は
「茶」、「大きさ」は「大」のように検索データの続き
を入力して、構成要素の属性情報による絞り込みを行
う。また、派生情報の中にも、「文字タイプ」と「フォ
ント」あるいは「筆記者」のように、「文字タイプ」が
「活字」なら「フォント」という派生情報があり得る一
方、「手書」なら「筆記者」という派生情報があり得る
というように階層構造を持つものがあり、ここでも前述
した対話的な絞り込みが可能である。It should be noted that it is possible to perform a search using both the derived information and the attribute information of the constituent elements. In particular, the derived information contains "Stain: Yes"
If there is detailed information such as the color and size of "spot" in the attribute information of the component,
First, looking at the derived information added to the document data from the search data "spotted", select the documents (spots: yes), and then select the attribute name of the corresponding component "spot". The user inputs the continuation of the search data, such as "color" and "size", and the user inputs "brown" for "color" and "large" for "size", and narrows down the attribute information of the constituent elements. . Also, in the derivative information, if the "character type" is "printing", such as "character type" and "font" or "writer", there may be derivative information such as "font". If so, there is a hierarchical structure such that there can be derived information called "writer", and the interactive narrowing down described above is also possible here.

【００５３】図６（ｂ）には、検索時の動作の別の例を
示す。まず、派生情報項目表示部８６が、派生情報抽出
部６により抽出可能な派生情報を、それについて予め定
義されている属性値（複数）とともに、図中１００のよ
うに表示する。ユーザはこれを見て、検索データ入力部
７により、所望の文書の「紙の色」は「ピンク」だっ
た、「余白」は中くらいだった、のように思い出しなが
ら、各派生情報について指示していく。思い出せない場
合には、その項目を除いて後の比較照合を行うので、そ
の項目については入力しなくて良い。このように入力さ
れた検索データに対して、検索情報比較照合部８３が、
図６（ａ）の場合と同様に合致する文書を選択、提示す
る。FIG. 6B shows another example of the search operation. First, the derivation information item display unit 86 displays derivation information that can be extracted by the derivation information extraction unit 6 together with attribute values (plurality) defined in advance for the derivation information, as indicated by 100 in the figure. The user looks at this and instructs the derived data by the search data input unit 7, remembering that the "paper color" of the desired document was "pink" and the "margin" was medium. I will do it. If you can't remember it, you do not need to enter that item, because the comparison and collation will be performed later, excluding that item. With respect to the search data input in this way, the search information comparing and collating unit 83
As in the case of FIG. 6A, the matching document is selected and presented.

【００５４】以上は、ファイルされる文書画像から構成
要素の属性情報や派生情報を抽出して、これを用いて検
索を行う実施例であるが、これらをパラパラめくりに用
いることも有効である。つまり、格納データ（文書）を
パラパラめくりながら提示することによりユーザに所望
の文書を選択させるシステムにおいて、提示する文書に
付随して格納されている構成要素の属性情報や派生情報
を画像に展開する。例えば、格納されている「シミ」の
色や大きさの情報に従ってその文書の画像に「シミ」の
画像情報を重畳して表示する。これにより、特に、入力
された文書画像からノイズを除去したものが格納データ
とした格納される場合には、パラパラめくりのとき提示
される文書に見覚えのあるノイズがないためにユーザが
一見して所望の文書か否かを判断することができないと
いうことがなくなり、使い勝手を向上させることができ
る。The above is the embodiment in which the attribute information and the derivative information of the constituent elements are extracted from the document image to be filed, and the retrieval is performed by using the extracted information, but it is also effective to flip these. In other words, in a system that allows a user to select a desired document by presenting stored data (documents) while flipping through, the attribute information and derivative information of the constituent elements that are stored in association with the presented document are developed into an image. . For example, the image information of "spots" is superimposed and displayed on the image of the document according to the stored color and size information of "spots". This makes it easy for the user to see at first glance because there is no familiar noise in the document presented at the time of flipping, especially when the input data is stored with the noise removed from the input document image. It is possible to improve usability by not being unable to determine whether or not the document is a desired document.

【００５５】また、原文書画像あるいは格納データと共
に、抽出された派生情報や属性情報（例えば紙の色：ピ
ンクのように表示）し、派生情報や属性情報をユーザが
修正変更、追加できるよう構成しても良い。この場合、
ユーザが例えば紙の色：白と修正し、書込量：多という
情報を追加し、シミ：有りという情報を削除したとする
と、このように修正された派生情報等を該当文書に対応
付けてデータ格納部４に格納する。Also, the extracted derivative information and attribute information (displayed as, for example, paper color: pink) are extracted together with the original document image or the stored data, and the user can modify and add the derivative information and attribute information. You may. in this case,
If the user corrects the color of the paper: white, adds the information that the writing amount is large, and deletes the information that the stain is present, the derived information corrected in this way is associated with the relevant document. The data is stored in the data storage unit 4.

【００５６】[0056]

【発明の効果】以上詳述したように、本発明によれば入
力画像から自動的に抽出される文書の外観を表す情報
（紙の色、紙質、シミ、文字の色、文字タイプ、文字
種、筆記者、筆記具、フォント、余白への書き込みの
量、書類の縦横、文字の大きさ・密度、用紙の種類、図
面や写真の位置関係等）を用いて、所望の文書が検索で
き、ユーザが文書の内容やキーワードを明確に覚えてい
ない場合にも、その文書の周辺的な情報を思い出すこと
による検索が実現できる。As described in detail above, according to the present invention, information representing the appearance of a document automatically extracted from an input image (paper color, paper quality, stain, character color, character type, character type, The desired document can be searched using the writer, writing instrument, font, amount of writing in the margins, vertical and horizontal dimensions of the document, character size and density, paper type, positional relationship between drawings and photos, etc. Even if the content or keyword of the document is not clearly remembered, the retrieval can be realized by remembering the peripheral information of the document.

[Brief description of drawings]

【図１】本実施例装置の概略構成を示す図。FIG. 1 is a diagram showing a schematic configuration of a device of this embodiment.

【図２】本実施例装置でバックグラウンド部を抽出す
る場合の処理例を示す図。FIG. 2 is a diagram showing a processing example when a background portion is extracted by the device of this embodiment.

【図３】本実施例装置で文字部を抽出する場合の処理
例を示す図。FIG. 3 is a diagram showing a processing example when a character portion is extracted by the device of this embodiment.

【図４】本実施例装置で表の罫線部、図面部、写真
部、グラフ部を抽出する場合の処理例を示す図。FIG. 4 is a diagram showing a processing example when a ruled line portion, a drawing portion, a photograph portion, and a graph portion of a table are extracted by the device of this embodiment.

【図５】データ格納部４に格納される構成要素の情報
や派生情報の形式例を示す図。FIG. 5 is a diagram showing a format example of component information and derivative information stored in a data storage unit 4.

【図６】本実施例装置における検索のための構成を示
す図。FIG. 6 is a diagram showing a configuration for searching in the device of this embodiment.

[Explanation of symbols]

１…画像入力部、２…特定部分抽出部、３…格納情報生
成部、４…データ格納部、５…特定部分特徴識別部、６
…派生情報抽出部、７…検索データ入力部、８…検索
部、９…出力部、１１…色分離部、１２…色画像バッフ
ァ、１３…総面積算出部、１４…バックグラウンドカラ
ー決定部、１５…バックグラウンド部抽出部、１６…
「シミ」部分抽出部、１７…ノイズ検出部、１８…大き
さ・位置検出部、１９…「紙質」抽出部、２０…「シ
ミ」情報抽出部、２１…「紙の色」抽出部、３１…連結
領域抽出部、３２…解析部、３３…文字部抽出部、３４
…画像バッファ、３５…文字の色検出部、３６…文字線
の太さ検出部、３７…文字タイプ判定部、３８…文字種
判定部、３９…ピッチ検出部、４０…手書辞書、４１…
活字辞書、４２…英字辞書、４３…カナ漢字辞書、４４
…筆記者識別部、４５…所有者手書辞書、４６…筆記具
識別部、４７…フォント識別部、４８…混在率検出部、
４９…縦横識別部、５０…文字の大きさ・密度検出部、
６１…直線・曲線検出部、６２…交わり検出部、６３・
６５…解析部、６４…表の罫線部抽出部、６６…図面部
抽出部、６７…像域分離部、６８…写真部抽出部、６９
…円・矩形・線分抽出部、７０…グラフ部抽出部、７１
…用紙の種類判定部、７２…用紙種辞書、７３…各構成
要素の位置関係・比率検出部、８１…検索情報抽出部、
８２…派生情報・属性値対応表記憶部、８３…検索情報
比較照合部、８４…類似度・重み記憶部、８５…文書提
示部、８６…派生情報項目表示部DESCRIPTION OF SYMBOLS 1 ... Image input part, 2 ... Specific part extraction part, 3 ... Storage information generation part, 4 ... Data storage part, 5 ... Specific part feature identification part, 6
Derived information extraction unit, 7 ... Search data input unit, 8 ... Search unit, 9 ... Output unit, 11 ... Color separation unit, 12 ... Color image buffer, 13 ... Total area calculation unit, 14 ... Background color determination unit, 15 ... Background extraction unit, 16 ...
"Blemish" portion extraction unit, 17 ... Noise detection unit, 18 ... Size / position detection unit, 19 ... "Paper quality" extraction unit, 20 ... "Blemish" information extraction unit, 21 ... "Paper color" extraction unit, 31 ... connected area extraction unit, 32 ... analysis unit, 33 ... character portion extraction unit, 34
Image buffer, 35 ... Character color detection unit, 36 ... Character line thickness detection unit, 37 ... Character type determination unit, 38 ... Character type determination unit, 39 ... Pitch detection unit, 40 ... Handwriting dictionary, 41 ...
Type dictionary, 42 ... English dictionary, 43 ... Kana-Kanji dictionary, 44
... writer identification section, 45 ... owner handwriting dictionary, 46 ... writing instrument identification section, 47 ... font identification section, 48 ... mixture rate detection section,
49 ... Vertical / horizontal identification section, 50 ... Character size / density detection section,
61 ... Straight line / curve detection unit, 62 ... Intersection detection unit, 63 ...
65 ... Analysis part, 64 ... Table ruled line part extraction part, 66 ... Drawing part extraction part, 67 ... Image area separation part, 68 ... Photo part extraction part, 69
... Circle / rectangle / line segment extraction unit, 70 ... Graph part extraction unit, 71
... Paper type determination unit, 72 ... Paper type dictionary, 73 ... Positional relationship / ratio detection unit of each component, 81 ... Search information extraction unit,
82 ... Derivation information / attribute value correspondence table storage unit, 83 ... Search information comparison / collation unit, 84 ... Similarity / weight storage unit, 85 ... Document presentation unit, 86 ... Derivation information item display unit

Claims

[Claims]

1. A storage unit for storing an input document image or a result of processing the document image as a document, a designation unit for designating the stored document by a document name or a keyword, and a designated document. In a document information search device comprising an output means for outputting, a means for extracting information representing an external characteristic of a document from a document image input by the input means, and the information extracted by this means in the storage means. Means for storing the document in association with the stored document, and, in addition to the designation of the document by the designation means or in addition to the designation of the document by the designation means, information indicating an external characteristic of a desired document. A means for determining a document to be output by inputting as a search key, collating the input search key with the information stored in association with the document, and determining the document to be output Document information retrieval apparatus characterized by.

2. A step of storing, as a document, an input document image or a result of processing the document image, and information indicating the external characteristics of the document extracted from the input document image, in the document. The step of storing in association with each other, the step of inputting information indicating the external characteristics of a desired document as a search key, the step of matching the input search key with the stored information, and the matching result And outputting a corresponding document from the documents stored according to the document information retrieval method.

3. A means for inputting a document image fixed on a medium, a means for storing the input document image or a result obtained by processing the document image as a document, and characteristics of the medium based on the input document image. Means for extracting the information representing the information and storing this information in association with the document, and means for inputting the information indicating on which medium the desired document image has been fixed as a search key, A document information search apparatus comprising: a means for matching a search key with the stored information; and a means for outputting a corresponding document from the stored documents according to the matching result.

4. A means for inputting a document image fixed on a medium, a means for storing the input document image or a result of processing the document image as a document, and fixing the input document image on the medium. Extracting information representing the characteristics of the substance, storing the information in association with the document, and inputting, as a search key, information indicating which substance fixed the desired document image on the medium. Means for collating the inputted search key with the stored information, and means for outputting a corresponding document from the documents stored according to the collation result. Document information retrieval device.

5. A means for inputting a document image fixed on a medium, a means for storing the input document image or a result obtained by processing the document image as a document, and the input document image on the medium. A means for extracting information representing characteristics of the represented information as an image, storing the information in association with the document, and information indicating what kind of image the desired document image is represented on the medium. Means for inputting as a search key, means for collating the input search key with the stored information, and means for outputting the corresponding document from the stored documents according to the collation result. A document information retrieval device characterized by the above.