JPH01106263A

JPH01106263A - Document storage retrieving device

Info

Publication number: JPH01106263A
Application number: JP62264888A
Authority: JP
Inventors: Yoshiharu Abe; 芳春阿部
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 1987-10-20
Filing date: 1987-10-20
Publication date: 1989-04-24

Abstract

PURPOSE:To eliminate a key word input work and to easily execute retrieval by providing the title device with a character recognizing device for segmenting characters in a document image, recognizing the segmented characters and converting the recognized characters into a character code string and a key word group extracting device. CONSTITUTION:The character recognizing device 7 segments character string areas 71-75 from a document image 70 read out by an image reader 2, successively recognizes the character images existing in the areas 71-75 while leaving the unclearness of division and outputs the codes of respective candidate characters. The device 7 including a reference character dictionary for KANA (Japanese syllabary) or KANJI (Chinese character) recognizes respective characters in each character and applies the recognized characters to a register 8 together with candidate characters. A key word group extracting device 10 checks the coincidence/discrepancy of all partial character code strings with length (n) included in a character code string 72 outputted from the register 8 and extracts key words.

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は文書をイメージデータに変換し、格納検索する
装置に関する。DETAILED DESCRIPTION OF THE INVENTION [Field of Industrial Application] The present invention relates to an apparatus for converting documents into image data and storing and retrieving the data.

[Conventional technology]

データやファイルの増大にともない、大量の文書を電子
ファイル化に、高密度の記憶媒体２例えば光ディスク等
に電子的データとして格納することが行われている。As the amount of data and files increases, a large amount of documents are converted into electronic files and stored as electronic data on a high-density storage medium 2, such as an optical disk.

従来、この種装置は、第５図に示すように構成されてい
る。文書の格納時は入力文書１をイメージリーダー２に
て文書イメージに光電変換し、キーボード５から入力さ
れたオペレータの任意な分類によるキーワード群を付し
て、記憶媒体となる光ディスク等のデータベース３に格
納する。一方、文書の検索時は、キーボード５がら入力
された検索条件（格納時に使用した分類に従った条件）
を満足するキーワード群の付けられた文書イメージを検
索し、検索結果を表示装置６上に表示するようになって
いる。Conventionally, this type of device has been constructed as shown in FIG. When storing a document, the input document 1 is photoelectrically converted into a document image by the image reader 2, a keyword group according to the operator's arbitrary classification input from the keyboard 5 is attached, and the data is stored in a database 3 such as an optical disk serving as a storage medium. Store. On the other hand, when searching for a document, search conditions entered from the keyboard 5 (conditions according to the classification used when storing)
A document image attached with a keyword group that satisfies the following is searched, and the search results are displayed on the display device 6.

[Problem that the invention seeks to solve]

このような装置では、前もって格納する文書１に対し、
検索時の状況を想定して適切なキーワード群を付さなけ
ればならなかった。もしキーワード群がないと、せっか
く格納された文書も、どこに何を格納したのか知るすべ
がなく検索できなくなる。又、文書格納時に−々オペレ
ータはキーワード群を入力作成し入力するわずられしい
作業を必要としていた。個人的に使うような文書管理装
置としては入力したキーワード群を覚えたらいいとして
も、別の人が検索するとき、それを見出すことが大変で
あるという問題である。In such a device, for document 1 to be stored in advance,
It was necessary to attach an appropriate group of keywords based on the search situation. If you don't have a set of keywords, you won't be able to search the documents you have stored because there is no way to know where and what they have stored. Furthermore, when storing a document, the operator is required to perform a cumbersome task of creating and inputting a group of keywords. Even if it is sufficient for a document management device for personal use to memorize a group of input keywords, the problem is that it is difficult for another person to find them when searching.

本発明は係る欠点を除去するためになされたもので、キ
ーワード群の作成をオペレータが行う労をなくして、自
動化し、かつ格納者と検索者とが別人であっても容易に
検索ができる。即ちオペレータが交代しても格納検索が
正確にできる文書の格納検索装置を提供することを目的
としている。The present invention has been made to eliminate such drawbacks, and eliminates the need for an operator to create a keyword group, automating the creation of a keyword group, and making it possible to easily perform a search even if the storer and searcher are different people. That is, it is an object of the present invention to provide a document storage and retrieval device that allows accurate storage and retrieval even when operators are changed.

〔問題点を解決するための手段〕この発明においては、文書１を読取り文書イメージに光
電変換するイメージ読取装置２と、文書イメージに検索
用キーワード群のデータが付されて格納されるデータベ
ース３と、検索用キーワード群を入力することにより該
データベース３から文書イメージを検索して読み出す検
索装置４とからなる文書の格納検索装置において、文書
イメージ中の文字を切り出して認識し文字コード列に変
換する文字認識装置７と、分類項目用の単語や熟語が予
め登録されている単語辞書９と、文字コード列と単語や
熟語とを照合して文書１に関するキーワードデ°−夕を
抽出するとともに文書イメージ〔作用〕オペレータはイメージ読取装置２を使って文書１を入力
すれば、文書イメージは自動的にキーワード群抽出装置
１０により、キーワード群が付され、データベース３に
格納される。検索時には検　　　　−索者が別人であっ
ても、文書の属する分野の一般的なキーワード群を入力
するだけで、目的の文書を取り出すことができる。[Means for Solving the Problems] The present invention includes an image reading device 2 that reads a document 1 and photoelectrically converts it into a document image, and a database 3 that stores the document image with data of a search keyword group. , a document storage and retrieval device comprising a search device 4 that searches and reads document images from the database 3 by inputting a group of search keywords, cuts out characters in the document image, recognizes them, and converts them into character code strings. The character recognition device 7 and the word dictionary 9 in which words and phrases for classification items are registered in advance, match the character code string with the words and phrases to extract keyword data regarding the document 1, and extract the document image. [Operation] When the operator inputs the document 1 using the image reading device 2, the document image is automatically assigned a keyword group by the keyword group extraction device 10 and stored in the database 3. When performing a search, even if the searcher is a different person, the desired document can be retrieved simply by inputting a group of general keywords for the field to which the document belongs.

〔Example〕

以下、この発明を図面に基づいて説明する。第１図にお
いて、１は格納予定の文書、２は文書１上の文字像を光
学的に入力し電子信号の文書イメ。The present invention will be explained below based on the drawings. In FIG. 1, 1 is a document to be stored, and 2 is a document image of an electronic signal obtained by optically inputting a character image on document 1.

−ジを生成するイメージ読取装置としてのイメージリー
ダー〈３は電子信号を記憶する磁気ディスクや光ディス
ク等のデータベース、４はデータベース３の記憶データ
を検索する検索装置、５は文字等を入力するキーボード
、６は・ＣＲＴ等からなる表示装置である。- an image reader as an image reading device that generates images; 3 is a database such as a magnetic disk or optical disk that stores electronic signals; 4 is a search device that searches for stored data in the database 3; 5 is a keyboard for inputting characters, etc.; 6 is a display device consisting of a CRT or the like.

而して、７はイメージリーダー２によって読み取られた
文書イメージ中の各文字データを切り出し、認識して文
字コードに変換する文字認識装置、８は文字認識装置７
の出力である文字コード列を記憶するレジスタ：、９は
キーワード群の抽出に用いるＲＯＭとしての単語辞書、
ｌＯはレジスタ８中の文字コード列から単語辞書９を用
いてキーワード群を抽出するキーワード群抽出装置であ
る。7 is a character recognition device that cuts out each character data in the document image read by the image reader 2, recognizes it, and converts it into a character code; 8 is a character recognition device 7;
9 is a register that stores the character code string that is the output of .9 is a word dictionary as a ROM used for extracting keyword groups;
IO is a keyword group extraction device that extracts a keyword group from a character code string in a register 8 using a word dictionary 9.

単語辞書９は分野別に、例えば料理分野に関する用語の
系統だった分類用熟語や文字等が予め料理専門家により
体系的に登録された読み出し専用のメモリである。文字
認識装置７は基準文字辞書を内蔵しており、文字を一文
字づつ認識するものである。The word dictionary 9 is a read-only memory in which systematic classification phrases, characters, etc. of terms related to the field of cooking, for example, are systematically registered in advance by a cooking expert. The character recognition device 7 has a built-in reference character dictionary and recognizes characters one by one.

以下、−例として、第２図に示すような文書を本装置に
入力する場合について動作を説明する。Hereinafter, as an example, the operation will be described in the case where a document as shown in FIG. 2 is input into the apparatus.

まず、文字認識装置７はイメージリーダー２によって読
み取られた文書のイメージから、第３図に示すように文
字列領域７１〜７５を切り出し、この領域７１〜７５内
に存在する文字イメージを区切りの曖昧さを残したまま
順次認識し、各候補文字のコードを出力する。第４図に
このようにして得られる一つの文字列領域７２の一例を
示す。同図では一行目が第１候補文字で二行目が第２候
補文字である。文字認識装置７はひらがなや漢字等の基
準文字の辞書を内蔵し、各文字を一字づつ認識して候補
文字を添えてレジスタ８に与える。First, the character recognition device 7 cuts out character string regions 71 to 75 from the image of the document read by the image reader 2, as shown in FIG. It sequentially recognizes each candidate character while preserving its character, and outputs the code of each candidate character. FIG. 4 shows an example of one character string area 72 obtained in this manner. In the figure, the first line is the first candidate character, and the second line is the second candidate character. The character recognition device 7 has a built-in dictionary of standard characters such as hiragana and kanji, and recognizes each character one by one and supplies them to the register 8 along with candidate characters.

そこで、次にキーワード群抽出装置１０は、レジスタ８
から出力される文字コード列７２に含まれる長さｎのす
べての部分文字コード列に対し、これが単語辞書９に登
録されている用語や熟語と一致するか否かを調べ、登録
されていれば、これをキーワードとして抽出する。この
操作を文字コード列７２〜７５について最小長から最大
長の範囲の長さｎについて行い、文字コード列７２〜７
５のなかから登録分類用語と一致するキーワード群を抽
出する。このようにして抽出されたキーワード群は、キ
ーワード群抽出装置１０により文書１の分類項目と決定
され、入力文書イメージに　。Therefore, next, the keyword group extraction device 10 uses the register 8
For all partial character code strings of length n included in the character code string 72 output from , extract this as a keyword. This operation is performed for the length n in the range from the minimum length to the maximum length for character code strings 72 to 75, and character code strings 72 to 7
5, a group of keywords that match the registered classification terms are extracted. The keyword group extracted in this way is determined by the keyword group extracting device 10 as the classification item of document 1, and is used as the input document image.

付されてデータベース３に格納される。is attached and stored in the database 3.

一方、格納された文書１の検索はキーボード５から入力
される所望分野についての検索条件を満たすキーワード
群を付された文書イメージをデータベース３中で検索す
ることによって行われる。On the other hand, a search for the stored document 1 is performed by searching the database 3 for a document image attached with a keyword group that satisfies the search conditions for the desired field inputted from the keyboard 5.

即ち、オペレータは料理分野に関する一般的な分類用語
や文節をキーボード５からキーワードとして入力する。That is, the operator inputs general classification terms and phrases related to the cooking field as keywords from the keyboard 5.

すると検索装置４はこのキーワード群と一致する分類項
目をデータベース３から検出し、一致した分類項目に属
する文書データ１を読み出し、表示装置６に表示する。Then, the search device 4 detects a classification item that matches this keyword group from the database 3, reads document data 1 belonging to the matching classification item, and displays it on the display device 6.

この表示文書をオペレータは視認して、もし要求する文
書であれば、プリントアウト等を行い取り入れ、もし異
なるものなら他のキーワード群を入力して目的の文書を
得るまで検索を行う。The operator visually checks this displayed document, and if it is the desired document, print it out or take it in, and if it is different, input another group of keywords and search until the desired document is obtained.

〔Effect of the invention〕

以上説明してきたように、発明によれば、検索用キーワ
ード群を入力することにより文書イメージが格納された
データベースから当該文書イメージを検索して読み出す
検索装置を備えた文書の格納検索装置において、文書イ
メージ中の文字を切り出して認識し文字コード列に変換
する文字認識装置と、分類項目用の単語や熟語が予め登
録されている単語辞書と、文字コードと単語や熟語とを
照合して文書に関するキーワードデータを抽出するとと
もに文書イメージに付してデータベースに格納するキー
ワード群抽出装置とを設け、格納すべき文書イメージに
付けられるキーワード群がこの文書イメージから自動的
に抽出されるものであって、しかも、自動抽出されたキ
ーワード群の中には必ず入力文書イメージに含まれる文
字列の中でキーワードとなり得る単語はすべて網羅され
ているから、利用者は文書の格納時にキーワード群を作
成入力する必要がなく、しかも、文書の検索時は誰が検
索しても指定された検索条件を満たすキーワード群を含
む文書は必ずそれに付されたキーワード群の分野別の性
質から検索されるという効果がある。As described above, according to the invention, in a document storage and retrieval device that is equipped with a search device that searches and reads document images from a database in which document images are stored by inputting a group of search keywords, a document storage and retrieval device is provided. A character recognition device that extracts and recognizes characters in an image and converts them into character code strings, a word dictionary in which words and phrases for classification items are registered in advance, and a character recognition device that extracts and recognizes characters in an image and converts them into character code strings, and a word dictionary that stores words and phrases for classification items in advance. A keyword group extraction device is provided which extracts keyword data and stores the keyword data attached to a document image in a database, and the keyword group attached to the document image to be stored is automatically extracted from the document image, Moreover, the automatically extracted keyword group always includes all words that can be keywords in the character strings included in the input document image, so the user does not have to create and input the keyword group when storing the document. Moreover, when searching for documents, no matter who searches for documents, documents that include a group of keywords that satisfy the specified search conditions are always searched based on the field-specific nature of the group of keywords attached to them.

[Brief explanation of the drawing]

第１図は本発明の文書の格納検索装置の全体構成図、第
２図は入力文書の一例を示す図、第３図は文字領域の切
り出しの様子を例示する図、第４図は文字コード列を例
示する図、第５図は従来の装置の構成図である。図において、１は入力文書、２はイメージリーダー、３
はデータベース、４は検索装置、５はキーボード、６は
表示装置、７は文字認識装置、８はレジスタ、９は単語
辞書、１０はキーワード群抽出装置、７０は文書イメー
ジ、７１〜７５は文字領域である。代理人　　大　　岩　　増　　ａ（ほか２名）第１図第２図第３図手続補正書（自発）１、事件の表示　　　特願昭６２−２６４８８８号３、
補正をする者代表者志岐守哉４、代理人５補正の対象発明の詳細な説明の欄。６補正の内容（１）明細書第２頁第６行目「電子ファイル化に、」と
あるのを「電子ファイル化し、」と補正する。（２）同書第３頁第１１行目「間層である。」とあるの
を「問題がある。」と補正する。以上FIG. 1 is an overall configuration diagram of the document storage and retrieval device of the present invention, FIG. 2 is a diagram showing an example of an input document, FIG. 3 is a diagram illustrating how a character area is cut out, and FIG. 4 is a character code FIG. 5, which is a diagram illustrating columns, is a configuration diagram of a conventional device. In the figure, 1 is the input document, 2 is the image reader, and 3
is a database, 4 is a search device, 5 is a keyboard, 6 is a display device, 7 is a character recognition device, 8 is a register, 9 is a word dictionary, 10 is a keyword group extraction device, 70 is a document image, and 71 to 75 are character areas It is. Agent Masu Oiwa A (and 2 others) Figure 1 Figure 2 Figure 3 Procedure amendment (voluntary) 1. Indication of the case Patent Application No. 1988-264888 3.
Column for detailed explanation of the invention to be amended by Representative Moriya Shiki 4 and Agent 5 of the person making the amendment. 6. Contents of the amendment (1) On page 2, line 6 of the specification, the phrase "to electronically file" should be amended to read "to electronically file." (2) In the 11th line of page 3 of the same book, the phrase ``It is an interlayer.'' has been amended to ``There is a problem.''that's all

Claims

[Scope of Claims] An image reading device that reads a document and photoelectrically converts it into a document image; a database in which the document image is stored with data of a group of search keywords; A document storage and retrieval device comprising a retrieval device that searches and reads out the document image from the database, a character recognition device that cuts out and recognizes characters in the document image and converts them into a character code string, and a word for classification items. Extracting keyword data related to the document by comparing the character code string and the words and phrases with a word dictionary in which words and phrases are registered in advance, and extracting a keyword group to be added to the document image and stored in the database. A document storage and retrieval device comprising: