JPH056398A

JPH056398A - Document register and document retrieving device

Info

Publication number: JPH056398A
Application number: JP3158139A
Authority: JP
Inventors: Shiyou Imasato; 詔今郷
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1991-06-28
Filing date: 1991-06-28
Publication date: 1993-01-14

Abstract

PURPOSE:To automatically extract a keyword without using excess data such as a dictionary by providing this document registering/retrieving device with a document coding means, a document registering means and a document index storing means for storing corresponding relation between a superimposed code and a document. CONSTITUTION:The document coding means 1 converts a document to be registered or a keyword applied to the document into a superimposed code. When the document is an object to be registered, the keyword is automatically allocated and manual keyword application is unnecessary. The document registering means 2 allows a bit string obtained by the means 1 to correspond to the document and stores the correspondence in the document index storing means 3. The means 3 correspondingly stores the bit string and the document. Since the keyword is automatically allocated, labor for registering the document can be reduced, and since large data such as a word dictionary are not used, the number of memories or disks to be driven can be reduced.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、文書に自動的にキーワ
ードを付けて保存しておき、そのキーワードの入力によ
って対応する文書を検索するような文書登録装置及び文
書検索装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a document registration device and a document search device which automatically save a document with a keyword attached thereto and retrieve the corresponding document by inputting the keyword.

【０００２】[0002]

【従来の技術】従来、電子的な文書をファイリングする
場合、キーワードを付けて登録しておき、後でそのキー
ワードを指定して対応する文書を取り出すのが一般的に
行われている。この場合、どのようにキーワードを「付
与」し、また、キーワードを使ってどうように「検索」
するのかという問題がある。2. Description of the Related Art Conventionally, in the case of filing an electronic document, it is generally performed that a keyword is added and registered, and then the keyword is designated to retrieve the corresponding document. In this case, how to "grant" the keyword and how to "search" using the keyword
There is the question of whether to do it.

【０００３】まず、キーワードの付与は、人間が行うの
が一般的であるが、文書の内容から自動的にキーワード
を付与するようなシステムもある。すなわち、単語辞書
を使って文書中からすべての名詞を抽出し、その中から
不要語と呼ばれるキーワードにならないと予め定めてお
いた語を除くという方法である。また、キーワードによ
る検索は、転置ファイルと呼ばれるキーワードに対して
文書を対応付けたファイルを用いて行うのが一般的であ
る。[0003] First of all, a human is generally given a keyword, but there is also a system in which a keyword is automatically given from the contents of a document. That is, a method is used in which all nouns are extracted from a document using a word dictionary, and words that have been previously determined not to be keywords called unnecessary words are removed from the nouns. Further, a search by a keyword is generally performed by using a file called a transposed file in which a document is associated with a keyword.

【０００４】また、他の方法として、例えば、特開平２
−２９７１９３号公報に「辞書引き装置」として開示さ
れているように、スーパーインポーズドコードを使用す
る方法もある。すなわち、これは、キーワードを特定の
長さのビット列にハッシュして、１つの文書に対応する
キーワードすべてのビット列の論理和をとったビット列
を文書と対応付けて記憶させておく方法である。また、
検索時は、検索キーワードを特定の長さのビット列にハ
ッシュして、そのビット列を検索キーとし、文書に対応
しているビット列との論理積が検索キーに等しいような
文書を検索する。検索キーワードが複数指定されている
場合でも、それぞれのビット列の論理和を検索キーとす
れば、簡単に検索することができる。As another method, for example, Japanese Unexamined Patent Publication (Kokai) 2
There is also a method of using a superposed code, as disclosed as a “dictionary lookup device” in Japanese Patent Publication No. 297193. That is, this is a method in which a keyword is hashed into a bit string of a specific length, and the bit string obtained by logically adding the bit strings of all the keywords corresponding to one document is stored in association with the document. Also,
At the time of a search, a search keyword is hashed into a bit string of a specific length, the bit string is used as a search key, and a document whose logical product with the bit string corresponding to the document is equal to the search key is searched. Even when a plurality of search keywords are specified, if the logical sum of each bit string is used as the search key, the search can be performed easily.

【０００５】[0005]

【発明が解決しようとする課題】上述したような従来の
キーワードの付与と検索とにおいては、以下に述べるよ
うな問題が生じる。まず、その第一の問題として、キー
ワード自動付与時に大規模な辞書データが必要となると
いうことである。すなわち、従来の方法においては、キ
ーワードの自動抽出のためには単語辞書や文法辞書など
の大規模なデータが必要であった。このことは、システ
ムを動作させるのに必要なメモリ或いはディスクの量が
増大すると共に、処理速度が遅いという問題がある。ま
た、辞書データの作成と維持に膨大な手間を要し、シス
テム作成のためのコストがかさむという問題がある。In the above-described conventional keyword assignment and retrieval, the following problems occur. First, the first problem is that large-scale dictionary data is required when automatically adding keywords. That is, in the conventional method, large-scale data such as a word dictionary and a grammar dictionary are required for automatic keyword extraction. This increases the amount of memory or disk required to operate the system and slows down the processing speed. In addition, there is a problem that enormous effort is required to create and maintain dictionary data, and the cost for creating the system is high.

【０００６】また、その第二の問題として、付与したキ
ーワードと完全に同じキーワードを入力しないと検索で
きないということである。すなわち、従来の方法におい
ては、付与したキーワードをそのままインデックスとし
て使用しているために少しでも異なったキーワードでは
検索できないという問題がある。一例として、「情報装
置」というキーワードを付与して登録した文書は、“情
報検索”や“検索装置”というようなキーワードでは検
索できないということである。The second problem is that the keyword cannot be searched unless the same keyword as the given keyword is input. That is, the conventional method has a problem in that the added keyword is used as it is as an index, so that it is impossible to search with a slightly different keyword. As an example, a document registered by adding the keyword “information device” cannot be searched for by a keyword such as “information search” or “search device”.

【０００７】[0007]

【課題を解決するための手段】請求項１記載の発明で
は、文書又は付与されたキーワードを字種の変化点で区
切りそれぞれの区間の文字列内の連続するすべての２文
字の組をスーパーインポーズドコードに変換する文書符
号化手段を設け、前記スーパーインポーズドコードと前
記文書との対応関係を登録する文書登録手段を設け、前
記スーパーインポーズドコードと前記文書との対応関係
を保持する文書インデックス保持手段を設けた。According to a first aspect of the present invention, a document or an assigned keyword is delimited by a change point of a character type, and all consecutive two character groups in a character string of each section are superposed. A document encoding means for converting into a paused code is provided, a document registration means for registering a correspondence relationship between the superposed code and the document is provided, and a correspondence relationship between the superposed code and the document is held. A document index holding means is provided.

【０００８】請求項２記載の発明では、請求項１記載の
発明において、文書符号化手段は、平仮名の区間はコー
ド化しないようにした。According to a second aspect of the invention, in the first aspect of the invention, the document encoding means does not encode the hiragana section.

【０００９】請求項３記載の発明では、キーワードを字
種の変化点で区切り、それぞれの区間の文字列内の連続
するすべての２文字の組をスーパーインポーズドコード
に変換するキーワード符号化手段を設け、前記スーパー
インポーズドコードに対応する文書を取り出す文書検索
手段を設け、前記スーパーインポーズドコードと前記文
書との対応関係を保持する文書インデックス保持手段を
設けた。According to the third aspect of the present invention, the keyword encoding means is provided for delimiting the keywords at the changing points of the character type and converting all two consecutive character sets in the character string of each section into a superposed code. And a document search means for retrieving a document corresponding to the superposed code, and a document index holding means for holding a correspondence relationship between the superposed code and the document.

【００１０】請求項４記載の発明では、請求項３記載の
発明において、キーワード符号化手段は、平仮名の区間
はコード化しないようにした。According to a fourth aspect of the invention, in the third aspect of the invention, the keyword encoding means does not encode the hiragana section.

【００１１】[0011]

【作用】請求項１記載の発明においては、キーワードを
動的に付与するため、文書登録の手間を減らすことがで
き、また、単語辞書などの大規模データを使用しないた
め、動作の必要なメモリやディスクが少なくて済む。According to the first aspect of the present invention, since the keywords are dynamically added, the labor of document registration can be reduced, and since a large-scale data such as a word dictionary is not used, a memory that requires an operation is required. And less disks are required.

【００１２】請求項２記載の発明においては、文章の大
きな部分を占める平仮名文字列を処理対象としないの
で、処理速度がさらに速くなり、また、使用頻度の高い
文字をハッシュの対象としなくて済むので、ハッシュ関
数の設計が容易となる。According to the second aspect of the present invention, since the hiragana character string that occupies a large part of the sentence is not processed, the processing speed is further increased, and it is not necessary to use frequently used characters for hashing. Therefore, the design of the hash function becomes easy.

【００１３】請求項３記載の発明においては、キーワー
ドを分解して検索することになり、付与したキーワード
と同じ形でなくとも検索が可能となる。According to the third aspect of the present invention, the keyword is decomposed and the search is performed, and the search is possible even if the keyword is not the same as the added keyword.

【００１４】請求項４記載の発明においては、文章の大
きな部分を占める平仮名文字列を処理対象としないた
め、処理速度がさらに早くなり、また、使用頻度の高い
文字をハッシュの対象としなくて済むため、ハッシュ関
数の設計が容易となる。According to the fourth aspect of the present invention, since the hiragana character string that occupies a large part of the sentence is not targeted for processing, the processing speed is further increased, and the frequently used characters are not targeted for hashing. Therefore, the hash function can be easily designed.

【００１５】[0015]

【実施例】本発明の一実施例を図面に基づいて説明す
る。図１は本装置の全体構成を示すものであり、文書符
号化手段１と、文書登録手段２と、文書インデックス保
持手段３と、キーワード符号化手段４と、文書検索手段
５とよりなっている。図２は文書登録時の処理の流れを
示し、また、図３は文書検索時の処理の流れを示すもの
であり、これらのフローを参照しながら、以下、各部の
構成について順次説明していく。DESCRIPTION OF THE PREFERRED EMBODIMENTS An embodiment of the present invention will be described with reference to the drawings. FIG. 1 shows the overall configuration of this apparatus, which comprises a document encoding means 1, a document registration means 2, a document index holding means 3, a keyword encoding means 4, and a document search means 5. . 2 shows the flow of processing at the time of document registration, and FIG. 3 shows the flow of processing at the time of document search. The configuration of each unit will be sequentially described below with reference to these flows. .

【００１６】まず、文書符号化手段１について説明す
る。これは、登録する文書、又は、文書に対して付与さ
れたキーワードをスーパーインポーズドコードに変換す
るものである。この場合、文書を対象とすると、キーワ
ードを自動的に付与することになり、人間が付与する必
要はない。また、どちらを対象としても動作は同じなの
で、ここでは文書を符号化する方法について説明する。
すなわち、以下に述べるように各ｓｔｅｐを順次とる。First, the document encoding means 1 will be described. This is to convert a document to be registered or a keyword given to the document into a superimposed code. In this case, when a document is targeted, a keyword is automatically added, and it is not necessary for a human to add the keyword. Since the operation is the same for both of them, a method of encoding a document will be described here.
That is, as described below, each step is sequentially taken.

【００１７】〔ｓｔｅｐ１〕として、文書を字種の変化
点で分割する。字種の変化点とは、平仮名→漢字や、ア
ルファベット→カタカナとなるような点である。例え
ば、“きのう情報検索装置を開発した”という文書は、
“／きのう／情報検索装置／を／開発／した／”という
ように分割できる。以後の動作は分割した区間を単位と
して行う。ただし、平仮名からなる区間に対しては何も
処理を行わない。As [step 1], the document is divided at the character type change points. The change point of the character type is such a point that it becomes hiragana → kanji or alphabet → katakana. For example, the document “I developed an information retrieval device yesterday”
It can be divided into "/ yes / information retrieval device / developed / developed /". Subsequent operations are performed in units of divided sections. However, no processing is performed on the section consisting of hiragana.

【００１８】〔ｓｔｅｐ２〕として、区間内の文字列か
らすべての２文字の組を抽出する。例えば、“情報検索
装置”という区間からは‘情報’‘報検’‘検索’‘索
装’‘装置’という５種類の２文字組が抽出できる。As [step 2], all sets of two characters are extracted from the character string in the section. For example, from the section "information retrieval device", five types of two-character sets "information", "inspection", "search", "search" and "device" can be extracted.

【００１９】〔ｓｔｅｐ３〕として、２文字組の文字コ
ードをキーとして、予め定めておいたハッシュ関数によ
って、予め定めておいた長さのビット列に変換する。こ
の時、変換後のビット列の１の数が同じになるようにハ
ッシュ関数を定めておく。例えば、長さ３２のビット列
のうちの４ビットに変換するようにハッシュ関数が定め
られたとすると、As [step 3], the character code of the two-character set is used as a key and converted into a bit string of a predetermined length by a predetermined hash function. At this time, the hash function is set so that the number of 1s in the converted bit string is the same. For example, if the hash function is defined to convert 4 bits of a bit string of length 32,

【００２０】[0020]

【表１】 [Table 1]

【００２１】というようになる（実際にどういうビット
列が得られるかはハッシュ関数の設計によって変わ
る）。(The actual bit string that can be obtained depends on the design of the hash function).

【００２２】〔ｓｔｅｐ４〕として、ｓｔｅｐ３で求め
たビット列すべての論理和をとる。上述した例の場合、
５つのビット列の論理和をとり、次のビット列が得られ
る。As [step 4], the logical sum of all the bit strings obtained in step 3 is calculated. In the example above,
The next bit string is obtained by taking the logical sum of the five bit strings.

【００２３】[0023]

【表２】 [Table 2]

【００２４】〔ｓｔｅｐ５〕として、それぞれの区間に
対応するビット列すべての論理和をとる。これが、その
文書に対応するスーパーインポーズドコードとなる。As [step 5], the logical sum of all bit strings corresponding to each section is calculated. This is the superposed code corresponding to the document.

【００２５】次に、文書登録手段２について説明する。
これは、文書符号化手段１で得られたビット列と文書と
を対応付けて文書インデックス保持手段３に格納すると
いうものである。Next, the document registration means 2 will be described.
This is to store the bit string obtained by the document encoding means 1 and the document in the document index holding means 3 in association with each other.

【００２６】次に、文書インデックス保持手段３につい
て説明する。これは、ビット列と文書とを対応付けて記
憶しているものである。例えば、次のようになる。Next, the document index holding means 3 will be described. This stores a bit string and a document in association with each other. For example:

【００２７】[0027]

【表３】 [Table 3]

【００２８】次に、キーワード符号化手段４について説
明する。これは、検索キーとして指定されたキーワード
をスーパーインポーズドコードに変換するものである。
この場合、その動作は文書符号化手段１と全く同じであ
る。もし、ＡＮＤ検索のために複数のキーワードが指定
された場合は、それぞれのキーワードに対するスーパー
インポーズドコードの論理和をとったものが検索キーに
対応するコードとなる。その一例として、“情報検索”
という検索キーワードは次のように符号化される。Next, the keyword encoding means 4 will be described. This converts a keyword specified as a search key into a superposed code.
In this case, the operation is exactly the same as the document encoding means 1. If a plurality of keywords are specified for the AND search, the logical sum of the superimposed code for each keyword becomes the code corresponding to the search key. One example is “information retrieval”
The search keyword is encoded as follows.

【００２９】[0029]

【表４】 [Table 4]

【００３０】最後に、文書検索手段５について説明す
る。これは、キーワード符号化手段４で得られた検索キ
ーにマッチする文書を文書インデックス保持手段３から
検索するというものである。すなわち、以下に述べるよ
うなｓｔｅｐをとる。Finally, the document retrieval means 5 will be described. This is to search the document index holding unit 3 for a document that matches the search key obtained by the keyword encoding unit 4. That is, the following steps are taken.

【００３１】〔ｓｔｅｐ１〕として、それぞれの文書に
対応するスーパーインポーズドコードと検索キーとの論
理積をとる。As [step 1], the logical product of the superposed code and the search key corresponding to each document is calculated.

【００３２】〔ｓｔｅｐ２〕として、ｓｔｅｐ１で得ら
れた論理積が検索キーに等しければその文書は検索キー
にマッチし、等しくなければマッチしないと判定する。As [step 2], it is determined that the document matches the search key if the logical product obtained in step 1 is equal to the search key, and does not match if they are not equal.

【００３３】例えば、検索キーが、For example, if the search key is

【００３４】[0034]

【表５】 [Table 5]

【００３５】である場合、その“情報検索”と文書１と
の論理積１は、, The logical product 1 of the "information search" and the document 1 is

【００３６】[0036]

【表６】 [Table 6]

【００３７】となり、検索キーに等しい。従って、文書
１は検索キーにマッチすると判定される。And is equal to the search key. Therefore, it is determined that the document 1 matches the search key.

【００３８】また、“情報検索”と文書２との論理積２
は、Further, the logical product 2 of "information retrieval" and document 2
Is

【００３９】[0039]

【表７】 [Table 7]

【００４０】となり、検索キーとは異なる。従って、文
書２は検索キーとはマッチしないと判定される。And is different from the search key. Therefore, it is determined that the document 2 does not match the search key.

【００４１】[0041]

【発明の効果】請求項１記載の発明は、文書又は付与さ
れたキーワードを字種の変化点で区切りそれぞれの区間
の文字列内の連続するすべての２文字の組をスーパーイ
ンポーズドコードに変換する文書符号化手段を設け、前
記スーパーインポーズドコードと前記文書との対応関係
を登録する文書登録手段を設け、前記スーパーインポー
ズドコードと前記文書との対応関係を保持する文書イン
デックス保持手段を設けたので、キーワードを動的に付
与するため、文書登録の手間を減らすことができ、ま
た、単語辞書などの大規模データを使用しないため、動
作の必要なメモリやディスクが少なくて済み、さらに、
単純な動作であるため処理を高速で行うことができるも
のである。According to the first aspect of the present invention, a document or a given keyword is separated by a change point of a character type and all consecutive two character sets in a character string of each section are made into a superposed code. A document index holding means for holding a correspondence relationship between the superposed code and the document is provided with a document encoding means for converting, and a document registration means for registering the correspondence relationship between the superposed code and the document. Since a means is provided, keywords can be dynamically added to reduce the trouble of document registration, and since large-scale data such as word dictionaries are not used, less memory and disk are required for operation. ,further,
Since it is a simple operation, the processing can be performed at high speed.

【００４２】請求項２記載の発明は、請求項１記載の発
明において、文書符号化手段は、平仮名の区間はコード
化しないようにしたので、文章の大きな部分を占める平
仮名文字列を処理対象としないため処理速度がさらに速
くなり、また、使用頻度の高い文字をハッシュの対象と
しなくて済むためハッシュ関数の設計が容易となるもの
である。According to a second aspect of the present invention, in the first aspect of the present invention, the document encoding means does not encode the hiragana section, so that the hiragana character string that occupies a large part of the sentence is processed. Since it does not, the processing speed is further increased, and since the frequently used characters do not have to be the object of hashing, the hash function can be easily designed.

【００４３】請求項３記載の発明は、キーワードを字種
の変化点で区切り、それぞれの区間の文字列内の連続す
るすべての２文字の組をスーパーインポーズドコードに
変換するキーワード符号化手段を設け、前記スーパーイ
ンポーズドコードに対応する文書を取り出す文書検索手
段を設け、前記スーパーインポーズドコードと前記文書
との対応関係を保持する文書インデックス保持手段を設
けたので、キーワードを分解して検索することになり、
付与したキーワードと同じ形でなくとも検索ができるも
のである。According to a third aspect of the present invention, the keyword encoding means divides a keyword at a character type change point, and converts all consecutive two character sets in a character string of each section into a superimposed code. Is provided, and a document search means for retrieving a document corresponding to the superposed code is provided, and a document index holding means for holding a correspondence relationship between the superposed code and the document is provided. Will be searched
It is possible to search even if it does not have the same shape as the given keyword.

【００４４】請求項４記載の発明は、請求項３記載の発
明において、キーワード符号化手段は、平仮名の区間は
コード化しないようにしたので、文章の大きな部分を占
める平仮名文字列を処理対象としないため処理速度がさ
らに早くなり、また、使用頻度の高い文字をハッシュの
対象としなくて済むためハッシュ関数の設計が容易とな
り、さらに、助詞や動詞語尾は平仮名であるため“情報
を検索する装置”のように句の形で入力されたキーワー
ドに対しても、平仮名の除去により特別な操作なしに検
索することができるものである。In the invention according to claim 4, in the invention according to claim 3, since the keyword encoding means does not encode the hiragana section, the hiragana character string occupying a large part of the sentence is targeted for processing. The processing speed is further increased because it is not necessary, and the hash function can be easily designed because the frequently used characters do not have to be the object of hashing. Furthermore, since the particle and the verb ending are hiragana, the “information retrieval device A keyword entered in the form of a phrase such as "can be searched without special operation by removing the hiragana.

[Brief description of drawings]

【図１】本発明の一実施例を示すブロック図である。FIG. 1 is a block diagram showing an embodiment of the present invention.

【図２】文書登録時の処理の流れを示すフローチャート
である。FIG. 2 is a flowchart showing a flow of processing at the time of document registration.

【図３】文書検索時の処理の流れを示すフローチャート
である。FIG. 3 is a flowchart showing the flow of processing at the time of document search.

[Explanation of symbols]

１文書符号化手段２文書登録手段３文書インデックス保持手段４キーワード符号化手段５文書検索手段 1 Document encoding means 2 Document registration means 3 Document index holding means 4 Keyword Encoding Means 5 Document search means

Claims

[Claims]

1. A document encoding means for separating a document or an attached keyword at a change point of a character type and converting all consecutive two character sets in a character string of each section into a superimposed code. Document registration means for registering a correspondence relationship between the superposed code and the document;
A document registration apparatus comprising: a document index holding unit that holds a correspondence relationship between the superposed code and the document.

2. The document registration device according to claim 1, wherein the document encoding means does not encode the hiragana section.

3. A keyword encoding means for dividing a keyword at a character type change point and converting all consecutive two character sets in a character string of each section into a superimposed code, and the superimposed code. A document retrieval device comprising document retrieval means for retrieving a document corresponding to a code, and document index retaining means for retaining a correspondence relationship between the superposed code and the document.

4. The document retrieval device according to claim 3, wherein the keyword encoding means does not encode the hiragana section.