JPH03172966A

JPH03172966A - Similar document retrieving device

Info

Publication number: JPH03172966A
Application number: JP1310562A
Authority: JP
Inventors: Hiroto Inagaki; 博人稲垣; Sueji Miyahara; 末治宮原; Hidefumi Kano; 加納　英文; Fumihiko Kobashi; 小橋　史彦
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1989-12-01
Filing date: 1989-12-01
Publication date: 1991-07-26
Anticipated expiration: 2013-04-22
Also published as: JP2742115B2

Abstract

PURPOSE:To efficiently perform retrieval with a high precision by providing a document input part, an index extracting part, etc., to retrieve a similar document without setting a keyword, a logical formula, or the like and developing indexes as a thesaurus. CONSTITUTION:Document information inputted from the external is converted to code information adapted to this device by a document input part 1, and a sentence inputted from the document input part 1 is divisionally written and is divided into word units by a modification analyzing part 2, and modification analysis of individual words is performed. The degree of resemblance of documents is judged in consideration of degrees of importance of indexes given by an index extracting part 3 and semantic closeness of synonyms developed as a thesaurus and in accordance with the degree of resemblance of modification structures. Thus, similar documents are efficiently retrieved from the data base of a full text with a high precision.

Description

【発明の詳細な説明】〔産業上の利用分野１この発明は、フルテキストのデータベース中から類似文
書を効率よく高精度に検索することができる類似文書検
索装置に関する。DETAILED DESCRIPTION OF THE INVENTION [Industrial Application Field 1] The present invention relates to a similar document retrieval device that can efficiently and highly accurately retrieve similar documents from a full-text database.

［従来の技術］従来から情報検索システムの構築を目指して種々のシス
テムが立案、実現されてきている。これらのシステムで
は、データベース中から必要な情報を取り出す場合の検
索方法として、文書番号や文書名を入力する方法、各文
書に付与されているキーワードを検索する方法、文書の
文字をすべて検索する方法が取られてきた。[Prior Art] Various systems have been proposed and implemented with the aim of constructing information retrieval systems. In these systems, the search methods for retrieving the necessary information from the database include entering the document number or document name, searching for keywords assigned to each document, and searching for all characters in the document. has been taken.

例えば特許文を検索するシステム（ＰＡＴＯＬＩＳ）で
は、特許番号（出願番号、公告番号、公開番号など）９
国際分類番号、出願人などの書誌事項から特許文書を検
索することができる。また、各特許に対して発明の目的
、産業上の利用分野、効果、構成等をフリーキーワード
として登録しており、検索したいキーワード列と論理式
（ＡＮＤ、ＯＲ，Ｎ０Ｔ）の組み合わせを入力するとと
により希望の特許文書を検索することができる。For example, in the patent text search system (PATOLIS), the patent number (application number, publication number, publication number, etc.)9
Patent documents can be searched by bibliographic information such as international classification number and applicant. In addition, the purpose of the invention, industrial application field, effect, structure, etc. are registered as free keywords for each patent, and if you enter the combination of the keyword string and logical formula (AND, OR, NOT) that you want to search, You can search for the desired patent document by using .

最後の方法はフルテキストサーチと呼ばれている方法で
あり、ユーザが入力したキーワードと一致する単語を持
つ文書を出力する方法である。この場合検索結果は、検
索した順に出力する場合とキーワードの一致数が多い順
にソートして出力する場合がある。The last method is called full-text search, which outputs documents that have words that match the keywords entered by the user. In this case, the search results may be output in the order of search or sorted and output in order of the number of keyword matches.

［発明が解決しようとする課題］上記従来技術においては、多数のキーワードと論理式を
複雑に組み合わせたとしても、必ずしも希望の文書を入
手できるとは限らない。また、キーワードのマツチング
方式においては、キーワードの直接マツチング、論理式
（ＡＮＤ、ＯＲ。[Problems to be Solved by the Invention] In the prior art described above, even if a large number of keywords and logical expressions are combined in a complex manner, it is not always possible to obtain a desired document. In addition, keyword matching methods include direct keyword matching, logical expressions (AND, OR), etc.

ＮＯＴ等）、複合条件式（近傍条件２文脈条件）が用い
られるが、検索時に必要な文書を検索することが非常に
多く、大量に出力された検索結果から希望する文書を人
手で探さなければならない等の問題点があった。NOT, etc.) and compound conditional expressions (neighborhood condition 2 context condition) are used, but in many cases, the required document is searched during the search, and it is necessary to manually search for the desired document from a large amount of search results. There were problems such as not being able to do so.

この発明は、上記の問題点を解決するためになされたも
ので、フルテキストのデータベースの中から類似文書を
効率よく、かつ高精度に検索ができる類似文書検索装置
を提供することを目的とする。This invention was made to solve the above problems, and an object of the present invention is to provide a similar document search device that can efficiently and accurately search for similar documents from a full-text database. .

［課題を解決するための手段］この発明にかかる類似文書検索装置は、検索対象である
文章を直接外部から入力可能とする文書入力部と、入力
文書を分かち書きし、形態素情報から文節間の係り受け
関係を解析する係り受け解析部と、係り受け解析結果か
ら文構造を決定し、この文構造から索引を抽出し、重要
度付与を行う索引抽出部と、入力された文書の形態素情
報、係り受け情報、索引情報を記憶する文書蓄積部と、
索引抽出部で作成した索引の同義語や類義語をシソーラ
ス辞書から取り出すソシーラス展開部と、索引とその同
義語、類義語をキーワードとして、文構造の類似度を算
出する類似文書検索部と、類似文書検索部で検索した文
書を類似度の高い順に画面に出力する類似文書出力部と
を具備したものである。[Means for Solving the Problems] A similar document retrieval device according to the present invention includes a document input section that allows a sentence to be searched to be inputted directly from the outside, and a document input section that separates the input document and calculates the relationship between clauses from morpheme information. A dependency analysis unit that analyzes dependency relationships, an index extraction unit that determines sentence structure from the dependency analysis results, extracts an index from this sentence structure, and assigns importance, and morpheme information and dependency of the input document. a document storage unit that stores received information and index information;
A Socirus expansion unit that extracts synonyms and synonyms of the index created by the index extraction unit from the thesaurus dictionary, a similar document search unit that calculates the similarity of sentence structures using the index and its synonyms and synonyms as keywords, and a similar document search unit. The apparatus also includes a similar document output section that outputs the documents searched by the section on the screen in descending order of similarity.

〔作用１この発明においては、入力として直接検索したい文書を
入力するため、キーワードや論理式の設定をせずに類似
文書を検索することが可能となる。また、索引抽出部で
付与した索引の重要度と、シソーラス展開した同義語、
類義語の意味の近さとを考慮し、さらに係り受け構造の
類似度から文書の類似度を判断するので、従来に比べ精
度のよい類似文書の検索が可能となる。[Operation 1] In this invention, since the document to be searched is directly input as input, it is possible to search for similar documents without setting keywords or logical formulas. In addition, the importance of the index assigned by the index extraction part, the synonyms expanded in the thesaurus,
Since the degree of similarity of documents is determined based on the degree of similarity of dependency structures in consideration of the closeness of the meanings of synonyms, it is possible to search for similar documents with higher precision than in the past.

〔Example〕

以下、この発明の実施例について説明する。 Examples of the present invention will be described below.

まず、この発明の類似文書検索装置の全体構成について
述べる。First, the overall configuration of the similar document search device of the present invention will be described.

第１図はこの発明の一実施例の構成を示すブロック図で
ある。１は文書入力部で、外部から入力した文書情報（
イメージ情報、コード情報）を本装置に適したコード情
報に変換する。２は係り受け解析部で、文書入力部１か
ら入力した文を分かち書きし単語単位に分割する。さら
に、品詞情報、活用形、形態素の情報を個々の単語に付
与して係り受け解析を行う。３は索引抽出部で、索引を
抽出するとともに、抽出した索引に重要度を付与する。FIG. 1 is a block diagram showing the configuration of an embodiment of the present invention. 1 is the document input section, which inputs document information (
image information, code information) into code information suitable for this device. Reference numeral 2 denotes a dependency analysis section which separates the sentence input from the document input section 1 and divides it into word units. Furthermore, dependency analysis is performed by adding part-of-speech information, conjugation form, and morpheme information to each word. 3 is an index extraction unit that extracts an index and assigns importance to the extracted index.

索引の重要度は係り受け解析部２の係り受け情報から得
た構文情報から判断する。４は文書蓄積部で、形態素情
報、係り受け情報、索引情報等を蓄積する。５はシソー
ラス展開部で、索引抽出部３で抽出した索引の同義語、
類義語をシソーラス辞書から取り出す。６は類似文書検
索部で、索引抽出部３から抽出した索引と、シソーラス
展開部５で取得した同義語、類義語と、係り受け解析部
２の係り受け関係を用いて文書の類似度を算出する。７
は類似文書出力部で、類似文書検索部６で検索された文
書を類似度の高い順に画面に出力する。The importance of an index is determined from the syntax information obtained from the dependency information of the dependency analysis unit 2. 4 is a document storage unit that stores morpheme information, dependency information, index information, and the like. 5 is a thesaurus expansion section, which contains synonyms of the index extracted by the index extraction section 3;
Extract synonyms from the thesaurus dictionary. 6 is a similar document search unit that calculates the degree of similarity of documents using the index extracted from the index extraction unit 3, the synonyms and synonyms acquired by the thesaurus expansion unit 5, and the dependency relationships from the dependency analysis unit 2. . 7
A similar document output section outputs the documents searched by the similar document search section 6 on the screen in descending order of similarity.

第２図にこの発明の類似文書検索装置のシステム構成ブ
ロック図を示す。イメージ情報からコード情報への変換
は文字認識部１０が受け持つ。イメージ情報の入力デバ
イスとしては、イメージリーダ１１．Ａノドスキャナ１
２．ＦＡＸ１３゜ＣＤ−ＲＯＭ１４がある。また、直接
コード情報として入力するデバイスとして、フロッピー
ディスク１５やマグネティックテープ１６がある。また
、ユーザが直接キーボードから文書入力可能なようにユ
ーザ端末にはキーボード１７が用意されている。FIG. 2 shows a block diagram of the system configuration of the similar document search device of the present invention. The character recognition unit 10 is in charge of converting image information into code information. As an input device for image information, an image reader 11. A throat scanner 1
2. There is a FAX 13° and a CD-ROM 14. Further, devices for directly inputting code information include a floppy disk 15 and a magnetic tape 16. Further, the user terminal is provided with a keyboard 17 so that the user can input documents directly from the keyboard.

ＣＰＵ２０およびＲＡＭ２１を使用して係り受け解析部
２．索引抽出部３．シソーラス展開部５、類似文書検索
部６の処理を実行する。ＤＩＳＫ３０には文書データベ
ース１０ｏ、シソーラス辞書１０１．形態素解析用辞書
１０２が記憶されている。文書データベース、辞書等は
通常ＤＩＳＫ３０に記憶されているが、各種処理の高速
化を図るためＲＡＭ２１の容量に応じて、ＤＩＳＫ３０
からＲＡＭ２１に転送されて使用される。類似文書検索
結果は、デイスプレー７０に映し出される。そして、上
記１０〜１７および７０の各部は、第１図の文書入力部
１に対応し、これらから文書を直接入力する。また、第
１図の文書蓄積部４は文書データベース１００に対応し
ている。Dependency analysis unit 2 using CPU 20 and RAM 21. Index extraction unit 3. The processes of the thesaurus development section 5 and similar document search section 6 are executed. The DISK 30 includes a document database 10o, a thesaurus dictionary 101. A morphological analysis dictionary 102 is stored. Document databases, dictionaries, etc. are normally stored in the DISK 30, but in order to speed up various processes, the DISK 30 is stored according to the capacity of the RAM 21.
The data is transferred from there to the RAM 21 and used. Similar document search results are displayed on the display 70. The units 10 to 17 and 70 correspond to the document input unit 1 shown in FIG. 1, and documents are directly input from these units. Further, the document storage section 4 in FIG. 1 corresponds to the document database 100.

以下に各処理の詳細について第１図、第２図により述べ
る。Details of each process will be described below with reference to FIGS. 1 and 2.

文書入力部１ではイメージ情報、コード情報のどちらも
入力可能である。イメージ情報の入力は、ＦＡＸ１３．
ＣＤ−ＲＯＭ１４．イメージリーグ１１等のイメージ読
取り装置から行われる。In the document input section 1, both image information and code information can be input. Image information can be input by FAX13.
CD-ROM14. This is done from an image reading device such as Image League 11.

各デバイスドライバでは、各デバイスから取得したイメ
ージ情報から文書構造情報を基に文字を切り出し、切り
出した文字イメージを文字認識部１０に転送する。文字
認識部１０では転送されてきた文字イメージをコード情
報に変換する処理を行う。この処理は、例えば宮原（宮
原：文字読取方式、特願昭５７−２２２４８９号）の発
明を用いる。コード情報は変換した後、他の処理を行う
ためＲＡＭ２１に転送される。Each device driver cuts out characters from the image information acquired from each device based on document structure information, and transfers the cut out character images to the character recognition unit 10. The character recognition unit 10 performs a process of converting the transferred character image into code information. This process uses, for example, the invention of Miyahara (Miyahara: Character reading system, Japanese Patent Application No. 57-222489). After the code information is converted, it is transferred to the RAM 21 for other processing.

直接、フロッピーディスク１５やマグネティックテープ
１６からコード情報を読取る場合は、読取った情報を直
接ＲＡＭ２１に転送する。キーボード１７から直接コー
ド情報を入力する場合でも同様にＲＡＭ２１に転送され
る。When code information is directly read from the floppy disk 15 or magnetic tape 16, the read information is directly transferred to the RAM 21. Even when code information is input directly from the keyboard 17, it is similarly transferred to the RAM 21.

係り受け解析部２では、まず、入力された文書の形態素
解析を行う。形態素解析では、形態素解析用辞書１０２
を用いて文を分かち書きし、形態素情報を付与する。形
態素情報としては、表記。The dependency analysis unit 2 first performs morphological analysis of the input document. In morphological analysis, morphological analysis dictionary 102
Separate sentences using , and add morphological information. Morphological information is written.

読み９品詞、活用形、意味カテゴリ番号等を付与する。Add reading 9 parts of speech, conjugated forms, meaning category numbers, etc.

係り受け解析はこれらの形態素情報を用いて実施される
。係り受け解析の代表的な手法としては、稲垣ら（稲垣
、小橋：係り受け解析方法、特開昭６４−１７１５２号
公報参照）の発明がある。この係り受け解析手法を用い
て入力文章を係り受け解析した例を第３図に示す。Dependency analysis is performed using this morpheme information. A typical technique for dependency analysis is the invention by Inagaki et al. FIG. 3 shows an example of dependency analysis of an input sentence using this dependency analysis method.

入力文「カナ文字列及び同音語選択指示信号を入力する
ための入力手段と、・・・・・・」を文節単位に分割す
ると、「カナ文字列　及び　同音語選択指示信号を　入
力するための　入力手段と、」と分割される。次に文節
単位に分割された単語群の係り受け関係を求める。この
入力文では第３図に示すように、「カナ文字列」と「同
音語選択指示信号」は並列構文を形成しており、ともに
「入力するための」に係り、文節「入力するための」は
「入力手段」に係ることになる。係り受け解析部２では
、このような文節間の係り受け関係を求める処理を受け
持つ。If we divide the input sentence ``Input means for inputting kana character strings and homophone selection instruction signals, and...'' into phrases, we can obtain ``Input means for inputting kana character strings and homophone selection instruction signals, and...''. and ``input means''. Next, the dependency relationships of the word groups divided into phrases are determined. In this input sentence, as shown in Figure 3, the "kana character string" and the "homophone selection instruction signal" form a parallel syntax, and both relate to "to input", and the phrase "to input" ” pertains to “input means.” The dependency analysis unit 2 is in charge of the process of determining such dependency relationships between clauses.

索引抽出部３では、形態素解析で分割された単語の中か
ら索引を抽出するとともに、抽出した索引を重み付けす
る。The index extraction unit 3 extracts an index from the words divided by morphological analysis and weights the extracted index.

索引抽出する手法としては、不要語辞書法と統制語辞書
法がある。前者は、索引として抽出してはならない単語
を不要語辞書に登録し、不要語辞書にない単語を索引と
して抽出する手法である。Methods for index extraction include the unnecessary word dictionary method and the controlled word dictionary method. The former is a method in which words that should not be extracted as an index are registered in an unnecessary word dictionary, and words that are not in the unnecessary word dictionary are extracted as an index.

後者は、索引となるべき単語を統制語辞書に登録し、統
制語辞書の単語と一致する単語が文書中に存在する場合
、索引として出力する手法である。The latter is a method in which words to be used as an index are registered in a controlled word dictionary, and if a word that matches a word in the controlled word dictionary exists in a document, it is output as an index.

統制語辞書が用意されている文書では統制語辞書法を用
い、統制語辞書が用意されていない文書では不要語辞書
を用いる。第３図に示すような例では、統制語辞書が用
意されていないため、不要語辞書法を適用する。この種
の文書では、第４図に示すような不要語辞書を用意し、
不要語以外の単語を除いた名詞相当語句（固有名詞、す
変名側等も含む）を索引として抽出する。The controlled word dictionary method is used for documents for which a controlled word dictionary is prepared, and the unnecessary word dictionary is used for documents for which a controlled word dictionary is not prepared. In the example shown in FIG. 3, since a controlled word dictionary is not prepared, the unnecessary word dictionary method is applied. For this type of document, prepare an unnecessary word dictionary as shown in Figure 4,
Noun-equivalent words and phrases (including proper nouns, pseudonyms, etc.) excluding unnecessary words are extracted as an index.

索引の重み付けは、係り受け解析結果から算出される文
構造ポイントを利用する。文構造ポイントは、文節間の
係り受けとその係り受け関係の属性により決定する。文
末の文節に文構造ポイントの基準値を与え、各文節の文
構造ポイントは、文末の文節から対象とする文節にたど
り着くまでに通る係り受けの文節間リンクポイントの合
計値とする。Index weighting uses sentence structure points calculated from the dependency analysis results. Sentence structure points are determined by the dependencies between clauses and the attributes of the dependencies. A standard value of sentence structure points is given to the clause at the end of a sentence, and the sentence structure point for each clause is the total value of the inter-clause link points of dependencies passed from the clause at the end of the sentence to the target clause.

第５図に係り文節の付属語と文節間のリンクポイントの
対応を示す。第６図に例を示すが、「信号を　入力する
ための　入力手段。」という入力文では、「入力手段。FIG. 5 shows the correspondence between attached words of bunsetsu and link points between bunsetsu. An example is shown in Figure 6, where in the input sentence "Input means for inputting signals.", "Input means."

」には文構造ポイントの基準値（０）を与え、文節「入
力するための」は、助詞「の」を介した係り受け関係で
あるため、文節間リンクポイントは２となり、文構造ポ
イントも２となる。同様にして、文節「信号を」は文節
間リンクポイントがＯであるため、「入力するための」
と同じ文構造ポイント２となる。同様にして、すべての
文節の文構造ポイントを求める。文構造ポイントが大き
いほど索引の重要度を低（する。” is given the standard value of sentence structure points (0), and the phrase “to input” has a dependency relationship through the particle “no”, so the inter-clause link points are 2, and the sentence structure points are also It becomes 2. Similarly, the clause ``signal'' has an inter-clause link point O, so it is ``for input''.
This is the same sentence structure point 2 as . In the same way, find the sentence structure points for all clauses. The higher the sentence structure point, the lower the importance of the index.

文書蓄積部４では、文書情報を文書データベース１００
に登録する。登録する情報としては、ただ単に文書のコ
ード情報だけでなく、係り受け解析部２で付与した形態
素情報や索引抽出部３で付与した索引情報も同時に記憶
する。第７図に入力文書例、第８図に入力された文章が
蓄積されている状態を示す。各行は文節単位に区切られ
ており、各文節に対して文節番号９文節表記１文節読み
９品詞、係り先番号、索引等が付与されている。新しい
文書蓄積の際には、類似文書検索部６で必要なインデッ
クス情報も併せて更新する。第９図にインデックス情報
テーブルを示す。インデックス情報テーブルには、文書
データベースの索引とその索引が付与されている文書番
号が記憶されている。このテーブルはマツチングした文
書番号を高速に取得するために使用される。The document storage unit 4 stores document information in a document database 100.
Register. The information to be registered is not just document code information, but also morpheme information given by the dependency analysis section 2 and index information given by the index extraction section 3. FIG. 7 shows an example of an input document, and FIG. 8 shows a state in which input sentences are stored. Each line is divided into clauses, and each clause is given a clause number, 9 clause descriptions, 1 clause reading, 9 parts of speech, a dependency number, an index, etc. When a new document is stored, the index information required by the similar document search unit 6 is also updated. FIG. 9 shows an index information table. The index information table stores an index of a document database and a document number to which the index is assigned. This table is used to quickly obtain matched document numbers.

シソーラス展開部５では、検索装置が種々の文書表現に
対応できるように、索引抽出部３で抽出した索引を同義
語や類義語で展開する。第１０図に索引をシソーラス辞
書１０１で展開した例を示す。左端に列挙されている単
語が文書中から抽出された索引群である。その索引に対
して、シソーラス辞書１０１を検索し、見出し語と索引
が一致した場合は、その見出し語に付与されている同義
語・類義語を抽出する。The thesaurus development section 5 develops the index extracted by the index extraction section 3 into synonyms and synonyms so that the search device can handle various document expressions. FIG. 10 shows an example in which the index is developed using the thesaurus dictionary 101. The words listed at the left end are the index group extracted from the document. The thesaurus dictionary 101 is searched for the index, and if a headword matches the index, synonyms and similar words assigned to the headword are extracted.

このシソーラス辞書１０１では、同義語を３つに分類し
ている。同義語０．同義語１．同義語２である。これら
は意味の近さによって大別されている。つまり同義語Ｏ
とは見出し語と意味的に全（類似の単語であり、どのよ
うな条件でも言い替え可能な語と定義する。同義ｉ！■
、同義語２と番号が高くなるにつれて、見出し語から意
味が遠くなる。In this thesaurus dictionary 101, synonyms are classified into three types. Synonyms 0. Synonym 1. This is synonym 2. These are broadly classified based on the similarity of their meanings. In other words, the synonym O
is defined as a word that is semantically similar to the headword and can be replaced under any conditions.Synonym i!■
, synonym 2. As the number increases, the meaning becomes further away from the headword.

例えば「文書」という索引の場合、同義語Ｏとしては「
ドキュメント」が挙げられている。この場合「文書」は
「ドキュメントＪと言い替えても全く意味的に同じこと
を示している。また、類義語としては、「テキスト」、
「文章」、「文ｊ等がある。For example, in the case of the index "document", the synonym O is "
Documents” are listed. In this case, "Document" can be used as "Document J" to mean the same meaning. Also, synonyms include "Text", "Document J", etc.
There are ``sentences'', ``sentences j, etc.

また、シソーラス辞書１０１には、多義判定テーブルが
用意されており、表記上では同じでも意味が異なる単語
の区別を文書の分野によって判定する。第１１図に多義
判定テーブルを示す。多義判定テーブルでは、単語の表
記、読み、利用分野の情報からなり、シソーラス展開す
る場合、入力された文書の分野に最も意味的に正しい同
義語。Further, the thesaurus dictionary 101 includes an ambiguity determination table, which determines the distinction between words that are the same in notation but have different meanings depending on the field of the document. FIG. 11 shows an ambiguity determination table. The polysemy judgment table consists of information on the spelling, pronunciation, and field of use of the word, and when expanded into a thesaurus, the most semantically correct synonym for the field of the input document.

類義語を出力する。例えばｒＣＤＪといった場合、銀行
関係を分野で使われる［キャッシュ・デイスペンサー」
であるのか、音楽関係である「コンパクト・ディスク」
であるのかわからないが、この発明の装置では、文書の
分野が多義判定テーブルに記載されている場合、その分
野に対応する同義語、類義語を優先する。もし、適当な
分野がない場合は、すべての同義語、類義語を抽出する
。Output synonyms. For example, rCDJ is a [cash dispenser] used in banking-related fields.
Is it a music-related "compact disc"?
Although it is not clear whether this is the case, in the device of the present invention, when the field of a document is listed in the ambiguity determination table, priority is given to synonyms and synonyms corresponding to that field. If there is no suitable field, extract all synonyms and synonyms.

類似文書検索部６では、入力文書に類似する文書を検索
する処理を行う。入力文書と文書データベース１００と
のマツチングは、２段階で実施される。第１段階は、キ
ーワード包含率検査で、入力文書のシソーラス展開され
たキーワードと文書データベースとのキーワードがどの
程度一致しているかを検査する。マツチングは、第９図
のインデックス情報テーブルとシソーラス展開された語
との間で高速に行われる。キーワード包含率は、キーワ
ードの一致した個数を、各文書に付与されているキーワ
ード数（重複したキーワードは１個と数える）で割った
値である。The similar document search unit 6 performs processing to search for documents similar to the input document. Matching between the input document and the document database 100 is performed in two stages. The first stage is a keyword inclusion rate check, in which it is checked to what extent the keywords developed in the thesaurus of the input document and the keywords in the document database match. Matching is performed at high speed between the index information table of FIG. 9 and the thesaurus-expanded words. The keyword inclusion rate is the value obtained by dividing the number of matched keywords by the number of keywords assigned to each document (duplicated keywords are counted as one).

キーワード包含率目した処理を行う。係り受け関係により抽出される２項
関係としては、く名詞（句）〉とく動詞（句）〉の関係
、名詞句内のく名詞〉とく名詞〉の関係等がある。名詞
句の場合、名詞句の一番最後にくる名詞にその名詞句の
意味を代表させ、動詞句の場合、動詞句を構成する単語
すべてを対象とした。Perform processing based on keyword inclusion rate. Binary relationships extracted by dependency relationships include the relationship between a noun (phrase) and a verb (phrase), and the relationship between a noun and a noun within a noun phrase. In the case of noun phrases, the last noun in the noun phrase was used to represent the meaning of the noun phrase, and in the case of verb phrases, all the words making up the verb phrase were targeted.

第７図の入力文では、例えばく名詞〉　　　　　　　く動詞〉第２段階のマツチングでは、先のキーワード包含率でヒ
ツトした文書に対して、文構造の類似度を判断して絞り
込みを行う。キーワードが２個以上一致した文書すべて
に対して処理を実施した場合時間がかかるため、ユーザ
の要望に合わせて処理レベルを決定する。処理レベルは
、キーワード包含率や文書数の２種類の指定が可能であ
る。In the input sentence of FIG. 7, for example, noun>verb> In the second stage of matching, the documents hit by the keyword inclusion rate are narrowed down by determining the degree of similarity in sentence structure. Since it would take time to process all documents with two or more matching keywords, the processing level is determined according to the user's needs. Two types of processing levels can be specified: keyword inclusion rate and number of documents.

この発明の装置では、絞り込みの手法として、係り受け
関係に代表される２項関係の類似度に着のようなく名詞
〉とく動詞〉の２項関係が抽出される。また、名詞句内
の関係と〈名詞〉　　　　　　〈名詞〉カナ文字　　　の　　　読み使用頻度　　　の　　　高い同ランク　　　の　　　単語などが抽出される。また、動詞句が複数の用言から構成
されている場合には、く名詞（句）〉は、複数の用言と
の２項関係を持つとする。例えば「漢字カナ混じり文に
　変換して　表示し・・・・・・」という文の場合、く名詞〉　　　　　　　　〈動詞〉漢字カナ混じり文　　に　　変換漢字カナ混じり文　　　　　表示とする。In the device of the present invention, as a narrowing down method, binary relationships between nouns and verbs are extracted, regardless of the similarity of binary relationships such as dependency relationships. In addition, words of the same rank with high usage frequency are extracted based on the relationship within the noun phrase and the pronunciation of kana characters. Furthermore, when a verb phrase is composed of multiple predicates, the noun (phrase) is assumed to have a binary relationship with the multiple predicates. For example, in the case of the sentence ``convert into a sentence containing kanji and kana and display it...'', the sentence is converted into a sentence containing kanji and kana and is displayed as a noun><verb> a sentence containing kanji and kana.

これらの２項関係が文書データベース中の文書に存在す
る場合、マツチングポイントを与える。If these binary relationships exist in documents in the document database, matching points are given.

マツチングポイントは索引の重要度、シソーラスポイン
トにより変化する。以下でそれぞれの評価ポイントにつ
いて説明する。Matching points change depending on the importance of the index and thesaurus points. Each evaluation point will be explained below.

索引の重要度は、文書蓄積部４で付与した文構造を考慮
した標準的重みを、第１段階のマツチング状況に応じて
変化させたものである。先に付与した文構造に基づく標
準的重み付けは、その文書の一般的内容に対する重み付
けであり、文書の内容に深く立ち入った専門的内容につ
いての重みは軽くしている。しかし、検索要求としては
、広い意味での検索（例えば「かな漢字変換装置に関す
る特許を収集したい」）や狭い意味での検索要求（例え
ば「かな漢字変換装置の同音語選択に関する特許を収集
したい」）などがあり、その２面性を同時に満足する必
要がある。この発明の装置の場合、なるべ（広い検索要
求から狭い検索要求まで対応できるように索引の重み付
けを変化させる。The importance of the index is obtained by changing the standard weight given by the document storage unit 4, taking into account the sentence structure, according to the matching situation in the first stage. The standard weighting based on the sentence structure given above is a weighting for the general content of the document, and a lighter weight for specialized content that goes deeper into the content of the document. However, search requests include searches in a broad sense (for example, ``I want to collect patents related to kana-kanji conversion devices'') and search requests in a narrow sense (for example, ``I want to collect patents related to homophone selection for kana-kanji conversion devices''). There is a need to satisfy these two aspects at the same time. In the case of the apparatus of the present invention, the weighting of the index is changed so as to be able to respond to a wide range of search requests to a narrow search request.

第１段階のマツチングでキーワード包含率が低い場合は
、現在ある文書データベースの内容が入力された文書に
対して十分なデータ量を保持していないと判断し、広い
意味での検索を優先して行う。そのため、索引の重要度
は標準的重み付けのままとする。キーワード包含率が高
い場合は、専門的な内容に立ち入った検索であると判断
して、索引の重要度を内容の深いところに対して最大の
重みを付加した状態で検索を実行する。If the keyword inclusion rate is low in the first stage of matching, it is determined that the content of the current document database does not hold enough data for the input document, and priority is given to searching in a broader sense. conduct. Therefore, the importance of the index is maintained at the standard weighting. If the keyword inclusion rate is high, it is determined that the search is in-depth into specialized content, and the search is executed with the index importance given the greatest weight to deep content.

シソーラスポイントは、シソーラス展開部５で付与され
た同義語、類義語の索引に対しての意味的距離を表した
ものである。索引に意味的距離が近いほどシソーラスポ
イントは高くなる。つまり、索引及び同義語０はシソー
ラスポイント３、同義語１と同義語２はシソーラスポイ
ント２、類義語はシソーラスポイント１を与える。The thesaurus point represents the semantic distance to the index of synonyms and similar words assigned by the thesaurus development unit 5. The closer the semantic distance to the index, the higher the thesaurus points. That is, the index and synonym 0 are given 3 thesaurus points, synonyms 1 and 2 are given 2 thesaurus points, and synonyms are given 1 thesaurus point.

各索引の評価値は、索引の重要度、シソーラスポイント
のそれぞれを掛は合せたポイントとし、係り受け関係が
一致した場合、それぞれの評価値どうしを掛は合せたポ
イントをマツチングポイントとする。最終的な文書間の
類似度は、マツチングポイントの総和にキーワード包含
率を掛けた値とする。The evaluation value of each index is the sum of the index importance and thesaurus points, and if the dependency relationships match, the matching point is the sum of the respective evaluation values. The final degree of similarity between documents is a value obtained by multiplying the sum of matching points by the keyword inclusion rate.

第１２図にマツチングポイントを求める例を示す。入力
文は°゛カナ文字列を漢字カナ混じり文に変換し、”で
、類似度を検査する文（マツチング文）は、゛第１の文
字を第２の文字に置換し、で、それぞれの係り受け関係
及び索引重要度が図のように求められているとする。こ
の場合、各文節について見ると、第１文節は“文字”と
いう単語が一致している（シソーラスポイント３）。第
２文節では、゛漢字°°に対して°゛文字゛が類義語で
あるからシソーラスポイント１゜第３文節では゛変換°
゛に対して°゛置換が類義語であるからシソーラスポイ
ントｌとなる。ここで、入力文の索引重要度とマツチン
グ文の索引重要度とシソーラスポイントをすべて掛は合
せ索引の評価値を算出する。係り受け関係が同じ場合、
係り文節と受け文節の索引評価値を掛は合せたポイント
を各係り受け関係のマツチングポイントとする。例えば
第１文゛漢字カナ混じり文に”と°°変換し、°゛が係
り受け関係にあり、また、第２文゛第２の文字に°°と
“°置換し”の係り受け関係が一致しているため、この
文節の係り受けのマツチングポイントは、各文節のポイ
ントを掛は合せた１ＯＸ１０＝１００ポイントとなる。FIG. 12 shows an example of finding matching points. The input sentence is ``Convert a kana character string into a kanji-kana mixed sentence,'' and the sentence to check the similarity (matching sentence) is ``Replace the first character with the second character,'' and each Assume that the dependency relationship and index importance are determined as shown in the figure. In this case, looking at each clause, the word "character" matches in the first clause (thesaurus point 3). In the bunsetsu, °゛character゛ is a synonym for ゛kanji°°, so thesaurus point 1゜In the third clause, ゛conversion°
Since the substitution of ° for ゛ is a synonym, it becomes the thesaurus point l. Here, the index importance of the input sentence, the index importance of the matching sentence, and the thesaurus points are all multiplied together to calculate the evaluation value of the index. If the dependency relationships are the same,
The point obtained by multiplying the index evaluation values of the dependent clause and dependent clause is set as the matching point for each dependency relation. For example, the first sentence is converted into a sentence with kanji and kana, and °゛ has a dependency relationship, and the second sentence has a dependency relationship between °° and ``° replacement'' in the second character. Since they match, the matching point for the dependency of this phrase is 1OX10=100 points, which is the sum of the points of each phrase.

同様にして、全ての係り受け関係についてマツチングポ
イントを求め総和をとり、キーワード包含率を掛は合せ
て文書の類似度とする。Similarly, matching points are found for all dependency relationships, the sum is taken, and the sum is multiplied by the keyword inclusion rate to obtain the document similarity.

類似文書出力部７は、類似文書検索部６の検索結果をデ
イスプレィ７０に表示する。この場合、類似度の高い順
にソートし類似度の高い文書５件を画面に表示し、それ
ぞれの内容が分かりやすいように表示する。画面上に５
件全て表示できない場合は、類似度の高い文書を優先し
て表示する。Similar document output section 7 displays the search results of similar document search section 6 on display 70 . In this case, the documents are sorted in descending order of similarity, and five documents with high similarity are displayed on the screen so that the contents of each document can be easily understood. 5 on the screen
If all documents cannot be displayed, priority is given to displaying documents with a high degree of similarity.

［発明の効果］この発明は以上説明したように、文書を直接入力し、コ
ード情報とする文書入力部と、入力された文字列を分か
ち書きし、形態素情報を付与するとともに、形態素情報
を基にして文節間の係り受け関係を判定する係り受け解
析部と、係り受け解析部の係り受け解析結果から文構造
を決定し、この文構造から索引を抽出するとともに、索
引の重要度を付与する索引抽出部と、入力文書、係り受
け解析結果、索引抽出結果を蓄積する文　書蓄積部と、
索引抽出部の索引をシソーラス辞書で展開するシソーラ
ス展開部と、入力文書と蓄積されている文書との類似度
を索引の類似度と係り受け関係の類似度から判定する類
似文書検索部と、検索した類似文書を出力する類似文書
出力部とからなるので、ユーザはキーワード、論理式等
を考慮する必要なくして類似文書の検索が可能となる。[Effects of the Invention] As explained above, the present invention has a document input section that directly inputs a document and uses it as code information, and a document input section that separates and writes input character strings, adds morpheme information, and converts the text based on the morpheme information. a dependency analysis unit that determines the dependency relationships between clauses, and an index that determines the sentence structure from the dependency analysis results of the dependency analysis unit, extracts an index from this sentence structure, and assigns the importance of the index. an extraction unit; a document storage unit that stores input documents, dependency analysis results, and index extraction results;
a thesaurus expansion unit that expands the index of the index extraction unit using a thesaurus dictionary; a similar document search unit that determines the degree of similarity between an input document and stored documents based on the degree of similarity of indexes and the degree of similarity of dependency relationships; and a similar document output section that outputs similar documents that have been searched, the user can search for similar documents without having to consider keywords, logical formulas, etc.

また、類似文書の検索においては、索引をシソーラス展
開することにより漏れのない検索を可能とする。検索し
た文書の絞り込みは、検索対象による索引に対して行い
高精度に実行される。また、２段階の検索方式により検
索時間の短縮が可能となる。Furthermore, when searching for similar documents, a thesaurus expansion of the index makes it possible to perform a complete search. The searched documents are narrowed down using the index based on the search target and are executed with high precision. Furthermore, the two-step search method allows the search time to be shortened.

[Brief explanation of the drawing]

第１図はこの発明の一実施例の構成を示すブロック図、
第２図は類似文書検索装置のシステム構成ブロック図、
第３図は係り受け解析例を説明する図、第４図は不要語
辞書を説明する図、第５図は係り文節の付属語と文節間
リンクポイントの対応を示すテーブル図、第６図は文構
造ポイント計算例を説明する図、第７図は入力文書例を
説明する図、第８図は蓄積された文書情報を示す図、第
９図はインデックス情報テーブル図、第１０図はシソー
ラス展開部を説明する図、第１１図は多義判定テーブル
、第１２図はマツチングポイント計算例を説明する図で
ある。図中、１は文書入力部、２は係り受け解析部、３は索引
抽出部、４は文書蓄積部、５はシソーラス展開部、６は
類似文書検索部、７は類似文書出力部、１０は文字認識
部、１１はイメージリーグ、１２はハンドスキャナ、１
３はＦＡＸ、１４はＣＤ−ＲＯＭ、１５はフロッピーデ
ィスク、１６はマグネティックテープ、１７はキーボー
ド、２ＱはＣＰＵ、１００は文書データベース、１０１
はシソーラス辞書、１０２は形態素解析用辞書である。文節間リンクポイント　− 文構造ポイントFIG. 1 is a block diagram showing the configuration of an embodiment of the present invention.
Figure 2 is a system configuration block diagram of a similar document search device.
Figure 3 is a diagram explaining an example of dependency analysis, Figure 4 is a diagram explaining an unnecessary word dictionary, Figure 5 is a table diagram showing the correspondence between attached words of dependent clauses and inter-clause link points, and Figure 6 is a diagram explaining a dependency analysis example. Figure 7 is a diagram explaining an example of sentence structure point calculation, Figure 7 is a diagram explaining an example of an input document, Figure 8 is a diagram showing accumulated document information, Figure 9 is an index information table diagram, and Figure 10 is a thesaurus expansion. 11 is an ambiguity determination table, and FIG. 12 is a diagram illustrating an example of matching point calculation. In the figure, 1 is a document input section, 2 is a dependency analysis section, 3 is an index extraction section, 4 is a document storage section, 5 is a thesaurus expansion section, 6 is a similar document search section, 7 is a similar document output section, and 10 is a similar document output section. Character recognition unit, 11 is image league, 12 is hand scanner, 1
3 is a FAX, 14 is a CD-ROM, 15 is a floppy disk, 16 is a magnetic tape, 17 is a keyboard, 2Q is a CPU, 100 is a document database, 101
102 is a thesaurus dictionary, and 102 is a morphological analysis dictionary. Inter-clause link points − Sentence structure points

Claims

[Claims]

A document input section that directly inputs a document and uses it as code information, and a dependency that separates the input character string, adds morpheme information, and determines the dependency relationship between clauses based on the morpheme information. an analysis unit, an index extraction unit that determines a sentence structure from the dependency analysis result of the dependency analysis unit, extracts an index from this sentence structure, and assigns the importance of the index; and an input document, and the dependency analysis result. , a document storage unit that stores index extraction results; a thesaurus expansion unit that expands the index of the index extraction unit in a thesaurus dictionary; and a dependency relationship between the similarity between the input document and the stored document and the similarity of the index. a similar document search unit that determines based on the degree of similarity;
A similar document search device comprising: a similar document output unit that outputs searched similar documents.