JP3500698B2

JP3500698B2 - Keyword extraction device and keyword extraction method

Info

Publication number: JP3500698B2
Application number: JP11160394A
Authority: JP
Inventors: 満美子岡; 忠信宮内; 寿平中垣
Original assignee: Fuji Xerox Co Ltd; Fujifilm Business Innovation Corp
Current assignee: Fujifilm Business Innovation Corp
Priority date: 1994-05-25
Filing date: 1994-05-25
Publication date: 2004-02-23
Anticipated expiration: 2019-02-23
Also published as: JPH07319885A

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、文書データベース等に
おいて、テキスト情報を検索するためのキーワードをテ
キストから自動的に抽出するキーワード抽出装置及びキ
ーワード抽出方法に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a keyword extracting device and a keyword extracting method for automatically extracting a keyword for searching text information from a text in a document database or the like.

【０００２】[0002]

【従来の技術】従来、データベース等に蓄積された大量
の情報の中から、所望の情報を検索する手法として、各
データにあらかじめキーワードを割り当てておき、ユー
ザの入力した検索キーと一致したキーワードを持つ情報
を検索して出力する装置が開発されている。2. Description of the Related Art Conventionally, as a method for searching desired information from a large amount of information stored in a database or the like, a keyword is assigned to each data in advance, and a keyword matching a search key input by a user is used. Devices have been developed to search for and output the information that they have.

【０００３】テキスト情報のキーワード検索において
は、インデクサと呼ばれる専門家が適切なキーワードを
あらかじめテキスト情報に付与するのが一般的である。
しかし、このキーワードの付与は、膨大な手間がかかる
ことから、キーワードを自動的に抽出する技術の研究が
数多く行なわれている。例えば、特開平１−１１２３３
１号公報に記載されているキーワード重要度自動評価装
置では、文書中から名詞をキーワードとして抽出し、さ
らに統計的、構文的、意味的な重要度の評価を加えてい
る。In the keyword search of text information, an expert called an indexer generally assigns an appropriate keyword to the text information in advance.
However, since the addition of this keyword requires a huge amount of work, many studies have been conducted on a technique for automatically extracting the keyword. For example, Japanese Unexamined Patent Publication No. H11-11233
In the keyword importance automatic evaluation device described in Japanese Patent Publication No. 1, a noun is extracted from a document as a keyword, and statistical, syntactic and semantic importance evaluations are added.

【０００４】しかし、このような従来のキーワード抽出
方法では、一般に単語単位で抽出を行なうため、どうし
ても検索結果に本来求めるものと無関係なものが多くな
ってしまう。すなわち、適合率が低下するという問題が
あった。これは、ユーザが欲する検索要求に対応する概
念は、必ずしも単語レベルの表現と一致しないため、単
語レベルのキーワードで検索した場合、ユーザが想定し
ていた意味とは別の意味で用いられているテキストも検
索されてしまうためである。また、重要度の評価につい
ても、１つの文書内で、ある単語がいろいろな意味に使
われている場合であっても、そのような意味については
考慮せず、ある単語についての重要度を評価してしまっ
ているため、必ずしも正しく重要度が評価されてはいな
い。However, in such a conventional keyword extraction method, since the extraction is generally performed on a word-by-word basis, the search results inevitably include many unrelated ones to be originally obtained. That is, there is a problem in that the precision rate decreases. This is because the concept corresponding to the search request desired by the user does not necessarily match the word-level expression, so when searching with word-level keywords, it is used in a meaning different from the meaning assumed by the user. This is because the text will also be searched. Also, regarding the evaluation of importance, even when a word is used for various meanings in one document, the importance of a word is evaluated without considering such meaning. Since it has been done, the importance is not always evaluated correctly.

【０００５】これに対して、複合語や、動詞句、名詞句
などの単位でキーワードを抽出することが考えられる。
例えば、特公昭５８−３３９９３号公報に記載されてい
るキーワード抽出装置においては、複合語を用いる方法
が提案されている。この方法によれば、単語単位で概念
を抽出するという制約はなくなる。また、複合語でなく
ても、複合語と同等の意味を表わす表現、例えば、「絶
縁膜形成方法」に対する「絶縁する膜を形成する方法」
のような表現がテキスト中にあれば、キーワードとして
抽出でき、表層の表現によらず、キーワードを抽出する
ことができる。On the other hand, it is possible to extract keywords in units of compound words, verb phrases, noun phrases and the like.
For example, in the keyword extracting device described in Japanese Patent Publication No. 58-33993, a method using a compound word has been proposed. According to this method, there is no restriction that the concept is extracted word by word. Also, even if it is not a compound word, an expression having the same meaning as the compound word, for example, “method of forming insulating film” for “insulating film forming method”
If such an expression is present in the text, it can be extracted as a keyword, and the keyword can be extracted regardless of the surface expression.

【０００６】しかしながら、この方法は、あらかじめ抽
出すべき複合語がキーワード表に登録されている必要が
あり、テキスト中から互いに関係を持つ単語群を自由に
抽出するものではない。また、複合語に準ずる表現を抽
出する際に、単語同士が係り受け関係にあれば複合語に
なり得るとされている。このため、例えば、「文書を検
索する」も、「文書から検索する」も、「文書／検索」
として抽出されてしまう。このため、検索結果には依然
として適切でないものが含まれてしまうことが多かっ
た。また、このように単語間の関係は無視されているた
め、それらを重要度の評価に利用することはできなかっ
た。However, this method requires that the compound word to be extracted is registered in the keyword table in advance, and does not freely extract a word group having a relationship with each other from the text. Also, when extracting expressions that are similar to compound words, if words have a dependency relationship, they can be compound words. Therefore, for example, "search for a document" and "search from a document" are both "document / search"
Will be extracted as. For this reason, the search results often included unsuitable ones. Moreover, since the relationships between words are ignored in this way, they could not be used for the evaluation of importance.

【０００７】[0007]

【発明が解決しようとする課題】本発明は、上述した事
情に鑑みてなされたもので、適合率の高い検索を行なえ
るような適切なキーワードを抽出することのできるキー
ワード抽出装置及びキーワード抽出方法を提供すること
を目的とする。SUMMARY OF THE INVENTION The present invention has been made in view of the above circumstances, and a keyword extracting device and a keyword extracting method capable of extracting an appropriate keyword capable of performing a search with a high matching rate. The purpose is to provide.

【０００８】[0008]

【課題を解決するための手段】本発明は、請求項１に記
載の発明においては、テキストからキーワードを抽出す
るキーワード抽出装置において、テキストから互いに関
連する複数の語句からなる表現を抽出する表現抽出手段
と、前記表現抽出手段で抽出された表現を構成する語句
の間の関係を推定し関係を表わすリレーション情報およ
び複数の語句からなるリレーション表現を出力する関係
抽出手段と、前記関係抽出手段で出力された前記リレー
ション情報およびリレーション表現から同じ語句によっ
て構成されリレーション情報が互いに矛盾しないリレー
ション表現を抽出しその中で最も抽象度の低いリレーシ
ョン情報を持つリレーション表現をキーワード候補とし
て選択するリレーション表現選択手段を有することを特
徴とするものである。According to a first aspect of the present invention, in a keyword extraction device for extracting a keyword from a text, an expression extraction for extracting an expression consisting of a plurality of words and phrases related to each other from the text. Means, a relation extracting means for estimating a relation between the words and phrases constituting the expressions extracted by the expression extracting means, and outputting relation information indicating the relationship and a relation expression composed of a plurality of words, and output by the relation extracting means. A relation expression selecting means for extracting a relation expression composed of the same phrase from the relation information and the relation expression which are not contradictory to each other and selects the relation expression having the relation information having the lowest abstraction as a keyword candidate. Characterized by having .

【０００９】また、請求項２に記載の発明においては、
テキストからキーワードを抽出するキーワード抽出装置
において、テキストから互いに関連する複数の語句から
なる表現を抽出する表現抽出手段と、前記表現抽出手段
で抽出された表現を構成する語句の間の関係を推定し関
係を表わすリレーション情報および複数の語句からなる
リレーション表現を出力する関係抽出手段と、前記関係
抽出手段より出力されたリレーション表現の出現頻度を
計数する頻度計数手段と、前記関係抽出手段から出力さ
れた前記リレーション情報およびリレーション表現から
同じ語句によって構成されリレーション情報が互いに矛
盾しないリレーション表現を抽出しその中で最も抽象度
の低いリレーション情報を持つリレーション表現をキー
ワード候補として選択するとともに選択されなかったリ
レーション表現および候補として選択されたリレーショ
ン表現の前記頻度係数手段で計数した出現頻度を用いて
選択されたリレーション表現の重要度を評価するリレー
ション表現評価手段と、該リレーション表現評価手段の
評価結果に基づいてキーワードを選定するキーワード選
定手段を有することを特徴とするものである。According to the second aspect of the invention,
In a keyword extracting device for extracting a keyword from text, an expression extracting means for extracting an expression consisting of a plurality of words and phrases related to each other from the text, and estimating a relationship between the words and phrases forming the expression extracted by the expression extracting means. A relation extracting means for outputting relation information representing a relation and a relation expression consisting of a plurality of words, a frequency counting means for counting the appearance frequency of the relation expression outputted by the relation extracting means, and a relation extracting means for outputting the relation expression. From the relation information and the relation expression, a relation expression composed of the same words and having relations with each other that are not in conflict with each other is extracted, and the relation expression having the relation information with the lowest abstraction is selected as a keyword candidate and is not selected. Oh And a relation expression evaluation means for evaluating the importance of the relation expression selected by using the appearance frequency of the relation expression selected as a candidate by the frequency coefficient means, and a keyword based on the evaluation result of the relation expression evaluation means. It is characterized by having a keyword selecting means for selecting.

【００１０】さらに請求項３に記載の発明において
は、テキストからキーワードを抽出するキーワード抽出
方法において、表現抽出手段によりテキストから互いに
関連する複数の語句からなる表現を抽出するステップ
と、関係抽出手段により表現抽出手段で抽出された表現
を構成する語句の間の関係を推定し関係を表わすリレー
ション情報および複数の語句からなるリレーション表現
を出力するステップと、リレーション表現選択手段によ
り関係抽出手段から出力された前記リレーション情報お
よびリレーション表現から同じ語句によって構成されリ
レーション情報が互いに矛盾しないリレーション表現を
抽出しその中で最も抽象度の低いリレーション情報を持
つリレーション表現をキーワード候補として選択するス
テップを有することを特徴とするものである。Further, in the invention according to claim 3, in the keyword extracting method for extracting a keyword from text, a step of extracting an expression composed of a plurality of mutually related terms from the text by the expression extracting means, and a relation extracting means A step of estimating a relation between the words and phrases constituting the expressions extracted by the expression extracting means, outputting relation information representing the relationship and a relation expression consisting of a plurality of words, and a relation expression selecting means outputting the relation expression. It has a step of extracting from the relation information and the relation expression a relation expression composed of the same words and phrases in which the relation information does not contradict each other, and selecting the relation expression having the relation information with the lowest abstraction as a keyword candidate. It is a characteristic.

【００１１】[0011]

【作用】本発明によれば、請求項１及び請求項３に記載
の発明においては、表現抽出手段でテキストから互いに
関連する複数の語句からなる表現を抽出し、関係抽出手
段において、表現抽出手段で抽出された表現を構成する
語句の間の関係を推定し、関係を表わすリレーション情
報および複数の語句からなるリレーション表現を出力す
る。リレーション表現選択手段では、関係抽出手段で出
力されたリレーション情報およびリレーション表現から
同じ語句によって構成され、リレーション情報が互いに
矛盾しないリレーション表現を抽出し、その中で最も抽
象度の低いリレーション情報を持つリレーション表現を
キーワードとして抽出する。このようにして、テキスト
中の単語だけでなく、単語間の関係も含めてキーワード
として抽出でき、関係の抽象度が異なる類似した表現の
中から、適切なキーワードを抽出することができる。According to the present invention, in the inventions of claims 1 and 3, the expression extracting means extracts an expression consisting of a plurality of words and phrases related to each other from the text, and the relationship extracting means extracts the expression. The relationship between the words and phrases that compose the expression extracted in 1. is estimated, and the relational information that represents the relationship and the relational expression composed of a plurality of words and phrases are output. The relation expression selecting means extracts a relation expression composed of the same words and phrases from the relation information and the relation expression output by the relation extracting means, and the relation information is consistent with each other, and has the relation information having the lowest abstraction among them. Extract expressions as keywords. In this way, not only the words in the text but also the relationships between the words can be extracted as keywords, and appropriate keywords can be extracted from similar expressions with different degrees of abstraction of the relationships.

【００１２】また、請求項２に記載の発明においては、
さらに、関係抽出手段より出力されたリレーション表現
の出現頻度を計数する頻度計数手段を設け、関係抽出手
段から出力されたリレーション情報およびリレーション
表現から同じ語句によって構成されリレーション情報が
互いに矛盾しないリレーション表現を抽出し、その中で
最も抽象度の低いリレーション情報を持つリレーション
表現をキーワード候補として選択するとともに選択され
なかったリレーション表現および候補として選択された
リレーション表現の前記頻度係数手段で計数した出現頻
度を用いて選択されたリレーション表現の重要度をリレ
ーション表現評価手段で評価する。この評価結果に基づ
いて、キーワード選定手段でキーワードを選定すること
により、選択されなかったキーワード候補の出現頻度を
も利用して、より正確に重要度を評価でき、精度良くキ
ーワードを抽出することができる。Further, in the invention described in claim 2,
Furthermore, a frequency counting means for counting the appearance frequency of the relational expressions output from the relation extracting means is provided, and the relational information and the relational expressions output from the relational extracting means are composed of the same words and the relational expressions are not contradictory to each other. The relation expression having the relation information having the lowest abstraction is extracted as a keyword candidate, and the appearance frequency counted by the frequency coefficient means of the relation expression not selected and the relation expression selected as a candidate is used. The relation expression evaluation means evaluates the importance of the selected relation expression. By selecting a keyword by the keyword selection means based on this evaluation result, the importance can be evaluated more accurately and the keyword can be extracted accurately by using the appearance frequency of the keyword candidates that have not been selected. it can.

【００１３】[0013]

【００１４】[0014]

【実施例】図１は、本発明のキーワード抽出装置の第１
の実施例の全体構成を示すブロック図である。図中、１
はキーワード抽出装置、２はデータ入力部、３は表現抽
出部、４は関係抽出部、５はリレーション表現選択部、
６は記憶部、１１は形態素解析部、１２は単語群抽出部
である。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS FIG. 1 shows a first keyword extraction device according to the present invention.
3 is a block diagram showing the overall configuration of the embodiment of FIG. 1 in the figure
Is a keyword extracting device, 2 is a data input unit, 3 is an expression extracting unit, 4 is a relation extracting unit, 5 is a relation expression selecting unit,
Reference numeral 6 is a storage unit, 11 is a morphological analysis unit, and 12 is a word group extraction unit.

【００１５】データ入力部２は、磁気ディスク、ＯＣ
Ｒ、ＭＴなどから、キーワードを抽出する対象となる日
本語テキストを読み込む。もちろん、入力装置を用いて
直接入力される構成であってもよい。表現抽出部３は、
形態素解析部１１、単語群抽出部１２を含む。表現抽出
部３は、データ入力部２で読み込んだテキストを形態素
解析部１１で形態素解析し、その結果から、単語群抽出
部１２で表現候補抽出規則にしたがってキーワードの候
補になり得る表現を品詞とともに抽出する。関係抽出部
４は、表現抽出部３で抽出された表現を構成する語句の
間の関係を解析し、関係を表わすリレーション記号に変
換し、複数の語句とリレーション記号の組からなるリレ
ーション表現を生成する。リレーション表現選択部５
は、関係抽出部４で出力されたキーワード候補のうち、
同じ語句によって構成され、リレーション記号が互いに
矛盾しないリレーション表現を抽出し、その中で最も抽
象度の低いリレーション記号を持つリレーション表現の
みをキーワード候補として選択する。選択されたキーワ
ード候補に基づき、抽出したキーワードは、入力された
文書とともに、記憶部６に記録される。The data input section 2 is a magnetic disk or OC.
A Japanese text from which a keyword is extracted is read from R, MT, or the like. Of course, the configuration may be such that the input device is used to directly input. The expression extraction unit 3
It includes a morphological analysis unit 11 and a word group extraction unit 12. The expression extraction unit 3 performs a morpheme analysis on the text read by the data input unit 2 by a morpheme analysis unit 11, and based on the result, an expression that may be a keyword candidate according to the expression candidate extraction rule in the word group extraction unit 12 together with a part of speech. Extract. The relation extracting unit 4 analyzes the relation between the words constituting the expression extracted by the expression extracting unit 3, converts the relation into a relation symbol representing the relation, and generates a relation expression including a set of a plurality of terms and the relation symbol. To do. Relation expression selection unit 5
Out of the keyword candidates output by the relationship extracting unit 4,
A relation expression that is composed of the same words and whose relation symbols do not contradict each other is extracted, and only the relation expression having the relation symbol with the lowest abstraction is selected as a keyword candidate. Based on the selected keyword candidate, the extracted keyword is recorded in the storage unit 6 together with the input document.

【００１６】表現抽出部３には、形態素解析部１１に加
えて構文解析部、意味解析部などを設けたり、これらを
統合した解析部を設けるなどして、より深い解析を行な
い、その結果に基づいてより精度の高い抽出を行なって
も良い。The expression extraction unit 3 is provided with a syntactic analysis unit, a semantic analysis unit, etc. in addition to the morpheme analysis unit 11, or an analysis unit integrating them, for deeper analysis. More accurate extraction may be performed based on this.

【００１７】次に、本発明の第１の実施例における動作
の一例を説明する。まず、データ入力部２より、文書デ
ータを電子的に読み込み、表現抽出部３に送出する。上
述したように、表現抽出部３は、文書データを一文ず
つ、すなわち句点まで読み込み、形態素解析部１１で形
態素解析して単語に分割する。形態素解析については、
自然言語処理の基本技術として広く知られており、例え
ば、特開昭６０−２０２３４号公報に記載されている日
本語形態素解析方式など、種々の公知の方式を用いるこ
とができる。Next, an example of the operation in the first embodiment of the present invention will be described. First, the document data is electronically read from the data input unit 2 and sent to the expression extraction unit 3. As described above, the expression extraction unit 3 reads the document data one sentence at a time, that is, up to a punctuation point, and the morphological analysis unit 11 performs morphological analysis to divide the data into words. For morphological analysis,
It is widely known as a basic technique of natural language processing, and various known methods such as a Japanese morphological analysis method described in JP-A-60-20234 can be used.

【００１８】次に、単語群抽出部１２で、品詞および表
層表現の組合わせに基づく表現抽出規則にしたがってキ
ーワードの候補となり得る単語群を抽出する。この例で
は、形態素解析結果のみから単語群を抽出するという方
法を用い、助詞などの付属語を介して直結する２つの自
立語群を抽出の単位とする。もちろん、意味解析結果を
用いるなど、他の情報をも参考にして単語群の抽出を行
なってもよい。また、３つ以上の自立語群を抽出単位と
してもよい。以下の説明では、抽出される表現のパター
ンは「前自立語群＋付属語群＋後自立語群」であるもの
とし、このパターンを抽出するために、それぞれ前自立
語群ストリームＪＦ、付属語群ストリームＦＺ、後自立
語群ストリームＪＢを用いる。表現抽出規則には、抽出
すべき表現の前自立語群、付属語群、後自立語群の取り
うる関係が記述されている。Next, the word group extraction unit 12 extracts a word group that can be a keyword candidate according to an expression extraction rule based on a combination of a part of speech and a surface expression. In this example, a method of extracting a word group from only the morphological analysis result is used, and two independent word groups that are directly connected via an auxiliary word such as a particle are used as the extraction unit. Of course, the word group may be extracted with reference to other information, such as using the semantic analysis result. Also, three or more independent word groups may be used as the extraction unit. In the following description, it is assumed that the extracted expression pattern is “pre-independent word group + adjunct word group + post-independent word group”, and in order to extract this pattern, the pre-independent word group stream JF and the adjunct word group are extracted, respectively. The group stream FZ and the post independent word group stream JB are used. The expression extraction rule describes a possible relationship between the pre-independent word group, the adjunct word group, and the post-independent word group of the expression to be extracted.

【００１９】上述の抽出される表現のパターン中の前自
立語群、付属語群、後自立語群は、以下の説明では直結
している場合について説明しているが、これらは直結し
ている必要はなく、表現抽出規則にマッチする最も近い
語群を抽出するように構成することもできる。また、こ
のパターンに限らず、他のパターンを抽出してもよい。Although the preceding independent word group, the adjunct word group, and the subsequent independent word group in the above-described extracted expression patterns are directly connected in the following description, they are directly connected. It is not necessary and can be configured to extract the closest word group that matches the expression extraction rule. In addition to this pattern, other patterns may be extracted.

【００２０】図２は、表現抽出部３の動作の一例を示す
フローチャートである。表現抽出部３は、Ｓ２２で文書
データから一文ずつ取り出し、Ｓ２３で形態素解析部１
１において形態素解析を行なって、単語に分割する。Ｓ
２４で前自立語群ストリームＪＦ、付属語群ストリーム
ＦＺ、後自立語群ストリームＪＢをクリアし、Ｓ２５で
前自立語群ストリームＪＦに自立語群、ここでは、サ変
動詞語幹、名詞、形容動詞語幹の並びを読み込む。以下
の説明および図面では、サ変動詞語幹をサ変、形容動詞
語幹を形容動詞と略記することがある。また、Ｓ２６で
付属語群ストリームＦＺに、前自立語群に続く付属語群
を読み込む。FIG. 2 is a flowchart showing an example of the operation of the expression extraction unit 3. The expression extracting unit 3 takes out one sentence from the document data in S22, and in S23, the morphological analyzing unit 1
Morphological analysis is performed in 1 to divide into words. S
In step 24, the pre-independent word group stream JF, the adjunct word group stream FZ, and the post-independent word group stream JB are cleared, and in S25, the independent word group in the front independent word group stream JF, in this case, the sa verb verb stem, noun, and adjective verb stem Read the sequence of. In the following description and drawings, the sa verb verb stem may be abbreviated as Sa, and the adjective verb stem may be abbreviated as adjective verb. In S26, the adjunct word group following the previous independent word group is read into the adjunct word stream FZ.

【００２１】Ｓ２７において、前自立語群ストリームＪ
Ｆ、付属語群ストリームＦＺに読み込まれた前自立語
群、付属語群の組み合わせが表現抽出規則にマッチする
か否かを判定し、マッチしない場合には抽出すべきパタ
ーンではないので、Ｓ２４へ戻り、抽出した前自立語
群、付属語群を破棄する。表現抽出規則にマッチする場
合には、Ｓ２８において、付属語群に続く自立語群を後
自立語群ストリームＪＢに読み込む。そして、Ｓ２９に
おいて、前自立語群ストリームＪＦ、付属語群ストリー
ムＦＺ、後自立語群ストリームＪＢを用いて表現抽出規
則にマッチするか否かを判定する。もし、マッチしてい
れば、Ｓ３１において、前自立語群ストリームＪＦ、付
属語群ストリームＦＺ、後自立語群ストリームＪＢに読
み込まれた内容を出力表現ストリームＥＸに出力する。
マッチしていない場合には、Ｓ３０において、前自立語
群ストリームＪＦに複合語が読み込まれているか否かを
判定し、複合語が読み込まれている場合には、Ｓ３１で
前自立語群ストリームＪＦの内容を出力表現ストリーム
ＥＸに出力する。In S27, the pre-independent word group stream J
F, it is determined whether or not the combination of the pre-independent word group and the adjunct word group read in the adjunct word group stream FZ matches the expression extraction rule, and if they do not match, it is not a pattern to be extracted. Return and discard the extracted pre-independent word group and attached word group. If it matches the expression extraction rule, in S28, the independent word group following the attached word group is read into the subsequent independent word group stream JB. Then, in S29, it is determined whether or not it matches the expression extraction rule using the front independent word group stream JF, the auxiliary word group stream FZ, and the rear independent word group stream JB. If they match, in S31, the contents read in the preceding independent word group stream JF, the attached word group stream FZ, and the subsequent independent word group stream JB are output to the output expression stream EX.
If there is no match, in S30, it is determined whether or not a compound word is read in the previous independent word group stream JF. If a compound word is read, in S31, the previous independent word group stream JF is determined. The content of is output to the output expression stream EX.

【００２２】Ｓ３２においては、現在読み込んだ前自立
語群ストリームＪＦ、付属語群ストリームＦＺ、後自立
語群ストリームＪＢのうち、後自立語群ストリームＪＢ
に読み込まれた自立語群は、続く単語列の前自立語群と
なり得る。そのため、後自立語群ストリームＪＢに読み
込まれている自立語群を前自立語群ストリームＪＦに格
納し、付属語群ストリームＦＺおよび後自立語群ストリ
ームＪＢをクリアする。In S32, the rear independent word group stream JB among the currently read front independent word group stream JF, adjunct word group stream FZ, and rear independent word group stream JB is read.
The independent word group read in can be the previous independent word group of the succeeding word string. Therefore, the independent word group read in the rear independent word group stream JB is stored in the front independent word group stream JF, and the adjunct word group stream FZ and the rear independent word group stream JB are cleared.

【００２３】Ｓ３３において、文の最後まで読み込んだ
か否かを判定し、まだ文の途中の場合には、Ｓ２５へ戻
り、上述のパターンの抽出処理を続行する。文の最後ま
で到達すると、Ｓ２１へ戻り、文書の最後か否かを判定
し、文がまだ残っている場合には、Ｓ２２以降の処理を
繰り返し行なう。文書の最後まで処理を行なったら、表
現抽出部３の処理を終了する。このとき、表現出力スト
リームＥＸに出力された表現が、表現抽出部３で抽出さ
れた表現である。抽出された表現は関係抽出部４に送ら
れる。In S33, it is determined whether or not the sentence has been read to the end, and if it is still in the middle of the sentence, the process returns to S25 and the above-described pattern extraction processing is continued. When the end of the sentence is reached, the process returns to S21, and it is determined whether or not the document is the end. If the sentence still remains, the processes from S22 are repeated. When the processing is performed up to the end of the document, the processing of the expression extracting unit 3 ends. At this time, the expression output to the expression output stream EX is the expression extracted by the expression extracting unit 3. The extracted expression is sent to the relation extracting unit 4.

【００２４】具体例を用いて上述の動作を説明する。図
３は、形態素解析結果の一例の説明図、図４は、抽出さ
れた表現の一例の説明図、図５は、表現抽出規則の一例
の説明図である。例えば、Ｓ２２において、「我々が実
現したシステムは、本手法適用により文書の高速検索を
実現する。」という文が取り出されたものとする。Ｓ２
３において、形態素解析部１１で形態素解析を行ない、
図３に示すような解析結果を得る。The above operation will be described using a specific example. FIG. 3 is an explanatory diagram of an example of a morpheme analysis result, FIG. 4 is an explanatory diagram of an example of an extracted expression, and FIG. 5 is an explanatory diagram of an example of an expression extraction rule. For example, in S22, it is assumed that the sentence "The system realized by us realizes high-speed search of documents by applying this method." S2
3, the morphological analysis unit 11 performs morphological analysis,
The analysis result as shown in FIG. 3 is obtained.

【００２５】次に、単語群抽出部１２において、表現抽
出規則に従って、分割された単語列からキーワードの候
補となり得る単語群を抽出する。表現抽出規則の例を図
５に示す。例えば、規則１では、前自立語群が名詞、サ
変動詞語幹、形容動詞語幹のいずれかまたはそれらの列
により構成され、その後ろに付属語群として「に」があ
り、さらにその後ろに後自立語群としてサ変動詞語幹が
存在する場合に、表現を抽出することを示している。他
の規則についても同様である。Next, the word group extraction unit 12 extracts a word group that can be a keyword candidate from the divided word strings according to the expression extraction rule. An example of the expression extraction rule is shown in FIG. For example, in Rule 1, the pre-independence word group is composed of a noun, a sa verb verb stem, or an adjective verb stem, or a sequence of these, followed by "ni" as an adjunct group, and further behind it. It is shown that the expression is extracted when the syllable stem is present as a word group. The same applies to other rules.

【００２６】上述の例文では、図３に示した形態素解析
結果から、まず、Ｓ２５で「我々」（代名詞）が前自立
語群ストリームＪＦに、Ｓ２６で「が」（格助詞）が付
属語群ストリームＦＺにそれぞれ読み込まれ、Ｓ２７で
図５に示した表現抽出規則とのマッチングを行なう。図
５に示した表現抽出規則には、代名詞で始まる規則はな
いため、以下の処理は行なわず、Ｓ２４に戻って各スト
リームはクリアされる。In the above-mentioned example sentence, from the morphological analysis result shown in FIG. 3, first, in S25, "we" (pronoun) is in the pre-independent word group stream JF, and in S26, "ga" (case particle) is an adjunct word group. Each is read into the stream FZ, and matching is performed in S27 with the expression extraction rule shown in FIG. Since the expression extraction rule shown in FIG. 5 does not have a rule starting with a pronoun, the following process is not performed, and the process returns to S24 to clear each stream.

【００２７】続いて、「実現」（サ変）が前自立語群ス
トリームＪＦに、「した」（サ変動詞終止／連体語尾
（以下、サ変語尾と略すことがある））が付属語群スト
リームＦＺにそれぞれ読み込まれ、表現抽出規則とのマ
ッチングを行なう。この場合、図５に示した表現抽出規
則の規則７とマッチするので、各ストリームの内容はそ
のまま保持され、Ｓ２８で、続く自立語「システム」
（名詞）が後自立語群ストリームＪＢに読み込まれる。
続いて「は」（副助詞）を読み込もうとするが、自立語
ではないため読み込まれず、Ｓ２９で表現抽出規則との
マッチングを行なう。これは、図５の規則７にマッチす
るため、Ｓ３１で「実現したシステム」という表現を品
詞情報とともに表現出力ストリームＥＸに出力する。Then, "realization" (sa-hen) is added to the pre-independent word group stream JF, and "shi" (sa verb ending / adjunct ending (hereinafter sometimes abbreviated as sa-ending) is added to the adjunct word stream FZ. Each is read and matched with the expression extraction rule. In this case, since it matches rule 7 of the expression extraction rule shown in FIG. 5, the contents of each stream are retained as they are, and in S28, the independent word "system" that follows is maintained.
(Noun) is read into the post-independent word group stream JB.
Subsequently, the user tries to read "ha" (adjective particle), but since it is not an independent word, it is not read, and matching with the expression extraction rule is performed in S29. Since this matches the rule 7 of FIG. 5, the expression “realized system” is output to the expression output stream EX together with the part-of-speech information in S31.

【００２８】次に、Ｓ３２で後自立語群ストリームＪＢ
の内容を前自立語群ストリームＪＦにコピーし、Ｓ２
５，２６で続く単語を読み込む。すなわち、「システ
ム」を前自立語群ストリームＪＦにコピーし、「は」
（副助詞）が付属語群ストリームＦＺに読み込まれる。
この場合、マッチする表現抽出規則がないため、Ｓ２４
へ戻り、次の単語の読み込みを行なう。ここで、「、」
（記号）、「本」（接頭語）は自立語でないため、無視
される。Next, in S32, the independent word group stream JB
Copy the contents of the previous independent word group stream JF, S2
Read the following words at 5,26. That is, "system" is copied to the pre-independent word group stream JF, and "ha"
(Sub particle) is read into the attached word group stream FZ.
In this case, since there is no matching expression extraction rule, S24
Return to and read the next word. here,","
(Symbol) and "book" (prefix) are not independent words and are ignored.

【００２９】次に、「手法」（名詞）が前自立語群スト
リームＪＦに読み込まれると、続く「適用」（サ変）も
自立語であるため、複合語として続けて前自立語群スト
リームＪＦに取り込む。このように自立語群として扱わ
れるのは、活用語尾を伴わない自立語の連続、すなわ
ち、名詞／サ変／形容動詞のいずれかが連続する場合で
あり、自立語群の品詞は名詞として取り扱う。続いて
「により」（格助詞相当語）が付属語群ストリームＦＺ
に読み込まれる。この場合、図５に示した表現抽出規則
にマッチする規則（規則５）があるため、続く「文書」
（名詞）が後自立語群ストリームＪＢに読み込まれる。
続いて表現抽出規則とのマッチングを行なうと、マッチ
する規則がない。すなわち、前自立語群、付属語群は、
図５に示した表現抽出規則の規則５にマッチするが、後
自立語群がサ変または形容動詞ではないので、規則５に
マッチしない。そのため、Ｓ３０で複合語として前自立
語群ストリームＪＦに読み込まれた「手法適用」をキー
ワード候補になり得る表現として品詞の情報とともに表
現出力ストリームＥＸに出力する。そして、後自立語群
ストリームＪＢに読み込まれた「文書」を前自立語群ス
トリームＪＦにコピーし、以下同様にして表現の抽出を
続行する。Next, when the "method" (noun) is read into the pre-independent word group stream JF, the subsequent "application" (sa-hen) is also an independent word, so that it continues as a compound word in the pre-independent word group stream JF. take in. In this way, an independent word group is treated as a sequence of independent words without an inflection ending, that is, a case in which any of nouns / sahen / adjective verbs continues, and the part of speech of the independent word group is treated as a noun. Then, "by" (a case particle equivalent) is attached to the adjunct word stream FZ.
Read in. In this case, there is a rule (rule 5) that matches the expression extraction rule shown in FIG.
(Noun) is read into the post-independent word group stream JB.
Then, when matching the expression extraction rule, there is no matching rule. That is, the pre-independent word group and the adjunct word group are
It matches the rule 5 of the expression extraction rules shown in FIG. 5, but it does not match the rule 5 because the post-independent word group is not a sa-adjective or adjective verb. Therefore, the “method application” read in the pre-independent word group stream JF as a compound word in S30 is output to the expression output stream EX together with the information of the part of speech as an expression that can be a keyword candidate. Then, the "document" read in the rear independent word group stream JB is copied to the front independent word group stream JF, and extraction of expressions is continued in the same manner.

【００３０】以上のようにして、上述の例文から抽出さ
れる表現を図４に示す。上述のように、複合語が前自立
語群あるいは後自立語群を構成する場合には、複合語を
１つの名詞として扱うので、複合語を構成する各単語は
‘−’で結んで示している。他の各語は、‘／’で区切
って示している。各表現は品詞情報とともに抽出され
る。The expressions extracted from the above-mentioned example sentences as described above are shown in FIG. As described above, when a compound word forms a front independent word group or a rear independent word group, the compound word is treated as one noun, and therefore each word forming the compound word is shown by connecting with a'- '. There is. Other words are shown separated by '/'. Each expression is extracted together with part-of-speech information.

【００３１】表現抽出部３は、Ｓ３３で文末までの表現
抽出が終了したことを検知すると、Ｓ２２で次の１文を
取り出し、同様に、形態素解析、単語群抽出を行なう。
このようにして、Ｓ２１で文書の最後までの表現抽出が
終了したことを検知すると、表現出力ストリームＥＸに
出力された表現について、関係抽出部４で関係抽出動作
を行なう。When the expression extraction unit 3 detects in S33 that the expression extraction up to the end of the sentence is completed, the next one sentence is extracted in S22, and similarly, the morphological analysis and the word group extraction are performed.
In this way, when it is detected in S21 that the expression extraction up to the end of the document has been completed, the relationship extracting unit 4 performs the relationship extraction operation on the expressions output to the expression output stream EX.

【００３２】図６は、関係抽出部４の関係抽出動作の一
例を示すフローチャートである。関係抽出部４では、表
現抽出部３で抽出された表現を受け取り、関係推定規則
を用いて各表現を構成する語句の間の関係を解析し、関
係を表わすリレーション記号に変換し、複数の語句とリ
レーション記号の組からなるリレーション表現を生成す
る。FIG. 6 is a flow chart showing an example of the relation extracting operation of the relation extracting unit 4. The relationship extracting unit 4 receives the expressions extracted by the expression extracting unit 3, analyzes the relationship between the words forming each expression using the relationship estimation rule, converts the expressions into relation symbols representing the relationships, and converts the expressions into a plurality of words. And a relational expression consisting of a pair of relational symbols and relational symbols.

【００３３】Ｓ４１において、表現出力ストリームＥＸ
から表現の１つを取り出す。Ｓ４２で取り出した表現と
マッチする関係推定規則を検索する。マッチした関係推
定規則に対応するリレーション記号を得て、Ｓ４３で表
現にリレーション記号を付与する。In S41, the expression output stream EX
Take one of the expressions from. A relation estimation rule that matches the expression extracted in S42 is searched. A relation symbol corresponding to the matched relation estimation rule is obtained, and the relation symbol is added to the expression in S43.

【００３４】続いて、｛リレーション記号表現１表
現２｝という形式のリレーション表現を生成する。基本
的には、前自立語群が表現１に、後自立語群が表現２に
なるが、語順を入れ替えた言い換え表現ができるような
場合に限って、必要であれば前自立語群と後自立語群を
入れ替える。例えば、「システムの実現」と「実現した
システム」のような場合であり、この場合「実現したシ
ステム」の方の語順を入れ替える。どちらを入れ替える
かについては、一般に、体言−用言の順にすることを基
本とする。Ｓ４４で語順の入れ替えが必要か否かを判定
し、必要な場合には、Ｓ４５で表現１と表現２の項目を
入れ替えたリレーション表現を生成する。入れ替えの必
要がない場合には、Ｓ４６でそのままの順序でリレーシ
ョン表現を生成する。生成したリレーション表現は、Ｓ
４７でリレーション表現群ストリームＲＬに出力する。Then, a relation expression of the form {relation symbol expression 1 expression 2} is generated. Basically, the pre-independent word group becomes expression 1 and the post-independent word group becomes expression 2. However, only when the paraphrased expressions in which the word order is exchanged can be performed, the pre-independent word group and the back-independent word group are used if necessary. Swap the independent words. For example, there are cases such as "realization of system" and "realized system", and in this case, the word order of "realized system" is exchanged. As for which to replace, generally, it is basically the order of the nouns and the nouns. In S44, it is determined whether or not the word order needs to be exchanged, and if so, a relation expression in which the items of expression 1 and expression 2 are exchanged is generated in S45. If there is no need for replacement, the relation expression is generated in S46 in that order. The generated relation expression is S
At 47, the relation expression group stream RL is output.

【００３５】Ｓ４８で表現出力ストリームＥＸ内のすべ
ての表現について処理されたか否かを判定し、未処理の
表現が残っている場合には、Ｓ４１へ戻り、繰り返しリ
レーション表現の生成を行なう。In S48, it is determined whether or not all expressions in the expression output stream EX have been processed. If any unprocessed expressions remain, the process returns to S41 to repeatedly generate relation expressions.

【００３６】上述の関係抽出部４の動作の一例を、具体
例をもとに説明する。図７は関係推定規則の一例の説明
図、図８は、リレーション表現の一例の説明図である。
ここでは、具体例として、図４に示した表現が表現抽出
部３で抽出され、表現出力ストリームＥＸに格納されて
関係抽出部４に渡された場合について説明する。An example of the operation of the above-mentioned relation extracting unit 4 will be described based on a concrete example. FIG. 7 is an explanatory diagram of an example of the relation estimation rule, and FIG. 8 is an explanatory diagram of an example of the relation expression.
Here, as a specific example, a case will be described in which the expressions shown in FIG. 4 are extracted by the expression extraction unit 3, stored in the expression output stream EX, and passed to the relationship extraction unit 4.

【００３７】関係推定規則は、図７に示すように、前自
立語群、後自立語群、付属語群の組と、リレーション記
号との対応表として記述されている。例えば、前自立語
群が名詞またはサ変、付属語群が「の」、後自立語群が
サ変の表現は、リレーション記号［ノ］が対応する。他
の組み合わせについても同様である。As shown in FIG. 7, the relation estimation rule is described as a correspondence table of a set of a front independent word group, a rear independent word group, an adjunct word group, and a relation symbol. For example, the relation symbol [no] corresponds to an expression in which the anteroposterior word group is noun or sahen, the adjunct word group is “no”, and the posterior independent word group is sahen. The same applies to other combinations.

【００３８】まず、Ｓ４１で表現出力ストリームＥＸか
ら表現を１つ取り出す。表現「実現したシステム」が取
り出されたものとする。関係抽出部４は、「実現」と
「システム」の間の関係を推定する。Ｓ４２で図７に示
した関係推定規則を探索する。この例では、前自立語群
である「実現」はサ変、後自立語群である「システム」
は名詞、付属語群は「した」であるから、図７に示した
関係推定規則のうち、最下欄に示す関係推定規則がマッ
チする。対応するリレーション記号［スル］を取り出
し、表現に対してこのリレーションを付与する。First, in S41, one expression is extracted from the expression output stream EX. It is assumed that the expression "realized system" is retrieved. The relation extracting unit 4 estimates the relation between “realization” and “system”. In S42, the relation estimation rule shown in FIG. 7 is searched. In this example, the previous independent word group "realization" is Sahen, and the subsequent independent word group "System"
Is a noun and the adjunct word group is “do”, so the relation estimation rule shown in the bottom column of the relation estimation rules shown in FIG. 7 matches. Take the corresponding relation symbol [thru] and give this relation to the representation.

【００３９】次に、Ｓ４４で入れ替えが必要か否かを判
定する。この例では、用言−体言の順に単語が並んでお
り、語順を入れ替えた表現も可能であるので、入れ替え
が必要であると判断する。このときの判定は、例えば、
入れ替えが行なえる特殊な場合を表わす語順入れ替え規
則とのマッチングを行ない、該当する規則があった場合
には、語順の入れ替えを行なうように構成することがで
きる。上述の「実現したシステム」の場合、Ｓ４５で語
順が入れ替えられ、｛［スル］システム実現｝とい
う形式のリレーション表現が生成される。生成されたリ
レーション表現は、順次、Ｓ４７でリレーション表現群
ストリームＲＬに出力される。Next, in S44, it is determined whether the replacement is necessary. In this example, the words are arranged in the order of idioms-syllables, and an expression in which the word order is exchanged is also possible, so it is determined that the exchange is necessary. The determination at this time is, for example,
It can be configured such that matching is performed with a word order changing rule that represents a special case where the word order can be changed, and if there is a corresponding rule, the word order is changed. In the case of the above-mentioned "realized system", the word order is changed in S45, and the relation expression of the form {[sul] system realization} is generated. The generated relational expressions are sequentially output to the relational expression group stream RL in S47.

【００４０】なお、自立語のみの複合語の場合には、
［直結］というリレーション記号が付与される。ここ
で、２語からなる複合語の場合は２つの語がそれぞれ表
現１，表現２となる。３語以上の複合語の場合には、表
現１，表現２は特定しない。例えば、「高速検索機能」
という複合語の場合、｛［直結］高速−検索−機能｝
というリレーション表現を生成する。このとき、高速と
検索−機能、あるいは高速−検索と機能といった分け方
はしない。In the case of a compound word containing only independent words,
The relation symbol "direct connection" is added. Here, in the case of a compound word consisting of two words, the two words are expression 1 and expression 2, respectively. In the case of a compound word of three or more words, expression 1 and expression 2 are not specified. For example, "high-speed search function"
In the case of the compound word, {[direct connection] high speed-search-function}
Is generated. At this time, the terms high speed and search-function or high speed-search and function are not used.

【００４１】このようにして、最初の表現「実現したシ
ステム」から｛［スル］システム実現｝というリレー
ション表現が生成された。以下、「手法適用」、「文書
の高速検索機能」、「高速検索機能を実現」について
も、同様の処理により、図８に示すようなリレーション
表現が生成される。関係抽出部４において、表現抽出部
３で抽出された全ての表現がリレーション表現に変換さ
れると、リレーション表現群ストリームＲＬに出力され
たリレーション表現群は、リレーション表現選択部５に
渡される。In this way, the relation expression {[sul] system realization} was generated from the first expression "realized system". Hereinafter, with respect to “application of method”, “high-speed search function of document”, and “realization of high-speed search function”, the relation processing as shown in FIG. 8 is generated by similar processing. When all the expressions extracted by the expression extracting unit 3 are converted into relation expressions in the relation extracting unit 4, the relation expression group output to the relation expression group stream RL is passed to the relation expression selecting unit 5.

【００４２】図９は、リレーション表現選択部５の動作
を示すフローチャートである。リレーション表現選択部
５は、Ｓ５２でリレーション表現群ストリームＲＬから
リレーション表現を１つ取り出し、リレーション表現ス
トリームＲＥにコピーする。続いて、Ｓ５３でリレーシ
ョン表現ストリームＲＥと同じ語から構成される全ての
リレーション表現をリレーション表現群ストリームＲＬ
中から探し、ストリームＳＩにコピーする。このとき、
リレーション表現ストリームＲＥ自身もストリームＳＩ
にコピーする。FIG. 9 is a flowchart showing the operation of the relation expression selecting unit 5. The relation expression selecting unit 5 extracts one relation expression from the relation expression group stream RL in S52 and copies it to the relation expression stream RE. Subsequently, in S53, all relation expressions composed of the same words as the relation expression stream RE are set to the relation expression group stream RL.
Search inside and copy to stream SI. At this time,
Relation expression stream RE itself is also stream SI
To copy.

【００４３】次に、ストリームＳＩ中のリレーション表
現のリレーション記号を比較する。リレーション記号に
は、あらかじめ抽象度およびそのリレーション記号と矛
盾しないリレーション記号の情報が与えられている。こ
の情報をもとに、Ｓ５４でストリームＳＩの中からリレ
ーション表現ストリームＲＥと矛盾するリレーション表
現を削除する。また、Ｓ５５で矛盾しないリレーション
表現について、重複して選択処理が行なわれないよう
に、ストリームＳＩ内のリレーション表現をリレーショ
ン表現群ストリームＲＬから削除する。そして、Ｓ５６
において、ストリームＳＩ内の各リレーション記号の有
する抽象度が最も低いものを選択し、結果出力ストリー
ムＳＯに出力する。Next, the relation symbols of the relation expressions in the stream SI are compared. The relation symbol is given in advance information on the degree of abstraction and the relation symbol that does not contradict the relation symbol. Based on this information, in S54, the relation expression inconsistent with the relation expression stream RE is deleted from the stream SI. In S55, relational expressions in the stream SI are deleted from the relational expression group stream RL so that selection processing is not performed redundantly for relational expressions that do not contradict each other. And S56
At, the one having the lowest degree of abstraction of each relation symbol in the stream SI is selected and output to the result output stream SO.

【００４４】Ｓ５１へ戻り、リレーション表現群ストリ
ームＲＬ内にリレーション表現が存在しなくなるまで、
上述の処理を繰り返し行なう。これにより、結果出力ス
トリームＳＯには、類似の関係を有する表現が排除され
たキーワード候補が収容されることになる。結果出力ス
トリームＳＯに書き込まれたキーワードは、記憶部６に
入力文書とともに登録される。Returning to S51, until there is no relation expression in the relation expression group stream RL,
The above process is repeated. As a result, the result output stream SO contains the keyword candidates from which expressions having a similar relationship are excluded. The keywords written in the result output stream SO are registered in the storage unit 6 together with the input document.

【００４５】具体例をもとに、上述のリレーション表現
選択部５の動作の一例を説明する。図１０は、同じ語か
ら構成されるリレーション表現の一例の説明図、図１１
は、リレーション記号が有する情報の一例の説明図であ
る。An example of the operation of the above-mentioned relation expression selecting unit 5 will be described based on a concrete example. FIG. 10 is an explanatory diagram of an example of a relation expression composed of the same words, and FIG.
FIG. 4 is an explanatory diagram of an example of information included in a relation symbol.

【００４６】Ｓ５２でリレーション表現群ストリームＲ
Ｌから、例えば、｛［スル］システム実現｝という
表現が取り出されたとすると、この表現がリレーション
表現ストリームＲＥにコピーされる。そして、リレーシ
ョン表現ストリームＲＥの内容、および、この表現と同
じ語、すなわち、「システム」と「実現」を有するリレ
ーション表現が、リレーション表現群ストリームＲＬか
らストリームＳＩにコピーされる。ストリームＳＩにコ
ピーされたリレーション表現を図１０に示している。図
１０に示すように、同じ語を有するリレーション表現で
あっても、リレーション記号の違うものが存在する。例
えば、「実現したシステム」、「システムを実現」、
「システムの実現」、「システム実現」などの表記が存
在する。これらは、関係抽出部４において、違うリレー
ション記号を付与して区別している。In S52, the relation expression group stream R
If, for example, the expression {[sul] system realization} is retrieved from L, this expression is copied to the relational expression stream RE. Then, the content of the relation expression stream RE and the relation expression having the same words as this expression, that is, "system" and "realization", are copied from the relation expression group stream RL to the stream SI. The relational expression copied to stream SI is shown in FIG. As shown in FIG. 10, even relation expressions having the same word have different relation symbols. For example, "realized system", "realized system",
There are expressions such as “system realization” and “system realization”. These are distinguished by adding different relation symbols in the relation extracting unit 4.

【００４７】次に、ＳＩ中のリレーション表現のリレー
ション記号を比較する。上述のように、リレーション記
号には、あらかじめ抽象度およびそのリレーション記号
と矛盾しないリレーション記号の情報が与えられてい
る。この例を図１１に示している。図１１に示したよう
に、抽象度は、例えば、１から７までの数字で表わして
おり、数字が大きいほど抽象度が大きいことを表してい
る。図１１では、例えば、リレーション記号［ヲ］の抽
象度は１であり、リレーション記号［ノ］の抽象度は３
である。これは、例えば、「システムの実現」という表
現は、「システムを実現」という意味を表している場合
もあるが、必ずしもそうであるとは言えないということ
を意味している。このように、表現によって確かさが異
なることを表したものが抽象度である。抽象度は、上述
したリレーション記号の種類に依存し、リレーション記
号の種類が変われば、抽象度の与え方も変わる。Next, the relation symbols of the relation expressions in SI are compared. As described above, the relation symbol is previously provided with the degree of abstraction and the relation symbol information that does not conflict with the relation symbol. This example is shown in FIG. As shown in FIG. 11, the degree of abstraction is represented by, for example, a number from 1 to 7, and the greater the number, the greater the degree of abstraction. In FIG. 11, for example, the abstraction degree of the relation symbol [wo] is 1 and the abstraction degree of the relation symbol [no] is 3.
Is. This means that, for example, the expression "realization of the system" may mean "realization of the system", but this is not always the case. In this way, the degree of abstraction expresses that certainty differs depending on the expression. The degree of abstraction depends on the type of relation symbol described above, and if the type of relation symbol changes, the way in which the degree of abstraction is given also changes.

【００４８】このような情報を用い、まず、リレーショ
ン記号を比較して、矛盾するリレーション記号を持つキ
ーワード候補をＳＩから削除する。一般に、抽象度が同
じリレーション記号は互いに矛盾し、抽象度が違うリレ
ーション記号の中には矛盾するものとしないものがあ
る。図１１では、自分自身より抽象度が高いリレーショ
ン記号の中で矛盾しないものが、矛盾しないリレーショ
ン記号の情報として与えられている。すなわち、リレー
ション記号［ノ］は、それより抽象度が高いリレーショ
ン記号［名詞接続］、［スル］、［直結］の中で、矛盾
しないリレーション記号［スル］、［直結］が与えられ
ている。Using such information, first, relation symbols are compared with each other, and keyword candidates having inconsistent relation symbols are deleted from SI. Generally, relation symbols having the same degree of abstraction conflict with each other, and some relation symbols having different degrees of abstraction do not. In FIG. 11, relational symbols having a higher degree of abstraction than themselves that do not conflict are given as information of relational symbols that do not conflict. That is, the relation symbol [no] is given the relation symbols [thru] and [direct connection] that are not inconsistent among the relation symbols [noun connection], [thru], and [direct connection] having a higher degree of abstraction.

【００４９】この矛盾するあるいは矛盾しないとは、例
えば、「システムを実現」は「システムの実現」と言い
換えることができるが、「展示会に出展」は「展示会の
出展」と言い換えることはできない。したがって、リレ
ーション記号［ヲ」と［ノ］は矛盾しないが、リレーシ
ョン記号［ニ］と［ノ］は矛盾する。For example, "realize a system" can be rephrased as "realize a system", but "exhibit at an exhibition" cannot be rephrased as "exhibit at an exhibition". . Therefore, the relation symbols [wo] and [no] are inconsistent, but the relation symbols [d] and [no] are inconsistent.

【００５０】リレーション記号の比較は、リレーション
表現ストリームＲＥおよびストリームＳＩ中のリレーシ
ョン記号の中で、最も抽象度の低いものを選び、その他
のリレーション記号が、選んだリレーション記号の持
つ、矛盾しないリレーション記号の情報に含まれれば、
矛盾しないと判断する。図１０に示した例では、最も抽
象度の低いリレーション記号［ヲ］を選択し、これと矛
盾しないリレーション記号［ノ］、［スル］、［直結］
と、他のリレーション表現のリレーション記号を比較す
る。矛盾するリレーション記号を持つリレーション表現
が見つかった場合には、抽象度の低いものを優先するな
どのあらかじめ決められた規則にしたがって矛盾する候
補を除去する。図１０に示す例では、全ての候補は矛盾
しないので、除去動作は行なわれない。このようにして
矛盾するリレーション表現の削除されたストリームＳＩ
中のリレーション表現は、それぞれが類似した意味関係
を有している。そのため、これらのリレーション表現の
中から１つをキーワード候補として抽出すればよい。The relation symbols are compared by selecting the relation symbol in the relation expression stream RE and the stream SI having the lowest abstraction level, and the other relation symbols having the consistent relation symbols of the selected relation symbols. If included in the information of
Judge that there is no contradiction. In the example shown in FIG. 10, the relation symbol [wo] having the lowest abstraction is selected, and the relation symbols [no], [thru], and [direct connection] that do not contradict this are selected.
And the relation symbols of other relation expressions. When a relational expression having an inconsistent relational symbol is found, the inconsistent candidate is removed according to a predetermined rule such as giving priority to a low abstraction. In the example shown in FIG. 10, since all candidates are consistent, no removal operation is performed. In this way, the stream SI in which the inconsistent relation expressions are deleted
The relation expressions in each have similar semantic relationships. Therefore, one of these relation expressions may be extracted as a keyword candidate.

【００５１】ストリームＳＩ中に残されたリレーション
表現は、選択処理によってキーワード候補が抽出される
ので、これらのリレーション表現から重複してキーワー
ドを抽出しないように、ストリームＳＩ中のリレーショ
ン表現をリレーション表現群ストリームＲＬから消去す
る。Since the keyword candidates are extracted from the relation expressions remaining in the stream SI by the selection processing, the relation expressions in the stream SI are related to each other in order not to extract the keywords from these relation expressions. Erase from stream RL.

【００５２】次に、類似したリレーション表現の中から
キーワード候補を抽出する。ストリームＳＩ中のリレー
ション表現のリレーション記号を比較し、最も抽象度が
低いものを選択し、そのリレーション表現をキーワード
として結果出力ストリームＳＯに書き込む。抽象度が低
いリレーション記号を選択するのは、抽象度が低い方
が、単語間の関係が確かであり、キーワードとして有効
に機能するためである。ここでは、リレーション記号
［ヲ］が選択され、リレーション表現｛［ヲ］システ
ム実現｝がキーワードとして結果出力ストリームＳＯ
に書き込まれる。Next, keyword candidates are extracted from the similar relation expressions. The relation symbols of the relation expressions in the stream SI are compared, the one with the lowest abstraction level is selected, and the relation expression is written as a keyword in the result output stream SO. The reason why the relation symbol having a low degree of abstraction is selected is that the lower the degree of abstraction, the more reliable the relationship between words and the more effective the function as a keyword. Here, the relation symbol [wo] is selected, and the relation expression {[wo] system realization} is used as a keyword to output the result output stream SO.
Written in.

【００５３】このようにして、一つのリレーション表現
ＲＥについての選択動作が終了する。リレーション表現
選択部６は、以下同様の動作を繰り返す。リレーション
表現群ストリームＲＬのすべてのリレーション表現の選
択が終了すると、リレーション表現選択動作が終了す
る。In this way, the selection operation for one relational expression RE is completed. The relation expression selecting unit 6 repeats the same operation thereafter. When the selection of all relation expressions of the relation expression group stream RL is completed, the relation expression selection operation ends.

【００５４】以上のように、第１の実施例によれば、同
じ自立語で構成されるリレーション表現の中から、リレ
ーション記号の抽象度が最も低いものをキーワードとし
て選択することにより、関係が確かなリレーション表現
をキーワードとすることができ、キーワード抽出の精度
を上げることができる。As described above, according to the first embodiment, by selecting the relation expression having the lowest abstraction degree of the relation symbol as the keyword from the relation expressions composed of the same independent word, the relation is confirmed. The relation expression can be used as a keyword, and the accuracy of keyword extraction can be improved.

【００５５】図１２は、本発明の第２の実施例の全体構
成を示すブロック図である。図中、図１と同様の部分に
は同じ符号を付して説明を省略する。７は頻度計数部、
８はリレーション表現選択／評価部、９はキーワード選
定部である。この第２の実施例では、第１の実施例の構
成に加えて、関係抽出部４の後に頻度計数部７が設けら
れている。また、図１のリレーション表現選択部５に代
わって、リレーション表現選択／評価部８が設けられ、
その後にキーワード選定部９が設けられている。FIG. 12 is a block diagram showing the overall construction of the second embodiment of the present invention. In the figure, the same parts as those in FIG. 7 is a frequency counter,
Reference numeral 8 is a relation expression selection / evaluation unit, and 9 is a keyword selection unit. In the second embodiment, in addition to the configuration of the first embodiment, a frequency counting unit 7 is provided after the relationship extracting unit 4. Further, a relation expression selection / evaluation unit 8 is provided in place of the relation expression selection unit 5 of FIG.
After that, the keyword selection unit 9 is provided.

【００５６】データ入力部２、表現抽出部３、関係抽出
部４は第１の実施例と同様である。頻度計数部７は、関
係抽出部４から出力されたリレーション表現から重複を
除いて各表現の出現頻度を計数し、リレーション表現を
出現頻度とともに出力する。リレーション表現選択／評
価部８は、まず、第１の実施例と同様に、関係抽出部４
から出力されたキーワード候補のうち、同じ語句によっ
て構成され、リレーション記号が互いに矛盾しないリレ
ーション表現を抽出し、その中で最も抽象度の低いリレ
ーション記号を持つリレーション表現のみをキーワード
候補として選択する。さらに、この第２の実施例では、
選択されなかったリレーション表現の出現頻度などをも
用いて、選択されたリレーション表現の重要度を評価す
る。キーワード選定部９は、リレーション表現選択／評
価部８の評価結果に基づいてキーワードを選定する。The data input unit 2, the expression extraction unit 3, and the relation extraction unit 4 are the same as those in the first embodiment. The frequency counting unit 7 counts the appearance frequency of each expression by removing duplication from the relation expression output from the relationship extraction unit 4, and outputs the relation expression together with the appearance frequency. The relation expression selecting / evaluating unit 8 firstly, like the first embodiment, the relation extracting unit 4
From the keyword candidates output from, the relation expressions that are composed of the same words and have relation symbols that do not contradict each other are extracted, and only the relation expression having the relation symbol with the lowest abstraction is selected as the keyword candidate. Furthermore, in this second embodiment,
The importance of the selected relational expression is evaluated by using the appearance frequency of the relational expression that is not selected. The keyword selection unit 9 selects a keyword based on the evaluation result of the relation expression selection / evaluation unit 8.

【００５７】次に、本発明のキーワード抽出装置の第２
の実施例における動作の一例を説明する。関係抽出部４
までの動作は、第１の実施例と同じである。関係抽出部
４において、表現抽出部３で抽出されたすべての表現が
リレーション表現に変換されると、頻度計数部５におい
て重複するリレーション表現を除き、出現頻度を付与す
る。これにより、リレーション表現は、例えば、｛リレ
ーション記号表現１表現２計数値｝という形式に変
換する。具体的には、例えば、｛［スル］システム表
現２｝のような形式となる。Next, the second keyword extraction device of the present invention will be described.
An example of the operation in this embodiment will be described. Relationship extraction unit 4
The operations up to are the same as those in the first embodiment. When all the expressions extracted by the expression extracting unit 3 are converted into relation expressions in the relation extracting unit 4, the frequency counting unit 5 removes overlapping relation expressions and gives appearance frequencies. As a result, the relation expression is converted into, for example, the form of {relation symbol expression 1 expression 2 count value}. Specifically, for example, the format is {[Sul] system expression 2}.

【００５８】頻度を付与されたリレーション表現群は、
リレーション表現選択／評価部８において選択／評価さ
れる。図１３は、リレーション表現選択／評価部８の動
作の一例を示すフローチャートである。同じ自立語から
構成されるリレーション表現の中から、リレーション記
号の抽象度が最も低いものを選択するところまでは、第
１の実施例とまったく同様である。すなわち、図１３の
Ｓ６１ないしＳ６５のステップは、図９のＳ５１ないし
Ｓ５５のステップと同様の処理が行なわれる。The relation expression group given the frequency is
The relation expression selection / evaluation unit 8 selects / evaluates. FIG. 13 is a flowchart showing an example of the operation of the relation expression selecting / evaluating unit 8. From the relational expressions composed of the same independent words, the one that has the lowest abstraction degree of the relational symbol is completely the same as in the first embodiment. That is, in steps S61 to S65 of FIG. 13, the same processing as that of steps S51 to S55 of FIG. 9 is performed.

【００５９】Ｓ６６において、この第２の実施例では、
選択されたリレーション表現をストリームＣＯに書き込
む。続いて、Ｓ６７でストリームＣＯに書き込まれたリ
レーション表現の重要度を計算する。重要度の計算とし
ては、たとえば、ストリームＳＩ中の全リレーション表
現の出現頻度を単純に加算したものを重要度とすること
ができる。このほかにも、各リレーション表現の出現頻
度をリレーション記号の抽象度に応じて重み付けして加
算するなど、種々の方法を用いることができる。At S66, in the second embodiment,
Write the selected relational expression to stream CO. Then, the importance of the relational expression written in the stream CO in S67 is calculated. The importance can be calculated by simply adding the appearance frequencies of all relation expressions in the stream SI as the importance. In addition to this, various methods such as weighting the frequency of appearance of each relational expression according to the degree of abstraction of the relational symbol and adding them can be used.

【００６０】Ｓ６８において、ストリームＣＯ内のリレ
ーション表現と、Ｓ６７で計算された重要度を結果出力
ストリームＳＯに出力し、リレーション表現ストリーム
ＲＥに読み込まれた１つのリレーション表現についての
選択／評価動作が終了する。リレーション表現選択／評
価部８は、以下同様の動作を繰り返す。リレーション表
現群ストリームＲＬのすべてのリレーション表現の選択
／評価が終了すると、リレーション表現選択／評価部８
の動作が終了する。In S68, the relational expression in the stream CO and the importance calculated in S67 are output to the result output stream SO, and the selection / evaluation operation for one relational expression read in the relational expression stream RE ends. To do. The relation expression selecting / evaluating unit 8 repeats the same operation thereafter. When the selection / evaluation of all relation expressions of the relation expression group stream RL is completed, the relation expression selection / evaluation unit 8
Operation ends.

【００６１】図１４は、同じ語から構成されるリレーシ
ョン表現の別の例の説明図である。具体例として、第１
の実施例と同様、Ｓ６２、Ｓ６３の処理により、図１４
に示すリレーション表現がリレーション表現ストリーム
ＲＥおよびストリームＳＩに読み込まれたものとする。
ここで、リレーション表現ストリームＲＥおよびストリ
ームＳＩに読み込まれた各リレーション表現は、頻度計
数部７によって出現頻度が計数され、計数値が付与され
ている。Ｓ６６において、これらのリレーション表現か
ら、抽象度が最も低いリレーション表現｛［ヲ］シス
テム実現２｝が選択され、ストリームＣＯに書き込
まれる。Ｓ６７では、このリレーション表現の重要度が
計算される。上述のように、重要度をストリームＳＩ中
の全リレーション表現の出現頻度を単純に加算したもの
とすれば、リレーション表現｛［ヲ］システム実現
２｝の重要度は８となる。このようにして、選択され
たリレーション表現とその重要度を、｛［ヲ］システ
ム実現８｝という形で結果出力ストリームＳＯに出
力する。FIG. 14 is an explanatory diagram of another example of the relation expression composed of the same words. As a specific example,
As in the embodiment of FIG.
It is assumed that the relation expression shown in (1) is read in the relation expression stream RE and stream SI.
Here, the appearance frequency of each relational expression read into the relational expression stream RE and the stream SI is counted by the frequency counting unit 7, and a count value is given. In S66, the relation expression {[wo] system realization 2} having the lowest abstraction degree is selected from these relation expressions and written in the stream CO. At S67, the importance of this relational expression is calculated. As described above, the importance of the relation expression {[wo] system realization 2} is 8 if the importance is simply added with the appearance frequencies of all the relation expressions in the stream SI. In this way, the selected relation expression and its importance are output to the result output stream SO in the form of {[wo] system realization 8}.

【００６２】キーワード選定部９は、リレーション表現
選択／評価部８で計算された重要度を用いて、例えば、
あらかじめ与えられた値以上のものをキーワードとして
選定し、記憶部９に入力文書とともに登録する。あらか
じめ与えておく値は、例えば、計算方法を与えておき、
キーワード候補数や重要度の分布によりキーワード選定
時に計算するように構成したり、あるいは、キーワード
抽出動作開始時にユーザがシステムに与えたり、重要度
評価結果をユーザに提示して閾値を入力させるなど、種
々の方法が考えられ、いずれの方法を用いても良い。The keyword selection unit 9 uses the importance calculated by the relation expression selection / evaluation unit 8 to calculate, for example,
A keyword having a predetermined value or more is selected as a keyword and registered in the storage unit 9 together with the input document. For the value given in advance, for example, give the calculation method,
It is configured to calculate at the time of keyword selection based on the number of keyword candidates or the distribution of importance, or the user gives it to the system at the start of keyword extraction operation, presents the importance evaluation result to the user and inputs a threshold value, etc. Various methods are conceivable, and any method may be used.

【００６３】以上のように、第２の実施例によれば、キ
ーワードの候補として選択されなかったリレーション表
現の出現頻度も用いて、選択されたリレーション表現の
重要度を評価することにより、より正確な重要度の評価
ができ、精度良くキーワード抽出ができるキーワード抽
出装置を提供することができる。As described above, according to the second embodiment, the importance of the selected relation expression is evaluated more accurately by using the appearance frequency of the relation expression not selected as the keyword candidate. It is possible to provide a keyword extracting device that can evaluate various importance levels and can extract keywords with high accuracy.

【００６４】上述の第２の実施例において、頻度計数部
７は、同一のリレーション表現の出現頻度を計数し、キ
ーワード選定部９は、リレーション表現選択／評価部８
におけるキーワードの候補として選択されなかったリレ
ーション表現の出現頻度および選択されたリレーション
表現の出現頻度から重要度を評価している。しかし、上
述の方法では、例えば、「文書処理」と「文書処理シス
テム」とは別のリレーション表現として抽出される。そ
して、単語群が同一でないため、リレーション表現選択
／評価部８において、同じ語を有するリレーション表現
として抽出されないので、別々のキーワード候補として
キーワード選定部９に出力されてしまう。文書中に、例
えば、「文書処理」が２回、「文書処理システム」が３
回出現したとし、キーワード選定部９で３回以上のリレ
ーション表現を選定するすれば、「文書処理システム」
がキーワードとして抽出され、「文書処理」はキーワー
ドとして選定されなくなってしまう。しかしながら、
「文書処理」と「文書処理システム」は全く異なる概念
ではなく、「文書処理」という概念に着目した場合、５
回出現したと考えるのが妥当である。したがって、実際
には「文書処理」の方が重要度が大きい可能性がある。
このように、単に同一の単語群についてのみから評価お
よび選択を行なうと、正確な重要度の評価が行なわれ
ず、検索の際の適合率を低下させる原因にもなる。In the second embodiment described above, the frequency counting unit 7 counts the appearance frequency of the same relational expression, and the keyword selecting unit 9 selects the relational expression selecting / evaluating unit 8.
The importance is evaluated from the appearance frequency of the relation expressions not selected as the keyword candidates and the appearance frequency of the selected relation expressions. However, in the above method, for example, “document processing” and “document processing system” are extracted as different relation expressions. Then, since the word groups are not the same, the relation expression selection / evaluation unit 8 does not extract the relation expressions having the same word, so that they are output to the keyword selection unit 9 as different keyword candidates. In a document, for example, "document processing" is twice, and "document processing system" is 3 times.
If the keyword selection unit 9 selects a relation expression three or more times, the document processing system
Is extracted as a keyword, and “document processing” is no longer selected as a keyword. However,
“Document processing” and “document processing system” are not completely different concepts, but if we focus on the concept of “document processing”,
It is reasonable to think that it has appeared once. Therefore, in reality, “document processing” may be more important.
As described above, if the evaluation and selection are performed only with respect to the same word group, accurate evaluation of the importance cannot be performed, which may cause a decrease in the matching rate at the time of search.

【００６５】これを解決するため、リレーション表現選
択／評価部８における評価の際、あるいは、キーワード
選定部９における選定の際に、ある第１のキーワード候
補を部分として持つ第２のキーワード候補があるとき、
少なくとも第１のキーワード候補の出現頻度と第２のキ
ーワード候補の出現頻度とに基づいてキーワード候補の
重要度を評価するように構成することができる。これに
より、あるキーワード候補が、別のキーワード候補に含
まれている場合でも、実際の出現頻度に見合った重要度
を付加することができる。例えば、リレーション表現選
択／評価部８で計算された重要度が、「文書処理」が
２、「文書処理システム」が３であるとき、「文書処
理」の重要度を５として評価するように構成することが
できる。In order to solve this, at the time of evaluation by the relation expression selecting / evaluating unit 8 or at the time of selection by the keyword selecting unit 9, there is a second keyword candidate having a certain first keyword candidate as a part. When
The importance of the keyword candidate may be evaluated based on at least the appearance frequency of the first keyword candidate and the appearance frequency of the second keyword candidate. As a result, even when a certain keyword candidate is included in another keyword candidate, it is possible to add the degree of importance commensurate with the actual appearance frequency. For example, when the importance calculated by the relation expression selection / evaluation unit 8 is “document processing” is 2 and the “document processing system” is 3, the importance of “document processing” is evaluated as 5 can do.

【００６６】上述の評価方法は、リレーション表現を用
いたキーワード抽出装置以外でも適用することができ
る。図１５は、本発明のキーワード抽出装置の第３の実
施例を示すブロック図である。図中、図１、図１２と同
様の部分には同じ符号を付して説明を省略する。１０は
重要度評価部、１３は単語群抽出部である。データ入力
部２、記憶部６は、第１および第２の実施例と同様であ
る。また、キーワード選定部９についても、第２の実施
例と同様とした。The evaluation method described above can be applied to devices other than the keyword extracting device using the relational expression. FIG. 15 is a block diagram showing a third embodiment of the keyword extracting device of the present invention. In the figure, the same parts as those in FIG. 1 and FIG. Reference numeral 10 is an importance evaluation unit, and 13 is a word group extraction unit. The data input unit 2 and the storage unit 6 are the same as those in the first and second embodiments. The keyword selection unit 9 is also the same as in the second embodiment.

【００６７】表現抽出部３は、形態素解析部１１、単語
群抽出部１３より構成されている。表現抽出部３は、デ
ータ入力部２で読み込んだテキストを形態素解析部１１
で形態素解析し、その結果から単語群抽出部１３で表現
抽出規則にしたがってキーワード候補を抽出する。この
第３の実施例では、単語群抽出部１３は、複合語を抽出
するものとして、以下、説明する。しかし、これに限ら
ず、上述の第１、第２の実施例と同様のパターン等、種
々のパターンを抽出するように構成することもできる。The expression extraction unit 3 is composed of a morpheme analysis unit 11 and a word group extraction unit 13. The expression extracting unit 3 uses the text read by the data input unit 2 as a morphological analysis unit 11
The morpheme analysis is carried out at, and the word group extraction unit 13 extracts a keyword candidate from the result according to the expression extraction rule. In the third embodiment, the word group extraction unit 13 will be described below as extracting a compound word. However, the present invention is not limited to this, and various patterns such as the same patterns as those of the first and second embodiments described above may be extracted.

【００６８】重要度評価部１０は、表現抽出部３から出
力され、頻度計数部７で計数された表現の出現頻度に基
づいて重要度を計算する。このとき、あるキーワード候
補が別のキーワード候補に含まれている場合、両者の出
現頻度から、そのキーワード候補の重要度を計算する。The importance evaluation unit 10 calculates the importance based on the appearance frequency of the expression output from the expression extraction unit 3 and counted by the frequency counting unit 7. At this time, when a certain keyword candidate is included in another keyword candidate, the importance degree of the keyword candidate is calculated from the appearance frequencies of both.

【００６９】次に、本発明のキーワード抽出装置の第３
の実施例における動作の一例について、具体例をもとに
説明する。具体例として、上述の第１および第２の実施
例で用いた「我々が実現したシステムは、本手法適用に
より文書の高速検索を実現する。」という例文が入力さ
れた場合を考える。この例文は、表現抽出部３内の形態
素解析部１１で形態素解析され、図３に示したような形
態素解析結果が得られる。単語群抽出部１３は、形態素
解析部１１で分割された単語列から、表現抽出規則に従
ってキーワードの候補となり得る単語群を抽出する。こ
こでは、表現抽出規則として、活用語尾を伴わない自立
語の連続、すなわち、名詞、サ変動詞語幹、形容動詞語
幹のいずれかが連続する場合に抽出するものとする。図
１６は、単語群抽出部１３により抽出されたキーワード
候補の一例の説明図である。上述の例文では、図１６に
示したような２つのキーワード候補が抽出される。Next, the third embodiment of the keyword extracting device of the present invention will be described.
An example of the operation in this embodiment will be described based on a specific example. As a specific example, consider a case where the example sentence “The system realized by us realizes high-speed search of documents by applying this method” used in the above-described first and second embodiments is input. This example sentence is morphologically analyzed by the morphological analysis unit 11 in the expression extraction unit 3, and the morphological analysis result as shown in FIG. 3 is obtained. The word group extraction unit 13 extracts, from the word string divided by the morpheme analysis unit 11, a word group that can be a keyword candidate according to the expression extraction rule. Here, as an expression extraction rule, it is assumed that an independent word without an inflection ending is extracted, that is, when any of a noun, a verb, and an adjective verb is continuous. FIG. 16 is an explanatory diagram of an example of keyword candidates extracted by the word group extraction unit 13. In the above example sentence, two keyword candidates as shown in FIG. 16 are extracted.

【００７０】形態素解析部１１および単語群抽出部１３
の処理は、１文ずつ行なわれ、以下、文書データの最後
までこれを繰り返す。文書データ全部についての単語群
の抽出が終了すると、頻度計数部７において、抽出され
た単語群の中から重複している単語群を探し、それらの
単語群の出現頻度を計数し、重複を除く。こうして得ら
れた単語群とその出現頻度の組は、重要度評価部１０に
渡される。Morphological analysis unit 11 and word group extraction unit 13
The process is performed one sentence at a time, and thereafter, this is repeated until the end of the document data. When the extraction of the word groups for all the document data is completed, the frequency counting unit 7 searches the extracted word groups for overlapping word groups, counts the appearance frequencies of those word groups, and eliminates the duplication. . The set of the word group and the appearance frequency thus obtained is passed to the importance evaluation unit 10.

【００７１】図１７は、本発明のキーワード抽出装置の
第３の実施例における重要度評価部１０の動作の一例を
示すフローチャートである。頻度計数部７から渡される
単語群と出現頻度の組は、キーワード候補群ストリーム
ＫＥに入力されているものとする。FIG. 17 is a flow chart showing an example of the operation of the importance evaluation section 10 in the third embodiment of the keyword extraction system of the present invention. It is assumed that the set of word groups and appearance frequencies passed from the frequency counting unit 7 has been input to the keyword candidate group stream KE.

【００７２】重要度評価部１０は、まず、Ｓ７１におい
て、キーワード候補群ストリームＫＥを前方一致順に
（辞書順に）ソートした前方一致順キーワード候補群ス
トリームＫＦと、キーワード候補群ストリームＫＥを後
方一致順に（文字列の語尾から辞書順に）ソートした後
方一致順キーワード候補群ストリームＫＢを用意する。In step S71, the importance evaluation section 10 first sorts the keyword candidate group stream KE in forward matching order (dictionary order) and the forward matching order keyword candidate group stream KF and the keyword candidate group stream KE in backward matching order (S71). A backward matching order keyword candidate group stream KB prepared by sorting (from the end of the character string in the dictionary order) is prepared.

【００７３】続いて、Ｓ７２において、前方一致順キー
ワード候補群ストリームＫＦにキーワード候補が存在す
ることを確認し、Ｓ７３において、前方一致順キーワー
ド候補群ストリームＫＦからキーワード候補をひとつ取
り出し、変数ＫＹに読み込む。Subsequently, in S72, it is confirmed that there is a keyword candidate in the prefix matching order keyword candidate group stream KF, and in S73, one keyword candidate is extracted from the prefix matching order keyword candidate group stream KF and read into the variable KY. .

【００７４】次に、Ｓ７４において、前方一致順キーワ
ード候補群ストリームＫＦの中で、変数ＫＹと前方一致
するキーワード候補が存在するか否かを判定し、存在す
る場合には、Ｓ７５において、変数ＫＹと前方一致する
キーワード候補をすべて前方一致候補ストリームＦＯに
コピーする。さらに、Ｓ７６において、後方一致順キー
ワード候補群ストリームＫＢの中で、変数ＫＹと後方一
致するキーワード候補が存在するか否かを判定し、存在
する場合には、Ｓ７７において、後方一致するキーワー
ド候補をすべて後方一致候補ストリームＢＡにコピーす
る。ここで、前方一致、後方一致の判断は、文字単位で
はなく、単語単位で行ない、変数ＫＹを構成する単語を
全て含んでいる場合に、前方一致、後方一致したと判断
する。Next, in S74, it is determined whether or not there is a keyword candidate that prefix-matches the variable KY in the prefix matching order keyword candidate group stream KF. If there is, a variable KY is determined in S75. All keyword candidates that prefix match with are copied to the prefix match candidate stream FO. Further, in S76, it is determined whether or not there is a keyword candidate that back-matches the variable KY in the backward-matching order keyword candidate group stream KB, and if there is, a backward-matching keyword candidate is found in S77. All are copied to the trailing match candidate stream BA. Here, the forward match and the backward match are determined not on a character-by-character basis but on a word-by-word basis, and when all the words forming the variable KY are included, it is determined that the prefix match and the backward match are made.

【００７５】図１８は、前方一致、後方一致により得ら
れるキーワード候補の一例の説明図である。変数ＫＹに
｛高速−検索３｝が読み込まれているものとする。こ
こで、「高速−検索」はキーワード候補であり、「３」
はその出現頻度である。前方一致候補ストリームＦＯに
は、このキーワード候補と前方一致する、例えば、「高
速−検索−システム」、「高速−検索−機能」などが書
き込まれる。また、後方一致候補ストリームＢＡには、
例えば、「文書−高速−検索」などが書き込まれる。FIG. 18 is an explanatory diagram of an example of keyword candidates obtained by prefix matching and suffix matching. It is assumed that {high speed-search 3} is read in the variable KY. Here, "high speed-search" is a keyword candidate, and "3"
Is the frequency of occurrence. In the prefix match candidate stream FO, prefix match with this keyword candidate, for example, "high speed-search-system", "high speed-search-function", or the like is written. Also, in the trailing match candidate stream BA,
For example, "document-high speed-search" or the like is written.

【００７６】図１７に戻り、続いて、Ｓ７８において、
前方一致候補ストリームＦＯおよび後方一致候補ストリ
ームＢＡにある候補の情報を用いて、変数ＫＹの重要度
を計算する。ここでは、重要度の計算は、前方一致候補
ストリームＦＯおよび後方一致候補ストリームＢＡ中の
すべての候補の出現頻度を、変数ＫＹの出現頻度に加算
することにより行なう。図１８に示した例では、キーワ
ード候補「高速−検索」の重要度は１５となる。重要度
を付与されたキーワード候補は、Ｓ７９において、重要
度付き候補群ストリームＩＭに書き込まれる。重要度評
価部１０は、以上の処理を、前方一致順キーワード候補
群ストリームＫＦにキーワード候補がなくなるまで繰り
返す。Returning to FIG. 17, in S78,
The importance of the variable KY is calculated using the information of the candidates in the prefix match candidate stream FO and the suffix match candidate stream BA. Here, the degree of importance is calculated by adding the appearance frequencies of all the candidates in the forward matching candidate stream FO and the backward matching candidate stream BA to the appearance frequency of the variable KY. In the example shown in FIG. 18, the keyword candidate “high speed-search” has an importance level of 15. The keyword candidates to which the degree of importance is given are written in the candidate-group stream IM with degree of importance in S79. The importance degree evaluation unit 10 repeats the above processing until there are no keyword candidates in the front matching order keyword candidate group stream KF.

【００７７】重要度評価部１０における重要度の計算
は、上述のように、出現頻度を単純に加算する方法を用
いる以外にも、例えば、前方一致候補ストリームＦＯお
よび後方一致候補ストリームＢＡ中の候補とキーワード
候補ＫＹとの重複語数、重複語の割合、前方一致か後方
一致か、などによって頻度を重み付けして加算するな
ど、種々のものが考えられる。頻度情報を用いるもので
あれば、どのような方法を用いてもよい。The importance evaluation unit 10 calculates the importance in addition to the method of simply adding the appearance frequencies as described above. For example, the candidates in the forward match candidate stream FO and the backward match candidate stream BA may be calculated. There are various possibilities such as weighting and adding frequencies according to the number of overlapping words of the keyword candidate KY, the ratio of overlapping words, forward matching or backward matching, and the like. Any method may be used as long as it uses frequency information.

【００７８】重要度評価部１０で重要度が付与されたキ
ーワード候補群は、キーワード選定部９に渡され、重要
度を付与されたキーワード候補の中から、重要度に基づ
いてキーワードが選定される。例えば、あらかじめ決め
られた閾値よりも大きな重要度を持つ候補をキーワード
として選定し、記憶部６に入力文書とともに登録する。
このキーワード選定部９で用いるキーワードを選定する
ためのあらかじめ決められた閾値は、例えば、計算方法
を与えておき、キーワード候補数や重要度の分布により
キーワード選定時に計算したり、キーワード抽出動作開
始時にユーザがシステムに与えたり、あるいは、重要度
評価結果をユーザに提示して閾値を入力させるなど、種
々の方法を用いることができる。The keyword candidate group to which the importance level is assigned by the importance level evaluation section 10 is passed to the keyword selection section 9, and a keyword is selected from the keyword candidates to which the importance level is assigned based on the importance level. . For example, a candidate having a degree of importance greater than a predetermined threshold is selected as a keyword and registered in the storage unit 6 together with the input document.
A predetermined threshold value for selecting a keyword to be used by the keyword selecting unit 9 is calculated, for example, by a calculation method, and is calculated at the time of keyword selection based on the number of keyword candidates or distribution of importance, or at the time of starting keyword extraction operation. Various methods can be used, such as the user giving the result to the system, or presenting the importance evaluation result to the user to input the threshold value.

【００７９】この第３の実施例では、表現抽出部３で、
キーワード候補として複合語を抽出する場合を示した
が、間に付属語があったり、連続していなくても特定の
品詞パターンを規則に基づき抽出するようにしても良い
し、付属語も抽出してもよい。付属語も抽出する場合、
前方一致、後方一致などの判断は自立語のみでするよう
にすればよい。もちろん、第１、第２の実施例のよう
に、リレーション表現を抽出するように構成することも
できる。In the third embodiment, the expression extracting unit 3
Although the case where a compound word is extracted as a keyword candidate is shown, a specific part-of-speech pattern may be extracted based on a rule even if there is an adjunct word between them or they are not continuous, and an adjunct word is also extracted. May be. If you also want to extract attached words,
Only independent words can be used to judge forward match and backward match. Of course, as in the first and second embodiments, the relation expression can be extracted.

【００８０】[0080]

【発明の効果】以上の説明から明らかなように、本発明
によれば、テキスト中の単語の表記だけでなく、その間
の関係も含めてキーワードの抽出を行なうことができ、
関係の抽象度が異なる類似した表現の中から、適切なキ
ーワードを抽出できる。また、選択されなかったキーワ
ード候補の出現頻度をも利用することにより、より正確
に重要度を評価でき、これにより精度良くキーワードを
抽出することができるという効果がある。As is clear from the above description, according to the present invention, not only the notation of words in a text but also the relation between them can be extracted to extract keywords.
Appropriate keywords can be extracted from similar expressions with different degrees of relation abstraction. Further, by using the appearance frequency of the keyword candidates that have not been selected, it is possible to evaluate the importance more accurately, and thus it is possible to extract the keywords with high accuracy.

【００８１】[0081]

[Brief description of drawings]

【図１】本発明のキーワード抽出装置の第１の実施例
の全体構成を示すブロック図である。FIG. 1 is a block diagram showing the overall configuration of a first embodiment of a keyword extraction device of the present invention.

【図２】表現抽出部３の動作の一例を示すフローチャ
ートである。FIG. 2 is a flowchart showing an example of the operation of the expression extraction unit 3.

【図３】形態素解析結果の一例の説明図である。FIG. 3 is an explanatory diagram of an example of a morphological analysis result.

【図４】抽出された表現の一例の説明図である。FIG. 4 is an explanatory diagram of an example of extracted expressions.

【図５】表現抽出規則の一例の説明図である。FIG. 5 is an explanatory diagram of an example of an expression extraction rule.

【図６】関係抽出部４の関係抽出動作の一例を示すフ
ローチャートである。FIG. 6 is a flowchart showing an example of a relation extracting operation of a relation extracting unit 4.

【図７】関係推定規則の一例の説明図である。FIG. 7 is an explanatory diagram of an example of a relationship estimation rule.

【図８】リレーション表現の一例の説明図である。FIG. 8 is an explanatory diagram of an example of a relation expression.

【図９】リレーション表現選択部５の動作を示すフロ
ーチャートである。FIG. 9 is a flowchart showing an operation of the relation expression selection unit 5.

【図１０】同じ語から構成されるリレーション表現の
一例の説明図である。FIG. 10 is an explanatory diagram of an example of a relational expression composed of the same words.

【図１１】リレーション記号が有する情報の一例の説
明図である。FIG. 11 is an explanatory diagram of an example of information included in a relation symbol.

【図１２】本発明の第２の実施例の全体構成を示すブ
ロック図である。FIG. 12 is a block diagram showing an overall configuration of a second exemplary embodiment of the present invention.

【図１３】リレーション表現選択／評価部８の動作の
一例を示すフローチャートである。FIG. 13 is a flowchart showing an example of the operation of the relation expression selection / evaluation unit 8.

【図１４】同じ語から構成されるリレーション表現の
別の例の説明図である。FIG. 14 is an explanatory diagram of another example of relation expressions composed of the same word.

【図１５】本発明のキーワード抽出装置の第３の実施
例を示すブロック図である。FIG. 15 is a block diagram showing a third embodiment of the keyword extracting device of the present invention.

【図１６】単語群抽出部１３により抽出されたキーワ
ード候補の一例の説明図である。16 is an explanatory diagram of an example of keyword candidates extracted by the word group extraction unit 13. FIG.

【図１７】本発明のキーワード抽出装置の第３の実施
例における重要度評価部１０の動作の一例を示すフロー
チャートである。FIG. 17 is a flowchart showing an example of an operation of the importance evaluation section 10 in the third exemplary embodiment of the keyword extracting device of the present invention.

【図１８】前方一致、後方一致により得られるキーワ
ード候補の一例の説明図である。FIG. 18 is an explanatory diagram of an example of keyword candidates obtained by prefix matching and suffix matching.

[Explanation of symbols]

１…キーワード抽出装置、２…データ入力部、３…表現
抽出部、４…関係抽出部、５…リレーション表現選択
部、６…記憶部、７…頻度計数部、８…リレーション表
現選択／評価部、９…キーワード選定部、１０…重要度
評価部、１１…形態素解析部、１２…単語群抽出部、１
３…単語群抽出部。1 ... Keyword extraction device, 2 ... Data input unit, 3 ... Expression extraction unit, 4 ... Relationship extraction unit, 5 ... Relation expression selection unit, 6 ... Storage unit, 7 ... Frequency counting unit, 8 ... Relation expression selection / evaluation unit , 9 ... Keyword selection section, 10 ... Importance evaluation section, 11 ... Morphological analysis section, 12 ... Word group extraction section, 1
3 ... Word group extraction unit.

───────────────────────────────────────────────────── フロントページの続き (72)発明者中垣寿平神奈川県横浜市保土ヶ谷区神戸町134番地横浜ビジネスパークイーストタワー富士ゼロックス株式会社内 (56)参考文献特開平２−158872（ＪＰ，Ａ) 特開平７−244673（ＪＰ，Ａ) (58)調査した分野(Int.Cl.⁷，ＤＢ名) G06F 17/30 210 - 419 ＪＩＣＳＴファイル（ＪＯＩＳ)─────────────────────────────────────────────────── ─── Continued Front Page (72) Inventor Juhei Nakagaki, 134, Kobe-cho, Hodogaya-ku, Yokohama-shi, Kanagawa Yokohama Business Park East Tower, Fuji Xerox Co., Ltd. (56) Reference JP-A-2-158872 (JP, A) JP-A-7-244673 (JP, A) (58) Fields investigated (Int.Cl. ⁷ , DB name) G06F 17/30 210 -419 JISST file (JOIS)

Claims

(57) [Claims]

1. A keyword extracting device for extracting a keyword from a text, between an expression extracting means for extracting an expression composed of a plurality of words and phrases related to each other from the text, and a word or phrase constituting the expression extracted by the expression extracting means. Relationship extracting means for estimating the relationship between the relation information and the relation information representing the relationship and outputting a relation expression consisting of a plurality of words and phrases, and the relation information composed of the same words from the relation information and the relation expression output by the relationship extracting means are mutually related. A keyword extracting device comprising relation expression selecting means for extracting a relation expression that does not contradict and selecting a relation expression having relation information having the lowest abstraction as a keyword candidate.

2. A keyword extracting device for extracting a keyword from a text, between an expression extracting means for extracting an expression composed of a plurality of words and phrases related to each other from the text, and a word or phrase constituting the expression extracted by the expression extracting means. Relationship extracting means for estimating the relationship of the relation expression and outputting the relation information representing the relationship and the relation expression consisting of a plurality of words, the frequency counting means for counting the appearance frequency of the relation expression output from the relation extracting means, and the relation extraction. From the relation information and the relation expression output from the means, a relation expression composed of the same words and phrases in which the relation information does not conflict with each other is extracted, and the relation expression having the relation information with the lowest abstraction is selected and selected as a keyword candidate. Not relayed Relation expression and means for evaluating the importance of the relation expression selected by using the frequency of appearance of the relation expression selected as a candidate and the relation expression evaluation means, and based on the evaluation result of the relation expression evaluation means. A keyword extracting device comprising a keyword selecting means for selecting a keyword.

3. A keyword extracting method for extracting a keyword from a text, wherein the expression extracting means extracts an expression consisting of a plurality of words and phrases related to each other from the text, and the relationship extracting means extracts the expression extracted by the expression extracting means. A step of estimating a relation between the constituent phrases and outputting relation information representing the relation and a relation expression consisting of a plurality of terms; the same phrase from the relation information and the relation expression output from the relation extracting means by the relation expression selecting means A keyword extraction method comprising the step of extracting a relational expression composed of the relational information that is not inconsistent with each other and selecting the relational expression having the relational information with the lowest abstraction as a keyword candidate.