JP2003167894A

JP2003167894A - Automatic related word extracting method, automatic related word extracting device, plural-important word extracting program and important word vertical hierarchical relationship extracting program

Info

Publication number: JP2003167894A
Application number: JP2001367472A
Authority: JP
Inventors: Genichiro Sueki; 源一郎末木; Hiroaki Fujiki; 宏明藤木; Naoko Yoshino; 直子吉野; Kazuko Adachi; 和子足立
Original assignee: Mitsubishi Space Software Co Ltd
Current assignee: Mitsubishi Space Software Co Ltd
Priority date: 2001-11-30
Filing date: 2001-11-30
Publication date: 2003-06-13
Anticipated expiration: 2021-11-30
Also published as: JP3553543B2; WO2003046765A1

Abstract

<P>PROBLEM TO BE SOLVED: To provide an automatic related word extracting method and an automatic related word extracting device capable of automatically extracting a technical term, a new word, and a buzz term unrecorded in a general existing thesaurus dictionary, and appearing in a specific field designated by a user, and capable of highly accurately and precisely extracting an important word deeply related to a word designated by the user. <P>SOLUTION: A document group of a field designated by the user is stored in a database part 1. The important word being a high significant word is selected from the document group in the database part 1 by an important word analyzing part 2. A count list is made as statistical information to the important word or a pair of important words by a count part 3. A degree of association of the mutual important words is determined by a related word extracting part 4 on the basis of this count list. <P>COPYRIGHT: (C)2003,JPO

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】この発明は、データベース中
に含まれる言葉の統計情報に基づいて、ユーザーが指定
した言葉に関連の深い言葉を自動的に抽出する関連語自
動抽出方法において、一般的な既存のシソーラス辞書に
は記載されていない、ユーザーが指定した特定分野に出
現する専門用語や、新語および流行語を抽出可能にした
関連語自動抽出方法と関連語自動抽出装置に関するもの
である。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention generally relates to a related word automatic extraction method for automatically extracting a word closely related to a word designated by a user based on statistical information of words contained in a database. The present invention relates to a related word automatic extraction method and a related word automatic extraction device capable of extracting a technical term that appears in a specific field designated by a user, a new word, and a buzzword that are not described in an existing thesaurus dictionary.

【０００２】[0002]

【従来の技術】従来の関連語自動抽出装置は、その内部
構成品として既存のシソーラス辞書を持っており、ユー
ザーが指定した言葉を該シソーラス辞書から単に検索し
てその結果を関連語抽出結果として表示させるのみであ
るのが一般的である。2. Description of the Related Art A conventional related word automatic extraction device has an existing thesaurus dictionary as an internal component thereof, and simply retrieves a word designated by a user from the thesaurus dictionary and outputs the result as a related word extraction result. Generally, it is only displayed.

【０００３】[0003]

【発明が解決しようとする課題】しかし、従来の関連語
自動抽出装置では、既存のシソーラス辞書には記載され
ていない専門用語や新語および流行語はその重要度にか
かわらず抽出することができないという欠点があった。However, in the related-art automatic word extraction device of the related art, technical terms, new words, and buzzwords that are not described in the existing thesaurus dictionary cannot be extracted regardless of their importance. There was a flaw.

【０００４】また、複数の分野についての関連語が必要
な場合、各分野個別にシソーラス辞書を用意する必要が
あったため、コスト面でも無駄が多かった。Further, when related words for a plurality of fields are required, it is necessary to prepare a thesaurus dictionary for each field individually, which is wasteful in terms of cost.

【０００５】更に、既存のシソーラス辞書を用いず、デ
ータベースの統計情報から関連語を自動抽出する方法に
おいても、従来の関連語自動抽出方法では例えば単独に
出現する言葉の出現頻度のみを使用したものが一般的で
ある。Further, even in the method of automatically extracting the related words from the statistical information of the database without using the existing thesaurus dictionary, the conventional related word automatic extraction method uses, for example, only the frequency of appearance of words that appear independently. Is common.

【０００６】したがって、たとえ専門用語や新語および
流行語を含んだ文書データベースを用いたとしても、関
連語抽出方法の抽出精度に欠点がありユーザーの所望す
る的確な関連語を抽出することが困難であった。Therefore, even if a document database including technical terms, new words, and buzzwords is used, the extraction accuracy of the related word extraction method is defective, and it is difficult to extract an accurate related word desired by the user. there were.

【０００７】この発明は上記した従来技術の問題点を解
決するためになされたもので、その目的とするところ
は、一般的な既存のシソーラス辞書には記載されていな
い、ユーザーが指定した特定分野に出現する専門用語
や、新語および流行語を自動抽出することが可能で、さ
らにユーザーが指定した言葉に関連の深い重要語を高精
度で的確に抽出することが可能な関連語自動抽出方法お
よび関連語自動抽出装置を提供することにある。The present invention has been made to solve the above-mentioned problems of the prior art, and its purpose is to provide a specific field specified by the user, which is not described in a general existing thesaurus dictionary. A related word automatic extraction method that can automatically extract technical terms that appear in, new words and buzzwords, and can accurately and accurately extract important words that are deeply related to the words specified by the user. An object is to provide a related word automatic extraction device.

【０００８】[0008]

【課題を解決するための手段】上記目的を達成するため
に、請求項１に記載の発明は、ユーザーが指定した分野
の文書群をデータベースとして用い、該データベース中
の文書から重要度の高い言葉である重要語を選別し、該
重要語または重要語のペアに対する統計情報を用いて重
要語同士の関連度を計算する関連語自動抽出方法を使用
することを特徴としている。ここで、重要度とは、その
文書が示している内容の特徴、またはその文書のジャン
ルにおいてその特徴をよく表している度合いのことをい
う。In order to achieve the above object, the invention according to claim 1 uses a group of documents in a field designated by a user as a database, and the words of high importance from the documents in the database. It is characterized by using the related word automatic extraction method of selecting the important word which is and calculating the degree of relevance between the important words by using the statistical information for the important word or the pair of important words. Here, the degree of importance refers to the feature of the content indicated by the document, or the degree to which the feature is well represented in the genre of the document.

【０００９】請求項２に記載の発明は、請求項１の構成
に加えて、前記データベースに複数分野の文書群が蓄積
されている場合に、各分野毎の関連語を自動的に抽出可
能にしたことを特徴としている。According to a second aspect of the invention, in addition to the structure of the first aspect, when a group of documents in a plurality of fields is stored in the database, related words for each field can be automatically extracted. It is characterized by having done.

【００１０】請求項３に記載の発明は、請求項１または
２の構成に加えて、前記データベースは任意の時期に更
新・追加が可能であり、関連語自動抽出の際に差分デー
タを逐次反映させたことを特徴としている。According to a third aspect of the present invention, in addition to the configuration of the first or second aspect, the database can be updated / added at any time, and the differential data is sequentially reflected when the related words are automatically extracted. It is characterized by having done.

【００１１】請求項４に記載の発明は、請求項１乃至３
のいずれかの１つの構成に加えて、前記データベース中
の文書群が、文書のヘッダー情報を利用して同一文書か
否かを判定し、複数の同一文書が含まれていた場合に一
つの文書を残して他の同一文書を除去したものであるこ
とを特徴としている。The invention according to claim 4 is the invention according to claims 1 to 3.
In addition to one of the above configurations, it is determined whether or not the document group in the database is the same document by using the header information of the document, and one document is included when a plurality of the same documents are included. It is characterized in that other identical documents are removed except for.

【００１２】請求項５に記載の発明は、請求項１乃至４
のいずれか１つの構成に加えて、重要語を前記データベ
ース中の文書を品詞単位に分割し分割した形態素から作
成した複合語としたことを特徴としている。The invention according to claim 5 is based on claims 1 to 4.
In addition to any one of the above constitutions, the important word is a compound word created from a morpheme obtained by dividing the document in the database into units of parts of speech.

【００１３】請求項６に記載の発明は、請求項１乃至５
のいずれか１つの構成に加えて、重要語をデータベース
中の文書毎に特徴を表すと予測される品詞としたことを
特徴としている。The invention according to claim 6 is the same as claims 1 to 5.
In addition to any one of the above configurations, it is characterized in that the important word is a part of speech predicted to represent a characteristic for each document in the database.

【００１４】請求項７に記載の発明は、請求項１乃至６
のいずれか１つの構成に加えて、重要語から除外する言
葉を除外リストとして保有し、重要語抽出後除外リスト
中の言葉を重要語から除外することを特徴としている。The invention according to claim 7 is the invention according to claims 1 to 6.
In addition to any one of the above configurations, it is characterized in that words to be excluded from the important words are retained as an exclusion list, and the words in the exclusion list after extracting the important words are excluded from the important words.

【００１５】請求項８に記載の発明は、請求項１乃至７
のいずれか１つの構成に加えて、同一の意味を持つ重要
語を同一語リストとして保有し、重要語抽出の際に同一
語リスト中の言葉の統計情報をまとめて保存することを
特徴としている。The invention according to claim 8 is the invention according to claims 1 to 7.
In addition to any one of the above configurations, it is characterized in that important words having the same meaning are retained as the same word list, and statistical information of words in the same word list is collectively stored when extracting the important word. .

【００１６】請求項９に記載の発明は、請求項１乃至８
のいずれか１つの構成に加えて、統計情報は、データベ
ース中の全出現回数、およびデータベース内に重要語が
含まれる文書数の割合であることを特徴としている。The invention according to a ninth aspect is the first to the eighth aspects.
In addition to any one of the above configurations, the statistical information is characterized in that it is the total number of appearances in the database and the ratio of the number of documents containing important words in the database.

【００１７】請求項１０に記載の発明は、請求項９の構
成に加えて、前記統計情報には前記データベース中の文
書に含まれる重要語の単独出現回数の他に、一定範囲内
の複数重要語の出現回数も用いたことを特徴としてい
る。According to a tenth aspect of the present invention, in addition to the configuration of the ninth aspect, the statistical information includes a plurality of important words within a certain range in addition to the number of times the important word included in the document in the database appears independently. The feature is that the number of appearances of words is also used.

【００１８】請求項１１に記載の発明は、請求項９の構
成に加えて、前記統計情報の他に、前記データベース中
の文書に含まれる表層表現を自動抽出し、該表層表現か
ら自動構築した重要語の上下階層関係を用いたことを特
徴としている。According to the invention of claim 11, in addition to the structure of claim 9, in addition to the statistical information, a surface expression included in a document in the database is automatically extracted and automatically constructed from the surface expression. It is characterized by using the hierarchical relationship of important words.

【００１９】請求項１２に記載の発明は、請求項１乃至
１１のいずれか１つの構成に加えて、前記統計情報の算
出の際、複数の異なる検索条件式を作成し、該複数の異
なる検索条件式を複数の異なるプロセッサを有する超並
列計算機の前記複数の異なるプロセッサ上に別個に設定
し、データベース中に蓄積されている文書群を前記複数
の異なる検索条件式で同時並行的に全文検索し、前記検
索条件式に合致した結果を用いたことを特徴としてい
る。According to a twelfth aspect of the present invention, in addition to the configuration according to any one of the first to eleventh aspects, a plurality of different search condition expressions are created when the statistical information is calculated, and the plurality of different search conditions are created. Conditional expressions are separately set on the plurality of different processors of a massively parallel computer having a plurality of different processors, and a document group accumulated in a database is simultaneously searched in full text by the plurality of different search conditional expressions. , A result that matches the search condition expression is used.

【００２０】請求項１３に記載の関連語自動抽出装置
は、請求項１の構成に加えて、ユーザーが指定した分野
の文書群を格納する請求項１に記載のデータベース部
と、該データベース部に含まれる重要語を抽出・選別す
る重要語解析部と、該重要語解析部で選別した重要語に
対する統計情報および重要語の上下階層関係情報を取得
するカウント部と、該カウント部で生成したカウントリ
ストを用いて重要語同士の関連度を計算する関連語抽出
部とからなり、一連の処理には請求項１に記載の関連語
自動抽出方法を用いたことを特徴としている。In addition to the structure of claim 1, an apparatus for automatically extracting related words according to claim 13 further comprises a database part according to claim 1 for storing a document group of a field designated by a user, and the database part. An important word analysis unit that extracts and selects included important words, a count unit that acquires statistical information for the important words selected by the important word analysis unit, and hierarchical relationship information of the important words, and a count that is generated by the counting unit A related word extracting unit that calculates the degree of relevance between important words using a list, and the related word automatic extraction method according to claim 1 is used for a series of processes.

【００２１】請求項１４に記載の発明は、データベース
中の文書に含まれる重要語の単独出現回数の他に、一定
範囲内の複数重要語の出現回数も用いて複数重要語を自
動抽出する複数重要語抽出プログラムにおいて、データ
ベース内の文書を一文書ずつ読み込み、該文書中から重
要語を探索し、探索された重要語から予め定義した一定
範囲内に別の重要語があるか否かを探索し、重要語から
一定範囲内に存在する重要語が探索された場合に重要語
のペアを逐次カウントリストに保存し、重要語のペアを
既に作成したカウントリストから探索し、既に同一の重
要語のペアがカウントリストに存在した場合、出現回数
のカウントに１加えてカウントリストを更新し、カウン
トリストに存在しなかった場合、前記重要語のペアのカ
ウントを１にしてカウントリストに新たに保存し、これ
らの処理をデータベース内の予め指定した複数文書につ
いて行い、作成したカウントリストを元に、重要語のペ
アの重要度を判定することを特徴としている。According to a fourteenth aspect of the present invention, a plurality of important words are automatically extracted by using the number of times of appearance of a plurality of important words within a certain range in addition to the number of times of occurrence of a plurality of important words included in a document in a database. In the important word extraction program, the documents in the database are read one by one, the important words are searched from the documents, and it is searched whether or not there is another important word within a predetermined range from the searched important words. However, when an important word existing within a certain range from the important word is searched, the important word pair is sequentially stored in the count list, and the important word pair is searched from the already created count list, and the same important word is already searched. If the pair of is present in the count list, the count list is updated by adding 1 to the count of the number of appearances. If it is not present in the count list, the count of the pair of important words is set to 1. Newly saved und list, these processes performed on multiple documents previously specified in the database, based on a count list created, it is characterized in that to determine the importance of the important word pairs.

【００２２】請求項１５に記載の発明は、データベース
中の文書に含まれる表層表現を自動抽出し、該表層表現
から自動構築した重要語の上下階層関係を用いた重要語
上下階層関係抽出プログラムにおいて、データベース内
の文書を一文書ずつ読み込み、該文書中から予め作成し
ておいた表層表現リストに書かれている表層表現を抽出
し、抽出された表層表現中の上位語部分および下位語部
分に前記重要語解析部２で抽出した重要語が含まれるか
否かを探索し、上位語部分および下位語部分の双方とも
に重要語が探索された場合、探索された上下重要語のペ
アを逐次カウントリストに保存し、既に同一の重要語の
ペアがカウントリストに存在した場合、出現回数のカウ
ントに１加えてカウントリストを更新し、カウントリス
トに存在しなかった場合、前記上下重要語のペアのカウ
ントを１にしてカウントリストに新たに保存し、これら
の処理をデータベース内の予め指定した複数文書につい
て行い、作成したカウントリストを元に重要語の上下階
層関係を構築することを特徴としている。According to a fifteenth aspect of the present invention, there is provided a program for extracting a relation between upper and lower hierarchical levels of an important word, which automatically extracts a surface layer expression included in a document in a database and uses a hierarchical relation of important words automatically constructed from the surface layer expression. , The documents in the database are read one by one, the surface expressions written in the surface expression list created in advance are extracted from the documents, and the high-order word part and the low-order word part in the extracted surface expression are extracted. It searches for whether the important word extracted by the important word analysis unit 2 is included, and when the important word is searched for in both the upper word part and the lower word part, the searched upper and lower important word pairs are sequentially counted. Saved to the list, if the same important word pair already exists in the count list, the count list is updated by adding 1 to the count of the number of appearances, and it does not exist in the count list. In this case, the count of the pair of upper and lower important words is newly stored in the count list, these processes are performed for a plurality of documents specified in advance in the database, and the upper and lower hierarchical relationships of the important words are based on the created count list. It is characterized by building.

【００２３】[0023]

【発明の実施の形態】以下、この発明を図示の実施の形
態に基づいて詳細に説明する。BEST MODE FOR CARRYING OUT THE INVENTION The present invention will now be described in detail based on the illustrated embodiments.

【００２４】図１は、この発明の実施の形態に係る関連
語自動抽出装置のブロック図である。FIG. 1 is a block diagram of an apparatus for automatically extracting related words according to an embodiment of the present invention.

【００２５】すなわち、この関連語自動抽出装置は、ユ
ーザーが指定した分野の文書群を格納するデータベース
部１と、このデータベース部１に含まれる重要語を抽出
・選別する重要語解析部２と、重要語解析部２で選別し
た重要語に対する統計情報および重要語の上下階層関係
情報を取得するカウント部３と、カウント部３で生成し
たカウントリストを用いて重要語同士の関連度を計算す
る関連語抽出部４とを備えた構成となっており、データ
ベース部１中の文書から重要度の高い言葉である重要語
を選別し、重要語または重要語のペアに対する統計情報
を用いて重要語同士の関連度を計算する処理を行う。That is, the related word automatic extraction device includes a database section 1 for storing a document group of a field designated by a user, an important word analysis section 2 for extracting and selecting important words contained in the database section 1, A count unit 3 that obtains statistical information and important hierarchical relationship information of important words selected by the important word analysis unit 2, and a relationship that calculates the degree of association between important words using the count list generated by the counting unit 3. The word extracting unit 4 is provided, and the important words that are highly important words are selected from the documents in the database unit 1 and the important words are combined using the statistical information for the important words or the important word pairs. Perform a process to calculate the degree of association of.

【００２６】データベース部１は、入力される文書群か
ら同一文書を判定し、複数の同一文書が含まれていた場
合に一つの文書を残して他の同一文書を除去する同一文
書判定機能部１１および同一文書判定機能部１１で同一
文書を除去した後の文書を格納するデータベース１２か
ら構成される。The database unit 1 determines the same document from the input document group, and when a plurality of the same documents are included, the same document determination function unit 11 that leaves one document and removes other same documents. And a database 12 that stores the document after the same document is removed by the same document determination function unit 11.

【００２７】以下、同一文書判定機能部１１について詳
しく説明する。The same document determination function unit 11 will be described in detail below.

【００２８】例えば、データベース１２中の文書が特許
文書であると仮定した場合、特許文書のヘッダー部から
「出願人の氏名または名称」、「発明の名称」および
「発明者の氏名」を抽出し、（１）「出願人の氏名また
は名称」が同一であること（２）「発明の名称」が同一
であること（３）発明者の人数が一致している、かつ各
々の「発明者の氏名」がすべて一致している（記載順は
不問）ことを判定する。前記（１）乃至（３）の条件に
合致した文書群はすべて同一文書とみなす。For example, when it is assumed that the document in the database 12 is a patent document, "name or name of applicant", "name of invention" and "name of inventor" are extracted from the header portion of the patent document. , (1) "Applicant's name or name" is the same (2) "Name of invention" is the same (3) The number of inventors is the same, and each "inventor's" It is determined that all "names" are the same (in any order). All the document groups that meet the above conditions (1) to (3) are regarded as the same document.

【００２９】重要語解析部２は、形態素解析部２１およ
び重要語の抽出部２２から構成される。The important word analysis unit 2 comprises a morpheme analysis unit 21 and an important word extraction unit 22.

【００３０】形態素解析部２１では、前記データベース
中の文書を形態素解析により品詞単位に分割し、品詞情
報を取得する。The morphological analysis unit 21 divides the document in the database into POSs by morphological analysis, and acquires POS information.

【００３１】重要語の抽出部２２では、前記形態素解析
部２１で品詞単位に分割した形態素を、例えば連続する
名詞は結合させる等の複合語処理をすることにより複合
語を作成し、該複合語を重要語として品詞情報および統
計情報と共に重要語リストに保存する。複合語作成によ
り、分割による言葉の抽象化を回避することができ、最
終的に抽出する関連語の精度を向上させることができ
る。In the important word extraction unit 22, a compound word is created by subjecting the morphemes divided by the morpheme analysis unit 21 in units of parts of speech to compound word processing such as combining consecutive nouns. Is stored as an important word in the important word list together with part-of-speech information and statistical information. By creating a compound word, it is possible to avoid the word abstraction due to division and improve the accuracy of the related word to be finally extracted.

【００３２】重要語とは、前記方法により作成した複合
語に限られるものではなく、例えば複合語以外の普通名
詞、固有名詞、未定義語等、データベース１２中の文書
のジャンル毎にその文書の内容を特徴付けると考えられ
る言葉の品詞を指定する。The term "important word" is not limited to the compound word created by the above-mentioned method. For example, common nouns other than compound words, proper nouns, undefined words, etc. Specifies the part of speech of the words that are considered to characterize the content.

【００３３】また、重要語の抽出後、場合によっては必
ず除外する言葉等を除外リストとして保有しておき、除
外リスト中の言葉は重要語から除外する機能を追加して
もよい。具体的には、データベースの文書のジャンル毎
に、例えば、特許文書であれば「発明者」、「比較例」
等その文書の内容を特徴付けることができない言葉を除
外リストに登録することが考えられる。After the important words are extracted, words to be excluded may be retained as an exclusion list in some cases, and a function of excluding the words in the exclusion list from the important words may be added. Specifically, for each genre of document in the database, for example, “inventor” for patent documents, “comparative example”
It is conceivable to add words that cannot characterize the content of the document to the exclusion list.

【００３４】この除外リストには、形態素毎に完全一致
することを除外条件とする言葉の他に、部分的に一致し
ていれば除外対象とする言葉を含んでいてもよい。This exclusion list may include words that are to be excluded as long as they partially match, in addition to words that have an exclusion condition that they completely match each morpheme.

【００３５】更に、同一の意味を持つ重要語を同一語リ
ストとして保有しておき、重要語抽出の際に、この重要
語リスト中の言葉の統計情報をまとめて保存することに
より、重要語の抽出精度を向上させることができる。Furthermore, important words having the same meaning are held as the same word list, and when extracting the important words, statistical information of the words in the important word list is collectively stored to save the important words. The extraction accuracy can be improved.

【００３６】図２は、重要語リストの概念図である。FIG. 2 is a conceptual diagram of the important word list.

【００３７】ここで、前記重要語リストに保存されるべ
き「統計情報」とは、重要語２３のデータベース中の全
出現回数２５、およびデータベース内に重要語が含まれ
る文書数２４の割合を用いる。これらは、後のカウント
部３および関連語の抽出部４１で使用する各種統計量の
元になる情報である。Here, as the "statistical information" to be stored in the important word list, the total number of occurrences 25 of the important word 23 in the database and the ratio of the number of documents 24 including the important word in the database are used. . These are pieces of information that are the basis of various statistics used in the counting unit 3 and the related word extracting unit 41 later.

【００３８】データベース１２内に重要語が含まれる文
書数の取得には、各々の重要語に対応する複数の異なる
検索条件式を作成し、該複数の異なる検索条件式を複数
の異なるプロセッサを有する超並列計算機の前記複数の
異なるプロセッサ上に別個に設定し、データベース１２
中に蓄積されている文書群を前記複数の異なる検索条件
式で同時並行的に全文検索し、前記検索条件式に合致し
た結果を用いることができる。ここで、各々の検索条件
式に合致した結果数が、データベース１２中に各々の重
要語が含まれる文書数となる。重要語解析部２の処理の
都度、前記全文検索を行うことで統計情報の正確さを保
持することできる。In order to obtain the number of documents containing important words in the database 12, a plurality of different search condition expressions corresponding to each important word are created, and the plurality of different search condition expressions are provided by a plurality of different processors. The database 12 is separately set on the plurality of different processors of the massively parallel computer.
It is possible to simultaneously and in parallel perform full-text search of the documents stored therein by the plurality of different search condition expressions, and use the result that matches the search condition expression. Here, the number of results that match each search condition expression is the number of documents in which each important word is included in the database 12. The accuracy of the statistical information can be maintained by performing the full-text search each time the processing of the important word analysis unit 2 is performed.

【００３９】前記超並列計算機は、数千乃至数万のプロ
セッサ（以下、これらをまとめてパイプラインという）
を内蔵することにより、このパイプラインに複数の異な
った検索条件式を同時に設定可能としている。そして、
これら大量のプロセッサを同時に動作させることによっ
て、複数の異なった検索条件式とデータベースのマッチ
ングを行う全文検索を実行する。マッチングの結果、検
索条件式に合致する文書が見つかったら、その文書がヒ
ットしたとみなす機能を有する。The massively parallel computer is a processor of thousands to tens of thousands (hereinafter, these are collectively referred to as a pipeline).
By incorporating, a plurality of different search condition expressions can be set simultaneously in this pipeline. And
By operating a large number of these processors at the same time, a full-text search that matches a plurality of different search condition expressions with a database is executed. As a result of matching, if a document that matches the search condition expression is found, it has a function of assuming that the document is a hit.

【００４０】超並列計算機は、全文検索エンジン（例え
ば、Ｐａｒａｃｅｌ社製、ＦＤＦ（登録商標）４ＴＴ
ｅｘｔＦｉｎｄｅｒ）のような機器が望ましいが、これ
と同等の機能および性能を有するワークステーション等
の機器でもよい。The massively parallel computer is a full-text search engine (for example, FDF (registered trademark) 4T T manufactured by Paracel).
A device such as extFinder is desirable, but a device such as a workstation having a function and performance equivalent to this may be used.

【００４１】カウント部３は、一定範囲内の複数重要語
の抽出部３１および重要語の上下階層関係の抽出部３２
から構成される。The counting unit 3 includes an extracting unit 31 for extracting a plurality of important words within a certain range and an extracting unit 32 for extracting the upper and lower hierarchical relationships of the important words.
Composed of.

【００４２】関連語自動抽出方法において、一定範囲内
の複数重要語の抽出部３１または重要語の上下階層関係
の抽出部３２のいずれか一方の処理を予めユーザーが選
択しておき、ユーザーが選択した処理のみを行う。In the related word automatic extraction method, the user selects in advance the processing of either the extraction part 31 of a plurality of important words within a certain range or the extraction part 32 of the hierarchical relation of important words, and the user selects it. Only the processing that was done

【００４３】一定範囲内の複数重要語の抽出部３１で
は、重要語解析部２で抽出した重要語を基準にして、基
準から予め定義した一定の範囲内に別の重要語が存在す
る場合を複数重要語と定義し、該複数重要語の出現数を
カウントしたものをカウントリストとして保存する。複
数重要語の抽出手順を図５のフローチャートに示してい
るが、その詳細は後述する。In the plural important word extraction unit 31 within a certain range, there is a case where another important word exists within a certain range defined in advance based on the important word extracted by the important word analysis unit 2. It is defined as a plurality of important words, and the number of appearances of the plurality of important words is stored as a count list. The procedure for extracting a plurality of important words is shown in the flowchart of FIG. 5, the details of which will be described later.

【００４４】重要語の上下階層関係の抽出部３２では、
上位語と下位語の関係が明確に表現されている表層表現
を予め定義しておき、前記重要語解析部２で抽出した重
要語が含まれる該表層表現を抽出する。抽出した表層表
現中の重要語を上位重要語および下位重要語とし、それ
らの出現数をカウントしたものをカウントリストとして
保存する。重要語の上下階層関係の抽出手順を図６のフ
ローチャートに示しているが、その詳細は後述する。In the extraction unit 32 for the upper and lower hierarchical relationships of important words,
A surface expression in which the relationship between the upper word and the lower word is clearly expressed is defined in advance, and the surface expression containing the important word extracted by the important word analysis unit 2 is extracted. The important words in the extracted surface expression are defined as upper important words and lower important words, and the counts of the number of occurrences of these are stored as a count list. The procedure for extracting the hierarchical relationship between important words is shown in the flowchart of FIG. 6, the details of which will be described later.

【００４５】図１において、関連語抽出部４は、関連語
の抽出部４１からなる。該関連語の抽出部４１におい
て、前記カウント部３で作成したカウントリストを元に
関連語判定を行う。In FIG. 1, the related word extraction unit 4 comprises a related word extraction unit 41. The related word extraction unit 41 determines a related word based on the count list created by the counting unit 3.

【００４６】関連語判定には、例えば２つの言葉の非類
似度を判定するＩｎｆｏｒｍａｔｉｏｎＲａｄｉｕｓ
（ＣｈｒｉｓｔｏｐｈｅｒＤ．Ｍａｎｎｉｎｇａｎ
ｄＨｉｎｒｉｃｈＳｃｈｕｔｚｅ，Ｆｏｕｎｄａｔｉ
ｏｎｓＯｆＳｔａｔｉｓｔｉｃａｌＮａｔｕｒａ
ｌＬａｎｇｕａｇｅＰｒｏｃｅｓｓｉｎｇ，Ｔｈｅ
ＭＩＴＰｒｅｓｓ（ＭＡＮＦＨ０−２６２−１３
３６０−１））等の判定指標を用いることができるが、
これに限らず、例えば前記一定範囲内の複数重要語の抽
出部３１を選択した場合は一定範囲内に存在する重要語
が共通している重要語のペア、または前記重要語の上下
階層関係の抽出部３２を選択した場合は下位重要語が共
通している重要語のペアを、関連語と判定することもで
きる。For the related word determination, for example, Information Radius for determining the dissimilarity of two words.
(Christopher D. Manning an
dHinrich Schutze, Foundati
ons Of Statistical Nature
l Language Processing, The
MIT Press (MANFH0-262-13
Although a determination index such as 360-1)) can be used,
Not limited to this, for example, when the extraction unit 31 for a plurality of important words within the certain range is selected, a pair of important words having common important words existing within the certain range or a hierarchical relationship of the upper and lower levels of the important words When the extraction unit 32 is selected, a pair of important words having a common lower important word can be determined as a related word.

【００４７】図３は、カウントリストの概念図であり、
重要語１のＩＤ３３、重要語２のＩＤ３４、重要語１と
重要語２のペアの出現回数３５がリスト項目としてカウ
ントリストが作成されている。FIG. 3 is a conceptual diagram of the count list.
A count list is created with the ID 33 of the important word 1, the ID 34 of the important word 2, and the number of appearances 35 of the pair of the important word 1 and the important word 2 as list items.

【００４８】図４は、図３のカウントリストおよび図２
の重要語リストを元にして作成した関連度判定リストの
概念図である。FIG. 4 shows the count list of FIG. 3 and FIG.
It is a conceptual diagram of the related degree determination list created based on the important word list of.

【００４９】図４の各列に配置されている言葉Ａ、Ｂ、
Ｃ、Ｄ、・・・が関連語判定の対象になる関連語判定対
象語（重要語）４２の集合で、各行に配置されている言
葉ａ、ｂ、ｃ、ｄ、・・・が関連語判定の用いる関連語
判定使用語（重要語）４３である。基本的に、各列、各
行とも重要語解析部２で抽出した重要語であり、カウン
ト部３で抽出した重要語ペアの片方が列に、もう片方が
行に配置される。例えば、図５の一定範囲内に存在する
重要語ペアでは重要語Ａを列に、重要語Ｂを行に配置す
る。図６の上下重要語ペアでは上位重要語を列に、下位
重要語を行に配置する。図４の関連度判定リストにおい
て、各セルの数字は、出現確率を表している。例えばｃ
列Ａ行では、「重要語Ａと重要語ｃが一定範囲内に出現
する確率」、または「重要語Ａが上位語で重要語ｃが下
位語である確率」を表す。The words A, B, which are arranged in each column of FIG.
C, D, ... Are a set of related word determination target words (important words) 42 to be subjected to related word determination, and words a, b, c, d, ... It is a related word used for judgment (important word) 43. Basically, each column and each row are important words extracted by the important word analysis unit 2, and one of the important word pairs extracted by the counting unit 3 is arranged in a column and the other is arranged in a row. For example, in an important word pair existing within a certain range in FIG. 5, the important word A is arranged in a column and the important word B is arranged in a row. In the upper and lower important word pairs in FIG. 6, the upper important words are arranged in columns and the lower important words are arranged in rows. In the degree-of-association determination list of FIG. 4, the number of each cell represents the appearance probability. For example, c
The column A row represents “probability that the important word A and the important word c appear within a certain range” or “probability that the important word A is a superordinate word and the important word c is a subordinate word”.

【００５０】以下、関連語判定の一例として、２つの言
葉の非類似度を判定するのにＩｎｆｏｒｍａｔｉｏｎ
Ｒａｄｉｕｓの判定指標を用いた場合の判定例について
説明する。Hereinafter, as an example of the related word determination, the Information is used to determine the dissimilarity of two words.
A determination example using the Radius determination index will be described.

【００５１】統計量は、この出現確率を用いて計算され
る「２つの言葉の非類似度」で、各列に配置された大文
字アルファベットのすべてのペアについて計算する（Ａ
とＢ、ＡとＣ、ＡとＤ、・・・、ＢとＣ、ＢとＤ・・
・、ＣとＤ、・・・）。重要語Ａと重要語Ｄの関連度判
定を例にとり説明すると、Ａに対するａ、ｂ、ｃ、ｄ、
・・・出現確率と、Ｄに対するａ、ｂ、ｃ、ｄ、・・・
の出現確率の違いが、非類似度として算出される。仮に
すべての行において出現確率が同じ値（ａ行Ａ列＝ａ行
Ｄ列、ｂ行Ａ列＝ｂ行Ｄ列、ｃ行Ａ列＝ｃ行Ｄ列、ｄ行
Ａ列＝ｄ行Ｄ列、・・・）であれば、非類似度は０、つ
まりＡとＤの類似度は最大となり、したがって重要語Ａ
と重要語Ｄの関連度は最大となる。逆に、出現確率が共
に０でない言葉ａ、ｂ、ｃ、ｄ、・・・が一つもなけれ
ば非類似度は最大、つまり関連度は最小となる。以上の
ように、すべての大文字アルファベットのペアについ
て、統計量を計算し、ある閾値以下のペアのみ互いに関
連のある言葉（関連語）と判定する。The statistic is "the dissimilarity of two words" calculated using this appearance probability, and is calculated for all pairs of uppercase alphabets arranged in each column (A
And B, A and C, A and D, ..., B and C, B and D ...
., C and D, ...). Taking the determination of the degree of association between the important word A and the important word D as an example, a, b, c, d for A,
... Appearance probability and a, b, c, d for D, ...
The difference in the appearance probability of is calculated as the dissimilarity. If all rows have the same occurrence probability (a row A column = a row D column, b row A column = b row D column, c row A column = c row D column, d row A column = d row D column) , ...), the dissimilarity is 0, that is, the similarity between A and D is the maximum, and therefore the important word A
And the degree of association between the important word D and the key word D are maximum. On the contrary, if there are no words a, b, c, d, ... Of which both appearance probabilities are not 0, the dissimilarity is maximum, that is, the degree of association is minimum. As described above, statistics are calculated for all pairs of uppercase alphabets, and only pairs with a certain threshold value or less are determined to be related words (related words).

【００５２】図５は、この発明の実施の形態に係る関連
語自動抽出方法における、一定範囲内に存在する複数個
の重要語の同時出現回数をカウントする手順を示すフロ
ーチャートである。FIG. 5 is a flowchart showing a procedure for counting the number of simultaneous appearances of a plurality of important words existing within a certain range in the related word automatic extraction method according to the embodiment of the present invention.

【００５３】まず、データベース内の文書を一文書ずつ
読み込み（ステップＳ１）、該文書中から前記重要語解
析部２で抽出した重要語を探索する（ステップＳ２）。First, the documents in the database are read one by one (step S1), and the important words extracted by the important word analysis unit 2 are searched from the documents (step S2).

【００５４】ここで探索すべき重要語とは、前記重要語
解析部２で抽出したものに限らず、場合によっては予め
ユーザーが定義したユーザー定義重要語リストに含まれ
る言葉でもよい。ユーザー定義重要語リストには、完全
一致することを探索条件とする言葉の他に、部分的に一
致していれば探索対象とする言葉を含んでいてもよい。The important word to be searched here is not limited to the one extracted by the important word analysis unit 2, but may be a word included in a user-defined important word list defined by the user in some cases. The user-defined important word list may include words that are to be searched as long as they partially match, in addition to words that have a perfect matching as a search condition.

【００５５】さらに、探索すべき言葉の重要度の判定尺
度として、データベース中の全出現回数、データベース
内にその重要語が含まれる文書数の割合や文字数を必要
に応じて探索対象重要語のフィルターに適用してもよ
い。これらの各種フィルターを適用することにより、重
要語を更に絞り込むことができ、その結果最終的に抽出
される関連語の精度を向上させることができる。Further, as a criterion for determining the importance of the word to be searched, the total number of appearances in the database, the ratio of the number of documents including the important word in the database, and the number of characters are filtered as necessary for the important word to be searched. May be applied to. By applying these various filters, the important words can be further narrowed down, and as a result, the accuracy of the related words finally extracted can be improved.

【００５６】重要語が探索された場合（ステップＳ３で
ＹＥＳと判定された場合）、探索された重要語（これを
重要語Aとよぶ）から予め定義した一定範囲内に別の重
要語（これを重要語Bとよぶ）があるか否かを探索する
（ステップＳ４）。When an important word is searched (when YES is determined in step S3), another important word (this is called important word A) is searched within a predetermined range from the searched important word (this is called important word A). Is referred to as an important word B) is searched (step S4).

【００５７】一定範囲内とは、例えば一文内（一文の先
頭から句点「。」までの範囲）で、前後２つまで近接し
たものを一定範囲内と定義するが、これに限らずデータ
ベース中の文書毎に特徴を表すと予測される範囲を指定
する。The term “within a certain range” is defined as, for example, within one sentence (the range from the beginning of one sentence to the punctuation “.”), Which is close to two before and after, within a certain range. Specify the range that is expected to represent the characteristics for each document.

【００５８】重要語Aから一定範囲内に存在する重要語B
が探索された場合（ステップＳ５でＹＥＳと判定された
場合）、重要語Aおよび重要語Bのペアを逐次カウントリ
ストに保存する。Important word B existing within a certain range from important word A
Is searched (when YES is determined in step S5), the pair of important word A and important word B is sequentially stored in the count list.

【００５９】重要語Aおよび重要語Bのペアを既に作成し
たカウントリストから探索し（ステップＳ６）、既に同
一のペアがカウントリストに存在した場合（ステップＳ
７でＹＥＳと判定された場合）、出現回数のカウントに
１加えてカウントリストを更新する（ステップＳ８）。When a pair of the important word A and the important word B is searched from the already created count list (step S6), when the same pair already exists in the count list (step S6).
If YES in 7), 1 is added to the count of the number of appearances and the count list is updated (step S8).

【００６０】カウントリストに存在しなかった場合（ス
テップＳ７でＮＯと判定された場合）、前記重要語Aお
よび重要語Bのペアのカウントを１にしてカウントリス
トに新たに保存する（ステップＳ９）。If it does not exist in the count list (NO in step S7), the count of the important word A and important word B pair is set to 1 and is newly stored in the count list (step S9). .

【００６１】以上、ステップＳ１乃至ステップＳ９の処
理をデータベース内の予め指定した複数文書について行
う（ステップＳ１０）。As described above, the processes of steps S1 to S9 are performed for a plurality of documents designated in advance in the database (step S10).

【００６２】その後、前記ステップＳ１乃至Ｓ１０で作
成したカウントリストおよび重要語リスト中の統計情報
を元に、重要語Ａおよび重要語Ｂのペアの重要度を判定
する（ステップＳ１１）。ステップＳ１１には、例えば
Ｄｉｃｅ係数や相互情報量等を用いることができる。After that, the degree of importance of the pair of the important word A and the important word B is determined based on the statistical information in the count list and the important word list created in steps S1 to S10 (step S11). For the step S11, for example, a Dice coefficient or mutual information amount can be used.

【００６３】図６は、この発明の実施の形態に係る関連
語自動抽出方法における、重要語の上下階層関係を抽出
する手順を示すフローチャートである。FIG. 6 is a flow chart showing a procedure for extracting the hierarchical relationship between important words in the related word automatic extraction method according to the embodiment of the present invention.

【００６４】まず、データベース内の文書を一文書ずつ
読み込み（ステップＳ２１）、該文書中から予め作成し
ておいた表層表現リストに書かれている表層表現を抽出
する（ステップＳ２２）。First, the documents in the database are read one by one (step S21), and the surface expression written in the surface expression list created in advance is extracted from the document (step S22).

【００６５】ここで前記表層表現リストに書かれるべき
表層表現とは、上位語と下位語の関係が明確に表現され
ているものであり、例えば「Ａ、Ｂ、Ｃ等のＤ」（Ａ乃
至Ｄは各々重要語とする）という表現においては、上位
語がＤ、下位語がＡ、Ｂ、Ｃである。Here, the surface expression to be written in the surface expression list is one in which the relationship between the upper word and the lower word is clearly expressed. For example, "D of A, B, C, etc." (A to A). In the expression "D is an important word", the upper word is D and the lower words are A, B, and C.

【００６６】次に、前記ステップＳ２２で抽出された
（ステップＳ２３でＹＥＳと判定された場合）表層表現
中の上位語部分および下位語部分に前記重要語解析部２
で抽出した重要語が含まれるか否かを探索する（ステッ
プＳ２４）。Next, the important word analysis unit 2 is added to the high-order word portion and the low-order word portion in the surface expression extracted in the step S22 (when YES is determined in the step S23).
It is searched whether or not the important word extracted in (3) is included (step S24).

【００６７】ここで探索すべき重要語とは、前記重要語
解析部２で抽出したものに限らず、場合によっては予め
ユーザーが定義したユーザー定義重要語リストに含まれ
る言葉でもよい。また、ユーザー定義重要語リストに
は、完全一致することを探索条件とする言葉の他に、部
分的に一致していれば探索対象とする言葉を含んでいて
もよい。The important word to be searched here is not limited to the one extracted by the important word analysis unit 2, but may be a word included in the user-defined important word list defined by the user in some cases. Further, the user-defined important word list may include words that are to be searched as long as they partially match, in addition to words that have a perfect match as a search condition.

【００６８】この探索により、上位語部分および下位語
部分の双方ともに重要語が探索された場合（ステップＳ
２５でＹＥＳと判定された場合）、探索された上下重要
語ペアを逐次カウントリストに保存する。この時、上下
重要語ペアの重要度の判定尺度として、データベース１
２内に上位重要語および下位重要語が含まれる文書数の
割合の比較、上位重要語および下位重要語の形態素の比
較、および必ず除外する上下重要語ペアを上下重要語ペ
ア除外リストとして保有しておき、上下重要語ペア除外
リスト中の上下重要語ペアは除外する機能等を必要に応
じて適用してもよい。By this search, when an important word is searched for in both the upper word part and the lower word part (step S
If YES is determined in 25), the searched upper and lower important word pairs are sequentially stored in the count list. At this time, the database 1 is used as a criterion for determining the importance of the upper and lower important word pairs.
The ratio of the number of documents in which the upper important word and the lower important word are included in 2 is compared, the morphemes of the upper important word and the lower important word are compared, and the upper and lower important word pairs to be excluded are kept as the upper and lower important word pair exclusion list. A function of excluding upper and lower important word pairs in the upper and lower important word pair exclusion list may be applied as necessary.

【００６９】上下重要語ペアを既に作成したカウントリ
ストから探索し（ステップＳ２６）、既に同一のペアが
カウントリストに存在した場合（ステップＳ２７でＹＥ
Ｓと判定された場合）、出現回数のカウントに１加えて
カウントリストを更新する（ステップＳ２８）。The upper and lower important word pairs are searched from the already created count list (step S26), and when the same pair already exists in the count list (YE at step S27).
If it is determined to be S), the count list is updated by adding 1 to the count of the number of appearances (step S28).

【００７０】カウントリストに存在しなかった場合（ス
テップＳ２７でＮＯと判定された場合）、前記上下重要
語ペアのカウントを１にしてカウントリストに新たに保
存する（ステップＳ２９）。If it does not exist in the count list (NO in step S27), the count of the pair of upper and lower important words is set to 1 and newly stored in the count list (step S29).

【００７１】以上、ステップＳ２１乃至ステップＳ２９
の処理をデータベース内の予め指定した複数文書につい
て行う（ステップＳ３０）。Above, steps S21 to S29
This process is performed for a plurality of documents specified in advance in the database (step S30).

【００７２】その後、前記ステップＳ２１乃至Ｓ３０で
作成したカウントリストおよび重要語リスト中の統計情
報を元に重要語の上下階層関係を構築する（ステップＳ
３１）。After that, based on the statistical information in the count list and the important word list created in steps S21 to S30, the hierarchical relation of the important words is constructed (step S).
31).

【００７３】具体的には、例えば共通の下位重要語Ｃを
持つ上位重要語ＡおよびＢが抽出されていると同時に上
位重要語Ａおよび下位重要語Ｂが抽出されている場合、
全体的にみれば直接の上下関係になっているペアはＡ
（上位）−Ｂ（下位）ペアおよびＢ（上位）−Ｃ（下
位）ペアのみであり、Ａ（上位）−Ｃ（下位）ペアは冗
長分に過ぎない。したがって、重要語の上下階層関係を
構築する際に前記Ａ−Ｃの冗長ペアを除外する。Specifically, for example, when the upper important words A and B having the common lower important word C are extracted, and the upper important word A and the lower important word B are extracted at the same time,
As a whole, the pair that has a direct hierarchical relationship is A
Only the (upper) -B (lower) pair and the B (upper) -C (lower) pair, and the A (upper) -C (lower) pair are redundant. Therefore, the redundant pairs A to C are excluded when constructing the hierarchical relationship of important words.

【００７４】また、上下階層関係の構築の際、前記上下
重要語ペアのデータベース中での全出現回数に閾値を設
け、閾値未満の該上下重要語ペアを必要に応じて除外し
てもよい。Further, when constructing the upper and lower hierarchical relations, a threshold may be set for the total number of appearances of the upper and lower important word pairs in the database, and the upper and lower important word pairs less than the threshold may be excluded as necessary.

【００７５】[0075]

【発明の効果】以上説明したように、請求項１に記載の
発明によれば、ユーザーが指定した分野の文書群をデー
タベースとして用い、該データベース中の文書から重要
度の高い言葉である重要語を選別し、該重要語または重
要語のペアに対する統計情報を用いて重要語同士の関連
度を計算し関連語を抽出するため、一般的なシソーラス
辞書には記載されていない、ユーザーが指定した特定分
野に出現する専門用語や、新語および流行語を自動抽出
する方法とその方法を用いた装置を提供することが可能
となる。As described above, according to the invention of claim 1, a document group in a field designated by a user is used as a database, and an important word which is a highly important word from the documents in the database. , Which is not included in a general thesaurus dictionary, is selected by the user, and the degree of association between important words is calculated by using the statistical information for the important word or the pair of important words to extract the related words. It is possible to provide a method for automatically extracting a technical term appearing in a specific field, a new word and a buzzword, and an apparatus using the method.

【００７６】また、ユーザーが指定した言葉に関連の深
い重要語を高精度で的確に抽出する方法とその方法を用
いた装置を提供することが可能となる。Further, it is possible to provide a method for accurately and accurately extracting an important word that is closely related to a word designated by the user, and an apparatus using the method.

【００７７】請求項２に記載の発明によれば、請求項１
の効果に加えて、データベースに複数分野の文書群が蓄
積されている場合に、各分野毎の関連語を自動的に抽出
可能にしたため、例えば同一の言葉に対して、ある分野
では関連語となるが、別の分野では関連語とはならない
といった、分野特有の関連語を抽出することが可能とな
る。また既存シソーラス辞書の分野に関わらずユーザー
が独自に分野を設定できるので、設定した分野のレベル
に応じた関連語が抽出可能となる。According to the invention of claim 2, claim 1
In addition to the effect of, since the related words for each field can be automatically extracted when documents in multiple fields are accumulated in the database, However, it is possible to extract a field-specific related word that is not a related word in another field. In addition, since the user can set the field independently regardless of the field of the existing thesaurus dictionary, it becomes possible to extract the related words according to the level of the field set.

【００７８】請求項３に記載の発明によれば、請求項１
または２の効果に加えて、データベースは任意の時期に
更新・追加が可能であり、差分データを逐次反映させる
ことにより、常に最新のデータベースの情報を反映した
新語および流行語を含む最新の関連語を抽出することが
可能となる。According to the invention of claim 3, claim 1
In addition to the effect of (2), the database can be updated / added at any time, and by reflecting the difference data sequentially, the latest related words including new words and buzzwords that always reflect the latest database information. Can be extracted.

【００７９】請求項４に記載の発明によれば、請求項１
乃至３のいずれか一つの効果に加えて、データベース中
の文書が、文書のヘッダー情報を利用して同一文書か否
かを判定し、複数の同一文書が含まれていた場合に一つ
の文書を残して他の同一文書を除去したものであるた
め、特定の文書が多くの同一文書を持った場合に生じる
統計情報の不要な偏りを除去することができ、その結果
関連語抽出精度を向上させることが可能となる。According to the invention of claim 4, claim 1
In addition to the effect of any one of 1 to 3, it is determined whether the documents in the database are the same document by using the header information of the document, and if a plurality of the same documents are included, one document is selected. Since other identical documents are removed while remaining, it is possible to eliminate unnecessary bias of statistical information that occurs when a specific document has many identical documents, and as a result, the related word extraction accuracy is improved. It becomes possible.

【００８０】請求項５に記載の発明によれば、請求項１
乃至４のいずれか一つの効果に加えて、重要語をデータ
ベース中の文書を品詞単位に分割し、分割した形態素か
ら作成した複合語としたものであるため、分割による言
葉の抽象化を回避することでき、最終的に抽出する関連
語の精度を向上させることができる。According to the invention of claim 5, claim 1
In addition to any one of the effects 4 to 4, since the important word is a compound word created by dividing the document in the database into parts of speech and created from the divided morphemes, avoiding word abstraction due to the division Therefore, it is possible to improve the accuracy of the related words to be finally extracted.

【００８１】請求項６に記載の発明によれば、請求項１
乃至５のいずれか一つの効果に加えて、重要語をデータ
ベース中の文書毎に特徴を表すと予測される品詞とした
ものであるため、抽出する重要語の漏れを少なくするこ
とができる。According to the invention of claim 6, claim 1
In addition to the effect of any one of 5 to 5, since the important word is the part of speech predicted to represent the feature for each document in the database, it is possible to reduce the omission of the important word to be extracted.

【００８２】請求項７に記載の発明によれば、請求項１
乃至６のいずれか一つの効果に加えて、重要語から除外
する言葉を除外リストとして保有し、重要語抽出後、除
外リスト中の言葉を重要語から除外したため、不要の言
葉を排除できる。According to the invention of claim 7, claim 1
In addition to the effect of any one of to 6, the unnecessary words can be excluded because the words to be excluded from the important words are held as an exclusion list and the words in the exclusion list are excluded from the important words after the important words are extracted.

【００８３】請求項８に記載の発明によれば、請求項１
乃至７のいずれか一つの効果に加えて、同一の意味を持
つ重要語を同一語リストとして保有し、重要語抽出の際
に同一語リスト中の言葉の統計情報をまとめて保存する
構成としたため、重要語の抽出精度を向上させることが
できる。According to the invention of claim 8, claim 1
In addition to the effect of any one of 7 to 7, because the configuration is such that important words having the same meaning are held as the same word list and statistical information of the words in the same word list is collectively stored when extracting the important words. , It is possible to improve the accuracy of extracting important words.

【００８４】請求項９に記載の発明によれば、請求項１
乃至８のいずれか一つの効果に加えて、統計情報を、デ
ータベース中の全出現回数、およびデータベース内に重
要語が含まれる文書数の割合としたため、抽出精度を向
上させることができる。According to the invention of claim 9, claim 1
In addition to the effect of any one of 8 to 8, since the statistical information is the total number of appearances in the database and the ratio of the number of documents including the important word in the database, the extraction accuracy can be improved.

【００８５】請求項１０に記載の発明によれば、請求項
９の効果に加えて、統計情報にはデータベース中の文書
に含まれる重要語の単独出現回数の他に、一定範囲内の
複数重要語の出現回数も用いたため、複数個の重要語の
ペアによる意味付けがより正確にでき、その結果関連語
抽出精度を向上させることが可能となる。According to the invention described in claim 10, in addition to the effect of claim 9, the statistical information includes a plurality of important words within a certain range in addition to the number of times the important word included in the document in the database appears independently. Since the number of appearances of a word is also used, it is possible to make the meaning by a pair of a plurality of important words more accurate, and as a result, it is possible to improve the accuracy of extracting the related word.

【００８６】請求項１１に記載の発明によれば、請求項
９の効果に加えて、統計情報の他に、データベース中の
文書に含まれる表層表現を自動抽出し、該表層表現から
自動構築した重要語の上下階層関係を用いたため、互い
に無関係な複数の重要語が偶発的に出現したことによる
ノイズを除去することができ、その結果関連語抽出精度
を向上させることが可能となる。According to the invention of claim 11, in addition to the effect of claim 9, in addition to the statistical information, the surface expression included in the document in the database is automatically extracted and automatically constructed from the surface expression. Since the upper and lower hierarchical relationships of the important words are used, it is possible to remove noise caused by the accidental appearance of a plurality of unrelated important words, and as a result, it is possible to improve the extraction accuracy of the related words.

【００８７】請求項１２に記載の発明によれば、請求項
１乃至１１のいずれか一つの効果に加えて、統計情報の
算出の際、複数の異なる検索条件式を作成し、該複数の
異なる検索条件式を複数の異なるプロセッサを有する超
並列計算機の前記複数の異なるプロセッサ上に別個に設
定し、データベース中に蓄積されている文書群を前記複
数の異なる検索条件式で同時並行的に全文検索し、前記
検索条件式に合致した結果を用いたため、関連語自動抽
出方法を適用するたびに最新のデータベースに対応した
正確な統計情報を用いることが可能となり、その結果関
連語抽出精度を向上させることが可能となる。According to the twelfth aspect of the present invention, in addition to the effect of any one of the first to eleventh aspects, a plurality of different search condition expressions are created when the statistical information is calculated, and the plurality of different search condition expressions are created. Search condition expressions are separately set on the plurality of different processors of a massively parallel computer having a plurality of different processors, and a group of documents accumulated in a database are simultaneously searched in full text by the plurality of different search condition expressions. However, since the result matching the search condition expression is used, it is possible to use accurate statistical information corresponding to the latest database every time the related word automatic extraction method is applied, and as a result, the related word extraction accuracy is improved. It becomes possible.

【００８８】請求項１３に記載の発明によれば、関連語
自動抽出装置はユーザーが指定した分野の文書群を格納
する請求項１に記載のデータベース部と、該データベー
ス部に含まれる重要語を抽出・選別する重要語解析部
と、該重要語解析部で選別した重要語に対する統計情報
および重要語の上下階層関係情報を取得するカウント部
と、該カウント部で生成したカウントリストを用いて重
要語同士の関連度を計算する関連語抽出部とからなり、
一連の処理には請求項１に記載の関連語自動抽出方法を
用いたため、ユーザーは該関連語自動抽出装置の内部構
造を意識することなく、専門用語や新語および流行語等
ユーザーの所望する関連語を的確に抽出することが可能
となる。According to the thirteenth aspect of the present invention, the related word automatic extraction device stores the database section according to the first aspect for storing the document group of the field designated by the user and the important word contained in the database section. It is important to use an important word analysis unit to extract / select, a count unit to obtain statistical information and important hierarchical relationship information of important words selected by the important word analysis unit, and a count list generated by the counting unit. It consists of a related word extraction unit that calculates the degree of relevance between words,
Since the related word automatic extraction method according to claim 1 is used for the series of processes, the user does not need to be aware of the internal structure of the related word automatic extraction device, and the related terms such as technical terms, new words, and buzzwords desired by the user can be obtained. It becomes possible to extract words accurately.

【００８９】請求項１４に記載の発明は、データベース
中の文書に含まれる重要語の単独出現回数の他に、一定
範囲内の複数重要語の出現回数も用いて複数重要語を自
動抽出する複数重要語抽出プログラムにおいて、データ
ベース内の文書を一文書ずつ読み込み、該文書中から重
要語を探索し、探索された重要語から予め定義した一定
範囲内に別の重要語があるか否かを探索し、重要語から
一定範囲内に存在する重要語が探索された場合に重要語
のペアを逐次カウントリストに保存し、重要語のペアを
既に作成したカウントリストから探索し、既に同一の重
要語のペアがカウントリストに存在した場合、出現回数
のカウントに１加えてカウントリストを更新し、カウン
トリストに存在しなかった場合、前記重要語のペアのカ
ウントを１にしてカウントリストに新たに保存し、これ
らの処理をデータベース内の予め指定した複数文書につ
いて行い、作成したカウントリストを元に、重要語のペ
アの重要度を判定するようにしたため、複数個の重要語
のペアによる意味付けを合理的にでき、その結果関連語
抽出精度を向上させることが可能となる。According to a fourteenth aspect of the present invention, a plurality of important words are automatically extracted by using the number of times of appearance of a plurality of important words within a certain range in addition to the number of times of occurrence of a plurality of important words included in a document in a database. In the important word extraction program, the documents in the database are read one by one, the important words are searched from the documents, and it is searched whether or not there is another important word within a predetermined range from the searched important words. However, when an important word existing within a certain range from the important word is searched, the important word pair is sequentially stored in the count list, and the important word pair is searched from the already created count list, and the same important word is already searched. If the pair of is present in the count list, the count list is updated by adding 1 to the count of the number of appearances. If it is not present in the count list, the count of the pair of important words is set to 1. It is newly stored in the count list, and these processes are performed on multiple pre-specified documents in the database, and the importance of pairs of important words is judged based on the created count list. It is possible to rationalize the meanings of the pairs, and as a result, it is possible to improve the precision of extracting related words.

【００９０】請求項１５に記載の発明は、データベース
中の文書に含まれる表層表現を自動抽出し、該表層表現
から自動構築した重要語の上下階層関係を用いた重要語
上下階層関係抽出プログラムにおいて、データベース内
の文書を一文書ずつ読み込み、該文書中から予め作成し
ておいた表層表現リストに書かれている表層表現を抽出
し、抽出された表層表現中の上位語部分および下位語部
分に前記重要語解析部で抽出した重要語が含まれるか否
かを探索し、上位語部分および下位語部分の双方ともに
重要語が探索された場合、探索された上下重要語のペア
を逐次カウントリストに保存し、既に同一の重要語のペ
アがカウントリストに存在した場合、出現回数のカウン
トに１加えてカウントリストを更新し、カウントリスト
に存在しなかった場合、前記上下重要語のペアのカウン
トを１にしてカウントリストに新たに保存し、これらの
処理をデータベース内の予め指定した複数文書について
行い、作成したカウントリストを元に重要語の上下階層
関係を構築するようにしたため、互いに無関係な複数の
重要語が偶発的に出現したことによるノイズを合理的に
除去することができる。According to a fifteenth aspect of the present invention, there is provided an important word upper / lower hierarchical relation extracting program which automatically extracts a surface expression included in a document in a database and uses an upper / lower hierarchical relationship of important words automatically constructed from the surface expression. , The documents in the database are read one by one, the surface expressions written in the surface expression list created in advance are extracted from the documents, and the high-order word part and the low-order word part in the extracted surface expression are extracted. A search is performed to determine whether or not the important word extracted by the important word analysis unit is included. When both the upper word portion and the lower word portion are searched for the important word, the searched upper and lower important word pairs are sequentially counted in a list. If the same important word pair already exists in the count list, the count list is updated by adding 1 to the count of the number of appearances, and the word does not exist in the count list. In this case, the count of the pair of upper and lower important words is set to 1 and newly stored in the count list, these processes are performed for a plurality of documents specified in advance in the database, and the upper and lower hierarchical relationships of the important words are based on the created count list. By constructing, it is possible to rationally remove noise caused by the accidental appearance of multiple unrelated important words.

[Brief description of drawings]

【図１】この発明の実施の形態に係る関連語自動抽出
装置のブロック図である。FIG. 1 is a block diagram of a related word automatic extraction device according to an embodiment of the present invention.

【図２】同実施の形態に係る関連語自動抽出装置に使
用する重要語リストの概念図である。FIG. 2 is a conceptual diagram of an important word list used in the related word automatic extraction device according to the embodiment.

【図３】同実施の形態に係る関連語自動抽出装置に使
用するカウントリストの概念図である。FIG. 3 is a conceptual diagram of a count list used in the related term automatic extraction device according to the embodiment.

【図４】図３のカウントリストおよび図２の重要語リ
ストを元にして作成した関連度判定リストの概念図であ
る。4 is a conceptual diagram of a degree-of-association determination list created based on the count list of FIG. 3 and the important word list of FIG.

【図５】同実施の形態に係る関連語自動抽出方法にお
ける、一定範囲内の複数重要語の抽出手順を示すフロー
チャートである。FIG. 5 is a flowchart showing a procedure for extracting a plurality of important words within a certain range in the related word automatic extraction method according to the embodiment.

【図６】同実施の形態に係る関連語自動抽出方法にお
ける、重要語の上下階層関係の抽出手順を示すフローチ
ャートである。FIG. 6 is a flowchart showing a procedure for extracting a hierarchical relationship of important words in the related word automatic extraction method according to the embodiment.

[Explanation of symbols]

１データベース部２重要語解析部３カウント部４関連語抽出部 1 database department 2 Important word analysis section 3 counting section 4 Related term extractor

───────────────────────────────────────────────────── フロントページの続き (72)発明者吉野直子神奈川県鎌倉市上町屋792番地三菱スペース・ソフトウエア株式会社鎌倉事業部内 (72)発明者足立和子神奈川県鎌倉市上町屋792番地三菱スペース・ソフトウエア株式会社鎌倉事業部内Ｆターム(参考） 5B075 ND03 NK33 NK35 NR05 PQ32 PR04 PR08 QP03 ─────────────────────────────────────────────────── ─── Continued front page (72) Inventor Naoko Yoshino Mitsubishi Spa, 792 Kamimachiya, Kamakura City, Kanagawa Prefecture Software Co., Ltd. in Kamakura Division (72) Inventor Kazuko Adachi Mitsubishi Spa, 792 Kamimachiya, Kamakura City, Kanagawa Prefecture Software Co., Ltd. in Kamakura Division F-term (reference) 5B075 ND03 NK33 NK35 NR05 PQ32 PR04 PR08 QP03

Claims

[Claims]

1. A document group in a field designated by a user is used as a database, important words that are highly important words are selected from documents in the database, and the important words or pairs of important words are stored in the database. A method for automatically extracting a related word, characterized in that the degree of relevance between important words is calculated using statistical information of included words to extract the related words.

2. The related word automatic extraction according to claim 1, wherein when the database stores documents in a plurality of fields, related words for each field can be automatically extracted. Method.

3. The database is updated / updated at any time.
The related word automatic extraction method according to claim 1 or 2, wherein addition is possible, and the differential data is sequentially reflected in the automatic extraction of related words.

4. The group of documents in the database is judged whether or not they are the same document by using the header information of the documents, and when a plurality of the same documents are included, one document is left and other documents are the same. The related word automatic extraction method according to claim 1, wherein the document is removed.

5. The related word automatic extraction method according to claim 1, wherein the important word is a compound word created by dividing a document in a database into parts of speech and created from the divided morphemes.

6. The related word automatic extraction method according to claim 1, wherein the important word is a part of speech predicted to represent a feature for each document in the database.

7. The related word automatic according to claim 1, wherein words excluded from important words are retained as an exclusion list, and words in the exclusion list are excluded from important words after the important words are extracted. Extraction method.

8. The method according to claim 1, wherein important words having the same meaning are held as a same word list, and statistical information of words in the same word list is collectively stored when the important words are extracted. The related word automatic extraction method described in.

9. The method for automatically extracting a related word according to claim 1, wherein the statistical information is a total number of appearances in a database and a ratio of the number of documents including an important word in the database. .

10. The statistical information includes the number of appearances of a plurality of important words included in a document in the database, as well as the number of appearances of a plurality of important words within a certain range. Automatic extraction method of related words.

11. A method for automatically extracting a surface expression included in a document in the database in addition to the statistical information and using a hierarchical relation of important words automatically constructed from the surface expression. 9. The related word automatic extraction method according to item 9.

12. When calculating the statistical information, a plurality of different search condition expressions are created, and the plurality of different search condition expressions are separately provided on the plurality of different processors of a massively parallel computer having a plurality of different processors. 12. A set of documents, which are stored in a database, are simultaneously searched for in full text by the plurality of different search condition expressions, and a result that matches the search condition expression is used. A method for automatically extracting a related word according to any one of 1.

13. The database unit according to claim 1, which stores a group of documents in a field designated by a user, an important word analysis unit for extracting and selecting important words contained in the database unit, and the important word analysis unit. It consists of a counting unit that acquires statistical information and important hierarchical relation information of the important words selected in 1., and a related word extracting unit that calculates the degree of association between the important words using the count list generated by the counting unit. ,
An apparatus for automatically extracting related words, wherein the method for automatically extracting related words according to claim 1 is used for a series of processes.

14. A multi-important word extraction program for automatically extracting a plurality of important words by using the number of appearances of a plurality of important words within a certain range in addition to the number of single occurrences of the important words included in a document in the database. Each document is read one by one, an important word is searched from the document, and it is searched whether another important word exists within a predetermined range defined from the searched important word. When an important word existing inside is searched for, the important word pairs are sequentially stored in the count list, the important word pairs are searched from the already created count list, and the same important word pair already exists in the count list. If you do, add 1 to the count of the number of appearances and update the count list,
If it does not exist in the count list, the count of the pair of important words is set to 1 and newly stored in the count list, and these processes are performed for a plurality of documents specified in advance in the database, and based on the created count list. A multi-important word extraction program characterized by determining the importance of a pair of important words.

15. A key word upper / lower hierarchical relation extraction program that automatically extracts a surface expression included in a document in a database and uses the upper / lower hierarchical relation of important words automatically constructed from the surface expression. The document is read one by one, the surface expression written in the surface expression list created in advance is extracted from the document, and the important word analysis unit 2 extracts the upper word part and the lower word part in the extracted surface expression. Search whether the extracted important word is included or not, and if the important word is searched for in both the upper word part and the lower word part, the searched upper and lower important word pairs are sequentially saved in the count list and are already the same. If a pair of important words of is present in the count list,
The count list is updated by adding 1 to the count of the number of appearances, and if it does not exist in the count list, the count of the pair of upper and lower important words is set to 1 and newly stored in the count list, and these processes are stored in the database. An important word upper / lower hierarchical relation extraction program, which is performed for a plurality of documents designated in advance, and constructs an upper / lower hierarchical relation of important words based on a created count list.