JP2003256447A

JP2003256447A - Related term extraction method and device

Info

Publication number: JP2003256447A
Application number: JP2002050415A
Authority: JP
Inventors: Kyoji Umemura; 恭司梅村
Original assignee: Individual
Current assignee: Individual
Priority date: 2002-02-26
Filing date: 2002-02-26
Publication date: 2003-09-12

Abstract

<P>PROBLEM TO BE SOLVED: To solve the problem that it is difficult to extract a related term from a new text including unknown words. <P>SOLUTION: A pre-processing part 50 conducts pre-processing for creating a double word list Bi(α) with reference to a document file 26. A first processing part 52 conducts a first processing for generating a list F(a) of pre-fix words x and a list B(a) of suffix words y for a remarked word a. A second processing part 54 conducts a second processing for generating a set BF(a) of post-fix word to each pre-fix word x and a set FB(a) of pre-fix words to each post-fix word y with reference to the double word list Bi(α). A third processing part 56 conducts a third processing for extracting a pair of candidates (a, b) of a related term from the common elements of these set BF(a) of post-fix words and set FB(a) of pre-fix words. <P>COPYRIGHT: (C)2003,JPO

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】この発明は関連語を抽出する
技術、とくに文書中に含まれる関連語を抽出してシソー
ラスを構築する方法および装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a technique for extracting related words, and more particularly to a method and apparatus for extracting related words contained in a document to construct a thesaurus.

【０００２】[0002]

【従来の技術】インターネットの普及に伴い、各種デー
タベースのネットワーク環境での利用が進んでおり、文
献データベースをもとにした電子図書館など実用的なア
プリケーションも広がっている。このような電子化され
た文書を多数蓄積したデータベースを効率よく利用する
ためには、蓄積された文書から適切なキーワードを抽出
して、検索しやすいようにインデックスをつけたり、キ
ーワードにもとづいて文書を分類することが必要であ
る。2. Description of the Related Art With the spread of the Internet, various databases are being used in a network environment, and practical applications such as electronic libraries based on document databases are also spreading. In order to efficiently use the database in which a large number of such electronic documents are stored, appropriate keywords are extracted from the stored documents and indexed for easy search, or documents are searched based on the keywords. It is necessary to classify.

【０００３】またユーザが検索時に指定するキーワード
を直接含まなくても、そのキーワードに関連する文書を
検索することができるように、キーワードの関連語を登
録したシソーラスを利用して類似検索や概念検索を行え
るようにする試みがなされている。Further, even if a user does not directly include a keyword specified at the time of searching, a document related to the keyword can be searched for by using a thesaurus in which related words of the keyword are registered, similar search or concept search. Attempts have been made to make it possible.

【０００４】[0004]

【発明が解決しようとする課題】雑誌、新聞等の記事や
ＷＷＷ（World Wide Web）のテキストには新しい概念を
示す造語や新語が日々増え続けている。このような新規
語を含むテキストは最新情報を記述した文書であること
が多く、ユーザにとって利用価値の高いものである。し
かしこのような新規語はほとんど辞書データベースに登
録されていないため、インターネットの検索エンジンや
文書データベースで検索語として有効に用いることがで
きないのが現状である。[Problems to be Solved by the Invention] In articles such as magazines and newspapers and in WWW (World Wide Web) texts, coined words and new words indicating new concepts are increasing every day. The text including such a new word is a document that describes the latest information in many cases, and is highly useful for the user. However, since such new words are hardly registered in the dictionary database, it is the current situation that they cannot be effectively used as search words in Internet search engines and document databases.

【０００５】新規語が現れるたびに辞書データベースに
登録するのは手間がかかり、また新規語に対応するため
に大容量の記憶媒体が必要となる。またそのような新規
語に対して他のキーワードとの類似性を判断して、シソ
ーラスを構築するには、シソーラスの対象となる単語数
が大きくなると計算コストが問題となる。It is troublesome to register a new word in the dictionary database each time it appears, and a large-capacity storage medium is required to support the new word. Further, in order to construct a thesaurus by judging the similarity of such a new word to other keywords, the computational cost becomes a problem when the number of words to be the target of the thesaurus increases.

【０００６】本発明はこうした状況に鑑みてなされたも
のであり、その目的は、未知語、既知語を問わず、関連
語を効率よく文書から抽出することのできる方法および
装置を提供することにある。The present invention has been made in view of such circumstances, and an object thereof is to provide a method and an apparatus capable of efficiently extracting a related word from a document regardless of an unknown word or a known word. is there.

【０００７】[0007]

【課題を解決するための手段】本発明のある態様は関連
語抽出装置に関する。この装置は、文書の集合と関連語
辞書を格納する文書データベースと、前記文書中で前後
に共通した単語が連なる２つの異なる単語を関連語の候
補対として選定する候補選定部と、前記関連語の候補対
について、それらの単語の前後に接続する文字列の類似
性にもとづいて関連度を判定する判定部と、前記関連度
の高い前記候補対を前記文書データベースに前記関連語
辞書として登録する登録部とを含む。One aspect of the present invention relates to a related word extraction device. This apparatus includes a document database that stores a set of documents and a related word dictionary, a candidate selection unit that selects two different words having a common word before and after in the document as a related word candidate pair, and the related word. For the candidate pair of, a determination unit that determines the degree of association based on the similarity of the character strings connected before and after those words, and the candidate pair with a high degree of association is registered in the document database as the related word dictionary. Including the registration department.

【０００８】ここでいう「文書」は言語を問わない。
「文書」は、古文書や暗号文書のようにかならずしも文
字や文法が解読されていない文書や、遺伝子配列など任
意の記号列のシーケンスも含み、任意の記号列が記録さ
れたファイル全般を意味する広い概念である。したがっ
て文書中に含まれる「文字列」という場合、言語のアル
ファベット等に限らず、一つの記号体系をなす任意の記
号列を含む趣旨である。The "document" mentioned here does not matter in any language.
"Document" means a document in which characters or grammars are not always deciphered, such as an old document or a ciphertext, or a sequence of arbitrary symbol sequences such as gene sequences, and means all files in which arbitrary symbol sequences are recorded. It is a broad concept. Therefore, the term “character string” included in a document is not limited to the alphabet of the language, but includes any symbol string forming one symbol system.

【０００９】また「単語」はこのような文書から切り出
された文字列である。単語には、造語や新語などの新規
語や未知語が含まれており、単語の切り出しは必ずしも
既知語を登録した辞書を使用することを前提としない。
また古文書や暗号文書のようにそもそも辞書が存在しな
い場合もある。このような場合、文字列の出現頻度など
を用いた統計処理の方法により単語が抽出されてもよ
い。A "word" is a character string cut out from such a document. A word includes a new word such as a coined word or a new word, or an unknown word, and the extraction of a word does not necessarily require the use of a dictionary in which a known word is registered.
In some cases, like old documents and encrypted documents, there is no dictionary in the first place. In such a case, words may be extracted by a statistical processing method using the appearance frequency of character strings.

【００１０】前記候補選定部は、前記文書集合から、前
記文書中で注目単語の前に連なる前置単語のリストと、
前記注目単語の後に連なる後置単語のリストを抽出する
第１の処理部と、前記前置単語のリスト内の前置単語に
ついては前記文書中でその後に連なる後置単語の集合
を、前記後置単語のリスト内の後置単語については前記
文書中でその前に連なる前置単語の集合を、それぞれ前
記文書集合から抽出する第２の処理部と、前記後置単語
の集合と前記前置単語の集合に共通に含まれる単語を、
前記注目単語の関連語の候補として抽出する第３の処理
部とを含んでもよい。The candidate selecting section includes a list of prefix words which are arranged in front of the word of interest in the document from the document set,
A first processing unit that extracts a list of postfix words that follow the target word; and a set of postfix words that follow in the document for the prefix words in the prefix word list, For the trailing words in the list of trailing words, a second processing unit for respectively extracting a set of leading words preceding it in the document from the document set, the trailing word set and the leading word, The words commonly included in the set of words are
A third processing unit may be included as a candidate for a related word of the word of interest.

【００１１】関連語の候補対となる単語の考えられる組
み合わせは、単語の数をＫとするとＫの二乗のオーダー
となるが、第１の処理部により注目単語の前に連なる前
置単語のリスト、および前記注目単語の後に連なる後置
単語のリストを抽出すれば、その要素数は注目単語に対
して実用的に計算することのできるオーダーとなるの
で、その後の第２の処理部により生成される前記後置単
語集合と前記前置単語集合の要素数も大きくなることが
少なく、関連語の候補の抽出は、文書集合のサイズが大
きくても実用的な時間での計算が可能である。A possible combination of words that form a candidate pair of related words is in the order of the square of K, where K is the number of words, but the first processing unit lists a list of prefix words that precedes the word of interest. , And a list of postfix words that follow the target word is extracted, the number of elements is in an order that can be practically calculated for the target word, and therefore, is generated by the second processing unit thereafter. The number of elements in the post-word set and the pre-word set is not large, and extraction of related word candidates can be performed in a practical time even if the size of the document set is large.

【００１２】注目単語に対して前置単語とは、文書中で
注目単語の前に位置する単語のことであり、前置単語と
注目単語がこの順で連なり、一定の意味をなす語句が形
成されることがある。たとえば注目単語「コンピュー
タ」に対して直前に前置単語「並列」がある場合、「並
列コンピュータ」という語句が形成されている。また注
目単語に対して後置単語とは、文書中で注目単語の後に
位置する単語のことであり、注目単語と後置単語がこの
順で連なり、一定の意味をなす語句が形成されることが
ある。たとえば注目単語「コンピュータ」に対して直後
に後置単語「生産者」がある場合、「コンピュータ生産
者」という語句が形成されている。The prefix word with respect to the target word is a word located before the target word in the document, and the prefix word and the target word are connected in this order to form a phrase having a certain meaning. It may be done. For example, when the prefix word "parallel" is immediately before the attention word "computer", the phrase "parallel computer" is formed. In addition, the postfix word with respect to the target word is a word located after the target word in the document, and the target word and the postfix word are connected in this order to form a phrase having a certain meaning. There is. For example, when the suffix word "producer" is immediately after the attention word "computer", the phrase "computer producer" is formed.

【００１３】注目単語に対する関連語とは、注目単語の
同義語または類義語、注目単語の上位概念または下位概
念を表す語、「ＤＲＡＭ」に対する「ＳＲＡＭ」など注
目単語と共通の上位概念または下位概念をもつ語、ある
いは注目単語の反意語などである。関連語の候補とは、
注目単語に対する関連語となる可能性のある語であり、
注目単語と同じように使われる程度を関連性の尺度とし
て評価することにより、関連語であるかどうかが最終的
に判定される。特に注目単語と置換可能関係にある語は
関連語である可能性が高いので、注目単語の候補として
選定されてもよい。２つの単語ａ、ｂが置換可能関係に
あるとは、一方の単語ａとの組み合わせで同じ文書に頻
繁に現れる単語が、他方の単語ｂとも組み合わされて現
れることをいう。このような置換可能関係にある単語対
は同義語や類義語である可能性が高い。前述の「コンピ
ュータ」の例でいえば、「計算機」が「コンピュータ」
と置換可能関係にある語であり、「コンピュータ」を
「計算機」に置き換えた「並列計算機」、「計算機生産
者」といった組み合わせの語句が文書中に現れることが
ある。The term related to the word of interest is a synonym or synonym of the word of interest, a word representing a superordinate or subordinate concept of the word of interest, a superordinate concept or subordinate concept common to the word of interest such as “SRAM” for “DRAM”. It is an antonym of the word or the word of interest. What are related word candidates?
It is a word that may be a related word to the word of interest,
It is finally determined whether or not the word is a related word by evaluating the degree of use as the word of interest as a measure of relevance. In particular, a word having a replaceable relationship with the target word is highly likely to be a related word, and thus may be selected as a target word candidate. The term “substitutable relationship between two words a and b” means that a word that frequently appears in the same document in combination with one word a appears in combination with the other word b. The word pair having such a replaceable relationship is likely to be a synonym or a synonym. In the example of "computer" mentioned above, "computer" is "computer".
There is a case in which a word having a substitutable relationship with, such as “parallel computer” in which “computer” is replaced with “computer” and “computer producer” appears in a document.

【００１４】前記候補選定部は、前記文書中で２つの単
語が連なった単語列について、その単語列の出現確率の
推定値がその単語列を構成する２つの単語のそれぞれの
出現確率の推定値の積に比べて有意に高いものを抽出し
て二連単語のリストを生成する前処理部をさらに含み、
前記第１および第２の処理部は、前記前置単語のリス
ト、前記後置単語のリスト、前記後置集合、および前記
前置集合の各要素を、前記二連単語のリストから抽出し
てもよい。ここで、文字列の出現確率は、文字列が１回
以上出現する文書の総数を示す文書頻度を文書の総数で
除算することで推定される。このようにあらかじめ二連
単語のリストを作成しておくことで、置換可能関係にあ
る語を関連語の候補として選定する際、候補の絞り込み
を容易にすることができる。The candidate selection unit, for a word string in which two words are connected in the document, an estimated value of the appearance probability of the word string is an estimated value of the appearance probability of each of the two words forming the word string. Further including a pre-processing unit that generates a list of double words by extracting those that are significantly higher than the product of
The first and second processing units extract each element of the prefix word list, the suffix word list, the suffix set, and the prefix set from the double word list. Good. Here, the appearance probability of the character string is estimated by dividing the document frequency indicating the total number of documents in which the character string appears one or more times by the total number of documents. By thus creating a list of double words in advance, it is possible to easily narrow down the candidates when selecting a word having a replaceable relationship as a candidate for a related word.

【００１５】前記判定部は、前記関連語の候補対につい
て、それらの単語の前後に接続する文字列を含めた結合
文字列が前記文書集合において出現する回数と前記結合
文字列が出現する前記文書の数に関する評価値にもとづ
いて、前記関連度を判定してもよい。単語ｃについての
結合文字列ｚとは、文書中で単語ｃの前後に現れるある
一定の長さの文字列ｘ、ｙを単語ｃに結合した文字列ｘ
ｃｙのことである。この結合文字列ｚが文書集合におい
て出現する回数を示す総出現頻度ｃｆ（ｚ）、結合文字
列ｚが１回以上出現する文書の総数を示す文書頻度ｄｆ
（ｚ）を計数し、文書の総数をＮとして、逆文書頻度Ｉ
ＤＦ（ｚ）＝−ｌｏｇ（ｄｆ（ｚ）／Ｎ）を計算し、関
連語の候補対について、たとえばｃｆ（ｚ）とＩＤＦ
（ｚ）の積を関連度のスコアとして評価することによ
り、関連度を判定してもよい。With respect to the candidate pairs of the related words, the determination unit determines the number of times a combined character string including character strings connected before and after those words appears in the document set, and the document in which the combined character string appears. The degree of association may be determined based on the evaluation value regarding the number of. The combined character string z for the word c is a character string x obtained by combining the character strings x and y that appear before and after the word c in the document with a certain length to the word c.
It is cy. A total appearance frequency cf (z) indicating the number of times the combined character string z appears in the document set, and a document frequency df indicating the total number of documents in which the combined character string z appears one or more times.
(Z) is counted, the total number of documents is N, and the inverse document frequency I
DF (z) = − log (df (z) / N) is calculated, and for example, cf (z) and IDF are calculated for the related word candidate pair.
The degree of association may be determined by evaluating the product of (z) as a score of the degree of association.

【００１６】前記文書集合から文字列の出現頻度にもと
づいて文書中のキーワードを切り出すキーワード抽出部
をさらに含み、前記候補選定部は、前記キーワードにつ
いて前記関連語の候補対の選定処理を行ってもよい。前
記登録部は、前記キーワードをキーワードインデックス
として前記文書データベースに登録してもよい。ユーザ
からの検索要求に対して、前記キーワードインデックス
とともに前記関連語辞書を用いて文書データベースから
該当する文書を検索する検索部をさらに含んでもよい。
これにより、キーワードとその関連語を用いた類似検索
や概念検索が可能となる。The method further includes a keyword extracting unit for extracting a keyword in the document from the document set based on the appearance frequency of the character string, and the candidate selecting unit performs the process of selecting the candidate pair of the related words for the keyword. Good. The registration unit may register the keyword in the document database as a keyword index. A search unit may be further included that searches for a corresponding document from a document database using the related word dictionary together with the keyword index in response to a search request from a user.
This enables similarity search and concept search using keywords and their related words.

【００１７】文字列の出現頻度の計数にあたり、文書集
合に対して、前記文書のある位置の文字から前記文書の
終了までの範囲の文字列、すなわち接尾辞（サフィック
ス）の集合であって、その集合が辞書順に並べられたサ
フィックスアレイ（Suffix Array）を生成してもよい。
サフィックスアレイはサフィックスの文書集合中の出現
場所を格納した配列であり、このサフィックスアレイを
用いて、文字列の出現頻度を計数する工程を含んでもよ
い。In counting the appearance frequency of a character string, a character string in a range from a character at a certain position in the document to the end of the document, that is, a set of suffixes, is included in the document set. A Suffix Array in which the sets are arranged in lexicographical order may be generated.
The suffix array is an array that stores the appearance locations in the document set of suffixes, and a step of counting the appearance frequency of character strings using this suffix array may be included.

【００１８】本発明の別の態様は関連語抽出方法に関す
る。この方法は、文書データベースに格納された文書の
集合から、前記文書中で前後に共通した単語が連なる２
つの異なる単語を関連語の候補対として選定する工程
と、前記関連語の候補対について、それらの単語の前後
に接続する文字列の類似性にもとづいて関連度を判定す
る工程と、前記関連度の高い前記候補対を関連語として
前記文書データベースに登録する工程とを含む。Another aspect of the present invention relates to a related word extraction method. According to this method, words common to the front and rear of the document are linked from a set of documents stored in a document database.
Selecting two different words as a candidate pair of related words; determining the degree of association of the candidate pair of related words based on the similarity of character strings connected before and after those words; Registering the candidate pair having a high value as a related word in the document database.

【００１９】本発明の別の態様も関連語抽出方法に関す
る。この方法は、文書データベースに格納された文書集
合から、前記文書中で注目単語の前に連なる前置単語の
リストを抽出する工程と、前記文書集合から、前記文書
中で前記注目単語の後に連なる後置単語のリストを抽出
する工程と、前記注目単語に関する前記前置単語のリス
トに含まれる前置単語の各々について、その前置単語の
後に連なる後置単語を前記文書集合から抽出し、その後
置単語を要素とする後置単語集合を生成する工程と、前
記注目単語に関する前記後置単語のリストに含まれる後
置単語の各々について、その後置単語の前に連なる前置
単語を前記文書集合から抽出し、その前置単語を要素と
する前置単語集合を生成する工程と、前記後置単語集合
と前記前置単語集合に共通に含まれる単語を、前記注目
単語の関連語の候補として抽出する工程とを含む。Another aspect of the present invention also relates to a related word extraction method. This method comprises a step of extracting a list of prefix words that are consecutive before a target word in the document from a document set stored in a document database, and a sequence that is after the target word in the document from the document set. Extracting a list of postfix words, and for each of the prefix words included in the list of prefix words for the word of interest, extracting suffix words that follow the prefix word from the document set, and then Generating a post-word set having post-words as elements, and for each post-word included in the post-word list relating to the target word, the pre-words preceding the post-word are the document set. Extracting from, a step of generating a prefix word set having the prefix word as an element, a word commonly included in the suffix word set and the prefix word set, as a candidate for a related word of the attention word And a step of extracting Te.

【００２０】前置文書中で２つの単語が連なった単語列
について、その単語列の出現確率の推定値がその単語列
を構成する２つの単語のそれぞれの出現確率の推定値の
積に比べて有意に高いものを抽出して二連単語のリスト
を生成する工程をさらに含み、前記前置単語のリスト、
前記後置単語のリスト、前記後置単語集合、および前記
前置単語集合の各要素が、前記二連単語のリストから抽
出されてもよい。For a word string in which two words are consecutive in a front document, the estimated value of the appearance probability of the word string is more than the product of the estimated values of the appearance probabilities of the two words forming the word string. Further comprising the step of extracting the significantly higher ones to produce a list of double words, said list of prefix words,
Each element of the postfix word list, the postfix word set, and the prefix word set may be extracted from the double word list.

【００２１】前記関連語の候補について前記注目単語と
の関連度を前後に現れる文字列の共通性から判定する工
程と、関連度の高い前記関連語の候補を前記注目単語の
関連語として前記文書データベースに登録する工程とを
さらに含んでもよい。The step of determining the degree of relevance of the related word candidate with the target word from the commonality of character strings appearing before and after, and the related word candidate having a high degree of relevance as the related word of the target word. The method may further include the step of registering in a database.

【００２２】なお、以上の構成要素の任意の組合せ、本
発明の表現を方法、装置、サーバ、システム、コンピュ
ータプログラム、記録媒体などの間で変換したものもま
た、本発明の態様として有効である。It should be noted that any combination of the above constituent elements, and any expression of the present invention converted between a method, an apparatus, a server, a system, a computer program, a recording medium, etc. are also effective as an aspect of the present invention. .

【００２３】[0023]

【発明の実施の形態】実施の形態に係る関連語抽出技術
ではシソーラスの構築にあたり、どの工程においても辞
書を用いないことと汎用計算機で実現できることを前提
条件とする。実施の形態の目的は未知語を含む新規テキ
ストの理解に役立つシソーラスを構築することである。
新規テキストを理解するためには、テキスト中の未知語
の意味を知る必要がある。未知語の意味を知る上でもっ
とも役立つのは関連語の情報である。関連語とは同じ意
味で用いられる単語や似た意味で用いられる単語であ
る。そこで、本実施の形態ではシソーラスとして関連語
リストを構築する。そして関連語としての関連性を判定
するために、二つの単語がテキスト集合中で同じように
使用されるかどうかを調べる。具体的には、二つの単語
が対象とするテキスト集合中で前後に同じ文字列を持っ
て出現するかどうかを調べる。たとえば、「年賀状を印
刷しなければならない」という文がある場合、「印刷」
を「プリント」に置き換えても同じ意味の文になる。こ
の単語の置換えは「印刷」と「プリント」が同義語であ
るが故にできることである。このことから、逆に、二つ
の単語が前後に同じ文字列を持って出現するのであれ
ば、同義関係であるとは限らないけれども、少なくとも
何らかの意味上の関連性が認められる関連語であると想
定して、関連性を判定する。BEST MODE FOR CARRYING OUT THE INVENTION In the related word extraction technique according to the embodiment, the construction of a thesaurus is premised on the fact that no dictionary is used in any step and that it can be realized by a general-purpose computer. The purpose of the embodiments is to build a thesaurus that helps understand new texts containing unknown words.
In order to understand a new text, it is necessary to know the meaning of unknown words in the text. The most useful information for understanding the meaning of unknown words is related word information. The related word is a word used in the same meaning or a word used in a similar meaning. Therefore, in the present embodiment, the related word list is constructed as a thesaurus. Then, in order to determine the relevance as a related word, it is examined whether the two words are used similarly in the text set. Specifically, it is checked whether two words appear with the same character string before and after in the target text set. For example, if you have a sentence that says "You must print the New Year's card,""Print"
Even if is replaced with "print", the sentence has the same meaning. This word replacement is possible because "print" and "print" are synonyms. From this, conversely, if two words appear with the same character string before and after, they are not necessarily synonymous, but at least some related meaning is recognized. Assume the relevance as expected.

【００２４】１．関連語の定義実施の形態では、関連語を前後に同じ文字列を持つ単語
と定義する。したがって、本実施の形態で得られる関連
語はテキスト中で同じように使用される単語である。二
つの単語ａ，ｂが関連語であるかどうかを判断するため
に、出現頻度情報に基づいたスコアを評価するスコア関
数ｓｃｏｒｅ（ａ，ｂ）を用いて二つの単語ａ，ｂの関
連度を評価する。スコア関数の定義を与える前に、文字
列の出現頻度に関する統計量を定義する。1. Definition of Related Words In the embodiment, a related word is defined as a word having the same character string before and after. Therefore, the related words obtained in this embodiment are words used in the same way in the text. In order to determine whether the two words a and b are related words, a score function score (a, b) that evaluates a score based on appearance frequency information is used to determine the degree of association between the two words a and b. evaluate. Before giving the definition of the score function, a statistic about the frequency of occurrence of the character string is defined.

【００２５】ドキュメント（テキストともいう）ｄの集
合がコーパス（テキスト集合ともいう）として与えられ
たとする。ｔｆ（ｄ，ｘ）をドキュメントｄに含まれる
文字列ｘの個数と定義する。ｔｆ（ｄ，ｘ）から次の二
つの統計量が計算できる。It is assumed that a set of documents (also called texts) d is given as a corpus (also called a text set). Define tf (d, x) as the number of character strings x included in the document d. The following two statistics can be calculated from tf (d, x).

【００２６】ｃｆ（ｘ）：コーパスに文字列ｘが出現す
る数（以下、総出現頻度という）ｃｆ（ｘ）＝Σ_ｄｔｆ（ｄ，ｘ）ｄｆ（ｘ）：文字列ｘが１回以上出現するドキュメント
の数（以下、ドキュメント頻度という）ｄｆ（ｘ）＝｜｛ｄ｜ｔｆ（ｄ，ｘ）≧１｝｜ｄｆ_２（ｘ）：文字列ｘが２回以上出現するドキュメン
トの数ｄｆ_２（ｘ）＝｜｛ｄ｜ｔｆ（ｄ，ｘ）≧２｝｜[0026] cf (x): number of string x appears in the corpus (or less, the total appearance of _{frequency) cf (x) = Σ d} tf (d, x) df (x): string x is one or more times Number of documents that appear (hereinafter referred to as document frequency) df (x) = | {d | tf (d, x) ≧ 1} | df ₂ (x): number of documents in which the character string x appears twice or more df ₂ (x) = | {d | tf (d, x) ≧ 2} |

【００２７】ドキュメント頻度ｄｆ（ｘ）を用いて、逆
文書頻度ＩＤＦ（Inverse DocumentFrequency）を次式
で定義する。ＩＤＦ（ｘ）＝−ｌｏｇ（ｄｆ（ｘ）／Ｎ）ここでＮはドキュメントの総数である。なお、ドキュメ
ント頻度ｄｆを求める方法は、文献［８］に開示されて
いる。An inverse document frequency IDF (Inverse Document Frequency) is defined by the following expression using the document frequency df (x). IDF (x) =-log (df (x) / N) where N is the total number of documents. A method of obtaining the document frequency df is disclosed in the document [8].

【００２８】スコア関数ｓｃｏｒｅ（ａ，ｂ）には、総
出現頻度ｃｆと逆文書頻度ＩＤＦの積であるｃｆ・ＩＤ
Ｆを用いる。このｃｆ・ＩＤＦは、語の特徴度を表し、
語が特徴的に多く出現することの数量的な評価になって
いると考えられる（文献［２］）。ＩＤＦ（ｚ）＝０な
らば、語ｚがすべてのテキストに出現し、ｃｆ（ｚ）＝
０ならば語ｚがテキスト集合に一度も出現しない。した
がって、ｃｆ（ｚ）・ＩＤＦ（ｚ）が０でない正の値を
もつことは、語ｚがテキスト集合において意味のある単
語であることを意味している。このｃｆ・ＩＤＦは、現
在の検索システムで広く用いられている指標であり、そ
の有用性は経験的に実証されている。In the score function score (a, b), cf · ID, which is the product of the total appearance frequency cf and the inverse document frequency IDF,
F is used. This cf IDF represents the degree of characteristic of a word,
This is considered to be a quantitative evaluation of the characteristic appearance of many words (Reference [2]). If IDF (z) = 0, then the word z occurs in all texts and cf (z) =
If 0, the word z never appears in the text set. Therefore, having a positive non-zero value for cf (z) · IDF (z) means that the word z is a meaningful word in the text set. This cf-IDF is an index that is widely used in current search systems, and its usefulness has been empirically verified.

【００２９】関連性を判定する二つの単語ａ、ｂのそれ
ぞれの前後に文字列ｘ、ｙを結合した二つの結合文字列
ｘａｙ、ｘｂｙに関するｃｆ・ＩＤＦの値を用いて、第
一のスコア関数を次のように定義する。Using the value of cf.IDF for two combined character strings xay and xby in which the character strings x and y are combined before and after the two words a and b for determining the relevance, the first score function is used. Is defined as follows.

【００３０】定義１．スコア関数と関連語リストｓｃｏｒｅ（ａ，ｂ）＝Σ_ｘ，ｙｃｆ（ｘａｙ）・ＩＤ
Ｆ（ｘａｙ）・ｃｆ（ｘｂｙ）・ＩＤＦ（ｘｂｙ）／
（ｌｏｇ（Ｎ））^２ここでＮはテキストの総数である。このスコア関数を用
いて、関連語リストＲｅｌｅｖａｎｔｓを次のように定
義する。Definition 1. Score function and related word list score (a, b) = Σ_{x, y}cf (xay) / ID
F (xay) ・ cf (xby) ・ IDF (xby) /
(Log (N))^Two Where N is the total number of texts. Use this score function
And the related word list Relevants is defined as follows.
Mean

【００３１】Ｒｅｌｅｖａｎｔｓ＝｛（ａ，ｂ）｜∃ｘ
∃ｙ（ｃｆ（ｘａｙ）＞１∧ｃｆ（ｘｂｙ）＞１）∧ｓ
ｃｏｒｅ（ａ，ｂ）＞α｝ここでαはスコアの閾値である。Relevants = {(a, b) | ∃x
∃y (cf (xay)> 1∧cf (xby)> 1) ∧s
core (a, b)> α} where α is a score threshold.

【００３２】この定義は、前後の文字列が一致する二つ
の単語ａ，ｂについて、その結合文字列ｚの総出現頻度
ｃｆ（ｚ）が、どちらともｃf（ｚ）＞１である場合に
限り、スコアとして評価し、スコア関数の値が閾値α以
上になれば、二つの単語ａ，ｂを関連語と判断すること
を意味する。どちらの結合文字列についてもｃf（ｚ）
＞１であることを条件としたのは、テキスト集合におい
て偶然同じように使われているというケースを除外する
ためである。情報検索においてｃｆ（ｚ）＝１である単
語は稀にしか出現せず、そのような単語は検索には有用
でないという経験則から、そのような単語に関しては関
連語を抽出しないようにした。This definition is applicable only when the total appearance frequency cf (z) of the combined character string z of two words a and b whose preceding and following character strings are the same is cf (z)> 1. , Is evaluated as a score, and if the value of the score function is equal to or larger than the threshold value α, it means that the two words a and b are determined to be related words. Cf (z) for both combined strings
The condition that> 1 is set is to exclude a case where they are accidentally used in the same way in a text set. Based on the rule of thumb that words with cf (z) = 1 rarely appear in information retrieval and such words are not useful for retrieval, related words are not extracted.

【００３３】また第二のスコア関数として、ｓｃｏｒｅ（ａ，ｂ）＝Σ_ｘ，ｙＭＡＸ（ｃｆ（ｘａ
ｙ）・ＩＤＦ（ｘａｙ），ｃｆ（ｘｂｙ）・ＩＤＦ（ｘ
ｂｙ））／ｌｏｇ（Ｎ）を用いて、関連語リストを次のように定義してもよい。As the second score function, score (a, b) = Σ _{x, y} MAX (cf (xa
y) · IDF (xay), cf (xby) · IDF (x
By)) / log (N), the related word list may be defined as follows.

【００３４】Ｒｅｌｅｖａｎｔｓ＝｛（ａ，ｂ）｜∃ｘ
∃ｙ（ｃｆ（ｘａｙ）＞１∨ｃｆ（ｘｂｙ）＞１）∧ｓ
ｃｏｒｅ（ａ，ｂ）＞α｝Relevants = {(a, b) | ∃x
∃y (cf (xay)> 1∨cf (xby)> 1) ∧s
core (a, b)> α}

【００３５】これは、二つの単語ａ，ｂの結合文字列ｚ
のどちらか一方についてｃｆ（ｚ）＞１であるならば、
それぞれの結合文字列に関するｃｆ（ｚ）・ＩＤＦ
（ｚ）の高い方をスコアとして加算し、スコア関数の値
が閾値α以上になれば、二つの単語ａ，ｂを関連語と判
断することを意味する。これは、一方の単語は稀な単語
であるが、他方の単語がテキスト集合においてある程度
の特徴度を持つ単語であるならば、二つの単語の関連は
有用であり得るというケースを考慮するためである。こ
の第二のスコア関数による評価方法は、第一のスコア関
数による関連語リストの抽出では切り捨てられる関連語
の情報を調査するために用いることができる。This is a combined character string z of two words a and b.
If cf (z)> 1 for either of
Cf (z) IDF for each combined string
The higher one of (z) is added as a score, and when the value of the score function is equal to or more than the threshold value α, it means that the two words a and b are determined to be related words. This is to consider the case where one word is a rare word, but the association of the two words can be useful if the other word is a word with some degree of feature in the text set. is there. This evaluation method using the second score function can be used for investigating information on related words that are truncated in the extraction of the related word list using the first score function.

【００３６】この関連語リストの定義をそのまま用いて
関連語を抽出するならば、テキストから切り出された単
語集合に対して考え得るすべての単語の対について関連
度を評価することになり、計算量の問題が生じる。たと
えば実験で用いた１２５メガバイトのテキスト集合から
約１０万単語が切り出されるが、この場合、判定する単
語対の組み合わせは１００億になる。調べる前後文字列
の長さを４文字として結合文字列のスコアを計算すると
して、汎用計算機上で一対の単語の判定に０．００１秒
程度かかる。これは１００億対を判定するのに１２０日
かかることを意味し、実用的な計算時間ではない。そこ
で本実施の形態では、テキストから切り出された単語集
合から関連語の候補を絞り込む。If a related word is extracted using the definition of the related word list as it is, the degree of relevance is evaluated for all possible word pairs with respect to the word set cut out from the text. Problem arises. For example, about 100,000 words are cut out from the 125-megabyte text set used in the experiment. In this case, the number of word pair combinations to be judged is 10 billion. Assuming that the length of the character string to be examined is 4 characters and the score of the combined character string is calculated, it takes about 0.001 seconds to judge a pair of words on a general-purpose computer. This means that it takes 120 days to determine 10 billion pairs, which is not a practical calculation time. Therefore, in the present embodiment, related word candidates are narrowed down from the word set cut out from the text.

【００３７】知られている関連語かどうかの判定方法で
は、単語の出現分布の類似性が利用されるが、一つの文
書において同じ概念を表す単語が二つ以上用いられるこ
とは少ない。これは、一つの文書は唯一の著者によって
書かれるため、単語は統一される傾向にあるからであ
る。特に技術文書においては読者の理解を容易にするた
めに、故意に単語が統一される傾向にある。一方、同じ
文書でも置換可能関係にある単語は頻繁に現れる傾向が
ある。ここで２つの単語ａ、ｂが置換可能関係にあると
は、一方の単語ａとの組み合わせで同じ文書に頻繁に現
れる単語が、他方の単語ｂとも組み合わされて現れるこ
とをいう。本実施の形態では、このような置換可能関係
にある単語対は同義語や類義語である可能性があるとし
て、関連語の候補対とする。In the known method of determining whether a word is a related word, the similarity of the appearance distribution of words is used, but it is rare that two or more words that represent the same concept are used in one document. This is because words tend to be unified because one document is written by only one author. Especially in technical documents, words tend to be intentionally unified to facilitate the reader's understanding. On the other hand, even in the same document, words that have a replaceable relationship tend to appear frequently. The term “substitutable relationship between two words a and b” means that a word frequently appearing in the same document in combination with one word a appears in combination with the other word b. In the present embodiment, the word pair having such a replaceable relationship is considered to be a synonym or a synonym and is set as a candidate pair of related words.

【００３８】関連語の候補Ｃａｎｄｉｄａｔｅｓを次の
ように定義する。定義２．関連語の候補Ｔｒｉ（α）＝｛ｘｙｚ｜Ｐ（ｘｙ）／（Ｐ（ｘ）Ｐ
（ｙ））＞α∧Ｐ（ｙｚ）／（Ｐ（ｙ）Ｐ（ｚ））＞
α｝のとき、Ｃａｎｄｉｄａｔｅｓ＝｛（ａ，ｂ）｜ｘａｚ∈Ｔｒｉ
（α）∧ｘｂｚ∈Ｔｒｉ（α）∧ｘ≠ａ∧ｘ≠ｂ∧ｚ≠
ａ∧ｚ≠ｂ｝ここで、ｘ，ｙ，ｚ，ａ，ｂは単語、ｘｙｚは三つの単
語が連なった単語列、Ｐ（ｗ）は単語列ｗの出現確率、
αは閾値である。Candidates of related words are defined as follows. Definition 2. Related word candidates Tri (α) = {xyz | P (xy) / (P (x) P
(Y))> α∧P (yz) / (P (y) P (z))>
When α}, Candidates = {(a, b) | xazεTri
(Α) ∧xbz∈Tri (α) ∧x ≠ a∧x ≠ b∧z ≠
a∧z ≠ b} where x, y, z, a, and b are words, xyz is a word string in which three words are consecutive, P (w) is the appearance probability of the word string w,
α is a threshold.

【００３９】Ｔｒｉ（α）は、テキスト中で三つの単語
ｘ、ｙ、ｚが連なった三連単語ｘｙｚの内、前半の二連
単語ｘｙの出現確率の推定値がその二連単語を構成する
各単語ｘ、ｙのそれぞれの出現確率の推定値の積よりも
有意に高く、かつ後半の二連単語ｙｚの出現確率の推定
値がその二連単語を構成する各単語ｙ、ｚのそれぞれの
出現確率の推定値の積に比べて有意に高いものを集めた
集合である。この定義により、二つの単語ａ，ｂがその
前後にそれぞれ同じ単語を持つなら、その二つの単語
ａ，ｂは関連語の候補対となる。In Tri (α), the estimated value of the appearance probability of the first half double word xy of the triple word xyz in which three words x, y, and z are connected in the text constitutes the double word. The estimated value of the appearance probability of the latter half of the double word yz is significantly higher than the product of the estimated values of the respective appearance probabilities of the respective words x and y. This is a set of collections that are significantly higher than the product of the estimated values of appearance probabilities. According to this definition, if two words a and b have the same word before and after it, the two words a and b become a candidate pair of related words.

【００４０】２．シソーラス構築手法実施の形態では、次の工程により、新規テキスト用シソ
ーラスとなる関連語のリストを生成する。2. Thesaurus Construction Method In the embodiment, a list of related words to be a new text thesaurus is generated by the following steps.

【００４１】（１）テキスト集合から単語を切り出し、
単語集合を求める。（２）単語集合から得られる単語対の中から、関連語の
候補対を絞り込む。（３）関連語の候補対について、関連度を判定し、関連
度が高い候補対から関連語リストを生成する。以下、上
記の三つの工程を順に詳しく説明する。(1) Cut out a word from a text set,
Find the word set. (2) The candidate pairs of related words are narrowed down from the word pairs obtained from the word set. (3) The degree of relevance is determined for a candidate pair of related words, and a related word list is generated from the candidate pairs having a high degree of relevance. Hereinafter, the above three steps will be sequentially described in detail.

【００４２】２．１単語の切り出し工程第一の工程では関連語リストの対象となる単語がテキス
トから切り出される。本実施の形態では、新規テキスト
に含まれる未知語を理解することに役立つ関連語の発見
を目的とするため、テキストにある未知語、既知語を問
わず単語を切り出す。日本語には語の境界がないため、
日本語テキストから単語を抽出するのは容易ではない。
このため、関連語のリストの対象となる単語、特に未知
語の切出しに失敗する場合が多い。そこで、本実施の形
態では未知語に対しても有効に適用できるキーワード抽
出手法（文献［４］、［５］）を利用する。このキーワ
ード抽出手法は辞書を用いないで、文字列の頻度情報か
らテキストの部分文字列から概念を示す単語と判定でき
る文字列を切り出し、その単語をキーワードとして抽出
する。本実施の形態では、この手法を用いて抽出したキ
ーワードを関連語リストの対象となる単語として扱う。2.1 Word Extraction Step In the first step, the word that is the target of the related word list is extracted from the text. In the present embodiment, in order to find a related word that is useful for understanding an unknown word included in a new text, a word is cut out regardless of an unknown word or a known word in the text. Because there are no word boundaries in Japanese,
Extracting words from Japanese text is not easy.
For this reason, in many cases, the extraction of the target word of the related word list, particularly the unknown word, fails. Therefore, in the present embodiment, a keyword extraction method (references [4] and [5]) that can be effectively applied to unknown words is used. This keyword extraction method cuts out a character string that can be determined as a word indicating a concept from a partial character string of text from the frequency information of the character string without using a dictionary, and extracts the word as a keyword. In the present embodiment, the keywords extracted by using this method are treated as the target words of the related word list.

【００４３】このキーワード抽出システムは、対象とす
るテキスト集合から文字列の頻度情報のみでキーワード
を抽出するシステムである。文字列の出現頻度は、統計
的に言語を処理するときの基本の統計量である。情報検
索においても「ある単語がドキュメントに現れる確率」
に関する情報量で重みをつけることにより、検索性能を
向上させることができる。単語ｘが１回出現する確率Ｐ
（１回出現）を推定するために、Ｐ（１回出現）＝ｄｆ
（ｘ）／Ｎが使われる。またアダプテーションとして知
られる統計量（文献［３］）がある。これはある単語が
一つのドキュメントに現れたという条件で、同じ単語が
もう一度出現する条件付き確率Ｐ（２回出現｜１回出
現）の推定値である。この確率は、対象の文字列ｘに関
して、文字列ｘを含むドキュメントの数ｄｆ（ｘ）とそ
の文字列ｘを２回以上含むドキュメントの数ｄｆ
_２（ｘ）を数え上げ、ベイズの規則を考慮することによ
り推定される。文字列の出現がポアソン分布に従うとす
れば、Ｐ（２回出現｜１回出現）はＰ（１回出現）の値
と等しくなるはずであるが、英語の場合、キーワードの
Ｐ（２回出現｜１回出現）は、Ｐ（１回出現）には依存
せず一定の値となることが報告されており、その確率が
統計的に単語の性質を識別できることを示している。
（文献［３］）。Ｐ（２回出現｜１回出現）の推定値を
Ｐ＾（２回出現｜１回出現）と表すと、アダプテーショ
ンは、次のように計算される。This keyword extraction system is a system for extracting a keyword from a target text set only by frequency information of character strings. The appearance frequency of a character string is a basic statistic when statistically processing a language. "Probability that a word appears in a document" even in information retrieval
The search performance can be improved by weighting the information amount regarding. Probability P that word x appears once
To estimate (occurs once), P (occurs once) = df
(X) / N is used. There is also a statistic known as adaptation (reference [3]). This is an estimated value of the conditional probability P that the same word will appear once again (two occurrences | one occurrence), provided that a certain word appears in one document. This probability is the number of documents df (x) containing the character string x and the number df of documents containing the character string x two or more times with respect to the target character string x.
Estimated by counting ₂ (x) and considering the Bayes rule. If the appearance of the character string follows the Poisson distribution, P (occurs twice | occurs once) should be equal to the value of P (occurs once), but in the case of English, the keyword P (occurs twice) It has been reported that | 1 occurrence) has a constant value independent of P (1 occurrence), and that probability indicates that the nature of the word can be statistically identified.
(Reference [3]). If the estimated value of P (2 occurrences | 1 occurrence) is expressed as P ^ (2 occurrences | 1 occurrence), the adaptation is calculated as follows.

【００４４】Ｐ＾（２回出現｜１回出現）＝Ｐ＾（２回
出現∧１回出現）／Ｐ＾（１回出現）＝Ｐ＾（２回出
現）／Ｐ＾（１回出現）＝（ｄｆ_２（ｘ）／Ｎ）／（ｄ
ｆ（ｘ）／Ｎ）＝ｄｆ_２（ｘ）／ｄｆ（ｘ）P ^ (2 occurrences | 1 occurrence) = P ^ (2 occurrences ∧1 occurrence) / P ^ (1 occurrence) = P ^ (2 occurrences) / P ^ (1 occurrence) = (Df ₂ (x) / N) / (d
f (x) / N) = df ₂ (x) / df (x)

【００４５】文献［４］［５］で、日本語のように語の
境界のない言語におけるキーワードのＰ（２回出現｜１
回出現）とＰ（１回出現）を分析したところ、実際のコ
ーパスではＰ（２回出現｜１回出現）はＰ（１回出現）
に比べ大きく、たとえば「自立移動ロボット」というキ
ーワードの文字列のＰ（２回出現｜１回出現）はそのキ
ーワードにその次の文字も含めた文字列「自立移動ロボ
ットに」のＰ（２回出現｜１回出現）に比べ非常に大き
いことが観測されている。文献［４］［５］にはこの観
測に基づき、キーワードと判定される文字列を推定し、
新規テキスト中から辞書を用いないでキーワードを抽出
する手法が説明されている。この方法は一つの方法に過
ぎず、この手法を用いなくても、統計量をもとにした他
のキーワード抽出手法を利用してもよいが、ここでは一
例として、この手法を用いてテキストから抽出されるキ
ーワードの例を次に示す。以下のようなテキストが与え
られたとする。In documents [4] and [5], the keyword P (occurs twice) | 1 in a language without word boundaries such as Japanese.
Analysis of (occurrence) and P (occurs once) shows that in the actual corpus P (occurs twice | occurs once) is P (occurs once)
For example, the P (two occurrences | one occurrence) of the character string of the keyword “independent mobile robot” is the P (twice) of the character string “independent mobile robot” that includes the next character in the keyword. Appearance | Appears once). In References [4] and [5], character strings determined as keywords are estimated based on this observation,
A method for extracting keywords from a new text without using a dictionary is described. This method is only one method, and other keyword extraction methods based on statistics may be used without using this method, but here, as an example, this method is used to extract text from text. An example of the extracted keywords is shown below. Given the following text:

【００４６】gakkai-0000000001 電気回路演習用ＣＡＩ
とその改良大学等での基礎的な電気回路演習を支援す
るＣＡＩソフトウェアとその改良について述べている。
本ＣＡＩはコンピュータが出題される回路を学習者各人
のレベルに応じて自動的に作成すること、解答を数式で
入力することが大きな特長である。また、誤った解答に
対しては、原因の検討を容易にするメッセージが表示さ
れるなど、効果的な個別学習が限られた設備・要員で実
施可能であるよう配慮した。昨年度の学生による本ＣＡ
Ｉの使用結果のアンケート、および発表の場における質
疑等を参考に、操作を容易とし、効果を上げるための改
良を行った。電気回路演習Gakkai-0000000001 CAI for electric circuit practice
And its improvement Describes CAI software that supports basic electric circuit exercises at universities and its improvements.
The main features of this CAI are that it automatically creates a circuit for a computer according to the level of each learner and inputs the answer by a mathematical expression. In addition, for incorrect answers, a message is displayed to facilitate consideration of the cause, and effective individual learning can be implemented with limited equipment and personnel. Book CA by students from last year
Improvements were made to facilitate the operation and improve the effect by referring to the questionnaire of the use result of I and the question and answer at the place of the presentation. Electric circuit exercises

【００４７】ｄｆ_２（ｘ）／ｄｆ（ｘ）を計測すること
により、このテキストから単語の切れ目を検出してテキ
ストの文字列を分割すると、以下のようになる。By measuring df ₂ (x) / df (x) to detect word breaks from the text and dividing the character string of the text, the following is obtained.

【００４８】gakkai-000000/0001/電気回路/演習/用/Ｃ
ＡＩ/と/その/改良//大学/等/で/の基礎/的/な/電気回
路/演習/を/支援/する/ＣＡＩ/ソフトウエア/と/その/
改良/について述べ/ている。/本/ＣＡＩ/は/コンピュー
タ/が/出/題/され/る/回路/を/学習者/各/人の/レベル/
に応じて/自動/的/に/作成/する/こと、/解答/を/数式/
で/入力/する/ことが/大き/な/特/長/である。/ま/た
/、/誤/った/解答/に対しては、/原因/の/検討/を容易
に/する/メッセージ/が/表示/され/る/など/、/効果/的
/な/個別/学習/が/限/ら/れ/た/設備/・/要員/で/実施/
可能/である/よ/う/配/慮/した。/昨年度/の/学生/によ
る/本/ＣＡＩ/の/使用/結/果/の/アンケート/、/および
/発/表/の/場における/質/疑/等/を/参/考/に/、/操作/
を/容易/とし、/効果/を上げる/ため/の改良/を行っ
た。/電気回路//演習/Gakkai-000000 / 0001 / electric circuit / exercise / for / C
AI / and / that / improvement // university / etc / de / basic / target / electrical circuit / exercise / support / support / CAI / software / and / the /
Describes / describes improvements. / Book / CAI / Is / Computer / is / issue / title / do / circuit / is / learner / each / person / level /
According to / automatic / target / to / create / do / that / answer / to / mathematical /
It is / input / enter / that / large / large / special / long /. /Also
For /, / wrong / wrong / answer /
/ Na / individual / learning / ga / limit / la / re / ta / equipment / ・ / personnel / in / implement /
Possible / Yes / Yes / No / Distribution / Consideration. / Last year / of / student / by / book / CAI / of / use / result / result / of / questionnaire /, / and
/ Departure / Table / Of / On site / Quality / Suspicion / Etc. / Reference / Consider / In /, / Operation /
/ Easy / and / Effect / Increase / To improve / To improve. / Electrical Circuit // Practice /

【００４９】このように、文字分割された文書ファイル
からキーワードとなりそうな文字列の分布をもつものを
選ぶことにより、次のキーワードが抽出される。As described above, the next keyword is extracted by selecting a document file having character string distribution that is likely to become a keyword from the character-divided document file.

【００５０】演習ＣＡＩ演習支援ＣＡＩソフ
トウエアＣＡＩ学習者自動メッセージ表示
学習学生ＣＡＩ演習Exercise CAI Exercise Support CAI Software CAI Learner Automatic Message Display
Learning Student CAI Exercise

【００５１】このように、単語の辞書がまったくない状
態で、日本語の文章からキーワードが抽出できる。As described above, the keyword can be extracted from the Japanese sentence without any word dictionary.

【００５２】２．２候補の絞込み工程テキストから切り出された単語集合から考え得るすべて
の単語対について関連度を調査すると、計算量が問題と
なる。そこで、第二の工程では第一の工程で抽出した単
語の集合から考え得る単語対を関連語となる候補の対に
絞り込む処理を行う。定義２で三連単語のリストＴｒｉ
（α）を用いて、関連語の候補とする対を定義し、調査
対象を限定しているが、その定義通りに候補の対を調べ
るのでは、Ｔｒｉ（α）の大きさから考えて、実用的に
計算できない。そこで、候補対の定義を次の定義に置き
換えて効率的に候補となる対を選び出す。2.2 Candidate Narrowing Process When the degree of relevance is investigated for all possible word pairs from the word set cut out from the text, the amount of calculation becomes a problem. Therefore, in the second step, a process of narrowing the word pairs that can be considered from the set of words extracted in the first step to pairs of candidates that are related words is performed. List of triple words in definition 2 Tri
(Α) is used to define a pair as a candidate for a related word, and the survey target is limited. However, if the candidate pair is examined according to the definition, considering the size of Tri (α), It cannot be calculated practically. Therefore, the candidate pair definitions are replaced with the following definitions to efficiently select candidate pairs.

【００５３】定義３候補対の選択Ｂｉ（α）＝｛ｘｙ｜Ｐ（ｘｙ）／（Ｐ（ｘ）Ｐ
（ｙ））＞α｝，Ｆ（ｃ）＝｛ｘ｜ｘｃ∈Ｂｉ（α）｝，Ｂ（ｃ）＝｛ｙ｜ｃｙ∈Ｂｉ（α）｝のとき、ＢＦ（ａ）＝｛Ｂ（ｘ）｜ｘ∈Ｆ（ａ）｝，ＦＢ（ａ）＝｛Ｆ（ｙ）｜ｙ∈Ｂ（ａ）｝，Ｃａｎｄｉｄａｔｅｓ＝｛（ａ，ｂ）｜ｂ∈（ＦＢ
（ａ）∩ＢＦ（ａ））∧ａ≠ｂ｝Definition 3 Selection of candidate pair Bi (α) = {xy | P (xy) / (P (x) P
When (y))> α}, F (c) = {x | xcεBi (α)}, B (c) = {y | cyεBi (α)}, BF (a) = {B ( x) | xεF (a)}, FB (a) = {F (y) | yεB (a)}, Candidates = {(a, b) | bε (FB
(A) ∩BF (a)) ∧a ≠ b}

【００５４】Ｂｉ（α）はテキスト中で二つの単語ｘ，
ｙが連なった単語列ｘｙの内、その単語列ｘｙの出現確
率の推定値がその単語列を構成する二つの単語ｘ、ｙの
それぞれの出現確率の推定値の積に比べて有意に高いも
のを選んで作られた二連単語のリストである。ここで出
現確率Ｐ（ｘ）などは前述のように、ドキュメント頻度
を用いてＰ（ｘ）＝ｄｆ（ｘ）／Ｎと推定される。Ｆ
（ｃ）は、ある単語ｃの前に連なる前置単語ｘを二連単
語の集合Ｂｉ（α）から抽出して作られた前置単語のリ
ストである。Ｂ（ｃ）は、ある単語ｃの後に連なる後置
単語ｘを二連単語の集合Ｂｉ（α）から抽出して作られ
た後置単語のリストである。ＢＦ（ａ）は、注目単語ａ
についての前置単語の集合Ｆ（ａ）に含まれる前置単語
ｘの各々について、テキスト中でその前置単語ｘの後に
連なる後置単語を二連単語の集合Ｂｉ（α）から抽出し
て作られた後置単語の集合である。ＦＢ（ａ）は、注目
単語ａについての後置単語の集合Ｂ（ａ）に含まれる前
置単語ｙの各々について、テキスト中でその前置単語ｙ
の前に連なる前置単語を二連単語の集合Ｂｉ（α）から
抽出して作られた前置単語の集合である。ある注目単語
ａに対して、最終的に得られた後置単語の集合ＢＦ
（ａ）と前置単語のＦＢ（ａ）の共通要素ｂ（≠ａ）が
選ばれ、単語対（ａ，ｂ）が関連語の候補対のリストＣ
ａｎｄｉｄａｔｅｓに加えられる。Bi (α) is the two words x,
Among the word strings xy in which y is continuous, the estimated value of the appearance probability of the word string xy is significantly higher than the product of the estimated values of the respective occurrence probabilities of the two words x and y forming the word string. It is a list of double words made by selecting. Here, the appearance probability P (x) and the like are estimated as P (x) = df (x) / N using the document frequency as described above. F
(C) is a list of prefix words created by extracting a prefix word x that continues before a certain word c from a set of biwords Bi (α). B (c) is a list of postfix words created by extracting the postfix word x that follows a certain word c from the set of biwords Bi (α). BF (a) is the attention word a
For each of the prefix words x included in the prefix word set F (a) of, the suffix words following the prefix word x in the text are extracted from the set of double words Bi (α). It is a set of postfix words created. FB (a) is a prefix word y in the text for each prefix word y included in the suffix word set B (a) for the target word a.
It is a set of prefix words created by extracting a prefix word that precedes each other from a set of biwords Bi (α). The set BF of postfix words finally obtained for a certain attention word a
(A) and the common element b (≠ a) of FB (a) of the prefix word are selected, and the word pair (a, b) is a list C of candidate pairs of related words.
Added to associates.

【００５５】関連語の候補対が選択される過程を説明す
るために、次の文章をテキストの例として考える。To illustrate the process of selecting candidate pairs of related words, consider the following sentence as an example of text.

【００５６】「昨夜、年賀状を印刷して郵便ポストに投函した。」「昨夜、原稿を印刷して郵送した。」「昨夜、原稿を修正して郵送した。」「昨夜、宣伝チラシを１００部印刷した。」「昨夜、雪が降った。」[0056] "I printed a New Year's card last night and posted it in a mailbox." "I printed and mailed the manuscript last night." "I revised the manuscript last night and mailed it." "I printed 100 copies of an advertising leaflet last night." "It snowed last night."

【００５７】キーワード抽出手法により単語が切り出さ
れ、二連単語のリストＢｉ（α）が次のように得られた
とする。It is assumed that words are cut out by the keyword extraction method, and a list of double words Bi (α) is obtained as follows.

【００５８】Ｂｉ（α）＝｛「昨夜」「年賀状」，「年
賀状」「印刷」，「印刷」「郵便ポスト」，「郵便ポス
ト」「投函」，「昨夜」「原稿」，「原稿」「印刷」，
「印刷」「郵送」，「原稿」「修正」，「修正」「郵
送」，「昨夜」「宣伝チラシ」，「宣伝チラシ」「印
刷」，「昨夜」「雪」｝Bi (α) = {“last night” “new year card”, “new year card” “print”, “print” “post box”, “post box” “mailing”, “last night” “manuscript”, “manuscript” “ printing",
"Print", "mail", "manuscript", "correction", "correction", "mail", "last night", "advertisement leaflet", "advertisement leaflet", "print", "last night", "snow"}

【００５９】ここで、注目単語「年賀状」に関する関連
語の候補を選択する。まず注目単語「年賀状」について
前置単語のリストＦ（「年賀状」）、後置単語のリスト
Ｂ（「年賀状」）が二連単語のリストＢｉ（α）を参照
して次のように生成される。Ｆ（「年賀状」）＝｛「昨夜」｝，Ｂ（「年賀状」）＝｛「印刷」｝Here, a candidate of a related word related to the attention word "New Year's card" is selected. First, a list F of prefix words (“New Year's card”) and a list B of postfix words (“New Year's card”) for the attention word “New Year's card” are generated as follows with reference to the list Bi (α) of double words. It F (“New Year card”) = {“Last night”}, B (“New year card”) = {“Print”}

【００６０】次に前置単語のリストＦ（「年賀状」）内
の要素「昨夜」についての後置単語の集合ＢＦ（「年賀
状」）が二連単語のリストＢｉ（α）を参照して次のよ
うに生成される。ＢＦ（「年賀状」）＝｛「年賀状」，「原稿」，「宣伝
チラシ」，「雪」｝Next, the set of postwords BF (“New Year's card”) for the element “last night” in the list F of prepositional words (“New Year's card”) is referred to by the list Bi (α) of double words. Is generated like. BF (“New Year's card”) = {“New Year's card”, “manuscript”, “advertisement leaflet”, “snow”}

【００６１】また、後置単語のリストＢ（「年賀状」）
内の要素「印刷」についての前置単語の集合ＦＢ（「年
賀状」）が二連単語のリストＢｉ（α）を参照して次の
ように生成される。ＦＢ（「年賀状」）＝｛「年賀状」，「原稿」，「宣伝
チラシ」｝A list of postfix words B ("New Year's card")
A prefix word set FB (“New Year's card”) for the element “print” in the above is generated with reference to the list of double words Bi (α) as follows. FB (“New Year's card”) = {“New Year's card”, “manuscript”, “advertisement leaflet”}

【００６２】このとき「年賀状」に関する関連語の候補
対は、後置単語の集合ＢＦ（「年賀状」）と前置単語の
集合ＦＢ（「年賀状」）の共通要素を選択することによ
り次のように生成される。Ｃａｎｄｉｄａｔｅｓ＝｛「年賀状」「原稿」，「年賀
状」「宣伝チラシ」｝At this time, the candidate pairs of related words related to “New Year's card” are as follows by selecting the common elements of the set of postwords BF (“New Year's card”) and the set of prefix words FB (“New Year's card”). Is generated. Candidates = {"New Year's card""manuscript","NewYear'scard""advertisementleaflet"}

【００６３】この例では、「年賀状」に関する候補のほ
か、注目単語としてたとえば「印刷」を選べば、関連単
語の候補として「修正」が選択される。In this example, if, for example, "print" is selected as the word of interest in addition to the candidate for "New Year's card,""correction" is selected as the candidate for the related word.

【００６４】３．３関連語の判定工程第三の工程は第二の工程で得られた関連語の候補対につ
いて関連度を評価し、候補が関連語であるかどうかを判
定する。通常、関連語を取り出す手法は単語の出現分布
が類似しているかどうかを判定して、関連語を取り出す
が、本実施の形態では、定義１のスコア関数による判定
を用いて、単語の前後に接続している文字列の関係を基
に関連語の対であるかどうかを判定する。この判定では
前後に接続する文字列の長さが問題となる。調べる前後
の文字列が短すぎる場合、偶然前後の文字列が一致して
いる単語の対がスコア関数での評価に加わることにな
り、実際には関連のない単語対でも関連度があると評価
されてしまう。また反対に、調べる前後の文字列が長す
ぎると、一致することが少なくなり、スコア関数での評
価値が下がり、実際には関係のある単語対でも抽出され
ずに、取りこぼされるという問題が生じる。このように
調べる前後の文字列の長さは最終的に関連語を抽出する
ための重要なパラメータであり、最適な長さに調整され
る。3.3 Related Word Determining Step In the third step, the degree of relevance of the related word candidate pair obtained in the second step is evaluated to determine whether or not the candidate is a related word. Normally, the method of extracting a related word determines whether or not the appearance distributions of words are similar to extract a related word. In the present embodiment, the determination by the score function of Definition 1 is used to detect before and after a word. It is determined whether or not it is a pair of related words based on the relationship between the connected character strings. In this determination, the length of the character string connected before and after becomes a problem. If the character string before and after the check is too short, the pair of words that happen to match the character string before and after will be added to the evaluation by the score function, and it will be evaluated that the word pairs that are not actually related are related. Will be done. On the other hand, if the character strings before and after the check are too long, the number of matches will decrease, the evaluation value of the score function will decrease, and the related word pairs will not be extracted and will be dropped. Occurs. The length of the character string before and after the examination is an important parameter for finally extracting the related word, and is adjusted to the optimum length.

【００６５】４．実施例図１は、実施の形態に係る関連語抽出装置１００の構成
図である。この構成は、ハードウエア的には、任意のコ
ンピュータのＣＰＵ、メモリ、その他のＬＳＩで実現で
き、ソフトウエア的にはメモリにロードされた関連語抽
出機能のあるプログラムなどによって実現されるが、こ
こではそれらの連携によって実現される機能ブロックを
描いている。したがって、これらの機能ブロックがハー
ドウエアのみ、ソフトウエアのみ、またはそれらの組合
せによっていろいろな形で実現できることは、当業者に
は理解されるところである。4. Example FIG. 1 is a configuration diagram of the related word extraction device 100 according to the embodiment. This configuration can be realized in terms of hardware by a CPU, a memory, and other LSIs of an arbitrary computer, and in terms of software, it can be achieved by a program having a related word extracting function loaded in the memory. Then, the functional blocks realized by those collaborations are drawn. Therefore, it will be understood by those skilled in the art that these functional blocks can be realized in various forms by only hardware, only software, or a combination thereof.

【００６６】計数処理部１０は、文書データベース２４
に格納された複数の文書ファイル２６を読み込んで文書
ファイル２６中の文字列の出現頻度に関する統計量を計
数する。複数の文書ファイル２６はコーパスを形成す
る。計数処理部１０は、SuffixArray生成部１２と、文
字列頻度計数部１４と、文書頻度算出部１６とをもつ。
Suffix Array生成部１２は、各々の文書ファイル２６に
ついてsuffixを求め、すべてのsuffixの集合を辞書順に
並べたSuffix Arrayを生成する。生成されたSuffix Arr
ayに関するデータはSuffix Arrayファイル２８として文
書データベース２４に格納される。The counting processing unit 10 uses the document database 24.
The plurality of document files 26 stored in is read and the statistical amount regarding the appearance frequency of the character string in the document file 26 is counted. The plurality of document files 26 form a corpus. The counting processing unit 10 includes a SuffixArray generating unit 12, a character string frequency counting unit 14, and a document frequency calculating unit 16.
The Suffix Array generation unit 12 obtains a suffix for each document file 26 and generates a Suffix Array in which a set of all suffixes is arranged in a dictionary order. Generated Suffix Arr
The data regarding ay is stored in the document database 24 as a Suffix Array file 28.

【００６７】ここで、Suffix Arrayは文献［６］によっ
て示されたデータ構造である。このデータ構造はあるテ
キストが与えられたときに、そのテキストのある位置の
文字からコーパスの終了までの範囲の文字列(これを接
尾辞（suffix）とよぶ)の集合を考え、その集合を辞書
順に並べたものである。どのsuffixも開始場所により一
意に定まるため、テキストの本体がメモリにあるとする
と、一つのsuffixを格納するのに、文字列の開始場所を
示す一つの整数を格納すればよい。このように、Suffix
Arrayを用いれば、任意の部分文字列の場所を知ること
ができるにもかかわらず、必要な記憶容量はテキストの
サイズをＮとするとＯ（Ｎ）（Ｎのオーダー）で済む。Here, Suffix Array is the data structure shown in the document [6]. This data structure, given a text, considers a set of character strings in the range from the character at a certain position of the text to the end of the corpus (this is called a suffix), and the set is a dictionary. They are arranged in order. Since every suffix is uniquely determined by the start position, if the body of the text is in memory, one suffix can be stored by storing one integer indicating the start position of the character string. Like this, Suffix
Using Array, although the location of an arbitrary substring can be known, the required storage capacity is O (N) (N order), where N is the size of the text.

【００６８】一般にコーパスがドキュメントの構造をも
たないとき、単純な方法では、文字列の比較計算の上限
を見積もることが難しいが、Suffix Arrayを生成する効
率の良いアルゴリズムが知られており（文献［７］）、
この方法によると、Suffix Arrayの作成の計算量はＯ
（ＮｌｏｇＮ）である。Generally, when the corpus has no document structure, it is difficult to estimate the upper limit of comparison calculation of character strings by a simple method, but an efficient algorithm for generating a suffix array is known (see [7]),
According to this method, the computational complexity of creating a Suffix Array is O
(NlogN).

【００６９】このデータ構造により、任意の文字列を与
え、その文字列が出現する場所を特定するために２分探
索ができ、Ｏ（ｌｏｇＮ）の時間で出現場所のリストが
求められる。ここでいう出現場所とは、コーパス内のあ
る部分と一対一で対応する整数値であり、Suffix Array
の性質から、これはある一つのsuffixにも対応する。Su
ffix Arrayを用いると、文字列の出現場所が特定でき、
その特定できた場所についてドキュメントの数を数えれ
ば、任意の文字列に対するドキュメント頻度を計算でき
る。With this data structure, an arbitrary character string is given, a binary search can be performed in order to specify the place where the character string appears, and a list of the appearing places can be obtained in O (logN) time. The appearance location here is an integer value that corresponds one-to-one with a certain part in the corpus.
Due to the nature of, this also corresponds to a certain suffix. Su
If you use ffix Array, you can specify the place where the character string appears,
By counting the number of documents for the identified location, the document frequency for an arbitrary character string can be calculated.

【００７０】文字列頻度計数部１４は、Suffix Arrayフ
ァイル２８を参照して、文字列がドキュメントに現れる
頻度などの統計量を計数する。文書頻度算出部１６は、
文字列頻度計数部１４により計測された文字列頻度から
文字列が出現するドキュメントの頻度を求める。文字列
頻度計数部１４および文書頻度算出部１６により計測さ
れた統計量は、文字列出現頻度データ３０として文書デ
ータベース２４に格納される。The character string frequency counting section 14 refers to the Suffix Array file 28 and counts statistics such as the frequency of the character string appearing in the document. The document frequency calculation unit 16
The frequency of the document in which the character string appears is calculated from the character string frequency measured by the character string frequency counting unit 14. The statistics measured by the character string frequency counting unit 14 and the document frequency calculating unit 16 are stored in the document database 24 as the character string appearance frequency data 30.

【００７１】キーワード抽出部２２は、文字列出現頻度
データ３０を参照して、各文書ファイル２６を特徴づけ
るキーワードを抽出する。キーワードの抽出はドキュメ
ント頻度を用いた単語の切れ目の検出と単語の出現のパ
ターンにもとづいて行い、単語辞書を必要としない。抽
出されたキーワードは各文書ファイル２６の検索用のイ
ンデックスとしてキーワードインデックスファイル３２
に格納される。The keyword extracting unit 22 refers to the character string appearance frequency data 30 and extracts keywords that characterize each document file 26. Keyword extraction is based on the detection of word breaks and the pattern of word appearance using document frequency, and does not require a word dictionary. The extracted keywords are used as an index for searching each document file 26 in the keyword index file 32.
Stored in.

【００７２】関連語抽出部４０は、文字列出現頻度デー
タ３０とキーワードインデックスファイル３２とを参照
して関連語を抽出し、関連語リスト３４を生成して文書
データベース２４に格納する。候補選定部４２は、キー
ワードインデックスファイル３２に格納されたキーワー
ドについて、文字列出現頻度データ３０を用いて関連語
の候補を絞り込む。関連度判定部４４は、文字列出現頻
度データ３０を用いて関連語の候補のスコアを計算し、
関連度を評価する。登録部４６は関連度の高い関連語の
候補を関連語リスト３４に登録する。The related word extracting unit 40 extracts the related words by referring to the character string appearance frequency data 30 and the keyword index file 32, generates the related word list 34, and stores it in the document database 24. The candidate selection unit 42 narrows down the candidates of related words for the keywords stored in the keyword index file 32 using the character string appearance frequency data 30. The degree-of-association determination unit 44 calculates a score of a related word candidate using the character string appearance frequency data 30,
Evaluate the degree of association. The registration unit 46 registers candidates of related words having a high degree of association in the related word list 34.

【００７３】検索部２０は、ユーザが指定する文字列を
キーワードとして受けつけ、キーワードインデックスフ
ァイル３２と関連語リスト３４を参照して、指定された
キーワードおよびその関連語を含む文書ファイル２６を
検索し、ユーザに提供する。The search unit 20 accepts the character string designated by the user as a keyword, refers to the keyword index file 32 and the related word list 34, and searches for the document file 26 containing the specified keyword and its related words. Provide to users.

【００７４】図２は、関連語抽出部４０の候補選定部４
２の機能構成図である。前処理部５０は、文書データベ
ース２４の文書ファイル２６を参照して、二連単語リス
トＢｉ（α）を作成する前処理を行う。第１処理部５２
は、注目単語ａに対して、前置単語ｘのリストＦ（ａ）
と後置単語ｙのリストＢ（ａ）を生成する第１の処理を
行う。第２処理部５４は、二連単語リストＢｉ（α）を
参照して、各前置単語ｘに対する後置単語の集合ＢＦ
（ａ）と、各後置単語ｙに対する前置単語の集合ＦＢ
（ａ）を生成する第２の処理を行う。第３処理部５６
は、これらの後置単語の集合ＢＦ（ａ）と前置単語の集
合ＦＢ（ａ）の共通要素から、関連語の候補対（ａ，
ｂ）を抽出する第３の処理を行う。FIG. 2 shows the candidate selecting section 4 of the related word extracting section 40.
It is a functional block diagram of 2. The preprocessing unit 50 refers to the document file 26 of the document database 24 and performs preprocessing for creating the double word list Bi (α). First processing unit 52
Is a list F (a) of prefix words x for the target word a.
And a first process for generating the list B (a) of the postfix word y is performed. The second processing unit 54 refers to the double word list Bi (α) and refers to the set BF of postfix words for each prefix word x.
(A) and a set FB of prefix words for each suffix word y
A second process for generating (a) is performed. Third processing unit 56
From the common elements of the set of postwords BF (a) and the set of frontwords FB (a),
A third process of extracting b) is performed.

【００７５】図３は、関連語抽出装置１００が行う関連
語抽出処理のフローチャートである。キーワード抽出部
２２は、文書データベース２４の文書ファイル２６から
キーワードを抽出する（Ｓ１０）。これは前述の単語の
切り出し工程である。候補選定部４２は、二連単語のリ
ストを生成し（Ｓ１２）、関連語の候補対を選定する
（Ｓ１４）。これは前述の候補の絞り込み工程である。
関連度判定部４４は、選定された候補対について、関連
度を判定する（Ｓ１６）。これは前述の関連語の判定工
程である。登録部４６は、関連度の高い候補対を関連語
リスト３４に登録する（Ｓ１８）。FIG. 3 is a flowchart of the related word extraction process performed by the related word extraction device 100. The keyword extracting unit 22 extracts keywords from the document file 26 of the document database 24 (S10). This is the above-mentioned word cutting process. The candidate selection unit 42 generates a list of double words (S12) and selects a candidate pair of related words (S14). This is the above-mentioned candidate narrowing down process.
The degree-of-association determination unit 44 determines the degree of association for the selected candidate pair (S16). This is the above-mentioned related word determination step. The registration unit 46 registers the candidate pairs having a high degree of association in the related word list 34 (S18).

【００７６】図４は、注目単語から関連単語の候補が特
定される例を図示したものである。注目単語ａとして
「コンピュータ」が与えられる。注目単語ａに対する前
置単語ｘのリストＦ（ａ）として、「並列」、「ホス
ト」、「分散」などが抽出される。これが第１の処理の
結果である。前置単語ｘのリストＦ（ａ）に含まれる
「並列」について、「処理」、「接続」といった後置単
語が特定される。また他の前置単語「分散」について、
「計算機」、「メモリ」といった後置単語が特定され
る。これにより、各前置単語ｘに対する後置単語の集合
ＢＦ（ａ）には、「処理」、「コンピュータ」、「計算
機」、「データ」、「接続」、「メモリ」などの単語が
含まれることになる。これが第２の処理の結果である。FIG. 4 shows an example in which candidates for related words are specified from the word of interest. “Computer” is given as the attention word a. “Parallel”, “host”, “distributed”, etc. are extracted as the list F (a) of the prefix word x for the target word a. This is the result of the first process. For "parallel" included in the list F (a) of the prefix words x, suffix words such as "processing" and "connection" are specified. For the other prefix word "dispersion",
Postfix words such as "computer" and "memory" are identified. As a result, the set BF (a) of the suffix words for each prefix word x includes words such as “processing”, “computer”, “computer”, “data”, “connection”, and “memory”. It will be. This is the result of the second processing.

【００７７】同様に、注目単語ａに対する後置単語ｙの
リストＢ（ａ）として、「生産者」、「キーボード」、
「故障」などが抽出される。後置単語ｙのリストＢ
（ａ）に含まれる「生産者」について、「自動車」、
「計算機」といった前置単語が特定される。また他の後
置単語「キーボード」について、「マウス」、「コンピ
ュータ」といった前置単語が特定される。これにより、
各後置単語ｙに対する前置単語の集合ＦＢ（ａ）には、
「自動車」、「マウス」、「ハードディスク」、「コン
ピュータ」、「メモリ」、「計算機」などの単語が含ま
れることになる。Similarly, as the list B (a) of the postfix word y for the attention word a, "producer", "keyboard",
"Failure" etc. are extracted. List B of postfix words y
Regarding "producer" included in (a), "automobile",
A prefix word such as "calculator" is identified. Further, with respect to the other trailing word "keyboard", leading words such as "mouse" and "computer" are specified. This allows
In the set FB (a) of prefix words for each suffix word y,
The words "automobile", "mouse", "hard disk", "computer", "memory", "computer", etc. will be included.

【００７８】第２の処理の結果得られた後置単語の集合
ＢＦ（ａ）と前置単語の集合ＦＢ（ａ）の共通要素を抽
出すると、「コンピュータ」、「計算機」、「メモリ」
が得られる。これらの共通要素には注目単語「コンピュ
ータ」が当然に含まれる。共通要素から注目単語を除い
たものが関連語の候補ｂとなる。これが第３の処理の結
果である。When the common elements of the set of postwords BF (a) and the set of frontwords FB (a) obtained as a result of the second processing are extracted, "computer", "computer", and "memory" are extracted.
Is obtained. These common elements naturally include the attention word "computer". The related word candidate b is obtained by removing the word of interest from the common elements. This is the result of the third processing.

【００７９】５．実行例日本語で書かれたＮＴＣＩＲの学術文書データ（文献
［１］）をテキスト集合として、本実施の形態の関連語
抽出装置１００による関連語リストの構築を行った実行
例を説明する。ＮＴＣＩＲの学術文書データは、様々な
分野からの学術文書の抄録を集めたものであり、そのデ
ータセットの一つであるＮＴＣＩＲ１は、３３万件び日
本語論文のアブストラクトのコレクションである。5. Execution Example An execution example in which the related word list is constructed by the related word extraction device 100 according to the present embodiment will be described using the NTCIR academic document data (reference [1]) written in Japanese as a text set. NTCIR scholarly document data is a collection of abstracts of scholarly documents from various fields. One of the datasets, NTCIR1, is an abstract collection of 330,000 Japanese papers.

【００８０】５．１単語の切り出し日本語論文のアブストラクトの一例として次のようなテ
キストが入力され、単語の切り出し工程で、単語が切り
出された。5.1 Extraction of Words The following text was input as an example of the abstract of a Japanese paper, and words were extracted in the word extraction process.

【００８１】入力されたテキスト「超電導三巻線リアクトル式事故時限流器電力系統の
故障電流抑制の為の限流器に、超電導の三巻線リアクト
ル式限流器を用いることを考案し、小型の装置を試作し
て実験し、その動作を確認した。この限流器は、送電線
の故障の程度を占める一線地絡故障に対しては大きな零
相リアクタンスにより限流し、この時はクエンチしな
い。二線地絡や三線地絡故障に対しては、超電導巻線が
クエンチして、大きな常電導抵抗により限流する。この
ため、経済的な運用が可能である。」The input text "Superconducting three-winding reactor type fault current limiter: It was devised to use a superconducting three-winding reactor type fault current limiter as a current limiting device for suppressing a fault current in a power system. The device was tested and tested, and its operation was confirmed.This fault current limiter does not quench at this time due to a large zero-phase reactance for the one-line ground fault that accounts for the degree of transmission line fault. For two-wire ground faults and three-wire ground faults, the superconducting windings are quenched and current is limited by a large normal-conducting resistance, which enables economical operation. ”

【００８２】切り出された単語「超電導巻線リアクトル限流器電力系統故障
装置送電線零相限流二線地絡」The extracted word “superconducting winding reactor current limiter power system failure device transmission line zero-phase current limiting two-wire ground fault”

【００８３】５．２候補の絞り込み超電導に関して、関連語の候補対として、たとえば、
（「超電導」，「開閉器」）、（「超電導」，「推進コ
イル」）、（「超電導」，「超伝導」）といったものが
抽出された。関連語の単語列の前後には、「電流」、
「電圧」、「磁気」などの単語が現れた。5.2 Narrowing down candidates Regarding superconductivity, as a candidate pair of related words, for example,
(“Superconductivity”, “Switch”), (“Superconductivity”, “Propulsion coil”), (“Superconductivity”, “Superconductivity”) were extracted. Before and after the related word string, "current",
Words such as "voltage" and "magnetism" appeared.

【００８４】５．３最終的な関連語の例関連語の候補についてスコアを評価し、最終的な関連語
を特定した。スコア関数において調べる前後の文字列の
長さを４に設定した。これはコーパスのデータに対し
て、経験的に好ましい値として選ばれたものである。ま
たスコア関数の閾値αは、第１のスコア関数に対しては
５を設定し、第２のスコア関数に対しては３を設定し
た。同じ意味を表す単語として、「サーバ」と「サーバ
ー」のように表記の揺れを含む単語、「メール」と「電
子メール」のように省略形の単語、「ウィルス」と「ウ
イルス」のように外来語のために表記が異なる単語、
「タンパク質」と「蛋白質」のように文字種が異なるが
同一の単語、「靱性」と「靭性」のように文字コードが
異なるが同一の単語が関連語として抽出された。5.3 Example of Final Related Words Scores were evaluated for candidate related words to identify final related words. The length of the character string before and after checking in the score function was set to 4. This is selected as an empirically preferable value for the corpus data. Further, the threshold value α of the score function is set to 5 for the first score function and 3 for the second score function. As words that have the same meaning, words with fluctuations such as "server" and "server", abbreviated words such as "mail" and "email", and words such as "virus" and "virus" Words with different notations due to foreign words,
The same words with different character types such as “protein” and “protein” and the same words with different character codes such as “toughness” and “toughness” were extracted as related words.

【００８５】また、「せん断強度」と「せん断耐力」の
ように似た意味を表す単語、「ネットワーク管理」と
「網管理」のように、上位概念または下位概念を表す単
語、「ＤＲＡＭ」と「ＳＲＡＭ」のように同じ上位概念
を持つ単語、「冬季」と「夏季」のように反対の意味を
持つ単語が関連語として抽出された。Also, words having similar meanings such as “shear strength” and “shear strength”, words indicating a superordinate concept or a subordinate concept such as “network management” and “network management”, and “DRAM”. Words having the same superordinate concept such as “SRAM” and words having opposite meanings such as “winter season” and “summer season” were extracted as related words.

【００８６】以上、本発明を実施の形態をもとに説明し
た。これらの実施の形態は例示であり、それらの各構成
要素や各処理プロセスの組合せにいろいろな変形例が可
能なこと、またそうした変形例も本発明の範囲にあるこ
とは当業者に理解されるところである。以下そのような
変形例を説明する。The present invention has been described above based on the embodiments. It is understood by those skilled in the art that these embodiments are mere examples, and that various modifications can be made to the combinations of the respective constituent elements and the respective processing processes, and such modifications are also within the scope of the present invention. By the way. Hereinafter, such a modified example will be described.

【００８７】上記の実施の形態では、関連語を抽出する
テキストとして日本語を例にあげたが、この関連語抽出
技術は言語独立であり、古文書のように辞書の整備がな
されていない文書や、暗号文書など未解読の文書にも適
用できる。また実施の形態では、文書を例にあげて関連
語の抽出を説明したが、関連語の抽出の対象は文書に限
られない。この関連語抽出技術により、たとえば遺伝子
情報における遺伝子の配列の出現頻度を計測し、遺伝子
配列の関連度を判定し、関連のある遺伝子配列を抽出す
ることも可能である。In the above embodiment, Japanese is taken as an example of the text for extracting the related words, but the related word extraction technique is language independent and is not a document such as an old document for which a dictionary is not prepared. It can also be applied to undeciphered documents such as encrypted documents. In the embodiment, the extraction of the related words has been described by taking the document as an example, but the target of extracting the related words is not limited to the document. With this related word extraction technique, it is also possible to extract the related gene sequences by measuring the appearance frequency of the gene sequences in the gene information, determining the degree of association of the gene sequences, and the like.

【００８８】本明細書で参照した参考文献のリストを以
下に示す。 [1] Noriko Kando, Kazuko Kuriyama, Toshihiko Nozu
e, Koji Eguchi, Hiroyuji Kato, and Souichiro Hidak
a, Overview of IR Tasks at the First NTCIR Worksho
p, Proceedings of NTCIR1 Workshop, Vol.1, pp.11‐4
4, 1999. [2] 相澤彰子,語と文書の共起に基づく特徴度の数量的
表現について,情報処理学会論文誌, Vol.41, No.12, p
p.3332‐3342, 2000. [3] Kenneth W. Church, Empirical Estimates of Adap
tation, Coling2000, pp.180‐186, 2000. [4] 田中路子,武田善行,仲村大也,山本英子,梅村恭司,
純統計処理によるキーワードの抽出実験,第42 回プログ
ラミング・シンポジウム報告集, pp.155‐158, 2001. [5] 武田善行,梅村恭司,キーワード抽出を実現する文書
頻度分析,計量国語学,第二十三巻二号, pp.65‐90, 200
1. [6] Manber, Udi and Gene Myer. 1990., Suffix array
s: A new method for on-line string searches, In th
e first Annual ACM-SIAM Symposium on Discrete Algo
rithms, pages 319-327. [7] 伊藤秀夫,Suffix Arrayの効率的な構築法,情報処理
学会論文誌、Vol.41, No.SIG1(TOD5), pp.31-39(2000). [8] Mikio Yamamoto and Kenneth W. Church, Using Su
ffix Arrays to ComputeTerm Frequency and Document
Frequency for All Substrings in a Corpus, Computat
ional Linguistics, Vol.27:1, pp.1-30, MIT Press.The following is a list of references referred to herein. [1] Noriko Kando, Kazuko Kuriyama, Toshihiko Nozu
e, Koji Eguchi, Hiroyuji Kato, and Souichiro Hidak
a, Overview of IR Tasks at the First NTCIR Worksho
p, Proceedings of NTCIR1 Workshop, Vol.1, pp.11-4
4, 1999. [2] Akiko Aizawa, Quantitative expression of feature degree based on co-occurrence of words and documents, Transactions of Information Processing Society of Japan, Vol.41, No.12, p.
p.3332-3342, 2000. [3] Kenneth W. Church, Empirical Estimates of Adap
tation, Coling2000, pp.180-186, 2000. [4] Michiko Tanaka, Yoshiyuki Takeda, Daiya Nakamura, Eiko Yamamoto, Kyoji Umemura,
Keyword extraction experiment by pure statistical processing, Proceedings of 42nd Programming Symposium, pp.155-158, 2001. [5] Takeyuki Yoshida, Kyoji Umemura, Document Frequency Analysis for Keyword Extraction, Metric Japanese Language, Second Vol. 13, No. 2, pp. 65-90, 200
1. [6] Manber, Udi and Gene Myer. 1990., Suffix array
s: A new method for on-line string searches, In th
e first Annual ACM-SIAM Symposium on Discrete Algo
rithms, pages 319-327. [7] Hideo Ito, Efficient construction method of Suffix Array, Journal of Information Processing Society of Japan, Vol.41, No.SIG1 (TOD5), pp.31-39 (2000). ] Mikio Yamamoto and Kenneth W. Church, Using Su
ffix Arrays to ComputeTerm Frequency and Document
Frequency for All Substrings in a Corpus, Computat
ional Linguistics, Vol.27: 1, pp.1-30, MIT Press.

【００８９】[0089]

【発明の効果】本発明によれば、関連語を効率よく抽出
することができる。According to the present invention, related words can be efficiently extracted.

[Brief description of drawings]

【図１】実施の形態に係る関連語抽出装置の構成図で
ある。FIG. 1 is a configuration diagram of a related word extraction device according to an embodiment.

【図２】図１の候補選定部の機能構成図である。FIG. 2 is a functional configuration diagram of a candidate selection unit in FIG.

【図３】実施の形態に係る関連語抽出処理手順のフロ
ーチャートである。FIG. 3 is a flowchart of a related word extraction processing procedure according to the embodiment.

【図４】注目単語から関連単語の候補が特定される手
順を例示する図である。FIG. 4 is a diagram illustrating a procedure of identifying a candidate for a related word from a target word.

[Explanation of symbols]

１０計数処理部、１２ Suffix Array生成部、１
４文字列頻度計数部、１６文書頻度算出部、２
０検索部、２２キーワード抽出部、２４文書
データベース、２６文書ファイル、２８ Suffix
Arrayファイル、３０文字列出現頻度データ、３
２キーワードインデックスファイル、３４関連語リ
スト、４０関連語抽出部、４２候補選定部、
４４関連度判定部、４６登録部、１００関連語
抽出装置。10 counting processing unit, 12 Suffix Array generation unit, 1
4 character string frequency counting unit, 16 document frequency calculating unit, 2
0 search unit, 22 keyword extraction unit, 24 document database, 26 document file, 28 Suffix
Array file, 30 character string appearance frequency data, 3
2 keyword index file, 34 related word list, 40 related word extraction unit, 42 candidate selection unit,
44 relevance determination unit, 46 registration unit, 100 related word extraction device.

Claims

[Claims]

1. A document database that stores a set of documents, a candidate selection unit that selects two different words in which common words before and after in the document are continuous as a candidate pair of related words, and a candidate pair of the related words. Regarding, with respect to the word, a determination unit that determines the degree of association based on the similarity of character strings connected before and after, and a registration unit that registers the candidate pair with a high degree of association in the document database as a related word dictionary. A related word extracting device characterized by including.

2. The first candidate selecting section extracts from the document set a list of prefix words preceding the target word in the document and a list of suffix words following the target word in the document. A processing unit, for a prefix word in the list of prefix words, a set of suffix words that follow in the document, and for a suffix word in the suffix word list in the document. A second processing unit for extracting a set of prefix words each connected to the document set from the document set, and a word commonly included in the set of the postwords and the set of the prefix words as a related word of the target word. The related word extraction device according to claim 1, further comprising a third processing unit that extracts the candidate as a candidate.

3. The candidate selection unit is configured such that, for a word string in which two words are connected in the document, the estimated value of the appearance probability of the word string indicates the appearance probability of each of the two words forming the word string. The first and second processing units further include a pre-processing unit that generates a list of double words by extracting a significantly higher value than the product of the estimated values, and the first and second processing units include the pre-word list and the post-word list. The related word extraction device according to claim 2, wherein each element of the prefix word list, the postfix word set, and the prefix word set is extracted from the double word list.

4. The determination unit, for the candidate pair of related words, the number of times a combined character string including character strings connected before and after those words appears in the document set, and the combined character string appears. The related word extraction device according to claim 1, wherein the degree of association is determined based on an evaluation value related to the number of documents.

5. The keyword selection unit further includes a keyword extraction unit that extracts a keyword in a document based on the frequency of occurrence of a character string from the document set, and the candidate selection unit performs a process of selecting a candidate pair of the related words for the keyword. The related word extraction device according to any one of claims 1 to 4, characterized in that.

6. The registration unit registers the keyword as a keyword index in the document database, and the apparatus uses the related word dictionary together with the keyword index in response to a search request from a user. The related word extraction device according to claim 5, further comprising a search unit that searches for a corresponding document.

7. A step of selecting, from a set of documents stored in a document database, two different words having common words before and after in the document as candidate pairs of related words, and the candidate pair of related words. , A step of determining a degree of association based on similarity of character strings connected before and after those words, and a step of registering the candidate pair having a high degree of association as an associated word in the document database. Related word extraction method.

8. A step of extracting from the document set stored in the document database a list of prefix words that are consecutive before the target word in the document, and from the document set, after the target word in the document. A step of extracting a list of consecutive prefix words, for each of the prefix words included in the list of prefix words for the attention word, extracting the suffix words that are consecutive after the prefix word from the document set, Generating a postword set having postwords as elements, and for each postword included in the list of postwords related to the target word, the prewords preceding the postword are the documents. Extracting from the set, generating a prefix word set having the prefix word as an element, a word commonly included in the suffix word set and the prefix word set, a related word of the attention word Related term extraction method comprising the step of extracting as.

9. A step of determining a degree of relevance of the related word candidate with the target word from the commonality of character strings appearing before and after, and a related word candidate having a high degree of relevance as a related word of the target word. The related word extracting method according to claim 8, further comprising: a step of registering in the document database.

10. With respect to a word string in which two words are connected in a document from a document set stored in a document database, an estimated value of the appearance probability of the word string is set for each of the two words constituting the word string. A step of generating a list of double words by extracting those significantly higher than the product of the estimated value of the appearance probability, and extracting a prefix word preceding the word of interest from the list of double words A step of generating a list of prefix words, a step of extracting a suffix word that follows the word of interest from the list of double words, and generating a list of suffix words; For each of the prefix words included in the prefix word list, a suffix word following the prefix word is extracted from the list of the double words, and a suffix word set having the suffix word as an element is generated. Process and the note For each of the trailing words included in the list of trailing words for a word, a leading word that precedes the trailing word is extracted from the list of double words, and the leading word having the leading word as an element A computer characterized by causing a computer to execute a step of generating a set and a step of extracting a word commonly included in the post-word set and the pre-word set as a candidate for a related word of the target word. program.