JP5542729B2

JP5542729B2 - Related word extraction device, related word extraction method, and related word extraction program

Info

Publication number: JP5542729B2
Application number: JP2011089567A
Authority: JP
Inventors: 貴行足立; 俊郎内山; 考藤村
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2011-04-13
Filing date: 2011-04-13
Publication date: 2014-07-09
Anticipated expiration: 2031-04-13
Also published as: JP2012221431A

Description

本発明は、ある単語に対する関連語の抽出を行う関連語抽出技術に関するものである。 The present invention relates to a related word extraction technique for extracting related words for a certain word.

世の中には様々な内容の電子テキストが大量に存在している。その中から欲しい情報を探すため、情報検索システムが利用されている。このシステムでは、ユーザは欲しい情報に関する検索語を入力すれば、その検索語に関する電子テキストの情報を得ることができる。その検索語の集合を考えた場合、多くの人々の検索意図が反映されているので、システム提供者は検索語集合を分析することで、人々が必要としている検索対象の拡充や、情報へのアクセス方法の改善に役立てることができる。また、人々に共通する関心が反映されたものだと捉えることができるので、マーケティング分析にも役立つ。 There are a lot of electronic texts with various contents in the world. An information retrieval system is used to search for desired information. In this system, if a user inputs a search word related to desired information, electronic text information related to the search word can be obtained. Considering the set of search terms, the search intentions of many people are reflected, so system providers can analyze the search term set to expand the search target that people need and It can be used to improve access methods. It can also be viewed as a reflection of the common interests of people, which is useful for marketing analysis.

従来技術として、情報検索システムにおいて、所定期間内に使用された検索語同士の関連度の強さを求めて、互いに強い関連を持つ語は同一の情報を得るために使用された語であるとみなすことで、例えば、年始における"年賀状"と"当選番号"のように一時的に関連の強くなった検索語同士の関連付けを行う情報関連づけ装置が開示されている（特許文献１参照）。 As a prior art, in an information search system, a word having a strong relationship with each other is used for obtaining the same information by obtaining the strength of the degree of association between search terms used within a predetermined period. Considering, for example, an information associating device for associating search terms that are temporarily strongly related, such as “New Year's card” and “winning number” at the beginning of the year, is disclosed (see Patent Document 1).

また、情報検索システムのクエリログからクエリ内の単語の共起頻度を基に関連付けを行う従来技術がある。例えば、"銀座"と同時に出現する単語（共起語）の共起頻度の高い順に単語を並べると "ランチ"や"映画館"などの関連語を得ることができる。 In addition, there is a conventional technique in which association is performed based on the co-occurrence frequency of words in a query from a query log of an information search system. For example, related words such as “lunch” and “movie theater” can be obtained by arranging words in the descending order of the co-occurrence frequency of words (co-occurrence words) that appear simultaneously with “Ginza”.

特許第３５４７０６９号Japanese Patent No. 3547069

しかしながら、例えば、"ワンピース"の意味が服の種類やアニメのタイトルのように、利用する背景によって同じ表記の単語が異なる意味として扱われることがある。このような単語に対する関連語を抽出した場合、従来技術では異なる意味の関連語を区別できない。 However, for example, the meaning of “one piece” may be treated as a different meaning depending on the background used, such as the type of clothes or the title of an animation. When related words for such a word are extracted, related words having different meanings cannot be distinguished by the prior art.

また、例えば、"ワンピース"を服の意味に限定するため"ワンピース"に"洋服"を追加して、いずれの単語とも共起した単語を関連語とする方法が考えられる。しかしながら、いずれの単語とも共起しなければならないため、"ワンピース"のみに共起する服の意味を持った単語があっても関連語として抽出できない。 For example, in order to limit “one piece” to the meaning of clothes, “clothes” is added to “one piece”, and a word that co-occurs with any word can be considered as a related word. However, since all words must co-occur, even if there is a word with clothes meaning that co-occurs only in "one piece", it cannot be extracted as a related word.

また、逆に、例えば、"ワンピース"を服の意味に限定するため"ワンピース"に"−アニメ"を追加して、"ワンピース"と共起したもののうち、"アニメ"とは共起しない単語を関連語とする方法が考えられる。しかしながら、"ワンピース"のみに共起するアニメの意味を持った単語があっても関連語から除くことができない。 Conversely, for example, to limit "one piece" to the meaning of clothes, add "-anime" to "one piece" and co-occur with "one piece", but the word does not co-occur with "anime" Can be considered as a related term. However, even if there is a word with an anime meaning that co-occurs only in "One Piece", it cannot be excluded from related words.

本発明は上記のような課題を解決するものであり、意味の曖昧性のある対象語から特定の意味の関連語を抽出するため、対象語の共起語をグループ化し、他の語と関係の深い共起語の共起頻度を用いて、対象語の共起語の支持度を加算や減算して集計し、支持度の高いグループを選択して、そのグループに属する共起語を対象語の関連語とすることで、他の語と関係の深い特定の意味の関連語を出力する関連語抽出技術を提供することを目的とする。 The present invention solves the above-described problems, and in order to extract related words having a specific meaning from a target word having an ambiguous meaning, the co-occurrence words of the target word are grouped and related to other words. Using the co-occurrence frequency of deep co-occurrence words, add or subtract the support level of the co-occurrence words of the target word, select a group with high support level, and target the co-occurrence words belonging to that group It is an object of the present invention to provide a related word extraction technique that outputs related words having a specific meaning that is closely related to other words by using the related words of the word.

上記の課題を解決するために、本発明は、単語に対する関連語を抽出する関連語抽出装置であって、
テキスト集合を入力し、テキスト中の単語と共起する単語との共起頻度を求め、共起単語データ記憶手段へ出力する共起単語データ作成手段と、
前記共起単語データ記憶手段に格納された共起単語データを用いて、所定の単語集合の各単語について、その単語と共起する単語を求めてグループ化し、その結果を単語グループ記憶手段へ出力する単語グループ作成手段と、
前記単語グループ記憶手段に格納された単語グループデータから、入力された対象語に対するグループデータを抽出し、関連語グループデータ記憶手段へ出力する関連語グループ抽出手段と、
前記単語グループデータから、入力された支持語リストに記載の支持語毎にグループデータを抽出し、支持語グループデータ記憶手段へ出力する支持語グループ抽出手段と、
前記支持語リストと、前記支持語グループデータ記憶手段に格納された支持語グループデータから、支持語との関係の深いグループに属する共起語である支持共起語と支持語との共起頻度を求めて、全ての支持語に対してその共起頻度を集計し、集計結果を支持共起語データ記憶手段へ出力する支持共起語抽出手段と、
前記関連語グループデータ記憶手段に格納された関連語グループデータと、前記支持共起語データ記憶手段に格納された支持共起語データから、前記対象語の各関連語グループに属する共起語と一致する支持共起語の支持度を求め、対象語の関連語グループ毎に集計し、支持共起語と関係の深い共起語を関連語として選択し、出力する関連語抽出手段と、を備えたことを特徴とする関連語抽出装置として構成される。 In order to solve the above-mentioned problem, the present invention is a related word extraction device for extracting a related word for a word,
A co-occurrence word data creation means for inputting a text set, obtaining a co-occurrence frequency of words co-occurring with words in the text, and outputting the co-occurrence word data storage means;
Using the co-occurrence word data stored in the co-occurrence word data storage means, for each word of a predetermined word set, the words that co-occur with the word are obtained and grouped, and the result is output to the word group storage means Word group creation means to
Related word group extraction means for extracting group data for the input target word from the word group data stored in the word group storage means, and outputting to the related word group data storage means;
From the word group data, support word group extraction means for extracting group data for each support word described in the input support word list and outputting to the support word group data storage means;
Co-occurrence frequency of support co-occurrence words and support words, which are co-occurrence words belonging to a group closely related to the support word, from the support word list and the support word group data stored in the support word group data storage means And calculating a co-occurrence frequency for all the support words, and outputting a count result to the support co-occurrence word data storage means;
From the related word group data stored in the related word group data storage means and the support co-occurrence word data stored in the support co-occurrence word data storage means, the co-occurrence words belonging to each related word group of the target word, Obtaining a support level of matching support co-occurrence words, totaling for each related word group of the target word, selecting a co-occurrence word closely related to the support co-occurrence word as a related word, and outputting a related word extracting means; It is comprised as a related word extraction apparatus characterized by having provided.

また、本発明は、単語に対する関連語を抽出する関連語抽出装置と、外部装置とを有する関連語抽出システムにおける前記関連語抽出装置であって、
前記外部装置は、テキスト集合を入力し、テキスト中の単語と共起する単語との共起頻度を求め、共起単語データとして出力する共起単語データ作成手段を備え、
前記関連語抽出装置は、
前記外部装置により作成された共起単語データを格納する共起単語データ記憶手段と、
当該共起単語データ記憶手段に格納された共起単語データを用いて、所定の単語集合の各単語について、その単語と共起する単語を求めてグループ化し、その結果を単語グループ記憶手段へ出力する単語グループ作成手段と、
前記単語グループ記憶手段に格納された単語グループデータから、入力された対象語に対するグループデータを抽出し、関連語グループデータ記憶手段へ出力する関連語グループ抽出手段と、
前記単語グループデータから、入力された支持語リストに記載の支持語毎にグループデータを抽出し、支持語グループデータ記憶手段へ出力する支持語グループ抽出手段と、
前記支持語リストと、前記支持語グループデータ記憶手段に格納された支持語グループデータから、支持語との関係の深いグループに属する共起語である支持共起語と支持語との共起頻度を求めて、全ての支持語に対してその共起頻度を集計し、集計結果を支持共起語データ記憶手段へ出力する支持共起語抽出手段と、
前記関連語グループデータ記憶手段に格納された関連語グループデータと、前記支持共起語データ記憶手段に格納された支持共起語データから、前記対象語の各関連語グループに属する共起語と一致する支持共起語の支持度を求め、対象語の関連語グループ毎に集計し、支持共起語と関係の深い共起語を関連語として選択し、出力する関連語抽出手段と、を備えることを特徴とする関連語抽出装置として構成することもできる。 Further, the present invention is the related word extraction device in a related word extraction system having a related word extraction device for extracting a related word for a word and an external device,
The external device includes co-occurrence word data creating means for inputting a text set, obtaining a co-occurrence frequency of a word that co-occurs with a word in the text, and outputting as co-occurrence word data,
The related word extraction device comprises:
Co-occurrence word data storage means for storing co-occurrence word data created by the external device;
Using the co-occurrence word data stored in the co-occurrence word data storage means, for each word of a predetermined word set, the words that co-occur with the word are obtained and grouped, and the result is output to the word group storage means Word group creation means to
Related word group extraction means for extracting group data for the input target word from the word group data stored in the word group storage means, and outputting to the related word group data storage means;
From the word group data, support word group extraction means for extracting group data for each support word described in the input support word list and outputting to the support word group data storage means;
Co-occurrence frequency of support co-occurrence words and support words, which are co-occurrence words belonging to a group closely related to the support word, from the support word list and the support word group data stored in the support word group data storage means And calculating a co-occurrence frequency for all the support words, and outputting a count result to the support co-occurrence word data storage means;
From the related word group data stored in the related word group data storage means and the support co-occurrence word data stored in the support co-occurrence word data storage means, the co-occurrence words belonging to each related word group of the target word, Obtaining a support level of matching support co-occurrence words, totaling for each related word group of the target word, selecting a co-occurrence word closely related to the support co-occurrence word as a related word, and outputting a related word extracting means; It can also be comprised as a related word extraction apparatus characterized by providing.

また、本発明は、単語に対する関連語を抽出する関連語抽出装置と、外部装置とを有する関連語抽出システムにおける前記関連語抽出装置であって、
前記外部装置は、テキスト集合を入力し、テキスト中の単語と共起する単語との共起頻度を求め、共起単語データ記憶手段へ出力する共起単語データ作成手段と、前記共起単語データ記憶手段に格納された共起単語データを用いて、所定の単語集合の各単語について、その単語と共起する単語を求めてグループ化し、その結果を単語グループデータとして出力する単語グループ作成手段と、を備え、
前記関連語抽出装置は、
前記外部装置により作成された単語グループデータを格納する単語グループ記憶手段と、
当該単語グループ記憶手段に格納された単語グループデータから、入力された対象語に対するグループデータを抽出し、関連語グループデータ記憶手段へ出力する関連語グループ抽出手段と、
前記単語グループデータから、入力された支持語リストに記載の支持語毎にグループデータを抽出し、支持語グループデータ記憶手段へ出力する支持語グループ抽出手段と、
前記支持語リストと、前記支持語グループデータ記憶手段に格納された支持語グループデータから、支持語との関係の深いグループに属する共起語である支持共起語と支持語との共起頻度を求めて、全ての支持語に対してその共起頻度を集計し、集計結果を支持共起語データ記憶手段へ出力する支持共起語抽出手段と、
前記関連語グループデータ記憶手段に格納された関連語グループデータと、前記支持共起語データ記憶手段に格納された支持共起語データから、前記対象語の各関連語グループに属する共起語と一致する支持共起語の支持度を求め、対象語の関連語グループ毎に集計し、支持共起語と関係の深い共起語を関連語として選択し、出力する関連語抽出手段と、を備えることを特徴とする関連語抽出装置として構成してもよい。 Further, the present invention is the related word extraction device in a related word extraction system having a related word extraction device for extracting a related word for a word and an external device,
The external device inputs a text set, obtains a co-occurrence frequency of a word co-occurring with a word in the text, and outputs the co-occurrence word data generation means to the co-occurrence word data storage means, and the co-occurrence word data Word group creating means for using the co-occurrence word data stored in the storage means to group and search for words that co-occur with the word for each word of a predetermined word set, and outputting the result as word group data; With
The related word extraction device comprises:
Word group storage means for storing word group data created by the external device;
Related word group extraction means for extracting group data for the input target word from the word group data stored in the word group storage means, and outputting to the related word group data storage means;
From the word group data, support word group extraction means for extracting group data for each support word described in the input support word list and outputting to the support word group data storage means;
Co-occurrence frequency of support co-occurrence words and support words, which are co-occurrence words belonging to a group closely related to the support word, from the support word list and the support word group data stored in the support word group data storage means And calculating a co-occurrence frequency for all the support words, and outputting a count result to the support co-occurrence word data storage means;
From the related word group data stored in the related word group data storage means and the support co-occurrence word data stored in the support co-occurrence word data storage means, the co-occurrence words belonging to each related word group of the target word, Obtaining a support level of matching support co-occurrence words, totaling for each related word group of the target word, selecting a co-occurrence word closely related to the support co-occurrence word as a related word, and outputting a related word extracting means; You may comprise as a related word extraction apparatus characterized by providing.

前記支持共起語抽出手段において、前記支持語リストの支持語に対して選択か除外かの支持方法を判断する情報が記されており、その情報に基づいて、全ての支持語に対してその共起頻度を集計する際に、選択する支持方法では加算を行い、除外する支持方法では減算を行って集計するようにしてもよい。 In the support co-occurrence word extraction means, information for determining a support method of selection or exclusion for the support words in the support word list is written, and on the basis of the information, for all support words When counting the co-occurrence frequencies, addition may be performed in the support method to be selected, and subtraction may be performed in the support method to be excluded.

また、前記対象語は例えば検索式の形式で入力され、その場合、前記関連語グループ抽出手段において、当該検索式に含まれる各単語に対するグループデータを抽出し、前記関連語グループデータ記憶手段へ出力し、また、前記関連語抽出手段において、前記検索式に含まれる各単語に対して各関連語グループに属する共起語と一致する支持共起語の支持度を求め、対象語の関連語グループ毎に集計し、支持共起語と関係の深い共起語を抽出して、前記検索式の条件を満たした共起語を関連語として選択する。 In addition, the target word is input in the form of a search expression, for example. In that case, the related word group extraction means extracts group data for each word included in the search expression and outputs it to the related word group data storage means. Further, in the related word extraction means, for each word included in the search formula, the support level of the support co-occurrence word that matches the co-occurrence word belonging to each related word group is obtained, and the related word group of the target word A co-occurrence word closely related to the supporting co-occurrence word is extracted, and a co-occurrence word that satisfies the search expression condition is selected as a related word.

また、前記支持語リストは例えば検索式の形式で入力され、その場合、前記支持語グループ抽出手段において、当該検索式に含まれる各単語に対するグループデータを探し、前記支持語グループデータ記憶手段へ出力し、また、前記支持共起語抽出手段において、前記検索式の条件を満たした共起語を支持共起語として選択する。 Further, the support word list is input in the form of, for example, a search expression. In this case, the support word group extraction unit searches for group data for each word included in the search expression and outputs the group data to the support word group data storage unit. The supporting co-occurrence word extracting unit selects a co-occurrence word that satisfies the search expression condition as a supporting co-occurrence word.

また、関連語抽出装置において、前記単語グループ作成手段の処理を関連語グループ抽出手段及び／又は支持語グループ抽出手段の中で行うことで、前記対象語や前記支持語リストの入力後に逐次的に処理を行って、前記対象語や前記支持語の単語グループデータを作成するように構成してもよい。 In the related word extraction device, the processing of the word group creation means is performed in the related word group extraction means and / or the support word group extraction means, so that the target word and the support word list are sequentially input after the input. You may comprise so that a process may be performed and the word group data of the said target word and the said support word may be produced.

また、本発明は、前記関連語抽出装置が実行する関連語抽出方法として構成してもよい。更に、本発明は、コンピュータを、前記関連語抽出装置における各手段として機能させるための関連語抽出プログラムとして構成してもよい。 Moreover, you may comprise this invention as a related word extraction method which the said related word extraction apparatus performs. Furthermore, this invention may comprise a computer as a related word extraction program for functioning as each means in the said related word extraction apparatus.

本発明によれば、意味の曖昧性のある対象語から特定の意味の関連語を抽出するため、対象語の共起語をグループ化し、他の語と関係の深い共起語の共起頻度を用いて、対象語の共起語の支持度を加算や減算して集計し、支持度の高いグループを選択して、そのグループに属する共起語を対象語の関連語とすることで、他の語と関係の深い特定の意味の関連語を高精度で抽出できる。 According to the present invention, in order to extract a related word having a specific meaning from a target word having an ambiguous meaning, the co-occurrence words of the target word are grouped together and the co-occurrence words closely related to other words are co-occurred. By adding and subtracting the support level of the co-occurrence word of the target word, selecting a group with a high support level, and making the co-occurrence word belonging to that group the related word of the target word, A related word having a specific meaning closely related to another word can be extracted with high accuracy.

本発明の一実施形態に係る関連語抽出装置の構成図である。It is a block diagram of the related word extraction apparatus which concerns on one Embodiment of this invention. 図１に示す関連語抽出装置の処理の流れを表すフローチャートである。It is a flowchart showing the flow of a process of the related word extraction apparatus shown in FIG. 各種データの一例（１）である。It is an example (1) of various data. 各種データの一例（２）である。It is an example (2) of various data. 各種データの一例（３）である。It is an example (3) of various data.

以下、図面を参照しながら本発明の実施の形態を説明するが、本発明は下記の実施形態に限定されるものではない。 Hereinafter, embodiments of the present invention will be described with reference to the drawings, but the present invention is not limited to the following embodiments.

（装置構成、動作概要）
図１は本発明の一実施形態の関連語抽出装置１００の構成図、図２は図１の関連語抽出装置１００の処理の流れを表すフローチャートである。図２のＳ１００〜Ｓ１５０は各処理のステップを各々示している。 (Outline of device configuration and operation)
FIG. 1 is a configuration diagram of a related word extraction device 100 according to an embodiment of the present invention, and FIG. 2 is a flowchart showing a process flow of the related word extraction device 100 of FIG. S100 to S150 in FIG. 2 indicate the steps of each process.

図１に示すように、本実施形態の関連語抽出装置１００は、共起単語データ作成部１１０、単語グループ作成部１２０、関連語グループ抽出部１３０、支持語グループ抽出部１４０、支持共起語抽出部１５０、関連語抽出部１６０、共起単語データベース１７０、単語グループデータベース１８０、関連語グループデータベース１９０、支持語グループデータベース２００、支持共起語データベース２１０を備える。関連語抽出装置１００は、テキスト集合３００、対象語４００、及び支持語リスト５００を入力とし、関連語抽出部１６０により抽出された関連語を関連語データベース６００に出力する。 As shown in FIG. 1, the related word extraction apparatus 100 according to the present embodiment includes a co-occurrence word data creation unit 110, a word group creation unit 120, a related word group extraction unit 130, a support word group extraction unit 140, and a support co-occurrence word. An extraction unit 150, a related word extraction unit 160, a co-occurrence word database 170, a word group database 180, a related word group database 190, a support word group database 200, and a support co-occurrence word database 210 are provided. The related word extraction apparatus 100 receives the text set 300, the target word 400, and the support word list 500 as inputs, and outputs the related words extracted by the related word extraction unit 160 to the related word database 600.

関連語抽出装置１００は、コンピュータに、本実施の形態で説明する処理内容を記述したプログラムを実行させることにより実現可能である。すなわち、関連語抽出装置１００の全機能部もしくは一部の機能部ついて、各部が有する機能は、当該装置を構成するコンピュータに内蔵されるＣＰＵやメモリなどのハードウェア資源を用いて、各部で実施される処理に対応するプログラムを実行することによって実現することが可能である。また、関連語抽出装置１００における各データベースは、メモリなどの記憶手段により実現される。また、上記プログラムは、コンピュータが読み取り可能な記録媒体、例えばＦＤ（Ｆｌｏｐｐｙ（登録商標）Ｄｉｓｋ）や、ＭＯ（Ｍａｇｎｅｔｏ−Ｏｐｔｉｃａｌｄｉｓｋ）、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、メモリカード、ＣＤ（ＣｏｍｐａｃｔＤｉｓｋ）−ＲＯＭ、ＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｋ）−ＲＯＭ、ＢＤ（Ｂｌｕ−ｒａｙＤｉｓｋ）−ＲＯＭ、ＣＤ−Ｒ、ＣＤ−ＲＷ、ＤＶＤ−Ｒ、ＤＶＤ−ＲＷ、ＢＤ−Ｒ、ＢＤ−ＲＥ、ＨＤＤ、リムーバブルディスクなどに記録して、保存したり、配布したりすることが可能である。また、上記プログラムをインターネットや電子メールなど、ネットワークを通して提供することも可能である。 The related word extraction apparatus 100 can be realized by causing a computer to execute a program describing the processing contents described in the present embodiment. That is, with respect to all or a part of the functional units of the related word extraction apparatus 100, the functions of the respective units are implemented in the respective units using hardware resources such as a CPU and a memory built in the computer constituting the apparatus. This can be realized by executing a program corresponding to the processing to be performed. Each database in the related word extraction device 100 is realized by storage means such as a memory. In addition, the program is a computer-readable recording medium such as FD (Floppy (registered trademark) Disk), MO (Magneto-Optical disk), ROM (Read Only Memory), memory card, CD (Compact Disk)- ROM, DVD (Digital Versatile Disk) -ROM, BD (Blu-ray Disk) -ROM, CD-R, CD-RW, DVD-R, DVD-RW, BD-R, BD-RE, HDD, removable disk, etc. It can be recorded and stored or distributed. It is also possible to provide the program through a network such as the Internet or electronic mail.

次に、図２のフローチャートを参照して、関連語抽出装置１００の動作概要を説明する。 Next, an outline of the operation of the related word extracting apparatus 100 will be described with reference to the flowchart of FIG.

ステップ１００）共起単語データ作成部１１０は、テキスト集合３００を入力し、テキスト中の単語と共起する単語との共起頻度を求め、共起単語データベース１７０へ出力する。 Step 100) The co-occurrence word data creation unit 110 inputs the text set 300, obtains the co-occurrence frequency of the words in the text and the co-occurrence words, and outputs them to the co-occurrence word database 170.

ステップ１１０）単語グループ作成部１２０は、共起単語データベース１７０に格納された共起単語を用いて、所定の単語集合の各単語について、その単語と共起する単語を求めてグループ化し、その結果を単語グループデータベース１８０へ出力する。 Step 110) Using the co-occurrence words stored in the co-occurrence word database 170, the word group creation unit 120 obtains a word that co-occurs with the word for each word in a predetermined word set, and groups the results. Is output to the word group database 180.

ステップ１２０）関連語グループ抽出部１３０は、単語グループデータベース１８０から、対象語４００に対するグループデータを探し、関連語グループデータベース１９０へ出力する。 Step 120) The related word group extraction unit 130 searches the word group database 180 for group data for the target word 400 and outputs it to the related word group database 190.

ステップ１３０）支持語グループ抽出部１４０は、単語グループデータベース１８０から、支持語リスト４００に記載の支持語毎にグループデータを探し、支持語グループデータベース２００へ出力する。 Step 130) The support word group extraction unit 140 searches the word group database 180 for group data for each support word described in the support word list 400, and outputs it to the support word group database 200.

ステップ１４０）支持共起語抽出部１５０は、支持語リスト５００と、支持語グループデータベース２００から、支持語との関係の深いグループに属する共起語（支持共起語）と支持語との共起頻度を求めて、全ての支持語に対してその共起頻度を集計し、支持共起語データベース２１０へ出力する。 Step 140) The support co-occurrence word extraction unit 150 extracts the co-occurrence words (support co-occurrence words) and support words belonging to a group closely related to the support words from the support word list 500 and the support word group database 200. The occurrence frequency is obtained, and the co-occurrence frequencies are totaled for all the support words, and are output to the support co-occurrence word database 210.

ステップ１５０）関連語抽出部１６０は、関連語グループデータベース１９０と支持共起語データベース２１０に格納されたデータを用いて、対象語の各関連語グループに属する共起語と一致する支持共起語の支持度を求め、対象語の関連語グループ毎に集計し、支持共起語と関係の深い共起語を関連語として選択し、関連語データベース６００へ出力する。 Step 150) The related word extraction unit 160 uses the data stored in the related word group database 190 and the support co-occurrence word database 210 to support co-occurrence words that match the co-occurrence words belonging to each related word group of the target word. The co-occurrence words closely related to the supporting co-occurrence words are selected as related words and output to the related word database 600.

（動作詳細）
次に、本実施形態における関連語抽出装置１００の一連の動作をより詳細に、具体例を用いながら説明する。なお、以下で説明に用いる具体例は一例に過ぎない。 (Operation details)
Next, a series of operations of the related word extraction device 100 in the present embodiment will be described in more detail using specific examples. Note that the specific example used in the following description is merely an example.

共起単語データ作成部１１０は、テキスト集合３００を入力し、テキスト中の単語と共起する単語との共起頻度を求め、共起単語データベース１７０へ出力する。例えば、情報検索システムの検索クエリログをテキスト集合とした場合のテキスト集合を図３（Ａ）に示す。図３（Ａ）のテキスト集合は１行が１つのクエリを表すテキストとなっている。このテキスト集合から、テキスト中のある単語と共起する単語の組を求めて、その組の数（共起頻度）を集計し、高頻度順に並べたデータを作成したものが図３（Ｂ）の共起単語データとなる。なお、テキスト集合が通常の文章でも、形態素解析によって自立語を取り出して１文もしくは１文節を１つのテキストと扱うことで、同様に処理できる。 The co-occurrence word data creation unit 110 inputs the text set 300, obtains the co-occurrence frequency of the words in the text and the co-occurrence words, and outputs them to the co-occurrence word database 170. For example, FIG. 3A shows a text set when the search query log of the information search system is a text set. The text set in FIG. 3A is a text in which one line represents one query. FIG. 3B shows a set of words that co-occur with a certain word in the text from this text set, the number of the sets (co-occurrence frequency) is totaled, and data arranged in order of frequency is created. Of co-occurrence word data. Even if the text set is a normal sentence, it can be processed in the same way by taking out independent words by morphological analysis and treating one sentence or one clause as one text.

単語グループ作成部１２０は、共起単語データベース１７０に格納された共起単語データを用いて、所定の単語集合の各単語について、その単語と共起する単語を求めてグループ化し、その結果を単語グループデータベース１８０へ出力する。 The word group creation unit 120 uses the co-occurrence word data stored in the co-occurrence word database 170 to find and group words that co-occur with the word for each word in a predetermined word set, and the result is a word Output to the group database 180.

例えば、まず、所定の単語集合については、図３（Ｂ）の共起単語データの各単語について共起語の数の多い順に並べた図３（Ｃ）の共起単語数データの上位Ｎ件と定める。なお、別の方法として、テキスト集合中の単語の頻度に基づく関数によって計算された値の大きなものから上位Ｎ件と定めたり、事前準備した単語リストに記載の単語集合を所定の単語集合と定めてもよい。 For example, first, for a predetermined word set, the top N items of the co-occurrence word count data of FIG. 3C arranged in descending order of the number of co-occurrence words for each word of the co-occurrence word data of FIG. It is determined. In addition, as another method, the top N items are determined from those having a large value calculated by the function based on the frequency of words in the text set, or the word set described in the word list prepared in advance is set as a predetermined word set. May be.

次に、所定の単語集合の単語毎の処理について説明する。例えば、所定の単語が"ワンピース"であれば、"ワンピース"の共起語を図３（Ｂ）の共起単語データから求めて、図３（Ｄ）の共起語データを抽出する。その際"ワンピース"と共起する単語との共起頻度も図３（Ｂ）から分かるので、共起頻度が所定の閾値未満のものは抽出しない。 Next, processing for each word in a predetermined word set will be described. For example, if the predetermined word is “one piece”, the co-occurrence word of “one piece” is obtained from the co-occurrence word data of FIG. 3B, and the co-occurrence word data of FIG. At this time, since the co-occurrence frequency of the word that co-occurs with “One Piece” is also known from FIG. 3B, those having a co-occurrence frequency less than a predetermined threshold are not extracted.

次に、図３（Ｄ）の共起語データの各共起語の特徴を表すため、各共起語に対して共起した語と共起頻度を図３（Ｂ）の共起単語データから求め、得られた単語に対する共起頻度を要素とする図３（Ｅ）の特徴ベクトルデータを作成する。なお、特徴ベクトルは共起頻度を基にして単語の特徴が表せられれば良いので、共起頻度の代わりにその値に任意の関数を適用した値を用いてもよい。最後に、作成された特徴ベクトルを用いて"ワンピース"の共起語をクラスタリングした結果、図３（Ｆ）のように４つのグループからなるグループデータが得られ、単語グループデータベース１７０へ出力する。なお、クラスタリング手法は、特徴ベクトルを用いるものであればその手法は問わない。また、クラスタ数はあらかじめ定めた数とする。例えば、図３（Ｆ）では、クラスタ数を４として処理した結果である。以上が、所定の単語集合の単語毎の処理であり、所定の単語集合の全ての単語について同様に処理される。 Next, in order to represent the characteristics of each co-occurrence word of the co-occurrence word data in FIG. 3D, the co-occurrence word and the co-occurrence frequency for each co-occurrence word are shown in FIG. The feature vector data shown in FIG. 3E having the co-occurrence frequency for the obtained word as an element is created. Since the feature vector only needs to represent the feature of the word based on the co-occurrence frequency, a value obtained by applying an arbitrary function to the value may be used instead of the co-occurrence frequency. Finally, as a result of clustering “one-piece” co-occurrence words using the created feature vectors, group data consisting of four groups is obtained as shown in FIG. 3F and output to the word group database 170. The clustering method is not limited as long as it uses feature vectors. The number of clusters is a predetermined number. For example, FIG. 3F shows the result of processing with 4 clusters. The above is the processing for each word in the predetermined word set, and all the words in the predetermined word set are processed in the same way.

関連語グループ抽出部１３０は、単語グループデータベース１８０から、対象語４００に対するグループデータを探し、関連語グループデータベース１９０へ出力する。例えば、対象語が"ワンピース"であった場合、単語グループデータから、グループデータ名が"ワンピース"である図３（Ｆ）のグループデータを探し、関連語グループデータベース１８０へ出力する。 The related word group extraction unit 130 searches the word group database 180 for group data for the target word 400 and outputs it to the related word group database 190. For example, when the target word is “one piece”, the group data of FIG. 3F whose group data name is “one piece” is searched from the word group data and output to the related word group database 180.

支持語グループ抽出部１４０は、単語グループデータベース１８０から、支持語リスト４００に記載の支持語毎にグループデータを探し、支持語グループデータベース２００へ出力する。例えば、支持語リストが図４（Ｇ）に示すものであるとした場合、この支持語リストは、"洋服"と関連の深い共起語を選択することを意図し、単語の前に"−"が付いている"−アニメ"は"アニメ"と関連の深い共起語を除外することを意図している。この場合、単語グループデータベースから、支持語"洋服"と"アニメ"のグループデータである図４（Ｈ）と図４（Ｉ）に示すグループデータがグループデータ名から見つかるので、これらを支持語グループデータベース２００へ出力する。 The support word group extraction unit 140 searches the word group database 180 for group data for each support word described in the support word list 400 and outputs the group data to the support word group database 200. For example, if the support word list is as shown in FIG. 4G, this support word list is intended to select a co-occurrence word closely related to “clothes”, and the word “−” "-" With "animation" is intended to exclude co-occurrence words closely related to "anime". In this case, the group data shown in FIG. 4 (H) and FIG. 4 (I), which are the group data of the support words “clothes” and “animation”, are found from the group data name from the word group database. Output to database 200.

支持共起語抽出部１５０は、支持語リスト５００と、支持語グループデータベース２００から、支持語との関係の深いグループに属する共起語（支持共起語）と支持語との共起頻度を求めて、全ての支持語に対してその共起頻度を集計し、支持共起語データベース２１０へ出力する。例えば、図４（Ｇ）に示す支持語リストに記載の"洋服"に対しては、支持語グループデータベース２００には、図４（Ｈ）の"洋服"のグループデータがあるので、その中から"洋服"との関係の深い共起語のグループのみを抽出する。抽出方法は、各グループに対し、グループに含まれている共起語と支持語との共起頻度を集計し、各グループの共起頻度が高い順に全体割合の累積値を求めたときに、所定の閾値に初めて達したときの該当グループとする。 The support co-occurrence word extraction unit 150 determines the co-occurrence frequency of co-occurrence words (support co-occurrence words) and support words belonging to a group closely related to the support word from the support word list 500 and the support word group database 200. The co-occurrence frequencies are calculated for all support words and output to the support co-occurrence word database 210. For example, for “clothes” described in the support word list shown in FIG. 4G, the support word group database 200 includes group data of “clothes” in FIG. 4H. Extract only co-occurrence groups that are closely related to "clothes". For each group, when the co-occurrence frequency of co-occurrence words and support words included in the group is aggregated for each group, and the cumulative value of the overall ratio is calculated in descending order of the co-occurrence frequency of each group, The corresponding group when the predetermined threshold is reached for the first time.

例えば、図４（Ｈ）の各グループの共起語の後述された括弧内の数値が支持語"洋服"と各単語との共起頻度であるので、当該共起頻度をグループ毎に集計（加算）し、全体割合とその累積値を求めると、図４（Ｊ）の"洋服"のグループ集計データとなる。なお、図４（Ｈ）と図４（Ｊ）のグループ番号は対応している。所定の閾値が０．９であった場合、各グループの共起頻度の全体割合の累積が０．９９となった時に初めて閾値以上となるので、グループ番号１〜３が該当グループとなる。そして、グループ番号１〜３に属する共起語と支持語との共起頻度を図４（Ｈ）のグループデータの共起語に後述された括弧内の数値から求めると、図４（Ｋ）が"洋服"との関連が深い単語データとなる。同様に、支持語リストに記載の"−アニメ"に対しては、"−"を除いた支持語"アニメ"のグループデータとして図４（Ｉ）があり、グループ毎に集計した図５（Ｌ）の"アニメ"のグループ集計データから、"アニメ"との関係の深い共起語グループは、各グループの共起頻度の全体割合の累積が閾値０．９４となった時に初めて閾値以上となるので、グループ番号１が該当グループとなる。そして、グループ番号１に属する共起語と支持語との共起頻度を図４（Ｉ）のグループデータから求めると図５（Ｍ）が"アニメ"との関連が深い単語データとなる。なお、支持語との関係の深いグループに属する共起語を、各グループの共起頻度の全体割合の累積が初めて閾値に達する時の該当グループに属する共起語から求めたが、同様な結果が得られるのであればこの方法に限定されない。 For example, since the numerical value in parentheses described later of the co-occurrence word of each group in FIG. 4H is the co-occurrence frequency of the support word “clothes” and each word, the co-occurrence frequency is aggregated for each group ( When the total ratio and the accumulated value are obtained, the group total data of “clothes” in FIG. Note that the group numbers in FIGS. 4H and 4J correspond to each other. When the predetermined threshold value is 0.9, the group number 1 to 3 becomes the corresponding group because the total of the co-occurrence frequency of each group becomes equal to or greater than the threshold value when the cumulative ratio becomes 0.99. Then, when the co-occurrence frequency of the co-occurrence words belonging to the group numbers 1 to 3 and the support word is obtained from the numerical value in parentheses described later in the co-occurrence words of the group data in FIG. 4 (H), FIG. Is word data closely related to "clothes". Similarly, there is FIG. 4 (I) as the group data of the support word “anime” excluding “-” for “-animation” described in the support word list, and FIG. ), The co-occurrence word group closely related to “animation” from the group total data of “anime” becomes the threshold or more for the first time when the total of the co-occurrence frequency of each group becomes the threshold value 0.94. Therefore, group number 1 is the corresponding group. Then, when the co-occurrence frequency of the co-occurrence word belonging to the group number 1 and the support word is obtained from the group data of FIG. 4 (I), FIG. 5 (M) becomes word data closely related to “animation”. The co-occurrence words belonging to the group closely related to the support word were obtained from the co-occurrence words belonging to the corresponding group when the cumulative total of the co-occurrence frequency of each group first reached the threshold. If it is obtained, it is not limited to this method.

支持語リストに記載されている全ての単語に対して同様の処理を行い、最後に、支持語との関連が深い単語データを集計して、支持語グループデータベース２１０へ出力する。例えば、支持語リストに記載の"洋服"は、支持語と関連の深い単語の選択に使われるので、対応する図４（Ｋ）を加算し、支持語リストに記載の"−アニメ"は先頭の"−"があることから、支持語と関連の深い単語の除外に使われるので、"−"を除いた"アニメ"に対応する図５（Ｍ）を減算し、集計した図５（Ｎ）の支持語との関連が深い単語データを支持共起語グループデータベース２１０へ出力する。 The same processing is performed for all the words described in the support word list, and finally, word data closely related to the support word is totaled and output to the support word group database 210. For example, “clothes” described in the support word list is used to select words closely related to the support word, so the corresponding FIG. 4 (K) is added, and “-animation” described in the support word list is the head. Since “-” is used, it is used to exclude words closely related to the support word. Therefore, FIG. 5 (N) corresponding to “animation” excluding “-” is subtracted and tabulated. ) Is output to the supporting co-occurrence word group database 210.

関連語抽出部１６０は、関連語グループデータベース１９０と支持共起語データベース２１０から、対象語の各関連語グループに属する共起語と一致する支持共起語の支持度を求め、対象語の関連語グループ毎に集計し、支持語との関係の深い共起語を関連語として選択し、関連語データベース６００へ出力する。 The related word extraction unit 160 obtains the support level of the support co-occurrence word that matches the co-occurrence word belonging to each related word group of the target word from the related word group database 190 and the support co-occurrence word database 210, and The words are aggregated for each word group, and co-occurrence words closely related to the support word are selected as related words and output to the related word database 600.

例えば、関連語グループデータベース１８０にある図３（Ｆ）の"ワンピース"のグループデータのグループ番号４の共起語"通販"に対し、支持共起語データベース２１０にある図５（Ｎ）の支持語との関係が深い単語データには、"通販"の共起頻度が１２００である事が分かる。その共起頻度を図３（Ｆ）のグループ番号４に加算する。図３（Ｆ）の"ワンピース"のグループデータの全ての共起語に対して図５（Ｎ）から共起頻度を求め、図３（Ｆ）のグループ毎に集計し、支持度の高い順に並べた結果、図５（Ｏ）の結果が得られる。次に、図５（Ｏ）の支持度の高い順に求めた全体割合の累積から、閾値が所定の値以上に初めて達した際の該当グループを選択する。なお、支持度が負の値のときの全体割合は0としている。所定の閾値が０．９であった場合、図５（Ｏ）の全体割合の累積値は１．０の時に初めて閾値以上になるので、グループ番号３と４が該当グループとなり、そのグループに含まれる単語を関連語データとして関連語データベース６００へ出力する。 For example, for the co-occurrence word “mail order” of the group number 4 of the “one piece” group data in FIG. 3F in the related word group database 180, the support in FIG. 5N in the support co-occurrence word database 210. It can be seen that word data having a deep relationship with words has a co-occurrence frequency of “mail order” of 1200. The co-occurrence frequency is added to group number 4 in FIG. The co-occurrence frequencies are obtained from FIG. 5 (N) for all co-occurrence words in the group data of “One Piece” in FIG. 3 (F), and are counted for each group in FIG. 3 (F), in descending order of support. As a result of arranging, the result of FIG. 5 (O) is obtained. Next, the corresponding group when the threshold value reaches a predetermined value or more for the first time is selected from the accumulation of the overall ratios determined in descending order of support in FIG. The overall ratio when the support level is negative is 0. When the predetermined threshold value is 0.9, the cumulative value of the whole ratio in FIG. 5 (O) becomes the threshold value or more for the first time when it is 1.0, so group numbers 3 and 4 become the corresponding groups and are included in that group. Are output to the related word database 600 as related word data.

この出力された単語は、支持語との関係の深いものに意味が特定化された対象語の関連語となる。なお、支持語との関係の深いグループに属する共起語を、各グループの共起頻度の全体割合の累積が初めて閾値に達する時の該当グループに属する共起語から求めたが、同様な結果が得られるのであればこの方法に限定されない。 This output word becomes a related word of the target word whose meaning is specified in a word closely related to the support word. The co-occurrence words belonging to the group closely related to the support word were obtained from the co-occurrence words belonging to the corresponding group when the cumulative total of the co-occurrence frequency of each group first reached the threshold. If it is obtained, it is not limited to this method.

本実施の形態では、例えば、対象語である"ワンピース"から服に関係する関連語のみを抽出したい際に、"ワンピース"の共起語である「服の意味でのワンピースのブランド名」があった場合、共起語をグループ化することで、その単語が服に関係するグループに含まれる。一方、選択対象とする支持語である"洋服"と「服の意味でのワンピースのブランド名」が共起していない場合でも、"洋服"と関連の深い共起語が"ワンピース"の服に関係するグループを支持するので、「服の意味でのワンピースのブランド名」が関連語として抽出できる。 In the present embodiment, for example, when it is desired to extract only related words related to clothes from the target word “one piece”, “one piece brand name in the meaning of clothes” is a co-occurrence word of “one piece”. If so, group the co-occurrence words so that the word is included in the group related to clothes. On the other hand, even if the support word “clothes” to be selected and “one-piece brand name in the meaning of clothes” do not co-occur, clothes that have a co-occurrence word “one piece” closely related to “clothes” Because it supports groups related to, "one-piece brand name in the meaning of clothes" can be extracted as a related term.

また、本実施の形態では、例えば、対象語である"ワンピース"から服に関係する関連語のみを抽出したい際に、"ワンピース"の共起語である「アニメでの登場人物名」があった場合、共起語をグループ化することで、その単語がアニメに関係するグループに含まれる。一方、除外対象とする支持語である"アニメ"と「アニメでの登場人物名」が共起していない場合でも、"アニメ"と関連の深い共起語が"ワンピース"のアニメに関係するグループを支持しないようにするので、「アニメでの登場人物名」が関連語から除外される。 In the present embodiment, for example, when it is desired to extract only related words related to clothes from the target word “One Piece”, there is “Character Name in Animation” which is a co-occurrence word of “One Piece”. If a co-occurrence word is grouped, the word is included in the group related to the animation. On the other hand, the co-occurrence word closely related to “Anime” is related to “One Piece” anime, even when “Anime” and “Character Name in Anime” are not co-occurring. In order not to support the group, “name of character in animation” is excluded from the related words.

本実施の形態とは別の方法として、図１の関連グループ抽出部１３０によって作成された関連語グループデータベース１９０に対して、選択対象である支持語を含むクラスタを直接求めて、そのクラスタに属する検索語を関連語として選択する方法が考えられる。この方法では、例えば、支持語が"洋服"である場合、図３（Ｆ）のワンピースのグループデータのグループ番号４に"洋服"が含まれているとすると、グループ番号４の共起語が関連語として抽出できるが、グループ番号３にも"ファッション"といった洋服と関係のある共起語が抽出できない。一方、本実施の形態では支持語を直接含むか含まないかではなく、支持語の共起語を用いることで、グループ番号３も４も抽出可能な方法となっている。 As another method different from the present embodiment, the related word group database 190 created by the related group extracting unit 130 in FIG. 1 is directly obtained for a cluster including the support word to be selected, and belongs to the cluster. A method of selecting a search term as a related term can be considered. In this method, for example, if the support word is “clothes”, and if “clothes” is included in the group number 4 of the one-piece group data in FIG. Although it can be extracted as a related word, a co-occurrence word related to clothes such as “fashion” cannot be extracted in group number 3 as well. On the other hand, according to the present embodiment, it is possible to extract both group numbers 3 and 4 by using co-occurrence words of support words, instead of directly including or not including support words.

なお、上記の実施形態では、図１の関連語抽出装置１００において、対象語４００が１つの単語であったが、対象語４００を検索式で表現してもよく、例えば"(ワンピース OR スーツ) AND NOT 春物"というように指定すれば、"ワンピース"の共起語または"スーツ"の共起語であり、"春物"の共起語を除外したものが関連語として抽出される。具体的には、図１の関連語グループ抽出部１３０で、単語グループデータベース１８０から"ワンピース"、"スーツ"、"春物"のグループデータを探して、関連語グループデータ１９０に格納し、関連語抽出部１６０において、"ワンピース"、"スーツ"、"春物"のグループ集計データを作成し、全体割合の累積が初めて閾値を超えた際の該当グループの共起語をそれぞれ抽出し、最後に"ワンピース"と"スーツ"のいずれかに含まれており、"春物"に含まれているものを除いた共起語を関連語として抽出すればよい。 In the above embodiment, the target word 400 is one word in the related word extraction apparatus 100 of FIG. 1, but the target word 400 may be expressed by a search expression, for example, “(one piece OR suit)” If "AND NOT Spring" is specified, a co-occurrence word of "one piece" or a co-occurrence word of "suit", and excluding the co-occurrence word of "spring" is extracted as a related word. Specifically, the related word group extraction unit 130 in FIG. 1 searches the word group database 180 for “One Piece”, “suit”, and “spring” group data, stores them in the related word group data 190, and stores related words. In the extraction unit 160, group total data of “one piece”, “suit”, and “spring” is created, and the co-occurrence words of the corresponding group when the accumulation of the whole ratio exceeds the threshold for the first time are respectively extracted. Co-occurrence words that are included in either “One Piece” or “Suit” and are not included in “Spring” may be extracted as related words.

また、上記の実施形態では、図１の関連語抽出装置１００において、支持語リスト５００が単語のリストであるが、支持語リスト５００を検索式で表現してもよい。例えば"(洋服 OR 服) AND NOT アニメ"というように指定すれば、"洋服"または"服"の共起語であり、"アニメ"の共起語を除外したものが支持共起語として抽出される。具体的には図１の支持語グループ抽出部１４０で、単語グループデータベース１８０から"服"、"洋服"、"アニメ"のグループデータを探して、支持語グループデータ２００に格納し、支持共起語抽出部１５０において、"服"、"洋服"、"アニメ"のグループ集計データを作成し、全体割合の累積が初めて閾値を超えた際の該当グループの共起語をそれぞれ抽出し、最後に"服"と"洋服"のいずれかに含まれており、かつ"アニメ"に含まれているものを除いた共起語を支持語として処理すればよい。 In the above embodiment, in the related word extraction apparatus 100 of FIG. 1, the support word list 500 is a list of words, but the support word list 500 may be expressed by a search expression. For example, if "(clothes OR clothes) AND NOT anime" is specified, it is a co-occurrence word of "clothes" or "clothes", and excluding the co-occurrence word of "anime" is extracted as a supporting co-occurrence word Is done. Specifically, the support word group extraction unit 140 of FIG. 1 searches the word group database 180 for “clothes”, “clothes”, and “animation” group data, stores them in the support word group data 200, and supports co-occurrence. The word extraction unit 150 creates group total data of “clothes”, “clothes”, and “animation”, extracts the co-occurrence words of the corresponding group when the total proportion exceeds the threshold for the first time, and finally Co-occurrence words that are included in either “clothes” or “clothes” and that are not included in “animation” may be processed as support words.

また、上記の実施形態では、図１の単語グループ作成部１２０は、共起単語データベース１７０を用いて、あらかじめ任意の単語に対する単語グループデータを作成するものであるが、単語グループ作成部１２０の処理を関連語グループ抽出部１３０及び支持語グループ抽出部１４０（もしくは、関連語グループ抽出部１３０及び支持語グループ抽出部１４０のいずれか）の中で行うことで、対象語４００や支持語リスト５００の入力後に逐次的に処理して対象語や支持語の単語グループデータを作成することも可能である。このとき、対象語や支持語の違いもしくは単語毎にクラスタリングの際のクラスタ数を変えてもよい。 In the above embodiment, the word group creation unit 120 in FIG. 1 creates word group data for an arbitrary word in advance using the co-occurrence word database 170. Is performed in the related word group extraction unit 130 and the support word group extraction unit 140 (or any one of the related word group extraction unit 130 and the support word group extraction unit 140). It is also possible to create word group data of the target word and supporting word by sequentially processing after input. At this time, the number of clusters at the time of clustering may be changed for each word or the difference between the target words and the support words.

また、上記の実施形態では、図１の関連語抽出装置１００において、テキスト集合３００を入力して処理が行われているが、共起単語データ作成部１１０や単語グループ作成部１２０を外部装置の機能として実現し、共起単語データベース１７０のデータや単語グループデータベース１８０のデータを外部装置の処理によって作成しておき、それを入力として処理してもよい。 In the above embodiment, the related word extraction apparatus 100 in FIG. 1 performs processing by inputting the text set 300. However, the co-occurrence word data creation unit 110 and the word group creation unit 120 are connected to the external device. It may be realized as a function, and the data of the co-occurrence word database 170 and the data of the word group database 180 may be created by processing of an external device and processed as input.

すなわち、例えば、コンピュータにより実現される外部装置が、テキスト集合を入力し、テキスト中の単語と共起する単語との共起頻度を求め、共起単語データとして出力する共起単語データ作成部１１０を備える。そして、関連語抽出装置１００が、外部装置により作成された共起単語データを格納する共起単語データ記憶手段（データベース）と、当該共起単語データ記憶手段に格納された共起単語データを用いて、所定の単語集合の各単語について、その単語と共起する単語を求めてグループ化し、その結果を単語グループ記憶手段へ出力する単語グループ作成部１２０と、前記単語グループ記憶手段に格納された単語グループデータから、入力された対象語に対するグループデータを抽出し、関連語グループデータ記憶手段へ出力する関連語グループ抽出部１３０と、前記単語グループデータから、入力された支持語リストに記載の支持語毎にグループデータを抽出し、支持語グループデータ記憶手段へ出力する支持語グループ抽出部１４０と、前記支持語リストと、前記支持語グループデータ記憶手段に格納された支持語グループデータから、支持語との関係の深いグループに属する共起語である支持共起語と支持語との共起頻度を求めて、全ての支持語に対してその共起頻度を集計し、集計結果を支持共起語データ記憶手段へ出力する支持共起語抽出部１５０と、前記関連語グループデータ記憶手段に格納された関連語グループデータと、前記支持共起語データ記憶手段に格納された支持共起語データから、前記対象語の各関連語グループに属する共起語と一致する支持共起語の支持度を求め、対象語の関連語グループ毎に集計し、支持共起語と関係の深い共起語を関連語として選択し、出力する関連語抽出部１６０とを備える。 That is, for example, an external device implemented by a computer inputs a text set, obtains the co-occurrence frequency of words that co-occur with words in the text, and outputs the co-occurrence word data creation unit 110 that outputs the co-occurrence word data. Is provided. Then, the related word extraction device 100 uses co-occurrence word data storage means (database) for storing the co-occurrence word data created by the external device, and the co-occurrence word data stored in the co-occurrence word data storage means. Then, for each word of a predetermined word set, a word group creation unit 120 that obtains a word that co-occurs with the word and groups the result and outputs the result to the word group storage means, and the word group storage means The related word group extraction unit 130 extracts group data for the input target word from the word group data, and outputs the extracted group data to the related word group data storage unit. The support described in the support word list input from the word group data A support word group extraction unit 140 that extracts group data for each word and outputs it to the support word group data storage means; From the support word list and the support word group data stored in the support word group data storage means, the co-occurrence frequency of a support co-occurrence word and a support word that are co-occurrence words belonging to a group closely related to the support word is determined. The co-occurrence frequency is calculated for all support words, and the co-occurrence word extraction unit 150 outputs the total result to the support co-occurrence word data storage means, and is stored in the related word group data storage means. The support co-occurrence word matching the co-occurrence word belonging to each related word group of the target word is determined from the related word group data and the support co-occurrence word data stored in the support co-occurrence word data storage means. A related word extraction unit 160 that calculates and collects the related word groups of the target words, selects the co-occurrence words closely related to the supporting co-occurrence words, and outputs them as related words;

また、他の例として、外部装置が、テキスト集合を入力し、テキスト中の単語と共起する単語との共起頻度を求め、共起単語データ記憶手段へ出力する共起単語データ作成部１１０と、前記共起単語データ記憶手段に格納された共起単語データを用いて、所定の単語集合の各単語について、その単語と共起する単語を求めてグループ化し、その結果を単語グループデータとして出力する単語グループ作成部１２０とを備える。そして、関連語抽出装置１００が、前記単語グループデータを格納する単語グループ記憶手段と、当該単語グループ記憶手段に格納された単語グループデータから、入力された対象語に対するグループデータを抽出し、関連語グループデータ記憶手段へ出力する関連語グループ抽出部１３０と、前記単語グループデータから、入力された支持語リストに記載の支持語毎にグループデータを抽出し、支持語グループデータ記憶手段へ出力する支持語グループ抽出部１４０と、前記支持語リストと、前記支持語グループデータ記憶手段に格納された支持語グループデータから、支持語との関係の深いグループに属する共起語である支持共起語と支持語との共起頻度を求めて、全ての支持語に対してその共起頻度を集計し、集計結果を支持共起語データ記憶手段へ出力する支持共起語抽出部１５０と、前記関連語グループデータ記憶手段に格納された関連語グループデータと、前記支持共起語データ記憶手段に格納された支持共起語データから、前記対象語の各関連語グループに属する共起語と一致する支持共起語の支持度を求め、対象語の関連語グループ毎に集計し、支持共起語と関係の深い共起語を関連語として選択し、出力する関連語抽出部１６０とを備える。 As another example, the external device inputs a text set, obtains the co-occurrence frequency of the words in the text and the words that co-occur, and outputs them to the co-occurrence word data storage unit 110. And using the co-occurrence word data stored in the co-occurrence word data storage means, for each word in a predetermined word set, the words that co-occur with the word are obtained and grouped, and the result is used as word group data. And an output word group creation unit 120. Then, the related word extracting apparatus 100 extracts the group data for the input target word from the word group storage means for storing the word group data and the word group data stored in the word group storage means, and the related words Related word group extraction unit 130 for outputting to group data storage means, and support for extracting group data for each support word described in the input support word list from the word group data and outputting to support word group data storage means A word group extraction unit 140, the support word list, and a support co-occurrence word that is a co-occurrence word belonging to a group closely related to the support word from the support word group data stored in the support word group data storage unit; The co-occurrence frequency with support words is obtained, the co-occurrence frequencies are totaled for all support words, and the result of the count is the support co-occurrence word Support co-occurrence word extraction unit 150 for outputting to the data storage means, related word group data stored in the related word group data storage means, and support co-occurrence word data stored in the support co-occurrence word data storage means The support co-occurrence words that match the co-occurrence words belonging to each related word group of the target word are obtained, and the support word of the co-occurrence word that is closely related to the support co-occurrence word is calculated for each related word group of the target word. And a related word extraction unit 160 that selects and outputs the related words.

なお、これらの場合において、例えば、外部装置と関連語抽出装置１００とを通信ネットワークで接続し、当該外部装置で作成されたデータを関連語抽出装置１００に通信ネットワークを介して入力する構成をとるようにしてもよい。 In these cases, for example, an external device and the related word extraction device 100 are connected via a communication network, and data created by the external device is input to the related word extraction device 100 via the communication network. You may do it.

本発明は、上記の実施の形態に限定されることなく、特許請求の範囲内において、種々変更・応用が可能である。 The present invention is not limited to the above-described embodiments, and various modifications and applications are possible within the scope of the claims.

１００…関連語抽出装置
１１０…共起単語データ作成部
１２０…単語グループ作成部
１３０…関連語グループ抽出部
１４０…支持語グループ抽出部
１５０…関連語抽出部
１６０…共起単語データベース
１７０…単語グループデータベース
１８０…関連語グループデータベース
１９０…支持語グループデータベース
３００…テキスト集合
４００…対象語
５００…支持語リスト
６００…関連語データベース DESCRIPTION OF SYMBOLS 100 ... Related word extraction apparatus 110 ... Co-occurrence word data creation part 120 ... Word group creation part 130 ... Related word group extraction part 140 ... Support word group extraction part 150 ... Related word extraction part 160 ... Co-occurrence word database 170 ... Word group Database 180 ... Related word group database 190 ... Supported word group database 300 ... Text set 400 ... Target word 500 ... Supported word list 600 ... Related word database

Claims

A related word extraction device for extracting related words for a word,
A co-occurrence word data creation means for inputting a text set, obtaining a co-occurrence frequency of words co-occurring with words in the text, and outputting the co-occurrence word data storage means;
Using the co-occurrence word data stored in the co-occurrence word data storage means, for each word of a predetermined word set, the words that co-occur with the word are obtained and grouped, and the result is output to the word group storage means Word group creation means to
Related word group extraction means for extracting group data for the input target word from the word group data stored in the word group storage means, and outputting to the related word group data storage means;
From the word group data, support word group extraction means for extracting group data for each support word described in the input support word list and outputting to the support word group data storage means;
Co-occurrence frequency of support co-occurrence words and support words, which are co-occurrence words belonging to a group closely related to the support word, from the support word list and the support word group data stored in the support word group data storage means And calculating a co-occurrence frequency for all the support words, and outputting a count result to the support co-occurrence word data storage means;
From the related word group data stored in the related word group data storage means and the support co-occurrence word data stored in the support co-occurrence word data storage means, the co-occurrence words belonging to each related word group of the target word, It is provided with a related word extraction means that obtains the support level of matching co-occurrence words, aggregates them for each related word group of the target word, selects a co-occurrence word closely related to the support co-occurrence word as a related word, and outputs it. A related word extraction device characterized by that.

The related word extraction device in a related word extraction system having a related word extraction device for extracting a related word for a word and an external device,
The external device is
A co-occurrence word data creation means for inputting a text set, obtaining a co-occurrence frequency of a word in the text and a co-occurrence word, and outputting the co-occurrence word data,
The related word extraction device comprises:
Co-occurrence word data storage means for storing co-occurrence word data created by the external device;
Using the co-occurrence word data stored in the co-occurrence word data storage means, for each word of a predetermined word set, the words that co-occur with the word are obtained and grouped, and the result is output to the word group storage means Word group creation means to
Related word group extraction means for extracting group data for the input target word from the word group data stored in the word group storage means, and outputting to the related word group data storage means;
From the word group data, support word group extraction means for extracting group data for each support word described in the input support word list and outputting to the support word group data storage means;
Co-occurrence frequency of support co-occurrence words and support words, which are co-occurrence words belonging to a group closely related to the support word, from the support word list and the support word group data stored in the support word group data storage means And calculating a co-occurrence frequency for all the support words, and outputting a count result to the support co-occurrence word data storage means;
From the related word group data stored in the related word group data storage means and the support co-occurrence word data stored in the support co-occurrence word data storage means, the co-occurrence words belonging to each related word group of the target word, A related word extraction unit that obtains support levels of matching co-occurrence words, aggregates them for each related word group of the target word, selects a co-occurrence word closely related to the support co-occurrence word as a related word, and outputs it. The related word extraction device characterized by this.

The related word extraction device in a related word extraction system having a related word extraction device for extracting a related word for a word and an external device,
The external device is
A co-occurrence word data creation means for inputting a text set, obtaining a co-occurrence frequency of words co-occurring with words in the text, and outputting the co-occurrence word data storage means;
Using the co-occurrence word data stored in the co-occurrence word data storage means, for each word in a predetermined word set, a word that co-occurs with the word is obtained and grouped, and the result is output as word group data. A word group creation means,
The related word extraction device comprises:
Word group storage means for storing word group data created by the external device;
Related word group extraction means for extracting group data for the input target word from the word group data stored in the word group storage means, and outputting to the related word group data storage means;
From the word group data, support word group extraction means for extracting group data for each support word described in the input support word list and outputting to the support word group data storage means;
Co-occurrence frequency of support co-occurrence words and support words, which are co-occurrence words belonging to a group closely related to the support word, from the support word list and the support word group data stored in the support word group data storage means And calculating a co-occurrence frequency for all the support words, and outputting a count result to the support co-occurrence word data storage means;
From the related word group data stored in the related word group data storage means and the support co-occurrence word data stored in the support co-occurrence word data storage means, the co-occurrence words belonging to each related word group of the target word, A related word extraction unit that obtains support levels of matching co-occurrence words, aggregates them for each related word group of the target word, selects a co-occurrence word closely related to the support co-occurrence word as a related word, and outputs it. The related word extraction device characterized by this.

In the support co-occurrence word extraction means, information for determining a support method of selection or exclusion for the support words in the support word list is written, and on the basis of the information, for all support words The related word according to any one of claims 1 to 3, wherein, when the co-occurrence frequencies are aggregated, addition is performed in the support method to be selected and subtraction is performed in the support method to be excluded. Extraction device.

The target word is input in the form of a search expression, and the related word group extraction unit extracts group data for each word included in the search expression, and outputs the group data to the related word group data storage unit. In the word extraction means, for each word included in the search formula, the support level of the support co-occurrence word that matches the co-occurrence word belonging to each related word group is obtained, and the support level is calculated for each related word group of the target word and supported. 5. The co-occurrence word closely related to the co-occurrence word is extracted, and the co-occurrence word satisfying the condition of the search expression is selected as a related word. 5. Related word extraction device.

The support word list is input in the form of a search expression, and the support word group extraction unit searches for group data for each word included in the search expression, and outputs the group data to the support word group data storage unit. 6. The related word extraction device according to claim 1, wherein a co-occurrence word extracting unit selects a co-occurrence word that satisfies the search expression condition as a support co-occurrence word.

By performing the processing of the word group creating means in the related word group extracting means and / or the supporting word group extracting means, the processing is sequentially performed after inputting the target word and the supporting word list, and the target word The word group data of the said support word is created. The related word extraction apparatus of any one of Claim 1 thru | or 6 characterized by the above-mentioned.

A related word extraction method executed by a related word extraction device that extracts related words for a word,
A co-occurrence word data creation step of inputting a text set, obtaining a co-occurrence frequency of a word co-occurring with a word in the text, and outputting the co-occurrence word data storage means;
Using the co-occurrence word data stored in the co-occurrence word data storage means, for each word of a predetermined word set, the words that co-occur with the word are obtained and grouped, and the result is output to the word group storage means A word group creation step,
A related word group extracting step of extracting group data for the input target word from the word group data stored in the word group storage means, and outputting to the related word group data storage means;
Extracting group data for each support word described in the input support word list from the word group data and outputting to the support word group data storage means, a support word group extraction step;
Co-occurrence frequency of support co-occurrence words and support words, which are co-occurrence words belonging to a group closely related to the support word, from the support word list and the support word group data stored in the support word group data storage means And calculating a co-occurrence frequency for all the support words, and outputting a count result to the support co-occurrence word data storage means;
From the related word group data stored in the related word group data storage means and the support co-occurrence word data stored in the support co-occurrence word data storage means, the co-occurrence words belonging to each related word group of the target word, A support word extraction step for obtaining a support level of matching support co-occurrence words, totaling each related word group of the target word, selecting a co-occurrence word closely related to the support co-occurrence word as a related word, and outputting it. A related word extraction method characterized by that.

A related word extraction method executed by the related word extraction device in a related word extraction system having a related word extraction device for extracting a related word for a word and an external device,
The external device is
A co-occurrence word data creation means for inputting a text set, obtaining a co-occurrence frequency of a word in the text and a co-occurrence word, and outputting the co-occurrence word data,
The related word extraction method is:
Using the co-occurrence word data created by the external device, for each word in a predetermined word set, a word that co-occurs with the word is obtained and grouped, and the result is output to the word group storage means Steps,
A related word group extracting step of extracting group data for the input target word from the word group data stored in the word group storage means, and outputting to the related word group data storage means;
Extracting group data for each support word described in the input support word list from the word group data and outputting to the support word group data storage means, a support word group extraction step;
Co-occurrence frequency of support co-occurrence words and support words, which are co-occurrence words belonging to a group closely related to the support word, from the support word list and the support word group data stored in the support word group data storage means And calculating a co-occurrence frequency for all the support words, and outputting a count result to the support co-occurrence word data storage means;
From the related word group data stored in the related word group data storage means and the support co-occurrence word data stored in the support co-occurrence word data storage means, the co-occurrence words belonging to each related word group of the target word, A support word extraction step for obtaining a support level of matching support co-occurrence words, totaling by each related word group of the target word, selecting a co-occurrence word closely related to the support co-occurrence word as a related word, and outputting it; The related word extraction method characterized by this.

A related word extraction method executed by the related word extraction device in a related word extraction system having a related word extraction device for extracting a related word for a word and an external device,
The external device is
A co-occurrence word data creation means for inputting a text set, obtaining a co-occurrence frequency of words co-occurring with words in the text, and outputting the co-occurrence word data storage means;
Using the co-occurrence word data stored in the co-occurrence word data storage means, for each word in a predetermined word set, a word that co-occurs with the word is obtained and grouped, and the result is output as word group data. A word group creation means,
The related word extraction method is:
A related word group extracting step of extracting group data for the input target word from the word group data created by the external device and outputting it to the related word group data storage means;
Extracting group data for each support word described in the input support word list from the word group data and outputting to the support word group data storage means, a support word group extraction step;
Co-occurrence frequency of support co-occurrence words and support words, which are co-occurrence words belonging to a group closely related to the support word, from the support word list and the support word group data stored in the support word group data storage means And calculating a co-occurrence frequency for all the support words, and outputting a count result to the support co-occurrence word data storage means;
From the related word group data stored in the related word group data storage means and the support co-occurrence word data stored in the support co-occurrence word data storage means, the co-occurrence words belonging to each related word group of the target word, A support word extraction step for obtaining a support level of matching support co-occurrence words, totaling by each related word group of the target word, selecting a co-occurrence word closely related to the support co-occurrence word as a related word, and outputting it; The related word extraction method characterized by this.

In the support co-occurrence word extraction step, information for determining a support method of selection or exclusion with respect to the support words of the support word list is written, and the related word extraction device is based on the information, 11. The total number of co-occurrence frequencies for the support words is added when the support method is selected, and is subtracted when the support method is excluded. The related word extraction method according to Item 1.

The target word is input in the form of a search expression, and in the related word group extraction step, the related word extraction device extracts group data for each word included in the search expression and supplies the related word group data storage means. And in the related word extracting step, the related word extracting device determines the support level of the support co-occurrence word that matches the co-occurrence word belonging to each related word group for each word included in the search expression. It is obtained by calculating for each related word group of target words, extracting co-occurrence words closely related to supporting co-occurrence words, and selecting co-occurrence words satisfying the conditions of the search expression as related words. The related word extraction method according to any one of claims 8 to 11.

The support word list is input in the form of a search expression, and in the support word group extraction step, the related word extraction device searches for group data for each word included in the search expression and sends it to the support word group data storage means. 13. The output of the supporting co-occurrence word, the related word extraction device selects a co-occurrence word satisfying the search expression condition as a supporting co-occurrence word. The related word extraction method of any one of these.

By performing the process of creating the word group in the related word group extraction step and / or the support word group extraction step, the processing is sequentially performed after inputting the target word and the support word list, and the target word or The related word extraction method according to any one of claims 8 to 13, wherein word group data of the support word is created.

The related word extraction program for functioning a computer as each means of the related word extraction apparatus of any one of Claims 1 thru | or 7.