JPH0728823A

JPH0728823A - Coocurrence word extracting method and device therefor

Info

Publication number: JPH0728823A
Application number: JP5170122A
Authority: JP
Inventors: Yoshimoto Yamamoto; 喜大山本; Junichi Kubota; 淳市久保田
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1993-07-09
Filing date: 1993-07-09
Publication date: 1995-01-31

Abstract

PURPOSE:To automatically extract a coocurrence relation which is effective in the selection of a homonym and has high reliability. CONSTITUTION:A homonym supply part 1 successively supplies each word of homonyms. A text extraction part 2 extracts plural texts including the words supplied from the homonym supply part 1. A word extraction part 3 extracts a specified word in the extracted text. A number of text counting part 4 determines the number of text where each word appears for each word in the texts. A number of word counting part 5 determines the number of word every number of text. A coocurrence word selection part 6 extracts a coocurrence word group based on the number of text counted by the number of text counting part 4 and the number of word counted by the number of word counting part 5. A coocurreence word deletion part 7 eliminates words included in at least two coocurrence word groups of the coocurrence word group corresponding to each word of homonyms from each coocurrence word group.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、語と語の共起関係を抽
出する共起語抽出方法およびその装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a co-occurrence word extraction method and apparatus for extracting a co-occurrence relation between words.

【０００２】[0002]

【従来の技術】近年、語と語の共起関係を用いて自然言
語の解析処理を行うことが試みられている。例えば、か
な漢字変換において、共起する２語で構成される共起語
対をあらかじめ作成しておき、変換候補中に共起語対が
あれば、その候補を優先して同音語を選択する等の手法
が開発されている。このような共起語は当初人手で作成
されていたため作業が膨大であった。これを改善するた
めに、テキスト中に同時に出現する２語の出現回数に基
づいて共起関係を抽出する方法が開発されている。2. Description of the Related Art In recent years, attempts have been made to perform natural language analysis processing using the co-occurrence relationship between words. For example, in kana-kanji conversion, a co-occurrence word pair composed of two co-occurring words is created in advance, and if there is a co-occurrence word pair among conversion candidates, then the candidate is prioritized and a homophone is selected. Method has been developed. Since such a co-occurrence word was initially created manually, the work was enormous. In order to improve this, a method of extracting the co-occurrence relation based on the number of appearances of two words appearing simultaneously in the text has been developed.

【０００３】図５は従来の共起語抽出装置の構成図であ
る。図中、５０４はテキストデータベースで、複数のテ
キストを保持するものである。５０１は語抽出部で、テ
キストデータベース５０４中のテキストから語を抽出す
るものである。５０２は出現頻度計数部で、語抽出部５
０１が抽出した語から２語の組を作りその出現頻度を計
数するものである。５０３は共起語選択部で、前記出現
頻度に基づいて共起語を選択する。FIG. 5 is a block diagram of a conventional co-occurrence word extraction device. In the figure, 504 is a text database, which holds a plurality of texts. A word extraction unit 501 extracts words from the text in the text database 504. 502 is an appearance frequency counting unit, which is a word extracting unit 5.
A set of two words is created from the words extracted by 01, and the frequency of appearance is counted. A co-occurrence word selection unit 503 selects a co-occurrence word based on the appearance frequency.

【０００４】前述のように構成された従来の共起語抽出
装置の動作を以下に説明する。（１）語抽出部５０１は、テキストデータベース５０４
の各テキストから自立語を抽出する。The operation of the conventional co-occurrence word extraction device configured as described above will be described below. (1) The word extraction unit 501 uses the text database 504.
Independent words are extracted from each text.

【０００５】（２）出現頻度計数部５０２は、各テキス
ト中の語から２語の組合せを作り、その出現頻度を求め
る。さらに、各組合せ毎に全てのテキストから得た出現
頻度を加算する。(2) The appearance frequency counting unit 502 creates a combination of two words from the words in each text and obtains the appearance frequency thereof. Furthermore, the appearance frequencies obtained from all the texts are added for each combination.

【０００６】（３）共起語選択部５０３は、出現頻度に
基づいて２語の組合せを選択する（おわり）。(3) The co-occurrence word selection unit 503 selects a combination of two words based on the appearance frequency (end).

【０００７】以上のように動作する従来の共起語抽出装
置における具体的な動作を次に説明する。The specific operation of the conventional co-occurrence word extraction device that operates as described above will be described below.

【０００８】テキストデータベース５０４は、一定の単
位に区切られた複数のテキストを格納するものである。
ここではテキストデータベース５０４は文単位に区切ら
れた複数のテキストを持つものとする。The text database 504 stores a plurality of texts delimited by a certain unit.
Here, it is assumed that the text database 504 has a plurality of texts divided for each sentence.

【０００９】（１）語抽出部５０１は、テキストデータ
ベース５０４中の各テキストを形態素解析し、各テキス
ト中の自立語を抽出する。例えばテキスト「情報機器に
対するストレスをテクノ・ストレスと呼び、近年、テク
ノ・ストレスが増加している。」から自立語「情報」
「機器」「ストレス」「テクノ」「呼ぶ」「近年」「増
加」を抽出する。(1) The word extraction unit 501 morphologically analyzes each text in the text database 504 and extracts independent words in each text. For example, from the text "stress on information devices is called techno-stress, and techno-stress is increasing in recent years."
“Equipment”, “Stress”, “Techno”, “Call”, “Recent”, and “Increase” are extracted.

【００１０】（２）出現頻度計数部５０２は、各テキス
ト中の２語の組合せを作りその出現頻度を求める。例え
ば前述のテキスト「情報機器に対する…」の場合には、
図６の結果を得る。(2) The appearance frequency counting unit 502 creates a combination of two words in each text and obtains the appearance frequency thereof. For example, in the case of the above-mentioned text "for information device ...",
The result of FIG. 6 is obtained.

【００１１】（３）出現頻度計数部５０２は、さらに各
組合せ毎に、全てのテキストから得た出現頻度を加算す
る。(3) The appearance frequency counting unit 502 further adds the appearance frequencies obtained from all the texts for each combination.

【００１２】（４）共起語選択部５０３は、頻度に基づ
いて２語の組合せを選択する。例えば、頻度３以上の組
合せを選択した場合、前述のテキスト「情報機器に対す
る…」からは「情報-ストレス」「機器-ストレス」の組
合せが抽出されてしまう。(4) The co-occurrence word selection unit 503 selects a combination of two words based on the frequency. For example, if a combination with a frequency of 3 or more is selected, a combination of “information-stress” and “device-stress” will be extracted from the above-mentioned text “for information device”.

【００１３】[0013]

【発明が解決しようとする課題】従来の共起語抽出方法
およびその装置の前述の具体例では、「情報」と「スト
レス」や、「機器」と「ストレス」を同時に含む文がテ
キストデータベース中にただ一つしかなくても、「情報
-ストレス」と「機器-ストレス」の組合せが抽出されて
しまう。このように、従来の共起語抽出方法およびその
装置では、ある一つのテキスト中で出現頻度の高い語が
ある場合に、その語を一方に持つ２語の組合せが共起語
として抽出されやすくなるという性質があった。この性
質は、従来の共起語抽出方法およびその装置が抽出した
共起語が、用意するテキストの内容から極めて敏感に影
響を受けやすいことを示す。したがって、同音語選択用
の共起語として、ふさわしくない語の組合せが抽出され
やすいという問題点を有していた。In the above-described specific example of the conventional co-occurrence word extraction method and its apparatus, sentences including "information" and "stress" or "device" and "stress" at the same time are stored in the text database. Even if there is only one in
A combination of "stress" and "equipment-stress" is extracted. As described above, in the conventional co-occurrence word extraction method and apparatus, when there is a word having a high appearance frequency in a certain text, a combination of two words having the word on one side is easily extracted as a co-occurrence word. There was a property of becoming. This property indicates that the conventional co-occurrence word extraction method and the co-occurrence word extracted by the apparatus are extremely sensitive to the contents of the prepared text. Therefore, there is a problem that a combination of unsuitable words is easily extracted as a co-occurrence word for selecting a homophone.

【００１４】本発明は、従来の共起語抽出方法およびそ
の装置が有していた前記の問題点に鑑み、信頼性の高い
共起関係を自動的に抽出することが可能であり、また、
同音語の選択に有効である共起語抽出方法およびその装
置を提供することを目的とする。The present invention is capable of automatically extracting a highly reliable co-occurrence relation in view of the above problems of the conventional co-occurrence word extraction method and apparatus.
It is an object of the present invention to provide a co-occurrence word extraction method and apparatus which are effective for selecting homophones.

【００１５】[0015]

[Means for Solving the Problems]

（１）本発明は、前記の目的を達成するために、同音語
の各語を順次供給する同音語供給部と、同音語供給部が
供給した語を含む複数のテキストを抽出するテキスト抽
出部と、抽出されたテキスト中の特定の語を抽出する語
抽出部と、テキスト中の各語について各語が出現するテ
キストの数を求めるテキスト数計数部と、テキスト数計
数部で計数されたテキストの数に基づいて共起語群を抽
出する共起語選択部を備えたことを特徴とする共起語抽
出装置を構成する。(1) In order to achieve the above-mentioned object, the present invention provides a homophone supply unit that sequentially supplies each of the homophones, and a text extraction unit that extracts a plurality of texts including the words supplied by the homophone supply unit. , A word extraction unit for extracting a specific word in the extracted text, a text number counting unit for determining the number of texts in which each word appears in each word in the text, and a text counted by the text number counting unit A co-occurrence word extraction device comprising a co-occurrence word selection unit that extracts a co-occurrence word group based on the number of co-occurrence words.

【００１６】（２）同音語の各語に対応する共起語群の
うち、少なくとも２つの共起語群に含まれる語を各共起
語群より取り除く共起語削除部を追加した共起語抽出装
置を構成する。(2) A co-occurrence word deletion unit for removing a word included in at least two co-occurrence word groups from the co-occurrence word groups corresponding to each homophonic word is added. Configure a word extraction device.

【００１７】（３）テキスト数計数部で計数されたテキ
スト数毎にその語数を求める語数計数部を追加し、共起
語選択部が語数に基づいて共起語群を抽出する共起語抽
出装置を構成する。(3) Co-occurrence word extraction in which a co-occurrence word selection unit extracts a co-occurrence word group based on the number of words by adding a word number counter for obtaining the number of words for each number of texts counted by the text number counter Configure the device.

【００１８】（４）語抽出部がテキスト中の自立語を抽
出する共起語抽出装置を構成する。（５）あらかじめ共起語の候補を格納する共起語候補格
納部を追加し、語抽出部がテキスト中の語のうち共起語
候補格納部にある語を抽出する共起語抽出装置を構成す
る。(4) The word extracting unit constitutes a co-occurrence word extracting device for extracting an independent word in the text. (5) A co-occurrence word extraction device that adds a co-occurrence word candidate storage unit that stores candidates for a co-occurrence word in advance and causes the word extraction unit to extract the words in the co-occurrence word candidate storage unit from the words in the text. Constitute.

【００１９】[0019]

【作用】本発明の共起語抽出方法およびその装置は前記
した構成により、同音語供給部が同音語の各語をテキス
ト抽出部に供給すると、テキスト抽出部は同音語供給部
で供給された語を含む複数のテキストを抽出し、語抽出
部が抽出されたテキスト中の特定の語を抽出し、テキス
ト数計数部がテキスト中の各語について各語が出現する
テキストの数を計数し、語数計数部がテキスト数計数部
で計数されたテキスト数毎にその語数を計数し、共起語
選択部がテキスト数計数部で計数されたテキスト数と語
数計数部で計数された語数に基づいて共起語群を抽出
し、共起語削除部が同音語の各語に対応する共起語群の
うち少なくとも２つの共起語群に含まれる語を各共起語
群より取り除く。According to the co-occurrence word extracting method and the apparatus thereof of the present invention, when the homophone supplying section supplies each word of the homophone to the text extracting section, the text extracting section is supplied by the homophone supplying section. Extracting a plurality of texts including words, the word extraction unit extracts a specific word in the extracted text, the text number counting unit counts the number of texts in which each word appears for each word in the text, The word number counting unit counts the number of words for each number of texts counted by the text number counting unit, and the co-occurrence word selection unit based on the number of texts counted by the text number counting unit and the number of words counted by the word number counting unit. The co-occurrence word group is extracted, and the co-occurrence word deletion unit removes the words included in at least two co-occurrence word groups from the co-occurrence word groups among the co-occurrence word groups corresponding to the respective words of the homophone.

【００２０】[0020]

【実施例】以下、本発明の共起語抽出装置の一実施例を
図面を用いて説明する。DESCRIPTION OF THE PREFERRED EMBODIMENTS An embodiment of the co-occurrence word extraction device of the present invention will be described below with reference to the drawings.

【００２１】図１は本発明の一実施例における共起語抽
出装置の構成図である。図中、８はテキストデータベー
スで、複数のテキストを保持する。９は共起語候補格納
部で、共起語の候補を格納する。１は同音語供給部で、
同音語の各語を順次供給するものである。２はテキスト
抽出部で、同音語供給部１から供給された語を含むテキ
ストをテキストデータベース８から抽出するものであ
る。３は語抽出部で、テキスト抽出部２が抽出したテキ
スト中に含まれる特定の語を抽出するものである。４は
テキスト数計数部で、語抽出部３が抽出した各語につい
て、その語が含まれていたテキストの数を求めるもので
ある。５は語数計数部で、テキスト数計数部４で計数し
たテキスト数毎にその語数を求めるものである。６は共
起語選択部で、有効な共起語を選択するものである。７
は共起語削除部で、同音語の各語に対応する共起語群を
比べて少なくとも２つの共起語群に含まれる語を除外す
るものである。FIG. 1 is a block diagram of a co-occurrence word extraction device according to an embodiment of the present invention. In the figure, 8 is a text database, which holds a plurality of texts. A co-occurrence word candidate storage unit 9 stores co-occurrence word candidates. 1 is a homophone supply unit,
Each of the homophones is sequentially supplied. Reference numeral 2 is a text extraction unit, which extracts a text including the word supplied from the homophone supply unit 1 from the text database 8. Reference numeral 3 denotes a word extraction unit, which extracts a specific word contained in the text extracted by the text extraction unit 2. Reference numeral 4 denotes a text number counting unit which, for each word extracted by the word extracting unit 3, calculates the number of texts in which the word was included. Reference numeral 5 is a word number counting unit which obtains the number of words for each text number counted by the text number counting unit 4. Reference numeral 6 denotes a co-occurrence word selection unit for selecting a valid co-occurrence word. 7
Is a co-occurrence word deletion unit, which compares the co-occurrence word groups corresponding to each of the homophones and excludes words included in at least two co-occurrence word groups.

【００２２】前述のように構成された本実施例の共起語
抽装置の動作を、図２のフローチャートを参照しながら
説明する。The operation of the co-occurrence word extracting device of the present embodiment configured as described above will be described with reference to the flowchart of FIG.

【００２３】（２０１）同音語供給部１は、同音語の一
つをテキスト抽出部２に与える。（２０２）テキスト抽出部２は、同音語供給部１から与
えられた語を含むテキストをテキストデータベース８か
ら抽出する。(201) The homophone word supply unit 1 supplies one of the homophone words to the text extraction unit 2. (202) The text extracting unit 2 extracts from the text database 8 the text including the word given from the homophone supplying unit 1.

【００２４】（２０３）語抽出部３は、テキスト抽出部
２が抽出した各テキスト中の語のうち、共起語候補格納
部９に格納された語を抽出する。(203) The word extraction unit 3 extracts the words stored in the co-occurrence word candidate storage unit 9 from the words in each text extracted by the text extraction unit 2.

【００２５】（２０４）テキスト数計数部４は、語抽出
部３が抽出した各語について、各語が出現するテキスト
の数を求める。(204) For each word extracted by the word extraction unit 3, the text number counting unit 4 obtains the number of texts in which each word appears.

【００２６】（２０５）（２０４）で求めた語とテキス
ト数の組合せから、テキスト数を同じくする語の集合が
想定できる。語数計数部５は、テキスト数を同じくする
語の数を各テキスト数ごとに求める。From the combination of the words and the number of texts obtained in (205) and (204), a set of words having the same number of texts can be assumed. The word number counting unit 5 obtains the number of words having the same number of texts for each number of texts.

【００２７】（２０６）共起語選択部６は、語計数部５
が求めたテキスト数と語数から共起語を取捨選択するた
めの基準値を求め、基準値に基づいて共起語を抽出す
る。一般に、テキスト数が小さくなるにしたがって語数
が多くなる。そこで、その語数の増え方が極端に大きく
なる直前のテキスト数を基準値とし、テキスト数が基準
値以上の語を抽出する。(206) The co-occurrence word selection unit 6 includes the word counting unit 5
A reference value for selecting co-occurrence words is obtained from the number of texts and the number of words obtained by, and the co-occurrence word is extracted based on the reference value. Generally, the smaller the number of texts, the larger the number of words. Therefore, the number of texts immediately before the increase in the number of words becomes extremely large is set as a reference value, and words having the number of texts equal to or larger than the reference value are extracted.

【００２８】（２０７）全ての同音語について処理され
たならステップ（２０８）へ進む。未処理の同音語があ
るなら（２０１）に進む。(207) If all homophones have been processed, the process proceeds to step (208). If there is an unprocessed homonym, proceed to (201).

【００２９】（２０８）共起語削除部７は、各同音語と
対応づけて抽出された共起語群を比べて、少なくとも２
つの共起語群に含まれる語を各共起語群より取り除く
（おわり）。(208) The co-occurrence word deleting unit 7 compares at least two co-occurrence word groups in association with each homophone and extracts at least two.
The words included in one co-occurrence word group are removed from each co-occurrence word group (end).

【００３０】以上のように動作する本実施例の共起語抽
出装置における具体的な動作を次に説明する。The specific operation of the co-occurrence word extraction device of this embodiment, which operates as described above, will be described below.

【００３１】同音語供給部１は、例えば、読み文字列が
「きき」である同音語「機器」と「危機」を保持するも
のとする。また、テキストデータベース８は一定の単位
に区切られた複数のテキストを格納するものである。本
実施例ではテキストデータベース８は文単位に区切られ
た複数のテキストを持つものとする。The homonym word supply section 1 is assumed to hold, for example, the homonym words "apparatus" and "crisis" whose reading character string is "ki". Further, the text database 8 stores a plurality of texts divided into a certain unit. In the present embodiment, the text database 8 is assumed to have a plurality of texts divided by sentence.

【００３２】また、共起語候補格納部９は共起語の候補
を格納する。例えば品詞等の文法的情報や意味コードに
基づいたり、あるいは個別に特定の語を選別して得られ
た語を、共起語候補格納部９に格納することが考えられ
る。本実施例では共起語候補格納部９は、汎用的に使わ
れる「近年」等の語を除く自立語の一部を格納するもの
とする。The co-occurrence word candidate storage unit 9 also stores co-occurrence word candidates. For example, it is conceivable to store words in the co-occurrence word candidate storage unit 9 based on grammatical information such as a part of speech or a semantic code, or obtained by individually selecting a specific word. In this embodiment, the co-occurrence word candidate storage unit 9 stores a part of the independent words excluding the words such as "recently" which are generally used.

【００３３】（１）同音語供給部１は同音語の一つ「機
器」を取り出し、テキスト抽出部２に取り出した語「機
器」を与える。(1) The homophone supplying unit 1 takes out one of the homophones "device" and gives the extracted word "device" to the text extracting unit 2.

【００３４】（２）テキスト抽出部２は、同音語供給部
１から与えられた語「機器」を含む複数のテキストをテ
キストデータベース８から抽出し、例えばテキスト「情
報機器に対するストレスをテクノ・ストレスと呼び、近
年、テクノ・ストレスが増加している。」等の複数のテ
キストを得る。(2) The text extraction unit 2 extracts a plurality of texts including the word "device" given from the homophone supply unit 1 from the text database 8, and, for example, the text "stress on information device is referred to as techno-stress. In recent years, techno-stress has been increasing. "

【００３５】（３）語抽出部３は、テキスト抽出部２で
抽出された各テキストを形態素解析し、共起語候補格納
部９に格納されている語のみを抽出する。例えば前記テ
キスト「情報機器に対する…」の場合、自立語は「情
報」「機器」「ストレス」「テクノ」「呼ぶ」「近年」
「増加」であるが、「近年」は共起語候補格納部９に格
納されていないため取り除かれる。この結果、語抽出部
３は、自立語の「情報」「機器」「ストレス」「テク
ノ」「呼ぶ」「増加」を抽出する。(3) The word extraction unit 3 performs morphological analysis on each text extracted by the text extraction unit 2 and extracts only the words stored in the co-occurrence word candidate storage unit 9. For example, in the case of the text “for information devices ...”, the independent words are “information”, “device”, “stress”, “techno”, “call”, “recent”.
Although “increased”, “recent” is not stored in the co-occurrence word candidate storage unit 9, and thus is removed. As a result, the word extraction unit 3 extracts the independent words “information”, “device”, “stress”, “techno”, “call”, and “increase”.

【００３６】（４）テキスト数計数部４は、語抽出部３
が抽出した各語について、各語が出現したテキストの数
を計数し、例えば図３（ａ）のような結果を得る。(4) The text number counting section 4 includes a word extracting section 3
For each word extracted by, the number of texts in which each word appears is counted and, for example, the result as shown in FIG. 3A is obtained.

【００３７】（５）語数計数部５は、テキスト数毎にそ
のテキスト数を持つ語の数を求め、例えば図４の結果を
得る。(5) The word counting section 5 obtains the number of words having the text number for each text number, and obtains the result shown in FIG. 4, for example.

【００３８】（６）図４で、テキスト数が３より小さい
の語の数の増え方は、テキスト数が３以上の語の数の増
え方よりも非常に大きい。そこで、共起語選択部６は、
「機器」に対応する語群のうち、テキスト数が３以上の
語だけを語群より抽出する。(6) In FIG. 4, the increase in the number of words having a text number smaller than 3 is much larger than the increase in the number of words having a text number of 3 or more. Therefore, the co-occurrence word selection unit 6
From the word group corresponding to “device”, only words having a text number of 3 or more are extracted from the word group.

【００３９】ここで、前述のテキストの中で語「ストレ
ス」は３回出現するが、語「ストレス」が「機器」を含
む他のテキストの中に出現しないならば、「ストレス」
に対応するテキスト数は１で、有効な共起語として残ら
ない。Here, the word "stress" appears three times in the above text, but if the word "stress" does not appear in other texts including "apparatus", "stress".
The number of texts corresponding to is 1 and does not remain as a valid co-occurrence word.

【００４０】（７）同音語供給部１は次の同音語「危
機」を取り出し、「機器」の場合と同様の手順で対応す
る語群を抽出し、例えば図３（ｂ）の結果を得る。(7) The homophone supply unit 1 takes out the next homophone "crisis" and extracts the corresponding word group by the same procedure as in the case of "apparatus". For example, the result of FIG. 3 (b) is obtained. .

【００４１】（８）同音語の全てを処理したので、次の
ステップに進む。（９）共起語削除部７は、「機器」と「危機」に対応す
る共起語群を比べて、少なくとも２つの共起語群に含ま
れる語を各共起語群より取り除く。例えば、図３（ａ）
と図３（ｂ）に示す例では、語「構造」が双方から取り
除かれる。(8) Since all the homophones have been processed, the process proceeds to the next step. (9) The co-occurrence word deletion unit 7 compares the co-occurrence word groups corresponding to “equipment” and “crisis” and removes the words included in at least two co-occurrence word groups from each co-occurrence word group. For example, FIG.
3 (b), the word "structure" is removed from both.

【００４２】以上の結果、図３の例の場合には、「機
器」の共起語として「装置」「電子」「販売」…を、ま
た、「危機」共起語として「軍事」「時代」「説」…を
自動的に抽出できる。As a result, in the case of the example of FIG. 3, "apparatus", "electronic", "sales" ... are co-occurrence words of "equipment", and "military""era" as "crisis" co-occurrence words. "Theories" ... can be automatically extracted.

【００４３】なお、本実施例においては、テキスト抽出
部２が複数のテキストを格納したテキストデータベース
８から必要なテキストを抽出した後で、語抽出部２がそ
の中に含まれる自立語を抽出したが、テキストデータベ
ース８中の各テキストに対応してあらかじめ自立語を抽
出しておき、これを格納した自立語データベースを備
え、必要な自立語群を自立語データベースから抽出する
方法であってもよい。また、語抽出部３がテキスト中か
ら語を抽出する際、形態素解析処理を行ったが、あらか
じめ用意された辞書との表記上でマッチングする等の簡
略化された方法であってもよい。また、共起語選択部６
は、語数の増え方が大きくなるテキスト数を基準値とす
るかわりに、一定のテキスト数をあらかじめ定め、例え
ば少なくとも２つのテキストに出現した語を選ぶなどの
方法であってもよい。また、共起語選択部６の処理に先
立って共起語削除部７の処理を実行する方法であっても
よい。In this embodiment, after the text extracting unit 2 extracts the necessary text from the text database 8 storing a plurality of texts, the word extracting unit 2 extracts the independent word contained therein. Alternatively, a method may be used in which an independent word is extracted in advance corresponding to each text in the text database 8, an independent word database that stores this is provided, and a necessary independent word group is extracted from the independent word database. . Further, when the word extracting unit 3 extracts a word from the text, the morphological analysis process is performed, but a simplified method such as matching in terms of a prepared dictionary may be used. Also, the co-occurrence word selection unit 6
Instead of using the number of texts in which the number of words increases as a reference value, a predetermined number of texts may be set in advance and, for example, a word that appears in at least two texts may be selected. Alternatively, the method of executing the process of the co-occurrence word deleting unit 7 prior to the process of the co-occurrence word selecting unit 6 may be used.

【００４４】[0044]

【発明の効果】本発明の共起語抽出方法およびその装置
においては、テキスト中に同時に出現する語の組が出現
するテキストの数に基づいて共起語を取捨選択するの
で、抽出された共起語はあらかじめ用意するテキストの
内容の影響を受けにくくなる。In the co-occurrence word extraction method and apparatus according to the present invention, the co-occurrence words are selected based on the number of texts in which a set of words that appear in the text at the same time appears. The utterance is less likely to be affected by the contents of the prepared text.

【００４５】また、各同音語に対応づけて抽出された共
起語群のうち、少なくとも２つの共起語群に含まれる語
を各共起語群より取り除くので、かな漢字変換の同音語
選択の際、変換候補中で同じ語が同音語中の複数の語と
共起関係を持つことを抑制する。Further, among the co-occurrence word groups extracted in association with each homophone word, words included in at least two co-occurrence word groups are removed from each co-occurrence word group. In this case, it is possible to prevent the same word in the conversion candidates from having a co-occurrence relationship with a plurality of words in the same phoneme.

【００４６】また、テキスト数が小さくなるにしたがっ
て語数が多くなるが、その語数の増え方の際立って大き
くなるテキスト数を基準として共起語を取捨選択するの
で、共起語の記憶容量と有効性を比較してコストパフォ
ーマンスのよい共起語を抽出することができる。Also, the number of words increases as the number of texts decreases, but co-occurrence words are selected based on the number of texts that significantly increases as the number of words increases. It is possible to extract co-occurrence words with good cost performance by comparing the sexes.

【００４７】以上のように本発明の共起語抽出方法およ
びその装置においては、同音語の選択に有効で信頼性の
高い共起語を自動的に抽出することができるため、その
実用的価値は大きい。As described above, in the co-occurrence word extraction method and apparatus of the present invention, a co-occurrence word that is effective in selecting a homophone and has high reliability can be automatically extracted. Is big.

[Brief description of drawings]

【図１】本発明の一実施例における共起語抽出装置の構
成図FIG. 1 is a configuration diagram of a co-occurrence word extraction device according to an embodiment of the present invention.

【図２】本発明の一実施例における動作を示すフローチ
ャートFIG. 2 is a flowchart showing the operation in one embodiment of the present invention.

【図３】（ａ）は本発明の一実施例における「機器」に
対応する語とそのテキスト数の一例を示す模式図（ｂ）は本発明の一実施例における「危機」に対応する
語とそのテキスト数の一例を示す模式図FIG. 3 (a) is a schematic diagram showing an example of a word corresponding to “device” in one embodiment of the present invention and the number of texts thereof. (B) is a word corresponding to “crisis” in one embodiment of the present invention. And schematic diagram showing an example of the number of texts

【図４】本発明の一実施例におけるテキスト数とその語
数との一例を示す模式図FIG. 4 is a schematic diagram showing an example of the number of texts and the number of words in one embodiment of the present invention.

【図５】従来の共起語抽出装置の構成図FIG. 5 is a block diagram of a conventional co-occurrence word extraction device.

【図６】従来の共起語抽出装置における２語の組合せと
出現頻度の一例を示す模式図FIG. 6 is a schematic diagram showing an example of a combination of two words and an appearance frequency in a conventional co-occurrence word extraction device.

[Explanation of symbols]

１同音語供給部２テキスト抽出部３語抽出部４テキスト数計数部５語数計数部６共起語選択部７共起語削除部９共起語候補格納部 1 homophone supply unit 2 text extraction unit 3 word extraction unit 4 text number counting unit 5 word number counting unit 6 co-occurrence word selection unit 7 co-occurrence word deletion unit 9 co-occurrence word candidate storage unit

Claims

[Claims]

1. A homophone supplying step for sequentially supplying each of the homophones, a text extracting step for extracting a plurality of texts including the words supplied by the homophone supplying step, and a specific word in the extracted text. The word extraction step for extracting the co-occurrence word group based on the number of texts counted in the text number counting step and the number of texts for each word in the text A method of extracting a co-occurrence word, which comprises performing a co-occurrence word selecting step.

2. A homophonic word supply section for sequentially supplying each word of the homonym word, a text extraction section for extracting a plurality of texts including the word supplied by the homonym word supply section, and a specific word in the extracted text. A coexistence word group is extracted based on the number of texts counted by the text extraction unit, a text number counting unit that obtains the number of texts in which each word appears for each word in the text, and the text number counting unit A co-occurrence word extraction device, comprising:

3. A co-occurrence word group corresponding to each of the homophones,
The co-occurrence word extracting method according to claim 1, further comprising a co-occurrence word deleting step of removing a word included in at least two co-occurrence word groups from each co-occurrence word group.

4. A co-occurrence word group corresponding to each of the homophones,
The co-occurrence word extracting device according to claim 2, further comprising a co-occurrence word deleting unit that removes words included in at least two co-occurrence word groups from each co-occurrence word group.

5. A word number counting step of obtaining the word number for each text number counted in the text number counting step,
The co-occurrence word extraction method according to claim 1, wherein the co-occurrence word selection step extracts the co-occurrence word group based on the number of words.

6. The co-occurrence word selecting unit extracts a co-occurrence word group based on the number of words, the word number counting unit obtaining the number of words for each text number counted by the text number counting unit. Origin extraction device.

7. The co-occurrence word extracting method according to claim 1, wherein the word extracting step extracts independent words in the text.

8. The co-occurrence word extraction device according to claim 1, wherein the word extraction unit extracts an independent word in the text.

9. The co-occurrence word extracting method according to claim 1, wherein the word extracting step extracts a word from a co-occurring word candidate prepared in advance among the words in the text.

10. A co-occurrence word candidate storage unit for storing candidates of a co-occurrence word in advance, wherein the word extraction unit extracts a word in the co-occurrence word candidate storage unit from the words in the text. Co-occurrence word extraction device.