JPH0728823A - Coocurrence word extracting method and device therefor - Google Patents

Coocurrence word extracting method and device therefor

Info

Publication number
JPH0728823A
JPH0728823A JP5170122A JP17012293A JPH0728823A JP H0728823 A JPH0728823 A JP H0728823A JP 5170122 A JP5170122 A JP 5170122A JP 17012293 A JP17012293 A JP 17012293A JP H0728823 A JPH0728823 A JP H0728823A
Authority
JP
Japan
Prior art keywords
word
text
occurrence
words
occurrence word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
JP5170122A
Other languages
Japanese (ja)
Inventor
Yoshimoto Yamamoto
喜大 山本
Junichi Kubota
淳市 久保田
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Panasonic Holdings Corp
Original Assignee
Matsushita Electric Industrial Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Matsushita Electric Industrial Co Ltd filed Critical Matsushita Electric Industrial Co Ltd
Priority to JP5170122A priority Critical patent/JPH0728823A/en
Publication of JPH0728823A publication Critical patent/JPH0728823A/en
Pending legal-status Critical Current

Links

Landscapes

  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)

Abstract

PURPOSE:To automatically extract a coocurrence relation which is effective in the selection of a homonym and has high reliability. CONSTITUTION:A homonym supply part 1 successively supplies each word of homonyms. A text extraction part 2 extracts plural texts including the words supplied from the homonym supply part 1. A word extraction part 3 extracts a specified word in the extracted text. A number of text counting part 4 determines the number of text where each word appears for each word in the texts. A number of word counting part 5 determines the number of word every number of text. A coocurrence word selection part 6 extracts a coocurrence word group based on the number of text counted by the number of text counting part 4 and the number of word counted by the number of word counting part 5. A coocurreence word deletion part 7 eliminates words included in at least two coocurrence word groups of the coocurrence word group corresponding to each word of homonyms from each coocurrence word group.

Description

【発明の詳細な説明】Detailed Description of the Invention

【0001】[0001]

【産業上の利用分野】本発明は、語と語の共起関係を抽
出する共起語抽出方法およびその装置に関する。
BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a co-occurrence word extraction method and apparatus for extracting a co-occurrence relation between words.

【0002】[0002]

【従来の技術】近年、語と語の共起関係を用いて自然言
語の解析処理を行うことが試みられている。例えば、か
な漢字変換において、共起する2語で構成される共起語
対をあらかじめ作成しておき、変換候補中に共起語対が
あれば、その候補を優先して同音語を選択する等の手法
が開発されている。このような共起語は当初人手で作成
されていたため作業が膨大であった。これを改善するた
めに、テキスト中に同時に出現する2語の出現回数に基
づいて共起関係を抽出する方法が開発されている。
2. Description of the Related Art In recent years, attempts have been made to perform natural language analysis processing using the co-occurrence relationship between words. For example, in kana-kanji conversion, a co-occurrence word pair composed of two co-occurring words is created in advance, and if there is a co-occurrence word pair among conversion candidates, then the candidate is prioritized and a homophone is selected. Method has been developed. Since such a co-occurrence word was initially created manually, the work was enormous. In order to improve this, a method of extracting the co-occurrence relation based on the number of appearances of two words appearing simultaneously in the text has been developed.

【0003】図5は従来の共起語抽出装置の構成図であ
る。図中、504はテキストデータベースで、複数のテ
キストを保持するものである。501は語抽出部で、テ
キストデータベース504中のテキストから語を抽出す
るものである。502は出現頻度計数部で、語抽出部5
01が抽出した語から2語の組を作りその出現頻度を計
数するものである。503は共起語選択部で、前記出現
頻度に基づいて共起語を選択する。
FIG. 5 is a block diagram of a conventional co-occurrence word extraction device. In the figure, 504 is a text database, which holds a plurality of texts. A word extraction unit 501 extracts words from the text in the text database 504. 502 is an appearance frequency counting unit, which is a word extracting unit 5.
A set of two words is created from the words extracted by 01, and the frequency of appearance is counted. A co-occurrence word selection unit 503 selects a co-occurrence word based on the appearance frequency.

【0004】前述のように構成された従来の共起語抽出
装置の動作を以下に説明する。 (1)語抽出部501は、テキストデータベース504
の各テキストから自立語を抽出する。
The operation of the conventional co-occurrence word extraction device configured as described above will be described below. (1) The word extraction unit 501 uses the text database 504.
Independent words are extracted from each text.

【0005】(2)出現頻度計数部502は、各テキス
ト中の語から2語の組合せを作り、その出現頻度を求め
る。さらに、各組合せ毎に全てのテキストから得た出現
頻度を加算する。
(2) The appearance frequency counting unit 502 creates a combination of two words from the words in each text and obtains the appearance frequency thereof. Furthermore, the appearance frequencies obtained from all the texts are added for each combination.

【0006】(3)共起語選択部503は、出現頻度に
基づいて2語の組合せを選択する(おわり)。
(3) The co-occurrence word selection unit 503 selects a combination of two words based on the appearance frequency (end).

【0007】以上のように動作する従来の共起語抽出装
置における具体的な動作を次に説明する。
The specific operation of the conventional co-occurrence word extraction device that operates as described above will be described below.

【0008】テキストデータベース504は、一定の単
位に区切られた複数のテキストを格納するものである。
ここではテキストデータベース504は文単位に区切ら
れた複数のテキストを持つものとする。
The text database 504 stores a plurality of texts delimited by a certain unit.
Here, it is assumed that the text database 504 has a plurality of texts divided for each sentence.

【0009】(1)語抽出部501は、テキストデータ
ベース504中の各テキストを形態素解析し、各テキス
ト中の自立語を抽出する。例えばテキスト「情報機器に
対するストレスをテクノ・ストレスと呼び、近年、テク
ノ・ストレスが増加している。」から自立語「情報」
「機器」「ストレス」「テクノ」「呼ぶ」「近年」「増
加」を抽出する。
(1) The word extraction unit 501 morphologically analyzes each text in the text database 504 and extracts independent words in each text. For example, from the text "stress on information devices is called techno-stress, and techno-stress is increasing in recent years."
“Equipment”, “Stress”, “Techno”, “Call”, “Recent”, and “Increase” are extracted.

【0010】(2)出現頻度計数部502は、各テキス
ト中の2語の組合せを作りその出現頻度を求める。例え
ば前述のテキスト「情報機器に対する…」の場合には、
図6の結果を得る。
(2) The appearance frequency counting unit 502 creates a combination of two words in each text and obtains the appearance frequency thereof. For example, in the case of the above-mentioned text "for information device ...",
The result of FIG. 6 is obtained.

【0011】(3)出現頻度計数部502は、さらに各
組合せ毎に、全てのテキストから得た出現頻度を加算す
る。
(3) The appearance frequency counting unit 502 further adds the appearance frequencies obtained from all the texts for each combination.

【0012】(4)共起語選択部503は、頻度に基づ
いて2語の組合せを選択する。例えば、頻度3以上の組
合せを選択した場合、前述のテキスト「情報機器に対す
る…」からは「情報-ストレス」「機器-ストレス」の組
合せが抽出されてしまう。
(4) The co-occurrence word selection unit 503 selects a combination of two words based on the frequency. For example, if a combination with a frequency of 3 or more is selected, a combination of “information-stress” and “device-stress” will be extracted from the above-mentioned text “for information device”.

【0013】[0013]

【発明が解決しようとする課題】従来の共起語抽出方法
およびその装置の前述の具体例では、「情報」と「スト
レス」や、「機器」と「ストレス」を同時に含む文がテ
キストデータベース中にただ一つしかなくても、「情報
-ストレス」と「機器-ストレス」の組合せが抽出されて
しまう。このように、従来の共起語抽出方法およびその
装置では、ある一つのテキスト中で出現頻度の高い語が
ある場合に、その語を一方に持つ2語の組合せが共起語
として抽出されやすくなるという性質があった。この性
質は、従来の共起語抽出方法およびその装置が抽出した
共起語が、用意するテキストの内容から極めて敏感に影
響を受けやすいことを示す。したがって、同音語選択用
の共起語として、ふさわしくない語の組合せが抽出され
やすいという問題点を有していた。
In the above-described specific example of the conventional co-occurrence word extraction method and its apparatus, sentences including "information" and "stress" or "device" and "stress" at the same time are stored in the text database. Even if there is only one in
A combination of "stress" and "equipment-stress" is extracted. As described above, in the conventional co-occurrence word extraction method and apparatus, when there is a word having a high appearance frequency in a certain text, a combination of two words having the word on one side is easily extracted as a co-occurrence word. There was a property of becoming. This property indicates that the conventional co-occurrence word extraction method and the co-occurrence word extracted by the apparatus are extremely sensitive to the contents of the prepared text. Therefore, there is a problem that a combination of unsuitable words is easily extracted as a co-occurrence word for selecting a homophone.

【0014】本発明は、従来の共起語抽出方法およびそ
の装置が有していた前記の問題点に鑑み、信頼性の高い
共起関係を自動的に抽出することが可能であり、また、
同音語の選択に有効である共起語抽出方法およびその装
置を提供することを目的とする。
The present invention is capable of automatically extracting a highly reliable co-occurrence relation in view of the above problems of the conventional co-occurrence word extraction method and apparatus.
It is an object of the present invention to provide a co-occurrence word extraction method and apparatus which are effective for selecting homophones.

【0015】[0015]

【課題を解決するための手段】[Means for Solving the Problems]

(1)本発明は、前記の目的を達成するために、同音語
の各語を順次供給する同音語供給部と、同音語供給部が
供給した語を含む複数のテキストを抽出するテキスト抽
出部と、抽出されたテキスト中の特定の語を抽出する語
抽出部と、テキスト中の各語について各語が出現するテ
キストの数を求めるテキスト数計数部と、テキスト数計
数部で計数されたテキストの数に基づいて共起語群を抽
出する共起語選択部を備えたことを特徴とする共起語抽
出装置を構成する。
(1) In order to achieve the above-mentioned object, the present invention provides a homophone supply unit that sequentially supplies each of the homophones, and a text extraction unit that extracts a plurality of texts including the words supplied by the homophone supply unit. , A word extraction unit for extracting a specific word in the extracted text, a text number counting unit for determining the number of texts in which each word appears in each word in the text, and a text counted by the text number counting unit A co-occurrence word extraction device comprising a co-occurrence word selection unit that extracts a co-occurrence word group based on the number of co-occurrence words.

【0016】(2)同音語の各語に対応する共起語群の
うち、少なくとも2つの共起語群に含まれる語を各共起
語群より取り除く共起語削除部を追加した共起語抽出装
置を構成する。
(2) A co-occurrence word deletion unit for removing a word included in at least two co-occurrence word groups from the co-occurrence word groups corresponding to each homophonic word is added. Configure a word extraction device.

【0017】(3)テキスト数計数部で計数されたテキ
スト数毎にその語数を求める語数計数部を追加し、共起
語選択部が語数に基づいて共起語群を抽出する共起語抽
出装置を構成する。
(3) Co-occurrence word extraction in which a co-occurrence word selection unit extracts a co-occurrence word group based on the number of words by adding a word number counter for obtaining the number of words for each number of texts counted by the text number counter Configure the device.

【0018】(4)語抽出部がテキスト中の自立語を抽
出する共起語抽出装置を構成する。 (5)あらかじめ共起語の候補を格納する共起語候補格
納部を追加し、語抽出部がテキスト中の語のうち共起語
候補格納部にある語を抽出する共起語抽出装置を構成す
る。
(4) The word extracting unit constitutes a co-occurrence word extracting device for extracting an independent word in the text. (5) A co-occurrence word extraction device that adds a co-occurrence word candidate storage unit that stores candidates for a co-occurrence word in advance and causes the word extraction unit to extract the words in the co-occurrence word candidate storage unit from the words in the text. Constitute.

【0019】[0019]

【作用】本発明の共起語抽出方法およびその装置は前記
した構成により、同音語供給部が同音語の各語をテキス
ト抽出部に供給すると、テキスト抽出部は同音語供給部
で供給された語を含む複数のテキストを抽出し、語抽出
部が抽出されたテキスト中の特定の語を抽出し、テキス
ト数計数部がテキスト中の各語について各語が出現する
テキストの数を計数し、語数計数部がテキスト数計数部
で計数されたテキスト数毎にその語数を計数し、共起語
選択部がテキスト数計数部で計数されたテキスト数と語
数計数部で計数された語数に基づいて共起語群を抽出
し、共起語削除部が同音語の各語に対応する共起語群の
うち少なくとも2つの共起語群に含まれる語を各共起語
群より取り除く。
According to the co-occurrence word extracting method and the apparatus thereof of the present invention, when the homophone supplying section supplies each word of the homophone to the text extracting section, the text extracting section is supplied by the homophone supplying section. Extracting a plurality of texts including words, the word extraction unit extracts a specific word in the extracted text, the text number counting unit counts the number of texts in which each word appears for each word in the text, The word number counting unit counts the number of words for each number of texts counted by the text number counting unit, and the co-occurrence word selection unit based on the number of texts counted by the text number counting unit and the number of words counted by the word number counting unit. The co-occurrence word group is extracted, and the co-occurrence word deletion unit removes the words included in at least two co-occurrence word groups from the co-occurrence word groups among the co-occurrence word groups corresponding to the respective words of the homophone.

【0020】[0020]

【実施例】以下、本発明の共起語抽出装置の一実施例を
図面を用いて説明する。
DESCRIPTION OF THE PREFERRED EMBODIMENTS An embodiment of the co-occurrence word extraction device of the present invention will be described below with reference to the drawings.

【0021】図1は本発明の一実施例における共起語抽
出装置の構成図である。図中、8はテキストデータベー
スで、複数のテキストを保持する。9は共起語候補格納
部で、共起語の候補を格納する。1は同音語供給部で、
同音語の各語を順次供給するものである。2はテキスト
抽出部で、同音語供給部1から供給された語を含むテキ
ストをテキストデータベース8から抽出するものであ
る。3は語抽出部で、テキスト抽出部2が抽出したテキ
スト中に含まれる特定の語を抽出するものである。4は
テキスト数計数部で、語抽出部3が抽出した各語につい
て、その語が含まれていたテキストの数を求めるもので
ある。5は語数計数部で、テキスト数計数部4で計数し
たテキスト数毎にその語数を求めるものである。6は共
起語選択部で、有効な共起語を選択するものである。7
は共起語削除部で、同音語の各語に対応する共起語群を
比べて少なくとも2つの共起語群に含まれる語を除外す
るものである。
FIG. 1 is a block diagram of a co-occurrence word extraction device according to an embodiment of the present invention. In the figure, 8 is a text database, which holds a plurality of texts. A co-occurrence word candidate storage unit 9 stores co-occurrence word candidates. 1 is a homophone supply unit,
Each of the homophones is sequentially supplied. Reference numeral 2 is a text extraction unit, which extracts a text including the word supplied from the homophone supply unit 1 from the text database 8. Reference numeral 3 denotes a word extraction unit, which extracts a specific word contained in the text extracted by the text extraction unit 2. Reference numeral 4 denotes a text number counting unit which, for each word extracted by the word extracting unit 3, calculates the number of texts in which the word was included. Reference numeral 5 is a word number counting unit which obtains the number of words for each text number counted by the text number counting unit 4. Reference numeral 6 denotes a co-occurrence word selection unit for selecting a valid co-occurrence word. 7
Is a co-occurrence word deletion unit, which compares the co-occurrence word groups corresponding to each of the homophones and excludes words included in at least two co-occurrence word groups.

【0022】前述のように構成された本実施例の共起語
抽装置の動作を、図2のフローチャートを参照しながら
説明する。
The operation of the co-occurrence word extracting device of the present embodiment configured as described above will be described with reference to the flowchart of FIG.

【0023】(201)同音語供給部1は、同音語の一
つをテキスト抽出部2に与える。 (202)テキスト抽出部2は、同音語供給部1から与
えられた語を含むテキストをテキストデータベース8か
ら抽出する。
(201) The homophone word supply unit 1 supplies one of the homophone words to the text extraction unit 2. (202) The text extracting unit 2 extracts from the text database 8 the text including the word given from the homophone supplying unit 1.

【0024】(203)語抽出部3は、テキスト抽出部
2が抽出した各テキスト中の語のうち、共起語候補格納
部9に格納された語を抽出する。
(203) The word extraction unit 3 extracts the words stored in the co-occurrence word candidate storage unit 9 from the words in each text extracted by the text extraction unit 2.

【0025】(204)テキスト数計数部4は、語抽出
部3が抽出した各語について、各語が出現するテキスト
の数を求める。
(204) For each word extracted by the word extraction unit 3, the text number counting unit 4 obtains the number of texts in which each word appears.

【0026】(205)(204)で求めた語とテキス
ト数の組合せから、テキスト数を同じくする語の集合が
想定できる。語数計数部5は、テキスト数を同じくする
語の数を各テキスト数ごとに求める。
From the combination of the words and the number of texts obtained in (205) and (204), a set of words having the same number of texts can be assumed. The word number counting unit 5 obtains the number of words having the same number of texts for each number of texts.

【0027】(206)共起語選択部6は、語計数部5
が求めたテキスト数と語数から共起語を取捨選択するた
めの基準値を求め、基準値に基づいて共起語を抽出す
る。一般に、テキスト数が小さくなるにしたがって語数
が多くなる。そこで、その語数の増え方が極端に大きく
なる直前のテキスト数を基準値とし、テキスト数が基準
値以上の語を抽出する。
(206) The co-occurrence word selection unit 6 includes the word counting unit 5
A reference value for selecting co-occurrence words is obtained from the number of texts and the number of words obtained by, and the co-occurrence word is extracted based on the reference value. Generally, the smaller the number of texts, the larger the number of words. Therefore, the number of texts immediately before the increase in the number of words becomes extremely large is set as a reference value, and words having the number of texts equal to or larger than the reference value are extracted.

【0028】(207)全ての同音語について処理され
たならステップ(208)へ進む。未処理の同音語があ
るなら(201)に進む。
(207) If all homophones have been processed, the process proceeds to step (208). If there is an unprocessed homonym, proceed to (201).

【0029】(208)共起語削除部7は、各同音語と
対応づけて抽出された共起語群を比べて、少なくとも2
つの共起語群に含まれる語を各共起語群より取り除く
(おわり)。
(208) The co-occurrence word deleting unit 7 compares at least two co-occurrence word groups in association with each homophone and extracts at least two.
The words included in one co-occurrence word group are removed from each co-occurrence word group (end).

【0030】以上のように動作する本実施例の共起語抽
出装置における具体的な動作を次に説明する。
The specific operation of the co-occurrence word extraction device of this embodiment, which operates as described above, will be described below.

【0031】同音語供給部1は、例えば、読み文字列が
「きき」である同音語「機器」と「危機」を保持するも
のとする。また、テキストデータベース8は一定の単位
に区切られた複数のテキストを格納するものである。本
実施例ではテキストデータベース8は文単位に区切られ
た複数のテキストを持つものとする。
The homonym word supply section 1 is assumed to hold, for example, the homonym words "apparatus" and "crisis" whose reading character string is "ki". Further, the text database 8 stores a plurality of texts divided into a certain unit. In the present embodiment, the text database 8 is assumed to have a plurality of texts divided by sentence.

【0032】また、共起語候補格納部9は共起語の候補
を格納する。例えば品詞等の文法的情報や意味コードに
基づいたり、あるいは個別に特定の語を選別して得られ
た語を、共起語候補格納部9に格納することが考えられ
る。本実施例では共起語候補格納部9は、汎用的に使わ
れる「近年」等の語を除く自立語の一部を格納するもの
とする。
The co-occurrence word candidate storage unit 9 also stores co-occurrence word candidates. For example, it is conceivable to store words in the co-occurrence word candidate storage unit 9 based on grammatical information such as a part of speech or a semantic code, or obtained by individually selecting a specific word. In this embodiment, the co-occurrence word candidate storage unit 9 stores a part of the independent words excluding the words such as "recently" which are generally used.

【0033】(1)同音語供給部1は同音語の一つ「機
器」を取り出し、テキスト抽出部2に取り出した語「機
器」を与える。
(1) The homophone supplying unit 1 takes out one of the homophones "device" and gives the extracted word "device" to the text extracting unit 2.

【0034】(2)テキスト抽出部2は、同音語供給部
1から与えられた語「機器」を含む複数のテキストをテ
キストデータベース8から抽出し、例えばテキスト「情
報機器に対するストレスをテクノ・ストレスと呼び、近
年、テクノ・ストレスが増加している。」等の複数のテ
キストを得る。
(2) The text extraction unit 2 extracts a plurality of texts including the word "device" given from the homophone supply unit 1 from the text database 8, and, for example, the text "stress on information device is referred to as techno-stress. In recent years, techno-stress has been increasing. "

【0035】(3)語抽出部3は、テキスト抽出部2で
抽出された各テキストを形態素解析し、共起語候補格納
部9に格納されている語のみを抽出する。例えば前記テ
キスト「情報機器に対する…」の場合、自立語は「情
報」「機器」「ストレス」「テクノ」「呼ぶ」「近年」
「増加」であるが、「近年」は共起語候補格納部9に格
納されていないため取り除かれる。この結果、語抽出部
3は、自立語の「情報」「機器」「ストレス」「テク
ノ」「呼ぶ」「増加」を抽出する。
(3) The word extraction unit 3 performs morphological analysis on each text extracted by the text extraction unit 2 and extracts only the words stored in the co-occurrence word candidate storage unit 9. For example, in the case of the text “for information devices ...”, the independent words are “information”, “device”, “stress”, “techno”, “call”, “recent”.
Although “increased”, “recent” is not stored in the co-occurrence word candidate storage unit 9, and thus is removed. As a result, the word extraction unit 3 extracts the independent words “information”, “device”, “stress”, “techno”, “call”, and “increase”.

【0036】(4)テキスト数計数部4は、語抽出部3
が抽出した各語について、各語が出現したテキストの数
を計数し、例えば図3(a)のような結果を得る。
(4) The text number counting section 4 includes a word extracting section 3
For each word extracted by, the number of texts in which each word appears is counted and, for example, the result as shown in FIG. 3A is obtained.

【0037】(5)語数計数部5は、テキスト数毎にそ
のテキスト数を持つ語の数を求め、例えば図4の結果を
得る。
(5) The word counting section 5 obtains the number of words having the text number for each text number, and obtains the result shown in FIG. 4, for example.

【0038】(6)図4で、テキスト数が3より小さい
の語の数の増え方は、テキスト数が3以上の語の数の増
え方よりも非常に大きい。そこで、共起語選択部6は、
「機器」に対応する語群のうち、テキスト数が3以上の
語だけを語群より抽出する。
(6) In FIG. 4, the increase in the number of words having a text number smaller than 3 is much larger than the increase in the number of words having a text number of 3 or more. Therefore, the co-occurrence word selection unit 6
From the word group corresponding to “device”, only words having a text number of 3 or more are extracted from the word group.

【0039】ここで、前述のテキストの中で語「ストレ
ス」は3回出現するが、語「ストレス」が「機器」を含
む他のテキストの中に出現しないならば、「ストレス」
に対応するテキスト数は1で、有効な共起語として残ら
ない。
Here, the word "stress" appears three times in the above text, but if the word "stress" does not appear in other texts including "apparatus", "stress".
The number of texts corresponding to is 1 and does not remain as a valid co-occurrence word.

【0040】(7)同音語供給部1は次の同音語「危
機」を取り出し、「機器」の場合と同様の手順で対応す
る語群を抽出し、例えば図3(b)の結果を得る。
(7) The homophone supply unit 1 takes out the next homophone "crisis" and extracts the corresponding word group by the same procedure as in the case of "apparatus". For example, the result of FIG. 3 (b) is obtained. .

【0041】(8)同音語の全てを処理したので、次の
ステップに進む。 (9)共起語削除部7は、「機器」と「危機」に対応す
る共起語群を比べて、少なくとも2つの共起語群に含ま
れる語を各共起語群より取り除く。例えば、図3(a)
と図3(b)に示す例では、語「構造」が双方から取り
除かれる。
(8) Since all the homophones have been processed, the process proceeds to the next step. (9) The co-occurrence word deletion unit 7 compares the co-occurrence word groups corresponding to “equipment” and “crisis” and removes the words included in at least two co-occurrence word groups from each co-occurrence word group. For example, FIG.
3 (b), the word "structure" is removed from both.

【0042】以上の結果、図3の例の場合には、「機
器」の共起語として「装置」「電子」「販売」…を、ま
た、「危機」共起語として「軍事」「時代」「説」…を
自動的に抽出できる。
As a result, in the case of the example of FIG. 3, "apparatus", "electronic", "sales" ... are co-occurrence words of "equipment", and "military""era" as "crisis" co-occurrence words. "Theories" ... can be automatically extracted.

【0043】なお、本実施例においては、テキスト抽出
部2が複数のテキストを格納したテキストデータベース
8から必要なテキストを抽出した後で、語抽出部2がそ
の中に含まれる自立語を抽出したが、テキストデータベ
ース8中の各テキストに対応してあらかじめ自立語を抽
出しておき、これを格納した自立語データベースを備
え、必要な自立語群を自立語データベースから抽出する
方法であってもよい。また、語抽出部3がテキスト中か
ら語を抽出する際、形態素解析処理を行ったが、あらか
じめ用意された辞書との表記上でマッチングする等の簡
略化された方法であってもよい。また、共起語選択部6
は、語数の増え方が大きくなるテキスト数を基準値とす
るかわりに、一定のテキスト数をあらかじめ定め、例え
ば少なくとも2つのテキストに出現した語を選ぶなどの
方法であってもよい。また、共起語選択部6の処理に先
立って共起語削除部7の処理を実行する方法であっても
よい。
In this embodiment, after the text extracting unit 2 extracts the necessary text from the text database 8 storing a plurality of texts, the word extracting unit 2 extracts the independent word contained therein. Alternatively, a method may be used in which an independent word is extracted in advance corresponding to each text in the text database 8, an independent word database that stores this is provided, and a necessary independent word group is extracted from the independent word database. . Further, when the word extracting unit 3 extracts a word from the text, the morphological analysis process is performed, but a simplified method such as matching in terms of a prepared dictionary may be used. Also, the co-occurrence word selection unit 6
Instead of using the number of texts in which the number of words increases as a reference value, a predetermined number of texts may be set in advance and, for example, a word that appears in at least two texts may be selected. Alternatively, the method of executing the process of the co-occurrence word deleting unit 7 prior to the process of the co-occurrence word selecting unit 6 may be used.

【0044】[0044]

【発明の効果】本発明の共起語抽出方法およびその装置
においては、テキスト中に同時に出現する語の組が出現
するテキストの数に基づいて共起語を取捨選択するの
で、抽出された共起語はあらかじめ用意するテキストの
内容の影響を受けにくくなる。
In the co-occurrence word extraction method and apparatus according to the present invention, the co-occurrence words are selected based on the number of texts in which a set of words that appear in the text at the same time appears. The utterance is less likely to be affected by the contents of the prepared text.

【0045】また、各同音語に対応づけて抽出された共
起語群のうち、少なくとも2つの共起語群に含まれる語
を各共起語群より取り除くので、かな漢字変換の同音語
選択の際、変換候補中で同じ語が同音語中の複数の語と
共起関係を持つことを抑制する。
Further, among the co-occurrence word groups extracted in association with each homophone word, words included in at least two co-occurrence word groups are removed from each co-occurrence word group. In this case, it is possible to prevent the same word in the conversion candidates from having a co-occurrence relationship with a plurality of words in the same phoneme.

【0046】また、テキスト数が小さくなるにしたがっ
て語数が多くなるが、その語数の増え方の際立って大き
くなるテキスト数を基準として共起語を取捨選択するの
で、共起語の記憶容量と有効性を比較してコストパフォ
ーマンスのよい共起語を抽出することができる。
Also, the number of words increases as the number of texts decreases, but co-occurrence words are selected based on the number of texts that significantly increases as the number of words increases. It is possible to extract co-occurrence words with good cost performance by comparing the sexes.

【0047】以上のように本発明の共起語抽出方法およ
びその装置においては、同音語の選択に有効で信頼性の
高い共起語を自動的に抽出することができるため、その
実用的価値は大きい。
As described above, in the co-occurrence word extraction method and apparatus of the present invention, a co-occurrence word that is effective in selecting a homophone and has high reliability can be automatically extracted. Is big.

【図面の簡単な説明】[Brief description of drawings]

【図1】本発明の一実施例における共起語抽出装置の構
成図
FIG. 1 is a configuration diagram of a co-occurrence word extraction device according to an embodiment of the present invention.

【図2】本発明の一実施例における動作を示すフローチ
ャート
FIG. 2 is a flowchart showing the operation in one embodiment of the present invention.

【図3】(a)は本発明の一実施例における「機器」に
対応する語とそのテキスト数の一例を示す模式図 (b)は本発明の一実施例における「危機」に対応する
語とそのテキスト数の一例を示す模式図
FIG. 3 (a) is a schematic diagram showing an example of a word corresponding to “device” in one embodiment of the present invention and the number of texts thereof. (B) is a word corresponding to “crisis” in one embodiment of the present invention. And schematic diagram showing an example of the number of texts

【図4】本発明の一実施例におけるテキスト数とその語
数との一例を示す模式図
FIG. 4 is a schematic diagram showing an example of the number of texts and the number of words in one embodiment of the present invention.

【図5】従来の共起語抽出装置の構成図FIG. 5 is a block diagram of a conventional co-occurrence word extraction device.

【図6】従来の共起語抽出装置における2語の組合せと
出現頻度の一例を示す模式図
FIG. 6 is a schematic diagram showing an example of a combination of two words and an appearance frequency in a conventional co-occurrence word extraction device.

【符号の説明】[Explanation of symbols]

1 同音語供給部 2 テキスト抽出部 3 語抽出部 4 テキスト数計数部 5 語数計数部 6 共起語選択部 7 共起語削除部 9 共起語候補格納部 1 homophone supply unit 2 text extraction unit 3 word extraction unit 4 text number counting unit 5 word number counting unit 6 co-occurrence word selection unit 7 co-occurrence word deletion unit 9 co-occurrence word candidate storage unit

Claims (10)

【特許請求の範囲】[Claims] 【請求項1】同音語の各語を順次供給する同音語供給ス
テップと、同音語供給ステップが供給した語を含む複数
のテキストを抽出するテキスト抽出ステップと、抽出さ
れたテキスト中の特定の語を抽出する語抽出ステップ
と、テキスト中の各語について各語が出現するテキスト
の数を求めるテキスト数計数ステップと、テキスト数計
数ステップで計数されたテキストの数に基づいて共起語
群を抽出する共起語選択ステップを実行することを特徴
とする共起語抽出方法。
1. A homophone supplying step for sequentially supplying each of the homophones, a text extracting step for extracting a plurality of texts including the words supplied by the homophone supplying step, and a specific word in the extracted text. The word extraction step for extracting the co-occurrence word group based on the number of texts counted in the text number counting step and the number of texts for each word in the text A method of extracting a co-occurrence word, which comprises performing a co-occurrence word selecting step.
【請求項2】同音語の各語を順次供給する同音語供給部
と、同音語供給部が供給した語を含む複数のテキストを
抽出するテキスト抽出部と、抽出されたテキスト中の特
定の語を抽出する語抽出部と、テキスト中の各語につい
て各語が出現するテキストの数を求めるテキスト数計数
部と、テキスト数計数部で計数されたテキストの数に基
づいて共起語群を抽出する共起語選択部を備えたことを
特徴とする共起語抽出装置。
2. A homophonic word supply section for sequentially supplying each word of the homonym word, a text extraction section for extracting a plurality of texts including the word supplied by the homonym word supply section, and a specific word in the extracted text. A coexistence word group is extracted based on the number of texts counted by the text extraction unit, a text number counting unit that obtains the number of texts in which each word appears for each word in the text, and the text number counting unit A co-occurrence word extraction device, comprising:
【請求項3】同音語の各語に対応する共起語群のうち、
少なくとも2つの共起語群に含まれる語を各共起語群よ
り取り除く共起語削除ステップを備えた請求項1記載の
共起語抽出方法。
3. A co-occurrence word group corresponding to each of the homophones,
The co-occurrence word extracting method according to claim 1, further comprising a co-occurrence word deleting step of removing a word included in at least two co-occurrence word groups from each co-occurrence word group.
【請求項4】同音語の各語に対応する共起語群のうち、
少なくとも2つの共起語群に含まれる語を各共起語群よ
り取り除く共起語削除部を備えた請求項2記載の共起語
抽出装置。
4. A co-occurrence word group corresponding to each of the homophones,
The co-occurrence word extracting device according to claim 2, further comprising a co-occurrence word deleting unit that removes words included in at least two co-occurrence word groups from each co-occurrence word group.
【請求項5】テキスト数計数ステップで計数されたテキ
スト数毎にその語数を求める語数計数ステップを備え、
共起語選択ステップが語数に基づいて共起語群を抽出す
る請求項1記載の共起語抽出方法。
5. A word number counting step of obtaining the word number for each text number counted in the text number counting step,
The co-occurrence word extraction method according to claim 1, wherein the co-occurrence word selection step extracts the co-occurrence word group based on the number of words.
【請求項6】テキスト数計数部で計数されたテキスト数
毎にその語数を求める語数計数部を備え、共起語選択部
が語数に基づいて共起語群を抽出する請求項2記載の共
起語抽出装置。
6. The co-occurrence word selecting unit extracts a co-occurrence word group based on the number of words, the word number counting unit obtaining the number of words for each text number counted by the text number counting unit. Origin extraction device.
【請求項7】語抽出ステップがテキスト中の自立語を抽
出する請求項1記載の共起語抽出方法。
7. The co-occurrence word extracting method according to claim 1, wherein the word extracting step extracts independent words in the text.
【請求項8】語抽出部がテキスト中の自立語を抽出する
請求項1記載の共起語抽出装置。
8. The co-occurrence word extraction device according to claim 1, wherein the word extraction unit extracts an independent word in the text.
【請求項9】語抽出ステップが、テキスト中の語のう
ち、あらかじめ用意された共起語の候補から語を抽出す
る請求項1記載の共起語抽出方法。
9. The co-occurrence word extracting method according to claim 1, wherein the word extracting step extracts a word from a co-occurring word candidate prepared in advance among the words in the text.
【請求項10】あらかじめ共起語の候補を格納する共起
語候補格納部を備え、語抽出部がテキスト中の語のうち
共起語候補格納部にある語を抽出する請求項2記載の共
起語抽出装置。
10. A co-occurrence word candidate storage unit for storing candidates of a co-occurrence word in advance, wherein the word extraction unit extracts a word in the co-occurrence word candidate storage unit from the words in the text. Co-occurrence word extraction device.
JP5170122A 1993-07-09 1993-07-09 Coocurrence word extracting method and device therefor Pending JPH0728823A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP5170122A JPH0728823A (en) 1993-07-09 1993-07-09 Coocurrence word extracting method and device therefor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP5170122A JPH0728823A (en) 1993-07-09 1993-07-09 Coocurrence word extracting method and device therefor

Publications (1)

Publication Number Publication Date
JPH0728823A true JPH0728823A (en) 1995-01-31

Family

ID=15899055

Family Applications (1)

Application Number Title Priority Date Filing Date
JP5170122A Pending JPH0728823A (en) 1993-07-09 1993-07-09 Coocurrence word extracting method and device therefor

Country Status (1)

Country Link
JP (1) JPH0728823A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011164865A (en) * 2010-02-08 2011-08-25 Kddi Corp Image-selecting device, image-selecting method, and image-selecting program

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011164865A (en) * 2010-02-08 2011-08-25 Kddi Corp Image-selecting device, image-selecting method, and image-selecting program

Similar Documents

Publication Publication Date Title
US5890103A (en) Method and apparatus for improved tokenization of natural language text
KR100453227B1 (en) Similar sentence retrieval method for translation aid
KR100542755B1 (en) Hybrid automatic translation Apparatus and Method by combining Rule-based method and Translation pattern method, and The medium recording the program
WO1997004405A9 (en) Method and apparatus for automated search and retrieval processing
US6876963B1 (en) Machine translation method and apparatus capable of automatically switching dictionaries
JPH06282572A (en) Automatic key word extracting device
JP2007323671A (en) Word segmentation in chinese text
KR20040101678A (en) Apparatus and method for analyzing compounded morpheme
JP2633824B2 (en) Kana-Kanji conversion device
Adda-Decker A corpus-based decompounding algorithm for German lexical modeling in LVCSR.
JP2960936B2 (en) Dependency analyzer
KR100617319B1 (en) Apparatus for selecting target word for noun/verb using verb patterns and sense vectors for English-Korean machine translation and method thereof
JPH0728823A (en) Coocurrence word extracting method and device therefor
JPH06215035A (en) Text retrieving device
KR100420474B1 (en) Apparatus and method of long sentence translation using partial sentence frame
KR20020054254A (en) Analysis Method for Korean Morphology using AVL+Trie Structure
JPH11338863A (en) Automatic collection and qualification device for unknown noun and flickering katakana word and storage medium recording processing procedure of the device
Kumar et al. Corpus Based Statistical Approach for Stemming Telugu
JPS63228326A (en) Automatic key word extracting system
JPH01295369A (en) Dividing and processing system for kanji/kana paragraph
KR100422809B1 (en) Long sentence partition method for machine translation
JP4262529B2 (en) Full-text search device, method, program, and recording medium
JP2006503328A (en) Directory assistance method and apparatus
JP2001022752A (en) Method and device for character group extraction, and recording medium for character group extraction
JPS6368972A (en) Unregistered word processing system