JPH08161340A

JPH08161340A - Automatic compound word extraction device

Info

Publication number: JPH08161340A
Application number: JP6307223A
Authority: JP
Inventors: Takashi Katooka; 隆加登岡
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1994-12-12
Filing date: 1994-12-12
Publication date: 1996-06-21

Abstract

PURPOSE: To efficiently and automatically collect a compound word whose degree of cooccurrence is large with respect to an idiom and a compound word by the combination of words whose speciality is not high. CONSTITUTION: An N-gram segment device 2 segments N-gram of a word from an objective document which is read from an objective document input part 1. A frequency adding-up device 3 adds up the appearing frequency of the compound word of segmented N-gram and a word storage device 4 stores the N- gram compound word and appearing frequency that the frequency adding-up device 3 adds up. A cooccurrence degree calculation device 5 calculates the cooccurrence degree of N-gram by using appearing frequency in the objective document of the respective words constituting N-gram and the appearing frequency of N-gram itself. A classification device 6 rearranges information in the word storage device 4 by the value of the cooccurrence degree calculated by the cooccurrence degree calculation device 5.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、言語処理装置における
連語自動抽出装置に関し、より詳細には、対象文書から
連語を効率よく自動収集するための連語自動抽出装置に
関する。例えば、機械翻訳やワードプロセッサなどの用
語辞書作成装置に適用されるものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a compound word automatic extracting device in a language processing apparatus, and more particularly to a compound word automatic extracting device for efficiently and automatically collecting compound words from a target document. For example, it is applied to a term dictionary creating device such as machine translation or a word processor.

【０００２】[0002]

【従来の技術】従来の言語処理装置について記載した公
知文献としては、例えば、特開平６−１９９６８号公報
がある。この公報のものは、膨大な単語のなかから専門
用語を容易に抽出できるようにし、専門用語辞書の構築
を短時間で容易に行うことができるようにするために、
単語分割装置で入力文が単語に区切られて品詞情報が付
与される等の正規化が行われ、単語分割装置で正規化さ
れた入力データは専門用語判定装置に出力され、この専
門用語判定装置で前記各辞書を参照しながら各単語の評
価が行われ、この評価に従って専門用語の候補が抽出さ
れるものである。しかし、専門用語判定は、構成語数、
構成語の使用頻度、分野別用語辞書、字種（カタカナ
語）を考慮して行い、分野別用語辞書を必要としてい
る。また、前記公報のものは、判定対象の専門用語候補
選定についての記述がない。2. Description of the Related Art As a known document describing a conventional language processing device, there is, for example, Japanese Patent Laid-Open No. 6-19998. In order to make it possible to easily extract technical terms from a huge number of words and to easily construct a technical term dictionary in a short time,
The word segmentation device divides the input sentence into words and normalizes it by adding part-of-speech information, and the input data normalized by the word segmentation device is output to the technical term determination device. Then, each word is evaluated with reference to each dictionary, and technical term candidates are extracted in accordance with this evaluation. However, the technical term judgment is the number of constituent words,
The terminology dictionary for each field is required, considering the frequency of use of constituent words, the terminology dictionary for each field, and the character type (Katakana). Moreover, the publications mentioned above do not describe selection of technical term candidates to be judged.

【０００３】[0003]

【発明が解決しようとする課題】例えば、機械翻訳シス
テムにおいては、ある特定の分野の文章を翻訳する場合
には、その分野の用語をどれくらい事前に辞書に登録し
ておくかで、翻訳の性能が大きく作用する。しかし、未
知語検索を利用した従来の専門用語検出方法では、熟語
や連語のように、複数の語からなる専門用語を効率良く
抽出することができないという欠点があった。また、前
記公報のものは分野別用語辞書を用いて用語を収集して
おり、装置が重くなるという欠点があった。For example, in a machine translation system, when translating a sentence in a particular field, the translation performance depends on how many terms in that field are registered in the dictionary in advance. Has a great effect. However, the conventional technical term detection method using unknown word search has a drawback in that technical terms composed of a plurality of words such as idioms and collocations cannot be efficiently extracted. Further, the above-mentioned publication has a drawback that the device becomes heavy because the terms are collected by using the field-specific term dictionary.

【０００４】本発明は、このような実情に鑑みてなされ
たもので、専門性の高い語でない語の組合せによる熟語
や連語に対しても、複数語からなるエントリを共起して
いるのか、あるいは偶然による連接かを見極め、効率良
く、共起の度合いが強い連語を自動的に収集するように
した連語自動抽出装置を提供することを目的としてい
る。The present invention has been made in view of such circumstances, and whether an entry consisting of a plurality of words is co-occurring with respect to a compound word or a compound word formed by a combination of words that are not highly specialized words. Another object of the present invention is to provide an automatic collocation extracting device that automatically determines collocations by accident and automatically collects collocations that have a high degree of co-occurrence.

【０００５】[0005]

【課題を解決するための手段】本発明は、上記課題を解
決するために、（１）対象文書を読み込む対象文書入力
部と、該対象文書入力部に読み込まれた対象文書から単
語のＮ-gram（Ｎ＝１，２，３，…，Ｎmax）を切り出す
切り出し装置と、該切り出し装置により切り出されたＮ
-gramの連接語の出現頻度を集計する頻度集計装置と、
Ｎ-gram連接語と該連接語の出現頻度を記憶する単語記
憶装置と、前記Ｎ-gramを構成する各単語の対象文書中
の出現頻度（Ｎ＝１の場合）とＮ-gram自体の出現頻度
とを用いてＮ-gramの共起度を計算する共起度計算装置
と、該共起度計算装置により計算された共起度の値で前
記単語記憶装置内の情報を並べ変える分類装置とを有す
ること、更には、（２）用語として抽出すべき条件に合
わない構成を記憶する条件設定記憶装置と、前記単語記
憶装置中に記憶した構成に合うＮ-gramを排除するため
のパターンマッチ装置とを用いてさらに精度良く連語を
抽出することを特徴としたものである。In order to solve the above-mentioned problems, the present invention provides (1) a target document input section for reading a target document and N-words from the target document read by the target document input section. A cutout device that cuts out gram (N = 1, 2, 3, ..., Nmax) and N cut out by the cutout device
-A frequency aggregator that aggregates the appearance frequency of concatenated words in gram,
A word storage device that stores an N-gram concatenated word and an appearance frequency of the concatenated word, an appearance frequency (when N = 1) in a target document of each word forming the N-gram, and an appearance of the N-gram itself. A co-occurrence degree calculating device that calculates the N-gram co-occurrence degree using the frequency, and a classifying device that rearranges the information in the word storage device by the co-occurrence degree value calculated by the co-occurrence degree calculating device. And (2) a condition setting storage device that stores a configuration that does not meet the condition to be extracted as a term, and a pattern for eliminating N-grams that match the configuration stored in the word storage device. The feature is that collocations are more accurately extracted using a matching device.

【０００６】[0006]

【作用】前記構成を有する本発明の連語自動抽出装置
は、（１）対象文書を入力し、該対象文書から単語のＮ
-gram（Ｎ＝１，２，３，…，Ｎmax）を切り出し、切り
出されたＮ-gramの連接語の出現頻度を集計し、Ｎ-gram
連接語とその出現頻度を記憶し、前記Ｎ-gramを構成す
る各単語の対象文書中の出現頻度（Ｎ＝１の場合）とＮ
-gram自体の出現頻度とを用いてＮ-gramの共起度を計算
し、共起度の値で記憶装置内の情報を並べ変えることに
より、連語の構成語の共起の強さをその構成語の出現頻
度と連語の出現頻度から求めることができるので、辞書
などを使用しないで、簡単な装置で、入力文において、
強い共起をもって連語として出現する語を効率良く自動
的に迅速に抽出することができる。（２）用語として抽
出すべき条件に合わない構成を記憶し、記憶した構成に
合うＮ-gramを排除するためのパターンマッチ装置を用
いてさらに精度良く連語を抽出するので、前記（１）で
抽出した語の中で用語として抽出するには不適切である
と思われるＮ-gramパターンを予め条件設定記憶装置に
記憶しておくことにより、用語候補からこのパターンに
マッチするものを排除することができ、精度良く用語を
抽出することができる。The automatic collocation extracting apparatus of the present invention having the above-described structure (1) inputs a target document, and extracts N words from the target document.
-gram (N = 1,2,3, ..., Nmax) is cut out, the appearance frequency of the concatenated word of the cut out N-gram is totaled, and the N-gram
The concatenated word and its appearance frequency are stored, and the appearance frequency (when N = 1) in the target document of each word constituting the N-gram and N
The co-occurrence degree of the N-gram is calculated using the appearance frequency of the -gram itself, and the co-occurrence strength of the compound word of the compound word is determined by rearranging the information in the storage device according to the value of the co-occurrence degree. Since it can be calculated from the appearance frequency of the constituent words and the appearance frequency of the collocations, without using a dictionary, etc.
It is possible to efficiently and automatically extract words that appear as collocations with strong co-occurrence. (2) Since a structure that does not meet the condition to be extracted as a term is stored and a collocation is extracted with higher accuracy by using a pattern matching device for excluding N-grams that match the stored structure, in (1) above By pre-storing an N-gram pattern that is considered to be inappropriate for extraction as a term from the extracted words in the condition setting storage device, it is possible to exclude a term candidate that matches this pattern. It is possible to extract terms accurately.

【０００７】[0007]

【実施例】実施例について、図面を参照して以下に説明
する。まず、対象文書からＮ-gramの連接語を抽出する
（Ｎ＝１，２，３，…，Ｎmax)。対象文書が英語の場合
であれば、言語の形態的特徴からスペース文字等を参考
にして一語一語を分割する。Ｎ＝３の３-gramであれ
ば、最大３単語連接する用語を切り出す。抽出したＮ連
接語は、出現頻度を集計する装置により、その出現頻度
をカウント集計する。また、各一単語ごとの出現頻度を
カウントして集計する。この結果は、Ｎ-gram単語記憶
装置に記憶される。Embodiments will be described below with reference to the drawings. First, N-gram concatenated words are extracted from the target document (N = 1, 2, 3, ..., Nmax). If the target document is English, each word is divided based on the morphological characteristics of the language, with reference to space characters and the like. In the case of N-three-gram, the maximum three consecutive words are cut out. The extracted N concatenated words are counted and counted by a device that counts their appearance frequencies. In addition, the appearance frequency of each word is counted and totaled. The result is stored in the N-gram word storage device.

【０００８】入力文書に対する出現頻度の集計が終わる
と、Ｎ-gram単語記憶装置内のＮ-gramの連接語に対して
共起度の計算を、例えば、以下の式に従って計算する。
Ｎ連接の語が、つまり連語の構成語がそれぞれｗ１，ｗ
２，ｗ３，…，ｗＮの時、それぞれの出現頻度がＨ(ｗ
１)，Ｈ(ｗ２)，…，Ｈ(ｗ３)で、Ｎ連接語自体の出現
頻度がＨ(ｗ１，ｗ２，ｗ３，…，ｗＮ)と表す。また、
対象入力文書の総語数をＡとする。When the appearance frequencies of the input documents are summed up, the co-occurrence degree is calculated for the N-gram concatenated word in the N-gram word storage device, for example, according to the following formula.
N consecutive words, that is, the constituent words of the consecutive words are w1 and w, respectively.
2, w3, ..., wN, the respective appearance frequencies are H (w
1), H (w2), ..., H (w3), the appearance frequency of the N concatenated word itself is expressed as H (w1, w2, w3, ..., WN). Also,
Let A be the total number of words in the target input document.

【０００９】[0009]

【数１】 [Equation 1]

【００１０】（１）式の分母は、連語を構成する各単語
の出現確率から各語が偶然に連接する確率を表す。
（１）式の分子は、実際に各語が連接して出現する確率
である。したがって、（１）式はある連語が実際に連接
する確率と偶然に連接する確率との比となる。（１）式
の値が高いほど、そのＮ-gramの連接語は、共起して出
現する度合いが高いといえる。逆に低い場合は、共起す
るようも偶然に連接したものである可能性が高い。The denominator of the equation (1) represents the probability that each word is accidentally concatenated from the appearance probabilities of each word forming the compound word.
The numerator of equation (1) is the probability that each word actually appears in a concatenated manner. Therefore, the expression (1) is a ratio between the probability that a certain collocation is actually connected and the probability that it is accidentally connected. It can be said that the higher the value of Expression (1), the higher the degree of co-occurrence of the concatenated word of the N-gram. On the contrary, if it is low, it is highly likely that they are accidentally connected even though they co-occur.

【００１１】図１は、本発明による連語自動抽出装置の
一実施例（実施例１）を説明するための構成図で、図
中、１は対象文書入力部、２はＮ-gram切り出し装置、
３はＮ-gramの頻度集計装置、４はＮ-gram単語記憶装置
（出現頻度記憶装置）、５は共起度計算装置、６は分類
装置である。FIG. 1 is a block diagram for explaining an embodiment (embodiment 1) of an automatic word extraction device according to the present invention, in which 1 is a target document input unit, 2 is an N-gram cutout device,
Reference numeral 3 is an N-gram frequency aggregation device, 4 is an N-gram word storage device (appearance frequency storage device), 5 is a co-occurrence degree calculation device, and 6 is a classification device.

【００１２】対象文書を対象文書入力部１から読み込
み、該対象文書入力部１から読み込まれた対象文書から
単語のＮ-gram（Ｎ＝１，２，３，…，Ｎmax）をＮ-gra
m切り出し装置２により切り出す。該Ｎ-gram切り出し装
置２により切り出されたＮ-gramの連接語の出現頻度を
頻度集計装置３により集計し、Ｎ-gram連接語と、前記
頻度集計装置３により集計された出現頻度を単語記憶装
置４により記憶する。A target document is read from the target document input unit 1, and N-grams (N = 1, 2, 3, ..., Nmax) of words from the target document read from the target document input unit 1 are N-gra.
Cut out with the cutting device 2. The appearance frequency of the N-gram concatenated words cut out by the N-gram cutout device 2 is totaled by the frequency totaling device 3, and the N-gram conjunctive word and the appearance frequency totaled by the frequency totaling device 3 are stored in a word memory. It is stored by the device 4.

【００１３】前記Ｎ-gramを構成する各単語の対象文書
中の出現頻度（Ｎ＝１の場合）と、Ｎ-gram自体の出現
頻度とを用いてＮ-gramの共起度を共起度計算装置５に
より計算する。分類装置６は前記共起度計算装置５で計
算された共起度の値で単語記憶装置４内の情報を並べ変
える。このようにして、専門性の高い語でない語の組合
せによる熟語や連語に対しても、複数語からなるエント
リを共起しているか、あるいは偶然による連接かを見き
わめ、効率の良い、共起の度合いが強い連語を自動的に
収集することができる。The co-occurrence degree of the N-gram is determined using the appearance frequency (when N = 1) of each word forming the N-gram in the target document and the appearance frequency of the N-gram itself. The calculation is performed by the calculation device 5. The classifying device 6 rearranges the information in the word storage device 4 according to the value of the co-occurrence degree calculated by the co-occurrence degree calculating device 5. In this way, even for idioms and collocations that are combinations of words that are not highly specialized words, it is possible to determine whether co-occurrence of entries consisting of multiple words or accidental concatenation, and efficient and co-occurrence. It is possible to automatically collect collocations with a high degree.

【００１４】図２及び図３は、本発明による連語自動抽
出装置の動作を説明するためのフローチャートである。
以下、各ステップ（Ｓ）に従って順に説明する。まず、
変数ｉ，変数ｊを１にセットし（Ｓ１）、変数ｊの値が
最大連語連接数Ｎの値を越えたかどうかを判断する（Ｓ
２）。Ｎの値を越えていなければ、次に、対象文書入力
部１より原文の先頭からｉ単語目からｊ単語を入力し、
変数wordsに格納する（Ｓ３）。次に、（ｉ＋ｊ−１）
番目の単語が存在するかどうかを判断し（Ｓ４）、単語
が存在していれば、次に、出現頻度記憶装置４にwords
の中の単語列が既に存在するかどうかを判断する（Ｓ
５）。存在しなければ、wordsの中味を出現頻度記憶装
置４に出現回数１として記憶し（Ｓ６）、変数ｊを１だ
けカウントアップし（Ｓ７）、前記ステップＳ２へ戻
る。2 and 3 are flow charts for explaining the operation of the automatic word extraction device according to the present invention.
Hereinafter, each step (S) will be described in order. First,
The variables i and j are set to 1 (S1), and it is determined whether or not the value of the variable j exceeds the value of the maximum collocation number N (S).
2). If it does not exceed the value of N, then the target document input unit 1 inputs j words from the i-th word from the beginning of the original sentence,
It is stored in the variable words (S3). Next, (i + j-1)
It is judged whether or not the th word is present (S4), and if the word is present, words are stored in the appearance frequency storage device 4 next.
It is determined whether the word string in is already present (S
5). If it does not exist, the contents of words are stored in the appearance frequency storage device 4 as the number of appearances 1 (S6), the variable j is incremented by 1 (S7), and the process returns to step S2.

【００１５】前記ステップＳ５において、出現頻度記憶
装置４にwordsの中の単語列が既に存在していれば、出
現頻度記憶装置４に記憶されているwordsの中味の出現
回数を１だけカウントアップし（Ｓ８）、前記ステップ
Ｓ７へ行く。前記ステップＳ２において、変数ｊの値が
最大連語連接数Ｎの値を越えたら、ｊに１をセットし、
ｉを１カウントアップし（Ｓ９）、前記ステップＳ３へ
行く。前記ステップＳ４において、（ｉ＋ｊ−１）番目
の単語が存在しなければ、次に、ｊがｉと等しいかどう
かを判断し（Ｓ１０）、等しくなければ、前記ステップ
Ｓ９へ行き、等しければ、対象文書の総語数を記憶する
変数Ａにｉ−１をセットする（Ｓ１１）。In step S5, if the word string in words already exists in the appearance frequency storage device 4, the number of appearances of the contents of words stored in the appearance frequency storage device 4 is incremented by one. (S8), go to step S7. In step S2, when the value of the variable j exceeds the value of the maximum number N of collocations, 1 is set to j,
i is incremented by 1 (S9), and the process goes to step S3. If the (i + j-1) th word does not exist in step S4, then it is determined whether j is equal to i (S10). If not, the process proceeds to step S9. A variable A that stores the total number of words in the document is set to i-1 (S11).

【００１６】次に、出現頻度記憶装置４に記憶された量
大Ｎ個の連語の共起度を計算する。結果を出現頻度記憶
装置４に記憶し（Ｓ１２）、分類装置６により出現頻度
記憶装置４に記憶された情報を共起度の高い順に並び変
える（Ｓ１３）。Next, the co-occurrence degree of a large number N of collocations stored in the appearance frequency storage device 4 is calculated. The result is stored in the appearance frequency storage device 4 (S12), and the information stored in the appearance frequency storage device 4 by the classifying device 6 is rearranged in the descending order of co-occurrence degree (S13).

【００１７】以下、本発明の実施例１について、例文に
基づき具体的に説明する。対象文書入力部から原文を入
力する。The first embodiment of the present invention will be specifically described below with reference to the example sentence. Input the original text from the target document input section.

【００１８】[0018]

【表１】 [Table 1]

【００１９】最大３gramの連語を自動抽出する実施例に
ついて説明する。まず、１変数ｉとｊに初期値１をセッ
トする（Ｓ１）。対象入力部１から対象文書を読み込
み、変数ｊが最大連語数３を越えていないので（Ｓ
２）、先頭（ｉ＝１）から１単語（ｊ＝１）を得る（Ｓ
３）。つまり、“The”が得られる。ｉ＋ｊ−１＝１番
目の単語は存在するので（Ｓ４）、この単語がＮ-gram
単語記憶装置４に記憶されているかどうか調べる（Ｓ
５）。まだ記憶されていないので、新規にＮ-gram単語
記憶装置４に語“The”をその出現回数１として記憶す
る（Ｓ６）。既に記憶されてる場合は出現回数を１だけ
カウントアップする（Ｓ８）。ｊを１だけカウントアッ
プし（Ｓ７）、次に先頭（ｉ＝１）から２単語（ｊ＋
１）を得る。An embodiment for automatically extracting a maximum of 3 gram complex words will be described. First, an initial value 1 is set to one variable i and j (S1). Since the target document is read from the target input unit 1 and the variable j does not exceed the maximum number of collocations of 3 (S
2), one word (j = 1) is obtained from the beginning (i = 1) (S
3). In other words, "The" is obtained. Since i + j-1 = 1st word exists (S4), this word is N-gram.
It is checked whether or not it is stored in the word storage device 4 (S
5). Since it has not been stored yet, the word "The" is newly stored in the N-gram word storage device 4 as the number of appearances 1 (S6). If it is already stored, the number of appearances is incremented by 1 (S8). j is incremented by 1 (S7), and then 2 words (j +) from the beginning (i = 1)
1) is obtained.

【００２０】つまり、“The orchestra”を得る。ｉ＋
ｊ−１＝２番目の単語は存在するので（Ｓ４）、この単
語がＮ-gram単語記憶装置４に記憶されているかどうか
調べる。まだ記憶されていないので、新規に記憶し出現
回数を１とする。同様にｊ＝３のときは、“The orches
tra gave”を得る。ｊを１つカウントアップすると（Ｓ
７）、ｊ＝４となる。ｊが最大連語連接数３を越えるの
で、ｊに１をセットしｉを１カウントアップして２とす
る（Ｓ９）。That is, "The orchestra" is obtained. i +
Since the j−1 = second word exists (S4), it is checked whether or not this word is stored in the N-gram word storage device 4. Since it is not stored yet, it is newly stored and the number of appearances is set to 1. Similarly, when j = 3, "The orches
"tra gave". When j is incremented by 1 (S
7) and j = 4. Since j exceeds the maximum number of consecutive word concatenations, 1 is set to j and i is incremented by 1 to be 2 (S9).

【００２１】次に、第２番目（ｉ＝２）の単語から１単
語（ｊ＝１）を得る。“orchestra”が得られる。ｉ＝
１の場合と同様に、ｊを最大３までカウントアップしな
がら連接語を切り出していくと、 “orchestra gave” “orchestra gave him” が切り出され、Ｎ-gram単語記憶装置にその出現回数と
ともに記憶される。前記処理をｉ＝６まで繰り返す。最
終的にＮ-gram単語記憶装置４には、図４の内容が記憶
されている。この後、式（１）に従って出現頻度から各
連語の共起度を計算し、結果を単語記憶装置４に記憶す
る（Ｓ１２）。さらに、共起度の値の高い順に分類（ソ
ート）し直す。Next, one word (j = 1) is obtained from the second (i = 2) word. You get an "orchestra". i =
As in the case of 1, when concatenated words are cut out while counting up j up to 3, “orchestra gave” and “orchestra gave him” are cut out and stored in the N-gram word storage device together with the number of appearances. It The above process is repeated until i = 6. Finally, the contents of FIG. 4 are stored in the N-gram word storage device 4. After that, the co-occurrence degree of each collocation is calculated from the appearance frequency according to the expression (1), and the result is stored in the word storage device 4 (S12). Further, the co-occurrence values are sorted (sorted) in descending order.

【００２２】例えば、対象入力文を１００万語用意し、
最大３の連接語を抽出する場合、共起度計算後のＮ-gra
m単語記憶装置４は、図５のようになる。Ｎ＝１の場合
については共起度は求めていない。共起度の値の大きさ
でソートし、例えば、上位５０％を候補とするなどのし
きい値を決めたり、連語の数が同じものごとに共起度の
値でソート（ソート方法としては、クイックソート、マ
ージソート、単純ソート法などがある）し、それぞれの
長さの連語ごとに抽出する連語の共起度の値をあるしき
い値以上と設定するというやり方で共起度の強い連語を
自動的に抽出することができる。For example, prepare 1 million target input sentences,
When extracting a maximum of 3 concatenated words, N-gra after co-occurrence degree calculation
The m word storage device 4 is as shown in FIG. The degree of co-occurrence is not calculated for N = 1. Sort by the size of the co-occurrence value, for example, determine a threshold value such that the top 50% is a candidate, or sort by the co-occurrence value for each word having the same number of collocations. , Quick sort, merge sort, simple sort method, etc.) and the co-occurrence value of the collocation extracted for each collocation of each length is set to a certain threshold value or more Conjunctions can be automatically extracted.

【００２３】図６は、本発明による連語自動抽出装置の
他の実施例（実施例２）を説明するための構成図で、図
中、７はパターンマッチ装置、８は条件設定記憶装置
で、その他、図１と同じ作用をする部分は同一の符号を
付してある。FIG. 6 is a block diagram for explaining another embodiment (embodiment 2) of the automatic word extraction device according to the present invention, in which 7 is a pattern matching device and 8 is a condition setting storage device. Other parts that have the same functions as those in FIG. 1 are designated by the same reference numerals.

【００２４】対象文書を対象文書入力部１から読み込
み、該対象文書入力部１から読み込まれた対象文書から
単語のＮ-gram（Ｎ＝１，２，３，…，Ｎmax）をＮ-gra
m切り出し装置２により切り出す。該Ｎ-gram切り出し装
置２により切り出されたＮ-gramの連接語の出現頻度を
頻度集計装置３により集計し、Ｎ-gram連接語と、前記
頻度集計装置３により集計された出現頻度を単語記憶装
置４により記憶する。A target document is read from the target document input unit 1, and N-grams (N = 1, 2, 3, ..., Nmax) of words are read from the target document read from the target document input unit 1 as N-gra.
Cut out with the cutting device 2. The appearance frequency of the N-gram concatenated words cut out by the N-gram cutout device 2 is totaled by the frequency totaling device 3, and the N-gram conjunctive word and the appearance frequency totaled by the frequency totaling device 3 are stored in a word memory. It is stored by the device 4.

【００２５】前記Ｎ-gramを構成する各単語の対象文書
中の出現頻度（Ｎ＝１の場合）と、Ｎ-gram自体の出現
頻度とを用いてＮ-gramの共起度を共起度計算装置５に
より計算する。分類装置６は前記共起度計算装置５で計
算された共起度の値で単語記憶装置４内の情報を並べ変
える。このようにして、専門性の高い語でない語の組合
せによる熟語や連語に対しても、複数語からなるエント
リを共起しているか、あるいは偶然による連接かを見き
わめ、効率の良い、共起の度合いが強い連語を自動的に
収集することができる。The co-occurrence degree of the N-gram is calculated using the appearance frequency (when N = 1) of each word constituting the N-gram in the target document and the appearance frequency of the N-gram itself. The calculation is performed by the calculation device 5. The classifying device 6 rearranges the information in the word storage device 4 according to the value of the co-occurrence degree calculated by the co-occurrence degree calculating device 5. In this way, even for idioms and collocations that are combinations of words that are not highly specialized words, it is possible to determine whether co-occurrence of entries consisting of multiple words or accidental concatenation, and efficient and co-occurrence. It is possible to automatically collect collocations with a high degree.

【００２６】条件設定記憶装置８は、用語として抽出す
べき条件に合わない構成を記憶し、パターンマッチ装置
７は、前記条件設定記憶装置８中に記憶した構成に合う
Ｎ-gramを排除するためのもので、これらを用いること
により、さらに程度良く連語を抽出することができる。
すなわち、実施例２においては、用語として抽出条件に
合わない構成とは、連語となりにくいパターンをさして
いる。例えば、冠詞とある１単語の２連接語の場合、あ
るいは代名詞（英語では、his,my,your,their,them,him
など）と別の語の２連接などがある。The condition setting storage device 8 stores a structure that does not meet the condition to be extracted as a term, and the pattern matching device 7 eliminates N-grams that match the structure stored in the condition setting storage device 8. By using these, it is possible to extract collocations to a better degree.
That is, in the second embodiment, a structure that does not meet the extraction condition as a term refers to a pattern that is unlikely to be a collocation. For example, in the case of an article and a two-word concatenation of a word, or a pronoun (in English, his, my, your, their, them, him
Etc.) and another word is concatenated.

【００２７】実施例２の例を以下に示す。表２の例は、
連接数を２と限定し、その場合の２連接語の第一語が装
置内に記憶された“the”,“a”,“an”,“his”,…な
どのいずれかであるとき、それを抽出用語の対象から排
除する場合である。パターンマッチはパターンマッチ装
置７により文字列の比較により行われる。An example of the second embodiment will be shown below. The example in Table 2 is
When the number of concatenations is limited to 2, and the first word of the two concatenated words in that case is any of “the”, “a”, “an”, “his”, ... Stored in the device, This is the case when it is excluded from the target of the extraction term. The pattern matching is performed by the pattern matching device 7 by comparing the character strings.

【００２８】[0028]

【表２】 [Table 2]

【００２９】また、表３の例では、連接数を２と限定
し、その場合の２連接語の第一語が装置内に記憶された
“the”,“a”,“an”のいずれかであり、かつ第２の単
語が“in”,“of”,“with”,…“from”,“to”のいず
れかであるその連語を抽出用語の対象から排除する場合
である。パターンマッチはパターンマッチ装置７により
文字列の比較により行われる。Further, in the example of Table 3, the number of concatenations is limited to 2, and in that case, the first word of the two concatenations is any of "the", "a", and "an" stored in the device. , And the second word is any of “in”, “of”, “with”, ... “From”, “to”, that is excluded from the target of the extracted term. The pattern matching is performed by the pattern matching device 7 by comparing the character strings.

【００３０】[0030]

【表３】 [Table 3]

【００３１】[0031]

【発明の効果】以上の説明から明らかなように、本発明
によると、以下のような効果がある。（１）請求項１に対応する効果：連語の構成語の共起の
強さをその構成語の出現頻度と連語の出現頻度から求め
ることができるので、辞書などを使用しないで、簡単な
装置で、入力文において、強い共起をもって連語として
出現する語を効率良く自動的に迅速に抽出することがで
きる。（２）請求項２に対応する効果：前記（１）で抽出した
語の中で用語として抽出するには不適切であると思われ
るＮ-gramパターンを予め条件設定記憶装置に記憶して
おくので、用語候補からこのパターンにマッチするもの
を排除することができ、精度良く用語を抽出することが
できる。As is apparent from the above description, the present invention has the following effects. (1) Effect corresponding to claim 1: Since the co-occurrence strength of a compound word of a compound word can be obtained from the frequency of appearance of the compound word and the frequency of appearance of compound words, a simple device without using a dictionary or the like. Thus, words that appear as collocations with strong co-occurrence can be efficiently and automatically extracted from the input sentence. (2) Effect corresponding to claim 2: An N-gram pattern which seems to be inappropriate for extraction as a term among the words extracted in (1) is stored in advance in the condition setting storage device. Therefore, it is possible to exclude the ones matching this pattern from the term candidates, and the terms can be extracted with high accuracy.

[Brief description of drawings]

【図１】本発明による連語自動抽出装置の一実施例を
説明するための構成図である。FIG. 1 is a configuration diagram for explaining an embodiment of a complex word automatic extraction device according to the present invention.

【図２】本発明による連語自動抽出装置の動作を説明
するためのフローチャート（その１）である。FIG. 2 is a flowchart (No. 1) for explaining the operation of the collocation automatic extraction device according to the present invention.

【図３】本発明による連語自動抽出装置の動作を説明
するためのフローチャート（その２）である。FIG. 3 is a flowchart (No. 2) for explaining the operation of the collocation automatic extraction device according to the present invention.

【図４】本発明におけるＮ-gram単語記憶装置の記憶
例を示す図である。FIG. 4 is a diagram showing a storage example of an N-gram word storage device according to the present invention.

【図５】本発明におけるＮ-gram単語記憶装置の他の
記憶例を示す図である。FIG. 5 is a diagram showing another storage example of the N-gram word storage device in the present invention.

【図６】本発明による連語自動抽出装置の他の実施例
を説明するための構成図である。FIG. 6 is a configuration diagram for explaining another embodiment of the collocation automatic extraction device according to the present invention.

[Explanation of symbols]

１…１は対象文書入力部、２…Ｎ-gram切り出し装置、
３…Ｎ-gramの頻度集計装置、４…Ｎ-gram単語記憶装
置、５…共起度計算装置、６…分類装置、７…パターン
マッチ装置、８…条件設定記憶装置。1 ... 1 is a target document input unit, 2 ... N-gram clipping device,
3 ... N-gram frequency aggregation device, 4 ... N-gram word storage device, 5 ... Co-occurrence degree calculation device, 6 ... Classification device, 7 ... Pattern matching device, 8 ... Condition setting storage device.

───────────────────────────────────────────────────── フロントページの続き (51)Int.Cl.⁶ 識別記号庁内整理番号ＦＩ技術表示箇所 9288−5ＬＧ０６Ｆ 15/20 ５２４Ａ 9288−5Ｌ５５０Ａ ─────────────────────────────────────────────────── ─── Continuation of the front page (51) Int.Cl. ⁶ Identification code Internal reference number FI technical display location 9288-5L G06F 15/20 524 A 9288-5L 550 A

Claims

[Claims]

1. A target document input unit for reading a target document,
N of words from the target document read in the target document input section
-gram (N = 1, 2, 3, ..., Nmax) cutting-out device and N-gram cut out by the cutting-out device
Frequency aggregation device for aggregating the frequency of appearance of concatenated words in N-gra
A word storage device that stores m concatenated words and the frequency of appearance of the concatenated words, an appearance frequency of each word that constitutes the N-gram in the target document, and an appearance frequency of the N-gram itself, A compound word having a co-occurrence degree calculating device for calculating a co-occurrence degree, and a classifying device for rearranging information in the word storage device according to a value of the co-occurrence degree calculated by the co-occurrence degree calculating device. Automatic extraction device.

2. A condition setting storage device for storing a configuration that does not match a condition to be extracted as a term, and a pattern matching device for eliminating an N-gram that matches the configuration stored in the word storage device. The collocation automatic extraction device according to claim 1, wherein collocations are further accurately extracted.