JP2003085170A

JP2003085170A - Fixed form sentence clustering device and method

Info

Publication number: JP2003085170A
Application number: JP2001274396A
Authority: JP
Inventors: Naoto Kato; 直人加藤
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2001-09-11
Filing date: 2001-09-11
Publication date: 2003-03-20

Abstract

PROBLEM TO BE SOLVED: To provide a fixed form sentence clustering device and a method capable of collecting, clustering, and outputting similar fixed form sentences as one class. SOLUTION: This fixed form sentence clustering device comprises a means 10 for dividing respective sentences in a corpus with every word, a word n gram dictionary means 20 for forming word n grams by linking the words obtained by division, and extracting and preserving the word n grams for satisfying a prescribed condition from the word n grams, a fixed form sentence extracting means 30 for forming a set of word n grams by combining a part or the whole in the preserved word n grams, and extracting a sentence becoming a prescribed threshold value or more in a rate including the set as a fixed form sentence form the corpus, a means 40 for preserving the extracted fixed form sentence and the set of word n grams, and a means 50 for calculating similarity of mutual classes to classes being a cluster of prescribed sentences included in corpus, and merging two classes having the highest similarity into one.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、複数ある定型文の
うち類似した文同士を集めてクラスに分類するクラスタ
リング装置および方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a clustering apparatus and method for collecting similar sentences from a plurality of fixed phrases and classifying them into classes.

【０００２】[0002]

【従来の技術】従来、単語が一定のパターンで配列され
た単語列を含む文（以下、このパターンを定型パターン
といい、この文を定型文という。）を複数文の集合の中
から抽出する場合、文単位で抽出していた。定型文の抽
出方法は、例えば、加藤直人著、「定型パターンを含む
文の機械翻訳手法」、情報処理学会論文誌、Vo１．３
６、No．９、pp．２０８１−２０９０、１９９５に開示
されているように、大まかに以下に示す３つのステップ
から構成される。2. Description of the Related Art Conventionally, a sentence including a word string in which words are arranged in a fixed pattern (hereinafter, this pattern is referred to as a fixed pattern and this sentence is referred to as a fixed sentence) is extracted from a set of a plurality of sentences. In this case, it was extracted in sentence units. A method of extracting a fixed form sentence is described in, for example, Naoto Kato, “Machine Translation Method for Sentences Containing Fixed Form Pattern”, IPSJ Journal, Vo1.3
6, No. 9, pp. As disclosed in 2081-2090, 1995, it is roughly composed of the following three steps.

【０００３】ステップ１で、単語nグラム辞書（ｎは、
１以上の自然数。）を作成する。ステップ２で、複数文
の集合の各文に対して単語nグラムを含む割合（以下、
含有率という。）を計算する。ステップ３で、上記の含
有率がしきい値以上になる文を定型文として抽出する。In step 1, the word n-gram dictionary (n is
A natural number of 1 or more. ) Is created. In step 2, for each sentence of the set of multiple sentences, the ratio including the word n-gram (hereinafter,
It is called the content rate. ) Is calculated. In step 3, a sentence whose content rate is equal to or higher than a threshold value is extracted as a fixed sentence.

【０００４】次に、経済ニュースから定型文を抽出する
場合を例に取り、上記の従来の方法について説明する。
まず、ステップ１で、単語がｎ個連なった単語列である
単語ｎグラムを形成し、その単語ｎグラムを構成要素と
する単語nグラム辞書を作成する。ここで、単語nグラム
辞書は、形成された単語nグラムのうち、コーパス中の
文中に高い頻度で出現する単語ｎグラムを抽出して作成
される。Next, the above-mentioned conventional method will be described by taking as an example the case where a fixed phrase is extracted from economic news.
First, in step 1, a word n-gram that is a word string in which n words are connected is formed, and a word n-gram dictionary having the word n-gram as a constituent element is created. Here, the word n-gram dictionary is created by extracting, from the formed word n-grams, the word n-grams that frequently appear in the sentences in the corpus.

【０００５】経済ニュースを用いて作成した単語nグラ
ム辞書１は、例えば、図５（ａ）に示すような単語nグ
ラムＷ１、W２、W３によって構成される。実際、経済ニ
ュースでは定型パターンの配列を有する単語列を含む表
現が多く、経済ニュース中から多数の単語nグラムが抽
出される。なお、図５（ａ）における記号「／」は、単
語の切れ目を表している。The word n-gram dictionary 1 created using economic news is composed of, for example, word n-grams W1, W2, W3 as shown in FIG. In fact, in economic news, many expressions include word strings having a fixed pattern arrangement, and a large number of word n-grams are extracted from economic news. The symbol “/” in FIG. 5A represents a word break.

【０００６】次に、ステップ２で、経済ニュース中の各
文に対して、ステップ１で得られた単語nグラムを含む
割合である含有率を計算する。例えば、図５（ａ）に示
す単語ｎグラム辞書１と図５（ｂ）に示す４つの例文と
を用いて含有率を計算した例を図５（ｃ）に示す。Next, in step 2, the content rate, which is the rate including the word n-gram obtained in step 1, is calculated for each sentence in the economic news. For example, FIG. 5C shows an example in which the content rate is calculated using the word n-gram dictionary 1 shown in FIG. 5A and the four example sentences shown in FIG. 5B.

【０００７】図５（ｂ）に示す例文１、例文２、例文４
にはそれぞれ、単語nグラム辞書１中の単語nグラムＷ
１、W２、W３がそれぞれ１つずつ含まれており、例文３
にはＷ１のみが含まれている。図５（ｃ）に示す含有率
の計算式において、分母は含有率の計算対象の文を構成
する単語の総数であり、分子は対象の文に含まれる各単
語nグラムＷ１、W２、W３を構成する単語数（ここで
は、ｎ＝３）の総和である。Example sentence 1, example sentence 2, and example sentence 4 shown in FIG.
For each word n-gram W in the word n-gram dictionary 1.
Each one contains 1, W2, W3, example sentence 3
Contains only W1. In the formula for calculating the content shown in FIG. 5C, the denominator is the total number of words constituting the sentence for which the content is to be calculated, and the numerator is each word n-gram W1, W2, W3 contained in the target sentence. It is the sum total of the number of constituent words (here, n = 3).

【０００８】ステップ３で、予め決められた含有率のし
きい値以上の含有率となる文を抽出する。例えば、しき
い値を０．５と仮定すると、上記の図５（ｂ）に示す例
文１、例文２、例文４は定型文として抽出されるが、例
文３は抽出されない。一般に、定型文は、言語現象を人
手によって分析する場合等において利用される。そのよ
うな場合には、類似した文を集めておいた方が処理しや
すいということが起こる。In step 3, sentences having a content rate equal to or higher than a predetermined content rate threshold value are extracted. For example, assuming that the threshold value is 0.5, the example sentence 1, the example sentence 2, and the example sentence 4 shown in FIG. 5B are extracted as the standard sentences, but the example sentence 3 is not extracted. Generally, fixed phrases are used when analyzing language phenomena manually. In such a case, it may be easier to process by collecting similar sentences.

【０００９】[0009]

【発明が解決しようとする課題】しかしながら、従来の
定型文の抽出方法では、定型文を一連の類似した文の集
合（以下、クラスという。）単位ではなく文単位でしか
抽出できないため、言語現象を分析する際にすべての文
を見る必要があり、効率的でないという問題がある。実
際、類似した定型文は数十文〜数百文も得られることが
あり、そのすべてを人手によって調査することは困難で
ある。そのため、類似した定型文は、１つのクラスとし
てまとめられている方が分析しやすい。However, according to the conventional method of extracting a fixed sentence, the fixed sentence can be extracted only in units of sentences, not in units of a set of a series of similar sentences (hereinafter referred to as a class). There is a problem in that it is inefficient to look at all sentences when analyzing. In fact, tens to hundreds of similar fixed phrases may be obtained, and it is difficult to manually investigate all of them. Therefore, similar fixed phrases are easier to analyze when they are grouped as one class.

【００１０】本発明は、かかる問題を解決するためにな
されたものであり、その目的は、類似した定型文を１つ
のクラスにまとめてクラスタリングし、出力することが
可能な定型文クラスタリング装置および方法を提供する
ことにある。The present invention has been made to solve such a problem, and an object thereof is a fixed sentence clustering apparatus and method capable of collectively clustering similar fixed sentences into one class and outputting them. To provide.

【００１１】[0011]

【課題を解決するための手段】以上の点を考慮して、請
求項１に係る発明は、所定の文の集合であるコーパスに
含まれる各文を、前記各文を構成する単語毎に分割する
形態素解析手段と、前記コーパスに含まれる各文を分割
して得られた前記単語を、所定の規則に基づいて連ねて
単語nグラム（ｎは１以上の自然数。）を生成し、前記
単語ｎグラムの中から所定の条件を満たす単語nグラム
を抽出して保存する単語nグラム辞書手段と、前記単語n
グラム辞書手段によって保存された単語nグラムにおけ
る所定種類の単語nグラムを組み合わせて単語ｎグラム
の組を生成し、前記単語ｎグラムの組を含む割合が所定
しきい値以上となる文を前記コーパスに含まれる文の中
から定型文として抽出する定型文抽出手段と、前記定型
文抽出手段によって抽出された定型文および前記定型文
に含まれる前記単語nグラムの組を保存する定型文保存
手段と、前記コーパスに含まれる所定の文の集合である
クラスに対して前記クラス相互の類似度を計算し、前記
類似度が最も高い２つのクラスを１つのクラスとするマ
ージを行う定型文クラスタリング手段とを備えた構成を
有している。In view of the above points, the invention according to claim 1 divides each sentence included in a corpus, which is a set of predetermined sentences, into words constituting each sentence. The morpheme analysis means and the words obtained by dividing each sentence included in the corpus are connected based on a predetermined rule to generate a word n-gram (n is a natural number of 1 or more), and the word a word n-gram dictionary means for extracting and storing word n-grams satisfying a predetermined condition from n-grams;
The n-grams of a predetermined type in the n-grams stored by the gram dictionary means are combined to generate a set of word n-grams, and a sentence in which the ratio including the set of word n-grams is a predetermined threshold value or more is the corpus. A fixed sentence extracting means for extracting as a fixed sentence from among the sentences included in the fixed sentence, a fixed sentence saving means for saving the fixed sentence extracted by the fixed sentence extracting means and the set of the word n-gram included in the fixed sentence, A fixed sentence clustering means for calculating a similarity between the classes with respect to a class which is a set of predetermined sentences included in the corpus, and performing merging in which the two classes having the highest similarity are merged into one class. It has a configuration including.

【００１２】この構成により、定型文クラスタリング手
段を設けて類似した定型文を検出してマージするため、
コーパスから抽出した定型文のうちの類似した定型文を
１つのクラスにまとめ、定型文のクラスとして出力する
ことが可能な定型文クラスタリング装置を実現すること
ができる。With this configuration, since the fixed phrase clustering means is provided to detect and merge similar fixed phrases,
It is possible to realize a fixed sentence clustering device that can combine similar fixed phrases among the fixed phrases extracted from the corpus into one class and output them as a fixed phrase class.

【００１３】また、請求項２に係る発明は、請求項１に
おいて、さらに、前記単語ｎグラム辞書手段で抽出する
単語ｎグラムが満たす前記所定の条件が、前記抽出され
る文中に前記単語ｎグラムが出現する回数である生起回
数が所定回数以上である構成を有している。この構成に
より、定型文クラスタリング手段を設けて類似した定型
文を検出してマージするため、コーパスから抽出した定
型文のうちの類似した定型文を１つのクラスにまとめ、
定型文のクラスとして出力することが可能な定型文クラ
スタリング装置を実現することができる。The invention according to claim 2 is the method according to claim 1, wherein the predetermined condition satisfied by the word n-gram extracted by the word n-gram dictionary means is the word n-gram in the extracted sentence. The number of occurrences, which is the number of occurrences of, is greater than or equal to a predetermined number. With this configuration, the fixed phrase clustering means is provided to detect and merge similar fixed phrases, so that similar fixed phrases among the fixed phrases extracted from the corpus are combined into one class,
It is possible to realize a fixed sentence clustering device that can output as a fixed sentence class.

【００１４】また、請求項３に係る発明は、所定の文の
集合であるコーパスに含まれる各文を、前記各文を構成
する単語毎に分割するステップと、前記コーパスに含ま
れる各文を分割して得られた前記単語を、所定の規則に
基づいて連ねて単語nグラム（ｎは１以上の自然数。）
を生成し、前記単語ｎグラムの中から所定の条件を満た
す単語nグラムを抽出して保存する単語nグラム辞書ステ
ップと、前記単語nグラム辞書ステップによって保存さ
れた単語nグラムにおける所定種類の単語nグラムを組み
合わせて単語ｎグラムの組を生成し、前記単語ｎグラム
の組を含む割合が所定しきい値以上となる文を前記コー
パスに含まれる文の中から定型文として抽出する定型文
抽出ステップと、前記定型文抽出ステップによって抽出
された定型文および前記定型文に含まれる前記単語nグ
ラムの組を保存するステップと、前記コーパスに含まれ
る所定の文の集合であるクラスに対して前記クラス相互
の類似度を計算し、前記類似度が最も高い２つのクラス
を１つのクラスとするマージを行うステップとを備えた
構成を有している。In the invention according to claim 3, a step of dividing each sentence included in the corpus, which is a set of predetermined sentences, into words constituting each sentence, and each sentence included in the corpus are divided. The words obtained by dividing the words are connected based on a predetermined rule to form a word n-gram (n is a natural number of 1 or more.).
And a word n-gram dictionary step of extracting and storing word n-grams satisfying a predetermined condition from the word n-grams, and a predetermined type of word in the word n-grams saved by the word n-gram dictionary step. A standard sentence extraction for generating a set of word n-grams by combining n-grams, and extracting a sentence having a ratio including the set of word n-grams equal to or more than a predetermined threshold value as a standard sentence from the sentences included in the corpus. A step, a step of saving a set of the fixed phrase extracted by the fixed phrase extraction step and the word n-gram included in the fixed phrase, and the class that is a set of predetermined sentences included in the corpus And a step of calculating the degree of similarity between the classes and merging the two classes having the highest degree of similarity into one class.

【００１５】この構成により、類似した定型文を検出し
てマージするステップを設けたため、コーパスから抽出
した定型文のうちの類似した定型文を１つのクラスにま
とめ、定型文のクラスとして出力することが可能な定型
文クラスタリング方法を実現することができる。With this configuration, a step of detecting and merging similar fixed phrases is provided. Therefore, similar fixed phrases among the fixed phrases extracted from the corpus are combined into one class and output as a fixed phrase class. It is possible to realize a fixed sentence clustering method that enables

【００１６】また、請求項４に係る発明は、請求項３に
おいて、さらに、前記単語ｎグラム辞書ステップで抽
出する単語ｎグラムが満たす前記所定の条件は、前記抽
出される文中に前記単語ｎグラムが出現する回数である
生起回数が所定回数以上である構成を有している。この
構成により、類似した定型文を検出してマージするステ
ップを設けたため、コーパスから抽出した定型文のうち
の類似した定型文を１つのクラスにまとめ、定型文のク
ラスとして出力することが可能な定型文クラスタリング
方法を実現することができる。The invention according to claim 4 is the method according to claim 3, wherein the predetermined condition satisfied by the word n-gram extracted in the word n-gram dictionary step is the word n-gram in the extracted sentence. The number of occurrences, which is the number of occurrences of, is greater than or equal to a predetermined number. With this configuration, since a step of detecting and merging similar fixed phrases is provided, it is possible to combine similar fixed phrases among the fixed phrases extracted from the corpus into one class and output them as a fixed phrase class. A fixed-form sentence clustering method can be realized.

【００１７】また、請求項５に係る発明は、請求項３ま
たは４において、さらに、前記単語ｎグラム辞書ステ
ップで単語ｎグラムを生成する際に基づく所定の規則
は、前記各文を構成する単語の配列の順番を保ちながら
ｎ個（ｎは１以上の自然数。）連ねて単語nグラムを生
成する規則である構成を有している。この構成により、
類似した定型文を検出してマージするステップを設けた
ため、コーパスから抽出した定型文のうちの類似した定
型文を１つのクラスにまとめ、定型文のクラスとして出
力することが可能な定型文クラスタリング方法を実現す
ることができる。The invention according to claim 5 is the method according to claim 3 or 4, further comprising: a predetermined rule based on the generation of the word n-gram in the word n-gram dictionary step, While maintaining the order of the array of n, n (n is a natural number of 1 or more) is connected to generate a word n-gram. With this configuration,
Since a step of detecting and merging similar fixed phrases is provided, a fixed phrase clustering method capable of collecting similar fixed phrases among the fixed phrases extracted from the corpus into one class and outputting as a fixed phrase class. Can be realized.

【００１８】[0018]

【発明の実施の形態】以下、添付図面を参照し、本発明
の第１の実施の形態に係る定型文クラスタリング装置に
ついて説明する。図１に、本発明の第１の実施の形態に
係る定型文クラスタリング装置のブロック構成図を示
す。図１において、定型文クラスタリング装置１００
は、形態素解析手段１０、単語ｎグラム辞書手段（ｎは
１以上の自然数。）２０、定型文抽出手段３０、定型文
保存手段４０、および定型文クラスタリング手段５０に
よって構成される。BEST MODE FOR CARRYING OUT THE INVENTION A fixed phrase clustering apparatus according to a first embodiment of the present invention will be described below with reference to the accompanying drawings. FIG. 1 shows a block configuration diagram of a fixed text clustering apparatus according to a first exemplary embodiment of the present invention. In FIG. 1, the fixed-form sentence clustering apparatus 100.
Is composed of a morpheme analysis unit 10, a word n-gram dictionary unit (n is a natural number of 1 or more) 20, a fixed sentence extraction unit 30, a fixed sentence storage unit 40, and a fixed sentence clustering unit 50.

【００１９】形態素解析手段１０は、定型文クラスタリ
ング装置１００に外部から入力された公知のコーパスに
含まれる各文を、各文を構成する単語に分割する（以
下、形態素解析という。）ための手段である。単語ｎグ
ラム辞書手段２０は、上記のコーパスに含まれる各文を
分割して得られた単語を、所定の規則に基づいて連ねて
単語nグラム（ｎは１以上の自然数。）を生成し、単語
ｎグラムの中から所定の条件を満たす単語nグラムを抽
出して保存するための手段である。The morphological analysis means 10 divides each sentence included in a publicly known corpus externally input to the standard sentence clustering apparatus 100 into words constituting each sentence (hereinafter referred to as morphological analysis). Is. The word n-gram dictionary means 20 connects the words obtained by dividing each sentence included in the corpus according to a predetermined rule to generate a word n-gram (n is a natural number of 1 or more), This is a means for extracting and saving word n-grams satisfying a predetermined condition from the word n-grams.

【００２０】ここで、上記の所定の条件として、抽出さ
れる文中に単語ｎグラムが出現する回数（以下、生起回
数という。）が所定回数以上であることを条件とするの
でも良い。また、上記の所定の規則として、各文を構成
する単語の配列の順番を保ちながらｎ個（ｎは１以上の
自然数。）連ねて単語nグラムを生成するという規則で
も良い。Here, the predetermined condition may be that the number of times the word n-gram appears in the extracted sentence (hereinafter, referred to as the occurrence number) is a predetermined number or more. The predetermined rule may be a rule that n words (n is a natural number greater than or equal to 1) are continuously formed to generate a word n-gram while maintaining the order of arrangement of the words forming each sentence.

【００２１】定型文抽出手段３０は、単語ｎグラム辞書
手段２０によって保存された単語nグラムにおける所定
種類の単語nグラムを組み合わせて単語ｎグラムの組を
生成し、単語ｎグラムの組を含む割合が所定しきい値以
上となる文を上記のコーパスに含まれる文の中から定型
文として抽出するための手段である。定型文保存手段４
０は、定型文抽出手段３０によって抽出された定型文お
よびその定型文に含まれる単語nグラムの組を保存する
ための手段である。The fixed phrase extraction means 30 combines word n-grams of a predetermined type in the word n-gram dictionary means 20 to generate a set of word n-grams, and a ratio including the set of word n-grams. Is a means for extracting a sentence having a predetermined threshold value or more from the sentences included in the corpus as a fixed sentence. Fixed phrase storage means 4
Reference numeral 0 is a means for storing a fixed sentence extracted by the fixed sentence extracting means 30 and a set of word n-grams included in the fixed sentence.

【００２２】定型文クラスタリング手段５０は、上記の
コーパスに含まれる所定の文の集合であるクラスに対し
てクラス相互の類似度を計算し、類似度が最も高い２つ
のクラスを１つのクラスとするマージを行うための手段
である。The standard sentence clustering means 50 calculates the degree of similarity between classes for a class which is a set of predetermined sentences included in the corpus, and sets the two classes having the highest degree of similarity as one class. It is a means for merging.

【００２３】なお、本発明の第１の実施の形態において
は、日本語を処理の対象にした場合について説明した
が、英語、フランス語、その他の外国語を処理対象とす
る場合でも、同様に、定型文を抽出し、クラスタリング
することが可能である。それらを対象とする定型文クラ
スタリング装置については、上記の記載から自明である
ため、その説明を省略する。In the first embodiment of the present invention, the case where Japanese is the object of processing has been described, but the same applies when English, French and other foreign languages are the objects of processing. It is possible to extract fixed phrases and perform clustering. The fixed-form sentence clustering device for them is obvious from the above description, and therefore its explanation is omitted.

【００２４】以上説明したように、本発明の第１の実施
の形態に係る定型文クラスタリング装置は、定型文クラ
スタリング手段を設けて類似した定型文を検出してマー
ジするため、コーパスから抽出した定型文のうちの類似
した定型文を１つのクラスにまとめてクラスタリング
し、定型文のクラスとして出力することができる。As described above, the fixed-form sentence clustering apparatus according to the first embodiment of the present invention is provided with fixed-form sentence clustering means to detect and merge similar fixed-form sentences. It is possible to cluster similar similar fixed phrases into one class and output them as a fixed phrase class.

【００２５】図２は、本発明の第２の実施の形態に係る
定型文クラスタリング方法における処理の流れを示すフ
ローチャートである。なお、本発明の第２の実施の形態
に係る定型文クラスタリング方法が実行される装置の構
成は、図１に示す本発明の第１の実施の形態に係る定型
文クラスタリング装置の構成と同じであり、その説明は
省略する。FIG. 2 is a flow chart showing the flow of processing in the fixed text clustering method according to the second embodiment of the present invention. The configuration of the device for executing the boilerplate clustering method according to the second embodiment of the present invention is the same as the configuration of the boilerplate clustering device according to the first embodiment of the present invention shown in FIG. Therefore, the description thereof will be omitted.

【００２６】ステップＳ２０１で、形態素解析手段１０
は、公知のコーパス中の全ての文を対象に形態素解析す
る。具体的には、外部からコーパスが入力されると、コ
ーパスに含まれる各文を分割してそれを構成する単語に
する。ステップＳ２０２で、単語ｎグラム辞書手段２０
は、上記のコーパスに含まれる各文を分割して得られた
単語を、所定の規則に基づいて連ねて単語nグラム（ｎ
は１以上の自然数。）を生成し、単語ｎグラムの中から
所定の条件を満たす単語nグラムを抽出して保存する。In step S201, the morphological analysis means 10
Performs morphological analysis on all sentences in a known corpus. Specifically, when a corpus is input from the outside, each sentence included in the corpus is divided into words that compose it. In step S202, the word n-gram dictionary means 20.
Is a word obtained by dividing each sentence included in the above corpus based on a predetermined rule.
Is a natural number of 1 or more. ) Is generated, a word n-gram satisfying a predetermined condition is extracted from the word n-grams and stored.

【００２７】ここで、上記の所定の条件として、抽出さ
れる文中に単語ｎグラムが出現する回数（以下、生起回
数という。）が所定回数以上であることを条件とするの
でも良い。また、上記の所定の規則として、各文を構成
する単語の配列の順番を保ちながらｎ個（ｎは１以上の
自然数。）連ねて単語nグラムを生成するという規則で
も良い。Here, the predetermined condition may be that the number of times the word n-gram appears in the extracted sentence (hereinafter referred to as the number of occurrences) is a predetermined number or more. The predetermined rule may be a rule that n words (n is a natural number greater than or equal to 1) are continuously formed to generate a word n-gram while maintaining the order of arrangement of the words forming each sentence.

【００２８】ステップＳ２０３で、定型文抽出手段３０
は、ステップＳ２０２で得られた単語nグラム辞書を使
って、所定種類の単語nグラムを組み合わせて単語ｎグ
ラムの組を生成し、コーパス中の各文についてその単語
ｎグラムの組の含有率を算出する。ステップＳ２０４
で、定型文抽出手段３０は、ステップＳ２０３で算出し
た含有率が所定しきい値以上となる文を上記のコーパス
に含まれる文の中から定型文として抽出し、その定型文
とともにそれを与える単語nグラムの組を集めて定型文
の集合を作成する。ここで、上記で抽出された定型文や
定型文の集合を、定型文保存手段４０に保存しておくの
でも良い。In step S203, the fixed phrase extraction means 30
Generates a set of word n-grams by combining word n-grams of a predetermined type using the word n-gram dictionary obtained in step S202, and determines the content rate of the word n-gram set for each sentence in the corpus. calculate. Step S204
Then, the fixed phrase extraction means 30 extracts a sentence whose content ratio calculated in step S203 is equal to or more than a predetermined threshold value as a fixed phrase from the sentences included in the corpus, and gives the fixed phrase together with the fixed phrase. Collect a set of n-grams to create a set of fixed phrases. Here, the fixed sentence or a set of fixed sentences extracted above may be stored in the fixed sentence storage unit 40.

【００２９】ステップＳ２０５で、定型文クラスタリン
グ手段５０は、予め決められた自然数を定型文のクラス
数Ｃとして、ステップＳ２０４で作成された定型文の集
合からC＋１個の定型文を取り出し、各定型文に対応す
る１文からなるクラスを作り、各クラス間の類似度を計
算する。ステップＳ２０６で、定型文クラスタリング手
段５０は、ステップＳ２０５で求められた各クラス間の
類似度の中で最も類似度が高い２つのクラスをマージす
る。In step S205, the fixed phrase clustering means 50 extracts C + 1 fixed phrases from the set of fixed phrases created in step S204, using a predetermined natural number as the number of classes of fixed phrases, C, and extracts each fixed sentence. A class consisting of one sentence corresponding to is created and the similarity between the classes is calculated. In step S206, the fixed phrase clustering unit 50 merges the two classes having the highest similarity among the classes obtained in step S205.

【００３０】ステップＳ２０７で、定型文クラスタリン
グ手段５０は、ステップＳ２０５または後述するステッ
プＳ２０８で所定個数の文が取り出された後の定型文の
集合（以下、残りの定型文の集合という。）が空集合か
否かを判断する。ステップＳ２０７で残りの定型文の集
合が空集合でないと判断された場合、ステップＳ２０８
で、定型文クラスタリング手段５０は、残りの定型文の
集合から、さらに１個の定型文を取り出して新たなクラ
スを作り、各クラス間の類似度を計算する。In step S207, the fixed-form sentence clustering unit 50 has an empty set of fixed-form sentences (hereinafter, referred to as a set of remaining fixed-form sentences) after a predetermined number of sentences are extracted in step S205 or step S208 described later. Judge whether it is a set or not. If it is determined in step S207 that the remaining set of fixed phrases is not an empty set, step S208
Then, the fixed-form sentence clustering unit 50 extracts one more fixed-form sentence from the remaining set of fixed-form sentences, creates a new class, and calculates the similarity between the classes.

【００３１】ステップＳ２０８での処理が終了したら、
処理はステップＳ２０６に戻り上記の処理を繰り返す。
ステップＳ２０７で残りの定型文の集合が空集合である
と判断された場合、定型文クラスタリング手段５０は、
定型文をクラスに分類（以下、クラスタリングとい
う。）した結果を出力する。When the processing in step S208 is completed,
The process returns to step S206 to repeat the above process.
When it is determined in step S207 that the remaining set of fixed phrases is an empty set, the fixed phrase clustering unit 50
The result of classifying fixed phrases into classes (hereinafter referred to as clustering) is output.

【００３２】以下に、日本語を対象にして上記各ステッ
プでの処理について説明する。また、以下の説明では、
外部から入力されたコーパスに、図３（ａ）に例文とし
て示す６つの文が含まれていたものとする。ステップＳ
２０１で、形態素解析手段１０は、例文１〜６を対象
に、各文を単語に分割する。例文１を形態素解析した結
果を図３（ｂ）に示す。The processing in each of the above steps will be described below for Japanese. Also, in the following explanation,
It is assumed that the corpus input from the outside includes six sentences shown as example sentences in FIG. Step S
At 201, the morphological analysis unit 10 divides each sentence into words for the example sentences 1 to 6. The result of morphological analysis of example sentence 1 is shown in FIG.

【００３３】ステップＳ２０２で、単語ｎグラム辞書手
段２０は、上記のコーパスに含まれる各文を分割して得
られた単語を用いて、所定の条件を満たす単語nグラム
を求める。この例に示すコーパスに対して生成された単
語ｎグラムの例を図３（ｃ）に示す。図３（ｃ）に示す
ｎが３の場合の単語３グラムは、連続する３つの単語
（「現在」、「は」、および「、」）がその順番を保ち
ながら単語列を形成する構成となっている。In step S202, the word n-gram dictionary means 20 obtains a word n-gram satisfying a predetermined condition by using the words obtained by dividing each sentence included in the corpus. An example of the word n-gram generated for the corpus shown in this example is shown in FIG. The word 3 gram when n is 3 shown in FIG. 3C has a configuration in which three consecutive words (“present”, “ha”, and “,”) form a word string while maintaining their order. Has become.

【００３４】コーパス中のすべての文に対して上記と同
様の処理を行い、得られた単語nグラムを集計して出現
頻度を計算する。例えば、図３（ｄ）に示すように算出
される。ここで、上記のある条件として、例えば「出現
頻度が８００回以上の単語ｎグラム」とすると、単語n
グラム辞書には図３（ｄ）における単語３グラム「現在
／は／、」は登録されるが、単語４グラム「現在／は
／、／きのう」は登録されない。The same processing as above is performed on all the sentences in the corpus, and the obtained word n-grams are totaled to calculate the appearance frequency. For example, it is calculated as shown in FIG. Here, as one of the above-mentioned certain conditions, for example, if "word n gram whose appearance frequency is 800 times or more" is used, the word n
The word 3 gram “current / wa /,” in FIG. 3 (d) is registered in the gram dictionary, but the word 4 gram “current / wa /, / yes” is not registered.

【００３５】同様にして、この条件を満たす単語nグラ
ムを集めて、単語nグラム辞書を作成する。例えば、図
４（ａ）に示すようなエントリーをもつ単語nグラム辞
書が得られる。ステップＳ２０３で、定型文抽出手段３
０は、ステップＳ２０２で得られた単語nグラム辞書を
用いて所定の単語ｎグラムの組を形成し、コーパス中の
各文におけるこの単語ｎグラムの組を含む割合である含
有率を算出する。Similarly, word n-grams satisfying this condition are collected to create a word n-gram dictionary. For example, a word n-gram dictionary having entries as shown in FIG. 4 (a) can be obtained. In step S203, the fixed phrase extraction means 3
In the case of 0, a predetermined set of word n-grams is formed using the word n-gram dictionary obtained in step S202, and the content rate, which is the ratio including this set of word n-grams in each sentence in the corpus, is calculated.

【００３６】図３（ａ）に示す例２の各文については、
図４（ａ）に示す単語nグラム辞書２を用いて、図４
（ｂ）に示すように含有率が計算される。以下、図３
（ａ）に示す例２における例文１を用いて説明する。例
文１には、図４（ｃ）に示すように、図４（ａ）に示す
単語ｎグラムＷ１、Ｗ２、Ｗ３が各１つずつ含まれる。
図４（ｂ）における含有率の計算では、分母に含有率の
計算対象の文を構成する単語の総数をとり、分子にその
文に含まれる各単語ｎグラムを構成する単語の総数をと
る。For each sentence of Example 2 shown in FIG.
Using the word n-gram dictionary 2 shown in FIG.
The content rate is calculated as shown in (b). Below, FIG.
A description will be given using Example sentence 1 in Example 2 shown in (a). As illustrated in FIG. 4C, the example sentence 1 includes one word n-gram W1, W2, and W3 illustrated in FIG. 4A, respectively.
In the calculation of the content rate in FIG. 4B, the denominator is the total number of words constituting the sentence whose content rate is to be calculated, and the numerator is the total number of words constituting each word n-gram included in the sentence.

【００３７】例文１の場合は、例文１を構成する単語の
総数は１４であり、例文１に含まれる各単語ｎグラムＷ
１、Ｗ２、Ｗ３の各々を構成する単語数は、それぞれ、
３、３、３である。ステップＳ２０４で、定型文抽出手
段３０は、コーパス中の文のうち、ステップＳ２０３で
算出された含有率が所定のしきい値以上となる文を定型
文として抽出し、抽出された定型文と共にそれを与える
単語nグラムの組を出力する。In the case of the example sentence 1, the total number of words forming the example sentence 1 is 14, and each word n-gram W included in the example sentence 1 is W.
The number of words forming each of 1, W2 and W3 is
Three, three and three. In step S204, the fixed phrase extraction unit 30 extracts, from the sentences in the corpus, the sentence whose content rate calculated in step S203 is equal to or more than a predetermined threshold as a fixed phrase, and extracts it together with the extracted fixed phrase. Output a set of word n-grams that gives.

【００３８】以下では、含有率のしきい値を例えば０．
５として説明する。この場合、例文１、例文２、例文
４、例文５、例文６については、各含有率が０．５以上
であるため、定型文として抽出される。しかし、例文３
については、含有率が０．２５であって０．５以下であ
るため、定型文としては抽出されない。抽出された定型
文とそれを与える単語nグラムの組との対とし、その対
を集めて定型文の集合を作成する。In the following, the threshold value of the content rate is, for example, 0.
5 will be described. In this case, since the content rates of the example sentence 1, the example sentence 2, the example sentence 4, the example sentence 5, and the example sentence 6 are 0.5 or more, they are extracted as the standard sentences. However, example sentence 3
For, the content rate is 0.25 and is 0.5 or less, so it is not extracted as a fixed phrase. A pair of the extracted fixed sentence and a set of word n-grams giving it is made into a pair, and the pair is collected to form a set of fixed sentence.

【００３９】定型文の集合は、例えば、図４（ｃ）に示
すように構成される。図４（ｃ）において、例文１は、
単語ｎグラムＷ１、Ｗ２、Ｗ３によって構成される単語
ｎグラムの組を含むことを示す。ステップＳ２０５で、
定型文クラスタリング手段５０は、予め決められた自然
数を定型文のクラス数Ｃとして、ステップＳ２０４で作
成された定型文の集合からC＋１個の文を取り出し、各
文に対応する１文からなるクラスを作り、各クラス間の
類似度を計算する。The set of fixed phrases is constructed, for example, as shown in FIG. In FIG. 4C, the example sentence 1 is
It is shown to include a set of word n-grams composed of word n-grams W1, W2, W3. In step S205,
The fixed-form sentence clustering means 50 takes a predetermined natural number as the number of classes of fixed-form sentences C, extracts C + 1 sentences from the set of fixed-form sentences created in step S204, and classifies one sentence corresponding to each sentence. Create and calculate the similarity between each class.

【００４０】以下では、クラス数Ｃを２として説明す
る。ステップＳ２０５では、ステップＳ２０４で生成し
た定型文の集合から、例えば、３個の文、例文１、例文
２、例文５を取り出し、それぞれをクラス１（例文１か
らなるクラス）、クラス２（例文２からなるクラス）、
クラス３（例文５からなるクラス）の構成要素とする。
ステップＳ２０５でさらに計算されるクラス間の類似度
として、例えば、以下の式（１）に定義される関数Ｓｉ
ｍ（ｘ_i，ｘ_j）を用いることとする。In the following description, the number of classes C is 2. In step S205, for example, three sentences, example sentence 1, example sentence 2, and example sentence 5 are extracted from the set of fixed-form sentences generated in step S204, and each of them is classified into class 1 (class consisting of example sentence 1) and class 2 (example sentence 2). Class consisting of),
It is a constituent element of class 3 (class consisting of example sentence 5).
As the similarity between the classes further calculated in step S205, for example, the function Si defined in the following equation (1) is used.
Let us use m (x _i , x _j ).

【００４１】[0041]

【数１】ここで、[Equation 1] here,

【数２】である。[Equation 2] Is.

【００４２】ｓ_i、ｓ_jは、ｉ番目のクラス、ｊ番目のク
ラスを識別する記号である。また、ｘ_iは、クラスｓ_iに
含まれる単語nグラムの組であり、ｐ種類の単語ｎグラ
ムｘ_i ¹〜ｘ_i ^pによって構成される。ｆ_iは、単語ｎグラ
ムの組を構成する各単語ｎグラムｘ_i ¹〜ｘ_i ^pが文中に含
まれる回数（以下、生起回数という。）ｆ_i ¹〜ｆ_i ^pによ
って構成されるデータの組である。S_i, S_jIs the i-th class, j-th class
It is a symbol that identifies a lath. Also, x_iIs the class s_iTo
It is a set of included word n-grams, and p types of word n-grams.
Mux_i ¹~ X_i ^pComposed by. f_iIs the word n
Each word n-gram that makes up a set of frames x_i ¹~ X_i ^pIs included in the sentence
Number of occurrences (hereinafter referred to as the number of occurrences) f_i ¹~ F_i ^pBy
Is a set of data that is composed of

【００４３】同様に、ｘ_jは、クラスｓ_jに含まれる単語
nグラムの組であり、ｑ種類の単語ｎグラムｘ_j ¹〜ｘ_j ^q
によって構成される。ｆ_jは、単語ｎグラムの組を構成
する各単語ｎグラムｘ_j ¹〜ｘ_j ^qの生起回数ｆ_i ¹〜ｆ_i ^pに
よって構成されるデータの組である。また、分子に表れ
る総和の記号Σの下に位置する記号および式は、クラス
ｓ_i、ｓ_jに含まれる単語ｎグラムの組ｘ_i、ｘ_jを構成す
る単語ｎグラムのうち同じ単語ｎグラム（例えば、ｘ_i ^r
とｘ_j ^s）について、それらの生起回数（この例では、ｆ
_i ^rとｆ_j ^s）を加算処理することを示す。図４（ｄ）に示
す２つのクラスを例にとって説明する。Similarly, x _j is a word included in the class s _j
It is a set of n-grams, and q kinds of words n-grams x _j ^{1 to} x _j ^q.
Composed by. f _j is a set of data composed of the occurrence times f _{_i} ¹ ~f _i ^p for each word n-gram x _{_j} ¹ ~x _j ^q constituting the set of word n-grams. Further, the symbols and expressions located below the summation symbol Σ appearing in the numerator are the same word n-gram among the word n-grams forming the pairs x _i , x _j of the word n-grams included in the classes s _i , s _j. (For example, x _i ^r
And x _j ^s ), their occurrences (in this example, f
_i ^r and f _j ^s ) are added. The two classes shown in FIG. 4D will be described as an example.

【００４４】図４（ｄ）に示す２つのクラス間の類似度
は、図４（ｄ）に示す各生起回数を上記の式（１）に代
入して、The degree of similarity between the two classes shown in FIG. 4D is obtained by substituting each occurrence number shown in FIG. 4D into the above equation (1).

【数３】となる。[Equation 3] Becomes

【００４５】ステップＳ２０６で、定型文クラスタリン
グ手段５０は、ステップＳ２０５で求められた各クラス
間の類似度の中で最も類似度が高い２つのクラスをマー
ジする。上記の例２では、クラス１（例文１からなるク
ラス）とクラス２（例文２からなるクラス）、クラス１
とクラス３（例文５からなるクラス）、クラス２とクラ
ス３について類似度が計算される。In step S206, the fixed text clustering means 50 merges the two classes having the highest similarity among the similarities obtained in step S205. In the above example 2, class 1 (class consisting of example sentence 1), class 2 (class consisting of example sentence 2), class 1
And class 3 (class consisting of example sentence 5) and class 2 and class 3 are calculated in similarity.

【００４６】上記の例２では、クラス１とクラス２に出
現する各単語ｎグラム種類および出現回数（いずれも
１）が同じであるため、クラス１とクラス２間の類似度
は１．０となるが、クラス１とクラス３間の類似度、ク
ラス２とクラス３間の類似度については、出現する各単
語ｎグラム種類が全く異なるため、０となる。そのた
め、クラス１とクラス２間の類似度が最大と判断され、
クラス１とクラス２とがマージされて１つのクラス（例
文１と例文２からなるクラス）とされる。In the above-described example 2, since each word n-gram type and the number of appearances (1 in each case) appearing in class 1 and class 2 are the same, the similarity between class 1 and class 2 is 1.0. However, the similarity between the class 1 and the class 3 and the similarity between the class 2 and the class 3 are 0 because the respective word n-gram types that appear are completely different. Therefore, the similarity between class 1 and class 2 is judged to be the maximum,
Class 1 and class 2 are merged into one class (class consisting of example sentence 1 and example sentence 2).

【００４７】マージ後のクラスの単語nグラムは、例え
ば、マージ前の２つのクラスに含まれる全ての種類の単
語nグラムを含み、同一の種類の単語ｎグラムについて
は、２つのクラスでの生起回数を合計したものとする。
例えば、図４（ｄ）に示すクラスｓ₁とクラスｓ₂とをマ
ージしたクラスｓ₃は、ｘ₃＝（W２７，W４４１，W７８，W８４），ｆ₃＝（３，
３，１，１，３）と表される。The word n-grams of the class after merging include, for example, word n-grams of all kinds included in the two classes before merging, and the word n-grams of the same kind occur in two classes. It shall be the total number of times.
For example, the class s ₃ obtained by merging the class s ₁ and the class s ₂ shown in FIG. 4D is x ₃ = (W27, W441, W78, W84), f ₃ = (3,
3,1,1,3).

【００４８】ステップＳ２０７で、定型文クラスタリン
グ手段５０は、ステップＳ２０５またはステップＳ２０
８で所定個数の文が取り出された後の定型文の集合が空
集合か否かを判断する。ステップＳ２０７で「残りの定
型文の集合」が空集合でないと判断された場合、ステッ
プＳ２０８で、定型文クラスタリング手段５０は、残り
の定型文の集合から、さらに１個の定型文を取り出して
新たなクラスを作り、各クラス間の類似度を計算する。In step S207, the fixed-form sentence clustering means 50 uses step S205 or step S20.
In step 8, it is determined whether the set of fixed phrases after the predetermined number of sentences have been extracted is an empty set. When it is determined in step S207 that the "remaining fixed phrase set" is not an empty set, the fixed phrase clustering unit 50 extracts one more fixed phrase from the remaining fixed phrase set and newly creates it in step S208. Create different classes and calculate the similarity between each class.

【００４９】上記の例２に示す文を用いた場合では、ス
テップＳ２０７での処理の後、例えば、例文４を取り出
し、クラス４（例文４からなるクラス）の構成要素と
し、ステップＳ２０６での処理後に存在するC個の各ク
ラスとの類似度を計算する。すなわち、クラス１とクラ
ス４との類似度、クラス３とクラス４の類似度を計算
し、その後、処理はステップＳ２０６に戻る。When the sentence shown in Example 2 above is used, after the processing in step S207, for example, the example sentence 4 is taken out and is made a constituent element of class 4 (class consisting of example sentence 4), and the process in step S206 is performed. Compute the similarity with each of the C classes that exist later. That is, the similarity between class 1 and class 4 and the similarity between class 3 and class 4 are calculated, and then the process returns to step S206.

【００５０】ステップＳ２０６以降で、定型文の集合が
空になるまで上記と同様の処理が繰り返される。ステッ
プＳ２０７で「残りの定型文の集合」が空集合であると
判断された場合、定型文クラスタリング手段５０は、定
型文をクラスタリングした結果を出力する。例えば、例
文１、例文２、例文４からなるクラス１と、例文５、例
文６からなるクラス５の２つのクラスが出力される。After step S206, the same processing as described above is repeated until the set of fixed phrases becomes empty. If it is determined in step S207 that the "set of remaining fixed texts" is an empty set, the fixed text clustering unit 50 outputs the result of clustering the fixed texts. For example, two classes, that is, class 1 including example sentence 1, example sentence 2, and example sentence 4 and class 5 including example sentence 5 and example sentence 6, are output.

【００５１】なお、本発明の第２の実施の形態において
は、日本語を処理の対象にした場合について説明した
が、英語、フランス語、その他の外国語を処理対象とす
る場合でも、同様に、定型文を抽出し、クラスタリング
することが可能である。それらを対象とする定型文クラ
スタリング方法については、上記の記載から自明である
ため、その説明を省略する。In the second embodiment of the present invention, the case where Japanese is the object of processing has been described, but the same applies when English, French, and other foreign languages are the objects of processing. It is possible to extract fixed phrases and perform clustering. The fixed-form sentence clustering method for them is obvious from the above description, and therefore its explanation is omitted.

【００５２】以上説明したように、本発明の第２の実施
の形態に係る定型文クラスタリング方法は、類似した定
型文を検出してマージするステップを設けたため、コー
パスから抽出した定型文のうちの類似した定型文を１つ
のクラスにまとめてクラスタリングし、定型文のクラス
として出力することができる。As described above, the fixed phrase clustering method according to the second embodiment of the present invention has a step of detecting and merging similar fixed phrases, and therefore, among fixed phrases extracted from the corpus. It is possible to cluster similar similar phrases into one class and output as a fixed phrase class.

【００５３】[0053]

【発明の効果】以上説明したように、本発明は、コーパ
スから抽出した定型文のうちの類似した定型文を１つの
クラスにまとめてクラスタリングし、定型文のクラスと
して出力することが可能な定型文クラスタリング装置お
よび方法を実現することができる。As described above, according to the present invention, the fixed phrases that are similar to each other among the fixed phrases extracted from the corpus are clustered into one class and can be output as the fixed phrase class. A sentence clustering apparatus and method can be realized.

[Brief description of drawings]

【図１】本発明の第１の実施の形態に係る定型文クラス
タリング装置のブロック構成図である。FIG. 1 is a block configuration diagram of a fixed sentence clustering device according to a first exemplary embodiment of the present invention.

【図２】本発明の第２の実施の形態に係る定型文クラス
タリング方法における処理の流れを示すフローチャート
である。FIG. 2 is a flowchart showing a flow of processing in a boilerplate clustering method according to a second embodiment of the present invention.

【図３】本発明の第２の実施の形態に係る定型文クラス
タリング方法における処理の説明図である。FIG. 3 is an explanatory diagram of processing in a boilerplate clustering method according to the second embodiment of the present invention.

【図４】本発明の第２の実施の形態に係る定型文クラス
タリング方法における処理の説明図である。FIG. 4 is an explanatory diagram of processing in a boilerplate clustering method according to the second embodiment of the present invention.

【図５】従来の定型文抽出方法における処理の説明図で
ある。FIG. 5 is an explanatory diagram of processing in a conventional fixed text extraction method.

[Explanation of symbols]

１０形態素解析手段２０単語ｎグラム辞書手段３０定型文抽出手段４０定型文保存手段５０定型文クラスタリング手段１００定型文クラスタリング装置 10 Morphological analysis means 20-word n-gram dictionary means 30 Fixed phrase extraction means 40 Fixed phrase storage means 50 Fixed sentence clustering means 100 Fixed sentence clustering device

Claims

[Claims]

1. A morphological analysis means for dividing each sentence included in a corpus, which is a set of predetermined sentences, into words constituting each sentence, and obtained by dividing each sentence included in the corpus. A word for generating a word n-gram (n is a natural number of 1 or more) by connecting the words in accordance with a predetermined rule, and extracting a word n-gram satisfying a predetermined condition from the word n-gram and storing the word. The n-gram dictionary means and the word n-gram stored in the word n-gram dictionary means are combined to generate a set of word n-grams, and a ratio including the set of word n-grams is predetermined. A fixed sentence extraction unit that extracts a sentence having a threshold value or more as a fixed sentence from sentences included in the corpus, a fixed sentence extracted by the fixed sentence extraction unit and the word n-gram included in the fixed sentence. Pair Fixed phrase storage means to save,
A fixed sentence clustering unit that calculates a similarity between the classes with respect to a class that is a set of predetermined sentences included in the corpus and performs merging with the two classes having the highest similarity as one class. A fixed-form sentence clustering device characterized by being provided.

2. The predetermined condition satisfied by the word n-gram extracted by the word n-gram dictionary means is that the number of occurrences of the word n-gram in the extracted sentence is a predetermined number or more. The fixed-form sentence clustering device according to claim 1.

3. A step of dividing each sentence included in a corpus, which is a set of predetermined sentences, into words constituting each sentence, and the word obtained by dividing each sentence included in the corpus. To generate a word n-gram (n is a natural number of 1 or more.) Based on a predetermined rule, and extract a word n-gram satisfying a predetermined condition from the word n-gram and store it. A combination of the dictionary step and a predetermined type of word n-gram in the word n-gram stored by the word n-gram dictionary step is combined to generate a set of word n-grams, and a ratio including the set of word n-grams is a predetermined threshold value. The fixed sentence extraction step of extracting the above sentence as a fixed sentence from the sentences included in the corpus, the fixed sentence extracted by the fixed sentence extraction step and the word n group included in the fixed sentence. Storing a set of rams, and calculating a degree of similarity between the classes, which is a set of predetermined sentences included in the corpus, and regards two classes having the highest degree of similarity as one class. A fixed-form sentence clustering method comprising a step of merging.

4. The predetermined condition satisfied by the word n-gram extracted in the word n-gram dictionary step is that the number of occurrences of the word n-gram in the extracted sentence is a predetermined number or more. The fixed sentence clustering method according to claim 3, wherein:

5. A predetermined rule based on generating the word n-gram in the word n-gram dictionary step is n pieces (n is a natural number of 1 or more) while maintaining the order of arrangement of the words forming each sentence. 5. The fixed sentence clustering method according to claim 3 or 4, wherein the rule is a concatenation to generate a word n-gram.