JP5022319B2

JP5022319B2 - Text mining apparatus, method, program, and recording medium thereof

Info

Publication number: JP5022319B2
Application number: JP2008200574A
Authority: JP
Inventors: 済央野本; 喜昭野田; 哲郎甘粕
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2008-08-04
Filing date: 2008-08-04
Publication date: 2012-09-12
Anticipated expiration: 2028-08-04
Also published as: JP2010039671A

Description

この発明は、形式化されていないテキストデータを単語等に分割し、その出現頻度や相関関係などをデータマイニングの手法を使って解析することで、一定の知見や発想を得るテキストマイニング技術に関する。 The present invention relates to a text mining technique that obtains certain knowledge and ideas by dividing unformatted text data into words and the like, and analyzing their appearance frequency and correlation using a data mining technique.

商品等に対する自由記述アンケート及びＢｌｏｇ記事等のテキストで書かれた文書を複数集めてきて、それらの主題傾向を調べようとした際、その中でどのような主題がどのくらいあるかを調べるために文書頻度（ＤｏｃｕｍｅｎｔＦｒｅｑｕｅｎｃｙ，略してＤＦとも呼ぶ。）の単語ランキングが用いられる。 When collecting multiple documents written in text, such as free description questionnaires and blog articles, etc. for products, etc., and trying to examine their subject trends, documents to find out what themes are and how much in them A word ranking of frequency (Document Frequency, also called DF for short) is used.

ある単語についての文書頻度とは、その単語を含む文書の数のことである。例えば、図５においては、３つの文書１，２，３のそれぞれが「電話」という単語を含んでいる。よって、「電話」の文書頻度は３となる。文書頻度は、文書に含まれる単語の数を考慮しない概念である。例えば、文書１，２，３のそれぞれは「電話」を２つ含んでいるが、文書頻度はそれについては考慮しない。ある文書がある単語を含めば、その文書に含まれるその単語の数に関わらず、文書頻度として１を計上する。 The document frequency for a word is the number of documents that contain that word. For example, in FIG. 5, each of the three documents 1, 2, 3 includes the word “telephone”. Therefore, the document frequency of “telephone” is 3. The document frequency is a concept that does not consider the number of words included in the document. For example, each of the documents 1, 2, and 3 includes two “phones”, but the document frequency does not consider it. If a document includes a certain word, 1 is counted as the document frequency regardless of the number of the word included in the document.

文書頻度の単語ランキングとは、文書頻度が大きい順に単語を並び替えて必要に応じて順位付けしたものである。図６に、図５の右側の欄に記載された単語及びその文書頻度についての単語ランキングを例示する。この単語ランキングにより、「電話」「横須賀」「市役所」という単語が出現した文書の数は３であり最も多いということがわかる。 The word ranking of the document frequency is a word ranking in the descending order of the document frequency and ranking as necessary. FIG. 6 exemplifies word rankings for the words described in the right column of FIG. 5 and their document frequencies. From this word ranking, it can be seen that the number of documents in which the words “telephone”, “Yokosuka”, and “city hall” appear is 3, which is the largest.

ところで、各文書にほぼ共通して現れる定型的であり、主題とは関係のない部分（以下、定型部という。）が存在した場合、その定型部に含まれる単語が単語ランキングの上位を占めてしまい、正確な主題傾向を掴むことができないという問題が生じ得る。図７に、下線で定型部を例示する。文書１の下線部分である「はい、お電話ありがとうございます。横須賀市役所市民窓口です。」「お電話ありがとうございました。それでは失礼致します。」等の挨拶はどの文書にも共通して現れる内容であり定型部と言える。そして、この定型部に含まれる「電話」「横須賀」「市役所」が、図６の単語ランキングの上位を占めている。このように、定型部に含まれる単語が単語ランキングの上位を占めると、どのような主題を持つ文書が多いのかを理解するのが難しくなる。したがって、単語ランキングから定型部の影響を取り除く必要がある。 By the way, when there is a fixed part that appears almost in common in each document and has nothing to do with the subject (hereinafter referred to as the fixed part), the words contained in the fixed part occupy the top of the word ranking. Therefore, there may arise a problem that an accurate subject tendency cannot be grasped. FIG. 7 illustrates the fixed portion with an underline. The underlined part of document 1, “Yes, thank you for the call. Yokosuka City Hall citizen window.” “Thank you for the call. It can be said that there is a fixed part. Then, “phone”, “Yokosuka”, and “city hall” included in this standard part occupy the top of the word ranking of FIG. As described above, when words included in the fixed form part occupy the top of the word ranking, it is difficult to understand what themes have many subjects. Therefore, it is necessary to remove the influence of the fixed part from the word ranking.

以下、定型部の影響を取り除くための２つの方法を説明する。
第１の方法は、定型部に多く現れる単語をストップワードとして登録しておき、このストップワードとして登録された単語を除いて単語ランキングを生成する方法である（例えば、非特許文献１参照。）。ストップワードの例を図８の右上に示す。この例では、文書１，２，３の定型部に多く現れる「電話」「横須賀」「市役所」「市民」「窓口」「ありがとう」「失礼」が、ストップワードとして登録される。そして、これらのストップワードを除いて生成された単語ランキングの例を図８の右下に示す。この単語ランキングは、図６の単語ランキングから、これらのストップワードを除いたものである。 Hereinafter, two methods for removing the influence of the fixed portion will be described.
The first method is a method of registering words that frequently appear in the fixed form part as stop words, and generating a word ranking by excluding the words registered as stop words (see, for example, Non-Patent Document 1). . An example of a stop word is shown in the upper right of FIG. In this example, “telephone”, “Yokosuka”, “city hall”, “citizen”, “window”, “thank you”, and “disrespect” that frequently appear in the fixed part of documents 1, 2, and 3 are registered as stop words. An example of word ranking generated by removing these stop words is shown in the lower right of FIG. This word ranking is obtained by removing these stop words from the word ranking of FIG.

第２の方法は、テキストタイリング法等のテキストセグメンテーション技術を用いた方法である（例えば、非特許文献２参照。）。テキストタイリング法を用いて、図９の左側に例示するように、文書を定型部と主題部とに分離して、主題部を用いて文書頻度を計算して、単語ランキングの生成を行う。すなわち、単語が文書の主題部に出現した場合には文書頻度として１を計上する。文書の定型部のみに出現する単語は文書頻度を計算する際に考慮しない。 The second method is a method using a text segmentation technique such as a text tiling method (for example, see Non-Patent Document 2). Using the text tiling method, as illustrated on the left side of FIG. 9, the document is divided into a fixed part and a theme part, and the document frequency is calculated using the theme part to generate a word ranking. That is, when a word appears in the subject part of a document, 1 is counted as the document frequency. Words that appear only in the standard part of the document are not considered when calculating the document frequency.

図１０を用いて、テキストタイリング法により文書を区切る方法を簡単に説明する。文書中に基準点を定めて、その基準点の左右に所定の数の文からなる窓を設定する。図１０の例では、窓のサイズは３であり、それぞれ３つの文からなる２つの窓をそれぞれ基準点の左側と右側とに設けている。基準点の左側の窓を左窓といい、基準点の右側の窓を右窓という。左窓に含まれる各単語が左窓に出現する頻度である左窓出現頻度、及び、右窓に含まれる各単語が右窓に出現する頻度である右窓出現頻度を計算する。そして、左窓出現頻度と右窓出現頻度との類似度を計算する。基準点を一定間隔でスライドさせながら、その類似度の変化を見て行き、類似度が低くなる位置を文書の切れ目として見つけ出して行く。話題が変化する位置では、左窓と右窓とは語彙的関連性が低く、類似度が小さくなると考えられるためである。定型部及び主題部についても同様に、互いに語彙的関連性が低いと考えられるため、類似度が低くなる位置で定型部と主題部とに区切る。
北研二，津田和彦，獅々堀正幹，「情報検索アルゴリズム」，共立出版，ｐ．２９−３０ Maria A.Hearst. Multi-Paragraph Segmentation of Expository Text.32nd Annual Meeting of the Association for Computation Linguistics. P.9-16. 1944 A method for dividing a document by the text tiling method will be briefly described with reference to FIG. A reference point is defined in the document, and a window composed of a predetermined number of sentences is set to the left and right of the reference point. In the example of FIG. 10, the size of the window is 3, and two windows each having three sentences are provided on the left side and the right side of the reference point, respectively. The window on the left side of the reference point is called the left window, and the window on the right side of the reference point is called the right window. The left window appearance frequency, which is the frequency at which each word included in the left window appears in the left window, and the right window appearance frequency, which is the frequency at which each word included in the right window appears in the right window, are calculated. Then, the similarity between the left window appearance frequency and the right window appearance frequency is calculated. While sliding the reference point at regular intervals, the change in the similarity is observed, and the position where the similarity is lowered is found as a break in the document. This is because at the position where the topic changes, the left window and the right window are considered to have a low lexical relationship and a low similarity. Similarly, the fixed part and the theme part are considered to have low lexical relation to each other, and therefore, the fixed part and the theme part are divided at a position where the similarity is low.
Kita Kenji, Tsuda Kazuhiko, Sasabori Masatomi, “Information Retrieval Algorithm”, Kyoritsu Shuppan, p. 29-30 Maria A. Hearst. Multi-Paragraph Segmentation of Expository Text. 32nd Annual Meeting of the Association for Computation Linguistics. P.9-16. 1944

上記第１の方法においては、ストップワードとして登録された単語が主題部にも登場してその文書の主題を構成する場合には、その主題を見つけることはできない。例えば、図８の例では、「市役所」は文書２の主題部において「市役所の開庁時間帯」という主題の一部を構成しており、「窓口」は文書３の主題部において「窓口の受付時間帯」という主題の一部を構成している。しかし、「市役所」「窓口」は定型部に多く出現する単語であるため、ストップワードとして登録されている。したがって、この例では、単語ランキングから「市役所」「窓口」は除外されてしまい、文書２の主題及び文書３の主題の把握に失敗している。このように、上記第１の方法においては、主題を構成する単語を必要以上に除外しており、必ずしも定型部の影響を適切に取り除いているとは言えないという問題がある。 In the first method, when a word registered as a stop word also appears in the subject part and constitutes the subject of the document, the subject cannot be found. For example, in the example of FIG. 8, “City Hall” constitutes a part of the theme “City Hall Opening Office Time” in the subject part of Document 2, and “Window” It constitutes a part of the theme “Reception hours”. However, “city hall” and “window” are registered as stop words because they appear frequently in the fixed part. Therefore, in this example, “city hall” and “window” are excluded from the word ranking, and the subject of the document 2 and the subject of the document 3 have failed to be grasped. Thus, in the first method, there is a problem that the words constituting the subject are excluded more than necessary, and the influence of the fixed part is not necessarily removed appropriately.

上記第２の方法においては、定型部が窓サイズよりも少ない数の文からなる場合には、その定型部を区切ることは難しい。これに対して、窓サイズを小さくすることによりさらに小さい定型部を区切ろうとする方法もあり得る。しかし、窓サイズを小さくするとそこに含まれる単語の数が少なくなり、左窓出現頻度と右窓出現頻度との類似度が著しく低くなり、統計的に信頼性のある類似度を計算することができなくなる。また、文書の最初又は最後においては左窓の窓サイズと右窓の窓サイズとを同じにすることができないため、定型部が文書の最初付近又は最後付近にある場合にも、その定型部を区切ることは難しい。このように上記第２の方法においては、定型部を適切に区切ることが難しいことに起因して、必ずしも定型部の影響を適切に取り除いているとは言えないという問題がある。 In the second method, when the fixed part is composed of a smaller number of sentences than the window size, it is difficult to delimit the fixed part. On the other hand, there may be a method of trying to divide a smaller fixed portion by reducing the window size. However, if the window size is reduced, the number of words included in the window is reduced, the similarity between the left window appearance frequency and the right window appearance frequency is significantly reduced, and statistically reliable similarity can be calculated. become unable. In addition, since the window size of the left window and the window size of the right window cannot be the same at the beginning or end of the document, even if the fixed part is near the beginning or end of the document, the fixed part is not displayed. It is difficult to separate. As described above, in the second method, there is a problem that it cannot be said that the influence of the fixed portion is properly removed due to the difficulty in appropriately dividing the fixed portion.

この発明は、上記問題に鑑みて、より適切に定型部の影響を取り除くことができるテキストマイニング装置、方法、プログラム及びその記録媒体を提供することを目的とする。 In view of the above problems, an object of the present invention is to provide a text mining device, a method, a program, and a recording medium thereof that can more appropriately remove the influence of the fixed portion.

請求項１に記載されたテキストマイニング装置によれば、複数の分析対象文書を格納する分析対象文書記憶部と、分析対象文書記憶部から読み込んだ各複数の分析対象文書を複数の単語に分割する形態素解析部と、複数の分割された単語の全部又は一部のそれぞれが読み込んだ分析対象文書に出現する頻度（以下、単語出現頻度とする。）を求める単語出現頻度計算部と、ある分析対象文書の定型部をその分析対象文書の主題とは関係がない定型的な部分とし、ある単語の定型部平均単語出現頻度をその単語が各複数の分析対象文書の定型部に出現する推定平均頻度として、分割された単語について求まった単語出現頻度からその単語の定型部平均単語出現頻度を減算して、その単語についての定型部影響除去後単語出現頻度を求める定型部影響除去部と、ある単語の文書頻度をその単語を含む複数の分析対象文書の数として、複数の分割された単語のうち、定型部影響除去後単語出現頻度が予め定められた頻度よりも高い又は以上である単語のそれぞれについての文書頻度を求める文書頻度計算部と、を含む。 According to the text mining device recited in claim 1, an analysis target document storage unit that stores a plurality of analysis target documents, and a plurality of analysis target documents that are read from the analysis target document storage unit are divided into a plurality of words. A morpheme analysis unit, a word appearance frequency calculation unit for obtaining a frequency at which all or some of a plurality of divided words appear in an analysis target document (hereinafter referred to as word appearance frequency), and a certain analysis target The standard part of the document is a standard part that is not related to the subject of the analysis target document, and the standard part average word appearance frequency of a certain word is the estimated average frequency that the word appears in the standard part of each of the plurality of analysis target documents. As a standard part for subtracting the standard part average word appearance frequency of the word from the word appearance frequency obtained for the divided word, and obtaining the word appearance frequency after removing the standard part effect for the word The reverberation removing unit and the document frequency of a certain word as the number of a plurality of analysis target documents including the word, among the plurality of divided words, the word appearance frequency after removing the fixed part influence is higher than a predetermined frequency. Or a document frequency calculation unit for obtaining a document frequency for each of the above words.

単語出現頻度から定型部平均単語出現頻度を減算することにより、より適切に定型部の影響を取り除くことができる。
定型部平均単語出現頻度は単語が定型部に出現する推定平均頻度であるから、単語出現頻度からこの定型部平均単語出現頻度を減算しても、上記第１の方法とは異なり、主題を構成する単語を必要以上に除外することにはならない。また、上記第２の方法とは異なり、上記テキストタイリング法を用いないため、定型部を適切に区切ることができないという問題は生じない。 By subtracting the fixed part average word appearance frequency from the word appearance frequency, the influence of the fixed part can be more appropriately removed.
Since the fixed part average word appearance frequency is an estimated average frequency that a word appears in the fixed part, even if this fixed part average word appearance frequency is subtracted from the word appearance frequency, the subject is configured unlike the first method. Do not exclude words that you want more than necessary. Unlike the second method, since the text tiling method is not used, there is no problem that the fixed part cannot be appropriately separated.

以下、この発明の一実施形態を説明する。図１に、この発明によるテキストマイニング装置の機能構成の例を示す。図２に、この発明によるテキストマイニング方法の例を示す。 An embodiment of the present invention will be described below. FIG. 1 shows an example of a functional configuration of a text mining device according to the present invention. FIG. 2 shows an example of a text mining method according to the present invention.

分析対象文書記憶部１０には、分析対象となる文書が複数記憶されている。分析対象となる文書のことを、分析対象文書と呼ぶことにする。分析対象文書は、例えば、電話の応対をテキストに書き起こしたもの、商品等に対する自由記述アンケート又はＢｌｏｇ記事等のテキストデータである。
分析対象文書記憶部１０に記憶された複数の分析対象文書から一部の分析対象文書が選択される（ステップＳ１）。そして、これらの選択された分析対象文書のそれぞれから定型部が取り出されて、定型部記憶部２０に記憶される。 The analysis target document storage unit 10 stores a plurality of documents to be analyzed. A document to be analyzed is referred to as an analysis target document. The analysis target document is, for example, text data such as a transcript of a telephone response, a free description questionnaire for a product, or a blog article.
Some analysis target documents are selected from a plurality of analysis target documents stored in the analysis target document storage unit 10 (step S1). Then, a fixed part is extracted from each of these selected analysis target documents and stored in the fixed part storage unit 20.

この発明では、ある分析対象文書の主題とは関係がない定型的な部分をその分析対象文書の定型部とする。ひとつの分析対象文書に複数の定型部の部分が含まれている場合には、これら複数の定型部の部分のすべてを、この分析対象文書の定型部と呼ぶ。例えば、図７の文書１では、第一の定型部の部分「はい、お電話ありがとうございます。横須賀市役所市民窓口です。」、第二の定型部の部分「お電話ありがとうございました。それでは失礼致します。」がある。したがって、この文書１の定型部といった場合には、「はい、お電話ありがとうございます。横須賀市役所市民窓口です。」と「お電話ありがとうございました。それでは失礼致します。」との両方を意味し、何れか一方を意味しない。 In the present invention, a fixed part that is not related to the subject of a certain analysis target document is set as a fixed part of the analysis target document. When a single analysis target document includes a plurality of fixed part portions, all of the plurality of fixed form portions are referred to as a fixed portion of the analysis target document. For example, in document 1 of FIG. 7, the part of the first fixed part "Yes, thank you for the call. It is the Yokosuka City Hall citizen window." The part of the second fixed part "Thank you for the call. There is. Therefore, in the case of the standard part of this document 1, it means both “Yes, thank you for the call. I am Yokosuka City Hall Citizens.” And “Thank you for the call. It does not mean either one.

分析対象文書の選択は、選択された分析対象文書の定型部における各単語の単語頻出頻度と、分析対象文書記憶部１０に記憶された複数の分析対象文書の定型部における各単語の単語頻出頻度とに大きな差がないように行うことが望ましい。後述する定型部平均単語出現頻度の推定精度を上げるためである。そのように行えば、分析対象文書の選択は、コンピュータ等を用いてランダムに行っても良いし、人が任意に行ってもよい。例えば、５０個程度の分析対象文書が選択される。 The analysis target document is selected based on the word frequency of each word in the standard part of the selected analysis target document and the word frequency of each word in the standard part of the plurality of analysis target documents stored in the analysis target document storage unit 10. It is desirable to make sure that there is no significant difference between the two. This is to improve the accuracy of estimating the fixed portion average word appearance frequency, which will be described later. If it does in that way, selection of an analysis object document may be performed at random using a computer etc., and a person may perform arbitrarily. For example, about 50 analysis target documents are selected.

分析対象文書からの定型部の取り出しは、例えば手作業等の、定型部を精度良く取り出すことができる方法で行う。この発明では、定型部の取り出しを、すべての分析対象文書に対して行う必要はなく、選択された分析対象文書、すなわち一部の分析対象文書に対して行えば足りる。したがって、定型部を取り出すために、手作業等の比較的時間がかかっても精度が高い方法を用いることができるのである。もちろん、コンピュータを用いて精度良く定型部を取り出すことができる方法があれば、コンピュータを用いてその方法により機械的に定型部の取り出しを行ってもよい。 The standard part is extracted from the analysis target document by a method that enables the standard part to be accurately extracted, such as manual work. In the present invention, it is not necessary to extract the standard part for all the analysis target documents, and it is sufficient to perform the selection for the selected analysis target document, that is, a part of the analysis target documents. Therefore, in order to take out the fixed portion, a method with high accuracy can be used even if it takes a relatively long time such as manual work. Of course, if there is a method that can accurately extract the fixed portion using a computer, the fixed portion may be mechanically extracted by the method using a computer.

定型部形態素解析部３０は、定型部記憶部２０から各定型部を読み出して、各定型部を複数の単語に分割する（ステップＳ２）。分割された単語は、定型部単語出現頻度計算部４０に送られる。 The fixed form morpheme analyzing unit 30 reads each fixed part from the fixed part storage unit 20, and divides each fixed part into a plurality of words (step S2). The divided words are sent to the fixed part word appearance frequency calculation unit 40.

単語への分割は、周知の形態素解析方法を用いることができる。例えば、形態素解析結果候補の中から最も長い文字数を含む候補を選択する最長一致法、文節数が最小になる候補を選択する方法、自立語の長い候補を選択する方法、予め定めた規則をもとにして形態素解析を行う方法、隠れマルコフモデルを用いて統計的にもっともらしい候補を選択する方法を例えば用いる（例えば、参考文献１参照。）。
〔参考文献１〕石崎俊，「自然言語処理」，昭晃堂，Ｐ．２７−２９ For dividing into words, a well-known morphological analysis method can be used. For example, there is a longest matching method for selecting a candidate including the longest number of characters from among morpheme analysis result candidates, a method for selecting a candidate with the smallest number of phrases, a method for selecting a candidate with a long independent word, and a predetermined rule. For example, a method for performing morphological analysis and a method for selecting a statistically plausible candidate using a hidden Markov model are used (for example, see Reference 1).
[Reference 1] Shun Ishizaki, “Natural Language Processing”, Shosodo, P. 27-29

定型部単語出現頻度計算部４０は、分割された単語が、定型部に出現する頻度を求める（ステップＳ３）。つまり、定型部ごとに、定型部に出現した各単語がその定型部に出現する数をカウントする。求まった頻度は、定型部平均単語出現頻度計算部５０に送られる。 The fixed part word appearance frequency calculation unit 40 obtains the frequency at which the divided words appear in the fixed part (step S3). That is, for each fixed part, the number of each word that appears in the fixed part appears in the fixed part. The obtained frequency is sent to the fixed part average word appearance frequency calculation unit 50.

例えば、図３の文書１の定型部においては、「電話」は３回出現するので「電話」が出現する頻度は３となる。同様に、文書１の定型部に出現する「ありがとう」「横須賀」「市役所」等の各単語が出現する頻度を求める。文書２についても同様に文書２に出現する各単語が文書２の定型部に出現する頻度を求め、文書３についても同様に文書３に出現する各単語が文書２の定型部に出現する頻度を求める。なお、図３は、選択された分析対象文書が、文書１，２，３の３つである場合の例である。 For example, in the standard part of the document 1 in FIG. 3, “telephone” appears three times, so the frequency of “telephone” appears as three. Similarly, the frequency at which each word such as “Thank you”, “Yokosuka”, “City Hall”, etc. appearing in the standard part of the document 1 appears. Similarly, for document 2, the frequency at which each word appearing in document 2 appears in the standard part of document 2 is obtained, and for document 3, the frequency at which each word appearing in document 3 appears in the standard part of document 2 is also calculated. Ask. FIG. 3 shows an example in which the selected analysis target documents are three documents 1, 2, and 3.

定型部平均単語出現頻度計算部５０は、求まった頻度を単語ごとに加算した後に、選択された分析対象文書の数で割ることにより、定型部平均単語出現頻度を求める（ステップＳ４）。ある単語についての定型部平均単語出現頻度とは、その単語が分析対象文書の定型部に出現する推定平均頻度のことである。「求まった頻度を単語ごとに加算」とは、言い換えると、「単語が各定型部に出現する頻度を加算すること」を意味する。選択された分析対象文書の数は定型部の数と等しいため、選択された分析対象文書の数ではなく定型部の数で割ると考えてもよい。求まった定型部平均単語出現頻度は、定型部平均単語出現頻度記憶部５１に記憶される。 The fixed part average word appearance frequency calculation unit 50 calculates the fixed part average word appearance frequency by adding the obtained frequency for each word and then dividing the result by the number of selected analysis target documents (step S4). The fixed part average word appearance frequency for a word is an estimated average frequency at which the word appears in the fixed part of the analysis target document. In other words, “adding the obtained frequency for each word” means “adding the frequency at which the word appears in each fixed part”. Since the number of selected analysis target documents is equal to the number of fixed form parts, it may be considered that the selected analysis target document is divided by the number of fixed part parts instead of the selected analysis target document number. The determined fixed part average word appearance frequency is stored in the fixed part average word appearance frequency storage unit 51.

例えば、図３の例では、「電話」が、文書１の定型部に出現する回数は３回であり、文書２の定型部に出現する回数は１回であり、文書３の定型部に出現する回数は２回である。定型部平均単語出現頻度計算部５０は、これらの回数を加算（３回＋１回＋２回＝６回）して、「電話」が、選択された分析対象文書の定型部の全体に出現する回数（６回）を求める。そして、この加算値（６回）を、選択された分析対象文書の数である３で割ることにより、「電話」の定型部平均単語出現頻度（２回）を求める。「ありがとう」「横須賀」等の他の単語についても同様に定型部平均単語出現頻度を求める。 For example, in the example of FIG. 3, “telephone” appears three times in the standard part of the document 1, appears once in the standard part of the document 2, and appears in the standard part of the document 3. The number of times to do is two times. The standard part average word appearance frequency calculation unit 50 adds these times (3 times + 1 time + 2 times = 6 times), and the number of times that “telephone” appears in the whole standard part of the selected analysis target document. (6 times) is requested. Then, by dividing this added value (six times) by 3, which is the number of selected documents to be analyzed, the fixed portion average word appearance frequency (twice) of “phone” is obtained. For other words such as “Thank you” and “Yokosuka”, the standard part average word appearance frequency is similarly obtained.

このように、この例では、全部の分析対象文書からではなく、選択された一部の分析対象文書から、定型部平均単語出現頻度を推定する。この定型部平均単語出現頻度の計算は、事前に行っておいてもよいし、後述するステップＳ５からステップＳ８の処理と並行して行ってもよい。 Thus, in this example, the fixed part average word appearance frequency is estimated not from all the analysis target documents but from a selected part of the analysis target documents. The calculation of the standard part average word appearance frequency may be performed in advance, or may be performed in parallel with the processing from step S5 to step S8 described later.

形態素解析部６０は、分析対象文書記憶部１０から読み込んだ各複数の分析対象文書を複数の単語に分割する（ステップＳ５）。分割された単語は、単語出現頻度計算部７０に送られる。単語への分割は、定型部形態素解析部３０と同様に周知の形態素解析方法を用いて行うことができる。 The morphological analysis unit 60 divides each of the plurality of analysis target documents read from the analysis target document storage unit 10 into a plurality of words (step S5). The divided words are sent to the word appearance frequency calculation unit 70. The division into words can be performed by using a well-known morpheme analysis method in the same manner as the fixed form morpheme analysis unit 30.

単語出現頻度計算部７０は、形態素解析部６０が分割した各単語が上記読み込んだ分析対象文書に出現する頻度を求める（ステップＳ６）。つまり、分析対象文書ごとに、分析対象文書に含まれる各単語がその分析対象文書に出現する回数をカウントする。この求まった頻度のことを、単語出現頻度（ＴｅｒｍＦｒｅｑｕｅｎｃｙ，略してＴＦとも呼ぶ。）と表現する。すなわち、ある単語とある分析対象文書とについて単語出現頻度は、その単語がその分析対象文書に出現する回数を意味する。求まった頻度、すなわち単語出現頻度は、定型部影響除去部８０に送られる。 The word appearance frequency calculation unit 70 obtains the frequency at which each word divided by the morpheme analysis unit 60 appears in the read analysis target document (step S6). That is, for each analysis target document, the number of times each word included in the analysis target document appears in the analysis target document is counted. This obtained frequency is expressed as a word appearance frequency (Term Frequency, also abbreviated as TF). That is, the word appearance frequency for a certain word and a certain analysis target document means the number of times that the word appears in the analysis target document. The obtained frequency, that is, the word appearance frequency is sent to the fixed part influence removing unit 80.

例えば、図４においては、「電話」は文書に２回出現するので「電話」の単語出現頻度は２回となり、「横須賀」は文書に１回出現するので「横須賀」の単語出現頻度は１回となる。 For example, in FIG. 4, since “telephone” appears twice in the document, the word appearance frequency of “phone” is twice, and “Yokosuka” appears once in the document, so the word frequency of “Yokosuka” is 1. Times.

定型部影響除去部８０は、分割された単語について求まった上記頻度、すなわち分割された単語の単語出現頻度から、その単語の定型部平均単語出現頻度を減算して、その単語についての定型部影響除去後単語出現頻度を求める（ステップＳ７）。すなわち、分割された単語ごとに、分割された単語の単語出現頻度から、その単語の定型部平均単語出現頻度を減算して、その減算結果をその単語の定型部影響除去後単語出現頻度とする。求まった定型部影響除去後単語出現頻度は、文書頻度計算部９０に送られる。 The fixed part influence removing unit 80 subtracts the fixed part average word appearance frequency of the word from the frequency obtained for the divided word, that is, the word appearance frequency of the divided word, and determines the fixed part influence on the word. The word appearance frequency after removal is obtained (step S7). That is, for each divided word, the standard part average word appearance frequency of the divided word is subtracted from the word appearance frequency of the divided word, and the subtraction result is used as the word appearance frequency after removing the fixed part influence of the word. . The obtained word appearance frequency after removing the fixed portion influence is sent to the document frequency calculation unit 90.

なお、定型部平均単語出現頻度が事前計算されている場合には、定型部影響除去部８０は、定型部平均単語出現頻度記憶部５１から対応する定型部平均単語出現頻度を適宜読み込む。定型部平均単語出現頻度の計算がステップＳ５からステップＳ８の処理と並行して行われる場合には、定型部平均単語出現頻度計算部５０が計算した対応する定型部平均単語出現頻度が定型部影響除去部８０に直接送られてもよい。 In addition, when the fixed part average word appearance frequency is pre-calculated, the fixed part influence removing unit 80 appropriately reads the corresponding fixed part average word appearance frequency from the fixed part average word appearance frequency storage unit 51. When the calculation of the standard part average word appearance frequency is performed in parallel with the processing from step S5 to step S8, the corresponding standard part average word appearance frequency calculated by the standard part average word appearance frequency calculation unit 50 is influenced by the standard part. It may be sent directly to the removing unit 80.

文書頻度計算部９０は、ある単語の文書頻度をその単語を含む分析対象文書の数として、複数の分割された単語のうち、定型部影響除去後単語出現頻度が予め定められた頻度よりも高い単語のそれぞれについての文書頻度を求める（ステップＳ８）。なお、複数の分割された単語のうち、定型部影響除去後単語出現頻度が予め定められた頻度以上の単語のそれぞれについての文章頻度を求めてもよい。 The document frequency calculation unit 90 uses the document frequency of a certain word as the number of documents to be analyzed including the word, and among the plurality of divided words, the appearance frequency of the word after removing the fixed part influence is higher than a predetermined frequency. The document frequency for each word is obtained (step S8). In addition, you may obtain | require the sentence frequency about each of the word more than the frequency with which the word appearance frequency after a fixed part influence removal is predetermined among several divided words.

このようにして、単語出現頻度から定型部平均単語出現頻度を減算することにより、より適切に定型部の影響を取り除くことができる。すなわち、定型部平均単語出現頻度は単語が定型部に出現する推定平均頻度であるから、単語出現頻度からこの定型部平均単語出現頻度を減算しても、背景技術の欄に記載された第１の方法とは異なり、主題を構成する単語を必要以上に除外することにはならない。また、背景技術の欄に記載された第２の方法とは異なり、上記テキストタイリング法を用いないため、定型部を適切に区切ることができないという問題は生じない。 Thus, by subtracting the fixed part average word appearance frequency from the word appearance frequency, the influence of the fixed part can be removed more appropriately. That is, since the standard part average word appearance frequency is an estimated average frequency at which a word appears in the standard part, even if this standard part average word appearance frequency is subtracted from the word appearance frequency, the first part described in the background art column is used. Unlike the above method, the words constituting the subject are not excluded more than necessary. In addition, unlike the second method described in the background art section, the text tiling method is not used, and therefore, there is no problem that the fixed part cannot be appropriately separated.

［変形例等］
定型部形態素解析部３０が定型部を分割して単語を出力する際、その分割の方法によっては、助詞、接続詞等の単体で主題を構成しない単語が出力される場合がある。この場合、定型部単語出現頻度計算部４０は、名詞、動詞等の単体で主題を構成する単語についての出現頻度を求め、助詞、接続詞等の単体で主題を構成しない単語についての出現頻度を求めなくてもよい。つまり、定型部単語出現頻度計算部４０は、定型部形態素解析部３０が分割した単語の全部ではなく、一部の単語についての出現頻度を求めてもよい。 [Modifications, etc.]
When the fixed form morphological analysis unit 30 divides the fixed part and outputs a word, depending on the division method, a word such as a particle or a conjunction that does not constitute a subject may be output. In this case, the fixed part word appearance frequency calculation unit 40 obtains the appearance frequency for a word that constitutes a subject matter such as a noun or a verb, and obtains the appearance frequency for a word that does not constitute a subject matter such as a particle or a conjunction. It does not have to be. That is, the fixed part word appearance frequency calculation unit 40 may obtain the appearance frequencies of some words instead of all the words divided by the fixed part morpheme analysis unit 30.

同様に、形態素解析部６０が分析対象文書を分割して単語を出力する際、その分割の方法によっては、助詞、接続詞等の単体で主題を構成しない単語が出力される場合がある。この場合、単語出現頻度計算部７０は、名詞、動詞等の単体で主題を構成する単語についての単語出現頻度を求め、助詞、接続詞等の単体で主題を構成しない単語についての単語出現頻度を求めなくてもよい。つまり、単語出現頻度計算部７０は、形態素解析部６０が分割した単語の全部ではなく、一部の単語についての単語出現頻度を求めてもよい。 Similarly, when the morphological analysis unit 60 divides the analysis target document and outputs a word, depending on the division method, a word such as a particle or conjunction that does not constitute a subject may be output. In this case, the word appearance frequency calculation unit 70 obtains the word appearance frequency for a word that constitutes a subject such as a noun or a verb, and obtains the word appearance frequency for a word that does not constitute a subject such as a particle or a conjunction. It does not have to be. That is, the word appearance frequency calculation unit 70 may obtain the word appearance frequency for some words instead of all the words divided by the morphological analysis unit 60.

なお、図１に点線で示すように、文書頻度が高い順に単語を出力する単語並替部１００を設けてもよい。これにより、単語ランキングを生成することができ、主題傾向を把握し易くなる。単語並替部１００は、単語の文書頻度に応じて順位付けをしてもよい。また、単語並替部１００は、並び替えた単語の全部を出力する必要はなく、並び替えた単語の一部のみを出力してもよい。例えば、予め定められた文書頻度以上の文書頻度を有する単語のみや、予め定められた順位以上の順位を有する単語のみを出力してもよい。これにより、さらに主題傾向を把握し易くなる。 In addition, as shown by a dotted line in FIG. 1, a word rearrangement unit 100 that outputs words in descending order of document frequency may be provided. Thereby, a word ranking can be produced | generated and it becomes easy to grasp | ascertain a theme tendency. The word rearrangement unit 100 may rank the word according to the word document frequency. Moreover, the word rearrangement part 100 does not need to output all the rearranged words, and may output only a part of the rearranged words. For example, only words having a document frequency equal to or higher than a predetermined document frequency, or only words having a rank higher than a predetermined rank may be output. This makes it easier to grasp the theme trend.

上記の例では、分析対象文書の一部から定型部平均単語出現頻度を計算したが、分析対象文書ではない文書から上記と同様にして定型部平均単語出現頻度を計算してもよい。例えば、過去に分析対象であったが、今回は分析対象ではない文書であり、定型部における各単語の単語出現頻度に大きな変化がないような場合には、その過去に分析対象であった文書から定型部平均単語出現頻度を計算してもよい。同様に、分析対象文書と分析対象ではない文書とを含む文書から上記と同様にして定型部平均単語出現頻度を計算してもよい。 In the above example, the standard part average word appearance frequency is calculated from a part of the analysis target document. However, the standard part average word appearance frequency may be calculated from a document that is not the analysis target document in the same manner as described above. For example, if the document was an analysis target in the past but is not an analysis target this time, and there is no significant change in the word appearance frequency of each word in the standard part, the document that was the analysis target in the past The standard part average word appearance frequency may be calculated from the above. Similarly, the fixed portion average word appearance frequency may be calculated from a document including an analysis target document and a document that is not an analysis target in the same manner as described above.

上述の構成をコンピュータによって実現する場合、テキストマイニング装置の各部が有する機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記各部の機能がコンピュータ上で実現される。 When the above configuration is realized by a computer, the processing contents of the functions of each unit of the text mining device are described by a program. By executing this program on a computer, the functions of the above-described units are realized on the computer.

すなわち、ＣＰＵがプログラムを逐次読み込んで実行することにより、定型部形態素解析部３０、定型部単語出現頻度計算部４０、定型部平均単語出現頻度計算部５０、形態素解析部６０、単語出現頻度計算部７０、定型部影響除去部８０、文書頻度計算部９０及び単語並替部１００の機能がそれぞれ実現される。また、補助記憶装置又はメモリが、分析対象文書記憶部１０、定型部記憶部２０及び定型部平均単語出現頻度記憶部５１として機能する。 That is, when the CPU sequentially reads and executes the program, the fixed part morpheme analyzer 30, the fixed part word appearance frequency calculator 40, the fixed part average word appearance frequency calculator 50, the morpheme analyzer 60, the word appearance frequency calculator 70, the fixed part influence removing unit 80, the document frequency calculating unit 90, and the word rearranging unit 100 are realized. In addition, the auxiliary storage device or the memory functions as the analysis target document storage unit 10, the fixed part storage unit 20, and the fixed part average word appearance frequency storage unit 51.

テキストマイニング装置の各部として機能するＣＰＵは、メモリ又は補助記憶装置から読み込み込んだデータに対して処理を行い、処理を行った後のデータをメモリ又は補助記憶装置に格納する。すなわち、メモリ又は補助記憶装置を介して、音響再生装置の各部間でデータがやり取りされる。 The CPU functioning as each unit of the text mining device processes the data read from the memory or the auxiliary storage device, and stores the processed data in the memory or the auxiliary storage device. That is, data is exchanged between the units of the sound reproducing device via the memory or the auxiliary storage device.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよいが、具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ
−Ｒ（Recordable）／ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto-Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing contents can be recorded on a computer-readable recording medium. The computer-readable recording medium may be any medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, or a semiconductor memory. Specifically, for example, the magnetic recording device may be a hard disk device or a flexible Discs, magnetic tapes, etc. as optical discs, DVD (Digital Versatile Disc), DVD-RAM (Random Access Memory), CD-ROM (Compact Disc Read Only Memory), CD
-R (Recordable) / RW (ReWritable), etc., MO (Magneto-Optical disc), etc. as a magneto-optical recording medium, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. as a semiconductor memory it can.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

また、上述した実施形態とは別の実行形態として、コンピュータが可搬型記録媒体から直接このプログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を基底する性質を有するデータ等）を含むものとする。 As an execution form different from the above-described embodiment, the computer may read the program directly from the portable recording medium and execute processing according to the program. Each time is transferred, the processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to a computer but has a property that is based on computer processing).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、本装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。
また、上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。
その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。 In this embodiment, the present apparatus is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.
In addition, the various processes described above are not only executed in time series according to the description, but may be executed in parallel or individually according to the processing capability of the apparatus that executes the processes or as necessary.
Needless to say, other modifications are possible without departing from the spirit of the present invention.

テキストマイニング装置の機能構成の例を示す図。The figure which shows the example of a function structure of a text mining apparatus. テキストマイニング方法の例を示す図。The figure which shows the example of the text mining method. 定型部平均単語出現頻度を説明するための図。The figure for demonstrating a fixed part average word appearance frequency. 単語出現頻度を説明するための図。The figure for demonstrating word appearance frequency. 文書頻度を説明するための図。The figure for demonstrating document frequency. 単語ランキングを説明するための図。The figure for demonstrating word ranking. 定型部及び主題部を説明するための図。The figure for demonstrating a fixed form part and a theme part. 定型部の影響を除去する従来の第１の方法を説明するための図。The figure for demonstrating the conventional 1st method of removing the influence of a fixed form part. 定型部の影響を除去する従来の第１の方法を説明するための図。The figure for demonstrating the conventional 1st method of removing the influence of a fixed form part. テキストタイリング法を説明するための図。The figure for demonstrating the text tiling method.

Explanation of symbols

１０分析対象文書記憶部
２０定型部記憶部
３０定型部形態素解析部
４０定型部単語出現頻度計算部
５０定型部平均単語出現頻度計算部
５１定型部平均単語出現頻度記憶部
６０形態素解析部
７０単語出現頻度計算部
８０定型部影響除去部
９０文書頻度計算部
１００単語並替部 DESCRIPTION OF SYMBOLS 10 Analysis object document memory | storage part 20 Fixed form part memory | storage part 30 Fixed form part morpheme analysis part 40 Fixed form part Word appearance frequency calculation part 50 Fixed form part average word appearance frequency calculation part 51 Fixed form part average word appearance frequency storage part 60 Morphological analysis part 70 Word appearance Frequency calculation unit 80 Fixed part influence removal unit 90 Document frequency calculation unit 100 Word rearrangement unit

Claims

An analysis target document storage unit for storing a plurality of analysis target documents;
A morpheme analyzer that divides each of the plurality of analysis target documents read from the analysis target document storage unit into a plurality of words;
A word appearance frequency calculation unit for obtaining a frequency at which all or a part of the plurality of divided words appear in the read analysis target document (hereinafter referred to as word appearance frequency);
A fixed part of a certain analysis target document is a fixed part that is not related to the subject of the analysis target document, and the fixed part average word appearance frequency of a certain word appears in the fixed part of each of the plurality of analysis target documents. As a presumed average frequency, the fixed part influence is obtained by subtracting the fixed part average word appearance frequency of the word from the word appearance frequency obtained for the divided word and obtaining a word appearance frequency after removing the fixed part influence on the word. A removal section;
The document frequency of a certain word is the number of the plurality of documents to be analyzed including the word, and among the plurality of divided words, the word appearance frequency after removing the fixed portion influence is higher than or higher than a predetermined frequency. A document frequency calculator that calculates the document frequency for each word,
Text mining device including

The text mining device according to claim 1,
A fixed form morpheme analysis unit that divides each fixed part of a part of analysis target documents selected from the plurality of analysis target documents into a plurality of words;
A fixed part word appearance frequency calculating unit for obtaining a frequency at which each of all or a part of each of the plurality of divided words appears in the fixed part;
After adding the obtained frequency for each word, dividing by the number of the part of the analysis target document, a fixed part average word appearance frequency calculation unit for obtaining a fixed part average word appearance frequency of the word,
A text mining device further comprising:

The text mining device according to claim 1 or 2,
A word rearrangement unit that outputs words in descending order of the document frequency;
A text mining device characterized by that.

The analysis target document storage unit stores a plurality of analysis target documents,
A morphological analysis step of dividing each of the plurality of analysis target documents read from the analysis target document storage unit into a plurality of words;
A word appearance frequency calculation step for obtaining a frequency at which all or a part of the plurality of divided words appear in the read analysis target document (hereinafter referred to as word appearance frequency);
A fixed part of a certain analysis target document is a fixed part that is not related to the subject of the analysis target document, and the fixed part average word appearance frequency of a certain word appears in the fixed part of each of the plurality of analysis target documents. As a presumed average frequency, the fixed part influence is obtained by subtracting the fixed part average word appearance frequency of the word from the word appearance frequency obtained for the divided word and obtaining a word appearance frequency after removing the fixed part influence on the word. A removal step;
The document frequency of a certain word is the number of the plurality of documents to be analyzed including the word, and among the plurality of divided words, the word appearance frequency after removing the fixed portion influence is higher than or higher than a predetermined frequency. A document frequency calculation step for obtaining a document frequency for each word,
Text mining method including.

The program for functioning a computer as each part of the text mining device in any one of Claim 1 to 3.

A computer-readable recording medium storing the program according to claim 5.