JP7221527B2

JP7221527B2 - Analysis method, analysis device and analysis program

Info

Publication number: JP7221527B2
Application number: JP2019084332A
Authority: JP
Inventors: 耕爾野守
Original assignee: 株式会社アナリティクスデザインラボ
Priority date: 2019-04-25
Filing date: 2019-04-25
Publication date: 2023-02-14
Anticipated expiration: 2039-04-25
Also published as: JP2020181390A

Description

本発明は、テキストデータから個性的なトピックを抽出することができる分析方法、分析装置及び分析プログラムに関する。 The present invention relates to an analysis method, an analysis device, and an analysis program capable of extracting unique topics from text data.

昨今では、テキストの電子化の急増とテキストマイニングツールの普及に伴い、テキストデータからいかに有用な知識を抽出するかということが課題となっている。 Recently, with the rapid increase in digitization of texts and the spread of text mining tools, how to extract useful knowledge from text data has become an issue.

本発明者は、テキストデータから、単語そのものではなく文章のトピックを抽出する手法として知られるＰＬＳＡを応用した分析方法を発明した（特許文献１参照）。ＰＬＳＡは、元々文章分類のために開発された手法で、文章とそこに出現する単語の間には観測できない潜在的な意味クラスがあることを想定し、文章と単語の共通のトピックとなるような特徴を見つける手法である。 The present inventor invented an analysis method that applies PLSA, which is known as a technique for extracting topics of sentences rather than words themselves from text data (see Patent Document 1). PLSA is a method originally developed for sentence classification. It assumes that there are unobservable latent semantic classes between sentences and the words that appear in them. It is a method to find the unique features.

このような分析方法においても、テキストデータからマイニングを行い、潜在的なトピックを抽出することはできる。しかしながら、ＰＬＳＡは、元々のテキストデータに高い頻度で発生する単語を元にトピックを抽出する傾向にあり、得られたトピックは典型的で目新しいものではない場合がある。 Even in such an analysis method, it is possible to perform mining from text data and extract potential topics. However, PLSA tends to extract topics based on frequently occurring words in the original text data, and the obtained topics may be typical and not new.

特開２０１６－０５１２２０号公報JP 2016-051220 A

本発明は、上記事情に鑑みてなされたものであり、テキストデータに低い頻度で発生するような単語であっても、当該単語に基づく個性的なトピックを抽出することができる分析方法、分析装置及び分析プログラムを提供することを目的とする。 The present invention has been made in view of the above circumstances, and an analysis method and analysis apparatus capable of extracting unique topics based on words that occur infrequently in text data. and to provide an analysis program.

上記課題を解決する本発明の第１の態様は、分析装置が実行するテキストデータの分析方法であって、前記テキストデータに含まれている第１語群に属する語及び第２語群に属する語の組み合わせの頻度に基づく要素からなる共起行列を作成する共起行列作成ステップと、前記共起行列を入力とし、第１語群に属する語及び第２語群に属する語で構成される複数のトピックを抽出する潜在意味解析法を実行することにより、各トピックを条件とした第１語群に属する語の第１条件付確率、及び各トピックを条件とした第２語群に属する語の第２条件付確率を求めるトピック抽出ステップと、前記第１条件付確率及び第１語群の出現頻度、並びに前記第２条件付確率及び第２語群の出現頻度に基づいて、各トピックを条件とした各前記テキストデータの条件付確率を計算し、前記条件付確率に基づいて各前記テキストデータに対する各トピックのスコアを求めるスコア計算ステップと、を備え、前記共起行列作成ステップは、前記テキストデータから前記第１語群に属する語及び前記第２語群に属する語の組み合わせの頻度を要素とする実測共起行列を作成し、前記テキストデータから前記第１語群に属する語及び前記第２語群に属する語の組み合わせの期待頻度を要素とする期待共起行列を作成し、前記期待共起行列の各要素に対する前記実測共起行列の各要素の差分あるいは比率を各要素とする前記共起行列を作成することを特徴とする分析方法にある。 A first aspect of the present invention for solving the above problems is a text data analysis method executed by an analysis device , comprising: a co-occurrence matrix creating step of creating a co-occurrence matrix composed of elements based on the frequency of word combinations; and taking the co-occurrence matrix as an input, comprising words belonging to a first word group and words belonging to a second word group. By executing a latent semantic analysis method that extracts multiple topics, the first conditional probability of words belonging to the first word group conditioned on each topic, and the words belonging to the second word group conditioned on each topic Each topic is extracted based on the topic extraction step of obtaining the second conditional probability of, the frequency of appearance of the first conditional probability and the first word group, and the frequency of appearance of the second conditional probability and the second word group a score calculation step of calculating a conditional probability of each text data set as a condition, and obtaining a score of each topic for each of the text data based on the conditional probability, wherein the co-occurrence matrix creation step comprises the creating a measured co-occurrence matrix whose elements are the frequencies of combinations of words belonging to the first word group and words belonging to the second word group from the text data; creating an expected co-occurrence matrix whose elements are the expected frequencies of combinations of words belonging to the second word group, and making each element the difference or ratio of each element of the measured co-occurrence matrix to each element of the expected co-occurrence matrix The analysis method is characterized by creating the co-occurrence matrix.

本発明の第２の態様は、第１の態様に記載の分析方法であって、前記共起行列作成ステップは、前記テキストデータから文章を抽出し、各文章に含まれている前記第１語群に属する語及び前記第２語群に属する語の組み合わせの頻度を要素とする前記実測共起行列を作成し、前記テキストデータから文章を抽出し、各文章に含まれている前記第１語群に属する語及び前記第２語群に属する語の組み合わせの期待頻度を要素とする前記期待共起行列を作成し、前記期待共起行列の各要素に対する前記実測共起行列の各要素の差分あるいは比率を各要素とする前記共起行列を作成し、前記スコア計算ステップは、前記第１条件付確率及び第１語群の出現頻度、並びに前記第２条件付確率及び第２語群の出現頻度に基づいて、各トピックを条件とした各文章の条件付確率を計算し、前記条件付確率に基づいて各前記テキストデータに対する各トピックのスコアを求めることを特徴とする分析方法にある。 A second aspect of the present invention is the analysis method according to the first aspect, wherein the co-occurrence matrix creating step extracts sentences from the text data, extracts the first word contained in each sentence, creating the measured co-occurrence matrix whose elements are the frequencies of combinations of words belonging to the group and words belonging to the second word group, extracting sentences from the text data, and extracting the first words contained in each sentence; creating the expected co-occurrence matrix whose elements are expected frequencies of combinations of words belonging to the group and words belonging to the second word group, and difference of each element of the measured co-occurrence matrix with respect to each element of the expected co-occurrence matrix Alternatively, the co-occurrence matrix is created with a ratio as each element, and the score calculation step includes the first conditional probability and the appearance frequency of the first word group, and the second conditional probability and the appearance of the second word group. The analysis method is characterized by calculating a conditional probability of each sentence with each topic as a condition based on the frequency, and obtaining a score of each topic for each of the text data based on the conditional probability.

本発明の第３の態様は、第１の態様に記載の分析方法であって、前記テキストデータは、カテゴリに分類されたテキスト部を含み、前記共起行列作成ステップは、第１のカテゴリに分類された前記テキスト部から抽出した前記第１語群に属する語、及び第２のカテゴリに分類された前記テキスト部から抽出した前記第２語群に属する語の組み合わせの頻度を要素とする前記実測共起行列を作成し、第１のカテゴリに分類された前記テキスト部から抽出した前記第１語群に属する語、及び第２のカテゴリに分類された前記テキスト部から抽出した前記第２語群に属する語の組み合わせの期待頻度を要素とする前記期待共起行列を作成し、前記期待共起行列の各要素に対する前記実測共起行列の各要素の差分あるいは比率を各要素とする前記共起行列を作成することを特徴とする分析方法にある。 A third aspect of the present invention is the analysis method according to the first aspect, wherein the text data includes text parts classified into categories, and the co-occurrence matrix creating step includes The element is the frequency of combinations of words belonging to the first word group extracted from the classified text part and words belonging to the second word group extracted from the text part classified into the second category. A measured co-occurrence matrix is created, and words belonging to the first word group extracted from the text portion classified into the first category and the second words extracted from the text portion classified into the second category. creating the expected co-occurrence matrix whose elements are the expected frequencies of combinations of words belonging to the group; The analysis method is characterized by creating an origin matrix.

本発明の第４の態様は、テキストデータの分析装置であって、前記テキストデータに含まれている第１語群に属する語及び第２語群に属する語の組み合わせの頻度に基づく要素からなる共起行列を作成する共起行列作成手段と、前記共起行列を入力とし、第１語群に属する語及び第２語群に属する語で構成される複数のトピックを抽出する潜在意味解析法を実行することにより、各トピックを条件とした第１語群に属する語の第１条件付確率、及び各トピックを条件とした第２語群に属する語の第２条件付確率を求めるトピック抽出手段と、前記第１条件付確率及び第１語群の出現頻度、並びに前記第２条件付確率及び第２語群の出現頻度に基づいて、各トピックを条件とした各前記テキストデータの条件付確率を計算し、前記条件付確率に基づいて各前記テキストデータに対する各トピックのスコアを求めるスコア計算手段と、を備え、前記共起行列作成手段は、前記テキストデータから前記第１語群に属する語及び前記第２語群に属する語の組み合わせの頻度を要素とする実測共起行列を作成し、前記テキストデータから前記第１語群に属する語及び前記第２語群に属する語の組み合わせの期待頻度を要素とする期待共起行列を作成し、前記期待共起行列の各要素に対する前記実測共起行列の各要素の差分あるいは比率を各要素とする前記共起行列を作成することを特徴とする分析装置にある。 A fourth aspect of the present invention is a text data analysis device comprising an element based on the frequency of combinations of words belonging to the first word group and words belonging to the second word group contained in the text data. A co-occurrence matrix creation means for creating a co-occurrence matrix, and a latent semantic analysis method for extracting a plurality of topics composed of words belonging to a first word group and words belonging to a second word group, using the co-occurrence matrix as an input. Topic extraction to find the first conditional probability of words belonging to the first word group with each topic as a condition and the second conditional probability of words belonging to the second word group with each topic as a condition by executing and conditional processing of each of the text data with each topic as a condition based on the first conditional probability and the appearance frequency of the first word group, and the second conditional probability and the appearance frequency of the second word group. score calculation means for calculating probabilities and obtaining a score of each topic for each of the text data based on the conditional probabilities, wherein the co-occurrence matrix creation means selects from the text data to belong to the first word group Create an actually measured co-occurrence matrix whose elements are frequencies of combinations of words and words belonging to the second word group, and extract combinations of words belonging to the first word group and words belonging to the second word group from the text data. creating an expected co-occurrence matrix with expected frequencies as elements, and creating the co-occurrence matrix with each element as a difference or ratio of each element of the measured co-occurrence matrix to each element of the expected co-occurrence matrix It is in the analyzer that

本発明の第５の態様は、テキストデータをコンピュータに分析させる分析プログラムであって、前記コンピュータを、前記テキストデータに含まれている第１語群に属する語及び第２語群に属する語の組み合わせの頻度に基づく要素からなる共起行列を作成する共起行列作成手段と、前記共起行列を入力とし、第１語群に属する語及び第２語群に属する語で構成される複数のトピックを抽出する潜在意味解析法を実行することにより、各トピックを条件とした第１語群に属する語の第１条件付確率、及び各トピックを条件とした第２語群に属する語の第２条件付確率を求めるトピック抽出手段と、前記第１条件付確率及び第１語群の出現頻度、並びに前記第２条件付確率及び第２語群の出現頻度に基づいて、各トピックを条件とした各前記テキストデータの条件付確率を計算し、前記条件付確率に基づいて各前記テキストデータに対する各トピックのスコアを求めるスコア計算手段として機能させ、前記共起行列作成手段は、前記テキストデータから前記第１語群に属する語及び前記第２語群に属する語の組み合わせの頻度を要素とする実測共起行列を作成し、前記テキストデータから前記第１語群に属する語及び前記第２語群に属する語の組み合わせの期待頻度を要素とする期待共起行列を作成し、前記期待共起行列の各要素に対する前記実測共起行列の各要素の差分あるいは比率を各要素とする前記共起行列を作成することを特徴とする分析プログラムにある。 A fifth aspect of the present invention is an analysis program for causing a computer to analyze text data, wherein the computer analyzes words belonging to the first word group and words belonging to the second word group contained in the text data. a co-occurrence matrix creating means for creating a co-occurrence matrix consisting of elements based on the frequency of combinations; By performing a latent semantic analysis method to extract topics, the first conditional probability of words belonging to the first word group conditioned on each topic and the number of words belonging to the second word group conditioned on each topic A topic extraction means for obtaining two conditional probabilities, and each topic as a condition based on the first conditional probability and the appearance frequency of the first word group, and the second conditional probability and the appearance frequency of the second word group functioning as score calculation means for calculating conditional probabilities of each of the text data, and obtaining a score of each topic for each of the text data based on the conditional probabilities; creating an actually measured co-occurrence matrix whose elements are frequencies of combinations of words belonging to the first word group and words belonging to the second word group, and from the text data, the words belonging to the first word group and the second words; creating an expected co-occurrence matrix whose elements are expected frequencies of combinations of words belonging to the group, and said co-occurrence whose elements are differences or ratios of each element of said measured co-occurrence matrix with respect to each element of said expected co-occurrence matrix It is in an analysis program characterized by creating matrices.

本発明によれば、テキストデータに低い頻度で発生するような単語であっても、当該単語に基づく個性的なトピックを抽出することができる分析方法、分析装置及び分析プログラムが提供される。 According to the present invention, there are provided an analysis method, an analysis device, and an analysis program capable of extracting unique topics based on even words that occur infrequently in text data.

本実施形態に係る分析方法を実装した分析プログラムを実行する分析装置の機能ブロック図である。1 is a functional block diagram of an analysis device that executes an analysis program implementing an analysis method according to this embodiment; FIG. ＰＬＳＡの概念図である。It is a conceptual diagram of PLSA. 分析装置での処理を示すフローチャートである。4 is a flow chart showing processing in the analyzer.

以下、本発明を実施するための形態について説明する。なお、実施形態の説明は例示であり、本発明は以下の説明に限定されない。 EMBODIMENT OF THE INVENTION Hereinafter, the form for implementing this invention is demonstrated. Note that the description of the embodiment is an example, and the present invention is not limited to the following description.

〈実施形態１〉
図１は、本実施形態に係る分析方法を実行する分析プログラムを実行する分析装置の機能ブロック図である。分析プログラム１０は、分析装置１にインストールされて実行されるものである。分析装置１は、特に図示しないが、ＣＰＵ、ＲＡＭ、ハードディスク、入出力装置、通信手段等を備えた一般的なコンピュータである。 <Embodiment 1>
FIG. 1 is a functional block diagram of an analysis device that executes an analysis program that executes an analysis method according to this embodiment. The analysis program 10 is installed in the analysis device 1 and executed. The analysis device 1 is a general computer equipped with a CPU, RAM, hard disk, input/output device, communication means, etc., although not shown.

ハードディスクには、分析装置１のＣＰＵ等を制御するためのオペレーティングシステムがインストールされている。このオペレーティングシステムにより、ハードディスクにインストールされた分析プログラム１０がＲＡＭに読み込まれ、ＲＡＭに読み込まれた分析プログラムがＣＰＵにより実行される。 An operating system for controlling the CPU and the like of the analyzer 1 is installed in the hard disk. The analysis program 10 installed in the hard disk is read into the RAM by this operating system, and the analysis program read into the RAM is executed by the CPU.

このような分析プログラムは、テキストデータを処理対象とする。テキストデータとは、文章を符号化したデータである。本発明でいう文章とは、テキストデータに含まれる一文である。テキストデータの符号化の方式（文字コード）は特に限定はなく、符号化により表される言語の種別も問わない。本実施形態では、テキストデータは日本語の文からなり、ＵＴＦ－８などの文字コードで表現されている。 Such analysis programs process text data. Text data is data obtained by encoding sentences. A sentence in the present invention is a sentence included in text data. The encoding method (character code) of the text data is not particularly limited, and the type of language represented by the encoding does not matter. In this embodiment, the text data consists of Japanese sentences and is represented by a character code such as UTF-8.

本実施形態では、テキストデータとして、日本の特許出願に添付された要約書の文章を用いる。具体的には、要約書及び特許請求の範囲に「電気」及び「車」を含む１０年分（出願日が２００７年１月１日から２０１６年１２月３１日）の電気自動車に関する特許出願（２６，４１９件）を抽出し、その特許出願の要約書の記載をテキストデータとする。 In this embodiment, the text of the abstract attached to the Japanese patent application is used as the text data. Specifically, patent applications for electric vehicles for 10 years (filed from January 1, 2007 to December 31, 2016) that include "electricity" and "vehicle" in the abstract and claims ( 26,419) are extracted, and the description of the abstract of the patent application is used as text data.

表１にテキストデータの一例を示す。表１には、４つのテキストデータが例示されている。テキストデータＩＤは、個々のテキストデータを識別する情報であり、ここでは重複しない数値である。テキストデータは、発明の要約文である。文章ＩＤは、テキストデータに含まれる個々の文章を識別する情報であり、ここでは重複しない数値である。各文章ＩＤは、テキストデータＩＤとの関連も保持されている。以後、ＩＤが「１」であるテキストデータをテキストデータ「１」と表記し、ＩＤが「１」である文章を文章「１」と表記する。 Table 1 shows an example of text data. Table 1 exemplifies four text data. The text data ID is information for identifying individual text data, and is a unique numerical value here. The text data is the abstract of the invention. The sentence ID is information for identifying individual sentences included in the text data, and is a unique numerical value here. Each sentence ID also holds a relationship with the text data ID. Hereinafter, text data with an ID of "1" will be referred to as text data "1", and sentences with an ID of "1" will be referred to as sentences "1".

テキストデータを分析対象とする分析装置１は、共起行列作成手段１１、トピック抽出手段１２、及びスコア計算手段１３を備えている。本実施形態では、それらの各手段は、分析装置１で実行される分析プログラム１０として実装されている。分析プログラム１０は、分析装置１を各手段１１～１３として機能させるプログラムである。 An analysis apparatus 1 for analyzing text data includes co-occurrence matrix creation means 11 , topic extraction means 12 , and score calculation means 13 . In this embodiment, each of these means is implemented as an analysis program 10 executed by the analysis device 1 . The analysis program 10 is a program that causes the analysis device 1 to function as each means 11-13.

共起行列作成手段１１は、テキストデータから共起行列を作成する。共起行列とは、第１語群に属する語、及び第２語群に属する語の組み合わせの頻度に基づく要素からなる行列であり、具体的には、以下のように、実測共起行列と期待共起行列とから作成される。 A co-occurrence matrix creation means 11 creates a co-occurrence matrix from text data. A co-occurrence matrix is a matrix composed of elements based on the frequencies of combinations of words belonging to the first word group and words belonging to the second word group. expected co-occurrence matrix.

実測共起行列とは、第１語群に属する語及び第２語群に属する語の組み合わせ（共起ペアと称する）を含むテキストデータの頻度（件数）を要素とする行列である。実測共起行列は、次のようにして作成される。 The measured co-occurrence matrix is a matrix whose elements are frequencies (number of cases) of text data including combinations of words belonging to the first word group and words belonging to the second word group (referred to as co-occurrence pairs). The measured co-occurrence matrix is created as follows.

まず、共起行列作成手段１１は、テキストデータから文章を抽出する。具体的には、共起行列作成手段１１は、テキストデータを一つずつ読み込み、各テキストデータについて、句点など一文の末尾に用いられる文字を基準として文章を出力する。例えば、テキストデータＩＤ「１」については、表１に示すように４つの文章が抽出される。 First, the co-occurrence matrix creation means 11 extracts sentences from text data. Specifically, the co-occurrence matrix creating means 11 reads text data one by one, and outputs sentences based on characters used at the end of sentences, such as punctuation marks, for each text data. For example, for text data ID "1", four sentences are extracted as shown in Table 1.

一つのテキストデータは、発明に関する記載が含まれているが、各文章に着目すると異なる観点で記載されていることが多い。表１のテキストデータＩＤ「１」からは、電気自動車の課題について述べた文章（文章ＩＤ「１」）や電気自動車の動作について述べた文章（文章ＩＤ「２」）などが得られることになる。 One piece of text data includes a description related to the invention, but when focusing on each sentence, it is often described from a different point of view. From the text data ID "1" in Table 1, a sentence describing the problem of the electric vehicle (sentence ID "1"), a sentence describing the operation of the electric vehicle (sentence ID "2"), etc. can be obtained. .

後述するトピック抽出手段１２では、文章を元にトピックを抽出するが、もし、仮にテキストデータを元にトピックを抽出する場合、テキストデータに異なる観点の文章が複数含まれていると、適切なトピックとはいえない結果となりうる。しかし、本発明では、テキストデータから抽出した文章を元にトピックを抽出するので、後述するトピック抽出手段１２による抽出精度を向上させることができる。 The topic extracting means 12, which will be described later, extracts topics based on sentences. If a topic is extracted based on text data, if the text data contains a plurality of sentences with different viewpoints, an appropriate topic is extracted. However, the results can be inconsistent. However, in the present invention, topics are extracted based on sentences extracted from text data, so the accuracy of extraction by topic extraction means 12, which will be described later, can be improved.

次に、共起行列作成手段１１は、各文章から第１語群及び第２語群を抽出する。第１語群及び第２語群は、所定の基準により文章から抽出された複数の語からなる。例えば、所定の基準としては、単語や特定の品詞、係り受け表現（文法的構造を持つ単語と単語のペア）などが挙げられる。第１語群と第２語群とで、異なる基準を用いるようにする。このような第１語群及び第２語群は、公知の形態素解析手法あるいは構文解析手法を適用することで得ることができる。 Next, the co-occurrence matrix creation means 11 extracts the first word group and the second word group from each sentence. The first word group and the second word group consist of a plurality of words extracted from the sentence according to a predetermined criterion. For example, the predetermined criteria include words, specific parts of speech, and dependency expressions (pairs of words having a grammatical structure). Different criteria are used for the first word group and the second word group. Such first word group and second word group can be obtained by applying a known morphological analysis method or syntactic analysis method.

次に、共起行列作成手段１１は、第１語群に属する語と、第２語群に属する語との組み合わせである共起ペアを含む文章の頻度を計算する。そして、その頻度を要素とする実測共起行列を作成する。実測共起行列のｉ行ｊ列の要素（ｉ，ｊ）は、第１語群に属するｉ番目の語と、第２語群に属するｊ番目の語からなる共起ペアを含む文章の頻度となる。 Next, the co-occurrence matrix creating means 11 calculates the frequency of sentences containing co-occurrence pairs that are combinations of words belonging to the first word group and words belonging to the second word group. Then, an actually measured co-occurrence matrix having the frequencies as elements is created. The element (i, j) in the i-th row and j-th column of the measured co-occurrence matrix is the frequency of sentences containing a co-occurrence pair consisting of the i-th word belonging to the first word group and the j-th word belonging to the second word group. becomes.

表２に、実測共起行列を例示する。この実測共起行列は、文章から「単語」を抽出して第１語群とし、文章から「係り受け表現」を抽出して第２語群とするものである。第１語群に属する単語として「構成」「モータ」「制御」などが行方向に並び、第２語群に属する係り受け表現として「電力－供給」「否－判定」「バッテリ－充電」などが列方向に並んでいる。共起行列作成手段１１は、「構成」と「電力－供給」の共起ペアを含む文章の数をカウントする。表２の実測共起行列の例では、要素（１，１）の「１１８」は、「構成」及び「電力－供給」という共起ペアが存在する文章の頻度（件数）が１１８件であることを表している。 Table 2 illustrates the measured co-occurrence matrices. This actually-measured co-occurrence matrix extracts "words" from sentences to form a first word group, and extracts "dependency expressions" from sentences to form a second word group. Words belonging to the first word group include "configuration," "motor," "control," and so on, and examples belonging to the second word group include "power-supply," "no-judgment," and "battery-charging." are arranged in columns. The co-occurrence matrix creating means 11 counts the number of sentences containing the co-occurrence pairs of “configuration” and “power-supply”. In the example of the measured co-occurrence matrix in Table 2, the element (1, 1) “118” has a frequency (number) of 118 sentences in which the co-occurrence pairs “configuration” and “power-supply” exist. represents that.

なお、第１語群と第２語群の選び方は上述の例に限定されない。例えば、テキストデータ中に含まれる「名詞」に分類される語を第１語群とし、「動詞又は形容詞」に分類される語を第２語群としてもよい。この第２語群のように複数の品詞の何れかに分類される語から第１語群又は第２語群を抽出してもよい。 The method of selecting the first word group and the second word group is not limited to the above example. For example, words included in the text data that are classified as "nouns" may be the first word group, and words that are classified as "verbs or adjectives" may be the second word group. Like this second word group, the first word group or the second word group may be extracted from words classified into any of a plurality of parts of speech.

期待共起行列とは、第１語群に属する語及び第２語群に属する語の共起ペアの期待頻度を要素とする行列である。期待頻度とは、理論的に推定される共起ペアを含む文章の頻度である。
第１語群に属するｉ番目の語（Ｘ_ｉ）が含まれる文章の件数を総頻度（ｎ（Ｘ_ｉ））とする。
第２語群に属するｊ番目の語（Ｙ_ｊ）が含まれる文章の件数を総頻度（ｎ（Ｙ_ｊ））とする。
文章の全件数を総文章数Ｎとする。
期待頻度は、ｎ（Ｘ_ｉ）・ｎ（Ｙ_ｊ）／Ｎである。 The expected co-occurrence matrix is a matrix whose elements are expected frequencies of co-occurrence pairs of words belonging to the first word group and words belonging to the second word group. The expected frequency is the theoretically estimated frequency of sentences containing co-occurrence pairs.
Let the number of sentences containing the i-th word (X _i ) belonging to the first word group be the total frequency (n(X _i )).
Let the number of sentences containing the j-th word (Y _j ) belonging to the second word group be the total frequency (n(Y _j )).
Let N be the total number of sentences.
The expected frequency is n(X _i )·n(Y _j )/N.

共起行列作成手段１１は、第１語群に属する語が含まれる文章の件数を計上して総頻度（ｎ（Ｘ_ｉ））を求め、第２語群に属する語が含まれる文章の件数を計上して総頻度（ｎ（Ｙ_ｊ））を求める。そして、文章の全件数を計上して総文章数Ｎとし、期待頻度を計算する。このような期待頻度を、第１語群に属する語及び第２語群に属する語からなる全ての共起ペアについて計算する。 The co-occurrence matrix creation means 11 counts the number of sentences containing words belonging to the first word group to obtain the total frequency (n(X _i )), and counts the number of sentences containing words belonging to the second word group. to obtain the total frequency (n(Y _j )). Then, the total number of sentences is counted to obtain the total number of sentences N, and the expected frequency is calculated. Such expected frequencies are calculated for all co-occurrence pairs of words belonging to the first word group and words belonging to the second word group.

表３は、共起行列作成手段１１により作成された期待共起行列の一例である。総文章数Ｎは、２２９，５９８件である。第１語群の一番目の語（Ｘ１）である「構成」の総頻度（ｎ（Ｘ１））は、５，８８０件である。第２語群の一番目の語（Ｙ１）である「電力－供給」の総頻度（ｎ（Ｙ１））は、１，３５０件である。要素（１，１）は「３４．６」である。これは、「構成」と「電力－供給」からなる共起ペアを含む文章の頻度は、理論的には「３４．６」であることを表している。 Table 3 is an example of an expected co-occurrence matrix created by the co-occurrence matrix creating means 11. The total number of sentences N is 229,598. The total frequency (n(X1)) of "construction", which is the first word (X1) of the first word group, is 5,880. The total frequency (n(Y1)) of the first word (Y1) of the second word group, "power-supply," is 1,350. Element (1,1) is "34.6". This means that the frequency of sentences containing the co-occurrence pair of "configuration" and "power-supply" is theoretically "34.6".

共起行列は、第１語群に属する語、及び第２語群に属する語の組み合わせの頻度に基づく要素からなる行列である。より具体的には、共起行列は、期待共起行列の各要素に対する実測共起行列の各要素の差分あるいは比率を各要素とする行列である。 The co-occurrence matrix is a matrix composed of elements based on the frequencies of combinations of words belonging to the first word group and words belonging to the second word group. More specifically, the co-occurrence matrix is a matrix whose elements are differences or ratios of each element of the measured co-occurrence matrix to each element of the expected co-occurrence matrix.

共起行列作成手段１１は、期待共起行列の各要素に対する実測共起行列の各要素の差分あるいは比率を計算して共起行列を作成する。この共起行列は、次のトピック抽出手段１２の入力データとなる。期待共起行列の各要素に対する実測共起行列の各要素の差分あるいは比率として、実測共起行列の各要素（ｉ，ｊ）／期待共起行列の各要素（ｉ，ｊ）の対数を計算し、その値を共起行列の要素（ｉ，ｊ）とする。実測共起行列及び期待共起行列の各要素（ｉ，ｊ）がゼロの場合や、上記対数が負であれば、共起行列の要素はゼロとする。このようにして作成した共起行列を表４に例示する。 The co-occurrence matrix creating means 11 creates a co-occurrence matrix by calculating the difference or ratio of each element of the actually measured co-occurrence matrix to each element of the expected co-occurrence matrix. This co-occurrence matrix becomes input data for the next topic extraction means 12 . Calculate the logarithm of each element (i, j) of the measured co-occurrence matrix / each element (i, j) of the expected co-occurrence matrix as the difference or ratio of each element of the measured co-occurrence matrix to each element of the expected co-occurrence matrix and its value is set as the element (i, j) of the co-occurrence matrix. If each element (i, j) of the measured co-occurrence matrix and the expected co-occurrence matrix is zero, or if the logarithm is negative, the element of the co-occurrence matrix is zero. Table 4 exemplifies the co-occurrence matrix created in this manner.

なお、差分あるいは比率の取り方は、単純な差分（絶対誤差）としてもよいし、絶対誤差を期待頻度で除した相対誤差としてもよいし、単純な比率としてもよいし、そうした差分あるいは比率の絶対値を取ったり、二乗を取ったり、対数を取ったりしてもよい。ただしゼロで除して値が計算不可となることや値が負数となることがないように、そのような場合は上記のようにゼロに置換するなどの調整を施す。 The difference or ratio may be taken as a simple difference (absolute error), as a relative error obtained by dividing the absolute error by the expected frequency, or as a simple ratio. You may take the absolute value, take the square, or take the logarithm. However, in order to prevent the value from becoming impossible to calculate by dividing by zero or the value to become a negative number, adjustments such as replacing with zero as described above are performed in such cases.

トピック抽出手段１２は、前記共起行列を入力とし、第１語群に属する語及び第２語群に属する語で構成される複数のトピックを抽出する潜在意味解析法を実行することにより、各トピックを条件とした第１語群に属する語の第１条件付確率、及び各トピックを条件とした第２語群に属する語の第２条件付確率を求める。トピックは、発明に関する文章の主題を表しているといえる。 The topic extracting means 12 receives the co-occurrence matrix as an input and executes a latent semantic analysis method for extracting a plurality of topics composed of words belonging to the first word group and words belonging to the second word group. A first conditional probability of words belonging to the first word group conditioned on the topic and a second conditional probability of words belonging to the second word group conditioned on each topic are determined. The topic can be said to represent the subject matter of the text about the invention.

潜在意味解析法とは、自然言語処理の技法の一つであり、文書群と文書に含まれる用語群について、それらに関連した概念の集合を生成することで、その関係を分析する手法である。潜在意味解析法の具体例としては、ＬＳＩ（ＬａｔｅｎｔＳｅｍａｎｔｉｃＩｎｄｅｘｉｎｇ）、ＬＤＡ（ＬａｔｅｎｔＤｉｒｉｃｈｌｅｔＡｌｌｏｃａｔｉｏｎ）、ＰＬＳＡ（ＰｒｏｂａｂｉｌｉｓｔｉｃＬａｔｅｎｔＳｅｍａｎｔｉｃＡｎａｌｙｓｉｓ）を挙げることができる。 Latent semantic analysis is one of the techniques of natural language processing, and it is a method of analyzing the relationship between a group of documents and a group of terms contained in the documents by generating a set of related concepts. . Specific examples of latent semantic analysis methods include LSI (Latent Semantic Indexing), LDA (Latent Dirichlet Allocation), and PLSA (Probabilistic Latent Semantic Analysis).

本実施形態では、ＰＬＳＡを用いて説明する。図２は、ＰＬＳＡの概念図である。図２（ａ）に示すように、ＰＬＳＡは、文書分類に用いられるクラスタリング手法の一つであり、一般には、文章Ｄと、その文章に含まれる単語Ｗの間に潜在的なトピックＺがあると想定し、文章Ｄ及び単語Ｗの組み合わせで構成されるトピックＺを抽出するものである。ＰＬＳＡによるトピック抽出は、各トピックＺに属する文章Ｄの条件付確率及び各トピックＺに属する単語Ｗの条件付確率及びトピックＺの確率がＥＭアルゴリズムにより計算される。 This embodiment will be described using PLSA. FIG. 2 is a conceptual diagram of PLSA. As shown in FIG. 2(a), PLSA is one of the clustering methods used for document classification. In general, there is a potential topic Z between a sentence D and a word W included in the sentence. Assuming that, a topic Z composed of a combination of a sentence D and a word W is extracted. In topic extraction by PLSA, the conditional probability of sentences D belonging to each topic Z, the conditional probability of words W belonging to each topic Z, and the probability of topic Z are calculated by the EM algorithm.

本実施形態では、このようなＰＬＳＡに入力するデータは、上述した共起行列である。ＰＬＳＡは、このような共起行列を入力として、図２（ｂ）に示すように、第１語群に属する語Ｗ１と、第２語群に属する語Ｗ２との間に潜在的なトピックＺがあると想定し、第１語群に属する語Ｗ１と第２語群に属する語Ｗ２の組み合わせで構成されるトピックＺを抽出するものである。すなわち、トピック抽出手段１２は、共起行列を入力としてＰＬＳＡを実行することで、各トピックＺを条件とした第１語群に属する語Ｗ１の第１条件付確率としてＰ（Ｗ１｜Ｚ）、及び各トピックＺを条件とした第２語群に属する語Ｗ２の第２条件付確率としてＰ（Ｗ２｜Ｚ）を計算する。本実施形態の例では、第１語群に属する語として単語（名詞、動詞、形容詞）を、第２語群に属する語として係り受け表現（名詞と動詞・形容詞の係り受けペア）を設定している。ＰＬＳＡの具体的な計算方法は、「Hofmann, T.:Probabilistic latent semantic analysis, Proc. Of Uncertainty in Artificial Intelligence, pp.289-296, 1999.」などの文献に記載の公知の技法を用いて実行することができる。 In this embodiment, the data input to such a PLSA is the co-occurrence matrix described above. With such a co-occurrence matrix as an input, PLSA finds potential topics Z , and extracts a topic Z composed of a combination of a word W1 belonging to the first word group and a word W2 belonging to the second word group. That is, the topic extraction means 12 executes PLSA with the co-occurrence matrix as an input, so that the first conditional probability of the word W1 belonging to the first word group with each topic Z as a condition is P(W1|Z), and P(W2|Z) is calculated as a second conditional probability of word W2 belonging to the second word group with each topic Z as a condition. In the example of this embodiment, words (nouns, verbs, adjectives) are set as words belonging to the first word group, and dependency expressions (noun-verb/adjective dependency pairs) are set as words belonging to the second word group. ing. A specific calculation method of PLSA is performed using a known technique described in literature such as "Hofmann, T.: Probabilistic latent semantic analysis, Proc. Of Uncertainty in Artificial Intelligence, pp.289-296, 1999." can do.

表５に、ＰＬＳＡにより計算されたトピックに属する単語及び係り受け表現を例示する。表５には、複数作成されたトピックのうち、２つのトピックＺ０８とトピックＺ２１に属する単語及び係り受け表現が示されている。それぞれ条件付確率が高い順に単語および係り受け表現を並べており、それぞれの総頻度（ｎ（Ｘ_ｉ））と総頻度（ｎ（Ｙ１））も掲載している。 Table 5 exemplifies words belonging to topics and dependency expressions computed by PLSA. Table 5 shows words and dependent expressions belonging to two topics Z08 and Z21 among the created topics. Words and dependent expressions are arranged in descending order of conditional probability, and their total frequency (n(X _i )) and total frequency (n(Y1)) are also shown.

トピックＺ０８についてみると、第１条件付確率が最上位である単語は「マスタシリンダ」という単語であり、第２条件付確率が最上位である係り受け表現は「基づく－発生」である。このようなトピックＺ０８に所属する単語及び係り受け表現に基づいて、トピックＺ０８の意味を解釈することができる。例えば、トピックＺ０８は、第１条件付確率が上位である単語に基づけば、ブレーキに関するトピックであると解釈することができる。また各単語および係り受け表現の総頻度にも着目すると、例えば「マスタシリンダ」「ブレーキ液圧」「ブレーキ操作」「液圧」など、「ブレーキ」という単語よりも比較的頻度の少ないブレーキに関する単語も上位の条件付確率が割り当てられており、より具体的な表現で構成された個性的なトピックが抽出されていることが分かる。 For topic Z08, the word with the highest first conditional probability is the word "master cylinder", and the dependency expression with the highest second conditional probability is "based-occur". The meaning of the topic Z08 can be interpreted based on the words and dependency expressions belonging to the topic Z08. For example, topic Z08 can be interpreted as a topic about braking based on the top words with the first conditional probability. Also, focusing on the total frequency of each word and dependency expression, words related to brakes, such as "master cylinder", "brake hydraulic pressure", "brake operation", and "hydraulic pressure", are relatively less frequent than the word "brake". are assigned higher conditional probabilities, and it can be seen that unique topics composed of more specific expressions are extracted.

ＰＬＳＡは、トピック数を予め設定する必要があり、また、初期値依存性があるため初期値によって結果が異なる。そこで、本実施形態のトピック抽出手段１２では、トピック数として範囲を持たせて複数設定し、初期値を変えてそれぞれのトピック数でＰＬＳＡを複数回実行し、それぞれの結果の情報量基準の値を計算する。そして、その全結果の中で情報量基準が最適となる結果を採用する。情報量基準の計算は、公知の方法（例えば「小西貞則,北川源四郎:情報量基準,朝倉書店,2004」参照）により行うことができる。なお、トピック数は、このような情報量基準に基づいて決定する場合に限定されず、任意に定めてもよい。 PLSA requires the number of topics to be set in advance and depends on the initial value, so the results differ depending on the initial value. Therefore, in the topic extraction means 12 of this embodiment, a plurality of topic numbers are set with a range, and PLSA is executed multiple times with different initial values for each number of topics. to calculate Then, among all the results, the result with the optimum information amount criterion is adopted. Calculation of the information amount criterion can be performed by a known method (see, for example, "Sadanori Konishi, Genshiro Kitagawa: Information Amount Criterion, Asakura Shoten, 2004"). Note that the number of topics is not limited to being determined based on such information amount criteria, and may be determined arbitrarily.

本実施形態では、表６に示すように、トピック抽出手段１２により５０個のトピックが抽出され、それぞれのトピックの解釈がなされた。表６にトピック抽出手段により抽出されたトピックに解釈を与えたものを例示する。

In this embodiment, as shown in Table 6, 50 topics were extracted by the topic extraction means 12, and each topic was interpreted. Table 6 exemplifies the interpretation given to the topics extracted by the topic extraction means.

スコア計算手段１３は、第１条件付確率及び第１語群の出現頻度、並びに第２条件付確率及び第２語群の出現頻度に基づいて、各トピックを条件とした各文章の条件付確率を計算する。そして、この条件付確率を各文章の発生確率で除した値を、各文章に対する各トピックのスコアとする。そして、そのスコアをテキストデータ単位に集約することで、各テキストデータに対する各トピックのスコアを求める。 Based on the first conditional probability and the appearance frequency of the first word group, and the second conditional probability and the appearance frequency of the second word group, the score calculation means 13 calculates the conditional probability of each sentence with each topic as a condition. to calculate Then, a value obtained by dividing this conditional probability by the occurrence probability of each sentence is used as the score of each topic for each sentence. Then, by aggregating the scores for each text data, the score of each topic for each text data is obtained.

文章Ｓ_ｈにおけるトピックＺ_ｋのスコアは、Ｐ（Ｓ_ｈ｜Ｚ_ｋ）／Ｐ（Ｓ_ｈ）である（式（１））。ｋは、ＰＬＳＡで作成されたトピックを特定する番号であり、トピックの総数を最大とする自然数である。ｈは、文章を特定する番号（文章ＩＤ）であり、文章の総数を最大とする自然数である。 The score of topic Z _k in sentence Sh is P(S _h |Z _k )/P ₍ S _h ) (equation (1)). k is a number specifying a topic created by PLSA, and is a natural number that maximizes the total number of topics. h is a number (sentence ID) specifying a sentence, and is a natural number that maximizes the total number of sentences.

第１語群に含まれる語（行要素Ｘ_ｉ）の集合をＳｘ_ｈとし（式（２））、第２語群に含まれる語（列要素Ｙｉ）の集合をＳｙ_ｈとする（式（３））。 Let Sx _h be the set of words (row elements X _i ) included in the first word group (equation (2)), and let Sy _h be the set of words (column elements Yi) included in the second word group (equation ( 3)).

式（１）のＰ（Ｓ_ｈ｜Ｚ_ｋ）は、上記文章Ｓｘ_ｈと文章Ｓｙ_ｈに分解し、それぞれＰ（Ｓｘ_ｈ｜Ｚ_ｋ）とＰ（Ｓｙ_ｈ｜Ｚ_ｋ）を計算し、それらを統合してＰ（Ｓ_ｈ｜Ｚ_ｋ）を計算する。 P(S _h |Z _k ) in equation (1) is decomposed into the above sentences Sx _h and Sy _h , P(Sx _h |Z _k ) and P(Sy _h |Z _k ) are calculated, and to compute P(S _h |Z _k ).

トピックＺ_ｋを条件とした文章Ｓｘ_ｈの条件付確率Ｐ（Ｓｘ_ｈ｜Ｚ_ｋ）を計算し（式（４））、トピックＺ_ｋを条件とした文章Ｓｙ_ｈの条件付確率Ｐ（Ｓｙ_ｈ｜Ｚ_ｋ）を計算する(式（５））。 The conditional probability P(Sx _h |Z _k ) of sentence Sx _h conditioned on topic Z _k is calculated (equation ₍ 4) ₎ , and the conditional probability P(Sy _h |Z _k ) is calculated (equation (5)).

式（４）の行要素Ｘ_ｉが出現する中で文章Ｓｘ_ｈが出現する確率（第１語群の出現頻度）であるＰ（Ｓｘ_ｈ｜Ｘ_ｉ）は、Ｘ_ｉが出現する総頻度ｎ（Ｘ_ｉ）の逆数として計算される（式（６））。 P(Sx _h |X _i ), which is the probability that the sentence Sx _h appears among the _line elements X _i in Equation (4) (the frequency of appearance of the first word group), is the total frequency n It is calculated as the reciprocal of (X _i ) (equation (6)).

式（５）の列要素Ｙ_ｊが出現する中で文章Ｓｙ_ｈが出現する確率（第２語群の出現頻度）であるＰ（Ｓｙ_ｈ｜Ｙ_ｊ）は、Ｙ_ｊが出現する総頻度ｎ（Ｙ_ｊ）の逆数として計算される（式（７））。 P(Sy _h |Y j ), which is the probability that the sentence Sy _h appears among the column elements Y _j in Equation (5 ₎ (the frequency of occurrence of the second word group), is the total _frequency n It is calculated as the reciprocal of (Y _j ) (equation (7)).

式（４）、式（５）のトピックＺ_ｋを条件とした行要素Ｘ_ｉの条件付確率（第１条件付確率）であるＰ（Ｘ_ｉ｜Ｚ_ｋ）と、トピックＺ_ｋを条件とした列要素Ｙ_ｊの条件付確率（第２条件付確率）であるＰ（Ｙ_ｊ｜Ｚ_ｋ）は、ＰＬＳＡの実行で得られる。したがって、式（１）のトピックＺ_ｋを条件とした文章Ｓ_ｈの条件付確率Ｐ（Ｓ_ｈ｜Ｚ_ｋ）は、式（８）で表される。 P(X _i |Z k ₎ , which is the conditional probability (first conditional probability) of row element X _i with topic Z _k in equations (4) and (5) as a condition, and topic Z _k as a condition P(Y _j |Z _k ), which is the conditional probability (second conditional probability) of the column element Y _j , is obtained by executing PLSA. Therefore, the conditional probability P(S _h |Z _k ) of sentence Sh with topic Z _k in equation (1) as a condition _is expressed by equation (8).

文章Ｓ_ｈにおいて、行要素Ｘで定義される文章Ｓ_ｈｘと、列要素Ｙで定義される文章Ｓｙ_ｈの重みは同じであるため、式（８）中の、文章Ｓｘ_ｈを条件とした文章Ｓ_ｈの条件付確率Ｐ（Ｓ_ｈ｜Ｓｘ_ｈ）と、文章Ｓｙ_ｈを条件とした文章Ｓ_ｈの条件付確率Ｐ（Ｓ_ｈ｜Ｓｙ_ｈ）はそれぞれ０．５とする。 In sentence _Sh , sentence Sh _x defined by row element X and sentence Sy _h defined by column element Y have the same weight, so sentence Sx _h in equation (8) is used as a condition The conditional probability P(S _h |Sx _h ) of the sentence Sh and the conditional probability P(S _h |Sy _h ₎ of the sentence Sh with the sentence Sy _h as the _condition are both 0.5.

式（１）の文章Ｓ_ｈの確率Ｐ（Ｓ_ｈ）は、式（９）で表され、Ｐ（Ｚ_ｋ）はＰＬＳＡの実行で得られる。 The probability P(S _h ) of sentence Sh in equation (1) is given by equation (9), _and P(Z _k ) is obtained by performing PLSA.

このように、Ｐ（Ｓ_ｈ｜Ｚ_ｋ）とＰ（Ｓ_ｈ）との比をもって文章Ｓ_ｈにおけるトピックＺ_ｋのスコアとする。この値が１を超えるということは、文章Ｓ_ｈの発生確率はトピックＺ_ｋを条件とすることで上昇し、トピックＺ_ｋとの関係が強いということである。このようなスコアを採用することで、各文章Ｓ_ｈとトピックＺ_ｋの関係の強さを把握しやすくすることができる。表７に各文章Ｓ_ｈに対する各トピックＺ_ｋのスコアを例示する。 Thus, the ratio of P( _S _h |Z _k ) and P(S _h ) is the score of topic Z _k in sentence Sh. If this value exceeds 1, it means that the occurrence probability of sentence _Sh increases with topic _Zk as a condition, and that the relationship with topic _Zk is strong. By adopting such a score, it is possible to easily grasp the strength of the relationship between each sentence _Sh and the topic _Zk . Table 7 illustrates the score of each topic Z _k for each sentence _Sh .

例えば、文章ＩＤ「１」は、トピックＺ１についてのスコアが３．１であり、トピックＺ２についてのスコアが０．９であり、このようなスコアが全トピックについて計算されている。 For example, sentence ID "1" has a score of 3.1 for topic Z1 and a score of 0.9 for topic Z2, and such scores are calculated for all topics.

スコア計算手段１３は、文章ＩＤ単位に計算された各トピックのスコアをテキストデータＩＤ単位に集約する。文章単位のスコアをテキストデータ単位に集約する方法としては、最大値や平均値などを計算することが挙げられる。本実施形態では、トピック毎のスコアの最大値を、テキストデータＩＤの各トピックのスコアとする。 The score calculation means 13 summarizes the score of each topic calculated for each sentence ID for each text data ID. As a method of aggregating the scores of sentence units into text data units, calculation of the maximum value, average value, or the like can be mentioned. In this embodiment, the maximum score for each topic is used as the score for each topic of the text data ID.

表８を用いて、スコアの集計について具体的に説明する。テキストデータ「１」は文章「１」～文章「４」から構成されている。トピックごとに、文章「１」～文章「４」のうち最大値を求める。 Table 8 will be used to specifically explain the aggregation of scores. Text data "1" is composed of sentences "1" to "4". For each topic, find the maximum value among sentences “1” to “4”.

文章「１」～文章「４」に対するトピックＺ１のスコアは「３．１」「１．４」「０．８」「１．２」である。したがって、「３．１」が最大値となる。この最大値「３．１」がテキストデータ「１」に対するトピックＺ１のスコアとなる。以下同様に、トピックＺ２～Ｚ５０についてトピック毎に最大値を計算することで、テキストデータ「１」に対する各トピックのスコアを得る。このような最大値を求めてテキストデータに対する各トピックのスコアとする計算を、全テキストデータについて実行する。表８の斜体字で表されたスコアがテキストデータに対する各トピックのスコアである。このようにして、各テキストデータに対して、各トピックのスコアを得ることができる。 The scores of topic Z1 for sentences "1" to "4" are "3.1", "1.4", "0.8" and "1.2". Therefore, "3.1" is the maximum value. This maximum value "3.1" is the score of the topic Z1 for the text data "1". By calculating the maximum value for each topic for topics Z2 to Z50 in the same manner, the score of each topic for the text data "1" is obtained. Calculations are performed for all text data to obtain such a maximum value as the score of each topic for the text data. The italicized scores in Table 8 are the scores of each topic for the text data. In this way, a score for each topic can be obtained for each text data.

このようにして得られたスコアから、トピックの該当の有無を表す１，０の情報を付与してもよい。例えば、閾値を「３」に設定し、スコアが３以上であれば「１」に３未満であれば「０」というフラグ情報を付与してもよい。表９にフラグ情報を示す。 From the score obtained in this way, information of 1 or 0 may be added to indicate whether or not the topic is applicable. For example, the threshold value may be set to "3", and if the score is 3 or more, flag information "1" may be given, and if the score is less than 3, "0" may be given. Table 9 shows the flag information.

テキストデータ「１」は、トピックＺ１のスコアが「３．１」であるから（表９参照）、フラグ情報は「１」となる。同様に、トピックＺ２のスコアは「５．８」であるから、フラグ情報は「１」となる。トピックＺ５０のスコアは「２．４」であるから、フラグ情報は「０」となる。なお、閾値は「３」である必要はない。Ｐ（Ｓ_ｈ｜Ｚ_ｋ）／Ｐ（Ｓ_ｈ）で定義したスコアは１が基準と考えることができるので、閾値を「１」と設定してもよい。 For the text data "1", since the score of the topic Z1 is "3.1" (see Table 9), the flag information is "1". Similarly, since the score of topic Z2 is "5.8", the flag information is "1". Since the score of topic Z50 is "2.4", the flag information is "0". Note that the threshold need not be "3". Since 1 can be considered as the standard for the score defined by P(S _h |Z _k )/P(S _h ), the threshold may be set to “1”.

次に、本実施形態に係る分析装置１の動作について説明する。図３は、分析装置での処理を示すフローチャートである。 Next, the operation of the analyzer 1 according to this embodiment will be described. FIG. 3 is a flow chart showing processing in the analyzer.

まず、テキストデータから共起行列を作成する（ステップＳ１：共起行列作成ステップ）。具体的には、共起行列作成手段１１が、テキストデータから文章を抽出し、各文章に含まれている第１語群に属する語及び第２語群に属する語の組み合わせの個数を表す共起行列を作成し、これは実測共起行列と期待共起行列とから作成する。具体例については、上述したので説明は省略する。 First, a co-occurrence matrix is created from text data (step S1: co-occurrence matrix creating step). Specifically, the co-occurrence matrix creating means 11 extracts sentences from the text data, and extracts sentences from the text data. A co-occurrence matrix is created from the measured co-occurrence matrix and the expected co-occurrence matrix. A specific example has been described above, so a description thereof will be omitted.

次に、共起行列を入力として潜在意味解析法を実行する（ステップＳ２：トピック抽出ステップ）。具体的には、トピック抽出手段１２が共起行列を入力とし、第１語群に属する語及び第２語群に属する語で構成される複数のトピックを抽出する潜在意味解析法を実行する。これにより、各トピックを条件とした第１語群に属する語の第１条件付確率、及び各トピックを条件とした第２語群に属する語の第２条件付確率が得られる。具体例については、上述したので説明は省略する。 Next, the latent semantic analysis method is executed with the co-occurrence matrix as input (step S2: topic extraction step). Specifically, the topic extracting means 12 receives the co-occurrence matrix as input and executes latent semantic analysis for extracting a plurality of topics composed of words belonging to the first word group and words belonging to the second word group. As a result, a first conditional probability of words belonging to the first word group with each topic as a condition and a second conditional probability of words belonging to the second word group with each topic as a condition are obtained. A specific example has been described above, so a description thereof will be omitted.

次に、各テキストデータに対する各トピックのスコアを計算する（ステップＳ３：スコア計算ステップ）。具体的には、スコア計算手段１３が、第１条件付確率及び第１語群の出現頻度、並びに第２条件付確率及び第２語群の出現頻度に基づいて、各トピックを条件とした各文章の条件付確率を各文章に対する各トピックのスコアとして求め、それをテキストデータ単位に集約することで、各テキストデータに対する各トピックのスコアを求める。具体例については上述したので説明は省略する。 Next, the score of each topic for each text data is calculated (step S3: score calculation step). Specifically, the score calculation means 13 calculates each score based on the first conditional probability and the appearance frequency of the first word group, and the second conditional probability and the appearance frequency of the second word group, with each topic as a condition. The conditional probability of sentences is obtained as the score of each topic for each sentence, and by aggregating it for each text data unit, the score of each topic for each text data is obtained. A specific example has been described above, so a description thereof will be omitted.

以上に述べたように、本実施形態に係る分析方法、分析装置及び分析プログラムによれば、テキストデータからトピックを抽出し、各テキストデータに対してトピックのスコアを求める。このようなスコアを求める前提となる共起行列は、期待共起行列に対する実測共起行列の差分あるいは比率を元に得られている。 As described above, according to the analysis method, analysis device, and analysis program according to the present embodiment, topics are extracted from text data, and topic scores are obtained for each piece of text data. A co-occurrence matrix that serves as a premise for obtaining such a score is obtained based on the difference or ratio of the actually measured co-occurrence matrix to the expected co-occurrence matrix.

このようにして得られた共起行列を用いることで、テキストデータからより個性的なトピックを抽出することができる。これは次のような理由による。実測共起行列の各要素を実測共起頻度、期待共起行列の各要素を期待共起頻度と称する。実測共起頻度が高い共起ペアでも、元々全体の頻度が高い要素（表２でいう総頻度が高い第１語群の語や第２語群の語）が含まれるときには期待共起頻度も高くなるため、実測共起頻度を期待共起頻度で除すことで期待頻度の大きさが制限される。逆に実測共起頻度が高くない共起ペアでも、期待共起頻度がそれよりも十分低ければ共起行列の期待頻度は大きくなり、これにＰＬＳＡを適用した解ではこうした要素にも高い確率が割り当てられる可能性がある。つまり、通常のＰＬＳＡでは頻度が低い要素は高い確率が割り当てられない傾向があるが、上述したような共起行列を用いる本発明では、そうした要素にも高い確率が割り当てられる可能性があり、より個性的なトピックが抽出されることが期待できる。 By using the co-occurrence matrix thus obtained, it is possible to extract more individual topics from the text data. This is for the following reasons. Each element of the measured co-occurrence matrix is called the measured co-occurrence frequency, and each element of the expected co-occurrence matrix is called the expected co-occurrence frequency. Even in a co-occurrence pair with a high measured co-occurrence frequency, if elements with a high overall frequency (words in the first word group or words in the second word group with a high total frequency in Table 2) are included, the expected co-occurrence frequency is also Therefore, the magnitude of the expected frequency is limited by dividing the measured co-occurrence frequency by the expected co-occurrence frequency. Conversely, even for a co-occurrence pair whose measured co-occurrence frequency is not high, if the expected co-occurrence frequency is sufficiently lower than the expected co-occurrence frequency, the expected frequency of the co-occurrence matrix will be large. may be assigned. In other words, in normal PLSA, elements with low frequency tend not to be assigned high probabilities, but in the present invention using the co-occurrence matrix as described above, such elements may also be assigned high probabilities. It is expected that unique topics will be extracted.

なお、通常の共起行列を用いてＰＬＳＡを適用した場合、頻度が高い要素に高い確率が割り当てられることから、結果として抽出されるトピックは典型的なものになる傾向があり、目新しさに欠けてしまう。 In addition, when PLSA is applied using an ordinary co-occurrence matrix, high probabilities are assigned to high-frequency elements, so the resulting topics tend to be typical and lack novelty. end up

また、テキストデータに含まれる文章ごとに共起行列を作成し、トピック抽出手段１２により文章を元にトピックを抽出した。これにより、テキストデータに異なる観点の文章が複数含まれている場合であっても、トピック抽出手段１２による抽出されたトピックは、異なる観点が混在したような曖昧さが低減され、より明確な内容のトピックを抽出することができる。 Also, a co-occurrence matrix was created for each sentence included in the text data, and topics were extracted based on the sentences by the topic extracting means 12 . As a result, even if the text data contains a plurality of sentences with different points of view, the topic extracted by the topic extraction means 12 is less ambiguous as if different points of view are mixed, and the contents are clearer. topics can be extracted.

また、共起行列の各要素は、期待共起行列に対する実測共起行列の比率の対数とした。このように対数を用いることにより、共起行列の比率が極端に高くなることを制限することができる。特に期待共起頻度は１未満となるケースも多く、比率のみでは値が高くなりすぎるものもある。この状態では共起行列全体の値の分布は大きくばらつき、極端な値の開きが生まれてしまうため、ＰＬＳＡを適用した際の最適化計算において、今度はこの極端に大きな値に引っ張られる結果となり、必要以上にデフォルメされた歪んだトピックとなることがありうる。そこで、この比率の値の対数を取ることで値の分布をならし、上記の現象を制限し、より適正なトピックを得ることができると期待できる。なお、共起行列の各要素の値は、期待共起行列に対する実測共起行列の差分あるいは比率を取ることで計算されるが、この差分あるいは比率の取り方は、単純な差分（絶対誤差）としてもよいし、絶対誤差を期待頻度で除した相対誤差としてもよいし、単純な比率としてもよいし、そうした差分あるいは比率の絶対値を取ったり、二乗を取ったり、対数を取ったりしてもよい。 Each element of the co-occurrence matrix is the logarithm of the ratio of the actually measured co-occurrence matrix to the expected co-occurrence matrix. By using logarithms in this way, it is possible to limit the proportion of co-occurrence matrices from becoming extremely high. In particular, there are many cases where the expected co-occurrence frequency is less than 1, and in some cases the value is too high if only the ratio is used. In this state, the distribution of the values of the entire co-occurrence matrix varies greatly, resulting in an extreme difference in values, so the optimization calculation when PLSA is applied will be pulled by this extremely large value this time. It can be a distorted topic that is unnecessarily deformed. Therefore, by taking the logarithm of the value of this ratio, it can be expected that the distribution of the values can be smoothed, the above phenomenon can be restricted, and more appropriate topics can be obtained. The value of each element of the co-occurrence matrix is calculated by taking the difference or ratio of the measured co-occurrence matrix to the expected co-occurrence matrix. , or relative error obtained by dividing the absolute error by the expected frequency, or a simple ratio, or taking the absolute value, squaring, or logarithm of such a difference or ratio. good too.

なお、本発明を上述した実施形態に基づいて説明したが、本発明は上記実施形態に限定されない。例えば、一台の分析装置１において各手段１１～１３による処理を実行させたが、このような態様に限らず、複数の分析装置にて各手段を分散して実行させてもよい。 In addition, although the present invention has been described based on the above-described embodiments, the present invention is not limited to the above-described embodiments. For example, although the processing by each of the means 11 to 13 is executed in one analyzer 1, the process is not limited to such an aspect, and each means may be distributed and executed by a plurality of analyzers.

また、上記実施形態では、特許文献を対象としたものであるが、これに限定されない。例えば、顧客から得たアンケートの自由記述結果をテキストデータとし、顧客の潜在ニーズを抽出したり、コールセンターの問い合わせ履歴をテキストデータとし、消費者の隠れた評価の観点を抽出するなど、テキストデータの一般に適用することができる。 Moreover, in the above embodiment, patent documents are targeted, but the present invention is not limited to this. For example, the results of free-text questionnaires obtained from customers can be used as text data to extract potential customer needs, or call center inquiry histories can be used as text data to extract consumers' hidden evaluation viewpoints. Generally applicable.

〈比較例〉
上述した実施形態と同じテキストデータを用いて、実測共起行列及び期待共起行列を作成せずに、実測共起行列を共起行列としてトピックの抽出及びスコアの集計を行った比較例を示す。具体的には、テキストデータから文章を抽出し、各文章から、第１語群及び第２語群を抽出し、各文章に含まれている第１語群に属する語及び第２語群に属する語の組み合わせの個数を表す共起行列を作成する。 <Comparative example>
A comparative example of extracting topics and aggregating scores using the measured co-occurrence matrix as the co-occurrence matrix without creating the measured co-occurrence matrix and the expected co-occurrence matrix using the same text data as in the above embodiment is shown. . Specifically, sentences are extracted from the text data, the first word group and the second word group are extracted from each sentence, and the words belonging to the first word group and the second word group included in each sentence are A co-occurrence matrix representing the number of combinations of words belonging to each word is created.

このようにして作成した共起行列について、上述した実施形態と同様にトピック抽出を行った結果を表１０に示す。本発明では表６に示したように、５０個のトピックが抽出されたが、比較例においては表１０に示すように、３４個のトピックが抽出された。 Table 10 shows the results of topic extraction performed on the co-occurrence matrix created in this manner in the same manner as in the above-described embodiment. In the present invention, as shown in Table 6, 50 topics were extracted, but in the comparative example, as shown in Table 10, 34 topics were extracted.

本発明で抽出された５０個のトピックには、上記比較例で抽出された３４個のトピックに対応するものもあるが、上記比較例では抽出されずに、本発明によってのみ得られたトピックも存在した。表１１にその例を示す。トピックＺ０９は、「シフトレンジ」や「パーキングレンジ」、「検出」、「停止」、「自動的－行う」といった表現で確率が高く、運転者の誤操作を抑制したり自動停止などの運転アシストに関する技術と解釈できる。トピックＺ２９は、「ナビゲーション装置」や「情報」、「目的地」、「位置情報」といった表現で確率が高く、位置情報を取得してドライバーにナビ情報として提供するなど、情報の取得と提供に関する技術と解釈できる。どちらも近年の自動車業界において付加価値を高める重要な機能が、本発明によってテキストデータから得ることができた。 Some of the 50 topics extracted by the present invention correspond to the 34 topics extracted by the above comparative example, but there are also topics that are not extracted by the above comparative example and are only obtained by the present invention. Were present. Examples are shown in Table 11. Topic Z09 has a high probability of expressions such as "shift range", "parking range", "detection", "stop", and "automatically", and is related to driving assistance such as suppressing erroneous operation by the driver and automatic stopping. It can be interpreted as technology. Topic Z29 has a high probability of using expressions such as “navigation device,” “information,” “destination,” and “location information,” and is related to information acquisition and provision, such as acquiring location information and providing it to drivers as navigation information. It can be interpreted as technology. In both cases, important functions for increasing added value in the recent automobile industry could be obtained from text data by the present invention.

〈実施形態２〉
実施形態１では、複数あるテキストデータのそれぞれから文章を抽出し、各文章から共起行列を作成した。しかしながら、本発明はこれに限定されず、複数あるテキストデータから共起行列を作成してもよい。以下、本実施形態の分析方法、分析装置、分析プログラムについて説明するが、実施形態１と重複する説明は省略する。 <Embodiment 2>
In Embodiment 1, sentences are extracted from each of a plurality of pieces of text data, and a co-occurrence matrix is created from each sentence. However, the present invention is not limited to this, and a co-occurrence matrix may be created from multiple pieces of text data. The analysis method, the analysis apparatus, and the analysis program of this embodiment will be described below, but descriptions overlapping those of the first embodiment will be omitted.

共起行列作成手段１１は、テキストデータから第１語群に属する語及び第２語群に属する語の組み合わせの頻度を表す共起行列を作成する。つまり、テキストデータは１又は複数の文章からなるが、文章単位では処理せずに、テキストデータ単位で処理する。なお、例として用いるテキストデータは、実施形態１の表１と同様である。 The co-occurrence matrix creation means 11 creates a co-occurrence matrix representing frequencies of combinations of words belonging to the first word group and words belonging to the second word group from the text data. In other words, although text data consists of one or more sentences, it is not processed in units of sentences, but processed in units of text data. Note that the text data used as an example is the same as in Table 1 of the first embodiment.

まず、共起行列作成手段１１は、各テキストデータから第１語群及び第２語群を抽出する。 First, the co-occurrence matrix creation means 11 extracts a first word group and a second word group from each text data.

次に、共起行列作成手段１１は、第１語群に属する語と、第２語群に属する語との組み合わせである共起ペアを含むテキストデータの頻度を計算する。そして、その頻度を要素とする実測共起行列を作成する。実測共起行列のｉ行ｊ列の要素（ｉ，ｊ）は、第１語群に属するｉ番目の語と、第２語群に属するｊ番目の語からなる共起ペアを含むテキストデータの頻度となる。 Next, the co-occurrence matrix creating means 11 calculates the frequency of text data containing co-occurrence pairs that are combinations of words belonging to the first word group and words belonging to the second word group. Then, an actually measured co-occurrence matrix having the frequencies as elements is created. The element (i, j) in the i-th row and the j-th column of the measured co-occurrence matrix is the text data containing the co-occurrence pair consisting of the i-th word belonging to the first word group and the j-th word belonging to the second word group. frequency.

次に、共起行列作成手段１１は、第１語群に属する語が含まれるテキストデータの件数を計上して総頻度（ｎ（Ｘ_ｉ））を求め、第２語群に属する語が含まれるテキストデータの件数を計上して総頻度（ｎ（Ｙ_ｊ））を求める。そして、テキストデータの全件数を計上して総テキストデータ数Ｎとし、期待頻度を計算する。このような期待頻度を、全ての第１語群に属する語及び第２語群に属する語について計算し、期待共起行列を作成する。 Next, the co-occurrence matrix creation means 11 calculates the total frequency (n(X i )) by counting the number of text data containing words belonging to the first word group, and calculates the total frequency (n(X _i )). The total frequency (n(Y _j )) is obtained by adding up the number of text data items that are stored. Then, the total number of text data is counted to obtain the total number of text data N, and the expected frequency is calculated. Such expected frequencies are calculated for all words belonging to the first word group and all words belonging to the second word group to create an expected co-occurrence matrix.

次に、共起行列作成手段１１は、期待共起行列の各要素に対する実測共起行列の各要素の差分あるいは比率を計算して共起行列を作成する。実施形態１と同様に実測共起行列の各要素（ｉ，ｊ）／期待共起行列の各要素（ｉ，ｊ）の対数を計算し、その値を共起行列の要素（ｉ，ｊ）とする。 Next, the co-occurrence matrix creating means 11 creates a co-occurrence matrix by calculating the difference or ratio of each element of the actually measured co-occurrence matrix to each element of the expected co-occurrence matrix. As in Embodiment 1, the logarithm of each element (i, j) of the measured co-occurrence matrix/each element (i, j) of the expected co-occurrence matrix is calculated, and the value is used as the element (i, j) of the co-occurrence matrix and

このようにして得られた共起行列に対して、トピック抽出手段１２によりトピックの抽出を行う。この抽出については、実施形態１と同様であるのでここでの説明は省略する。 The topic extraction means 12 extracts topics from the co-occurrence matrix thus obtained. Since this extraction is the same as in the first embodiment, the description is omitted here.

実施形態１では、各トピックを条件とした各文章の条件付確率を計算したが、本実施形態では、各トピックを条件とした各テキストデータの条件付確率を計算する。 In the first embodiment, the conditional probability of each sentence is calculated with each topic as a condition, but in this embodiment, the conditional probability of each text data is calculated with each topic as a condition.

具体的には、スコア計算手段１３は、第１条件付確率及び第１語群の出現頻度、並びに第２条件付確率及び第２語群の出現頻度に基づいて、各トピックを条件とした各テキストデータの条件付確率を計算する。そして、この条件付確率を各テキストデータの発生確率で除した値を、各テキストデータに対する各トピックのスコアとする。 Specifically, the score calculation means 13 calculates each score with each topic as a condition, based on the first conditional probability and the frequency of appearance of the first word group, and the second conditional probability and the frequency of appearance of the second word group. Compute conditional probabilities for text data. A value obtained by dividing this conditional probability by the occurrence probability of each text data is used as the score of each topic for each text data.

テキストデータＤ_ｈにおけるトピックＺ_ｋのスコアは、Ｐ（Ｄ_ｈ｜Ｚ_ｋ）／Ｐ（Ｄ_ｈ）である（式（１０））。ｋは、ＰＬＳＡで作成されたトピックを特定する番号であり、トピックの総数を最大とする自然数である。ｈは、テキストデータを特定する番号（テキストデータＩＤ）であり、テキストデータの総数を最大とする自然数である。 The score of topic Z _k in text data D _h is P(D _h |Z _k )/P(D _h ) (equation (10)). k is a number specifying a topic created by PLSA, and is a natural number that maximizes the total number of topics. h is a number (text data ID) specifying text data, and is a natural number that maximizes the total number of text data.

第１語群に含まれる語（行要素Ｘ_ｉ）の集合をＤｘ_ｈとし（式（１１））、第２語群に含まれる語（列要素Ｙｉ）の集合をＤｙ_ｈとする（式（１２））。 Let Dx _h be the set of words (row elements X _i ) included in the first word group (equation (11)), and let Dy _h be the set of words (column elements Yi) included in the second word group (equation ( 12)).

これらの集合を用いて、トピックＺ_ｋを条件としたテキストデータＤｘ_ｈの条件付確率Ｐ（Ｄｘ_ｈ｜Ｚ_ｋ）を計算し（式（１３））、トピックＺ_ｋを条件としたテキストデータＤｙ_ｈの条件付確率Ｐ（Ｄｙ_ｈ｜Ｚ_ｋ）を計算する(式（１４））。 Using these sets, the conditional probability P(Dx _h |Z _{k ) of text data Dx h} _conditional _on topic Z _k is calculated (equation (13)), and text data Dy Calculate the conditional probability P(Dy _h |Z _k ) of _h (equation (14)).

式（１３）の行要素Ｘ_ｉが出現する中でテキストデータＤｘ_ｈが出現する確率（第１語群の出現頻度）であるＰ（Ｄｘ_ｈ｜Ｘ_ｉ）は、Ｘ_ｉが出現する総頻度ｎ（Ｘ_ｉ）の逆数として計算される（式（１５））） P(Dx _h |X _i ), which is the probability that the text data Dx _h appears among the line elements X _i in Equation (13) (frequency of appearance of the first word group), is the total frequency of occurrence of X _i Calculated as the reciprocal of n(X _i ) (equation (15)))

式（１４）の列要素Ｙ_ｊが出現する中でテキストデータＤｙ_ｈが出現する確率（第１語群の出現頻度）であるＰ（Ｄｙ_ｈ｜Ｙ_ｊ）は、Ｙ_ｊが出現する総頻度ｎ（Ｙ_ｊ）の逆数として計算される（式（１６）） P(Dy h |Y j ), which is the probability that the text data Dy _h appears among the column elements Y _j in Equation (14 ₎ (frequency of occurrence of the first word group), is the total frequency _of occurrence of Y _j Calculated as the reciprocal of n(Y _j ) (equation (16))

式（１３）、式（１４）のトピックＺ_ｋを条件とした行要素Ｘ_ｉの条件付確率（第１条件付確率）であるＰ（Ｘ_ｉ｜Ｚ_ｋ）と、トピックＺ_ｋを条件とした列要素Ｙ_ｊの条件付確率（第２条件付確率）であるＰ（Ｙ_ｊ｜Ｚ_ｋ）は、ＰＬＳＡの実行で得られる。したがって、式（１０）のトピックＺ_ｋを条件としたテキストデータＤ_ｈの条件付確率Ｐ（Ｄ_ｈ｜Ｚ_ｋ）は、式（１７）で表される。 P(X _i |Z k ₎ , which is the conditional probability (first conditional probability) of row element X _i with topic Z _k in equations (13) and (14) as a condition, and topic Z _k as a condition P(Y _j |Z _k ), which is the conditional probability (second conditional probability) of the column element Y _j , is obtained by executing PLSA. Therefore, the conditional probability P(D _h |Z _k ) of the text data D _h with the topic Z _k in Equation (10) as a condition is expressed by Equation (17).

テキストデータＤ_ｈにおいて、行要素Ｘで定義される文章Ｄ_ｈｘと、列要素Ｙで定義されるテキストデータＤｙ_ｈの重みは同じであるため、式（１７）中の、テキストデータＤｘ_ｈを条件としたテキストデータＤ_ｈの条件付確率Ｐ（Ｄ_ｈ｜Ｄｘ_ｈ）と、テキストデータＤｙ_ｈを条件としたテキストデータＤ_ｈの条件付確率Ｐ（Ｄ_ｈ｜Ｄｙ_ｈ）はそれぞれ０．５とする。 In the text data D _h , the text D _h _x defined by the row element X and the text data Dy _h defined by the column element Y have the same weight. The conditional probability P( _Dh | _Dxh ) of text data _Dh as a condition and the conditional probability P( _Dh | _Dyh ) of text data _Dh as a condition of text data _Dyh are 0.5, respectively. and

式（１０）のテキストデータＤ_ｈの確率Ｐ（Ｄ_ｈ）は、式（１８）で表され、Ｐ（Ｚ_ｋ）はＰＬＳＡの実行で得られる。 The probability P(D _h ) of the text data D _h in Equation (10) is expressed by Equation (18), and P(Z _k ) is obtained by executing PLSA.

以上に述べたように、本実施形態に係る分析方法、分析装置及び分析プログラムによれば、実施形態１と同様の作用効果を奏する。また、本実施形態では、文章ごとではなく、テキストデータから共起行列を作成する。このため、本実施形態の分析方法等は、テキストデータに異なる観点の文章が複数含まれていない場合に、特に有用である。 As described above, according to the analysis method, the analysis apparatus, and the analysis program according to this embodiment, the same effects as those of the first embodiment are obtained. Also, in this embodiment, a co-occurrence matrix is created from text data, not for each sentence. Therefore, the analysis method and the like of this embodiment are particularly useful when the text data does not contain a plurality of sentences with different points of view.

〈実施形態３〉
実施形態１ではテキストデータから抽出された文章を対象として共起行列を作成し、実施形態２ではテキストデータを対象として共起行列を作成したが、本発明はこれらに限定されない。 <Embodiment 3>
In the first embodiment, a co-occurrence matrix is created for sentences extracted from text data, and in the second embodiment, a co-occurrence matrix is created for text data, but the present invention is not limited to these.

本実施形態のテキストデータは、カテゴリに分類されたテキスト部（１又は複数の文章からなる）を複数備えた構造となっている。表１２にテキストデータを例示する。 The text data of this embodiment has a structure including a plurality of text portions (consisting of one or more sentences) classified into categories. Table 12 exemplifies the text data.

表１２に示すように、テキストデータは、複数のテキスト部からなり、各テキスト部は、カテゴリに分類されている。例えば、特許出願の明細書等に関するテキストデータには、タイトル（発明の名称）、課題、解決手段、効果などのカテゴリに分類されたテキスト部が含まれている。 As shown in Table 12, the text data consists of a plurality of text parts, and each text part is classified into categories. For example, text data relating to the specification of a patent application includes text portions classified into categories such as title (name of invention), problem, solution, effect, and the like.

共起行列作成手段１１は、複数のカテゴリのうち特定の２個のカテゴリを用いる。この２個のカテゴリは、ユーザーに指定されたものである。それらの２個のカテゴリのうちの一つを第１のカテゴリ、他の一つを第２のカテゴリと称する。 The co-occurrence matrix creating means 11 uses two specific categories among the plurality of categories. These two categories are user-specified. One of those two categories is called the first category and the other one is called the second category.

共起行列作成手段１１は、第１のカテゴリに分類されたテキスト部から第１語群に属する語、及び第２のカテゴリに分類されたテキスト部から第２語群に属する語の組み合わせの頻度を表す共起行列を作成する。 The co-occurrence matrix creating means 11 calculates the frequencies of combinations of words belonging to the first word group from the text part classified into the first category and words belonging to the second word group from the text part classified into the second category. Create a co-occurrence matrix representing

共起行列作成手段１１は、全てのテキストデータのうち、第１のカテゴリに分類されたテキスト部から第１語群を抽出し、第２のカテゴリに分類されたテキスト部から第２語群を抽出する。 The co-occurrence matrix creating means 11 extracts the first word group from the text parts classified into the first category among all the text data, and extracts the second word group from the text parts classified into the second category. Extract.

表１３は、第１のカテゴリを「タイトル」とし、第２のカテゴリを「解決手段」とし、第１語群を「名詞」とし、第２語群を「係り受け表現」として作成した実測共起行列を例示している。 Table 13 shows the actual measurement results created with the first category as "title", the second category as "solution means", the first word group as "noun", and the second word group as "dependency expression". exemplifies the origin matrix.

例えば、要素（１，１）は、第１のカテゴリ「タイトル」に分類されたテキスト部に「ブレーキ」という名詞が含まれ、かつ、第２のカテゴリ「解決手段」に分類されたテキスト部に「基づく－発生」という係り受け表現が含まれるような共起ペアが存在するテキストデータの数は８件であることを表す。 For example, the element (1, 1) contains the noun "brake" in the text portion classified into the first category "Title" and the text portion classified into the second category "Solution" It indicates that the number of text data in which there is a co-occurrence pair including the dependency expression “based on-occurs” is eight.

次に、共起行列作成手段１１は、第１のカテゴリに分類されたテキスト部に、第１語群に属する語が含まれるテキストデータの件数を計上して総頻度（ｎ（Ｘ_ｉ））を求める。また、共起行列作成手段１１は、第２のカテゴリに分類されたテキスト部に、第２語群に属する語が含まれるテキストデータの件数を計上して総頻度（ｎ（Ｙ_ｊ））を求める。そして、テキストデータの全件数を計上して総テキストデータ数Ｎとし、期待頻度を計算する。このような期待頻度を、全ての第１語群に属する語及び第２語群に属する語について計算し、期待共起行列を作成する。 Next, the co-occurrence matrix creation means 11 calculates the total frequency (n(X _i )) by counting the number of text data items in which words belonging to the first word group are included in the text portions classified into the first category. Ask for In addition, the co-occurrence matrix creating means 11 calculates the total frequency (n(Y _j )) by counting the number of text data in which words belonging to the second word group are included in the text parts classified into the second category. demand. Then, the total number of text data is counted to obtain the total number of text data N, and the expected frequency is calculated. Such expected frequencies are calculated for all words belonging to the first word group and all words belonging to the second word group to create an expected co-occurrence matrix.

本実施形態におけるスコアの計算は、Ｄｘ_ｈ、Ｄｙ_ｈの定義が異なる以外は、実施形態２と同様であるので詳細な説明は省略する。Ｄｘ_ｈは、第１のカテゴリに分類されたテキスト部から得られた、第１語群に含まれる語（行要素Ｘ_ｉ）の集合である（式（１９））。Ｄｙ_ｈは、第２のカテゴリに分類されたテキスト部から得られた、第２語群に含まれる語（列要素Ｙｉ）の集合である（式（２０））。 Score calculation in this embodiment is the same as in Embodiment 2 except that the definitions of Dx _h and Dy _h are different, so detailed description will be omitted. Dx _h is a set of words (line elements X _i ) included in the first word group obtained from the text portion classified into the first category (equation (19)). Dy _h is a set of words (column elements Yi) included in the second word group obtained from the text portion classified into the second category (equation (20)).

以上に述べたように、本実施形態に係る分析方法、分析装置及び分析プログラムによれば、実施形態１及び実施形態２と同様の作用効果を奏する。また、本実施形態では、カテゴリに分けられたテキスト部を含む、構造化されたテキストデータを対象として分析する場合に特に有用である。 As described above, according to the analysis method, analysis apparatus, and analysis program according to this embodiment, the same effects as those of the first and second embodiments are obtained. In addition, this embodiment is particularly useful when analyzing structured text data including categorized text portions.

１分析装置
１０分析プログラム
１１共起行列作成手段
１２トピック抽出手段
１３スコア計算手段
1 analysis device 10 analysis program 11 co-occurrence matrix creation means 12 topic extraction means 13 score calculation means

Claims

A text data analysis method executed by an analysis device , comprising:
a co-occurrence matrix creation step of creating a co-occurrence matrix composed of elements based on the frequencies of combinations of words belonging to the first word group and words belonging to the second word group contained in the text data;
Using the co-occurrence matrix as an input, a latent semantic analysis method is executed to extract a plurality of topics composed of words belonging to the first word group and words belonging to the second word group. a topic extraction step of obtaining a first conditional probability of words belonging to one word group and a second conditional probability of words belonging to a second word group with each topic as a condition;
Calculating the conditional probability of each text data with each topic as a condition based on the first conditional probability and the appearance frequency of the first word group, and the second conditional probability and the appearance frequency of the second word group and a score calculation step of obtaining a score of each topic for each of the text data based on the conditional probability,
The co-occurrence matrix creation step includes:
creating a measured co-occurrence matrix whose elements are frequencies of combinations of words belonging to the first word group and words belonging to the second word group from the text data;
creating an expected co-occurrence matrix whose elements are expected frequencies of combinations of words belonging to the first word group and words belonging to the second word group from the text data;
An analysis method comprising: creating the co-occurrence matrix whose elements are differences or ratios of the elements of the measured co-occurrence matrix with respect to the elements of the expected co-occurrence matrix.

The analysis method according to claim 1,
The co-occurrence matrix creation step includes:
Extracting sentences from the text data, creating the measured co-occurrence matrix whose elements are the frequencies of combinations of words belonging to the first word group and words belonging to the second word group contained in each sentence,
Sentences are extracted from the text data, and the expected co-occurrence matrix is created, whose elements are expected frequencies of combinations of words belonging to the first word group and words belonging to the second word group contained in each sentence. ,
creating the co-occurrence matrix, each element of which is a difference or ratio of each element of the measured co-occurrence matrix to each element of the expected co-occurrence matrix;
In the score calculation step, based on the first conditional probability and the appearance frequency of the first word group, and the second conditional probability and the appearance frequency of the second word group, conditions for each sentence with each topic as a condition A method of analysis, comprising calculating attached probabilities and determining a score for each topic for each of said text data based on said conditional probabilities.

The analysis method according to claim 1,
The text data includes text parts classified into categories,
The co-occurrence matrix creation step includes:
Frequency of combinations of words belonging to the first word group extracted from the text part classified into the first category and words belonging to the second word group extracted from the text part classified into the second category Create the measured co-occurrence matrix with elements of
Expectation of combinations of words belonging to the first word group extracted from the text portion classified into the first category and words belonging to the second word group extracted from the text portion classified into the second category. creating the expected co-occurrence matrix with frequencies as elements;
An analysis method comprising: creating the co-occurrence matrix whose elements are differences or ratios of the elements of the measured co-occurrence matrix with respect to the elements of the expected co-occurrence matrix.

A text data analysis device comprising:
co-occurrence matrix creation means for creating a co-occurrence matrix composed of elements based on the frequencies of combinations of words belonging to the first word group and words belonging to the second word group contained in the text data;
Using the co-occurrence matrix as an input, a latent semantic analysis method is executed to extract a plurality of topics composed of words belonging to the first word group and words belonging to the second word group. topic extraction means for obtaining a first conditional probability of words belonging to one word group and a second conditional probability of words belonging to a second word group with each topic as a condition;
Calculating the conditional probability of each text data with each topic as a condition based on the first conditional probability and the appearance frequency of the first word group, and the second conditional probability and the appearance frequency of the second word group and score calculation means for obtaining a score of each topic for each of the text data based on the conditional probability,
The co-occurrence matrix creation means includes:
creating a measured co-occurrence matrix whose elements are frequencies of combinations of words belonging to the first word group and words belonging to the second word group from the text data;
creating an expected co-occurrence matrix whose elements are expected frequencies of combinations of words belonging to the first word group and words belonging to the second word group from the text data;
An analysis apparatus, comprising: creating the co-occurrence matrix whose elements are differences or ratios of the elements of the measured co-occurrence matrix with respect to the elements of the expected co-occurrence matrix.

An analysis program that causes a computer to analyze text data,
said computer,
co-occurrence matrix creation means for creating a co-occurrence matrix composed of elements based on the frequencies of combinations of words belonging to the first word group and words belonging to the second word group contained in the text data;
Using the co-occurrence matrix as an input, a latent semantic analysis method is executed to extract a plurality of topics composed of words belonging to the first word group and words belonging to the second word group. topic extraction means for obtaining a first conditional probability of words belonging to one word group and a second conditional probability of words belonging to a second word group with each topic as a condition;
Calculating the conditional probability of each text data with each topic as a condition based on the first conditional probability and the appearance frequency of the first word group, and the second conditional probability and the appearance frequency of the second word group and functioning as score calculation means for obtaining the score of each topic for each of the text data based on the conditional probability,
The co-occurrence matrix creation means includes:
creating a measured co-occurrence matrix whose elements are frequencies of combinations of words belonging to the first word group and words belonging to the second word group from the text data;
creating an expected co-occurrence matrix whose elements are expected frequencies of combinations of words belonging to the first word group and words belonging to the second word group from the text data;
An analysis program for creating the co-occurrence matrix having each element as a difference or ratio of each element of the measured co-occurrence matrix to each element of the expected co-occurrence matrix.