JP2008234618A

JP2008234618A - Knowledge extracting device, knowledge extracting method and computer program

Info

Publication number: JP2008234618A
Application number: JP2007249739A
Authority: JP
Inventors: Kazuhiko Shudo; 和彦首藤; Masaki Matsudaira; 正樹松平
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 2007-02-23
Filing date: 2007-09-26
Publication date: 2008-10-02

Abstract

<P>PROBLEM TO BE SOLVED: To provide a knowledge extracting method capable of further highly accurately extracting knowledge at a high speed, by using text data and structured data. <P>SOLUTION: This knowledge extracting device has a text dividing part 122 dividing the text data into words, a word pair making part 126 making a word pair by combining the words of the text data, a co-occurrence score calculating part 128 calculating a co-occurrence score of indicating an appearance degree of both composing words composing the word pair in a plurality of text data, a priority determining part 130 determining priority of the word pair based on the co-occurrence score, a text-time series data corresponding part 150 acquiring time series data corresponding to the composing word of the word pair, and a correlation coefficient calculating part 140 calculating a correlation coefficient of the word pair by using the time series data associated with the composing word of the word pair according to the determined priority. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、知識抽出装置、知識抽出方法およびコンピュータプログラムに関する。 The present invention relates to a knowledge extraction device, a knowledge extraction method, and a computer program.

大量のデータから関連性や法則性を抽出する分析手法であるデータマイニングは、有効なビジネスパターンを発見するために利用されている。例えば、消費者金融の領域では、大量の顧客の属性データ（例えば収入、年齢等）を用いてデータマイニングすることにより、その顧客の信用度を測定することができる。 Data mining, which is an analysis method that extracts relevance and law from a large amount of data, is used to discover effective business patterns. For example, in the area of consumer finance, credit quality of a customer can be measured by performing data mining using a large amount of customer attribute data (for example, income, age, etc.).

近年のインターネットの興隆や記憶装置の大容量化に伴い、最近では、大量のテキストとＰＯＳデータなどの構造化されたデータとが共存している。従来は、共存するデータのうち、テキストデータはテキストマイニングを用いて、構造化されたデータはデータマイニングを用いて、というように別個に解析されていた。例えば、上記した顧客の信用度を測定する場合には、構造化されたデータである属性データのみを用いており、顧客の発言内容は用いられていなかった。 With the recent rise of the Internet and the increase in storage capacity, a large amount of text and structured data such as POS data coexist recently. Conventionally, among coexisting data, text data is analyzed separately using text mining, structured data is analyzed using data mining, and so on. For example, when measuring the creditworthiness of the customer described above, only attribute data, which is structured data, is used, and the content of the customer's remarks is not used.

しかし、これらの情報を統合的に解析することで、より精度の高い知識抽出や、高速な知識抽出が期待できる。このような状況に対して、テキストデータと構造化されたデータとを用いて解析する方法が提案されている（例えば、特許文献１〜４）。例えば、構造化されたデータが時系列データである場合、例えばニュース記事においてある企業の収益と為替レートとが関連して語られることがある。この場合、その企業の企業収益と為替レートの時系列データには何らかの相関関係があることが推測される。したがって、ニュース記事というテキストデータから企業収益と為替レートという時系列データとの相関関係を推測することができる。 However, by analyzing these information in an integrated manner, more accurate knowledge extraction and high-speed knowledge extraction can be expected. For such a situation, a method of analyzing using text data and structured data has been proposed (for example, Patent Documents 1 to 4). For example, when the structured data is time-series data, for example, in a news article, a company's profit and the exchange rate may be spoken in relation to each other. In this case, it is presumed that there is some correlation between the company profit of the company and the time series data of the exchange rate. Therefore, it is possible to infer the correlation between the company profit and the time series data of the exchange rate from the text data of the news article.

特開２００５−７８２４０号公報JP-A-2005-78240 特開２０００−３４８０１５号公報JP 2000-348015 A 特開２００１−１３４５７５号公報JP 2001-134575 A 特開２００５−１３５１６７号公報JP 2005-135167 A

しかし、従来のテキストデータと構造化されたデータとの両方について解析する方法は、各データをそれぞれ別個に解析し、その解析結果から知識を獲得するものであった。このため、新たな知識を抽出するための処理が増加し、データ、特に時系列データ間の相関を発見しようとしても処理時間がかかりすぎるため、新たな知識の抽出は容易ではなかった。 However, the conventional method for analyzing both text data and structured data is to analyze each data separately and acquire knowledge from the analysis result. For this reason, the number of processes for extracting new knowledge increases, and it takes too much processing time to discover correlations between data, particularly time-series data. Therefore, it is not easy to extract new knowledge.

そこで、本発明は、上記問題に鑑みてなされたものであり、本発明の目的とするところは、テキストデータと構造化されたデータとを用いてより高精度かつ高速に知識を抽出することの可能な、新規かつ改良された知識抽出装置および知識抽出方法を提供することにある。 Therefore, the present invention has been made in view of the above problems, and an object of the present invention is to extract knowledge with high accuracy and high speed using text data and structured data. It is an object of the present invention to provide a new and improved knowledge extraction apparatus and knowledge extraction method.

上記課題を解決するために、本発明のある観点によれば、テキストデータと時系列データとを用いて知識を抽出する知識抽出装置が提供される。かかる知識抽出装置は、テキストデータを単語に分割するテキスト分割部と、テキストデータの単語を組み合わせて単語ペアを作成する単語ペア作成部と、複数のテキストデータにおいて単語ペアを構成する構成単語が共に出現する度合いを示す共起スコアを算出する共起スコア算出部と、共起スコアに基づいて単語ペアの優先順位を決定する優先順位決定部と、単語ペアの構成単語に対応する時系列データを取得するテキスト・時系列データ対応部と、決定された優先順位にしたがって、単語ペアの構成単語に対応付けられた時系列データを用いて単語ペアの相関係数を算出する相関係数算出部と、を備えることを特徴とする。 In order to solve the above-described problem, according to an aspect of the present invention, a knowledge extraction device is provided that extracts knowledge using text data and time-series data. Such a knowledge extraction device includes a text dividing unit that divides text data into words, a word pair creation unit that creates word pairs by combining words in text data, and constituent words that constitute word pairs in a plurality of text data. A co-occurrence score calculation unit that calculates a co-occurrence score indicating the degree of appearance, a priority determination unit that determines the priority of word pairs based on the co-occurrence score, and time-series data corresponding to the constituent words of the word pair A text / time-series data correspondence unit to be acquired; and a correlation coefficient calculation unit that calculates a correlation coefficient of the word pair using time-series data associated with the constituent words of the word pair according to the determined priority order; It is characterized by providing.

本発明によれば、大量の時系列データから相関のあるデータを見出すために、まず、知識抽出装置に入力された複数のテキストデータをテキスト分割部により単語に分割し、分割された単語を組み合わせた単語ペアについてそれぞれ共起スコアを算出する。共起スコアは、単語ペアを構成する一の構成単語と他の構成単語とがどのくらい共に出現するかを示すものである。この共起スコアに基づいて決定された優先順位にしたがって、時系列データから単語ペアの相関係数を算出する。このように、本発明の知識抽出装置によれば、テキストデータを用いたテキスト解析により相関の高そうな時系列データのペアを発見し、相関の高そうな順に相関係数を算出することにより、高精度かつ高速に時系列データ間の相関を見出すことができる。 According to the present invention, in order to find correlated data from a large amount of time-series data, first, a plurality of text data input to the knowledge extraction device is divided into words by a text dividing unit, and the divided words are combined. A co-occurrence score is calculated for each word pair. The co-occurrence score indicates how much one constituent word that constitutes a word pair and another constituent word appear together. In accordance with the priority order determined based on the co-occurrence score, the correlation coefficient of the word pair is calculated from the time series data. Thus, according to the knowledge extraction apparatus of the present invention, by finding a pair of time series data that seems to be highly correlated by text analysis using text data, and calculating correlation coefficients in the order that the correlation is likely to be high. The correlation between time series data can be found with high accuracy and high speed.

ここで、優先順位決定部は、例えば共起スコアの絶対値が高い順に優先順位を付けることができる。ここで、各テキストデータに出現する程度が構成単語間で連動しているとき（すなわち、構成単語がテキストデータごとに共に多く出現したり、または共にほとんど出現しなかったりするとき）共起スコアが高いとする。このとき、共起スコアの絶対値が高い単語ペアほど相関する可能性が高いとして、優先的にこの単語ペアの時系列データによる相関係数の算出が行われる。 Here, the priority order determination unit can assign priorities in descending order of the absolute value of the co-occurrence score, for example. Here, when the degree of appearance in each text data is linked between the constituent words (that is, when the constituent words appear many together or rarely appear together in each text data), the co-occurrence score is Suppose it is expensive. At this time, it is presumed that the word pair having a higher absolute value of the co-occurrence score is more likely to be correlated, and the correlation coefficient is preferentially calculated from the time series data of the word pair.

また、知識抽出装置は、時系列データと該時系列データを特定するための時系列データＩＤとを関連付けて記憶する時系列データ記憶部と、単語と時系列データＩＤとを関連付けて記憶する対応表記憶部と、を備えてもよい。ここで、相関係数算出部は、単語ペアの構成単語に対応する時系列データＩＤが対応表記憶部に記憶されていない場合、該単語ペアの相関係数を算出しないこともできる。また、相関係数算出部は、１つの単語ペアの構成単語に対応する時系列データＩＤが同一である場合、該単語ペアの相関係数を算出しないこともできる。このように、相関係数を算出する前に、時系列データから相関係数を算出できない場合や、同一の時系列データから相関係数を算出する場合を除外することにより、より効率よく相関係数を算出することができる。 In addition, the knowledge extraction device stores a time-series data storage unit that associates and stores time-series data and a time-series data ID for specifying the time-series data, and stores a word and a time-series data ID in association with each other. A table storage unit. Here, the correlation coefficient calculation unit may not calculate the correlation coefficient of the word pair when the time-series data ID corresponding to the constituent words of the word pair is not stored in the correspondence table storage unit. Further, the correlation coefficient calculation unit may not calculate the correlation coefficient of the word pair when the time series data ID corresponding to the constituent words of one word pair is the same. Thus, before calculating the correlation coefficient, by excluding the case where the correlation coefficient cannot be calculated from the time series data or the case where the correlation coefficient is calculated from the same time series data, the correlation is improved more efficiently. A number can be calculated.

相関係数算出部は、時系列データ間に所定の遅れを順次に生じさせて、遅れごとに相関係数を算出することにより１つの単語ペアに対して複数の相関係数を算出し、複数の相関係数のうち、絶対値が最大の相関係数を単語ペアの相関係数とするようにしてもよい。このように、一の時系列データを他の時系列データに対して遅延を生じさせることにより強い相関が現れる単語ペアも発見することができ、より高精度に知識を抽出することができる。 The correlation coefficient calculation unit sequentially generates a predetermined delay between the time series data, calculates a correlation coefficient for each delay, calculates a plurality of correlation coefficients for one word pair, Of these correlation coefficients, the correlation coefficient having the maximum absolute value may be used as the correlation coefficient of the word pair. In this way, by causing a delay in one time-series data with respect to other time-series data, a word pair in which a strong correlation appears can be found, and knowledge can be extracted with higher accuracy.

さらに、単語ペアの相関係数に基づいて、該単語ペアの構成単語に相関があるか否かを判定する相関判定部を備えることもできる。ここで、相関判定部は、相関係数の絶対値が所定の値以上である場合に相関関係があると判定するようにしてもよい。これにより、知識抽出装置は、相関の高い単語ペアを選択して新たな知識としてユーザに提供することができる。 Furthermore, a correlation determination unit that determines whether or not the constituent words of the word pair have a correlation based on the correlation coefficient of the word pair can be provided. Here, the correlation determination unit may determine that there is a correlation when the absolute value of the correlation coefficient is equal to or greater than a predetermined value. Thereby, the knowledge extraction apparatus can select a highly correlated word pair and provide it to the user as new knowledge.

また、上記課題を解決するために、本発明の別の観点によれば、コンピュータを上述した知識抽出装置として機能させるコンピュータプログラムが提供される。コンピュータプログラムは、コンピュータが備える記憶装置に格納され、コンピュータが備えるＣＰＵに読み込まれて実行されることにより、そのコンピュータを上記の知識抽出装置として機能させる。また、コンピュータプログラムが記録された、コンピュータで読み取り可能な記録媒体も提供することができる。記録媒体は、例えば磁気ディスク、光ディスクなどである。 In order to solve the above-described problem, according to another aspect of the present invention, a computer program that causes a computer to function as the above-described knowledge extraction device is provided. The computer program is stored in a storage device included in the computer, and read and executed by a CPU included in the computer, thereby causing the computer to function as the knowledge extraction device. A computer-readable recording medium in which a computer program is recorded can also be provided. The recording medium is, for example, a magnetic disk or an optical disk.

さらに、上記課題を解決するために、本発明の別の観点によれば、テキストデータと時系列データとを用いて知識を抽出する知識抽出方法が提供される。かかる知識抽出方法は、テキストデータを単語に分割するテキスト分割ステップと、分割された単語を組み合わせて単語ペアを作成する単語ペア作成ステップと、複数のテキストデータにおいて作成された単語ペアを構成する構成単語が共に出現する度合いを示す共起スコアを算出する共起スコア算出ステップと、共起スコアに基づいて単語ペアの優先順位を決定する優先順位決定ステップと、単語ペアの構成単語に対応する時系列データを取得するテキスト・時系列データ対応ステップと、決定された優先順位にしたがって、単語ペアの構成単語に対応付けられた時系列データを用いて単語ペアの相関係数を算出する相関係数算出ステップと、を含むことを特徴とする。 Furthermore, in order to solve the above-described problem, according to another aspect of the present invention, a knowledge extraction method for extracting knowledge using text data and time-series data is provided. The knowledge extraction method includes a text dividing step for dividing text data into words, a word pair creating step for creating a word pair by combining the divided words, and a word pair created in a plurality of text data A co-occurrence score calculating step for calculating a co-occurrence score indicating the degree of appearance of the words, a priority order determining step for determining the priority order of the word pair based on the co-occurrence score, Correlation coefficient for calculating the correlation coefficient of a word pair using the time-series data associated with the constituent words of the word pair according to the text / time-series data correspondence step for acquiring the series data and the determined priority order And a calculating step.

本発明によれば、大量の時系列データから相関のあるデータを見出すために、まず、知識抽出装置に入力された複数のテキストデータを単語に分割し、分割された単語を組み合わせた単語ペアについてそれぞれ共起スコアを算出する。共起スコアは、単語ペアを構成する一の構成単語と他の構成単語とがどのくらい共に出現するかを示すものである。この共起スコアに基づいて決定された優先順位にしたがって、時系列データから単語ペアの相関係数を算出する。このように、本発明の知識抽出方法によれば、テキストデータを用いたテキスト解析により相関の高そうな時系列データのペアを発見し、相関の高そうな順に相関係数を算出することにより、高精度かつ高速に時系列データ間の相関を見出すことができる。 According to the present invention, in order to find correlated data from a large amount of time-series data, first, a plurality of text data input to the knowledge extraction device is divided into words, and a word pair obtained by combining the divided words. A co-occurrence score is calculated for each. The co-occurrence score indicates how much one constituent word that constitutes a word pair and another constituent word appear together. In accordance with the priority order determined based on the co-occurrence score, the correlation coefficient of the word pair is calculated from the time series data. Thus, according to the knowledge extraction method of the present invention, by finding a pair of time series data that seems to be highly correlated by text analysis using text data, and calculating correlation coefficients in the order that the correlation is likely to be high. The correlation between time series data can be found with high accuracy and high speed.

ここで、優先順位決定ステップは、共起スコアの絶対値が高い順に優先順位を付けることもできる。これにより、各テキストデータに出現する程度が構成単語間で連動しているとき共起スコアが高いとすると、共起スコアの絶対値が高い単語ペアほど相関する可能性が高いとして、優先的にこの単語ペアの時系列データによる相関係数の算出が行われる。 Here, in the priority order determination step, priorities can be assigned in descending order of the absolute value of the co-occurrence score. As a result, if the co-occurrence score is high when the degree of occurrence in each text data is linked between the constituent words, the word pair having a higher absolute value of the co-occurrence score is more likely to be correlated. The correlation coefficient is calculated from the time series data of the word pair.

また、相関係数算出ステップにおいて、単語ペアの構成単語に対応する時系列データを特定するための時系列データＩＤが、単語と時系列データＩＤとを関連付けて記憶する対応表記憶部に記憶されていない場合、該単語ペアの相関係数を算出しないようにしてもよい。さらに、相関係数算出ステップにおいて、１つの単語ペアの構成単語に対応する時系列データが同一である場合、該単語ペアの相関係数を算出しないようにすることもできる。このように、相関係数を算出する前に、時系列データから相関係数を算出できない場合や、同一の時系列データから相関係数を算出する場合を除外することにより、より効率よく相関係数を算出することができる。 In the correlation coefficient calculating step, the time series data ID for specifying the time series data corresponding to the constituent words of the word pair is stored in the correspondence table storage unit that stores the word and the time series data ID in association with each other. If not, the correlation coefficient of the word pair may not be calculated. Further, in the correlation coefficient calculating step, when the time series data corresponding to the constituent words of one word pair are the same, the correlation coefficient of the word pair may not be calculated. Thus, before calculating the correlation coefficient, by excluding the case where the correlation coefficient cannot be calculated from the time series data or the case where the correlation coefficient is calculated from the same time series data, the correlation is improved more efficiently. A number can be calculated.

相関係数算出ステップは、時系列データ間に所定の遅れを順次に生じさせて、遅れごとに相関係数を算出することにより１つの単語ペアに対して複数の相関係数を算出するステップと、算出された複数の相関係数のうち、絶対値が最大の相関係数を該単語ペアの相関係数とするステップと、を含むようにすることもできる。これにより、一の時系列データを他の時系列データに対して遅延を生じさせることにより強い相関が現れる単語ペアも発見することができ、より高精度に知識を抽出することができる。 The correlation coefficient calculating step calculates a plurality of correlation coefficients for one word pair by sequentially generating a predetermined delay between time series data and calculating a correlation coefficient for each delay. The step of setting the correlation coefficient having the maximum absolute value among the plurality of calculated correlation coefficients as the correlation coefficient of the word pair may be included. As a result, it is possible to find word pairs in which strong correlation is caused by delaying one time-series data with respect to other time-series data, and it is possible to extract knowledge with higher accuracy.

また、本発明の知識抽出方法は、単語ペアの相関係数に基づいて、該単語ペアの構成単語に相関があるか否かを判定する相関判定ステップをさらに含むこともできる。これにより、知識抽出装置は、相関の高い単語ペアを選択して新たな知識としてユーザに提供することができる。 The knowledge extraction method of the present invention may further include a correlation determination step of determining whether or not the constituent words of the word pair have a correlation based on the correlation coefficient of the word pair. Thereby, the knowledge extraction apparatus can select a highly correlated word pair and provide it to the user as new knowledge.

また、上記課題を解決するために、本発明の別の観点によれば、テキストデータ、時系列データおよび相互に相関する情報である複数のユーザ入力モデルを用いて知識を抽出する知識抽出装置が提供される。かかる知識抽出装置は、テキストデータを構成単語に分割するテキスト分割部と、テキストデータの構成単語の出現頻度を測定する出現頻度測定部と、ユーザ入力モデルを入力するための入力装置から少なくとも２以上のユーザ入力モデルが入力されるモデル入力部と、テキストデータの構成単語の出現頻度に基づいて、各ユーザ入力モデルについて、ユーザ入力モデルと前記時系列データとが共に出現する度合いを示す共起スコアを算出する時系列データ共起スコア算出部と、算出された共起スコアに基づいて、任意の２つの時系列データを組み合わせて作成された各時系列データペアの相関の強さを示す相関スコアを算出する時系列データ相関スコア算出部と、時系列データペアの相関スコアに基づいて、時系列データペアの相関係数を算出する優先順位を決定する優先順位決定部と、決定された優先順位にしたがって時系列データペアの相関係数を算出する相関係数算出部と、を備えることを特徴とする。 In order to solve the above-mentioned problem, according to another aspect of the present invention, there is provided a knowledge extraction device that extracts knowledge using a plurality of user input models that are text data, time-series data, and mutually correlated information. Provided. Such a knowledge extraction device includes at least two or more text dividing units that divide text data into constituent words, an appearance frequency measuring unit that measures the appearance frequency of constituent words of text data, and an input device for inputting a user input model. Co-occurrence score indicating the degree of appearance of both the user input model and the time-series data for each user input model based on the model input unit to which the user input model is input and the appearance frequency of the constituent words of the text data Correlation score indicating the strength of correlation of each time-series data pair created by combining arbitrary two time-series data based on the calculated co-occurrence score The time series data correlation score calculation unit for calculating the correlation coefficient of the time series data pair based on the correlation score of the time series data pair A priority determining section for determining the priority of output, and the correlation coefficient calculation unit for calculating a correlation coefficient time series data pair when according to the determined priorities, characterized in that it comprises a.

本発明によれば、まず、知識抽出装置に入力された複数のテキストデータをテキスト分割部により構成単語に分割し、各テキストデータにおける構成単語の出現頻度を出現頻度測定部により測定する。一方、大量の時系列データから相関のあるデータを見出すため、ユーザが相互に相関する情報と考えるユーザ入力モデルが知識抽出装置に入力されると、予め測定されたテキストデータの構成単語の出現頻度に基づいて、ユーザ入力モデルと時系列データとの共起スコアが算出される。そして、算出された共起スコアに基づいて、任意の２つの時系列データを組み合わせて作成された各時系列データペアの相関スコアを算出する。このように、本発明の知識抽出装置によれば、テキストデータ、ユーザ入力モデルおよび時系列データを統合して解析することで、高精度に時系列データ間の相関を見出すことができる。また、相互に相関するユーザ入力モデルを用いて時系列データペアの相関係数を算出する優先順位を決定した後に優先順位にしたがって時系列データペアの相関係数を算出するため、盲目的にあらゆる時系列データペアの相関係数を算出することがなく、相関係数の算出処理を軽減することができ、高速に時系列データ間の相関を見出すことができる。 According to the present invention, first, a plurality of text data input to the knowledge extracting device is divided into constituent words by the text dividing unit, and the appearance frequency of the constituent words in each text data is measured by the appearance frequency measuring unit. On the other hand, in order to find correlated data from a large amount of time-series data, when a user input model that the user considers to be correlated with each other is input to the knowledge extraction device, the frequency of appearance of constituent words of text data measured in advance Based on the above, a co-occurrence score of the user input model and the time series data is calculated. Then, based on the calculated co-occurrence score, a correlation score of each time-series data pair created by combining arbitrary two time-series data is calculated. As described above, according to the knowledge extracting apparatus of the present invention, the text data, the user input model, and the time series data are integrated and analyzed, whereby the correlation between the time series data can be found with high accuracy. In addition, since the correlation coefficient of the time series data pair is calculated according to the priority order after the priority order for calculating the correlation coefficient of the time series data pair is determined using the user input model correlated with each other, The correlation coefficient calculation process can be reduced without calculating the correlation coefficient of the time series data pair, and the correlation between the time series data can be found at high speed.

ここで、時系列データ共起スコア算出部は、ユーザ入力モデルと同一であるテキストデータの構成単語のテキストデータにおける出現頻度と、時系列データに対応する単語と同一であるテキストデータの構成単語のテキストデータにおける出現頻度とを用いて、該ユーザ入力モデルと該時系列データとの共起スコアを算出することができる。このように、テキストデータの構成単語の出現頻度を用いてユーザ入力モデルと時系列データとの共起スコアを算出することにより、時系列データペアの相関係数を算出する優先順位の決定に、ユーザ入力モデルから得られるモデルの相関関係に加えて、テキストデータから得られる構成単語間の関係を反映させることができる。 Here, the time-series data co-occurrence score calculation unit calculates the appearance frequency of the constituent words of the text data that are the same as the user input model in the text data and the constituent words of the text data that are the same as the words corresponding to the time-series data. A co-occurrence score between the user input model and the time series data can be calculated using the appearance frequency in the text data. Thus, by calculating the co-occurrence score of the user input model and the time series data using the appearance frequency of the constituent words of the text data, the priority order for calculating the correlation coefficient of the time series data pair is determined. In addition to the correlation of the model obtained from the user input model, the relationship between the constituent words obtained from the text data can be reflected.

また、本発明の知識抽出装置は、時系列データと該時系列データを特定するための時系列データＩＤとを関連付けて記憶する時系列データ記憶部と、単語と時系列データＩＤとを関連付けて記憶する対応表記憶部と、時系列データ記憶部に記憶された時系列データＩＤについて、同一の時系列データＩＤに対応する対応表記億部に記憶された１または２以上の単語からなる単語リストを取得する時系列データ単語リスト取得部とをさらに備えてもよい。このとき、時系列データ共起スコア算出部は、時系列データの単語リストを構成する各単語とユーザ入力モデルとの共起スコアをそれぞれ算出し、算出された共起スコアのうち絶対値が最大の共起スコアを該時系列データと該モデルとの共起スコアとすることができる。これにより、同一の時系列データＩＤを有する複数の単語がある場合にも、時系列データごとにユーザ入力モデルとの共起スコアを決定することができる。 In addition, the knowledge extraction device of the present invention relates to a time series data storage unit that stores time series data and a time series data ID for specifying the time series data in association with each other, and associates a word with the time series data ID. A word list composed of one or more words stored in the corresponding notation part corresponding to the same time series data ID for the correspondence table storage unit to be stored and the time series data ID stored in the time series data storage unit And a time-series data word list acquisition unit for acquiring. At this time, the time-series data co-occurrence score calculation unit calculates a co-occurrence score between each word constituting the word list of the time-series data and the user input model, and the absolute value of the calculated co-occurrence scores is the maximum. Can be the co-occurrence score of the time-series data and the model. Thereby, even when there are a plurality of words having the same time-series data ID, a co-occurrence score with the user input model can be determined for each time-series data.

さらに、時系列データ相関スコア算出部は、ユーザ入力モデルを組み合わせて、任意の２つのユーザ入力モデルである第１のユーザ入力モデルと第２のユーザ入力モデルからなるユーザ入力モデルペアを作成し、相関スコア算出対象である、第１の時系列データと第２の時系列データからなる時系列データペアについて、第１のユーザ入力モデルと第１の時系列データとの共起スコアと、第２のユーザ入力モデルと第２の時系列データとの共起スコアとの乗算値と、第２のユーザ入力モデルと第１の時系列データとの共起スコアと、第１のユーザ入力モデルと第２の時系列データとの共起スコアとの乗算値とを算出し、想定されるすべての任意の２つのユーザ入力モデルからなるユーザ入力ペアについて算出された乗算値のうち最大の値を相関スコア算出対象である時系列データペアの相関スコアとしてもよい。 Furthermore, the time series data correlation score calculation unit creates a user input model pair composed of a first user input model and a second user input model, which are arbitrary two user input models, by combining the user input models, For a time-series data pair consisting of first time-series data and second time-series data, which is a correlation score calculation target, a co-occurrence score between the first user input model and the first time-series data, and second The product of the co-occurrence score of the user input model and the second time series data, the co-occurrence score of the second user input model and the first time series data, the first user input model and the 2 is multiplied by the co-occurrence score with the time series data of 2, and the maximum value among the calculated multiplication values for the user input pairs made up of any two arbitrary user input models is calculated. Or as a correlation score of the time-series data pair is the score calculation target.

優先順位決定部は、時系列データペアの相関スコアの高い順に優先順位を付けるようにすることができる。ここで、各ユーザ入力モデルは相互に相関のある情報であることから、ユーザ入力モデルと時系列データとが連動すると考えられる場合（例えば、テキストデータにおいて、ユーザ入力モデルと時系列データとが共に多く出現したり、または共にほとんど出現しなかったりするとき）にユーザ入力モデルと時系列データとの共起スコアが高いとし、このときユーザ入力モデルと時系列データとが相関する可能性が高いと考えられる。そして、各時系列データペアの相関の強さを示す相関スコアは、例えば共起スコアの乗算によって算出され、その値が大きいほど時系列データペアの相関が強いと考え、相関スコアの高い時系列データペアによる相関係数の算出が優先的に行われる。 The priority order determination unit can assign priorities in descending order of the correlation score of the time-series data pair. Here, since each user input model is correlated information, it is considered that the user input model and time series data are linked (for example, in text data, both the user input model and time series data are If the co-occurrence score between the user input model and the time series data is high when it appears frequently (or rarely appears together), the user input model and the time series data are likely to be correlated Conceivable. The correlation score indicating the strength of correlation of each time series data pair is calculated by, for example, multiplication of the co-occurrence score, and the larger the value, the stronger the correlation of the time series data pair is considered, and the time series having a higher correlation score. The calculation of the correlation coefficient by the data pair is performed preferentially.

また、相関係数算出部は、時系列データ間に所定の遅れを順次に生じさせて、遅れごとに相関係数を算出することにより１つの時系列データペアに対して複数の相関係数を算出し、複数の相関係数のうち、絶対値が最大の相関係数を時系列データペアの相関係数とすることができる。一の時系列データを他の時系列データに対して遅延を生じさせることにより強い相関が現れる時系列データペアも発見することができ、より高精度に知識を抽出することができる。 Further, the correlation coefficient calculation unit sequentially generates a predetermined delay between the time series data, and calculates a correlation coefficient for each delay, thereby obtaining a plurality of correlation coefficients for one time series data pair. The correlation coefficient having the maximum absolute value among the plurality of correlation coefficients can be calculated as the correlation coefficient of the time series data pair. By causing one time-series data to be delayed with respect to other time-series data, a time-series data pair in which strong correlation appears can be found, and knowledge can be extracted with higher accuracy.

さらに、本発明の知識抽出装置は、時系列データペアの相関係数に基づいて、該時系列データペアを構成する時系列データ間に相関があるか否かを判定する相関判定部をさらに備えてもよい。相関判定部は、相関係数の絶対値が所定の値以上である場合に相関関係があると判定するようにすることができる。これにより、知識抽出装置は、相関の高い時系列データペアを選択して新たな知識としてユーザに提供することができる。 Furthermore, the knowledge extraction apparatus of the present invention further includes a correlation determination unit that determines whether or not there is a correlation between the time series data constituting the time series data pair based on the correlation coefficient of the time series data pair. May be. The correlation determination unit can determine that there is a correlation when the absolute value of the correlation coefficient is equal to or greater than a predetermined value. Thereby, the knowledge extraction apparatus can select a time-series data pair with high correlation and provide it to the user as new knowledge.

さらに、上記課題を解決するために、本発明の別の観点によれば、テキストデータ、時系列データおよび相互に相関する情報である複数のユーザ入力モデルを用いて知識を抽出する知識抽出方法が提供される。かかる知識抽出方法は、テキストデータを構成単語に分割するテキスト分割ステップと、テキストデータの構成単語の出現頻度を測定する出現頻度測定ステップと、ユーザ入力モデルを入力するための入力装置から少なくとも２以上のユーザ入力モデルが入力されるモデル入力ステップと、テキストデータの構成単語の出現頻度に基づいて、各ユーザ入力モデルについて、ユーザ入力モデルと時系列データとが共に出現する度合いを示す共起スコアを算出する時系列データ共起スコア算出ステップと、算出された共起スコアに基づいて、任意の２つの時系列データを組み合わせて作成された各時系列データペアの相関の強さを示す相関スコアを算出する時系列データ相関スコア算出ステップと、時系列データペアの相関スコアに基づいて、時系列データペアの相関係数を算出する優先順位を決定する優先順位決定ステップと、決定された優先順位にしたがって、時系列データペアの相関係数を算出する相関係数算出ステップと、を含むことを特徴とする。 Furthermore, in order to solve the above-described problem, according to another aspect of the present invention, there is provided a knowledge extraction method for extracting knowledge using a plurality of user input models that are text data, time series data, and mutually correlated information. Provided. The knowledge extraction method includes at least two or more text dividing steps for dividing text data into constituent words, an appearance frequency measuring step for measuring the appearance frequency of constituent words of the text data, and an input device for inputting a user input model. A co-occurrence score indicating the degree of appearance of both the user input model and the time series data for each user input model based on the model input step in which the user input model is input and the appearance frequency of the constituent words of the text data A time series data co-occurrence score calculation step to calculate, and a correlation score indicating the strength of correlation of each time series data pair created by combining any two time series data based on the calculated co-occurrence score Based on the time series data correlation score calculation step to be calculated and the correlation score of the time series data pair, A priority order determining step for determining a priority order for calculating the correlation coefficient of the column data pair, and a correlation coefficient calculating step for calculating the correlation coefficient of the time-series data pair according to the determined priority order. It is characterized by.

本発明によれば、まず、入力された複数のテキストデータを構成単語に分割し、各テキストデータにおける構成単語の出現頻度を測定する。一方、大量の時系列データから相関のあるデータを見出すため、ユーザが相互に相関する情報と考えるユーザ入力モデルが入力されると、予め測定されたテキストデータの構成単語の出現頻度に基づいて、ユーザ入力モデルと時系列データとの共起スコアが算出される。そして、算出された共起スコアに基づいて、任意の２つの時系列データを組み合わせて作成された各時系列データペアの相関スコアを算出する。このように、本発明の知識抽出方法によれば、テキストデータ、ユーザ入力モデルおよび時系列データを統合して解析することで、高精度に時系列データ間の相関を見出すことができる。また、相互に相関するユーザ入力モデルを用いて時系列データペアの相関係数を算出する優先順位を決定した後に優先順位にしたがって時系列データペアの相関係数を算出するため、盲目的にあらゆる時系列データペアの相関係数を算出することがなく、相関係数の算出処理を軽減することができ、高速に時系列データ間の相関を見出すことができる。 According to the present invention, first, a plurality of input text data is divided into constituent words, and the appearance frequency of the constituent words in each text data is measured. On the other hand, in order to find correlated data from a large amount of time-series data, when a user input model that is considered to be mutually correlated information by a user is input, based on the appearance frequency of constituent words of text data measured in advance, A co-occurrence score of the user input model and time series data is calculated. Then, based on the calculated co-occurrence score, a correlation score of each time-series data pair created by combining arbitrary two time-series data is calculated. As described above, according to the knowledge extraction method of the present invention, the text data, the user input model, and the time series data are integrated and analyzed, whereby the correlation between the time series data can be found with high accuracy. In addition, since the correlation coefficient of the time series data pair is calculated according to the priority order after the priority order for calculating the correlation coefficient of the time series data pair is determined using the user input model correlated with each other, The correlation coefficient calculation process can be reduced without calculating the correlation coefficient of the time series data pair, and the correlation between the time series data can be found at high speed.

ここで、時系列データ共起スコア算出ステップは、ユーザ入力モデルと同一であるテキストデータの構成単語のテキストデータにおける出現頻度と、時系列データに対応する単語と同一であるテキストデータの構成単語のテキストデータにおける出現頻度とを用いて、該ユーザ入力モデルと該時系列データとの共起スコアを算出することができる。このように、テキストデータの構成単語の出現頻度を用いてユーザ入力モデルと時系列データとの共起スコアを算出することにより、時系列データペアの相関係数を算出する優先順位の決定に、ユーザ入力モデルから得られるモデルの相関関係に加えて、テキストデータから得られる構成単語間の関係を反映させることができる。 Here, the time series data co-occurrence score calculation step includes the appearance frequency of text data constituent words that are the same as the user input model in the text data, and text data constituent words that are the same as the words corresponding to the time series data. A co-occurrence score between the user input model and the time series data can be calculated using the appearance frequency in the text data. Thus, by calculating the co-occurrence score of the user input model and the time series data using the appearance frequency of the constituent words of the text data, the priority order for calculating the correlation coefficient of the time series data pair is determined. In addition to the correlation of the model obtained from the user input model, the relationship between the constituent words obtained from the text data can be reflected.

また、本発明の知識抽出方法は、時系列データを特定するために該時系列データと関連付けて記憶された時系列データ記憶部の時系列データＩＤについて、単語と時系列データＩＤとを関連付けて記憶する対応表記億部から、同一の前記時系列データＩＤに対応する１または２以上の単語を取得して単語リストを作成する時系列データ単語リスト取得ステップをさらに含んでもよい。このとき、時系列データ共起スコア算出ステップは、時系列データの単語リストを構成する各単語とユーザ入力モデルとの共起スコアをそれぞれ算出し、算出された共起スコアのうち絶対値が最大の共起スコアを該時系列データと該モデルとの共起スコアとすることができる。これにより、同一の時系列データＩＤを有する複数の単語がある場合にも、時系列データごとにユーザ入力モデルとの共起スコアを決定することができる。 Further, the knowledge extraction method of the present invention relates a word and a time-series data ID with respect to the time-series data ID of the time-series data storage unit stored in association with the time-series data in order to identify the time-series data. It may further include a time-series data word list acquisition step of acquiring one or more words corresponding to the same time-series data ID from the stored corresponding notation parts and creating a word list. At this time, the time series data co-occurrence score calculating step calculates a co-occurrence score between each word constituting the word list of the time series data and the user input model, and the absolute value of the calculated co-occurrence scores is the maximum. Can be the co-occurrence score of the time-series data and the model. Thereby, even when there are a plurality of words having the same time-series data ID, a co-occurrence score with the user input model can be determined for each time-series data.

時系列データ相関スコア算出ステップは、ユーザ入力モデルを組み合わせて、任意の２つのユーザ入力モデルである第１のユーザ入力モデルと第２のユーザ入力モデルからなるユーザ入力モデルペアを作成するユーザ入力モデルペア作成ステップと、相関スコア算出対象である、第１の時系列データと第２の時系列データからなる時系列データペアについて、第１のユーザ入力モデルと第１の時系列データとの共起スコアと、第２のユーザ入力モデルと第２の時系列データとの共起スコアとの乗算値と、第２のユーザ入力モデルと第１の時系列データとの共起スコアと、第１のユーザ入力モデルと第２の時系列データとの共起スコアとの乗算値とを算出する乗算ステップと、想定されるすべての任意の２つのユーザ入力モデルからなるユーザ入力ペアについて算出された乗算値のうち最大の値を相関スコア算出対象である時系列データペアの相関スコアとする相関スコア決定ステップと、を含んでもよい。 The time series data correlation score calculating step combines the user input models to create a user input model pair including a first user input model and a second user input model which are arbitrary two user input models. Co-occurrence of the first user input model and the first time-series data for the pair creation step and the time-series data pair consisting of the first time-series data and the second time-series data, which is a correlation score calculation target A score, a multiplication value of the co-occurrence score of the second user input model and the second time series data, a co-occurrence score of the second user input model and the first time series data, the first A user comprising a multiplication step for calculating a multiplication value of the user input model and the co-occurrence score of the second time-series data, and any two arbitrary user input models assumed A correlation score determining step of the correlation score time series data pairs a maximum value among the calculated multiplication value is a correlation score calculation target for the force pairs may include.

また、優先順位決定ステップは、時系列データペアの相関スコアが高い順に優先順位を付けることができる。各時系列データペアの相関の強さを示す相関スコアは、例えば共起スコアの乗算によって算出され、その値が大きいほど時系列データペアの相関が強いと考えられる。これより相関スコアの高い時系列データペアによる相関係数の算出が優先的に行われる。 Further, the priority order determining step can assign priorities in descending order of the correlation score of the time series data pair. The correlation score indicating the correlation strength of each time series data pair is calculated by, for example, multiplication of the co-occurrence score, and it is considered that the correlation between the time series data pairs is stronger as the value is larger. From this, the calculation of the correlation coefficient by the time series data pair having a higher correlation score is preferentially performed.

さらに、相関係数算出ステップは、時系列データ間に所定の遅れを順次に生じさせて、遅れごとに相関係数を算出することにより１つの時系列データペアに対して複数の相関係数を算出するステップと、算出された複数の相関係数のうち、絶対値が最大の相関係数を該時系列データペアの相関係数とするステップと、を含んでもよい。一の時系列データを他の時系列データに対して遅延を生じさせることにより強い相関が現れる時系列データペアも発見することができ、より高精度に知識を抽出することができる。 Further, the correlation coefficient calculating step sequentially generates a predetermined delay between the time series data, and calculates a correlation coefficient for each delay, thereby obtaining a plurality of correlation coefficients for one time series data pair. And a step of calculating a correlation coefficient having the maximum absolute value among the plurality of calculated correlation coefficients as a correlation coefficient of the time-series data pair. By causing one time-series data to be delayed with respect to other time-series data, a time-series data pair in which strong correlation appears can be found, and knowledge can be extracted with higher accuracy.

また、時系列データペアの相関係数に基づいて、該時系列データペアを構成する時系列データ間に相関があるか否かを判定する相関判定ステップをさらに含むこともできる。これにより、相関の高い時系列データペアを選択して新たな知識としてユーザに提供することができる。 Further, it may further include a correlation determination step for determining whether or not there is a correlation between the time series data constituting the time series data pair based on the correlation coefficient of the time series data pair. Thereby, a time-series data pair with high correlation can be selected and provided to the user as new knowledge.

以上説明したように本発明によれば、テキストデータと構造化されたデータとを用いてより高精度かつ高速に知識を抽出することの可能な知識抽出装置および知識抽出方法を提供することができる。 As described above, according to the present invention, it is possible to provide a knowledge extraction apparatus and a knowledge extraction method that can extract knowledge with high accuracy and high speed using text data and structured data. .

以下に添付図面を参照しながら、本発明の好適な実施の形態について詳細に説明する。なお、本明細書及び図面において、実質的に同一の機能構成を有する構成要素については、同一の符号を付することにより重複説明を省略する。 Exemplary embodiments of the present invention will be described below in detail with reference to the accompanying drawings. In addition, in this specification and drawing, about the component which has the substantially same function structure, duplication description is abbreviate | omitted by attaching | subjecting the same code | symbol.

（第１の実施形態）
まず、図１〜４に基づいて、本発明の第１の実施形態にかかる知識抽出装置の概略構成を説明する。なお、図１は、本実施形態にかかる知識抽出装置の概略構成を示すブロック図である。図２は、テキストデータである文書を構成する単語の出現頻度を表す表の一例を示す。図３は、時系列データ記憶部の構成の一例を示す。図４は、単語と時系列データとを対応付けた対応表を記憶する対応表記憶部の構成の一例を示す。 (First embodiment)
First, a schematic configuration of the knowledge extraction device according to the first embodiment of the present invention will be described with reference to FIGS. FIG. 1 is a block diagram showing a schematic configuration of the knowledge extracting apparatus according to the present embodiment. FIG. 2 shows an example of a table representing the appearance frequency of words constituting a document that is text data. FIG. 3 shows an example of the configuration of the time-series data storage unit. FIG. 4 shows an example of a configuration of a correspondence table storage unit that stores a correspondence table in which words and time-series data are associated with each other.

＜知識抽出装置の概略構成＞
本実施形態にかかる知識抽出装置１００は、図１に示すように、テキスト入力部１１０と、テキスト解析部１２０と、優先順位決定部１３０と、相関係数算出部１４０と、テキスト・時系列データ対応部１５０と、相関判定部１６０と、時系列データ記憶部１７０と、対応表記憶部１８０とを備える。 <Schematic configuration of knowledge extraction device>
As shown in FIG. 1, the knowledge extraction apparatus 100 according to the present embodiment includes a text input unit 110, a text analysis unit 120, a priority order determination unit 130, a correlation coefficient calculation unit 140, text / time-series data. A correspondence unit 150, a correlation determination unit 160, a time-series data storage unit 170, and a correspondence table storage unit 180 are provided.

テキスト入力部１１０は、テキストデータが入力される入力部である。テキストデータとして、例えば新聞や雑誌の記事や文書などを用いることができる。 The text input unit 110 is an input unit for inputting text data. As the text data, for example, articles and documents of newspapers and magazines can be used.

テキスト解析部１２０は、テキストデータを構成する単語について共起関係を解析する機能部であって、例えば、テキスト分割部１２２と、出現頻度測定部１２４と、単語ペア作成部１２６と、共起スコア算出部１２８とを備える。 The text analysis unit 120 is a functional unit that analyzes the co-occurrence relationship for words constituting text data. For example, the text division unit 122, the appearance frequency measurement unit 124, the word pair creation unit 126, and the co-occurrence score And a calculation unit 128.

テキスト分割部１２２は、テキストデータを単語に分割する機能部であり、例えば形態素解析などの手法を用いて分割することができる。 The text dividing unit 122 is a functional unit that divides text data into words, and can be divided using a technique such as morphological analysis.

出現頻度測定部１２４は、単語がテキストデータ内に出現する頻度を測定する機能部である。出現頻度測定部１２４は、テキスト分割部１２２により分割された単語について出現頻度を測定し、例えば図２に示すような、単語の出現頻度についてのマトリックス１２５を作成する。図２のマトリックス１２５は、出現頻度測定部１２４によりテキストデータごとに測定した単語１２５ｂの出現回数を示すものである。なお、文書ＩＤ１２５ａは、テキストデータを特定するためのＩＤである。 The appearance frequency measurement unit 124 is a functional unit that measures the frequency with which words appear in text data. The appearance frequency measuring unit 124 measures the appearance frequency of the words divided by the text dividing unit 122, and creates a matrix 125 for the word appearance frequency as shown in FIG. The matrix 125 in FIG. 2 shows the number of appearances of the word 125b measured for each text data by the appearance frequency measuring unit 124. The document ID 125a is an ID for specifying text data.

単語ペア作成部１２６は、テキスト分割部１２２により分割された単語を２つ組み合わせて１つの単語ペアを作成する機能部である。本実施形態の知識抽出装置１００は、単語ペア作成部１２６により作成された単語ペアのうち、強い相関のある単語ペアを抽出するものである。 The word pair creation unit 126 is a functional unit that creates one word pair by combining two words divided by the text division unit 122. The knowledge extraction apparatus 100 according to the present embodiment extracts a word pair having a strong correlation among the word pairs created by the word pair creation unit 126.

共起スコア算出部１２８は、単語ペア作成部１２６により作成された単語ペアの構成単語の共起関係を示す共起スコアを算出する機能部である。共起スコアの詳細については後述する。 The co-occurrence score calculation unit 128 is a functional unit that calculates a co-occurrence score indicating the co-occurrence relationship of the constituent words of the word pair created by the word pair creation unit 126. Details of the co-occurrence score will be described later.

優先順位決定部１３０は、後述する相関係数算出部１４０において相関係数を算出する順番を決定する機能部である。本実施形態の優先順位決定部１３０は、共起スコア算出部１２８にて算出された共起スコアに基づいて決定される。 The priority order determination unit 130 is a functional unit that determines the order in which correlation coefficients are calculated in the correlation coefficient calculation unit 140 described later. The priority order determination unit 130 of this embodiment is determined based on the co-occurrence score calculated by the co-occurrence score calculation unit 128.

相関係数算出部１４０は、単語ペアの構成単語にそれぞれ対応付けられた時系列データの相関係数を算出する機能部である。相関係数算出部１４０は、優先順位決定部１３０により決定された優先順位の高い単語ペアから順に相関係数を算出する。相関係数算出部１４０による相関係数の算出方法の詳細については後述する。 The correlation coefficient calculation unit 140 is a functional unit that calculates the correlation coefficient of the time series data associated with the constituent words of the word pair. Correlation coefficient calculation section 140 calculates correlation coefficients in descending order of word pairs determined by priority determination section 130. Details of the correlation coefficient calculation method by the correlation coefficient calculation unit 140 will be described later.

テキスト・時系列データ対応部１５０は、単語ペアの構成単語に対応付けられた時系列データを取得する機能部である。テキスト・時系列データ対応部１５０は、相関係数算出部１４０から受け取った単語ペアの構成単語について、後述する時系列データ記憶部１７０と対応表記憶部１８０とから対応する時系列データを取得する。そして、テキスト・時系列データ対応部１５０は、取得した時系列データを相関係数算出部１４０に出力する。 The text / time-series data correspondence unit 150 is a functional unit that acquires time-series data associated with the constituent words of the word pair. The text / time-series data correspondence unit 150 acquires time-series data corresponding to the constituent words of the word pair received from the correlation coefficient calculation unit 140 from the time-series data storage unit 170 and the correspondence table storage unit 180 described later. . Then, the text / time-series data correspondence unit 150 outputs the acquired time-series data to the correlation coefficient calculation unit 140.

相関判定部１６０は、算出された単語ペアの相関係数に基づいて、この単語ペアの構成単語に相関関係があるか否かを判定する機能部である。本実施形態の相関判定部１６０は、相関係数算出部１４０により算出された相関係数の絶対値が所定の値以上である場合に、この単語ペアの構成単語は相関が高いと判定する。相関判定部１６０は、強い相関のある単語ペアを新しい知識として出力する。 The correlation determination unit 160 is a functional unit that determines whether or not the constituent words of the word pair have a correlation based on the calculated correlation coefficient of the word pair. When the absolute value of the correlation coefficient calculated by the correlation coefficient calculation unit 140 is greater than or equal to a predetermined value, the correlation determination unit 160 of this embodiment determines that the constituent words of this word pair have a high correlation. Correlation determining section 160 outputs word pairs having a strong correlation as new knowledge.

時系列データ記憶部１７０は、単語に対応付けて記憶された時系列データを記憶する記憶部であって、例えば、ＲＡＭやハードディスク等のメモリを含んで構成される。時系列データ記憶部１７０は、図３に示すように、例えば、時系列データを特定するためのＩＤである時系列データＩＤ１７１と、時系列データ１７２とを有して構成される。時系列データ１７２は、時間とともに変化するデータの集まりであり、所定の時間間隔で取得されたデータを記憶している。ここで、時系列データ記憶部１７０に記憶される時系列データ１７２は、すべてが同一の時間間隔で取得されたものに限られず、また、取得開始時期、取得終了時期も相違していることも考えられる。しかし、時系列データ１７２から相関係数を算出するためには同一のタイミングで取得したデータを用いる必要がある。このため、時系列データ記憶部１７０に、データの整合性をとるため、データの取得時間間隔、取得開始時期、取得終了時期などについても記憶してもよい。 The time-series data storage unit 170 is a storage unit that stores time-series data stored in association with words, and includes, for example, a memory such as a RAM or a hard disk. As shown in FIG. 3, the time-series data storage unit 170 includes, for example, a time-series data ID 171 that is an ID for specifying time-series data and time-series data 172. The time series data 172 is a collection of data that changes with time, and stores data acquired at predetermined time intervals. Here, the time-series data 172 stored in the time-series data storage unit 170 is not limited to those acquired at the same time interval, and the acquisition start time and the acquisition end time may be different. Conceivable. However, in order to calculate the correlation coefficient from the time series data 172, it is necessary to use data acquired at the same timing. For this reason, the data acquisition time interval, the acquisition start time, the acquisition end time, and the like may be stored in the time-series data storage unit 170 in order to ensure data consistency.

対応表記憶部１８０は、単語と時系列データとの対応関係を記憶する記憶部であって、例えば、ＲＡＭやハードディスク等のメモリを含んで構成される。対応表記憶部１８０は、図４に示すように、例えば、単語１８１と、時系列データＩＤ１８２とを有して構成される。単語１８１には、時系列データを有する単語が記憶されている。時系列データＩＤ１８２は、図３の時系列データＩＤ１７１と同一であり、単語１８１に対応する時系列データを特定するために用いられる。異なる単語１８１であっても同一の時系列データを有する場合、すなわち意味的に同一の単語には、同一の時系列データＩＤ１８２が記憶される。図４では、単語「為替レート」および「為替相場」に対応する時系列データＩＤはともに「Ｄ１」である。 The correspondence table storage unit 180 is a storage unit that stores a correspondence relationship between words and time-series data, and includes, for example, a memory such as a RAM or a hard disk. As shown in FIG. 4, the correspondence table storage unit 180 includes, for example, a word 181 and a time series data ID 182. The word 181 stores a word having time series data. The time series data ID 182 is the same as the time series data ID 171 in FIG. 3 and is used to specify time series data corresponding to the word 181. Even when different words 181 have the same time-series data, that is, the same time-series data ID 182 is stored in the semantically identical words. In FIG. 4, the time-series data IDs corresponding to the words “exchange rate” and “exchange rate” are both “D1”.

以上、本実施形態にかかる知識抽出装置１００の概略構成について説明した。次に、図５および図６に基づいて、本実施形態にかかる知識抽出方法について説明する。なお、図５は、本実施形態にかかる知識抽出方法を示すフローチャートである。図６は、共起スコアの性質を説明するための説明図である。 The schematic configuration of the knowledge extraction device 100 according to the present embodiment has been described above. Next, the knowledge extraction method according to the present embodiment will be described based on FIGS. 5 and 6. FIG. 5 is a flowchart showing the knowledge extraction method according to the present embodiment. FIG. 6 is an explanatory diagram for explaining the nature of the co-occurrence score.

＜知識抽出方法＞
本実施形態にかかる知識抽出方法は、図５に示すように、まず、テキストデータをテキスト入力部１１０へ入力する（Ｓ１０１）。本実施形態では、テキストデータとして大量の文書を用いるとする。文書としては、例えば、時系列データ記憶部１７０に記憶された時系列データに関連する記事や書物などを用いる。 <Knowledge extraction method>
In the knowledge extraction method according to the present embodiment, as shown in FIG. 5, first, text data is input to the text input unit 110 (S101). In this embodiment, it is assumed that a large amount of documents are used as text data. As the document, for example, an article or a book related to the time series data stored in the time series data storage unit 170 is used.

次いで、テキスト入力部１１０は、入力されたテキストデータをテキスト解析部１２０へ送信する。テキスト解析部１２０では、まず、テキスト分割部１２２によりテキストデータを単語に分割する（Ｓ１０３）。テキスト分割部１２２は、テキストデータごとに例えば形態素解析により文章を単語に分解する。そして、分解された単語がテキストデータ中にどのくらいの頻度で現れるかを測定する（Ｓ１０５）。単語の出現頻度は各テキストデータについて、単語が出現した回数を測定することにより表される。これにより、例えば図２に示すようなマトリックス１２５が形成される。 Next, the text input unit 110 transmits the input text data to the text analysis unit 120. In the text analysis unit 120, first, the text division unit 122 divides the text data into words (S103). The text dividing unit 122 decomposes the sentence into words for each text data by, for example, morphological analysis. Then, how often the decomposed word appears in the text data is measured (S105). The appearance frequency of the word is expressed by measuring the number of times the word appears for each text data. Thereby, for example, a matrix 125 as shown in FIG. 2 is formed.

その後、テキストデータに出現する単語１２５ａのうち２つを組み合わせて単語ペアを作成する（Ｓ１０７）。単語ペアは、例えばテキストデータに出現する単語１２５ａについて可能な組み合わせだけ作成される。そして、マトリックス１２５に表された単語の出現頻度に基づいて、単語ペアを構成する構成単語間の共起スコアを算出する（Ｓ１０９）。共起スコアは、テキストデータ中に単語ペアの構成単語が共に出現する度合いを示し、本実施形態ではＮ次元ベクトルのコサイン距離として以下の数式１から算出される。 Thereafter, a word pair is created by combining two of the words 125a appearing in the text data (S107). As many word pairs as possible are created for the word 125a appearing in the text data, for example. Then, based on the appearance frequency of the words shown in the matrix 125, a co-occurrence score between the constituent words constituting the word pair is calculated (S109). The co-occurrence score indicates the degree of occurrence of both words constituting the word pair in the text data, and is calculated from the following Equation 1 as the cosine distance of the N-dimensional vector in this embodiment.

ここで、ｘ_ｉおよびｙ_ｉは、それぞれ文書ＩＤ＝ｉの文書に単語Ｘと単語Ｙとが出現した回数であり、Ｎは、全テキストデータ数である。 Here, x _i and y _i are the number of times the word X and the word Y appear in the document with the document ID = i, respectively, and N is the total number of text data.

共起スコアは−１≦（共起スコア）≦１の値をとり、単語ペアの構成単語の共起が強い場合には、共起スコアの絶対値が大きくなる。上述したように、共起スコアはある単語が他の単語とどのくらい共に出現する傾向があるかを示す。ここで、図６に、横軸にテキストデータを特定する文書ＩＤを、縦軸に単語の出現回数を表したグラフを示す。例えば、図６において、単語Ｘと単語Ｙとは異なる波形を示している。すなわち、単語Ｘの出現頻度が高いとき、単語Ｙの出現頻度は高かったり低かったりしており、様々な文書における、単語Ｘの出現頻度と単語Ｙの出現頻度の間に何らかの関係があるようには見られない。この場合、単語Ｘと単語Ｙとはあまり共起していないといえる。このとき共起スコアはゼロに近い。 The co-occurrence score takes a value of −1 ≦ (co-occurrence score) ≦ 1, and when the co-occurrence of the constituent words of the word pair is strong, the absolute value of the co-occurrence score increases. As described above, the co-occurrence score indicates how much a certain word tends to appear with other words. Here, FIG. 6 shows a graph in which the horizontal axis represents the document ID for specifying the text data, and the vertical axis represents the number of occurrences of the word. For example, in FIG. 6, the word X and the word Y show different waveforms. That is, when the appearance frequency of the word X is high, the appearance frequency of the word Y is high or low, and there is some relationship between the appearance frequency of the word X and the appearance frequency of the word Y in various documents. Is not seen. In this case, it can be said that the word X and the word Y do not co-occur. At this time, the co-occurrence score is close to zero.

一方、単語Ｘと単語Ｚについてはほぼ同一の波形を示している。これは、単語Ｘと単語Ｚとが出現する回数が共に高くなったり低くなったりしているためであり、これより単語Ｘと単語Ｚは共起しているといえる。このとき、共起スコアは１に近くなる。一方、この場合とは逆に、ある文書における単語Ｘの出現回数が高いときに単語Ｗの出現回数が低くなり、別の文書における単語Ｘの出現回数が低いときに単語Ｗの出現回数が高くなる、といった関係がある場合がある。このとき、単語ＸとＷとの間には負の共起関係があり、共起スコアは−１に近くなる。この場合も単語Ｘと単語Ｗとの間には何らかの関係があると考えられる。したがって、共起スコアの絶対値の大小が重要であり、それを以下の処理では用いることとする。 On the other hand, the word X and the word Z show almost the same waveform. This is because the number of times that the word X and the word Z appear both increases or decreases, and it can be said that the word X and the word Z co-occur. At this time, the co-occurrence score is close to 1. On the other hand, on the other hand, when the number of occurrences of word X in a document is high, the number of appearances of word W is low, and when the number of appearances of word X in another document is low, the number of appearances of word W is high. There may be a relationship such as At this time, there is a negative co-occurrence relationship between the words X and W, and the co-occurrence score is close to -1. Also in this case, it is considered that there is some relationship between the word X and the word W. Therefore, the magnitude of the absolute value of the co-occurrence score is important and will be used in the following processing.

ステップＳ１０９において各単語ペアの共起スコアが算出されると、各単語ペアに対して相関係数を算出する優先順位を決定する（Ｓ１１１）。ステップＳ１１１では、例えば共起スコアの絶対値の大きい順に優先順位を付けるとする。ここで、優先順位の高い順に単語ペアを並べた単語ペアリストを作成し、単語ペアリストの記載順に以下のステップＳ１１３〜Ｓ１２１の処理を行ってもよい。 When the co-occurrence score of each word pair is calculated in step S109, the priority order for calculating the correlation coefficient for each word pair is determined (S111). In step S111, for example, it is assumed that priorities are assigned in descending order of the absolute value of the co-occurrence score. Here, a word pair list in which word pairs are arranged in descending order of priority may be created, and the following steps S113 to S121 may be performed in the order of description in the word pair list.

次いで、ステップＳ１１１の優先順位にしたがって、優先順位の高い順に単語ペアの相関係数を算出するが、相関係数を算出する前に以下のステップＳ１１３、Ｓ１１５の処理を行ってもよい。 Next, the correlation coefficients of word pairs are calculated in descending order of priority according to the priority order of step S111, but the following processes of steps S113 and S115 may be performed before calculating the correlation coefficient.

まず、相関係数は時系列データに基づいて算出されるため、単語ペアの構成単語の時系列データが存在するか否かをチェックする（Ｓ１１３）。ステップＳ１１３では、テキスト・時系列データ対応部１５０により対応表記憶部１８０から構成単語に対応する時系列データＩＤ１８２を取得しようとしたときに、構成単語が単語１８１に登録されていなかったり、構成単語は単語１８１に登録はされているが時系列データＩＤ１８２がブランク（対応時系列データなし）であったりした場合、時系列データが存在しないと判断し、この単語ペアの相関係数は算出せずステップＳ１２３に進む。 First, since the correlation coefficient is calculated based on the time series data, it is checked whether or not the time series data of the constituent words of the word pair exists (S113). In step S113, when the text / time-series data correspondence unit 150 tries to acquire the time-series data ID 182 corresponding to the constituent word from the correspondence table storage unit 180, the constituent word is not registered in the word 181 or the constituent word Is registered in the word 181 but the time series data ID 182 is blank (no corresponding time series data), it is determined that there is no time series data, and the correlation coefficient of this word pair is not calculated. The process proceeds to step S123.

次に、単語ペアの２つの構成単語の時系列データＩＤが相違するかをチェックする（Ｓ１１５）。構成単語の時系列データＩＤが同一であることは、その構成単語は意味的にほぼ同一であり時系列データも同一であることを意味する。例えば、図４では、「為替レート」と「為替相場」は共に時系列データＩＤ「Ｄ１」が対応付けられている。このような単語ペアについては相関関係を見出す必要性がないため、相関係数を算出せずにステップＳ１２３に進む。 Next, it is checked whether the time series data IDs of the two constituent words of the word pair are different (S115). The fact that the constituent words have the same time series data ID means that the constituent words are semantically substantially the same and the time series data are also the same. For example, in FIG. 4, the “exchange rate” and the “exchange rate” are associated with the time series data ID “D1”. Since there is no need to find a correlation for such a word pair, the process proceeds to step S123 without calculating a correlation coefficient.

このようなチェックを行った後、ステップＳ１１３、Ｓ１１５の要件を満たす単語ペアについて、相関係数を算出する（Ｓ１１７）。例えば、単語「為替レート」と「Ｘ社収益」との相関係数を算出するとする。まず、テキスト・時系列データ対応部１５０により各単語に対応する時系列データを取得する。対応表記憶部１８０から、「為替レート」に対応する時系列データの時系列ＩＤは「Ｄ１」、「Ｘ社収益」に対応する時系列データの時系列ＩＤは「Ｄ２」であることがわかるので、テキスト・時系列データ対応部１５０は時系列データ記憶部１７０から時系列データＩＤ「Ｄ１」および「Ｄ２」の時系列データを取得し、相関係数算出部１４０へ送信する。なお、以下において時系列データＩＤ「Ｄ１」、「Ｄ２」に対応する各時系列データのデータ取得時間間隔、取得開始時期等は同一であるとする。 After performing such a check, a correlation coefficient is calculated for a word pair that satisfies the requirements of steps S113 and S115 (S117). For example, assume that the correlation coefficient between the word “exchange rate” and “Company X revenue” is calculated. First, the text / time-series data correspondence unit 150 acquires time-series data corresponding to each word. From the correspondence table storage unit 180, it can be seen that the time series ID of the time series data corresponding to the “exchange rate” is “D1”, and the time series ID of the time series data corresponding to “company X revenue” is “D2”. Therefore, the text / time-series data handling unit 150 acquires the time-series data of the time-series data IDs “D1” and “D2” from the time-series data storage unit 170 and transmits the time-series data to the correlation coefficient calculation unit 140. In the following, it is assumed that the data acquisition time interval, the acquisition start time, etc. of the time series data corresponding to the time series data IDs “D1” and “D2” are the same.

時系列データを受け取った相関係数算出部１４０は、以下の数式２から相関係数を算出する。 The correlation coefficient calculation unit 140 that has received the time series data calculates a correlation coefficient from the following Equation 2.

ここで、ｘバーは単語Ｘの時系列データの平均値であり、σ_ｘは単語Ｘの時系列データの標準偏差である。また、ｔは時系列データのインデックスであり、ｎは時系列データのシーケンス長である。 Here, x bar is an average value of the time series data of the word X, and σ _x is a standard deviation of the time series data of the word X. Further, t is an index of time series data, and n is a sequence length of time series data.

数式２から相関係数が算出されると、相関係数の絶対値が所定の値以上であるかをチェックする（Ｓ１１９）。相関係数は−１≦（相関係数）≦１の値をとり、その絶対値が大きいほど相関が強いことを示す。そこで、相関係数の絶対値が所定の値より大きい相関係数を有する単語ペアを新たな知識として出力する（Ｓ１２１）。一方、相関係数の絶対値が所定の値より小さい場合には、新たな知識として出力されず、ステップＳ１２３に進む。ここで、所定の値は任意に決定することができるが、相関が強いと考えられる値、例えば０．７に設定することができる。 When the correlation coefficient is calculated from Equation 2, it is checked whether the absolute value of the correlation coefficient is equal to or greater than a predetermined value (S119). The correlation coefficient takes a value of −1 ≦ (correlation coefficient) ≦ 1, and the larger the absolute value, the stronger the correlation. Therefore, a word pair having a correlation coefficient whose absolute value is larger than a predetermined value is output as new knowledge (S121). On the other hand, when the absolute value of the correlation coefficient is smaller than the predetermined value, the new knowledge is not output and the process proceeds to step S123. Here, the predetermined value can be arbitrarily determined, but can be set to a value considered to have a strong correlation, for example, 0.7.

１つの単語ペアについての相関関係を求め終えると、相関係数を算出すべき単語ペアすべてについて相関関係を求める処理が終了しているかを確認する（Ｓ１２３）。相関係数を算出すべき単語ペアとは、例えば全単語ペアでもよく、優先順位の高いものから半分の単語ペアでもよい。相関係数を算出する単語ペア数は、例えば知識抽出処理に割り当て可能な時間を考慮して決定することができる。ここで、本実施形態の知識抽出装置１００では、テキストデータを用いて相関の高そうな単語ペアを優先順位を付けて選択している。したがって、相関係数を算出する単語ペア数を制限した場合にも、相関の高い単語ペアを発見する可能性が高い。 When the correlation for one word pair is obtained, it is confirmed whether the process for obtaining the correlation for all word pairs for which the correlation coefficient is to be calculated has been completed (S123). The word pair for which the correlation coefficient is to be calculated may be, for example, an all-word pair, or a word pair having a high priority to a half word pair. The number of word pairs for calculating the correlation coefficient can be determined in consideration of the time that can be allocated to the knowledge extraction process, for example. Here, in the knowledge extraction apparatus 100 of the present embodiment, word pairs that are likely to be highly correlated are selected with priority using text data. Therefore, even when the number of word pairs for calculating the correlation coefficient is limited, there is a high possibility of finding a word pair having a high correlation.

ステップＳ１２３にて相関関係を求める処理が終了していると判断されれば、全体の処理を終了する。一方、まだ相関係数を算出すべき単語ペアが残っている場合には、ステップＳ１２５へ進み、次に優先順位の高い単語ペアを例えば単語ペアリストから読み出し（Ｓ１２５）、ステップＳ１１３から処理を繰り返す。 If it is determined in step S123 that the process for obtaining the correlation has been completed, the entire process is terminated. On the other hand, if there are still word pairs whose correlation coefficients are to be calculated, the process proceeds to step S125, and the next highest priority word pair is read from, for example, the word pair list (S125), and the process is repeated from step S113. .

以上、本発明の第１の実施形態にかかる知識抽出方法について説明した。本実施形態によれば、大量の時系列データから相関関係を求める際に、予めテキストデータからテキスト解析を行って相関が高いと思われる時系列データのペアを、優先順位を付けて決定する。そして、優先順位の高い順に時系列データの相関係数を算出して相関関係を求める。これにより、テキストデータと時系列データとを統合して解析することにより、より精度の高い知識の抽出が期待でき、盲目的にあらゆる時系列データのペアについて相関係数を算出していた場合と比較して高速に時系列データの相関を発見して新たな知識を抽出できるという効果を奏する。 The knowledge extraction method according to the first embodiment of the present invention has been described above. According to the present embodiment, when obtaining a correlation from a large amount of time-series data, a text analysis is performed in advance from text data, and a pair of time-series data that seems to have a high correlation is determined with priority. Then, the correlation is obtained by calculating the correlation coefficient of the time series data in descending order of priority. By integrating and analyzing text data and time-series data, more accurate knowledge extraction can be expected, and correlation coefficients are calculated for all pairs of time-series data blindly. Compared to this, it is possible to find a correlation between time series data at high speed and extract new knowledge.

なお、ステップＳ１１７において相関係数を算出する際、時系列データの遅れについて考慮することもできる。遅れＬを考慮した場合、相関係数は下記の数式３から算出することができる。 Note that when calculating the correlation coefficient in step S117, it is possible to consider the delay of the time series data. When the delay L is taken into consideration, the correlation coefficient can be calculated from Equation 3 below.

遅れの最大値をＭＡＸと予め決めておき、遅れＬを０からＭＡＸまで変えて各値について数式３から相関係数σ（Ｌ）を算出する。例えば、「Ｘ社収益」と「為替レート」との相関において時系列データを０週（遅れなし）、１週、２週、・・・と遅らせ、１０週目（遅れの最大値ＭＡＸ）までの相関係数σ（Ｌ）をそれぞれ算出する。そして、算出された各相関係数σ（Ｌ）のうち絶対値が最大の値を、その単語ペアの相関係数として決定する。このように、時系列データの遅れを考慮することにより、遅れを生じさせた場合に強い相関が現れる単語ペアも発見することができ、より精度よく相関関係の高い知識を抽出することが可能となる。 The maximum value of the delay is predetermined as MAX, the delay L is changed from 0 to MAX, and the correlation coefficient σ (L) is calculated from Equation 3 for each value. For example, in the correlation between “Company X revenue” and “Exchange rate”, the time series data is delayed to 0 weeks (no delay), 1 week, 2 weeks, etc. until the 10th week (maximum delay MAX). The correlation coefficient σ (L) is calculated respectively. Then, a value having the maximum absolute value among the calculated correlation coefficients σ (L) is determined as the correlation coefficient of the word pair. In this way, by taking into account the delay of the time series data, it is possible to find word pairs that show a strong correlation when the delay occurs, and to extract highly correlated knowledge with higher accuracy. Become.

（第２の実施形態）
次に、本発明の第２の実施形態にかかる知識抽出装置について説明する。本実施形態にかかる知識抽出装置は、第１の実施形態と比較して、テキストデータに加えて、ユーザから入力されたユーザ入力モデルを用いて相関係数を算出する優先順位を決定する点で相違する。まず、図７に基づいて、本実施形態にかかる知識抽出装置２００の概略構成を説明する。なお、図７は、本実施形態にかかる知識抽出装置の概略構成を示すブロック図である。 (Second Embodiment)
Next, a knowledge extraction apparatus according to the second embodiment of the present invention will be described. Compared with the first embodiment, the knowledge extraction apparatus according to the present embodiment determines the priority for calculating the correlation coefficient using the user input model input from the user in addition to the text data. Is different. First, a schematic configuration of the knowledge extraction device 200 according to the present embodiment will be described based on FIG. FIG. 7 is a block diagram showing a schematic configuration of the knowledge extracting apparatus according to the present embodiment.

＜知識抽出装置の概略構成＞
本実施形態にかかる知識抽出装置２００は、図７に示すように、テキスト入力部２１１と、モデル入力部２１３と、テキスト解析部２２０と、テキスト・時系列データ対応部２３１と、時系列データ単語リスト取得部２３３と、時系列データ共起スコア算出部２４０と、時系列データ相関スコア算出部２５０と、優先順位決定部２６０と、相関係数算出部２７０と、相関判定部２８０と、テキストデータ記憶部２９１と、時系列データ記憶部２９３と、対応表記憶部２９５と、を備える。 <Schematic configuration of knowledge extraction device>
As shown in FIG. 7, the knowledge extraction apparatus 200 according to the present embodiment includes a text input unit 211, a model input unit 213, a text analysis unit 220, a text / time-series data correspondence unit 231, and a time-series data word. List acquisition unit 233, time series data co-occurrence score calculation unit 240, time series data correlation score calculation unit 250, priority order determination unit 260, correlation coefficient calculation unit 270, correlation determination unit 280, text data A storage unit 291, a time series data storage unit 293, and a correspondence table storage unit 295 are provided.

テキスト入力部２１１は、後述するテキストデータ記憶部２９１からテキストデータが入力される入力部である。テキストデータとして、例えば新聞や雑誌の記事や文書などを用いることができる。 The text input unit 211 is an input unit to which text data is input from a text data storage unit 291 described later. As the text data, for example, articles and documents of newspapers and magazines can be used.

モデル入力部２１３は、ユーザからユーザ入力モデルが入力され、ユーザ入力モデルのリストを取得する機能部である。ユーザ入力モデルは、共起スコアの計算対象となるデータを限定するとともに時系列データの相関係数を算出する優先順位を決定するために用いられる情報であり、本実施形態では相互に関連性を有する単語とする。ユーザは、例えばキーボード等の外部の入力装置２０５を用いて単語を入力する。モデル入力部２１３は、入力装置２０５を用いてユーザが入力した単語（以下、「ユーザ入力単語」とする。）を取得して単語リストを作成し、後述する時系列データ共起スコア算出部２４０に出力する。 The model input unit 213 is a functional unit that receives a user input model from a user and acquires a list of user input models. The user input model is information used to limit the data for which the co-occurrence score is calculated and to determine the priority for calculating the correlation coefficient of the time-series data. In this embodiment, the user input model is related to each other. It is a word that has. The user inputs a word using an external input device 205 such as a keyboard. The model input unit 213 acquires words (hereinafter referred to as “user input words”) input by the user using the input device 205 to create a word list, and a time-series data co-occurrence score calculation unit 240 described later. Output to.

テキスト解析部２２０は、テキストデータを構成する単語について出現頻度を解析する機能部であって、例えば、テキスト分割部２２１と、出現頻度測定部２２３とを備える。テキスト分割部２２１は、テキストデータを単語に分割する機能部であり、例えば形態素解析などの手法を用いて分割することができる。出現頻度測定部２２３は、単語がテキストデータ内に出現する頻度を測定する機能部である。出現頻度測定部２２３は、テキスト分割部２２１により分割された単語について出現頻度を測定し、第１の実施形態の出現頻度測定部１２４と同様、例えば図２に示すような、単語の出現頻度についてのマトリックス１２５を作成する。 The text analysis unit 220 is a functional unit that analyzes the appearance frequency of words constituting the text data, and includes, for example, a text division unit 221 and an appearance frequency measurement unit 223. The text dividing unit 221 is a functional unit that divides text data into words, and can be divided using a technique such as morphological analysis. The appearance frequency measurement unit 223 is a functional unit that measures the frequency at which words appear in text data. The appearance frequency measuring unit 223 measures the appearance frequency for the words divided by the text dividing unit 221 and, for example, the word appearance frequency as shown in FIG. 2 as in the appearance frequency measuring unit 124 of the first embodiment. The matrix 125 is created.

テキスト・時系列データ対応部２３１は、時系列データと単語との対応関係を取得する機能部である。テキスト・時系列データ対応部２３１は、後述する時系列データ単語リスト取得部２３３からの要求に応じて後述する時系列データ記憶部２９３と対応表記憶部２９５とを参照し、時系列データと単語リストとの対応関係を取得して時系列データ単語リスト取得部２３３に出力する。また、テキスト・時系列データ対応部２３１は、所望の時系列データを相関係数算出部２７０に出力する。 The text / time-series data correspondence unit 231 is a functional unit that acquires a correspondence relationship between time-series data and words. The text / time-series data correspondence unit 231 refers to a time-series data storage unit 293 and a correspondence table storage unit 295, which will be described later, in response to a request from the time-series data word list acquisition unit 233, which will be described later. The correspondence with the list is acquired and output to the time-series data word list acquisition unit 233. Also, the text / time-series data correspondence unit 231 outputs desired time-series data to the correlation coefficient calculation unit 270.

時系列データ単語リスト取得部２３３は、時系列データの単語リストを取得する機能部である。時系列データ単語リスト取得部２３３は、テキスト・時系列データ対応部２３１に対して時系列データと単語との対応関係の取得を要求し、テキスト・時系列データ対応部２３１から取得した時系列データと単語との対応関係から時系列データについての単語リストを作成し、後述する時系列データ共起スコア算出部２４０に出力する。 The time series data word list acquisition unit 233 is a functional unit that acquires a word list of time series data. The time-series data word list acquisition unit 233 requests the text / time-series data correspondence unit 231 to obtain the correspondence between the time-series data and the words, and the time-series data obtained from the text / time-series data correspondence unit 231. A word list for time-series data is created from the correspondence between the word and the word, and is output to a time-series data co-occurrence score calculation unit 240 described later.

時系列データ共起スコア算出部２４０は、時系列データ単語取得部２３３により作成された単語リストと、モデル入力部２１３から入力されたユーザ入力単語からなる単語リストとの共起関係を算出する機能部である。共起スコアの算出についての詳細は後述する。時系列データ共起スコア算出部２４０は、算出した共起スコアを後述する時系列データ相関スコア算出部２５０に出力する。 The time-series data co-occurrence score calculation unit 240 calculates a co-occurrence relationship between the word list created by the time-series data word acquisition unit 233 and the word list including user input words input from the model input unit 213. Part. Details of the calculation of the co-occurrence score will be described later. The time series data co-occurrence score calculation unit 240 outputs the calculated co-occurrence score to the time series data correlation score calculation unit 250 described later.

時系列データ相関スコア算出部２５０は、時系列データのペアに対して相関係数を算出する際の優先順位となる相関スコアを算出する機能部である。相関スコアの算出方法については後述する。時系列データ相関スコア算出部２５０は、算出した相関スコアを後述する優先順位決定部２６０に出力する。 The time-series data correlation score calculation unit 250 is a functional unit that calculates a correlation score that is a priority when calculating a correlation coefficient for a pair of time-series data. A method for calculating the correlation score will be described later. The time-series data correlation score calculation unit 250 outputs the calculated correlation score to the priority order determination unit 260 described later.

優先順位決定部２６０は、後述する相関係数算出部２７０において相関係数を算出する順番を決定する機能部である。本実施形態の優先順位決定部２６０は、時系列データ相関スコア算出部２５０にて算出された相関スコアに基づいて決定される。 The priority order determination unit 260 is a functional unit that determines the order in which correlation coefficients are calculated in a correlation coefficient calculation unit 270 described later. The priority order determination unit 260 of this embodiment is determined based on the correlation score calculated by the time series data correlation score calculation unit 250.

相関係数算出部２７０は、時系列データペアの相関係数を算出する機能部である。相関係数算出部２７０は、優先順位決定部２６０により決定された優先順位の高い時系列データペアから順に相関係数を算出する。 The correlation coefficient calculation unit 270 is a functional unit that calculates the correlation coefficient of the time series data pair. Correlation coefficient calculation section 270 calculates correlation coefficients in order from the time-series data pair having the highest priority determined by priority determination section 260.

相関判定部２８０は、相関係数算出部２７０により算出された時系列データペアの相関係数に基づいて、時系列データペアに相関関係があるか否かを判定する機能部である。本実施形態の相関判定部２８０は、相関係数算出部２７０により算出された相関係数の絶対値が所定の値以上である場合に、時系列データペアの相関が高いと判定する。相関判定部２８０は、強い相関のある時系列データペアを新しい知識として出力する。 The correlation determination unit 280 is a functional unit that determines whether or not the time series data pair has a correlation based on the correlation coefficient of the time series data pair calculated by the correlation coefficient calculation unit 270. The correlation determination unit 280 of the present embodiment determines that the correlation of the time series data pair is high when the absolute value of the correlation coefficient calculated by the correlation coefficient calculation unit 270 is equal to or greater than a predetermined value. Correlation determining section 280 outputs time series data pairs having strong correlation as new knowledge.

テキストデータ記憶部２９１は、例えば新聞や雑誌の記事、文書などのテキストデータを記憶する記憶部であって、例えば、ＲＡＭやハードディスク等のメモリを含んで構成される。なお、本実施形態では、テキストデータは予めテキストデータ記憶部２９１に記憶されているが、ユーザ入力モデルのようにテキストデータを外部から入力させてもよい。 The text data storage unit 291 is a storage unit that stores text data such as articles and documents of newspapers and magazines, and includes a memory such as a RAM and a hard disk. In the present embodiment, the text data is stored in the text data storage unit 291 in advance, but the text data may be input from the outside like a user input model.

時系列データ記憶部２９３は、単語に対応付けて記憶された時系列データを記憶する記憶部である。時系列データ記憶部２９３は、図３のように第１の実施形態の時系列データ記憶部１７０と同様の構成とすることができる。 The time-series data storage unit 293 is a storage unit that stores time-series data stored in association with words. The time series data storage unit 293 can have the same configuration as the time series data storage unit 170 of the first embodiment as shown in FIG.

対応表記憶部２９５は、単語と時系列データとの対応関係を記憶する記憶部である。対応表記憶部２９５は、図４のように第１の実施形態の対応表記億部１８０と同様に構成することができる。 The correspondence table storage unit 295 is a storage unit that stores a correspondence relationship between words and time-series data. The correspondence table storage unit 295 can be configured in the same way as the correspondence notation unit 180 of the first embodiment as shown in FIG.

以上、本実施形態にかかる知識抽出装置２００の概略構成について説明した。次に、図８〜図１０に基づいて、本実施形態にかかる知識抽出方法について説明する。なお、図８は、本実施形態にかかる知識抽出方法を示すフローチャートである。図９は、モデル入力部２１３に入力される単語についての説明図である。図１０は、時系列データ相関スコア算出部２５０の処理についての説明図である。 The schematic configuration of the knowledge extraction apparatus 200 according to the present embodiment has been described above. Next, the knowledge extraction method according to the present embodiment will be described with reference to FIGS. FIG. 8 is a flowchart showing the knowledge extraction method according to the present embodiment. FIG. 9 is an explanatory diagram of words input to the model input unit 213. FIG. 10 is an explanatory diagram for processing of the time-series data correlation score calculation unit 250.

＜知識抽出方法＞
本実施形態にかかる知識抽出方法は、前処理としてテキストデータ記憶部２９１に記憶されたテキストデータの解析を行う（Ｓ２０１〜Ｓ２０５）。まず、図８に示すように、テキスト入力部２１１によりテキストデータ記憶部２９１に記憶されたテキストデータを読み込む（Ｓ２０１）。本実施形態では、テキストデータとして大量の文書を用いる。文書としては、例えば時系列データ記憶部２９３に記憶された時系列データに関連する記事や書物などを用いる。 <Knowledge extraction method>
The knowledge extraction method according to the present embodiment analyzes text data stored in the text data storage unit 291 as preprocessing (S201 to S205). First, as shown in FIG. 8, the text data stored in the text data storage unit 291 is read by the text input unit 211 (S201). In this embodiment, a large amount of documents are used as text data. As the document, for example, an article or a book related to the time series data stored in the time series data storage unit 293 is used.

次いで、テキスト入力部２１１は、入力されたテキストデータをテキスト解析部２２０へ送信する。テキスト解析部２２０では、まず、テキスト分割部２２１によりテキストデータを単語に分割する（Ｓ２０３）。テキスト分割部２２１は、テキストデータごとに例えば形態素解析により文章を単語に分解する。そして、分解された単語がテキストデータ中にどのくらいの頻度で現れるかを測定する（Ｓ２０５）。単語の出現頻度は各テキストデータについて、単語が出現した回数を測定することにより表される。これにより、例えば図２に示すようなマトリックス１２５が形成される。 Next, the text input unit 211 transmits the input text data to the text analysis unit 220. In the text analysis unit 220, first, the text division unit 221 divides the text data into words (S203). The text dividing unit 221 decomposes a sentence into words for each text data, for example, by morphological analysis. Then, how often the decomposed word appears in the text data is measured (S205). The appearance frequency of the word is expressed by measuring the number of times the word appears for each text data. Thereby, for example, a matrix 125 as shown in FIG. 2 is formed.

以上のテキストデータの前処理（Ｓ２０１〜Ｓ２０５）は、時系列データの共起スコアを算出する前までに行われる。次に、かかる前処理に基づいて、時系列データの相関を発見して新たな知識を抽出する（Ｓ２１１〜Ｓ２２９）。まず、ユーザによるモデルの入力が行われる（Ｓ２１１）。モデルは、上述したように、共起スコアの計算対象となるデータを限定するとともに時系列データの相関係数を算出する優先順位を決定するために用いられる情報であり、本実施形態では相互に関連性を有する単語とする。ユーザは、入力装置２０５を用いてモデル入力部２１３に相互に関連性を有する単語を入力する。 The text data preprocessing (S201 to S205) is performed before calculating the co-occurrence score of the time-series data. Next, based on such pre-processing, the correlation of time series data is discovered and new knowledge is extracted (S211 to S229). First, a model is input by the user (S211). As described above, the model is information used to determine the priority for calculating the correlation coefficient of the time-series data while limiting the data for which the co-occurrence score is to be calculated. Relevant words. The user inputs mutually related words to the model input unit 213 using the input device 205.

例えば、図９に示すように、「金利」（Ｕ１）は「為替」（Ｕ２）や「景気」（Ｕ３）、「インフレ」（Ｕ４）と関連があり、「インフレ」（Ｕ４）は「失業率」（Ｕ５）や「原材料価格」（Ｕ６）と関連があるとユーザが考えているとする。このとき、ユーザは、単語間の相互の詳細な関係を記述することなく、ある概念や単語を次々と入力していく。すなわち、「金利」（Ｕ１）に関連する単語として「為替」（Ｕ２）、「景気」（Ｕ３）、「インフレ」（Ｕ４）が入力され、「インフレ」（Ｕ４）に関連する単語として「失業率」（Ｕ５）、「原材料価格」（Ｕ６）が入力されるというプロセスで相互に関連する単語リストを作成することができる。このようにして、ユーザ入力単語からなる単語リストＵ（Ｕ１、Ｕ２、・・・、Ｕｍ）を得る。 For example, as shown in FIG. 9, “interest rate” (U1) is related to “exchange” (U2), “economy” (U3), “inflation” (U4), and “inflation” (U4) is “unemployment”. It is assumed that the user thinks that there is a relation with “rate” (U5) and “raw material price” (U6). At this time, the user inputs certain concepts and words one after another without describing the detailed relationship between the words. That is, “exchange” (U2), “economy” (U3), “inflation” (U4) are input as words related to “interest rate” (U1), and “unemployment” is related to “inflation” (U4). An interrelated word list can be created in the process of inputting “rate” (U5) and “raw material price” (U6). In this way, a word list U (U1, U2,..., Um) composed of user input words is obtained.

ここで、ユーザ入力単語は少なくとも２単語以上入力されることが必要である。また、ユーザ入力単語は本実施形態の知識抽出方法において関連性を有するものとして扱われるため、関連性の弱い単語を多く（例えば数百語）入力するよりも、数は少なくても（例えば数十語）相関の強いと思われる単語を入力する方が適切な優先順位を決定することができる。 Here, it is necessary that at least two user input words are input. In addition, since the user input words are treated as having relevance in the knowledge extraction method of the present embodiment, the number of words (for example, several) is smaller than the number of weakly related words (for example, several hundred words). Ten words) It is possible to determine an appropriate priority by inputting a word that seems to have a strong correlation.

次いで、時系列データ記憶部２９５に記憶された時系列データの各々について、テキスト・時系列データ対応部２３１および時系列データ単語取得部２３３により対応する単語リストＷ（Ｗ１、Ｗ２、・・・、Ｗｎ）を取得する（Ｓ２１３）。例えば、時系列データＤ１については、テキスト・時系列データ対応部２３１により対応表記憶部２９５を参照して「為替相場」（Ｗ１）と「為替レート」（Ｗ２）との対応付けを行い、時系列データ単語取得部２３３により「為替相場」（Ｗ１）と「為替レート」（Ｗ２）からなる単語リストＷが取得される。 Next, for each of the time series data stored in the time series data storage unit 295, the corresponding word list W (W1, W2,..., By the text / time series data correspondence unit 231 and the time series data word acquisition unit 233 is stored. Wn) is acquired (S213). For example, for the time-series data D1, the text / time-series data correspondence unit 231 refers to the correspondence table storage unit 295 to associate the “exchange rate” (W1) with the “exchange rate” (W2). The series data word acquisition unit 233 acquires a word list W composed of “exchange rate” (W1) and “exchange rate” (W2).

さらに、ステップＳ２１３により取得した各時系列データの単語リストとユーザ入力単語とを用いて、時系列データの単語とユーザ入力単語とのペアについて共起スコアを求める（Ｓ２１５）。すなわち、ステップＳ２１５では、時系列データとユーザ入力単語との間にどの程度関係があるかを求めている。具体的には、まず、時系列データに対応する単語とユーザ入力単語とがテキストデータに出現した頻度を図２に示すマトリックス１２５から取得し、時系列データの単語リストを構成する単語Ｗｉ（ｉ＝１、２、・・・、ｎ）とユーザ入力単語Ｕｊ（ｊ＝１、２、・・・、ｍ）との間の共起スコアを算出する。共起スコアの算出は、第１の実施形態におけるステップＳ１０９と同様に数式１を用いて行うことができる。なお、数式１によって共起スコアを計算する際、時系列データの単語Ｗｉとユーザ入力単語Ｕｊのいずれかがマトリックス１２５に登録されていない場合には、共起スコアは０とする。こうして、各ユーザ入力単語に対して時系列データの単語リストとの共起スコアのリストを得ることができる。 Further, a co-occurrence score is obtained for the pair of the word of the time series data and the user input word using the word list of each time series data acquired in step S213 and the user input word (S215). That is, in step S215, the degree of relationship between the time-series data and the user input word is obtained. Specifically, first, the frequency of occurrence of words corresponding to time series data and user input words in the text data is obtained from the matrix 125 shown in FIG. 2, and the words Wi (i) constituting the word list of the time series data are obtained. = 1, 2,..., N) and a co-occurrence score between user input words Uj (j = 1, 2,..., M). The calculation of the co-occurrence score can be performed using Formula 1 as in Step S109 in the first embodiment. When the co-occurrence score is calculated according to Equation 1, the co-occurrence score is 0 if either the word Wi of the time series data or the user input word Uj is not registered in the matrix 125. In this way, a list of co-occurrence scores with the word list of the time-series data can be obtained for each user input word.

例えば、ユーザ入力単語Ｕ１について、時系列データの単語Ｗｉとの共起スコア
Score（Ｗ１、Ｕ１）、Score（Ｗ２、Ｕ１）、・・・、Score（Ｗｎ、Ｕ１）
を得る。これらの共起スコアからなる共起スコアリストのうち、絶対値が最大の共起スコアを該時系列データＤｘの該ユーザ入力単語Ｕｙに対するスコアScore（Ｄｘ、Ｕｙ）とする。例えば、時系列データＤ１のユーザ入力単語Ｕ１に対するスコアは、Score（Ｄ１、Ｕ１）と表され、図１０のユーザ入力単語Ｕ１についての共起スコアリスト２４１ではScore（Ｄ１、Ｕ１）の値は０．７である。このようにして、各ユーザ入力単語に対する各時系列データＤ１、Ｄ２、・・・の共起スコア
Score（Ｄ１、Ｕ１）、Score（Ｄ２、Ｕ１）、・・・、Score（Ｄｌ、Ｕ１）
Score（Ｄ１、Ｕ２）、Score（Ｄ２、Ｕ２）、・・・、Score（Ｄｌ、Ｕ２）
Score（Ｄ１、Ｕ３）、Score（Ｄ２、Ｕ３）、・・・、Score（Ｄｌ、Ｕ３）
・・・
を得る。 For example, for the user input word U1, the co-occurrence score with the word Wi in the time series data
Score (W1, U1), Score (W2, U1), ..., Score (Wn, U1)
Get. Of the co-occurrence score list composed of these co-occurrence scores, the co-occurrence score having the maximum absolute value is set as a score Score (Dx, Uy) for the user input word Uy of the time series data Dx. For example, the score for the user input word U1 in the time series data D1 is represented as Score (D1, U1), and the value of Score (D1, U1) is 0 in the co-occurrence score list 241 for the user input word U1 in FIG. .7. Thus, the co-occurrence score of each time series data D1, D2,... For each user input word.
Score (D1, U1), Score (D2, U1), ..., Score (D1, U1)
Score (D1, U2), Score (D2, U2), ..., Score (D1, U2)
Score (D1, U3), Score (D2, U3), ..., Score (D1, U3)
...
Get.

ここで、図１０には、ユーザ入力単語Ｕ１についての共起スコアリスト２４１、ユーザ入力単語Ｕ２についての共起スコアリスト２４２、ユーザ入力単語Ｕ３についての共起スコアリスト２４３が、共起スコアの大きい順にソートされた状態で示されている。なお、ステップＳ２１５により得られた共起スコアリストを記憶部（図示せず。）に記憶してもよい。共起スコアリストに示されたスコアは、各時系列データが、各ユーザ入力単語に対してどの程度関係がありそうかを示す尺度と考えることができる。 Here, in FIG. 10, the co-occurrence score list 241 for the user input word U1, the co-occurrence score list 242 for the user input word U2, and the co-occurrence score list 243 for the user input word U3 have a large co-occurrence score. They are shown sorted in order. Note that the co-occurrence score list obtained in step S215 may be stored in a storage unit (not shown). The score shown in the co-occurrence score list can be considered as a measure indicating how much each time-series data is likely to be related to each user input word.

その後、時系列データＤ１、Ｄ２、・・・、Ｄｌの各ペア（Ｄｉ、Ｄｊ）について相関スコアを算出する（Ｓ２１７）。以降では時系列データＤ１，Ｄ２，・・・、Ｄｌを対象として処理を行うが、この時系列データとしては時系列データ記憶部２９５に含まれるすべての時系列データを対象としてもよいが、次のようにしてその一部に限定することで以降の処理の計算量を節約してもよい。 Thereafter, a correlation score is calculated for each pair (Di, Dj) of time series data D1, D2,..., Dl (S217). In the following, processing is performed on the time series data D1, D2,..., D1, but all the time series data included in the time series data storage unit 295 may be targeted as this time series data. Thus, the calculation amount of the subsequent processing may be saved by limiting to a part thereof.

すなわち、上述した通り、各ユーザ入力単語に対する各時系列データＤ１、Ｄ２、・・・の共起スコア
Score（Ｄ１、Ｕ１）、Score（Ｄ２、Ｕ１）、・・・、Score（Ｄｌ、Ｕ１）
Score（Ｄ１、Ｕ２）、Score（Ｄ２、Ｕ２）、・・・、Score（Ｄｌ、Ｕ２）
Score（Ｄ１、Ｕ３）、Score（Ｄ２、Ｕ３）、・・・、Score（Ｄｌ、Ｕ３）
・・・
が得られているが、例えば時系列データＤ１については、各ユーザ入力単語に対する共起スコア
Score（Ｄ１、Ｕ１）、Score（Ｄ１、Ｕ２）、Score（Ｄ１、Ｕ３）、・・・
が存在する。ここで、例えばこれらのスコアの絶対値の最大値がある閾値より小さい場合には、時系列データＤ１については以降の処理の対象である時系列データの集合から除くようにしてもよい。その理由は、時系列データＤ１に関してはどのユーザ入力単語とも関連が乏しいと考えられるからである。同様に他の時系列データＤ２、Ｄ３、・・・についても同じようにスコアの絶対値の最大値を閾値と比較して、以降の処理の対象である時系列データに含めるか否かを決定する。 That is, as described above, the co-occurrence score of each time series data D1, D2,... For each user input word.
Score (D1, U1), Score (D2, U1), ..., Score (D1, U1)
Score (D1, U2), Score (D2, U2), ..., Score (D1, U2)
Score (D1, U3), Score (D2, U3), ..., Score (D1, U3)
...
For example, for the time series data D1, the co-occurrence score for each user input word
Score (D1, U1), Score (D1, U2), Score (D1, U3), ...
Exists. Here, for example, when the maximum absolute value of these scores is smaller than a certain threshold value, the time series data D1 may be excluded from the set of time series data to be processed later. The reason is that the time-series data D1 is considered to be poorly related to any user input word. Similarly, for the other time series data D2, D3,..., The maximum absolute value of the score is compared with the threshold value, and it is determined whether or not to include in the time series data to be processed thereafter. To do.

ここで、閾値の値は任意に設定することができるが、経験に基づく所定の値としてもよく、上述の各ユーザ入力単語に対する各時系列データＤ１、Ｄ２、・・・の共起スコア
Score（Ｄ１、Ｕ１）、Score（Ｄ２、Ｕ１）、・・・、Score（Ｄｌ、Ｕ１）
Score（Ｄ１、Ｕ２）、Score（Ｄ２、Ｕ２）、・・・、Score（Ｄｌ、Ｕ２）
Score（Ｄ１、Ｕ３）、Score（Ｄ２、Ｕ３）、・・・、Score（Ｄｌ、Ｕ３）
・・・
の絶対値の平均値Ａｖｅｒを求め、その値Ａｖｅｒ、またはそれに基づいた値、例えば１／３＊Ａｖｅｒなどを閾値として設定してもよい。 Here, the threshold value can be arbitrarily set, but may be a predetermined value based on experience, and the co-occurrence score of each time series data D1, D2,...
Score (D1, U1), Score (D2, U1), ..., Score (D1, U1)
Score (D1, U2), Score (D2, U2), ..., Score (D1, U2)
Score (D1, U3), Score (D2, U3), ..., Score (D1, U3)
...
An average value Aver of the absolute values may be obtained, and the value Aver or a value based on the average Aver may be set as a threshold value.

ステップＳ２１７では、まず、ユーザ入力単語のうち２つを組み合わせて単語ペア（Ｕｋ、Ｕｌ）を作成する。次いで、単語ペアのうちユーザ入力単語Ｕｋの時系列データＤｉ、Ｄｊとの相関スコアをそれぞれScore（Ｄｉ、Ｕｋ）、Score（Ｄｊ、Ｕｋ）とし、ユーザ入力単語Ｕｌの時系列データＤｉ、Ｄｊとの相関スコアをそれぞれScore（Ｄｉ、Ｕｌ）、Score（Ｄｊ、Ｕｌ）として、
Score（Ｄｉ、Ｕｋ）×Score（Ｄｊ、Ｕｌ）
Score（Ｄｊ、Ｕｋ）×Score（Ｄｉ、Ｕｌ）
を算出する。このうち大きい値をユーザ入力単語の単語ペア（Ｕｋ、Ｕｌ）に対する時系列データＤｉ、Ｄｊの相関スコアとする。同様に、想定されるすべてのユーザ入力単語の単語ペアに対して相関スコアを算出し、算出された相関スコアのうち最大の値を（Ｄｉ、Ｄｊ）の相関スコアScore（Ｄｉ、Ｄｊ）とする。相関スコアScore（Ｄｉ、Ｄｊ）は、時系列データのペアがどの程度関係がありそうかをユーザ入力単語とテキストデータとに基づいた共起関係から推測したものと考えることができる。 In step S217, first, a word pair (Uk, Ul) is created by combining two of the user input words. Next, the correlation scores of the word pairs with the time series data Di and Dj of the user input word Uk are respectively Score (Di, Uk) and Score (Dj, Uk), and the time series data Di, Dj of the user input word Ul are The correlation scores of Score (Di, Ul) and Score (Dj, Ul) are
Score (Di, Uk) x Score (Dj, Ul)
Score (Dj, Uk) x Score (Di, Ul)
Is calculated. Of these, the larger value is used as the correlation score of the time-series data Di and Dj for the word pair (Uk, Ul) of the user input word. Similarly, correlation scores are calculated for word pairs of all possible user input words, and the maximum value of the calculated correlation scores is defined as the correlation score Score (Di, Dj) of (Di, Dj). . The correlation score Score (Di, Dj) can be considered as an estimate from the co-occurrence relationship based on user input words and text data to what extent the pair of time-series data is likely to be related.

例えば、図１０に示すように、３つのユーザ入力単語Ｕ１、Ｕ２、Ｕ３がユーザから入力されたとする。ユーザ入力単語Ｕ１とＵ２との単語ペアにおいて、時系列データＤ３とＤ５との相関スコアを算出すると、
Score（Ｄ３、Ｕ１）×Score（Ｄ５、Ｕ２）＝０．９×０．７＝０．６３
Score（Ｄ５、Ｕ１）×Score（Ｄ３、Ｕ２）＝０．５×０．６＝０．３０
となり、単語ペア（Ｕ１、Ｕ２）に対するＤ３、Ｄ５の相関スコアは０．６３となる。同様に、ユーザ入力単語Ｕ２とＵ３との単語ペアにおける時系列データＤ３、Ｄ５の相関スコアは０．３５、ユーザ入力単語Ｕ３とＵ１との単語ペアにおける時系列データＤ３、Ｄ５の相関スコアは０．２７となる。したがって、相関スコアScore（Ｄｉ、Ｄｊ）は、算出した相関スコアのうち最大値である０．６３に決定される。 For example, as shown in FIG. 10, it is assumed that three user input words U1, U2, and U3 are input from the user. When the correlation score between the time series data D3 and D5 is calculated in the word pair of the user input words U1 and U2,
Score (D3, U1) × Score (D5, U2) = 0.9 × 0.7 = 0.63
Score (D5, U1) × Score (D3, U2) = 0.5 × 0.6 = 0.30
Thus, the correlation score of D3 and D5 for the word pair (U1, U2) is 0.63. Similarly, the correlation score of the time series data D3 and D5 in the word pair of the user input words U2 and U3 is 0.35, and the correlation score of the time series data D3 and D5 in the word pair of the user input words U3 and U1 is 0. .27. Accordingly, the correlation score Score (Di, Dj) is determined to be 0.63 which is the maximum value among the calculated correlation scores.

次いで、ステップＳ２１７にて算出された時系列データのペアの相関スコアに基づいて、時系列データの各ペアについて相関係数を算出する優先順位を決定する（Ｓ２１９）。優先順位は、ステップＳ２１７にて算出された相関スコアの大きい順とする。ここで、優先順位の高い順に時系列データを並べた時系列データのペアリストを作成した後、かかるペアリストの記載順に以下の処理（Ｓ２２１〜Ｓ２２９）を行ってもよい。 Next, based on the correlation score of the time series data pair calculated in step S217, the priority order for calculating the correlation coefficient for each pair of time series data is determined (S219). The priority is set in descending order of the correlation score calculated in step S217. Here, after creating a time-series data pair list in which time-series data is arranged in order of priority, the following processing (S221 to S229) may be performed in the order of description of the pair list.

さらに、ステップＳ２１９において決定された優先順位に基づいて、優先順位の高い順に時系列データのペアの相関係数を算出する（Ｓ２２１）。時系列データを受け取った相関係数算出部２７０は、第１の実施形態と同様、上述した数式２から相関係数を算出することができる。相関係数の算出についての詳細な説明は第１の実施形態と同様であるため省略する。 Further, based on the priority determined in step S219, correlation coefficients of time-series data pairs are calculated in descending order of priority (S221). The correlation coefficient calculation unit 270 that has received the time series data can calculate the correlation coefficient from Equation 2 described above, as in the first embodiment. A detailed description of the calculation of the correlation coefficient is omitted because it is the same as that of the first embodiment.

相関係数算出部２７０により相関係数が算出されると、相関判定部２８０により相関係数の絶対値が所定の値以上であるかをチェックする（Ｓ２２３）。相関係数は−１≦（相関係数）≦１の値をとり、その絶対値が大きいほど相関が強いことを示す。そこで、相関係数の絶対値が所定の値より大きい相関係数を有する時系列データのペアを新たな知識として出力する（Ｓ２２５）。一方、相関係数の絶対値が所定の値より小さい場合には、新たな知識として出力されず、ステップＳ２２７に進む。ここで、所定の値は任意に決定することができるが、相関が強いと考えられる値、例えば０．７に設定することができる。 When the correlation coefficient calculation unit 270 calculates the correlation coefficient, the correlation determination unit 280 checks whether the absolute value of the correlation coefficient is greater than or equal to a predetermined value (S223). The correlation coefficient takes a value of −1 ≦ (correlation coefficient) ≦ 1, and the larger the absolute value, the stronger the correlation. Therefore, a pair of time series data having a correlation coefficient whose absolute value is larger than a predetermined value is output as new knowledge (S225). On the other hand, if the absolute value of the correlation coefficient is smaller than the predetermined value, the new knowledge is not output and the process proceeds to step S227. Here, the predetermined value can be arbitrarily determined, but can be set to a value considered to have a strong correlation, for example, 0.7.

１つの時系列データのペアについての相関関係を求め終えると、相関係数を算出すべき時系列データのペアすべてについて相関関係を求める処理が終了しているかを確認する（Ｓ２２７）。相関係数を算出すべき時系列データのペアとは、例えば全時系列データのペアでもよく、優先順位の高い上位半分の時系列データのペアでもよい。相関係数を算出する時系列データのペア数は、例えば知識抽出処理に割り当て可能な時間を考慮して決定することができる。ここで、本実施形態の知識抽出装置２００では、ユーザ入力単語およびテキストデータを用いて相関の高そうな時系列データのペアを優先順位を付けて選択している。したがって、相関係数を算出する時系列データのペア数を制限した場合にも、相関の高い時系列データのペアを発見する可能性が高い。 When the correlation for one time-series data pair has been obtained, it is confirmed whether the processing for obtaining the correlation has been completed for all pairs of time-series data for which correlation coefficients are to be calculated (S227). The pair of time-series data for which the correlation coefficient is to be calculated may be, for example, a pair of all time-series data, or a pair of time-series data in the upper half having a high priority. The number of pairs of time series data for calculating the correlation coefficient can be determined in consideration of, for example, a time that can be allocated to the knowledge extraction process. Here, in the knowledge extraction device 200 of the present embodiment, a pair of time series data that seems to have a high correlation is selected with a priority using user input words and text data. Therefore, even when the number of pairs of time series data for calculating the correlation coefficient is limited, there is a high possibility of finding a pair of time series data having a high correlation.

ステップＳ２２７にて相関関係を求める処理が終了していると判断されれば、全体の処理を終了する。一方、まだ相関係数を算出すべき時系列データのペアが残っている場合には、ステップＳ２２９へ進み、次に優先順位の高い時系列データのペアを例えば時系列データのペアリストから読み出し（Ｓ２２９）、ステップＳ２２１から処理を繰り返す。 If it is determined in step S227 that the process for obtaining the correlation has been completed, the entire process is terminated. On the other hand, if there are still time-series data pairs for which correlation coefficients are to be calculated, the process proceeds to step S229, and the next highest-priority time-series data pair is read from, for example, the time-series data pair list ( S229), the process is repeated from step S221.

なお、ステップＳ２２１において相関係数を算出する際、第１の実施形態と同様、時系列データの遅れについて考慮することもできる。遅れＬを考慮した場合、相関係数は上述の数式３から算出することができる。時系列データの遅れを考慮することにより、遅れを生じさせた場合に強い相関が現れる時系列データのペアも発見することができ、より精度よく相関関係の高い知識を抽出することが可能となる。 Note that when calculating the correlation coefficient in step S221, the delay of the time-series data can be taken into consideration as in the first embodiment. When the delay L is taken into account, the correlation coefficient can be calculated from Equation 3 described above. By taking into account the delay of time-series data, it is possible to find a pair of time-series data in which a strong correlation appears when a delay occurs, and it is possible to extract highly correlated knowledge with higher accuracy .

以上、本発明の第２の実施形態にかかる知識抽出方法について説明した。本実施形態によれば、ユーザが相互に関係すると考える単語を入力することにより、相関が高いと思われる時系列データのペアを、優先順位を付けて決定する。そして、優先順位の高い順に時系列データの相関係数を算出して相関関係を求める。これにより、テキストデータ、ユーザ入力単語および時系列データを統合して解析するため、より精度の高い知識の抽出が期待できる。また、本実施形態においてはユーザ入力単語が時系列データペアの相関係数算出の優先順位の決定に大きく影響し、テキストデータはユーザ入力単語の補助的に用いられる。ユーザは簡単な入力を行うだけで共起スコアの計算対象となるデータを限定することができ、盲目的にあらゆる時系列データのペアについて相関係数を算出していた場合と比較して高速に時系列データの相関を発見して新たな知識を抽出できるという効果を奏する。 The knowledge extraction method according to the second embodiment of the present invention has been described above. According to the present embodiment, by inputting words that the user thinks are related to each other, a pair of time-series data that is considered to have a high correlation is determined with priority. Then, the correlation is obtained by calculating the correlation coefficient of the time series data in descending order of priority. Thereby, since text data, a user input word, and time series data are integrated and analyzed, extraction of knowledge with higher accuracy can be expected. In the present embodiment, the user input word greatly affects the determination of the priority order for calculating the correlation coefficient of the time-series data pair, and the text data is used as an auxiliary to the user input word. The user can limit the data for which the co-occurrence score is calculated with a simple input, and is faster than if the correlation coefficient was calculated for every pair of time series data blindly. There is an effect that new knowledge can be extracted by finding the correlation of time series data.

以上、添付図面を参照しながら本発明の好適な実施形態について説明したが、本発明は係る例に限定されないことは言うまでもない。当業者であれば、特許請求の範囲に記載された範疇内において、各種の変更例または修正例に想到し得ることは明らかであり、それらについても当然に本発明の技術的範囲に属するものと了解される。 As mentioned above, although preferred embodiment of this invention was described referring an accompanying drawing, it cannot be overemphasized that this invention is not limited to the example which concerns. It will be apparent to those skilled in the art that various changes and modifications can be made within the scope of the claims, and these are naturally within the technical scope of the present invention. Understood.

例えば、上記第１の実施形態では、相関係数算出前のチェックとして、時系列データの有無と時系列データＩＤの相違について確認したが、本発明はかかる例に限定されない。例えば、単語ペアの構成単語の共起スコアの絶対値の大きさが所定の閾値以上である場合にのみ相関係数を算出するようにしてもよい。ここで、所定の閾値は、任意の値とすることができ、例えば、ステップＳ１０９において算出された全共起スコアの絶対値の平均値を閾値とすることができる。同様に、上記第２の実施形態においても、時系列データペアの相関係数算出前のチェックとして、時系列データペアの相関スコアの絶対値の大きさが所定の閾値以上である場合にのみ相関係数を算出するようにしてもよい。ここで、所定の閾値は、任意の値とすることができ、例えばステップＳ２１７において算出された時系列データペアの相関スコアの平均値を閾値とすることができる。 For example, in the first embodiment, the presence / absence of time-series data and the difference between the time-series data IDs are checked as a check before calculating the correlation coefficient, but the present invention is not limited to such an example. For example, the correlation coefficient may be calculated only when the absolute value of the co-occurrence score of the word constituting the word pair is equal to or greater than a predetermined threshold. Here, the predetermined threshold value can be an arbitrary value, and for example, an average value of absolute values of all co-occurrence scores calculated in step S109 can be set as the threshold value. Similarly, also in the second embodiment, as a check before calculating the correlation coefficient of the time series data pair, only when the absolute value of the correlation score of the time series data pair is equal to or greater than a predetermined threshold value. The number of relationships may be calculated. Here, the predetermined threshold value can be an arbitrary value. For example, the average value of the correlation scores of the time series data pairs calculated in step S217 can be used as the threshold value.

また、上記実施形態において、時系列データＩＤ「Ｄ１」、「Ｄ２」に対応する各時系列データのデータ取得時間間隔等は同一であるとして説明したが、本発明はかかる例に限定されず、データ取得時間間隔等の異なる時系列データが時系列データ記憶部１７０、２９３に記憶されていてもよい。ただし、相関係数を算出するためには、データ取得時間間隔等を一致させ、整合性をとる必要がある。このため、例えば、データ取得時間間隔の短い時系列データからデータ取得時間間隔の長い時系列データに時間を合わせたデータを作成したり、あるいはデータ取得時間間隔の長い時系列データを補間してデータ取得時間間隔の短い時系列データにあわせたりする必要がある。 In the above embodiment, the time acquisition data intervals of the time series data corresponding to the time series data IDs “D1” and “D2” have been described as being the same. However, the present invention is not limited to this example. Different time-series data such as data acquisition time intervals may be stored in the time-series data storage units 170 and 293. However, in order to calculate the correlation coefficient, it is necessary to match the data acquisition time intervals to ensure consistency. For this reason, for example, data is created by matching time from time series data with a short data acquisition time interval to time series data with a long data acquisition time interval, or by interpolating time series data with a long data acquisition time interval. It is necessary to match the time series data with a short acquisition time interval.

さらに、上記実施形態において、時系列データは予め時系列データ記憶部１７０、２９３に記憶されていたが、本発明はかかる例に限定されず、テキストデータと同様に時系列データ入力部を設けて外部から入力させてもよい。 Further, in the above embodiment, the time series data is stored in advance in the time series data storage units 170 and 293. However, the present invention is not limited to such an example, and a time series data input unit is provided in the same manner as text data. You may make it input from the outside.

本発明の第１の実施形態にかかる知識抽出装置の概略構成を示すブロック図である。It is a block diagram which shows schematic structure of the knowledge extraction apparatus concerning the 1st Embodiment of this invention. 同実施形態にかかる文書を構成する単語の出現頻度を表す表の一例を示す。An example of the table | surface showing the appearance frequency of the word which comprises the document concerning the embodiment is shown. 同実施形態にかかる時系列データ記憶部の構成の一例を示す。An example of a structure of the time series data storage part concerning the embodiment is shown. 同実施形態にかかる単語と時系列データとを対応付けた対応表を記憶する対応表記憶部の構成の一例を示す。An example of the structure of the correspondence table memory | storage part which memorize | stores the correspondence table which matched the word and time series data concerning the embodiment is shown. 同実施形態にかかる知識抽出方法を示すフローチャートである。It is a flowchart which shows the knowledge extraction method concerning the embodiment. 共起スコアの性質を説明するための説明図である。It is explanatory drawing for demonstrating the property of a co-occurrence score. 本発明の第２の実施形態にかかる知識抽出装置の概略構成を示すブロック図である。It is a block diagram which shows schematic structure of the knowledge extraction apparatus concerning the 2nd Embodiment of this invention. 同実施形態にかかる知識抽出方法を示すフローチャートである。It is a flowchart which shows the knowledge extraction method concerning the embodiment. 同実施形態にかかるモデル入力部に入力される単語についての説明図である。It is explanatory drawing about the word input into the model input part concerning the embodiment. 同実施形態にかかる時系列データ相関スコア算出部の処理についての説明図である。It is explanatory drawing about the process of the time series data correlation score calculation part concerning the embodiment.

Explanation of symbols

１００、２００知識抽出装置
１１０、２１１テキスト入力部
１２０、２２０テキスト解析部
１２２、２２１テキスト分割部
１２４、２２３出現頻度測定部
１２６単語ペア作成部
１２８共起スコア算出部
１３０、２６０優先順位決定部
１４０、２７０相関係数算出部
１５０、２３１テキスト・時系列データ対応部
１６０、２８０相関判定部
１７０、２９３時系列データ記憶部
１８０、２９５対応表記憶部
２０５入力装置
２１３モデル入力部
２３３時系列データ単語取得部
２４０時系列データ共起スコア算出部
２５０時系列データ相関スコア算出部
２９１テキストデータ記憶部 100, 200 Knowledge extraction device 110, 211 Text input unit 120, 220 Text analysis unit 122, 221 Text division unit 124, 223 Appearance frequency measurement unit 126 Word pair creation unit 128 Co-occurrence score calculation unit 130, 260 Priority order determination unit 140 270 Correlation coefficient calculation unit 150, 231 Text / time series data correspondence unit 160, 280 Correlation determination unit 170, 293 Time series data storage unit 180, 295 Correspondence table storage unit 205 Input device 213 Model input unit 233 Time series data word Acquisition unit 240 Time series data co-occurrence score calculation unit 250 Time series data correlation score calculation unit 291 Text data storage unit

Claims

A knowledge extraction device that extracts knowledge using text data and time series data,
A text division unit for dividing text data into words;
A word pair creation unit that creates a word pair by combining the words of the text data;
A co-occurrence score calculating unit that calculates a co-occurrence score indicating a degree of appearance of constituent words constituting the word pair in a plurality of the text data;
A priority determining unit that determines the priority of word pairs based on the co-occurrence score;
A text / time-series data corresponding unit for acquiring time-series data corresponding to the constituent words of the word pair;
A correlation coefficient calculating unit that calculates the correlation coefficient of the word pair using time-series data associated with the constituent words of the word pair according to the determined priority;
A knowledge extraction device comprising:

The knowledge extraction device according to claim 1, wherein the priority order determination unit assigns priorities in descending order of the absolute value of the co-occurrence score.

A time-series data storage unit that stores time-series data and a time-series data ID for specifying the time-series data in association with each other;
A correspondence table storage unit that associates and stores words and time-series data IDs;
With
The correlation coefficient calculation unit does not calculate a correlation coefficient of the word pair when time series data ID corresponding to the constituent words of the word pair is not stored in the correspondence table storage unit. The knowledge extraction apparatus according to 1 or 2.

The correlation coefficient calculation unit does not calculate the correlation coefficient of the word pair when the time-series data ID corresponding to the constituent words of one word pair is the same. The knowledge extraction device described in Crab.

The correlation coefficient calculation unit
A plurality of correlation coefficients are calculated for one word pair by sequentially generating a predetermined delay between the time series data and calculating a correlation coefficient for each delay,
The knowledge extraction device according to any one of claims 1 to 4, wherein the correlation coefficient having the maximum absolute value among the plurality of correlation coefficients is used as the correlation coefficient of the word pair.

A correlation determining unit that determines whether or not the constituent words of the word pair have a correlation based on the correlation coefficient of the word pair;
The knowledge extraction apparatus according to claim 1, wherein the correlation determination unit determines that there is a correlation when an absolute value of a correlation coefficient is equal to or greater than a predetermined value.

A computer program that causes a computer to function as a knowledge extraction device that extracts knowledge using text data and time-series data,
Text dividing means for dividing the text data into words;
Word pair creation means for creating a word pair by combining the words of the text data;
Co-occurrence score calculating means for calculating a co-occurrence score indicating the degree of appearance of constituent words constituting the word pair in a plurality of the text data;
Priority determining means for determining the priority of word pairs based on the co-occurrence score;
Text / time-series data correspondence means for obtaining time-series data corresponding to the constituent words of the word pair;
A computer that functions as a correlation coefficient calculation unit that calculates a correlation coefficient of the word pair using time-series data associated with the constituent words of the word pair according to the determined priority order. program.

A knowledge extraction method for extracting knowledge using text data and time series data,
A text splitting step for splitting the text data into words;
A word pair creation step of creating a word pair by combining the divided words;
A co-occurrence score calculating step for calculating a co-occurrence score indicating a degree of appearance of constituent words constituting the created word pair in a plurality of the text data; and
A priority determining step for determining the priority of word pairs based on the co-occurrence score;
Text / time-series data correspondence step for obtaining time-series data corresponding to the constituent words of the word pair;
A correlation coefficient calculating step of calculating a correlation coefficient of the word pair using time-series data associated with the constituent words of the word pair according to the determined priority;
A knowledge extracting method characterized by comprising:

9. The knowledge extracting method according to claim 8, wherein the priority order determining step assigns priorities in descending order of the absolute value of the co-occurrence score.

In the correlation coefficient calculating step, a time series data ID for specifying time series data corresponding to the constituent words of the word pair is stored in a correspondence table storage unit that stores the word and the time series data ID in association with each other. The knowledge extraction method according to claim 8 or 9, wherein if there is no word, the correlation coefficient of the word pair is not calculated.

The correlation coefficient of the word pair is not calculated when the time-series data corresponding to the constituent words of one word pair is the same in the correlation coefficient calculating step. The knowledge extraction method described in Crab.

The correlation coefficient calculating step includes:
Calculating a plurality of correlation coefficients for one word pair by sequentially generating a predetermined delay between the time series data and calculating a correlation coefficient for each delay;
Of the plurality of calculated correlation coefficients, the correlation coefficient having the maximum absolute value is set as the correlation coefficient of the word pair;
The knowledge extraction method according to claim 8, wherein the knowledge extraction method comprises:

The correlation determination step of determining whether or not the constituent words of the word pair have a correlation based on the correlation coefficient of the word pair is further included. Knowledge extraction method.

A knowledge extraction device that extracts knowledge using a plurality of user input models that are text data, time series data, and mutually correlated information,
A text division unit for dividing text data into constituent words;
An appearance frequency measuring unit that measures the appearance frequency of the constituent words of the text data;
A model input unit to which at least two or more user input models are input from an input device for inputting a user input model;
Time-series data co-occurrence score calculation that calculates a co-occurrence score indicating the degree of appearance of both the user input model and the time-series data for each user input model based on the appearance frequency of the constituent words of the text data And
A time-series data correlation score calculation unit that calculates a correlation score indicating a correlation strength of each time-series data pair created by combining any two of the time-series data based on the calculated co-occurrence score; ,
A priority order determining unit for determining a priority order for calculating a correlation coefficient of the time series data pair based on a correlation score of the time series data pair;
A correlation coefficient calculating unit for calculating a correlation coefficient of the time series data pair according to the determined priority;
A knowledge extraction device comprising:

The time series data co-occurrence score calculation unit
The appearance frequency of the constituent words of the text data that is the same as the user input model in the text data, and the appearance frequency of the constituent words of the text data that are the same as the words corresponding to the time-series data The knowledge extraction apparatus according to claim 14, wherein a co-occurrence score between the user input model and the time-series data is calculated using.

A time-series data storage unit that stores time-series data and a time-series data ID for specifying the time-series data in association with each other;
A correspondence table storage unit for storing words and time-series data IDs in association with each other;
For the time series data ID stored in the time series data storage unit, a time series for acquiring a word list composed of one or more words stored in the corresponding notation part corresponding to the same time series data ID A data word list acquisition unit;
Further comprising
The time series data co-occurrence score calculation unit
Calculating a co-occurrence score of each word constituting the word list of the time-series data and the user input model,
16. The knowledge extracting apparatus according to claim 14, wherein a co-occurrence score having the maximum absolute value among the calculated co-occurrence scores is set as a co-occurrence score between the time series data and the model.

The time series data correlation score calculation unit includes:
Combining the user input models, creating a user input model pair consisting of a first user input model and a second user input model which are arbitrary two user input models;
A co-occurrence score between the first user input model and the first time-series data for the time-series data pair consisting of the first time-series data and the second time-series data, which is a correlation score calculation target, , A multiplication value of the co-occurrence score of the second user input model and the second time series data, a co-occurrence score of the second user input model and the first time series data, and Calculating a multiplication value of the first user input model and the co-occurrence score of the second time-series data;
The largest value among the multiplication values calculated for the user input pair consisting of all two arbitrary user input models assumed is the correlation score of the time-series data pair that is the correlation score calculation target. The knowledge extraction device according to any one of claims 14 to 16.

The knowledge extracting device according to any one of claims 14 to 17, wherein the priority determining unit assigns priorities in descending order of correlation scores of the time series data pairs.

The correlation coefficient calculation unit
A plurality of correlation coefficients are calculated for one time-series data pair by sequentially generating a predetermined delay between the time-series data and calculating a correlation coefficient for each delay,
The knowledge extraction device according to any one of claims 14 to 18, wherein a correlation coefficient having a maximum absolute value among the plurality of correlation coefficients is set as a correlation coefficient of the time series data pair.

A correlation determining unit that determines whether or not there is a correlation between the time-series data constituting the time-series data pair based on the correlation coefficient of the time-series data pair;
The knowledge extraction device according to any one of claims 14 to 19, wherein the correlation determination unit determines that there is a correlation when an absolute value of a correlation coefficient is equal to or greater than a predetermined value.

A computer program for causing a computer to function as a knowledge extraction device that extracts knowledge using a plurality of user input models that are text data, time series data and mutually correlated information,
Text dividing means for dividing text data into constituent words;
Appearance frequency measuring means for measuring the appearance frequency of the constituent words of the text data;
Model input means for inputting at least two or more user input models from an input device for inputting a user input model;
Time-series data co-occurrence score calculation that calculates a co-occurrence score indicating the degree of appearance of both the user input model and the time-series data for each user input model based on the appearance frequency of the constituent words of the text data Means,
Time-series data correlation score calculating means for calculating a correlation score indicating the strength of correlation of each time-series data pair created by combining any two of the time-series data based on the calculated co-occurrence score; ,
Priority order determining means for determining a priority order for calculating a correlation coefficient of the time series data pair based on the correlation score of the time series data pair;
A computer program that functions as a correlation coefficient calculation unit that calculates a correlation coefficient of the time-series data pair according to the determined priority order.

A knowledge extraction method for extracting knowledge using a plurality of user input models which are text data, time series data and mutually correlated information,
A text splitting step for splitting the text data into constituent words;
An appearance frequency measuring step for measuring an appearance frequency of the constituent words of the text data;
A model input step in which at least two or more user input models are input from an input device for inputting a user input model;
Time-series data co-occurrence score calculation that calculates a co-occurrence score indicating the degree of appearance of both the user input model and the time-series data for each user input model based on the appearance frequency of the constituent words of the text data Steps,
A time-series data correlation score calculating step for calculating a correlation score indicating a correlation strength of each time-series data pair created by combining any two of the time-series data based on the calculated co-occurrence score; ,
A priority determining step for determining a priority for calculating a correlation coefficient of the time series data pair based on the correlation score of the time series data pair;
A correlation coefficient calculating step of calculating a correlation coefficient of the time series data pair according to the determined priority;
A knowledge extracting method characterized by comprising:

The time series data co-occurrence score calculating step includes:
The appearance frequency of the constituent words of the text data that is the same as the user input model in the text data, and the appearance frequency of the constituent words of the text data that are the same as the words corresponding to the time-series data The knowledge extraction method according to claim 22, wherein a co-occurrence score between the user input model and the time-series data is calculated using

For the time-series data ID of the time-series data storage unit stored in association with the time-series data in order to identify the time-series data, the same notation from the corresponding notation 100 million parts that stores the word and the time-series data ID in association with each other. A time series data word list obtaining step of obtaining a word list by obtaining one or more words corresponding to the time series data ID;
The time series data co-occurrence score calculating step includes:
Calculating a co-occurrence score of each word constituting the word list of the time-series data and the user input model,
The knowledge extraction method according to claim 22 or 23, wherein a co-occurrence score having the maximum absolute value among the calculated co-occurrence scores is used as a co-occurrence score between the time series data and the model.

The time series data correlation score calculating step includes:
A user input model pair creating step of creating a user input model pair composed of a first user input model and a second user input model which are arbitrary two user input models by combining the user input models;
A co-occurrence score between the first user input model and the first time-series data for the time-series data pair consisting of the first time-series data and the second time-series data, which is a correlation score calculation target, , A multiplication value of the co-occurrence score of the second user input model and the second time series data, a co-occurrence score of the second user input model and the first time series data, and A multiplication step of calculating a multiplication value of the first user input model and the co-occurrence score of the second time-series data;
Correlation score determination using a maximum value among the multiplication values calculated for the user input pair consisting of all two arbitrary user input models assumed as a correlation score of the time series data pair that is the correlation score calculation target Steps,
The knowledge extracting method according to any one of claims 22 to 24, characterized by comprising:

The knowledge extraction method according to any one of claims 22 to 25, wherein the priority order determining step assigns priorities in descending order of correlation scores of the time series data pairs.

The correlation coefficient calculating step includes:
Calculating a plurality of correlation coefficients for one time-series data pair by sequentially generating a predetermined delay between the time-series data and calculating a correlation coefficient for each delay;
Of the plurality of calculated correlation coefficients, the correlation coefficient having the maximum absolute value is set as the correlation coefficient of the time-series data pair; and
The knowledge extraction method according to any one of claims 22 to 26, characterized by comprising:

The correlation determination step of determining whether or not there is a correlation between time-series data constituting the time-series data pair based on a correlation coefficient of the time-series data pair. The knowledge extraction method according to any one of -27.