JP4567025B2

JP4567025B2 - Text classification device, text classification method, text classification program, and recording medium recording the program

Info

Publication number: JP4567025B2
Application number: JP2007128122A
Authority: JP
Inventors: 克人別所; 俊郎内山; 良治片岡
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2007-05-14
Filing date: 2007-05-14
Publication date: 2010-10-20
Anticipated expiration: 2027-05-14
Also published as: JP2008282328A

Description

本発明は、テキストを、所定のカテゴリ集合の内のいずれかのカテゴリに分類するためのテキスト分類装置及び方法及びプログラム並びにプログラムを記録した記録媒体に関する。 The present invention relates to a text classification apparatus and method and a program for classifying text into any one category of a predetermined category set, and a recording medium on which the program is recorded.

単語間の関連度を算出する従来技術として、非特許文献１の手法がある。この手法では、テキスト中の単語間の共起頻度を算出して単語間共起頻度行例を作成する。該行列の各行ベクトルは、対応する単語の、他の単語と共起するパターンを表している。意味の似た単語は、共通の単語と共起する傾向があるので、対応するパターンも似る傾向がある。そこで、単語間の関連度を、対応するベクトルのコサインとして算出する。 As a conventional technique for calculating the degree of association between words, there is a method of Non-Patent Document 1. In this method, a co-occurrence frequency between words in a text is calculated to create an example of a co-occurrence frequency between words. Each row vector of the matrix represents a pattern of the corresponding word that co-occurs with other words. Since words with similar meanings tend to co-occur with common words, the corresponding patterns also tend to be similar. Therefore, the degree of association between words is calculated as the cosine of the corresponding vector.

非特許文献２の手法では、単語に意味属性が付随した辞書を用いて、テキスト中の単語・意味属性間の共起頻度を算出し単語・意味属性間共起頻度行列を作成する。該行列の各行ベクトルは、対応する単語の、意味属性と共起するパターンを表している。意味の似た単語は、共通の意味属性と共起する傾向があるので、対応するパターンも似る傾向がある。そこで、単語間の関連度を、対応するベクトルのコサインとして算出する。 In the method of Non-Patent Document 2, a co-occurrence frequency between words and semantic attributes in a text is calculated using a dictionary in which semantic attributes are attached to words, and a co-occurrence frequency matrix between words and semantic attributes is created. Each row vector of the matrix represents a pattern that co-occurs with the semantic attribute of the corresponding word. Since words with similar meanings tend to co-occur with common semantic attributes, the corresponding patterns also tend to be similar. Therefore, the degree of association between words is calculated as the cosine of the corresponding vector.

非特許文献１の手法も、非特許文献２の手法も、共起頻度行列を特異値分解して列数の縮退した行列に変換することにより、ベクトル間の関連度の精度を上げている。 Both the method of Non-Patent Document 1 and the method of Non-Patent Document 2 raise the accuracy of the degree of association between vectors by converting the co-occurrence frequency matrix into a matrix with a singular value decomposition and a reduced number of columns.

ある単語集合が対応付けられているカテゴリの集合が与えられていて、テキストをいずれかのカテゴリに分類するタスクにおいては、上記で生成されたベクトルを用い、カテゴリやテキストのベクトルを、それに含まれる単語のベクトルの重心として算出し、該重心ベクトル間のコサインを、該テキストと該カテゴリとの間の関連度とし、関連度の大きいカテゴリを、分類先のカテゴリとする。
Ｈ．Ｓｃｈｕｔｚｅ，ＤｉｍｅｎｓｉｏｎｓｏｆＭｅａｎｉｎｇ，Ｐｒｏｃ．ｏｆＳｕｐｅｒｃｏｍｐｕｔｉｎｇ’９２，ｐｐ．７８６−７９６，１９９２．別所克人，内山俊郎，片岡良治：単語・意味属性間共起に基づく概念ベースの拡張方式，情報処理学会研究報告，Ｖｏｌ．ＳＩＧ−ＩＣＳ１４４，ｐｐ．２９−３４，２００６． Given a set of categories associated with a set of words, the task of classifying text into one of the categories uses the vectors generated above and includes categories and text vectors. The centroid of the word vector is calculated, the cosine between the centroid vectors is set as the relevance between the text and the category, and the category with the high relevance is set as the category to be classified.
H. Schutze, Dimensions of Meaning, Proc. of Supercomputing '92, pp. 786-796, 1992. Katsuhito Bessho, Toshiro Uchiyama, Ryoji Kataoka: Concept-based extension based on co-occurrence between words and semantic attributes, Information Processing Society of Japan Research Report, Vol. SIG-ICS 144, pp. 29-34, 2006.

上記従来技術においては、重心ベクトル間のコサインを用いている。ベクトル間のコサインでは、一つの単語と関連度の高い単語として、該単語の上位・下位概念にあたる単語のみならず、同じ上位概念をもつ兄弟関係にある単語も導出する。この結果、テキストが、その内容の兄弟概念にあたるカテゴリに誤分類されるという問題がある。例えば、コサインでは「精神病」という単語に対し、「心臓病」といった、共通の上位概念「病気」をもつ兄弟関係にある単語も導出する。この結果、「精神病」に関するテキストが、「心臓病」に関するカテゴリに誤って分類されるということがあった。 In the above prior art, a cosine between centroid vectors is used. In the cosine between vectors, as a word highly related to one word, not only a word corresponding to a higher / lower concept of the word but also a word having a sibling relationship having the same higher concept is derived. As a result, there is a problem that the text is misclassified into a category corresponding to the sibling concept of the content. For example, in cosine, a word having a common superordinate concept “disease” such as “heart disease” is also derived for the word “psychiatric disease”. As a result, the text related to “psychiatric disease” may be erroneously classified into the category related to “heart disease”.

本発明は、この課題を解決するために考え出されたものであり、本発明の目的は、テキストを、所定のカテゴリ集合の内のいずれかのカテゴリに分類するタスクの精度を向上するためのテキスト分類装置、テキスト分類方法及びテキスト分類プログラム並びにそのプログラムを記録した記録媒体を提供することにある。 The present invention has been conceived in order to solve this problem, and an object of the present invention is to improve the accuracy of a task of classifying text into one of categories in a predetermined category set. A text classification device, a text classification method, a text classification program, and a recording medium on which the program is recorded.

上記課題を解決するための請求項１に記載のテキスト分類装置は、テキストを、所定のカテゴリ集合のうちのいずれかのカテゴリに分類するためのテキスト分類装置であって、コーパスを入力とし、任意の単語Ａに対し、各座標が、単語または単語意味属性に対応し、該座標の値が、単語Ａと、該座標に対応する単語または単語意味属性との共起頻度の相対値である単語ベクトルを生成する単語ベクトル生成手段と、任意の単語Ｂに対し、単語Ｂの前記単語ベクトル生成手段で得られた単語ベクトルと、任意の単語Ｃの前記単語ベクトル生成手段で得られた単語ベクトルとの間の、カルバック・ライブラー距離を算出し、該距離の小さい順に単語Ｃをランキングし、該ランキングにおける順位により、単語Ｂと単語Ｃとの関連度を算出する単語間関連度算出手段とを備えたことを特徴としている。 The text classification device according to claim 1 for solving the above-described problem is a text classification device for classifying a text into any category of a predetermined category set, wherein a corpus is input, and an arbitrary For each word A, each coordinate corresponds to a word or word semantic attribute, and the value of the coordinate is a relative value of the co-occurrence frequency of the word A and the word or word semantic attribute corresponding to the coordinate A word vector generating means for generating a vector, a word vector obtained by the word vector generating means for the word B for an arbitrary word B, a word vector obtained by the word vector generating means for an arbitrary word C, Calculating the relevance between the word B and the word C according to the ranking in the ranking. It is characterized in that a between relevance calculating means.

また請求項７に記載のテキスト分類方法は、テキストを、所定のカテゴリ集合のうちのいずれかのカテゴリに分類するためのテキスト分類方法であって、単語ベクトル生成手段が、コーパスを入力とし、任意の単語Ａに対し、各座標が、単語または単語意味属性に対応し、該座標の値が、単語Ａと、該座標に対応する単語または単語意味属性との共起頻度の相対値である単語ベクトルを生成する単語ベクトル生成ステップと、単語間関連度算出手段が、任意の単語Ｂに対し、単語Ｂの前記単語ベクトル生成手段で得られた単語ベクトルと、任意の単語Ｃの前記単語ベクトル生成手段で得られた単語ベクトルとの間の、カルバック・ライブラー距離を算出し、該距離の小さい順に単語Ｃをランキングし、該ランキングにおける順位により、単語Ｂと単語Ｃとの関連度を算出する単語間関連度算出ステップとを備えたことを特徴している。 The text classification method according to claim 7 is a text classification method for classifying text into any category of a predetermined category set, wherein the word vector generation means receives a corpus as an input, For each word A, each coordinate corresponds to a word or word semantic attribute, and the value of the coordinate is a relative value of the co-occurrence frequency of the word A and the word or word semantic attribute corresponding to the coordinate A word vector generation step for generating a vector, and an inter-word relevance calculation means for the arbitrary word B, the word vector obtained by the word vector generation means for the word B, and the word vector generation for the arbitrary word C The Cullback-Roller distance between the word vector obtained by the means is calculated, the word C is ranked in ascending order of the distance, and the word is determined according to the ranking in the ranking. It is characterized by an inter-word relevance calculating a degree of relevance between the words C and.

また請求項３に記載のテキスト分類装置は、前記単語ベクトル生成手段は、テキストを形態素解析し、処理に必要な単語を特定する形態素解析手段と、前記形態素解析手段の解析結果から、任意の単語の対、又は任意の単語と任意の単語意味属性の対に対し、前記テキストにおける一つまたは複数の所定の範囲のそれぞれにおいて、前記対が共起する事象を、テキスト全体にわたって計数した頻度を算出し、各単語に対し、各座標の値が、前記単語と、前記座標に対応する単語又は単語意味属性の対に対し算出された前記頻度であるベクトルを対応づけて得られる共起頻度行列を生成する共起行列生成手段と、前記共起行列生成手段によって生成された共起頻度行列の行ベクトルの各座標の値を、相対頻度に変換する相対頻度算出手段と、を備えたことを特徴としている。 The text classification apparatus according to claim 3, wherein the word vector generation unit performs morphological analysis on the text and specifies a word necessary for processing, and an arbitrary word based on an analysis result of the morpheme analysis unit. For each pair of words, or a pair of any word and any word meaning attribute , the frequency of counting the events that the pair co-occurs in each of one or more predetermined ranges in the text is calculated over the entire text. A co-occurrence frequency matrix obtained by associating each word with a vector whose value of each coordinate is the frequency calculated for the word and the word or word semantic attribute pair corresponding to the coordinate. a co-occurrence matrix generating means for generating for the value of each coordinate of the row vector of the co-occurrence matrix is generated by the generating means co-occurrence frequency matrix, the relative frequency calculating means for converting the relative frequency, the It is characterized in that was example.

また請求項９に記載のテキスト分類方法は、前記単語ベクトル生成ステップは、形態素解析手段が、テキストを形態素解析し、処理に必要な単語を特定する形態素解析ステップと、共起行列生成手段が、前記形態素解析ステップの解析結果から、任意の単語の対、又は任意の単語と任意の単語意味属性の対に対し、前記テキストにおける一つまたは複数の所定の範囲のそれぞれにおいて、前記対が共起する事象を、テキスト全体にわたって計数した頻度を算出し、各単語に対し、各座標の値が、前記単語と、前記座標に対応する単語又は単語意味属性の対に対し算出された前記頻度であるベクトルを対応づけて得られる共起頻度行列を生成する共起行列生成ステップと、相対頻度算出手段が、前記共起行列生成ステップによって生成された共起頻度行列の行ベクトルの各座標の値を、相対頻度に変換する相対頻度算出ステップと、を備えたことを特徴としている。 Further, in the text classification method according to claim 9, in the word vector generation step, the morpheme analysis unit performs a morpheme analysis on the text and specifies a word necessary for processing, and a co-occurrence matrix generation unit includes: From the analysis result of the morphological analysis step, the pair co-occurs in any one or a plurality of predetermined ranges in the text with respect to an arbitrary word pair or an arbitrary word and an arbitrary word semantic attribute pair. The frequency at which the event is counted over the entire text is calculated, and for each word, the value of each coordinate is the frequency calculated for the word and the word or word semantic attribute pair corresponding to the coordinate. co of the co-occurrence matrix generating step of generating a co-occurrence frequency matrix obtained by associating a vector, the relative frequency calculating means, produced by the co-occurrence matrix generating step The value of each coordinate of the row vector of frequency matrix, and the relative frequency calculating step of converting the relative frequency, comprising the.

また請求項４に記載のテキスト分類装置は、前記単語間関連度算出手段における関連度は、前記ランキングの順位の逆数により算出することを特徴としている。 The text classification apparatus according to claim 4 is characterized in that the relevance in the inter-word relevance calculation means calculates the reciprocal of the ranking order.

また請求項１０に記載のテキスト分類方法は、前記単語間関連度算出ステップにおける関連度は、前記ランキングの順位の逆数により算出することを特徴としている。 The text classification method according to claim 10 is characterized in that the relevance in the inter-word relevance calculation step is calculated by the reciprocal of the ranking order.

また請求項５に記載のテキスト分類装置は、前記単語間関連度算出手段における関連度は、前記単語Ｂ、Ｃの各単語ベクトル間のコサインに基づいて算出することを特徴としている。 The text classification apparatus according to claim 5 is characterized in that the relevance in the inter-word relevance calculation means is calculated based on a cosine between the word vectors of the words B and C.

また請求項１１に記載のテキスト分類方法は、前記単語間関連度算出ステップにおける関連度は、前記単語Ｂ、Ｃの各単語ベクトル間のコサインに基づいて算出することを特徴としている。 The text classification method according to claim 11 is characterized in that the relevance in the inter-word relevance calculation step is calculated based on a cosine between the word vectors of the words B and C.

上記構成によれば、任意の単語Ｂに対し、単語Ｂから任意の単語Ｃへのカルバック・ライブラー距離の小さい順に、単語Ｃをランキングするため、単語Ｂの上位概念である度合いの大きい順に単語Ｃがランキングされる。単語Ｂごとのランキング結果上位におけるカルバック・ライブラー距離の大きさは異なるが、この大きさに関わらず、ランキングの上位は、常に上位概念が占める。したがって、ランキングにおける順位により算出される関連度は、[ＢならばＣ]である度合いを的確に表す。 According to the above configuration, the word C is ranked in ascending order of the cullback / railer distance from the word B to the arbitrary word C with respect to the arbitrary word B. C is ranked. Although the magnitude of the Culbach-Riverer distance at the top of the ranking result for each word B is different, the top concept is always occupied at the top of the ranking regardless of this magnitude. Therefore, the degree of relevance calculated by the rank in the ranking accurately represents the degree of [C if B].

また請求項２に記載のテキスト分類装置は、請求項１のテキスト分類装置に、ある単語集合が対応付けられているカテゴリの集合と、任意のテキストを入力とし、各カテゴリに対し、該テキスト中の単語Ｄと、該カテゴリ中の単語Ｅとの間の前記単語間関連度算出手段で算出した関連度により、該テキストと該カテゴリとの間の関連度を算出するテキスト・カテゴリ間関連度算出手段をさらに備えたことを特徴としている。 Further, the text classification device according to claim 2 receives a set of categories associated with a certain word set and an arbitrary text as input to the text classification device according to claim 1, and for each category, Text-category relevance calculation for calculating the relevance between the text and the category based on the relevance calculated by the inter-word relevance calculation means between the word D and the word E in the category It is characterized by further providing means.

また請求項６に記載のテキスト分類装置は、前記テキスト・カテゴリ間関連度算出手段は、ＯＲ結合された文字列が対応付けられたカテゴリＣｐの各文字列Ｚｐｑを形態素解析し、前記各文字列Ｚｐｑを、各文字列中の必要語終止形の異なりとその頻度の組の集合に変換し、任意のテキストＵを形態素解析し、前記テキストＵ中の必要語終止形の異なりとその頻度の組の集合に変換し、テキストＵ中の単語Ｄと前記カテゴリＣｐ中の単語Ｅとの間の前記単語間関連度算出手段で算出した関連度をもとに、前記テキストＵ中の単語ＤとカテゴリＣｐ中の文字列Ｚｐｑとの間の関連度を算出し、任意のテキストＵと任意のカテゴリＣｐの間の関連度を算出することを特徴としている。 The text classification apparatus according to claim 6, wherein the text-category relevance calculation unit performs morphological analysis on each character string Zpq of a category Cp associated with an OR-linked character string, and each character string Zpq is converted into a set of sets of different required word end forms and their frequencies in each character string, morphological analysis is performed on an arbitrary text U, and a set of different required word end forms in the text U and their set of frequencies And the word D in the text U and the category based on the relevance calculated by the inter-word relevance calculating means between the word D in the text U and the word E in the category Cp. The degree of association between the character string Zpq in Cp is calculated, and the degree of association between an arbitrary text U and an arbitrary category Cp is calculated.

また請求項８に記載のテキスト分類方法は、請求項７のテキスト分類方法に、テキスト・カテゴリ間関連度算出手段が、ある単語集合が対応付けられているカテゴリの集合と、任意のテキストを入力とし、各カテゴリに対し、該テキスト中の単語Ｄと、該カテゴリ中の単語Ｅとの間の前記単語間関連度算出手段で算出した関連度により、該テキストと該カテゴリとの間の関連度を算出するテキスト・カテゴリ間関連度算出ステップをさらに備えたことを特徴としている。 The text classification method according to claim 8 is the text classification method according to claim 7, wherein the text-category relevance calculation means inputs a set of categories associated with a set of words and arbitrary text. For each category, the degree of association between the text and the category is calculated based on the degree of association calculated by the inter-word association degree calculating means between the word D in the text and the word E in the category. The method further includes a step of calculating the degree of association between texts and categories.

また請求項１２に記載のテキスト分類方法は、前記テキスト・カテゴリ間関連度算出ステップは、ＯＲ結合された文字列が対応付けられたカテゴリＣｐの各文字列Ｚｐｑを形態素解析し、前記各文字列Ｚｐｑを、各文字列中の必要語終止形の異なりとその頻度の組の集合に変換し、任意のテキストＵを形態素解析し、前記テキストＵ中の必要語終止形の異なりとその頻度の組の集合に変換し、テキストＵ中の単語Ｄと前記カテゴリＣｐ中の単語Ｅとの間の前記単語間関連度算出手段で算出した関連度をもとに、前記テキストＵ中の単語ＤとカテゴリＣｐ中の文字列Ｚｐｑとの間の関連度を算出し、任意のテキストＵと任意のカテゴリＣｐの間の関連度を算出することを特徴としている。 The text classification method according to claim 12, wherein the text-category relevance calculation step performs morphological analysis on each character string Zpq of the category Cp associated with an OR-linked character string, and Zpq is converted into a set of combinations of different required word end forms and their frequencies in each character string, morphological analysis is performed on an arbitrary text U, and a set of different required word end forms in the text U and their set of frequencies And the word D in the text U and the category based on the relevance calculated by the inter-word relevance calculating means between the word D in the text U and the word E in the category Cp. The degree of association between the character string Zpq in Cp is calculated, and the degree of association between an arbitrary text U and an arbitrary category Cp is calculated.

上記構成によれば、テキスト中の単語Ｄとカテゴリ中の単語Ｅとの間の[ＤならばＥ]である度合いを表す関連度からテキスト・カテゴリ間関連度を算出するため、このテキスト・カテゴリ間関連度は、[該テキストならば該カテゴリ]である度合いを的確に表す。この関連度に基づきテキストを分類することにより、テキストを、その内容の兄弟概念にあたるカテゴリに誤分類されることが抑制され、分類の精度が向上する。 According to the above configuration, since the relevance between text categories is calculated from the relevance representing the degree of [E if D] between the word D in the text and the word E in the category, this text category The degree of interrelationship accurately represents the degree of [the category if it is the text]. By classifying the text based on this degree of association, it is possible to prevent the text from being misclassified into a category corresponding to the sibling concept of the content, and the classification accuracy is improved.

また請求項１３に記載のテキスト分類プログラムは、コンピュータを、請求項１ないし６のいずれかに記載の単語ベクトル生成手段、単語間関連度算出手段として機能させるためのテキスト分類プログラムとしたことを特徴としている。 The text classification program according to claim 13 is a text classification program for causing a computer to function as the word vector generation means and the inter-word relevance calculation means according to any one of claims 1 to 6. It is said.

また請求項１４に記載のテキスト分類プログラムは、コンピュータを、請求項２ないし６のいずれかに記載のテキスト・カテゴリ間関連度算出手段として機能させるためのテキスト分類プログラムとしたことを特徴としている。 A text classification program according to claim 14 is a text classification program for causing a computer to function as the text-category relevance calculating means according to any one of claims 2 to 6.

また請求項１５に記載の記録媒体は、請求項１３又は１４に記載のテキスト分類プログラムを記録したコンピュータ読み取り可能な記録媒体としたことを特徴としている。 A recording medium according to claim 15 is a computer-readable recording medium on which the text classification program according to claim 13 or 14 is recorded.

（１）請求項１〜１３、１５に記載の発明によれば、任意の単語Ｂに対し、単語Ｂから任意の単語Ｃへのカルバック・ライブラー距離の小さい順に、単語Ｃをランキングするため、単語Ｂの上位概念である度合いの大きい順に単語Ｃがランキングされる。単語Ｂごとのランキング結果上位におけるカルバック・ライブラー距離の大きさは異なるが、この大きさに関わらず、ランキングの上位は、常に上位概念が占める。したがって、ランキングにおける順位により算出される関連度は、[ＢならばＣ]である度合いを的確に表す。
（２）請求項２〜１５に記載の発明によれば、テキスト中の単語Ｄとカテゴリ中の単語Ｅとの間の[ＤならばＥ]である度合いを表す関連度からテキスト・カテゴリ間関連度を算出するため、このテキスト・カテゴリ間関連度は、[該テキストならば該カテゴリ]である度合いを的確に表す。この関連度に基づきテキストを分類することにより、テキストを、その内容の兄弟概念にあたるカテゴリに誤分類されることが抑制され、分類の精度が向上する。 (1) According to the inventions described in claims 1 to 13 and 15, in order to rank the word C with respect to the arbitrary word B in ascending order of the Cullback-Lailer distance from the word B to the arbitrary word C, Word C is ranked in descending order of the degree of superordinate concept of word B. Although the magnitude of the Culbach-Riverer distance at the top of the ranking result for each word B is different, the top concept is always occupied at the top of the ranking regardless of this magnitude. Therefore, the degree of relevance calculated by the rank in the ranking accurately represents the degree of [C if B].
(2) According to the inventions of claims 2 to 15, the relationship between the text and the category is represented from the relevance indicating the degree of [E if D] between the word D in the text and the word E in the category. In order to calculate the degree, this text-category relevance accurately represents the degree of [the category if it is the text]. By classifying the text based on this degree of association, it is possible to prevent the text from being misclassified into a category corresponding to the sibling concept of the content, and the classification accuracy is improved.

以下、図面を参照しながら本発明の実施の形態を説明するが、本発明は下記の実施形態例に限定されるものではない。 Hereinafter, embodiments of the present invention will be described with reference to the drawings, but the present invention is not limited to the following embodiments.

図１は、本発明の請求項１及び２の実施例におけるテキスト分類装置の構成例を示す。請求項１は、単語ベクトル生成手段１０１、単語間関連度算出手段１０２から構成され、請求項２は、それにテキスト・カテゴリ間関連度算出手段１０３を加えて構成される。 FIG. 1 shows an example of the configuration of a text classification apparatus in the embodiments of claims 1 and 2 of the present invention. Claim 1 comprises word vector generation means 101 and word-to-word relevance calculation means 102, and claim 2 comprises text-category relevance calculation means 103.

これら単語ベクトル生成手段１０１、単語間関連度算出手段１０２およびテキスト・カテゴリ間関連度算出手段１０３が有する後述する各機能は、例えばコンピュータにより達成されるものである。 Each function to be described later included in the word vector generation unit 101, the inter-word relevance calculation unit 102, and the text / category relevance calculation unit 103 is achieved by a computer, for example.

単語ベクトル生成手段１０１は、コーパスを入力とし、任意の単語Ａに対し、各座標が、単語または単語意味属性に対応し、該座標の値が、単語Ａと、該座標に対応する単語または単語意味属性との共起頻度の相対値である単語ベクトルを生成する。図２は、単語ベクトル生成手段１０１の詳細な構成例であり、形態素解析手段２０１、共起行列生成手段２０２、相対頻度算出手段２０３、単語辞書２０４を有している。 The word vector generation unit 101 receives a corpus, and for an arbitrary word A, each coordinate corresponds to a word or a word semantic attribute, and the value of the coordinate is the word A and the word or word corresponding to the coordinate A word vector that is a relative value of the co-occurrence frequency with the semantic attribute is generated. FIG. 2 is a detailed configuration example of the word vector generation unit 101, which includes a morpheme analysis unit 201, a co-occurrence matrix generation unit 202, a relative frequency calculation unit 203, and a word dictionary 204.

図２において、形態素解析手段２０１は、テキストを形態素解析する。この形態素解析は、単語辞書２０４を参照して行うが、図３はその辞書の内容の一例である。同図では、単語辞書は、１レコードが１単語に関する情報となっており、１レコードは、カンマで区切られた３つの項目から構成されている。第１項目は単語の表記であり、第２項目は該単語の品詞情報である。第３項目は該単語の意味属性である。なお、単語ベクトル生成手段１０１において、単語・単語間の共起をとる場合は、第３項目の意味属性はなくてもよい。また、活用語に対しては、終止形も登録しておいてもよい。 In FIG. 2, a morphological analysis unit 201 performs morphological analysis on a text. This morphological analysis is performed with reference to the word dictionary 204, and FIG. 3 shows an example of the contents of the dictionary. In the figure, in the word dictionary, one record is information related to one word, and one record is composed of three items separated by commas. The first item is a word notation, and the second item is the part of speech information of the word. The third item is a semantic attribute of the word. In the word vector generation unit 101, when the co-occurrence between words / words is taken, the semantic attribute of the third item may be omitted. In addition, an end form may be registered for a use word.

単語の意味属性とは、単語の属する意味カテゴリを表す。意味カテゴリとは一般に、事物を抽象化した概念である。これは一般に、人が個々の単語の意味を吟味した上で得られるものである。意味カテゴリの集合は、一例として図４で表されるような意味体系をなしている。図４では、各意味カテゴリを言葉として表現しているが、意味カテゴリ自体は必ずしも言葉として表現されているとは限らない概念である。各意味カテゴリには、それを特定するためのＩＤが付与されている。本実施例では、このＩＤを便宜上、意味属性と同一視する。 The word semantic attribute represents a semantic category to which the word belongs. A semantic category is generally a concept that abstracts things. This is generally obtained by a person examining the meaning of individual words. The set of semantic categories has a semantic system as shown in FIG. 4 as an example. In FIG. 4, each semantic category is expressed as a word, but the semantic category itself is not necessarily expressed as a word. Each semantic category is given an ID for specifying it. In this embodiment, this ID is identified with a semantic attribute for convenience.

意味体系は、それ自体が上位・下位の概念体系を表しているものの、人手で作成するため、作成の労力が大きいことや、作成者の恣意性に左右されることから、その体系は一般に不完全性が大きい。例えば、「病気」という意味属性に、「精神病」や「鬱病」といった単語が分類されていたりして、これらの単語が上位・下位関係になっていないということがある。また、上位・下位関係にない意味属性に属している単語が、実は上位・下位関係にあるということもある。このように、意味体系は、人手で作成するため、単語の意味に関する貴重な情報を含んでいるものの、上位・下位関係を正確かつ網羅的には表していない。 Although the semantic system itself represents the upper and lower conceptual systems, they are created manually, so the system is generally unsuccessful because of the great effort of creation and the arbitrary nature of the creator. Great integrity. For example, there are cases where words such as “psychiatric disease” and “depression” are classified in the semantic attribute of “disease”, and these words are not in a higher / lower relationship. A word belonging to a semantic attribute that is not in a higher / lower relationship may actually have a higher / lower relationship. As described above, since the semantic system is created manually and includes valuable information on the meaning of words, it does not accurately and comprehensively represent the upper / lower relationship.

図３の単語辞書において、一般に内容語には、一つまたは複数の意味属性が対応している。図３においては、複数の意味属性をコロンで区切っている。１単語に複数の意味属性があるとき、よく使用される順に意味属性を並べておいてもよい。新しい単語を単語辞書に登録するとき、既存の意味属性のどれが該単語に対応するかを、一般に、人が該単語の品詞や意味を吟味した上で付与する。 In the word dictionary of FIG. 3, one or more semantic attributes generally correspond to content words. In FIG. 3, a plurality of semantic attributes are separated by colons. When a single word has a plurality of semantic attributes, the semantic attributes may be arranged in the order in which they are frequently used. When a new word is registered in the word dictionary, it is generally given by a person who examines the part of speech and meaning of the word, which existing semantic attribute corresponds to the word.

図５は、形態素解析手段２０１に入力されるテキストの一例であり、図６は、図５のテキストの形態素解析結果の一例である。形態素間は”／”で区切られている。各形態素は、単語表記、終止形、品詞情報、意味属性、必要語フラグから成っており、それぞれ”，”で区切られている。単語辞書２０４に終止形が登録されていない場合は、形態素解析後に、単語表記と品詞情報から終止形を導出する。終止形がない単語に対しては、単語表記を終止形とする。必要語フラグは、形態素解析後はヌルである。なお、単語ベクトル生成手段１０１において、単語・単語間の共起をとる場合は、形態素解析結果に意味属性のカラムがなくてもよい。 FIG. 5 is an example of text input to the morpheme analysis unit 201, and FIG. 6 is an example of a morpheme analysis result of the text of FIG. The morphemes are separated by “/”. Each morpheme is composed of a word notation, an end form, a part of speech information, a semantic attribute, and a necessary word flag, and each is delimited by “,”. When the final form is not registered in the word dictionary 204, the final form is derived from the word notation and the part of speech information after the morphological analysis. For words that do not have a closing form, the word notation is the closing form. The necessary word flag is null after the morphological analysis. In the word vector generation unit 101, when a word / word co-occurrence is taken, the semantic attribute column may not be included in the morphological analysis result.

また、単語・意味属性間の共起をとる場合では、形態素解析用の単語辞書と、単語とその意味属性の組の集合が格納された辞書とは別にしておき、処理の過程で、ある単語の意味属性を取得する必要があるときは、該単語で後者の辞書を検索して、対応する意味属性を取得するというようにしてもよい。以後の説明は、形態素解析用の単語辞書に意味属性も格納されており、形態素解析結果に意味属性も出力されているという前提で述べる。 Also, in the case of co-occurrence between words and semantic attributes, a word dictionary for morphological analysis and a dictionary storing a set of pairs of words and their semantic attributes are separate from the processing process. When it is necessary to acquire a semantic attribute of a word, the latter dictionary may be searched for the word to acquire a corresponding semantic attribute. In the following description, the semantic attribute is also stored in the word dictionary for morphological analysis, and the semantic attribute is also output in the morphological analysis result.

次に、不要単語テーブル、不要品詞テーブルを参照することにより、形態素解析結果中の形態素が、その後の処理に必要な語かそうでないかを判断し、必要語ならば必要語フラグを１とし、必要語でないならば必要語フラグを０とする。図７は、不要単語テーブルの一例である。不要とされる各単語の終止形が１レコードとして記述されている。図８は、不要品詞テーブルの一例である。不要とされる各品詞情報が１レコードとして記述されている。対象としている形態素の終止形が不要単語テーブルのあるレコードと一致するか、あるいは、対象としている形態素の品詞情報が不要品詞テーブルのあるレコードと一致する場合、該形態素を必要語でないと判断する。この処理により、図６の形態素解析結果は、図９のようになる。 Next, by referring to the unnecessary word table and the unnecessary part-of-speech table, it is determined whether or not the morpheme in the morpheme analysis result is a word necessary for the subsequent processing. If it is a necessary word, the necessary word flag is set to 1. If it is not a necessary word, the necessary word flag is set to 0. FIG. 7 is an example of the unnecessary word table. The ending form of each unnecessary word is described as one record. FIG. 8 is an example of the unnecessary part-of-speech table. Each part of speech information that is unnecessary is described as one record. If the end form of the target morpheme matches a record in the unnecessary word table, or if the part of speech information of the target morpheme matches a record in the unnecessary part of speech table, it is determined that the morpheme is not a required word. By this processing, the morphological analysis result of FIG. 6 becomes as shown in FIG.

共起行列生成手段２０２は、形態素解析手段２０１で得られた形態素解析結果から、必要語の終止形の異なりの集合を取得する。次に、単語・意味属性間の共起をとる場合では、図１０のような、取得した必要語の終止形の異なりの集合（単に単語集合と呼ぶ）と意味属性集合との間の共起頻度行列を生成する。共起頻度行列における各行は一単語に対応し、各列は一意味属性に対応する。各行ベクトルは、対応する単語の、各座標が意味属性に対応し、該座標の値が該単語と該意味属性との間の共起頻度であるようなベクトルである。各行ベクトルの全座標値を０にセットする。 The co-occurrence matrix generating unit 202 acquires different sets of necessary word termination forms from the morpheme analysis result obtained by the morpheme analyzing unit 201. Next, in the case of co-occurrence between words and semantic attributes, the co-occurrence between a set of different required forms of the required words (referred to simply as a word set) and a semantic attribute set as shown in FIG. Generate a frequency matrix. Each row in the co-occurrence frequency matrix corresponds to one word, and each column corresponds to one semantic attribute. Each row vector is a vector in which each coordinate of the corresponding word corresponds to a semantic attribute, and the value of the coordinate is a co-occurrence frequency between the word and the semantic attribute. Set all coordinate values of each row vector to 0.

なお、単語・単語間の共起をとる場合では、各列に必要語の終止形の異なりを対応させる。このとき、後の処理の計算量低減のため、各列に対応する必要語の終止形の異なりの集合を、入力テキスト中における高頻度語のみに限定してもよい。 In addition, when taking the co-occurrence between words, the difference in the final form of the necessary word is associated with each column. At this time, in order to reduce the amount of calculation in subsequent processing, different sets of necessary word termination forms corresponding to each column may be limited to only high-frequency words in the input text.

次に、単語・意味属性間の共起をとる場合では、単語と意味属性とが共起する頻度を算出する処理の対象となるテキスト中の範囲を決定する。所定の範囲としては、一文、一段落や所定の数の単語の列等がある。所定の範囲を一文とした場合は、まず、テキスト中の最初の文を処理対象とする。処理対象とした文に関する処理が終了したならば、その次の文を処理対象とする。最後の文に関する処理が終了したならば、処理対象の文はないので、共起行列生成手段２０２の処理を終了する。所定の範囲を、他のものとした場合も同様である。 Next, in the case of co-occurrence between words and semantic attributes, a range in the text that is a target of processing for calculating the frequency of co-occurring words and semantic attributes is determined. The predetermined range includes a sentence, a paragraph, a string of a predetermined number of words, and the like. When the predetermined range is one sentence, the first sentence in the text is first processed. When the processing related to the processing target sentence is completed, the next sentence is set as the processing target. If the process related to the last sentence is completed, there is no sentence to be processed, and the process of the co-occurrence matrix generating unit 202 is ended. The same applies to the case where the predetermined range is other.

各処理対象のテキストの範囲における処理は、以下のように行う。 Processing in the range of each text to be processed is performed as follows.

まず前記範囲における必要語の終止形（同一のものが複数あれば別物とする）の意味属性の頻度をカウントする。図９に表示した、テキストの１範囲の形態素解析結果からは、図１１の、意味属性とその頻度の組の集合である頻度ハッシュが得られる。 First, the frequency of semantic attributes of the end forms of necessary words in the range (if there are multiple identical words, they are different) is counted. From the morphological analysis result of one range of text displayed in FIG. 9, the frequency hash that is a set of the semantic attribute and its frequency in FIG. 11 is obtained.

ここで、単語辞書２０４において、１単語における複数の意味属性が、よく使用される順に並べられており、形態素解析結果における形態素中の意味属性の順番もそれを引き継いでいる場合、形態素中の意味属性の列の中の最初から指定した数だけの意味属性のみをカウントする対象としてもよい。 Here, in the word dictionary 204, when a plurality of semantic attributes in one word are arranged in a frequently used order, and the order of semantic attributes in the morpheme in the morpheme analysis result also inherits it, the meaning in the morpheme Only the number of semantic attributes specified from the beginning in the attribute column may be counted.

次に、該範囲における各必要語の終止形（同一のものが複数あれば別物とする）に対し、以下の処理を行う。共起頻度行列中の、該必要語の終止形に対応する行ベクトルの、頻度ハッシュにおける各意味属性に対応する座標の値に、該頻度ハッシュにおける該意味属性の頻度を加算する。 Next, the following processing is performed on the final form of each necessary word in the range (if there are multiple identical words, they are different). The frequency of the semantic attribute in the frequency hash is added to the coordinate value corresponding to each semantic attribute in the frequency hash of the row vector corresponding to the end form of the required word in the co-occurrence frequency matrix.

ここで、該必要語の終止形とその形態素内の意味属性との共起はカウントしないというようにしてもよい。すなわち、頻度ハッシュから、該必要語の終止形が保持しているカウント対象の意味属性に対応する座標の値から、１だけ減じて得られる一時的な頻度ハッシュを生成し、この一時的な頻度ハッシュにおける各意味属性の頻度を加算するというようにしてもよい。例えば、図９に表示した、テキストの１範囲の形態素解析結果における処理において、必要語終止形「米」に対しては、意味属性１１，９１を保持しているので、図１１の頻度ハッシュから、図１２の一時的な頻度ハッシュが得られる。共起頻度行列が図１０の状態で、かつ図９に表示した、テキストの１範囲の形態素解析結果を処理した場合、図１３の共起頻度行列が得られる。 Here, the co-occurrence of the end form of the necessary word and the semantic attribute in the morpheme may not be counted. That is, a temporary frequency hash obtained by subtracting 1 from the coordinate value corresponding to the semantic attribute of the count target held by the end form of the necessary word is generated from the frequency hash, and this temporary frequency is generated. You may make it add the frequency of each semantic attribute in a hash. For example, in the processing in the morphological analysis result of one range of the text displayed in FIG. 9, the semantic attribute 11, 91 is held for the necessary word termination type “US”, so the frequency hash of FIG. The temporary frequency hash of FIG. 12 is obtained. When the co-occurrence frequency matrix is in the state of FIG. 10 and the morphological analysis result of one range of text displayed in FIG. 9 is processed, the co-occurrence frequency matrix of FIG. 13 is obtained.

全ての範囲に対する処理が終了すると、入力テキストにおける任意の単語と任意の意味属性との間の共起頻度が記録された行列が得られ、この行列における各行ベクトルが、共起行列生成手段２０２が求める、対応する単語のベクトルとなる。 When the processing for all the ranges is completed, a matrix in which the co-occurrence frequency between an arbitrary word and an arbitrary semantic attribute in the input text is recorded, and each row vector in this matrix is obtained by the co-occurrence matrix generating unit 202. The vector of the corresponding word to be obtained.

上記の、一範囲における単語と意味属性との間の共起頻度算出は、単語の頻度をカウントするプロセスが一切ないようにして実行できる。したがって、単語・意味属性間共起頻度行列の生成は、単語間共起頻度行列を生成するプロセスを介在させることなく実行できる。 The above-described calculation of the co-occurrence frequency between a word and a semantic attribute in one range can be executed without any process of counting the frequency of words. Therefore, the generation of the co-occurrence frequency matrix between words and semantic attributes can be performed without intervening the process of generating the inter-word co-occurrence frequency matrix.

なお、ここでは、一範囲において、ある必要語の終止形Ｍがｍ回、意味属性Ｎがｎ回出現している場合、該範囲におけるＭとＮの共起頻度をｍ×ｎまたはｍ×（ｎ−１）としたが、ＭやＮが同一の範囲に複数回出現しても、該範囲における共起頻度は１とするというようにすることもできる。 Here, in one range, when the end form M of a certain required word appears m times and the semantic attribute N appears n times, the co-occurrence frequency of M and N in the range is expressed as m × n or m × ( Although n-1), the co-occurrence frequency in the range can be set to 1 even if M and N appear multiple times in the same range.

単語・単語間の共起をとる場合の共起行列生成手段２０２の共起頻度算出は、上記の単語・意味属性間の共起をとる場合での説明における「（必要語の終止形の）意味属性」を、共起頻度行列の列に対応する必要語の終止形の異なりとして行う。 The co-occurrence frequency calculation of the co-occurrence matrix generation unit 202 in the case of taking the co-occurrence between words / words is described in the above description in the case of taking the co-occurrence between words / semantic attributes (the end form of the necessary word). The “semantic attribute” is performed as a difference in the final form of the necessary word corresponding to the column of the co-occurrence frequency matrix.

相対頻度算出手段２０３は、共起行列生成手段２０２によって生成されたベクトルの各座標の値を、相対頻度に変換する。 The relative frequency calculation unit 203 converts the value of each coordinate of the vector generated by the co-occurrence matrix generation unit 202 into a relative frequency.

共起行列生成手段２０２によって生成されたある単語のベクトルが、（ａ₁，ａ₂，…，ａ_n）であったとする。相対頻度算出手段２０３によって、この単語のベクトルは、 It is assumed that the vector of a certain word generated by the co-occurrence matrix generating unit 202 is (a ₁ , a ₂ , ... , _An ). By the relative frequency calculation means 203, this word vector is

に変換される。各ｘ_i（１≦ｉ≦ｎ）は、変換前のベクトルにおける、対応する座標の値の、全座標の値の和に対する相対頻度である。 Is converted to Each x _i (1 ≦ i ≦ n) is a relative frequency with respect to the sum of the values of the corresponding coordinates in the vector before conversion.

であるので、変換後のベクトルは、各座標を確率変数、座標値を確率値とする確率分布ととらえることができる。 Therefore, the converted vector can be regarded as a probability distribution in which each coordinate is a random variable and the coordinate value is a probability value.

単語間関連度算出手段１０２は、任意の単語Ｘに対し、単語Ｘの前記単語ベクトル生成手段１０１で得られた単語ベクトルと、任意の単語Ｙの前記単語ベクトル生成手段１０１で得られた単語ベクトルとの間の、カルバック・ライブラー距離（二つの確率分布間の距離）を算出し、距離の小さい順に単語Ｙをランキングし、ランキングにおける順位により、単語Ｘと単語Ｙとの関連度を算出する。図１４は、単語間関連度算出手段１０２の詳細なフローチャートである。 The word-to-word relevance calculating unit 102 is configured to, for an arbitrary word X, the word vector obtained by the word vector generating unit 101 for the word X and the word vector obtained by the word vector generating unit 101 for the arbitrary word Y. Kullback-Riverer distance (distance between two probability distributions) between and the word Y is ranked in ascending order of the distance, and the degree of association between the word X and the word Y is calculated based on the ranking in the ranking. . FIG. 14 is a detailed flowchart of the inter-word relevance calculation means 102.

以下、単語間関連度算出手段１０２の説明では、ベクトルの各座標が意味属性に対応している構成に基づいて行うこととする。ベクトルの各座標が必要語の終止形の異なりに対応している構成でも、同様に説明される。
（ステップＳ１）
これまでに処理していない単語の中で、処理対象とする単語Ｘを一つ決定する。あれば、ステップＳ２に進み、なければ、単語間関連度算出手段１０２の処理を終了する。
（ステップＳ２）
これまでに処理していない単語の中で、処理対象とする単語Ｙを一つ決定する。あれば、ステップＳ３に進み、なければ、ステップＳ４に進む。
（ステップＳ３）
単語の対Ｘ、Ｙに対し、Ｘ、Ｙのベクトルｖ（Ｘ）、ｖ（Ｙ）が、
ｖ（Ｘ）：＝（ｘ₁，ｘ₂，…，ｘ_n）
ｖ（Ｙ）：＝（ｙ₁，ｙ₂，…，ｙ_n）
のようになっているとき、ＸからＹへのカルバック・ライブラー距離Ｐ（Ｘ，Ｙ）を、 Hereinafter, in the description of the inter-word relevance calculation means 102, it is assumed that each coordinate of the vector is based on a configuration corresponding to a semantic attribute. The same applies to a configuration in which each coordinate of the vector corresponds to a different end form of the required word.
(Step S1)
Among the words not processed so far, one word X to be processed is determined. If there is, the process proceeds to step S2, and if not, the processing of the inter-word relevance calculation unit 102 is terminated.
(Step S2)
Among words that have not been processed so far, one word Y to be processed is determined. If there is, the process proceeds to step S3, and if not, the process proceeds to step S4.
(Step S3)
For a pair of words X and Y, X and Y vectors v (X) and v (Y)
v (X): = (x ₁ , x ₂ , ... , x _n )
v (Y): = (y ₁ , y ₂ , ... , y _n )
When the Cullback-Lailer distance P (X, Y) from X to Y is

として算出する。 Calculate as

ここで、 here,

と定義する。しかし、この定義だとｘ_i≠０かつｙ_i＝０であるような座標ｉが一つでもあると、Ｐ（Ｘ，Ｙ）＝∞となってしまうため、上位・下位関係にある単語対に対し距離が∞となるものが多数出てしまい、上位・下位関係の単語対の再現率が下がってしまう問題がある。 It is defined as However, in this definition, if there is at least one coordinate i such that x _i ≠ 0 and y _i = 0, P (X, Y) = ∞, so a word pair in a higher / lower relationship On the other hand, there are many cases where the distance is ∞, and the recall rate of the word pairs in the upper / lower relationship is lowered.

距離値を常に有限値にする場合は、以下のようにする。 To always make the distance value finite, do as follows.

Ｐ（Ｘ，Ｙ）は、現実の分布としてｖ（Ｘ）があり、それをｖ（Ｙ）で近似した場合の情報損失量の期待値を表す。 P (X, Y) represents an expected value of information loss when v (X) is an actual distribution and approximated by v (Y).

がとりわけ大きくなるのは、ｙ_iが０あるいは０に近く、ｘ_iがｙｉと比してはるかに大きい場合である。ＹがＸの上位概念の場合、一般に、Ｘが座標ｉの意味属性と共起すれば、Ｙも座標ｉの意味属性と共起する傾向があるので、このような事態はあまり生じない。従って、ＹがＸの上位概念の場合、ＹがＸの上位概念でない場合と比較して、カルバック・ライブラー距離Ｐ（Ｘ，Ｙ）は小さくなる傾向がある。 Is particularly large when y _i is 0 or close to 0 and x _i is much larger than y _i . When Y is a superordinate concept of X, generally, if X co-occurs with the semantic attribute of coordinate i, Y tends to co-occur with the semantic attribute of coordinate i. Therefore, in the case where Y is a superordinate concept of X, the Cullback-Lailer distance P (X, Y) tends to be smaller than in the case where Y is not a superordinate concept of X.

ベクトルの長さを１に正規化したときのユークリッド距離は、コサイン類似度と等価である。この距離尺度は、ベクトル値の差分が小さいものを距離が小さいとするため、上位・下位関係の単語のみならず、兄弟関係の単語も相対的に距離が小さいものとして判定する。 The Euclidean distance when the vector length is normalized to 1 is equivalent to the cosine similarity. In this distance scale, since the distance is small when the difference between the vector values is small, not only the words related to the upper and lower levels but also the words related to the siblings are determined to be relatively small in distance.

カルバック・ライブラー距離を用いれば、単語Ｘを指定したとき、カルバック・ライブラー距離Ｐ（Ｘ，Ｙ）が相対的に小さい単語Ｙを、Ｘの上位概念の単語として検出することが可能となる。 If the Cullback-Liber distance is used, when the word X is designated, it becomes possible to detect a word Y having a relatively small Cullback-Roller distance P (X, Y) as a higher concept word of X. .

ステップＳ３の処理が終了した後、ステップＳ２に進む。
（ステップＳ４）
Ｐ（Ｘ，Ｙ）の小さい順に単語Ｙをランキングする。これは、単語Ｘの上位概念である度合いの大きい順である。図１５は、Ｘを「精神病」としたときの単語Ｙのランキングである。 After the process of step S3 is complete | finished, it progresses to step S2.
(Step S4)
The word Y is ranked in ascending order of P (X, Y). This is the order in which the degree that is a superordinate concept of the word X is large. FIG. 15 is a ranking of the word Y when X is “psychiatric”.

任意の単語対Ｘ，Ｙに対しＰ（Ｘ，Ｙ）を算出する処理の計算量を低減するために、単語Ｙの集合を、例えば、分類先カテゴリに対応付けられている単語集合群と、コーパス中の高頻度語集合をマージしたものに限定してもよい。 In order to reduce the calculation amount of the process of calculating P (X, Y) for an arbitrary word pair X, Y, a set of words Y, for example, a word set group associated with a classification destination category, You may limit to what merged the high frequency word set in a corpus.

ステップＳ４の処理が終了した後、ステップＳ５に進む。
（ステップＳ５）
任意の単語Ｙに対し、Ｙの順位をｍとしたとき、[ＸならばＹ]である度合いを表すＸ，Ｙ間の関連度Ｅ（Ｘ，Ｙ）を、一例として、 After the process of step S4 is complete | finished, it progresses to step S5.
(Step S5)
For an arbitrary word Y, when the rank of Y is m, the degree of association E (X, Y) between X and Y representing the degree of [Y if X] is taken as an example.

として算出する。 Calculate as

単語Ｘごとのランキング結果上位におけるカルバック・ライブラー距離の大きさは異なる。しかし、Ｘの上位概念の単語Ｙ’に対するカルバック・ライブラー距離Ｐ（Ｘ，Ｙ’）は、Ｘの上位概念でない単語Ｙ’’に対するカルバック・ライブラー距離Ｐ（Ｘ，Ｙ’’）よりも小さいので、ランキングの上位は、Ｘごとの距離の大きさの違いに関わらず、常に上位概念が占める。したがって、ランキングにおける順位こそ、Ｘの上位概念である度合いを表す。よって、この順位により算出される関連度は、[ＸならばＹ]である度合いを的確に表す。 The magnitude of the Cullback-Ribler distance at the top of the ranking results for each word X is different. However, the Cullback-Roller distance P (X, Y ′) for the word Y ′, which is a superordinate concept of X, is larger than the Cullback-Riverer distance P (X, Y ″), for the word Y ″, which is not a superordinate concept of X. Since it is small, the higher ranking is always occupied by the higher concept regardless of the difference in distance for each X. Therefore, the rank in the ranking represents the degree that is a superordinate concept of X. Therefore, the degree of relevance calculated based on this rank accurately represents the degree of [Y if X].

ここで、例えば、非特許文献１や非特許文献２の手法で得られるＸ，Ｙのベクトルｖ’（Ｘ），ｖ’（Ｙ）の間のコサイン Here, for example, a cosine between X and Y vectors v ′ (X) and v ′ (Y) obtained by the methods of Non-Patent Document 1 and Non-Patent Document 2.

を０以上１以下の値になるように変換したＣＯ’（Ｘ，Ｙ）＝０．５・ＣＯ（Ｘ，Ｙ）＋０．５と組み合せて、
Ｅ’（Ｘ，Ｙ）＝α・Ｅ（Ｘ，Ｙ）＋（１−α）・ＣＯ’（Ｘ，Ｙ）（０≦α≦１）
とおくと、Ｅ’（Ｘ，Ｙ）は、Ｅ（Ｘ，Ｙ）、ＣＯ（Ｘ，Ｙ）それぞれの連想の傾向を兼ね備えた関連度となる。この関連度を単語間関連度算出手段１０２の導出する最終的な関連度としてもよい。 In combination with CO ′ (X, Y) = 0.5 · CO (X, Y) +0.5, which is converted to a value of 0 or more and 1 or less,
E ′ (X, Y) = α · E (X, Y) + (1−α) · CO ′ (X, Y) (0 ≦ α ≦ 1)
In other words, E ′ (X, Y) is a relevance degree that has the tendency of associations of E (X, Y) and CO (X, Y). This degree of association may be the final degree of association derived by the inter-word degree of association calculation means 102.

ステップＳ５の処理が終了した後、ステップＳ１に進む。 After the process of step S5 is complete | finished, it progresses to step S1.

テキスト・カテゴリ間関連度算出手段１０３は、ある単語集合が対応付けられているカテゴリの集合と、任意のテキストを入力とし、各カテゴリに対し、該テキスト中の単語Ｘと、該カテゴリ中の単語Ｙとの間の前記単語間関連度算出手段１０２で算出した関連度により、該テキストと該カテゴリとの間の関連度を算出する。 The text-category relevance calculating means 103 receives a set of categories associated with a set of words and an arbitrary text as input, and for each category, a word X in the text and a word in the category The degree of association between the text and the category is calculated based on the degree of association calculated by the inter-word association degree calculating means 102 with Y.

カテゴリＣ_p（１≦ｐ≦ｈ）に対し、以下のｓｐ個の文字列Ｚ_pq（１≦ｑ≦ｓｐ）のＯＲ結合が対応付けられているとする。 It is assumed that the following sp character string Z _pq (1 ≦ q ≦ sp) OR combination is associated with the category C _p (1 ≦ p ≦ h).

上記の例として、カテゴリ「テレビ番組ＯＲラジオ番組」がある。 As an example of the above, there is a category “TV program OR radio program”.

各文字列Ｚ_pq（１≦ｑ≦ｓｐ）を形態素解析する。 Each character string Z _pq (1 ≦ q ≦ sp) is subjected to morphological analysis.

各文字列Ｚ_pqを、その中の必要語終止形の異なりＹ_pqr（１≦ｒ≦ｔ_pq）とその頻度ＴＦ_pqr（１≦ｒ≦ｔ_pq）の組の集合に変換し、以下のように表す。 Each character string Z _pq is converted into a set of sets of different required word termination forms Y _pqr (1 ≦ r ≦ t _pq ) and its frequency TF _pqr (1 ≦ r ≦ t _pq ), as follows: Expressed in

カテゴリ「テレビ番組ＯＲラジオ番組」からは、「（テレビ：１，番組：１）ＯＲ（ラジオ：１，番組：１）」が得られる。 From the category “TV program OR radio program”, “(TV: 1, program: 1) OR (radio: 1, program: 1)” is obtained.

任意のテキストＵに対し、それを形態素解析する。その中の必要語終止形の異なりＸ_ug（１≦ｇ≦ｔｕ）とその頻度ＴＦｕｇ（１≦ｇ≦ｔｕ）の組の集合に変換し、以下のように表す。 A morphological analysis is performed on an arbitrary text U. Different from the necessary word termination form, it is converted into a set of a set of X _ug (1 ≦ g ≦ tu) and its frequency TFug (1 ≦ g ≦ tu) and expressed as follows.

尚、前記カテゴリＣ_pの各文字列Ｚ_pqおよび前記テキストＵに対する形態素解析は、前記単語ベクトル生成手段１０１の形態素解析手段２０１によって行われる。 The morpheme analysis for each character string Z _pq of the category C _p and the text U is performed by the morpheme analysis unit 201 of the word vector generation unit 101.

任意のテキストＵを固定したとき、任意のカテゴリＣ_pに対する、[テキストＵならばカテゴリＣ_p]である度合いを表すＵ，Ｃ_p間の関連度Ｅ（Ｕ，Ｃ_p）を、図１６のフローで算出する。
（ステップＳ１１）
これまでに処理していないカテゴリの中で、処理対象とするカテゴリＣ_pを一つ決定する。あれば、ステップＳ１２に進み、なければ、終了する。
（ステップＳ１２）
カテゴリＣ_p中の、これまでに処理していない文字列の中で、処理対象とする文字列Ｚ_pqを一つ決定する。あれば、ステップＳ１３に進み、なければ、ステップＳ１６に進む。
（ステップＳ１３）
テキストＵ中の、これまでに処理していない単語の中で、処理対象とする単語Ｘ_ugを一つ決定する。あれば、ステップＳ１４に進み、なければ、ステップＳ１５に進む。
（ステップＳ１４）
[Ｘ_ugならばＺ_pq]である度合いを表すＸ_ug，Ｚ_pq間の関連度Ｅ（Ｘ_ug，Ｚ_pq）を以下の式により算出する。 When an arbitrary text U is fixed, the relation E (U, C _p ) between U and C _p representing the degree of [Category C _p if text U] for an arbitrary category C _p is shown in FIG. Calculate by flow.
(Step S11)
Among the categories not processed so far, one category C _p to be processed is determined. If there is, the process proceeds to step S12, and if not, the process ends.
(Step S12)
One character string Z _pq to be processed is determined from among character strings in category C _p that have not been processed so far. If there is, the process proceeds to step S13, and if not, the process proceeds to step S16.
(Step S13)
One word X _ug to be processed is determined from words in the text U that have not been processed so far. If there is, the process proceeds to step S14, and if not, the process proceeds to step S15.
(Step S14)
The degree of relevance E (X _ug , Z _pq ) between X _ug and Z _pq representing the degree of [Z _pq if X _ug ] is calculated by the following equation.

ここで、Ｅ（Ｘ_ug，Ｙ_pqr）が存在しない場合はＴＦ_pqr＝０とし、Ｅ（Ｘ_ug，Ｚ_pq）の分母が０の場合はＥ（Ｘ_ug，Ｚ_pq）＝０とする。 Here, when E (X _ug , Y _pqr ) does not exist, TF _pqr = 0, and when E (X _ug , Z _pq ) has 0 denominator, E (X _ug , Z _pq ) = 0.

尚、テキストＵ中の単語Ｘ_ugとカテゴリＣ_p中の単語Ｙ_pqrとの間の、[Ｘ_ugならばＹ_pqr]である度合いを表す関連度Ｅ（Ｘ_ug，Ｙ_pqr）は、単語間関連度算出手段１０２によって算出処理される。 The relation E (X _ug , Y _pqr ) indicating the degree of [Y _pqr if X _ug ] between the word X _ug in the text U and the word Y _pqr in the category C _p The relevance calculation means 102 performs calculation processing.

ステップＳ１４の処理が終了した後、ステップＳ１３に進む。
（ステップＳ１５）
[ＵならばＺ_pq]である度合いを表すＵ，Ｚ_pq間の関連度Ｅ（Ｕ，Ｚ_pq）を以下の式により算出する。 After the process of step S14 is complete | finished, it progresses to step S13.
(Step S15)
The degree of association E (U, Z _pq ) between U and Z _pq representing the degree of [Z _pq if U] is calculated by the following equation.

ここで、Ｅ（Ｕ，Ｚ_pq）の分母が０の場合はＥ（Ｕ，Ｚ_pq）＝０とする。 Here, when the denominator of E (U, Z _pq ) is 0, E (U, Z _pq ) = 0.

ステップＳ１５の処理が終了した後、ステップＳ１２に進む。
（ステップＳ１６）
[ＵならばＣ_p]である度合いを表すＵ，Ｃ_p間の関連度Ｅ（Ｕ，Ｃ_p）を以下の式により算出する。 After the process of step S15 is complete | finished, it progresses to step S12.
(Step S16)
The degree of association E (U, C _p ) between U and C _p representing the degree of [C _p if U] is calculated by the following equation.

ステップＳ１６の処理が終了した後、ステップＳ１１に進む。 After the process of step S16 is complete | finished, it progresses to step S11.

前記ステップＳ１４における算出は、テキストＵ中の単語Ｘ_ugとカテゴリＣ_p中の単語Ｙ_pqrとの間の、[Ｘ_ugならばＹ_pqr]である度合いを表す関連度Ｅ（Ｘ_ug，Ｙ_pqr）（すなわち前記単語間関連度算出手段１０２で算出した関連度）をもとにしており、Ｅ（Ｕ，Ｃｐ）はこのＥ（Ｘ_ug，Ｙ_pqr）から構成されるので、[ＵならばＣ_p]である度合いを的確に表す。 The calculation in step S14 is performed by calculating the relevance E (X _ug , Y _pqr between the word X _ug in the text U and the word Y _pqr in the category C _p , which is [Y _pqr if X _ug ]. ) (That is, the relevance calculated by the inter-word relevance calculation means 102), and E (U, Cp) is composed of E (X _ug , Y _pqr ). C _p ] is accurately expressed.

なお、上記において、各異なり単語Ｗｉに対応する頻度ＴＦ_iの代わりに、以下の式で表される該単語のＴＦＩＤＦ_iを用いてもよい。 In the above, instead of the frequency TF _i corresponding to each different word Wi, the TFIDF _i of the word represented by the following expression may be used.

ここで、ＤＮは、あるコーパスにおける文書数であり、ＯＮ_iは、該異なり単語Ｗ_iの該コーパスにおける出現文書数である。上記ステップＳ１４において、ＴＦＩＤＦ_pqrが存在しない場合はＴＦＩＤＦ_pqr＝０とし、ステップＳ１５において、ＴＦＩＤＦ_ugが存在しない場合はＴＦＩＤＦ_ug＝０とする。 Here, DN is the number of documents in a certain corpus, and ON _i is the number of appearing documents in the corpus of the different word W _i . In step S14, if the TFIDF _pqr is absent and TFIDF _pqr = 0, in step S15, if there is no TFIDF _ug and TFIDF _ug = 0.

このようにして、各カテゴリＣｐに対しＥ（Ｕ，Ｃ_p）が得られる。テキストＵを、Ｅ（Ｕ，Ｃ_p）が最も大きいＣ_pに分類することができる。また、Ｅ（Ｕ，Ｃ_p）がある閾値以上の値をとるＣ_pに分類することもできる。Ｅ（Ｕ，Ｃ_p）は、[ＵならばＣ_p]である度合いであるため、テキストが、その内容の兄弟概念にあたるカテゴリに誤分類されることが抑制され、分類の精度が向上する。 In this way, E (U, C _p ) is obtained for each category Cp. Text U can be classified into C _p with the largest E (U, C _p ). Further, E (U, C _p ) can be classified into C _p having a value equal to or greater than a certain threshold. Since E (U, C _p ) is a degree of [C _p if U,] the text is prevented from being misclassified into the category corresponding to the sibling concept of the content, and the accuracy of classification is improved.

上記の実施の形態における処理をプログラムとして構築し、当該プログラムを通信回線または記憶媒体からインストールし、ＣＰＵ等の手段で実施することが可能である。 It is possible to construct the processing in the above-described embodiment as a program, install the program from a communication line or a storage medium, and implement it by means such as a CPU.

すなわち前記プログラムを記録した記録媒体を、システム、又は装置に供給し、そのシステム又は装置のＣＰＵ（ＭＰＵ）が記録媒体に格納されたプログラムを読み出し実行することも可能である。この場合記録媒体から読み出されたプログラム自体が上記実施形態の機能を実現することになり、このプログラムを記録した記録媒体としては、例えば、ＣＤ−ＲＯＭ，ＤＶＤ−ＲＯＭ，ＣＤ−Ｒ，ＣＤ−ＲＷ，ＭＯ及びＨＤＤ等がある。 That is, it is also possible to supply a recording medium recording the program to a system or apparatus, and to read and execute the program stored in the recording medium by a CPU (MPU) of the system or apparatus. In this case, the program itself read from the recording medium realizes the functions of the above-described embodiment, and examples of the recording medium on which the program is recorded include CD-ROM, DVD-ROM, CD-R, CD- There are RW, MO, and HDD.

なお、本発明は、上記の実施の形態に限定されることなく、特許請求の範囲内において種々変更・応用が可能である。 The present invention is not limited to the above-described embodiment, and various modifications and applications can be made within the scope of the claims.

本発明は、言語処理技術に適用可能である。 The present invention is applicable to language processing technology.

本発明の一実施形態例のテキスト分類装置の構成図。The lineblock diagram of the text classification device of the example of one embodiment of the present invention. 本発明の一実施形態例のテキスト分類装置における単語ベクトル生成手段の構成図。The block diagram of the word vector production | generation means in the text classification device of the example of 1 embodiment of this invention. 本発明の一実施形態例の単語ベクトル生成手段における単語辞書の内容の一例を示す説明図。Explanatory drawing which shows an example of the content of the word dictionary in the word vector production | generation means of one embodiment of this invention. 本発明の一実施形態例で用いる意味カテゴリの集合を示す説明図。Explanatory drawing which shows the collection of the semantic category used by one embodiment of this invention. 本発明の一実施形態例における形態素解析手段に入力されるテキストの一例を示す説明図。Explanatory drawing which shows an example of the text input into the morphological-analysis means in one embodiment of this invention. 本発明の一実施形態例における形態素解析の途中の結果の一例を示す説明図。Explanatory drawing which shows an example of the result in the middle of the morphological analysis in one embodiment of this invention. 本発明の一実施形態例における不要単語テーブルの一例を示す説明図。Explanatory drawing which shows an example of the unnecessary word table in one embodiment of this invention. 本発明の一実施形態例における不要品詞テーブルの一例を示す説明図。Explanatory drawing which shows an example of the unnecessary part of speech table in one embodiment of this invention. 本発明の一実施形態例における形態素解析の最終結果の一例を示す説明図。Explanatory drawing which shows an example of the final result of the morphological analysis in one embodiment of this invention. 本発明の一実施形態例における単語集合と意味属性集合との間の共起頻度行列の一例を示す説明図。Explanatory drawing which shows an example of the co-occurrence frequency matrix between the word set and semantic attribute set in the example embodiment of the present invention. 本発明の一実施形態例における頻度ハッシュの一例を示す説明図。Explanatory drawing which shows an example of the frequency hash in one embodiment of this invention. 本発明の一実施形態例における頻度ハッシュの他の例を示す説明図。Explanatory drawing which shows the other example of the frequency hash in one embodiment of this invention. 本発明の一実施形態例における単語集合と意味属性集合との間の共起頻度行列を表し、図９の形態素解析結果を処理し、頻度ハッシュの頻度を加算した場合の共起頻度行列を示す説明図。9 illustrates a co-occurrence frequency matrix between a word set and a semantic attribute set according to an exemplary embodiment of the present invention, and illustrates a co-occurrence frequency matrix when the morpheme analysis result of FIG. 9 is processed and the frequencies of frequency hashes are added. Illustration. 本発明の一実施形態例における単語間関連度算出手段の処理手順を表すフローチャート。The flowchart showing the process sequence of the related degree calculation means between words in the example of 1 embodiment of this invention. 本発明の一実施形態例の単語間関連度算出手段における、単語Ｘを「精神病」としたときの単語Ｙのランキングを表す説明図。Explanatory drawing showing the ranking of the word Y when the word X is made into "psychiatric disease" in the word related degree calculation means of one example of this invention. 本発明の一実施形態例におけるテキスト・カテゴリ間関連度算出手段の処理手順を表すフローチャート。The flowchart showing the process sequence of the text-category relation degree calculation means in one embodiment of the present invention.

Explanation of symbols

１０１…単語ベクトル生成手段、１０２…単語間関連度算出手段、１０３…テキスト・カテゴリ間関連度算出手段、２０１…形態素解析手段、２０２…共起行列生成手段、２０３…相対頻度算出手段、２０４…単語辞書。 DESCRIPTION OF SYMBOLS 101 ... Word vector generation means, 102 ... Inter-word relevance calculation means, 103 ... Text-category relevance calculation means, 201 ... Morphological analysis means, 202 ... Co-occurrence matrix generation means, 203 ... Relative frequency calculation means, 204 ... Word dictionary.

Claims

A text classification device for classifying text into any category of a predetermined category set,
With a corpus as input, for any word A, each coordinate corresponds to a word or word semantic attribute, and the value of the coordinate is the co-occurrence frequency of the word A and the word or word semantic attribute corresponding to the coordinate A word vector generation means for generating a word vector that is a relative value of
For an arbitrary word B, a Cullback-Roller distance between the word vector obtained by the word vector generation means for the word B and the word vector obtained by the word vector generation means for the arbitrary word C is A text classification device comprising: an inter-word relevance calculating unit that calculates and ranks words C in order of increasing distance, and calculates the relevance between words B and C according to the ranking in the ranking .

A set of categories associated with a set of words and an arbitrary text are input, and for each category, the degree of association between the words D in the text and the word E in the category is calculated. 2. The text classification apparatus according to claim 1, further comprising a text / category relevance calculating means for calculating a relevance between the text and the category based on the relevance calculated by the means.

The word vector generation means includes
A morphological analysis means for performing morphological analysis on the text and identifying words necessary for processing;
From the analysis result of the morphological analysis means, the pair co-occurs in any one or a plurality of predetermined ranges in the text with respect to any word pair or any word and any word semantic attribute pair. The frequency at which the event is counted over the entire text is calculated, and for each word, the value of each coordinate is the frequency calculated for the word and the word or word semantic attribute pair corresponding to the coordinate. Co-occurrence matrix generating means for generating a co- occurrence frequency matrix obtained by associating vectors ;
A relative frequency calculation means for converting each coordinate value of the row vector of the co-occurrence frequency matrix generated by the co-occurrence matrix generation means into a relative frequency;
The text classification apparatus according to claim 1, further comprising:

4. The text classification apparatus according to claim 1, wherein the degree of association in the inter-word relation degree calculation means is calculated by an inverse number of the ranking order.

4. The text classification device according to claim 1, wherein the degree of association in the word-to-word degree-of-word calculation means is calculated based on a cosine between the word vectors of the words B and C.

The text-category relevance calculating means is:
Morphological analysis is performed on each character string Zpq of the category Cp associated with the OR-connected character strings, and each character string Zpq is converted into a set of sets of different required word termination forms and their frequencies in each character string. And
Arbitrary text U is morphologically analyzed and converted to a set of sets of different required word end forms in the text U and their frequencies,
Based on the relevance calculated by the inter-word relevance calculating means between the word D in the text U and the word E in the category Cp, the character string Zpq in the text U and the category Cp. To calculate the relevance between
The text classification device according to claim 2, wherein a degree of association between an arbitrary text U and an arbitrary category Cp is calculated.

A text classification method for classifying text into any category of a predetermined category set,
The word vector generation means receives a corpus, and for any word A, each coordinate corresponds to a word or word semantic attribute, and the value of the coordinate is word A and the word or word meaning corresponding to the coordinate. A word vector generation step of generating a word vector that is a relative value of the co-occurrence frequency with the attribute;
The inter-word relevance calculation means, for any word B, between the word vector obtained by the word vector generation means for the word B and the word vector obtained by the word vector generation means for the arbitrary word C And calculating a relevance level between words, ranking the word C in ascending order of the distance, and calculating the relevance between the word B and the word C according to the ranking in the ranking. A text classification method characterized by that.

The text-category relevance calculating means receives a set of categories associated with a set of words and an arbitrary text as input, and for each category, a word D in the text and a word E in the category A text-category relevance calculating step for calculating a relevance between the text and the category based on a relevance calculated by the inter-word relevance calculating means between the text and the category. Item 8. The text classification method according to Item 7.

The word vector generation step includes
A morpheme analyzing unit morphologically analyzes the text and identifies words necessary for processing;
The co-occurrence matrix generating means, based on the analysis result of the morphological analysis step, each of one or a plurality of predetermined ranges in the text with respect to an arbitrary word pair or an arbitrary word and an arbitrary word meaning attribute pair The frequency of counting the events in which the pair co-occurs over the entire text, and for each word, the value of each coordinate is the word and the word or word semantic attribute pair corresponding to the coordinate . A co-occurrence matrix generating step for generating a co- occurrence frequency matrix obtained by associating the calculated vector with the frequency ;
A relative frequency calculating means for converting each coordinate value of the row vector of the co-occurrence frequency matrix generated by the co- occurrence matrix generating step into a relative frequency;
The text classification method according to claim 7 or 8, further comprising:

The text classification method according to claim 7, wherein the relevance in the inter-word relevance calculation step is calculated by the reciprocal of the ranking.

10. The text classification method according to claim 7, wherein the relevance in the inter-word relevance calculation step is calculated based on a cosine between the word vectors of the words B and C.

The text-category relevance calculation step includes:
Morphological analysis is performed on each character string Zpq of the category Cp associated with the OR-connected character strings, and each character string Zpq is converted into a set of sets of different required word termination forms and their frequencies in each character string. And
Arbitrary text U is morphologically analyzed and converted to a set of sets of different required word end forms in the text U and their frequencies,
Based on the relevance calculated by the inter-word relevance calculating means between the word D in the text U and the word E in the category Cp, the character string Zpq in the text U and the category Cp. To calculate the relevance between
The text classification method according to any one of claims 8 to 11, wherein a degree of association between an arbitrary text U and an arbitrary category Cp is calculated.

A text classification program for causing a computer to function as the word vector generation means and the inter-word relevance calculation means according to any one of claims 1 to 6.

A text classification program for causing a computer to function as the text-category relevance calculating means according to any one of claims 2 to 6.

A computer-readable recording medium on which the text classification program according to claim 13 or 14 is recorded.