JP2008084064A

JP2008084064A - Text classification processing method, text classification processing device and text classification processing program

Info

Publication number: JP2008084064A
Application number: JP2006264088A
Authority: JP
Inventors: Takeshi Sadohara; 健佐土原
Original assignee: National Institute of Advanced Industrial Science and Technology AIST
Current assignee: National Institute of Advanced Industrial Science and Technology AIST
Priority date: 2006-09-28
Filing date: 2006-09-28
Publication date: 2008-04-10

Abstract

<P>PROBLEM TO BE SOLVED: To provide a text classification processing method for determining whether a text belongs to a certain category or not. <P>SOLUTION: The text classification processing method comprises extracting a character string of a fixed length or less from the text, calculating a characteristic quantity of the character string, and generating a characteristic vector originated from the characteristic quantity. When a training data group with a label related to whether the text thereof belongs to a certain category or not being preliminarily assigned is given, the text of the training data group is converted to the characteristic vector, the characteristic vector is applied to a support vector machine together with the label to perform learning, and a text classifier by the support vector machine is generated. When a text which is unknow whether it belongs to a certain category or not is given, the characteristic vector of the unknown text is generated, and whether the text belongs to the category or not is determined by use of the generated text classifier. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、例えば、形態素解析が困難な文字列を含むテキストについて、当該テキストがあるカテゴリーに属すか否かを判定するテキスト分類処理を行うテキスト分類処理方法、および、そのような分類を実現する分類器をカテゴリーに属すか否かについてあらかじめラベル付けされたテキスト集合の訓練データ集合から構築するための方法に関するもののである。 The present invention realizes, for example, a text classification processing method for performing text classification processing for determining whether or not the text belongs to a certain category for text including a character string that is difficult to perform morphological analysis, and realizes such classification. It relates to a method for constructing a classifier from a training data set of pre-labeled text sets as to whether they belong to a category.

テキスト分類については、これまでに多くの研究開発がなされてきた。例えば、不適切な電子メイル（ジャンクメイルあるいはスパムと呼ばれるユーザーにとって不要な電子メイル）を除去したり、カスタマーセンターにおいて、問い合わせメイルを自動的に担当者に振り分けたり、することを目的として開発され、ニュース記事をトピック毎に整理したり等、さまざまな応用を目指して、多くの研究開発がなされている。 A lot of research and development has been done on text classification. For example, it was developed for the purpose of removing inappropriate e-mail (e-mail that is unnecessary for users called junk mail or spam), or automatically inquiring mail to the person in charge at the customer center. Much research and development has been done for various applications, such as organizing articles by topic.

テキスト分類を実現する一つの方法は、あらかじめ用意した分類器により分類する方法である。典型的には、あらかじめ定義したＩＦ・ＴＨＥＮルールに基づいてテキストを分類する方法である。例えば、「東証」、「終値」という単語と数字が出現すれば「株式市場」に関するテキストであるというようなルールを用いて、テキストを分類する方法である。しかし、残念ながら、このような方法によるテキスト分類は、テキストの数や語彙数が増大すると、整合性を保ちつつ、分類のためのルールを維持管理してくことが困難になるという問題がある。 One method for realizing text classification is a method of classifying by a classifier prepared in advance. Typically, this is a method of classifying texts based on pre-defined IF / THEN rules. For example, it is a method of classifying texts using a rule such that if the words “TSE” and “closing price” and numbers appear, the text is related to “stock market”. Unfortunately, however, text classification by such a method has a problem that it becomes difficult to maintain and manage the rules for classification while maintaining consistency as the number of texts and the number of vocabularies increases.

従って、近年では、このような静的な分類器を用いる手法に代わって、データから必要に応じて、動的に分類器を学習させる手法が主流となっている。この種の分類器については、ＩＦ・ＴＨＥＮルール、ニューラルネットワーク、決定木、確率モデル、分離超平面等、さまざまな表現形式が用いられるが、各表現形式毎にさまざまな学習アルゴリズムが提案されている。これらの分類器の学習法については、一般的に知られているところであり、それぞれの説明は当業者にとっては周知であるので、ここでの説明は省略する。 Therefore, in recent years, instead of such a method using a static classifier, a method for dynamically learning a classifier from data as necessary has become mainstream. For this type of classifier, various representation formats such as IF / THEN rules, neural networks, decision trees, probability models, separation hyperplanes, etc. are used, but various learning algorithms have been proposed for each representation format. . Since the learning method of these classifiers is generally known, and the description thereof is well known to those skilled in the art, the description thereof is omitted here.

ここでは、本発明において利用するサポートベクトルマシン（非特許文献１）について概要を説明する。データがｎ次元空間上の点として表現されており、さらに、これらの点には、あるカテゴリーに属するか否かを表す２種類のラベル＋１と−１の内一つが付与されているとする。このとき、サポートベクトルマシンは、これらのラベル付のデータをある基準の下で最適に分離するｎ次元空間の超平面を計算し、この超平面によりデータを分類する方法である。そして、データの分類では、ラベルが未知のデータが与えられると、このデータが超平面のどちら側にあるかを調べることにより、このデータのラベルを予測することができる。つまり、何らかの方法で、テキストをｎ次元空間上の点（特徴ベクトルと呼ばれる）として表現してしまえば、テキストがあるカテゴリーに属するか否かを判定する分類器を、サポートベクトルマシンを用いて、ｎ次元空間の超平面を計算し、データから学習させて、適宜に分類器を生成することができる。 Here, an outline of a support vector machine (Non-Patent Document 1) used in the present invention will be described. It is assumed that the data is expressed as points on the n-dimensional space, and further, one of two types of labels +1 and −1 indicating whether or not the data belongs to a certain category is given to these points. At this time, the support vector machine is a method of calculating an n-dimensional space hyperplane that optimally separates the labeled data under a certain standard and classifying the data by this hyperplane. In the data classification, when data with an unknown label is given, the label of this data can be predicted by examining which side of the hyperplane this data is. In other words, if the text is expressed as a point (called a feature vector) in an n-dimensional space by some method, a classifier that determines whether the text belongs to a certain category is used using a support vector machine. The hyperplane in the n-dimensional space can be calculated and learned from the data, and a classifier can be generated accordingly.

テキスト分類器の学習にサポートベクトルマシンを用いる利点の一つは、超平面の次元が仮に非常に高次元であったとしても、カーネルトリックと呼ばれる方法を用いると、この次元に依存しない計算量で、超平面を学習させることができるという点である。もう一つの利点は、特徴ベクトルの次元が非常に高次元であったとしても、分類器が訓練データに過剰に適合して一般性を失う危険性（オーバーフィッティングと呼ばれる）が、他の学習法と比較して小さいことが経験的に知られている点である。 One of the advantages of using a support vector machine for learning a text classifier is that even if the dimension of the hyperplane is very high, using a method called kernel tricks requires an amount of computation that does not depend on this dimension. The point is that the hyperplane can be learned. Another advantage is that even if the dimension of the feature vector is very high, the risk that the classifier will overfit the training data and lose generality (called overfitting) It is known from experience that it is small compared to.

例えば、非特許文献２においては、テキストを、出現する単語の特徴量を成分(素性)とする特徴ベクトルとして表現した場合、サポートベクトルマシンを用いることで、他の学習法を用いるよりも高精度の分類器を学習可能であることが示されている。しかも、１万個を超える単語を用いても、オーバーフィッティングを起こすことなく、むしろ学習性能が向上することが示されている。なお、サポートベクトルマシンのテキスト分類への適用に関するより詳細な情報及び理論的な解析については、非特許文献４を参照することができる。 For example, in Non-Patent Document 2, when a text is expressed as a feature vector that uses the feature amount of an appearing word as a component (feature), a support vector machine is used, which is more accurate than using other learning methods. It is shown that the classifier can be learned. Moreover, it has been shown that even if more than 10,000 words are used, the learning performance is improved without causing overfitting. Note that Non-Patent Document 4 can be referred to for more detailed information and theoretical analysis regarding the application of the support vector machine to text classification.

特許文献１においては、日本語のテキストから形態素解析により単語を抽出して、品詞情報などに基づいて単語を分析して、単語を素性とする特徴ベクトルを生成した上で、サポートベクトルマシンを適用する技術を開示している。 In Patent Document 1, a word is extracted from Japanese text by morphological analysis, a word is analyzed based on part-of-speech information, etc., and a feature vector having the word as a feature is generated, and then a support vector machine is applied. The technology to do is disclosed.

特許文献２には、空白または句読点で区切られたトークンを基本的な素性として、特徴ベクトルを計算した上で、サポートベクトルマシンを用いて超平面を学習させ、得られた超平面の重みベクトルを含むある種の単調関数を分類器として用いるテキスト分類技術を開示している。 In Patent Document 2, a feature vector is calculated using tokens separated by blanks or punctuation marks as basic features, a hyperplane is learned using a support vector machine, and a weight vector of the obtained hyperplane is obtained. Disclosed is a text classification technique that uses a certain monotonic function as a classifier.

これらの文献に見られるように、テキスト分類においては、単語を素性する特徴ベクトルが用いられることが多いが、単語の並びが重要な場合には、特許文献３に見られるように、単語のＮ個の連接（Ｎグラムと呼ばれる）を素性とする場合もある。また、特許文献５、特許文献６あるいは特許文献１８等に見られるように、素性として用いるべき単語、フレーズ、あるいは係り受けの構造などを、あらかじめ素性辞書に用意しておき、この辞書を用いて素性を抽出し、特徴ベクトルを生成する方法も一般的に用いられている。 As can be seen from these documents, feature vectors that identify words are often used in text classification. However, when the word sequence is important, the word N In some cases, the connection (called an N-gram) is used as a feature. Also, as can be seen in Patent Document 5, Patent Document 6, Patent Document 18, etc., words, phrases, dependency structures, etc. to be used as features are prepared in advance in the feature dictionary, and this dictionary is used. A method of extracting features and generating feature vectors is also commonly used.

また、日本語のテキストの場合、英語のテキストとは異なり、単語が分かち書きされていないので、通常、形態素解析により単語を抽出するが、この際、やはり素性（形態素）辞書を用いることになる。素性辞書を用いることの問題点は、辞書に登録されていない素性を抽出できないことであるが、前後の関係などから、未登録語を未知語として検出できる場合も多いので、標準的な日本語のテキスト分類において、素性辞書の使用が問題になることは少なく、実際、以下に引用した日本語テキスト分類に関する特許文献においては、いずれも何らかの素性辞書を用いている。 Also, in the case of Japanese text, unlike English text, words are not separated, so words are usually extracted by morphological analysis, but at this time, a feature (morpheme) dictionary is also used. The problem with using a feature dictionary is that it is not possible to extract features that are not registered in the dictionary, but unregistered words can often be detected as unknown words due to the relationship between before and after. In the text classification, the use of a feature dictionary is rarely a problem, and in fact, all the patent documents related to Japanese text classification cited below use some kind of feature dictionary.

一方で、素性辞書を用いずに、テキスト中の任意の文字列を分析の対象とする方法も知られている。例えば、特許文献４では、テキスト中に現れる長さＮの文字列（文字のＮグラムと呼ぶ）を抽出し重要度を計算することで、この文字列が一般表現であるか専門表現であるかを判定する技術を開示している。このような、文字Ｎグラムの使用は、素性辞書に制約されないという長所を持つ一方、言語的に意味のない文字列が抽出されてしまうという問題があり、このような文字列をいかに排除するかが特許文献４で開示された技術の要点の一つである。しかし、テキストのカテゴリー分類に適用する場合は、ある文字列が言語的に意味があるかどうかは問題ではなく、分類に寄与するかどうかが問題となるわけであり、任意の文字Ｎグラムを素性とする特徴ベクトルを用いて、テキスト分類器の学習を行い、学習の過程で分類に寄与しない文字列を自動的に排除するような手法が有効であるかどうかは、興味深い未知の問題である。 On the other hand, there is also known a method for analyzing an arbitrary character string in text without using a feature dictionary. For example, in Patent Document 4, whether a character string is a general expression or a specialized expression by extracting a character string of length N appearing in a text (referred to as an N-gram of a character) and calculating the importance. A technique for judging the above is disclosed. While the use of such character N-grams has the advantage that it is not restricted by the feature dictionary, there is a problem that linguistically meaningless character strings are extracted, and how to eliminate such character strings. Is one of the main points of the technique disclosed in Patent Document 4. However, when applying to text category classification, it does not matter whether a certain character string is linguistically meaningful, but it does not matter whether it contributes to classification. It is an interesting and unknown problem whether or not a technique that automatically learns a character string that does not contribute to classification in the learning process by learning a text classifier using the feature vector is effective.

このような問題に関する数少ない研究としては、非特許文献３が挙げられる。ここでは、単語が分かち書きされた英語のテキストに対して、連続する３文字を一つの素性とする特徴ベクトルを用いた場合のテキスト分類性能を、言語知識を用いて生成した特徴ベクトルのテキスト分類性能と比較している。非特許文献３によれば、文字Ｎグラムを用いる場合、１０００個より多い素性を用いるだけで性能が劣化し始めるということ、また、品詞に基づく素性の選別や語の活用部分を除去するステミングといった言語知識を利用することでより良い性能が得られることを報告している。しかし、素性の数が多くてもオーバーフィッティングを起こし難いサポートベクトルマシンを用いた場合、あるいは、単語が分かち書きされておらず、しかも上述したような言語的な知識が有効に機能しないような日本語テキストを対象とする場合、文字Ｎグラムを素性とする手法が有効であるかどうかは、興味深い未知の問題であって、本発明が解決する課題である。 Non-patent document 3 is cited as a few studies on such a problem. Here, text classification performance when a feature vector having three consecutive characters as one feature is used for an English text in which words are separated, and the text classification performance of a feature vector generated using language knowledge Compare with According to Non-Patent Document 3, when character N-grams are used, the performance starts to deteriorate only by using more than 1000 features, and feature selection based on part of speech and stemming for removing word utilization parts, etc. It is reported that better performance can be obtained by using language knowledge. However, if you use a support vector machine that does not cause overfitting even if there are many features, or if the words are not shared and the linguistic knowledge mentioned above does not work effectively In the case of text, whether or not the technique using the character N-gram is effective is an interesting and unknown problem, and is a problem to be solved by the present invention.

この種のテキスト分類に関係する従来における技術の文献としては、次のような各文献が参照できる。
特開２００１−２２７２７号公報特表２００２−５１９７６６号公報特開２００４−３４８２３９号公報特開平１１−２７２７０２号公報特開２００５−２３４７３１号公報特開２００５−１９０２８４号公報特開２００４−２４０５１７号公報特開２００４−２３４０５１号公報特開２００３−２７１６１６号公報特開２００２−７４３３号公報特開２００２−３０４４０１号公報特開２００１−３１２５０１号公報特開２０００−１７２６９１号公報特開平１１−３２８２１１号公報特開平１１−２９６５５２号公報特開平１１−１６７５８１号公報特開平１１−１６１６７１号公報特開平９−２６９６３号公報特開２００３−２５６８０１号公報Ｖ．Ｖａｐｎｉｋ：ＴｈｅＮａｔｕｒｅｏｆＳｔａｔｉｓｔｉｃａｌＬｅａｒｎｉｎｇＴｈｅｏｒｙ，Ｓｐｒｉｎｇｅｒ−Ｖｅｒｌａｇ，１９９５．Ｔ．Ｊｏａｃｈｉｍｓ：ＴｅｘｔＣａｔｅｇｏｒｉｚａｔｉｏｎｗｉｔｈＳｕｐｐｏｒｔＶｅｃｔｏｒＭａｃｈｉｎｅｓ：ＬｅａｒｎｉｎｇｗｉｔｈＭａｎｙＲｅｌｅｖａｎｔＦｅａｔｕｒｅｓ，Ｐｒｏｃ．ｏｆＥｕｒｏｐｅａｎＣｏｎｆｅｒｅｎｃｅｏｎＭａｃｈｉｎｅＬｅａｒｎｉｎｇ，ｐｐ．１３７−１４２，１９９８．Ｇ．ＮｅｕｍａｎｎａｎｄＳ．Ｓｃｈｍｅｉｅｒ：Ｃｏｍｂｉｎｉｎｇｓｈａｌｌｏｗｔｅｘｔｐｒｏｃｅｓｓｉｎｇａｎｄｍａｃｈｉｎｅｌｅａｒｎｉｎｇｉｎｒｅａｌｗｏｒｌｄａｐｐｌｉｃａｔｉｏｎｓ，Ｐｒｏｃ．ｏｆＭａｃｈｉｎｅＬｅａｒｎｉｎｇｆｏｒＩｎｆｏｒｍａｔｉｏｎＦｉｌｔｅｒｉｎｇ，１９９９．Ｔ．Ｊｏａｃｈｉｍｓ：Ｌｅａｒｎｉｎｇｔｏｃｌａｓｓｉｆｙｔｅｘｔｕｓｉｎｇｓｕｐｐｏｒｔｖｅｃｔｏｒｍａｃｈｉｎｅｓ：ｍｅｔｈｏｄｓ，ｔｈｅｏｒｙ，ａｎｄａｌｇｏｒｉｔｈｍｓ，ＫｌｕｗｅｒＡｃａｄｅｍｉｃＰｕｂｌｉｓｈｅｒｓ，２００２．Ｎ．ＣｒｉｓｔｉａｎｉｎｉａｎｄＪ．Ｓｈａｗｅ−Ｔａｙｌｏｒ：ＡｎＩｎｔｒｏｄｕｃｔｉｏｎｔｏＳｕｐｐｏｒｔＶｅｃｔｏｒＭａｃｈｉｎｅｓ，ＣａｍｂｒｉｄｇｅＵｎｉｖｅｒｓｉｔｙＰｒｅｓｓ，２００２．Ｉ．Ｇｕｙｏｎｅｔａｌ．：Ｇｅｎｅｓｅｌｅｃｔｉｏｎｆｏｒｃａｎｃｅｒｃｌａｓｓｉｆｉｃａｔｉｏｎｕｓｉｎｇｓｕｐｐｏｒｔｖｅｃｔｏｒｍａｃｈｉｎｅｓ，ＭａｃｈｉｎｅＬｅａｒｎｉｎｇ，ｐｐ．３８９−４２２，Ｖｏｌ．４６，２００２． The following documents can be referred to as documents of the prior art related to this type of text classification.
JP 2001-22727 A JP-T-2002-519766 JP 2004-348239 A Japanese Patent Laid-Open No. 11-272702 Japanese Patent Laid-Open No. 2005-247331 JP 2005-190284 A JP 2004-240517 A Japanese Patent Laid-Open No. 2004-234051 JP 2003-271616 A JP 2002-7433 A JP 2002-304401 A JP 2001-312501 A JP 2000-172691 A JP-A-11-328211 JP 11-296552 A Japanese Patent Application Laid-Open No. 11-167581 Japanese Patent Laid-Open No. 11-161671 JP-A-9-26963 JP 2003-256801 A V. Vapnik: The Nature of Statistical Learning Theory, Springer-Verlag, 1995. Vapnik: The Nature of Statistical Learning Theory, Springer-Verlag, 1995. T.A. Joachims: Text Categorization with Support Vector Machines: Learning with Many Relevant Features, Proc. of European Conference on Machine Learning, pp. 137-142, 1998. G. Neumann and S.M. Schmeier: Combining shallow text processing and machine learning in real world applications, Proc. of Machine Learning for Information Filtering, 1999. T.A. Joachims: Learning to classy text using supporting vector machines: methods, theories, and algorithms, Kluwer Academic Publishers, 2002. N. Christianiani and J.M. Shawe-Taylor: An Induction to Support Vector Machines, Cambridge University Press, 2002. Shaw-Taylor: An Introduction to Support Vector Machines, Cambridge University Press, 2002. I. Guyon et al. : Gene selection for cancer classification using support vector machines, Machine Learning, pp. 389-422, Vol. 46, 2002.

ところで、例えば、形態素解析が困難な文字列を含むようなテキストについては、すなわち、テキストの特性が異なる場合については、そのテキスト分類を行う際には、従来の技術のテキスト分類の手法がそのままでは適用できない。特に、インターネット上の日本語の掲示板サイトにおいて、不適切な書き込みテキストを同定して、除去する場合に適用するには、従来のテキスト分類手法では十分な分類精度が得られない。掲示板サイトにおける書き込みテキストは、ニュース記事や電子メイルに見られるテキストとは異なる以下のような特徴を有している。
（ａ）人名、製品名などの固有名詞、ジャーゴン、伏字、絵文字等の一般的な辞書には登録されていな語が多数出現する。
（ｂ）書き込みは、文法的に正しくない場合が多い。
（ｃ）一件の書き込みは短い。一方で、書き込み件数は非常に多い。 By the way, for example, text that includes character strings that are difficult to analyze morphologically, that is, when the text characteristics are different, the text classification method of the prior art is not used as it is when classifying the text. Not applicable. In particular, in a Japanese bulletin board site on the Internet, it is not possible to obtain sufficient classification accuracy with the conventional text classification method to be applied when identifying and removing inappropriate written text. The text written on the bulletin board site has the following characteristics that are different from those found in news articles and electronic mail.
(A) Many unregistered words appear in general dictionaries such as proper nouns such as personal names and product names, jargon, fuzzy characters, and pictograms.
(B) Writing is often grammatically incorrect.
(C) One entry is short. On the other hand, the number of writings is very large.

単語が分かち書きされる英語とは異なり、日本語のテキストから単語を抽出するためには、通常、形態素解析等の言語処理が行われる。このような言語処理には、辞書や文法のような言語的な知識が、陽にあるいは暗黙のうちに用いられる。掲示板サイトの書き込みテキストの場合には、例えば、上述した特性（ａ）、特性（ｂ）の影響で、言語知識の有効性が低下し、間違った形態素への分割や、未登録語を未知語と認識できない場合が頻繁に起こってしまい、形態素解析の性能が著しく劣化してしまう。したがって、間違った形態素から生成された特徴ベクトルを用いたテキスト分類の性能も、同様に劣化してしまうという問題がある。 Unlike English, in which words are shared, language processing such as morphological analysis is usually performed to extract words from Japanese text. For such language processing, linguistic knowledge such as dictionary and grammar is used explicitly or implicitly. In the case of written text on the bulletin board site, for example, the effectiveness of the language knowledge decreases due to the above-mentioned characteristics (a) and (b), and the division into incorrect morphemes and unregistered words as unknown words If it cannot be recognized frequently, the performance of morphological analysis will deteriorate significantly. Therefore, the performance of text classification using feature vectors generated from wrong morphemes is also deteriorated.

また、掲示板サイトの書き込みが不適切か否かを判定するテキスト分類器の学習に、サポートベクトルマシンを用いる場合、学習に要する計算コストの問題が生じる。一般的には、サポートベクトルマシンによる分類器を学習させるために要する計算時間は、他の学習法に比べて長くなる傾向があり、経験的には、訓練データの数の２乗から３乗に比例する時間を要する。したがって、掲示板サイトの書き込み件数が増加すると、学習に要する時間が急速に増加してしまうという問題が生じる。また、学習の効率化のために、カーネルマトリックスと呼ばれる訓練データの特徴ベクトル間の内積を格納した行列を用いる場合が多いが、この場合には、行列を格納するためには、訓練データ数の２乗に比例するメモリ領域が必要になり、掲示板サイトの書き込み件数の増加にともなって、急速にメモリ消費量が増加してしまうという問題も生じる。 In addition, when a support vector machine is used for learning a text classifier that determines whether writing on a bulletin board site is inappropriate, there is a problem of calculation cost required for learning. In general, the computation time required to train a classifier using a support vector machine tends to be longer than that of other learning methods. From experience, the number of training data is increased from the second to the third power. Proportional time is required. Therefore, when the number of writings on the bulletin board site increases, there arises a problem that the time required for learning increases rapidly. In order to improve the efficiency of learning, a matrix that stores the inner product between feature vectors of training data called a kernel matrix is often used. In this case, the number of training data is used to store the matrix. A memory area proportional to the square is required, and there is a problem that the memory consumption rapidly increases as the number of writings on the bulletin board site increases.

本発明は、上記のような様々な問題を解決するためになされたものであり、本発明の目的は、素性辞書などの言語知識を用いることなく、テキストのみを用いて、特徴ベクトルを生成し、カテゴリーが付与された大量の特徴ベクトルを訓練データとして与える場合においても、計算時間・領域に関して効率良く、テキスト分類器の学習を行うことができ、カテゴリーが未知のテキストが与えられると、これに対応する特徴ベクトル生成し、学習されたテキスト分類器を用いて、テキストのカテゴリーを判定することができるテキスト分類処理法、テキスト分類処理装置ならびにテキスト分類処理プログラムを提供することにある。 The present invention has been made to solve the various problems as described above, and an object of the present invention is to generate a feature vector using only text without using language knowledge such as a feature dictionary. Even when a large number of feature vectors with categories are given as training data, the text classifier can be learned efficiently in terms of computation time and area, and if text with unknown categories is given, To provide a text classification processing method, a text classification processing apparatus, and a text classification processing program capable of generating a corresponding feature vector and determining a category of text using a learned text classifier.

上記のような目的を達成するため、本発明は、第１の態様として、本発明によるテキスト分類処理方法は、例えば、形態素解析が困難な文字列を含むテキストを格納したデータ格納装置と、前記テキストがあるカテゴリーに属すか否かを判定するテキスト分類処理を行うデータ処理装置を備え、前記テキストを分類するテキスト分類処理方法であって、前記テキストから一定長以下の任意の文字列を抽出し、当該文字列の特徴量を計算し、前記特徴量を素性とした特徴ベクトルを生成する特徴ベクトル生成過程と、前記テキストがあるカテゴリーに属するか否かに関するラベルがあらかじめ付与された訓練データ集合が与えられた場合に、前記訓練データ集合のテキストを前記特徴ベクトル生成過程により特徴ベクトルに変換し、ラベルとともに前記特徴ベクトルをサポートベクトルマシンに適用して、当該サポートベクトルマシンに学習を行って、サポートベクトルマシンによるテキスト分類器を生成する分類器生成過程と、あるカテゴリーに属すか否かが未知のテキストが与えられる場合に、前記特徴ベクトル生成過程により当該テキストの特徴ベクトルを生成し、前記分類器生成過程により生成されたテキスト分類器を用いて、当該テキストがそのカテゴリーに属するか否かを判定するカテゴリー判定過程との処理をデータ処理装置により実行するように構成される。 In order to achieve the above object, the present invention provides, as a first aspect, a text classification processing method according to the present invention, for example, a data storage device that stores text including a character string that is difficult to analyze morpheme, A text classification processing method for classifying the text, comprising a data processing device for performing text classification processing for determining whether or not the text belongs to a category, and extracting an arbitrary character string having a predetermined length or less from the text , A feature vector generation process of calculating a feature amount of the character string and generating a feature vector having the feature amount as a feature, and a training data set to which a label relating to whether the text belongs to a certain category is given in advance When given, the text of the training data set is converted into a feature vector by the feature vector generation process, and a label and Applying the feature vector to a support vector machine, learning to the support vector machine, generating a text classifier by the support vector machine, and text that is unknown whether it belongs to a certain category Is generated by the feature vector generation process, and the text classifier generated by the classifier generation process is used to determine whether the text belongs to the category. The processing with the category determination process is configured to be executed by the data processing device.

この場合に、前記分類器生成過程は、訓練データ集合を複数の部分集合に分割し、各部分集合に対して順番に、サポートベクトルマシンを適用し、一時的な分類器を学習させた後、その分類器からサポートベクトルを抽出し、抽出されたサポートベクトルと次の部分集合を混合し、再びサポートベクトルマシンの入力とするという処理を繰り返すように構成される。 In this case, the classifier generation process divides the training data set into a plurality of subsets, applies a support vector machine to each subset in turn, and trains a temporary classifier, A support vector is extracted from the classifier, the extracted support vector and the next subset are mixed, and the process of inputting the support vector machine again as an input to the support vector machine is repeated.

本発明は、第２の態様として、本発明によるテキスト分類処理装置が、例えば、形態素解析が困難な文字列を含むテキストを格納したデータ格納装置と、前記テキストがあるカテゴリーに属すか否かを判定するテキスト分類処理を行うデータ処理装置を備え、前記テキストを分類するテキスト分類処理装置であって、前記テキストから一定長以下の任意の文字列を抽出し、当該文字列の特徴量を計算し、前記特徴量を素性とした特徴ベクトルを生成する特徴ベクトル生成手段と、前記テキストがあるカテゴリーに属するか否かに関するラベルがあらかじめ付与された訓練データ集合が与えられた場合に、前記訓練データ集合のテキストを前記特徴ベクトル生成手段により特徴ベクトルに変換し、ラベルとともに前記特徴ベクトルをサポートベクトルマシンに適用して、当該サポートベクトルマシンに学習を行って、サポートベクトルマシンによるテキスト分類器を生成する分類器生成手段と、あるカテゴリーに属すか否かが未知のテキストが与えられる場合に、前記特徴ベクトル生成手段により当該テキストの特徴ベクトルを生成し、前記分類器生成手段により生成されたテキスト分類器を用いて、当該テキストがそのカテゴリーに属するか否かを判定するカテゴリー判定手段とを備えるように構成される。 As a second aspect of the present invention, the text classification processing apparatus according to the present invention includes, for example, a data storage device that stores text including a character string that is difficult to analyze morpheme, and whether the text belongs to a certain category. A text classification processing device for classifying the text, comprising a data processing device that performs text classification processing for determining, extracting an arbitrary character string having a predetermined length or less from the text, and calculating a feature amount of the character string When the training data set to which the feature vector generating means for generating the feature vector having the feature quantity as the feature and the training data set to which the text relating to whether or not the text belongs to a certain category is given in advance is given. The text is converted into a feature vector by the feature vector generation means, and the feature vector is supported along with the label. A classifier generating means that applies to a tol machine and performs learning on the support vector machine to generate a text classifier by the support vector machine, and when the text that is unknown whether it belongs to a certain category is given, A feature vector generating unit that generates a feature vector of the text, and a category determination unit that determines whether the text belongs to the category using the text classifier generated by the classifier generating unit. Configured.

この場合に、前記分類器生成手段は、訓練データ集合を複数の部分集合に分割し、各部分集合に対して順番に、サポートベクトルマシンを適用し、一時的な分類器を学習させた後、その分類器からサポートベクトルを抽出し、抽出されたサポートベクトルと次の部分集合を混合し、再びサポートベクトルマシンの入力とするという処理を繰り返すように構成される。 In this case, the classifier generation unit divides the training data set into a plurality of subsets, applies a support vector machine to each subset in turn, and learns a temporary classifier, A support vector is extracted from the classifier, the extracted support vector and the next subset are mixed, and the process of inputting the support vector machine again and inputting the support vector machine is repeated.

また、本発明は、第３の態様として、本発明によるテキスト分類処理プログラムは、例えば、形態素解析が困難な文字列を含むテキストがあるカテゴリーに属すか否かを判定するテキスト分類処理をコンピュータにより実行するテキスト分類処理プログラムであって、前記テキストから一定長以下の任意の文字列を抽出し、当該文字列の特徴量を計算し、前記特徴量を素性とした特徴ベクトルを生成する特徴ベクトル生成手段と、前記テキストがあるカテゴリーに属するか否かに関するラベルがあらかじめ付与された訓練データ集合が与えられた場合に、前記訓練データ集合のテキストを前記特徴ベクトル生成手段により特徴ベクトルに変換し、ラベルとともに前記特徴ベクトルをサポートベクトルマシンに適用して、当該サポートベクトルマシンに学習を行って、サポートベクトルマシンによるテキスト分類器を生成する分類器生成手段と、あるカテゴリーに属すか否かが未知のテキストが与えられる場合に、前記特徴ベクトル生成手段により当該テキストの特徴ベクトルを生成し、前記分類器生成手段により生成されたテキスト分類器を用いて、当該テキストがそのカテゴリーに属するか否かを判定するカテゴリー判定手段と、としてコンピュータを機能させるものある。 As a third aspect of the present invention, the text classification processing program according to the present invention performs, for example, a text classification process for determining whether or not a text including a character string difficult to analyze morpheme belongs to a certain category by a computer. A text classification processing program to be executed, wherein an arbitrary character string having a predetermined length or less is extracted from the text, a feature amount of the character string is calculated, and a feature vector having the feature amount as a feature is generated And a training data set to which a label relating to whether or not the text belongs to a certain category is given in advance, the text of the training data set is converted into a feature vector by the feature vector generation unit, and a label And applying the feature vector to a support vector machine, A classifier generating means for learning a thin and generating a text classifier by a support vector machine, and when a text whose unknown whether or not belonging to a certain category is given, features of the text by the feature vector generating means The computer functions as a category determination unit that generates a vector and determines whether the text belongs to the category using the text classifier generated by the classifier generation unit.

この場合に、前記分類器生成手段は、訓練データ集合を複数の部分集合に分割し、各部分集合に対して順番に、サポートベクトルマシンを適用し、一時的な分類器を学習させた後、その分類器からサポートベクトルを抽出し、抽出されたサポートベクトルと次の部分集合を混合し、再びサポートベクトルマシンの入力とするという処理を繰り返すように構成される。 In this case, the classifier generation unit divides the training data set into a plurality of subsets, applies a support vector machine to each subset in turn, and learns a temporary classifier, A support vector is extracted from the classifier, the extracted support vector and the next subset are mixed, and the process of inputting the support vector machine again as an input to the support vector machine is repeated.

このように構成された本発明のテキスト分類処理方法、テキスト分類処理装置およびテキスト分類処理プログラムによれば、言語知識の有効性の乏しいテキストに対して、素性辞書などの言語知識を用いることなく、一定長以下の任意の文字列を素性とする特徴ベクトルを生成し、カテゴリーに入るか否かのラベルが付与された特徴ベクトル集合から、テキスト分類器をサポートベクトルマシンに学習させることが可能になる。これが少ない計算資源で行うことができ、しかも、カテゴリーが未知のテキストが与えられるとき、学習された分類器を用いて、このテキストのカテゴリーを高精度に判定（予測）することが可能になる。 According to the text classification processing method, the text classification processing device, and the text classification processing program of the present invention configured as described above, without using linguistic knowledge such as a feature dictionary, for text with poor language knowledge, Generate a feature vector with an arbitrary character string of a certain length or less as a feature, and let a support vector machine learn a text classifier from a feature vector set with a label indicating whether or not it falls into a category . This can be done with a small number of computational resources. Moreover, when a text whose category is unknown is given, it becomes possible to determine (predict) the category of this text with high accuracy using a learned classifier.

以下、本発明を実施する場合の一形態について図面を参照して説明する。図１は、本発明に係るテキスト分類処理方法の処理フローの一例を示すフローチャートである。本発明のテキスト分類処理においては、基本的な処理として、図１に示すように、テキストに現れる長さが一定長以下の任意の文字列を抽出し特徴量を計算して、これら特徴量を素性とする特徴ベクトルを生成する特徴ベクトル生成過程（Ｐ１）と、カテゴリーが付与された特徴ベクトル集合を入力として、サポートベクトルマシンを用いて分類器を生成する分類器生成過程（Ｐ２）と、未分類テキストを、前記特徴ベクトル生成過程（Ｐ１）により特徴ベクトルに変換した後、前記分類器生成過程（Ｐ２）により学習された分類器を用いてカテゴリーを判定するカテゴリー判定過程（Ｐ３）の各処理を行う。 Hereinafter, an embodiment for carrying out the present invention will be described with reference to the drawings. FIG. 1 is a flowchart showing an example of a processing flow of a text classification processing method according to the present invention. In the text classification process of the present invention, as a basic process, as shown in FIG. 1, an arbitrary character string having a length that appears in text is extracted to calculate a feature amount, and these feature amounts are calculated. A feature vector generation process (P1) for generating a feature vector as a feature, a classifier generation process (P2) for generating a classifier using a support vector machine with a feature vector set to which a category is assigned as an input, Each process of the category determination process (P3) in which the classification text is converted into the feature vector by the feature vector generation process (P1) and then the category is determined by using the classifier learned by the classifier generation process (P2). I do.

すなわち、本発明によるテキスト分類処理では、例えば、形態素解析が困難な文字列を含むテキストを格納したデータ格納装置と、前記テキストがあるカテゴリーに属すか否かを判定するテキスト分類処理を行うデータ処理装置を備えており、このデータ処理装置により、テキストを分類するテキスト分類処理を行う。この場合に、テキストから一定長以下の任意の文字列を抽出し、当該文字列の特徴量を計算し、前記特徴量を素性とした特徴ベクトルを生成する特徴ベクトル生成過程（Ｐ１）と、テキストがあるカテゴリーに属するか否かに関するラベルがあらかじめ付与された訓練データ集合が与えられた場合に、訓練データ集合のテキストを前記特徴ベクトル生成過程により特徴ベクトルに変換し、ラベルとともに前記特徴ベクトルをサポートベクトルマシンに適用して、当該サポートベクトルマシンに学習を行って、サポートベクトルマシンによるテキスト分類器を生成する分類器生成過程（Ｐ２）と、あるカテゴリーに属すか否かが未知のテキストが与えられる場合に、前記特徴ベクトル生成過程（Ｐ１）により当該テキストの特徴ベクトルを生成し、前記分類器生成過程により生成されたテキスト分類器を用いて、当該テキストがそのカテゴリーに属するか否かを判定するカテゴリー判定過程（Ｐ３）との処理を実行する。 That is, in the text classification processing according to the present invention, for example, a data storage device that stores text including a character string that is difficult to perform morphological analysis, and data processing that performs text classification processing for determining whether the text belongs to a certain category. The data processing apparatus performs text classification processing for classifying the text. In this case, a feature vector generation process (P1) for extracting an arbitrary character string of a certain length or less from the text, calculating a feature amount of the character string, and generating a feature vector having the feature amount as a feature, and the text When a training data set with a label on whether or not it belongs to a certain category is given in advance, the text of the training data set is converted into a feature vector by the feature vector generation process, and the feature vector is supported together with the label Applying to a vector machine, learning to the support vector machine to generate a text classifier by the support vector machine (P2), and text that is unknown whether it belongs to a certain category or not In this case, a feature vector of the text is generated by the feature vector generation process (P1). , Using a text classifier generated by the classifier generation process, the text to perform a process of determining category determination process whether they belong to the category (P3).

更に詳細に説明する。テキスト分類処理を実行する場合に、まず、与えられた分類済みテキストは、計算機の主記憶上に読み込まれ（ステップ１０１）、特徴生成ベクトル過程（Ｐ１）により、テキスト中に現れる長さ一定長以下の文字列を素性とする特徴ベクトルが計算される（ステップ１０２）。 Further details will be described. When executing the text classification process, first, the given classified text is read into the main memory of the computer (step 101), and the length generated in the text by the feature generation vector process (P1) is less than a certain length. A feature vector having the character string as a feature is calculated (step 102).

図２は、特徴ベクトル生成過程（Ｐ１）の詳細な処理フローを示す図である。特徴ベクトル生成過程（Ｐ１）では、まず、テキストが分類済みであるか否かを判定し（ステップ１２１）、テキストが分類済みである場合には、テキストから、長さＮ以下の任意の文字列ｓが抽出され、この文字列ｓの各テキストｔ_ｉにおける出現頻度ＴＦ（ｓ，ｔ_ｉ）と、この文字列が出現するテキストの数ＤＦ（ｓ）を計算する（ステップ１２２）。このテキスト中に現れる長さＮ以下の任意の文字列の出現頻度は、トライとよばれるデータ構造を用いて効率よく計算できる。このトライとよばれるデータ構造を用いる計算は、当業者には周知であり、ここでの説明は省略する。 FIG. 2 is a diagram showing a detailed processing flow of the feature vector generation process (P1). In the feature vector generation process (P1), it is first determined whether or not the text has been classified (step 121). If the text has been classified, an arbitrary character string having a length of N or less is determined from the text. s is extracted, and the appearance frequency TF (s, t _i ) in each text t _i of this character string s and the number of texts DF (s) in which this character string appears are calculated (step 122). The appearance frequency of an arbitrary character string having a length N or less appearing in the text can be efficiently calculated using a data structure called trie. The calculation using the data structure called trie is well known to those skilled in the art and will not be described here.

次に、全ての素性の中から、数千から数万個の素性を選択し（ステップ１２３）、任意のｔに対して、特徴量ｆ（ｓ_ｉ，ｔ）を成分とするベクトルを生成し（ステップ１２４）、これを特徴ベクトルとする。ここでは、素性の選択（ステップ１２３）を行うが、これはあまりに素性の数が多すぎる場合、オーバーフィッティングの危険性が増大したり、分類器学習時の計算コストが許容できなくなる恐れがあるからである。素性の選択には、相互情報量、カイ二乗検定などの公知の技術を用いる。 Next, thousands to tens of thousands of features are selected from all the features (step 123), and a vector having a feature quantity f (s _i , t) as a component is generated for an arbitrary t. (Step 124), this is a feature vector. Here, feature selection (step 123) is performed. This is because if there are too many features, the risk of overfitting may increase, or the calculation cost during classifier learning may become unacceptable. It is. Known features such as mutual information and chi-square test are used for feature selection.

このようにして、選択された素性ｓ_１，…，ｓ_ｄを用いて、各テキストｔ_ｉが、ｄ次元の特徴ベクトル（ｆ（ｓ_１，ｔ_ｉ），…，ｆ（ｓ_ｄ，ｔ_ｉ））に変換される。ｆ（ｓ_ｊ，ｔ_ｉ）は、素性ｓ_ｊの特徴量であり、例えば、単純にＴＦ（ｓ_ｊ，ｔ_ｉ）を用いたり、ＴＦ（ｓ_ｊ，ｔ_ｉ）にｌｏｇ（Ｄ／ＤＦ（ｓ_ｊ））（ただし、Ｄはテキストの総数）を乗じたＴＦＩＤＦ値などの標準的指標を用いる。また、これらの値は、ある特定の閾値を超えるか否かで２値化することもできる。その場合、特徴ベクトルは２値ベクトルとなり、計算資源を節約した実装が可能になる。 Thus, using the selected features s ₁ ,..., S _d , each text t _i is converted into a d-dimensional feature vector (f (s ₁ , t _i ),..., F (s _d , t _i. )). f (s _j , t _i ) is a feature amount of the feature s _j , and for example, TF (s _j , t _i ) is simply used, or log (D / DF () is used for TF (s _j , t _i ). A standard index such as a TFIDF value multiplied by s _j )) (where D is the total number of texts) is used. These values can also be binarized depending on whether or not a certain threshold value is exceeded. In this case, the feature vector is a binary vector, and implementation that saves calculation resources is possible.

なお、特徴ベクトルの成分は、多くの場合ゼロであるので、スパースベクトル技法を用いて、大きな一つの配列ではなくて、二つの小さな配列を用いるようにした実装も可能である。すなわち、一つ目の配列には、ゼロでない値を持つ成分のインデックスを保持し、二つめの配列にはゼロでない成分の実際の値を保持する。特に、特徴ベクトルが２値ベクトルの場合は、二つ目の配列は不要となる。 Since the component of the feature vector is often zero, it is possible to use the sparse vector technique so that two small arrays are used instead of one large array. That is, the first array holds the index of the component having a non-zero value, and the second array holds the actual value of the non-zero component. In particular, when the feature vector is a binary vector, the second array is unnecessary.

分類済みのテキスト集合は、上述したような処理を経て、分類ラベルが付与された特徴ベクトルの集合に変換され、その後、これを訓練データとして、分類器生成過程（Ｐ２）において分類器が学習される（ステップ１０３）。 The classified text set is converted into a set of feature vectors to which a classification label is assigned through the processing described above, and then the classifier is learned in the classifier generation process (P2) using this as training data. (Step 103).

図３は、分類器生成過程（Ｐ２）の詳細な処理フローを説明する図である。分類器生成過程（Ｐ２）の処理では、まず、分類ラベルが付与された特徴ベクトル集合は、ｋ個の部分集合Ｄ_１，…，Ｄ_ｋに分割される（ステップ１３１）。この分割の処理は、訓練データが多すぎるために、サポートベクトルマシンの訓練に要する時間やメモリが許容できなくなることを防ぐためである。分割のサイズは、許容できる計算資源の範囲で、できるだけ大きなサイズに分割することが望ましい。したがって、もし、全ての訓練データを用いた学習が許容できるのであれば、ｋ＝１とすることが望ましい。また、分割された訓練データは、カテゴリーに属するデータと属さないデータの割合が、どれもほぼ等しくなるように分割することが望ましい。 FIG. 3 is a diagram for explaining a detailed processing flow of the classifier generation process (P2). In the process of the classifier generation process (P2), first, the feature vector set to which the classification label is assigned is divided into _k subsets D ₁ ,..., D _k (step 131). This division processing is to prevent the time and memory required for training the support vector machine from becoming unacceptable because there is too much training data. It is desirable to divide the size into as large a size as possible within the allowable range of computing resources. Therefore, if learning using all training data is acceptable, k = 1 is desirable. Moreover, it is desirable to divide the divided training data so that the ratio of the data belonging to the category and the data not belonging to the category is almost equal.

次に、サポートベクトルの集合Ｓを空集合に初期化した（ステップ１３２）後で、各１≦ｉ≦ｋに対して、以下の処理を繰り返す（ステップ１３３〜１３７）。すなわち、訓練データＴをＤ_ｉ∪Ｓとし（ステップ１３４）、Ｔを用いてサポートベクトルマシンを訓練して分類器を計算し（ステップ１３５）、訓練した分類器からサポートベクトルを抽出して、これを新たなＳとする。これをサポートベクトルの集合として（ステップ１３６）、この処理を繰り返す。 Next, after the support vector set S is initialized to an empty set (step 132), the following processing is repeated for each 1 ≦ i ≦ k (steps 133 to 137). That is, let training data T be D _i ∪S (step 134), train a support vector machine using T to calculate a classifier (step 135), extract a support vector from the trained classifier, Is a new S. This process is repeated with this set of support vectors (step 136).

サポートベクトルマシンの訓練法については、本発明の本質的な部分でないので、説明を省略するが、詳細については、非特許文献１あるいは非特許文献４等に記述されているので参照できる。ここでは、本発明の理解に必要な部分だけを以下に述べるに留める。 The support vector machine training method is not an essential part of the present invention and will not be described. However, details can be referred to because they are described in Non-Patent Document 1, Non-Patent Document 4, and the like. Here, only the portions necessary for understanding the present invention are described below.

Ｍ個の訓練データ集合
｛（ｙ_ｉ，ｘ_ｉ）｜ｙ_ｉ＝＋１あるいは−１，ｘ_ｉは特徴ベクトル，１≦ｉ≦Ｍ｝
が与えられるとき、サポートベクトルマシンは、ある凸二次計画問題を解き、その解を
α_１ ^＊，…，α_Ｍ ^＊とするとき、以下のような超平面

を分類境界として出力する。ただし、ｂ^＊は、α_ｉ ^＊＞０であるｉに対して、
ｙ_ｉｄ（ｘ_ｉ）＝１であるように決定する。 M training data sets {(y _i , x _i ) | y _i = + 1 or −1, x _i is a feature vector, 1 ≦ i ≦ M}
, The support vector machine solves a convex quadratic programming problem, and when the solution is α ₁ ^* ,..., Α _M ^* , the following hyperplane

Are output as classification boundaries. Where b ^* is i for α _i ^* > 0,
Determine that y _i d (x _i ) = 1.

ここで、α_ｉ ^＊≠０である訓練データｘ_ｉはサポートベクトルと呼ばれ、分類境界に最も近いところに位置する、超平面を定義するにあたって本質的なデータである。多くの場合、サポートベクトルは、訓練データに比べてかなり少ない数となり、図３に示す分類器生成過程（Ｐ２）の処理（ステップ１３４〜ステップ１３７）では、訓練データの部分集合を全て受け渡す代わりに、それと同じ情報量を持ちながらより数の少ないサポートベクトルを受け渡すことで、全ての訓練データを用いて得られる超平面を近似している。 Here, the training data x _{i in} which α _i ^* ≠ 0 is called a support vector, and is essential data for defining a hyperplane located closest to the classification boundary. In many cases, the number of support vectors is considerably smaller than the training data, and the classifier generation process (P2) process (step 134 to step 137) shown in FIG. Furthermore, the hyperplane obtained by using all the training data is approximated by passing fewer support vectors while having the same amount of information.

また、ここでのＫ（ｘ，ｘ_ｉ）は、カーネル関数と呼ばれ、データ間の類似性を表す関数であり、公知の多項式カーネル、ガウシアンカーネルなどを用いることができるほか、特徴ベクトルが２値ベクトルである場合は、特許文献１９で開示されたブーリアンカーネルを用いることもできる。 Further, K (x, x _i ) here is called a kernel function and is a function representing similarity between data. A known polynomial kernel, Gaussian kernel, or the like can be used, and a feature vector is 2 In the case of a value vector, the Boolean kernel disclosed in Patent Document 19 can also be used.

このような、サポートベクトルの抽出と学習データとの混合という方法は、また、インクリメンタルな分類器の学習とすることも可能である。つまり、図３のＳの初期化（ステップ１３２）において、Ｓを空集合にするのではなく、既に学習されている分類器から抽出したサポートベクトル集合としてＳを初期化すれば、既存の分類器に対するインクリメンタルな学習が実現できる。 Such a method of extracting support vectors and learning data can also be an incremental classifier learning. That is, in the initialization of S in FIG. 3 (step 132), if S is initialized as a support vector set extracted from a classifier that has already been learned, instead of making S an empty set, an existing classifier Incremental learning can be realized.

さらに、非特許文献６あるいは特許文献１９で示された素性選択法を用いて、最終的に得られた超平面を分析し、分類に寄与しない素性を除去した上で、より分類精度の高い超平面を再学習することも可能である。 Further, the feature selection method shown in Non-Patent Document 6 or Patent Document 19 is used to analyze the finally obtained hyperplane to remove features that do not contribute to classification, and to achieve superclassification with higher classification accuracy. It is also possible to relearn the plane.

このように、未分類テキストが存在する場合に（ステップ１０４）、未分類テキストを読み込み（ステップ１０５）、特徴ベクトル生成過程（Ｐ１）の処理を行い（ステップ１０６）、学習された超平面を用いて、カテゴリー判定過程（Ｐ３）の処理（ステップ１０７）を行って、未分類のテキストのカテゴリーを判定する。より詳細に説明すると、特徴ベクトル生成過程（Ｐ１）により、未分類テキストを特徴ベクトルｘに変換し、ｄ（ｘ）を計算し、この値が閾値Ｄｍａｘ以上であれば、＋１のカテゴリーに、閾値Ｄｍｉｎ以下であれば−１のカテゴリーに分類し、Ｄｍｉｎより大きくＤｍａｘよりも小さい場合は、カテゴリーが不定であると判定する処理を行う。 As described above, when unclassified text exists (step 104), the unclassified text is read (step 105), the feature vector generation process (P1) is performed (step 106), and the learned hyperplane is used. Then, the category determination process (P3) process (step 107) is performed to determine the category of the unclassified text. More specifically, in the feature vector generation process (P1), the unclassified text is converted into a feature vector x, and d (x) is calculated. If it is less than Dmin, it is classified into a category of −1, and if it is larger than Dmin and smaller than Dmax, processing for determining that the category is indefinite is performed.

図４は、本発明によるテキスト分類処理装置のハードウェア構成の一例を示すブロック図である。図４に示すように、ここでのテキスト分類処理装置１０は、ハードウェア構成として、システム制御プログラムが組み込まれたＲＯＭを内蔵しデータ処理を行うＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）１０ａによって装置全体のシステム制御がなされる。ＣＰＵ１０ａには、バス１０ｇを介してＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）１０ｂ、ハードディスクドライブ（ＨＤＤ：ＨａｒｄＤｉｓｋＤｒｉｖｅ）１０ｃ、グラフィック処理装置１０ｄ、入力インタフェース１０ｅ、及び通信インタフェース１０ｆが接続されている。 FIG. 4 is a block diagram showing an example of the hardware configuration of the text classification processing apparatus according to the present invention. As shown in FIG. 4, the text classification processing device 10 here has a hardware configuration that includes a ROM in which a system control program is incorporated, and a CPU (Central Processing Unit) 10a that performs data processing controls the entire system. Is made. A random access memory (RAM) 10b, a hard disk drive (HDD: Hard Disk Drive) 10c, a graphic processing device 10d, an input interface 10e, and a communication interface 10f are connected to the CPU 10a via a bus 10g.

ＲＡＭ１０ｂには、ＣＰＵ１０ａに実行させるＯＳ（ＯｐｅｒａｔｉｎｇＳｙｓｔｅｍ）のプログラムや、本発明によるテキスト分類処理プログラムの少なくとも一部が一時的に格納される。また、ＲＡＭ１０ｂには、ＣＰＵ１０ａによる処理に必要な各種データが保存される。ＨＤＤ１０ｃには、上記のＯＳやアプリケーションプログラム、各種データなどが格納される。 The RAM 10b temporarily stores at least a part of an OS (Operating System) program to be executed by the CPU 10a and a text classification processing program according to the present invention. The RAM 10b stores various data necessary for processing by the CPU 10a. The HDD 10c stores the OS, application programs, various data, and the like.

グラフィック処理装置１０ｄには、モニタ１０ｈが接続されている。グラフィック処理装置１０ｄは、ＣＰＵ１０ａからの命令に従って、入出力処理を行うための画像をモニタ１０ｈの表示画面に表示させる。入力インタフェース１０ｅには、キーボード１０ｉと、マウス１０ｊとが接続されている。入力インタフェース１０ｅは、キーボード１０ｉやマウス１０ｊから送られてくる信号を、バス１０ｇを介してＣＰＵ１０ａに送信する。 A monitor 10h is connected to the graphic processing device 10d. The graphic processing device 10d displays an image for performing input / output processing on the display screen of the monitor 10h in accordance with an instruction from the CPU 10a. A keyboard 10i and a mouse 10j are connected to the input interface 10e. The input interface 10e transmits a signal sent from the keyboard 10i or the mouse 10j to the CPU 10a via the bus 10g.

通信インタフェース１０ｆは、ネットワーク３０に接続されて、本発明によるテキスト分類処理装置が、ネットワークシステムの中のサーバとして用いられる構成とされてもよい。もちろん、装置単体として動作するように構成されてもよい。図５に示すように、ネットワークに接続された分類サーバ２０として動作する場合は、ユーザーが利用するクライアント２１は、ネットワーク３０を介して、分類サーバ２０にアクセスし、分類済みテキスト集合を送信して、分類器の構築を要求したり、未分類テキスト集合を送信して、テキストの分類を要求し、分類カテゴリーを受信するようにして、システムを利用することができる。また、この際に、用いられるテキスト集合は、ファイルサーバ２２に格納しておき、クライアント２１は、分類サーバ２０に対してファイルサーバ２２上のファイル名を指定し、分類サーバ２０は、必要に応じて、指定されたファイルをネットワークを介して読み込むことも可能である。 The communication interface 10f may be connected to the network 30 so that the text classification processing device according to the present invention is used as a server in the network system. Of course, it may be configured to operate as a single device. As shown in FIG. 5, when operating as a classification server 20 connected to a network, a client 21 used by a user accesses the classification server 20 via a network 30 and transmits a classified text set. The system can be utilized by requesting the construction of a classifier, sending an unclassified text set, requesting classification of text, and receiving classification categories. At this time, the text set to be used is stored in the file server 22, the client 21 designates the file name on the file server 22 to the classification server 20, and the classification server 20 It is also possible to read the specified file via the network.

次に、テキスト分類処理装置１０が備える処理モジュールの各機能について説明する。図６は、テキスト分類処理装置の機能ブロック図であり、テキスト分類処理装置１０は、処理モジュールとして、特徴ベクトル生成手段（Ｂ１）と分類器生成手段（Ｂ２）とカテゴリー判定手段（Ｂ３）を有している。 Next, each function of the processing module provided in the text classification processing apparatus 10 will be described. FIG. 6 is a functional block diagram of the text classification processing device. The text classification processing device 10 has feature vector generation means (B1), classifier generation means (B2), and category determination means (B3) as processing modules. is doing.

特徴ベクトル生成手段（Ｂ１）は、分類済みのテキスト集合が与えられる場合には、テキスト中に現れる一定長以下の任意の文字列を抽出した後、素性とする文字列を数千〜数万個に絞り込む。そして、各テキスト、各素性毎に特徴量を計算し、この特徴量を成分とする特徴ベクトルを生成する。また、特徴ベクトル生成手段（Ｂ１）は、未分類テキスト集合が与えられるとき、既に選択された素性に対して特徴量を計算して、これを成分とする特徴ベクトルを生成する。 When a classified text set is given, the feature vector generation means (B1) extracts an arbitrary character string of a certain length or less appearing in the text, and then sets the feature character string to several thousand to several tens of thousands Refine to. Then, a feature amount is calculated for each text and each feature, and a feature vector having this feature amount as a component is generated. Further, when an unclassified text set is given, the feature vector generation means (B1) calculates a feature amount for the already selected feature and generates a feature vector having this as a component.

分類器生成手段（Ｂ２）は、分類ラベルが付与された特徴ベクトル集合から、サポートベクトルマシンを用いて分類器（Ｂ４）を構築し、ＲＡＭ１０ｂあるいはＨＤＤ１０ｃに格納する。この際、既に述べたように、特徴ベクトル集合が大きすぎる場合には、複数の部分集合に分割し、各部分集合に対して順番に、サポートベクトルマシンを適用し、一時的な分類器を学習させた後、その分類器からサポートベクトルを抽出し、抽出されたサポートベクトルと次の部分集合を混合し、再びサポートベクトルマシンの入力とするという処理を繰り返すことができる。 The classifier generation means (B2) constructs a classifier (B4) using a support vector machine from the feature vector set to which the classification label is assigned, and stores it in the RAM 10b or the HDD 10c. At this time, as described above, if the feature vector set is too large, it is divided into a plurality of subsets, and a support vector machine is applied to each subset in turn to learn a temporary classifier. After that, the support vector is extracted from the classifier, the extracted support vector and the next subset are mixed, and the process of inputting the support vector machine again can be repeated.

カテゴリー判定手段（Ｂ３）は、未分類のテキストが与えられるとき、前期特徴ベクトル生成手段（Ｂ１）により、このテキストの特徴ベクトルを生成し、前記分類器生成手段（Ｂ２）により生成された分類器を用いて、このテキストがカテゴリーに属するか否かを判定する。 The category determination means (B3) generates a feature vector of the text by the previous period feature vector generation means (B1) when an unclassified text is given, and the classifier generated by the classifier generation means (B2). Is used to determine whether this text belongs to a category.

図７は、本発明によるテキスト分類処理を図５で示すような分類サーバ２０上で実行する場合のプログラムのフローチャートである。この処理では、図７に示すように、クライアント２１から学習リクエストがあると、分類済みテキストを読み込み（Ｓ１）、特徴ベクトル生成ステップ（Ｓ２）によってテキストから特徴ベクトルを生成し、分類器生成ステップ（Ｓ３）によって分類器を生成する。そして、生成された分類器をＲＡＭ１０ｂあるいはＨＤＤ１０ｃに格納した後、クライアント２１に処理終了通知（Ｓ５）を行って、再びリクエストを待って待機する。また、分類リクエストがある場合は、未分類テキストを読み込み（Ｓ６）、特徴ベクトル生成ステップ（Ｓ３）によって特徴ベクトルを生成し、ＲＡＭ１０ｂあるいはＨＤＤ１０ｃ上に格納された分類器を読み込んだ後、カテゴリー判定ステップ（Ｓ８）によって未分類テキストのカテゴリーを判別（予測）し、クライアント２１に送信して（Ｓ９）、再び待機する。 FIG. 7 is a flowchart of a program when the text classification process according to the present invention is executed on the classification server 20 as shown in FIG. In this process, as shown in FIG. 7, when there is a learning request from the client 21, the classified text is read (S1), a feature vector is generated from the text by a feature vector generation step (S2), and a classifier generation step ( A classifier is generated by S3). Then, after storing the generated classifier in the RAM 10b or the HDD 10c, the client 21 is notified of the end of processing (S5), and waits for the request again. If there is a classification request, unclassified text is read (S6), a feature vector is generated by a feature vector generation step (S3), a classifier stored in the RAM 10b or HDD 10c is read, and then a category determination step. The category of uncategorized text is determined (predicted) by (S8), transmitted to the client 21 (S9), and waits again.

このプログラムがインストールされた分類サーバ２０は、各ステップの処理を実行することにより、特徴ベクトル生成手段、分類器生成手段、カテゴリー判定手段として機能するテキスト分類処理装置を構成する。 The classification server 20 in which this program is installed constitutes a text classification processing device that functions as a feature vector generation unit, a classifier generation unit, and a category determination unit by executing the processing of each step.

以上に説明したように、本発明によるテキスト分類処理装置によれば、例えば、インターネット掲示板の書き込みテキストのような、大量で、しかも一般的な言語知識の適用が困難なテキストに対して、不適切な書き込みを高効率・高精度に同定・除去可能なテキストフィルタリング装置として利用される。 As described above, the text classification processing device according to the present invention is inappropriate for a large amount of text that is difficult to apply general language knowledge, such as written text on an Internet bulletin board. It is used as a text filtering device that can identify and remove simple writing with high efficiency and high accuracy.

本発明に係るテキスト分類処理方法の処理フローの一例を示すフローチャートである。It is a flowchart which shows an example of the processing flow of the text classification | category processing method which concerns on this invention. 本発明によるテキスト分類処理方法の特徴ベクトル生成過程を説明する図である。It is a figure explaining the feature vector generation process of the text classification processing method by this invention. 本発明によるテキスト分類処理方法の分類器生成過程を説明する図である。It is a figure explaining the classifier production | generation process of the text classification processing method by this invention. 本発明によるテキスト分類処理装置のハードウェア構成の一例を示すブロック図である。It is a block diagram which shows an example of the hardware constitutions of the text classification | category processing apparatus by this invention. 本発明によるテキスト分類処理装置を分類サーバとしてネットワーク上で動作させる場合のシステム構成図である。1 is a system configuration diagram when a text classification processing device according to the present invention is operated on a network as a classification server. FIG. 本発明によるテキスト分類処理装置の機能ブロック構成図である。It is a functional block block diagram of the text classification | category processing apparatus by this invention. 本発明によるテキスト分類処理プログラムを分類サーバ上で動作させる場合の処理フローの一例を示すフローチャートである。It is a flowchart which shows an example of the processing flow in the case of operating the text classification processing program by this invention on a classification server.

Claims

A text classification process for classifying the text, comprising a data storage device that stores text including a character string that is difficult to perform morphological analysis, and a data processing device that performs text classification processing for determining whether the text belongs to a certain category. A method,
Extracting an arbitrary character string having a predetermined length or less from the text, calculating a feature amount of the character string, and generating a feature vector having the feature amount as a feature;
When a training data set to which a label relating to whether or not the text belongs to a certain category is given in advance, the text of the training data set is converted into a feature vector by the feature vector generation process, and the feature together with the label Applying a vector to a support vector machine, learning to the support vector machine, and generating a text classifier by the support vector machine; and
When text that is unknown whether it belongs to a certain category or not is given, a feature vector of the text is generated by the feature vector generation process, and the text classifier generated by the classifier generation process is used to generate the text A text classification processing method, characterized in that a data processing apparatus executes a process with a category determination process for determining whether or not an image belongs to the category.

The text classification processing method according to claim 1,
The classifier generation process includes:
Divide the training data set into multiple subsets, apply a support vector machine to each subset in turn, train a temporary classifier, then extract and extract support vectors from the classifier A text classification processing method characterized by repeating the process of mixing the supported support vector and the next subset and using it again as an input to the support vector machine.

A text classification process for classifying the text, comprising a data storage device that stores text including a character string that is difficult to perform morphological analysis, and a data processing device that performs text classification processing for determining whether the text belongs to a certain category. A device,
Extracting an arbitrary character string having a predetermined length or less from the text, calculating a feature amount of the character string, and generating a feature vector having the feature amount as a feature;
When a training data set to which a label relating to whether or not the text belongs to a certain category is given in advance, the text of the training data set is converted into a feature vector by the feature vector generation means, and the feature is combined with the label. Classifier generating means for applying a vector to a support vector machine, learning the support vector machine, and generating a text classifier by the support vector machine;
When text that is unknown whether it belongs to a certain category or not is given, the feature vector generation unit generates a feature vector of the text, and the text classifier generated by the classifier generation unit uses the text A category determination means for determining whether or not the file belongs to the category;
A text classification processing apparatus comprising:

The text classification processing device according to claim 1,
The classifier generating means includes:
Divide the training data set into multiple subsets, apply a support vector machine to each subset in turn, train a temporary classifier, then extract and extract support vectors from the classifier A text classification processing device that repeats the process of mixing the support vector and the next subset, which are input to the support vector machine again.

A text classification processing program for executing text classification processing by a computer to determine whether or not text including a character string that is difficult to perform morphological analysis belongs to a category,
Extracting an arbitrary character string having a predetermined length or less from the text, calculating a feature amount of the character string, and generating a feature vector having the feature amount as a feature;
When a training data set to which a label relating to whether or not the text belongs to a certain category is given in advance, the text of the training data set is converted into a feature vector by the feature vector generation means, and the feature is combined with the label. Classifier generating means for applying a vector to a support vector machine, learning the support vector machine, and generating a text classifier by the support vector machine;
When text that is unknown whether it belongs to a certain category or not is given, the feature vector generation unit generates a feature vector of the text, and the text classifier generated by the classifier generation unit uses the text A category determination means for determining whether or not the file belongs to the category;
Text classification processing program that makes a computer function as a computer.

The text classification processing program according to claim 1,
The classifier generating means includes:
Divide the training data set into multiple subsets, apply a support vector machine to each subset in turn, train a temporary classifier, then extract and extract support vectors from the classifier A text classification processing program that repeats the process of mixing the support vector and the next subset and using it as an input to the support vector machine again.