JP2013131073A

JP2013131073A - Classification model learning method, device, program, and review document classifying method

Info

Publication number: JP2013131073A
Application number: JP2011280545A
Authority: JP
Inventors: Mariko Kawaba; 真理子川場; Toru Hirano; 徹平野; Toshiaki Makino; 俊朗牧野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2011-12-21
Filing date: 2011-12-21
Publication date: 2013-07-04

Abstract

PROBLEM TO BE SOLVED: To properly classify whether documents are review documents or not, from the small amount of learning data.SOLUTION: A sentence dividing portion 22 divides each of a plurality of blog documents for learning which include review documents, in which information about a specified object is written, in a sentence unit. A morphological analyzing portion 23 performs morphological analysis processing to each sentence, and an identity extracting portion 24 extracts identities indicating characteristics of the sentence about each sentence of the blog documents for learning. A learning portion 25 uses each of the extracted identities about each sentence of the blog documents for learning which are the review documents as learning data of positive examples, and each of the extracted identities about each sentence of the blog documents for learning which are not the review documents as learning data of negative examples, thereby learning a classification model for classifying whether the inputted sentences are sentences in the review documents or not.

Description

本発明は、分類モデル学習方法、装置、プログラム、及びレビュー文書分類方法に係り、特に、文書がレビュー文書であるか否かを分類するための分類モデル学習方法、装置、プログラム、及びレビュー文書分類方法に関する。 The present invention relates to a classification model learning method, apparatus, program, and review document classification method, and more particularly to a classification model learning method, apparatus, program, and review document classification for classifying whether or not a document is a review document. Regarding the method.

blog等の文書中の単語の分布を利用して、文書を分類する分類方法がある（例えば、非特許文献１）。この分類方法では、特定の分野に出現しやすい単語とそうでない単語があり、それらの分布を利用して分野ごとに文書を分類している。 There is a classification method for classifying a document using a word distribution in a document such as a blog (for example, Non-Patent Document 1). In this classification method, there are words that are likely to appear in a specific field and words that are not so, and a document is classified for each field by using their distribution.

平野耕一、古林紀哉、高橋淳一、「日本語圏ブログの自動分類」、情報処理学会研究報告、2005年Koichi Hirano, Kiya Kobayashi, Junichi Takahashi, "Automatic classification of Japanese-speaking blogs", IPSJ Research Report, 2005

しかしながら、blog等のソーシャルメディアにおいて得られる文書がレビュー文書であるか否か分類する場合、文書が様々な書式で書かれる為に分類器を作成する際の学習データを大量に作成するのが非常に困難であり、また、少量の学習データではレビュー文書であるか否か適切に分類することができない、という問題がある。 However, when classifying whether or not a document obtained on social media such as a blog is a review document, it is very important to create a large amount of learning data when creating a classifier because the document is written in various formats In addition, there is a problem that it is difficult to appropriately classify whether or not the document is a review document with a small amount of learning data.

本発明は、上記の事情を鑑みてなされたもので、少量の学習データで、レビュー文書であるか否かを適切に分類することができる分類モデル学習方法、装置、プログラム、及びレビュー文書分類方法を提供することを目的とする。 The present invention has been made in view of the above circumstances, and a classification model learning method, apparatus, program, and review document classification method capable of appropriately classifying whether or not the document is a review document with a small amount of learning data The purpose is to provide.

上記の目的を達成するために本発明に係る分類モデル学習方法は、文分割手段によって、特定対象に関する情報が記載されたレビュー文書を含む複数の学習用文書の各々を、文単位で分割するステップと、素性抽出手段によって、前記文分割手段によって分割された前記学習用文書の各文について、前記文の特徴を示す素性を抽出するステップと、学習手段によって、前記複数の学習用文書の各文について抽出された前記素性の各々に基づいて、入力された文が前記レビュー文書内の文であるか否かを分類するための分類モデルを学習するステップと、を含む。 In order to achieve the above object, the classification model learning method according to the present invention includes a step of dividing each of a plurality of learning documents including a review document in which information relating to a specific object is described by a sentence dividing unit. Extracting a feature indicating the feature of the sentence for each sentence of the learning document divided by the sentence dividing means by the feature extracting means; and each sentence of the plurality of learning documents by the learning means. Learning a classification model for classifying whether the input sentence is a sentence in the review document based on each of the features extracted for.

本発明に係る分類モデル学習装置は、特定対象に関する情報が記載されたレビュー文書を含む複数の学習用文書の各々を、文単位で分割する文分割手段と、前記文分割手段によって分割された前記学習用文書の各文について、前記文の特徴を示す素性を抽出する素性抽出手段と、前記複数の学習用文書の各文について抽出された前記素性の各々に基づいて、入力された文が前記レビュー文書内の文であるか否かを分類するための分類モデルを学習する学習手段とを含んで構成されている。 The classification model learning device according to the present invention includes a sentence dividing unit that divides each of a plurality of learning documents including a review document in which information about a specific target is described, and the sentence dividing unit that is divided by the sentence dividing unit. Based on each of the features extracted for each sentence of the plurality of learning documents, the feature extraction means for extracting a feature indicating the feature of the sentence for each sentence of the learning document Learning means for learning a classification model for classifying whether or not the sentence is in a review document.

本発明に係る分類モデル学習方法及び分類モデル学習装置によれば、文分割手段によって、特定対象に関する情報が記載されたレビュー文書を含む複数の学習用文書の各々を、文単位で分割する。素性抽出手段によって、前記文分割手段によって分割された前記学習用文書の各文について、前記文の特徴を示す素性を抽出する。 According to the classification model learning method and the classification model learning apparatus according to the present invention, the sentence dividing unit divides each of the plurality of learning documents including the review document in which the information about the specific target is described in sentence units. A feature indicating the feature of the sentence is extracted for each sentence of the learning document divided by the sentence dividing means by the feature extracting means.

そして、学習手段によって、前記複数の学習用文書の各文について抽出された前記素性の各々に基づいて、入力された文が前記レビュー文書内の文であるか否かを分類するための分類モデルを学習する。 And a classification model for classifying whether or not the input sentence is a sentence in the review document based on each of the features extracted for each sentence of the plurality of learning documents by the learning unit. To learn.

このように、レビュー文書を含む学習用文書の各々を文単位に分割し、学習用文書の各文について抽出された素性の各々に基づいて、分類モデルを学習することにより、少量の学習データで、レビュー文書であるか否かを適切に分類することができる分類モデルを得ることができる。 In this way, by dividing each of the learning documents including the review document into sentence units and learning the classification model based on each of the features extracted for each sentence of the learning document, a small amount of learning data can be obtained. Therefore, it is possible to obtain a classification model that can appropriately classify whether the document is a review document.

本発明に係るレビュー文書分類方法は、入力文分割手段によって、入力された文書を、文単位で分割するステップと、入力素性抽出手段によって、前記入力文分割手段によって分割された前記文書の各文について、前記素性を抽出するステップと、分類手段によって、上記の分類モデル学習方法によって学習された前記分類モデルと、前記入力素性抽出手段によって抽出された各文の前記素性とに基づいて、前記文書の各文について、前記レビュー文書内の文であるか否かを分類するステップと、判定手段によって、前記分類手段によって分類された前記文書の各文の分類結果に基づいて、前記入力された文書が前記レビュー文書であるか否かを判定するステップと、を含む。 The review document classification method according to the present invention includes a step of dividing an input document by a sentence unit by an input sentence dividing unit, and each sentence of the document divided by the input sentence dividing unit by an input feature extracting unit. On the basis of the step of extracting the features, the classification model learned by the classification model learning method by the classification means, and the features of each sentence extracted by the input feature extraction means For each sentence of the above, the step of classifying whether or not it is a sentence in the review document, and the input document based on the classification result of each sentence of the document classified by the classification means by the determination means Determining whether or not is a review document.

本発明に係るレビュー文書分類方法によれば、入力文分割手段によって、入力された文書を、文単位で分割する。入力素性抽出手段によって、前記入力文分割手段によって分割された前記文書の各文について、前記素性を抽出する。 According to the review document classification method of the present invention, an input document is divided into sentence units by an input sentence dividing unit. The feature is extracted for each sentence of the document divided by the input sentence dividing means by the input feature extracting means.

そして、分類手段によって、学習された前記分類モデルと、前記入力素性抽出手段によって抽出された各文の前記素性とに基づいて、前記文書の各文について、前記レビュー文書内の文であるか否かを分類する。判定手段によって、前記分類手段によって分類された前記文書の各文の分類結果に基づいて、前記入力された文書が前記レビュー文書であるか否かを判定する。 Whether each sentence of the document is a sentence in the review document based on the classification model learned by the classification means and the feature of each sentence extracted by the input feature extraction means. Classify. A determination unit determines whether the input document is the review document based on a classification result of each sentence of the document classified by the classification unit.

このように、レビュー文書を含む学習用文書の各文について抽出された素性の各々に基づいて学習した分類モデルを用いて、入力された文書がレビュー文書であるか否かを判定することにより、少量の学習データで、レビュー文書であるか否かを適切に分類することができる。 In this way, by using the classification model learned based on each feature extracted for each sentence of the learning document including the review document, by determining whether the input document is a review document, It is possible to appropriately classify whether or not the document is a review document with a small amount of learning data.

本発明に係るプログラムは、コンピュータに、上記の分類モデル学習方法、あるいは上記のレビュー文書分類方法の各ステップを実行させるためのプログラムである。 The program according to the present invention is a program for causing a computer to execute each step of the classification model learning method or the review document classification method.

以上説明したように、本発明の分類モデル学習方法、装置、及びプログラムによれば、レビュー文書を含む学習用文書の各々を文単位に分割し、学習用文書の各文について抽出された素性の各々に基づいて、分類モデルを学習することにより、少量の学習データで、レビュー文書であるか否かを適切に分類することができる分類モデルを得ることができる、という効果が得られる。 As described above, according to the classification model learning method, apparatus, and program of the present invention, each of the learning documents including the review document is divided into sentence units, and the features extracted for each sentence of the learning document are extracted. By learning the classification model based on each of them, it is possible to obtain a classification model that can appropriately classify whether or not the document is a review document with a small amount of learning data.

また、本発明のレビュー文書分類方法及びプログラムによれば、レビュー文書を含む学習用文書の各文について抽出された素性の各々に基づいて学習した分類モデルを用いて、入力された文書がレビュー文書であるか否かを判定することにより、少量の学習データで、レビュー文書であるか否かを適切に分類することができる、という効果が得られる。 Further, according to the review document classification method and program of the present invention, an input document is a review document using a classification model learned based on each feature extracted for each sentence of a learning document including the review document. By determining whether or not it is, it is possible to appropriately classify whether or not the document is a review document with a small amount of learning data.

本発明の実施の形態に係るレビュー文書分類装置の構成を示す概略図である。It is the schematic which shows the structure of the review document classification | category apparatus which concerns on embodiment of this invention. 入力されるブログ文書を示す図である。It is a figure which shows the blog document input. レビュー文と非レビュー文とを説明するための図である。It is a figure for demonstrating a review sentence and a non-review sentence. （Ａ）入力されるブログ文書を示す図、（Ｂ）文に分割した結果を示す図、及び（Ｃ）各文について抽出された素性を示す図である。(A) The figure which shows the input blog document, (B) The figure which shows the result divided | segmented into the sentence, (C) The figure which shows the feature extracted about each sentence. 各文に対する分類結果を示す図である。It is a figure which shows the classification result with respect to each sentence. 本発明の実施の形態に係るレビュー文書分類装置における学習処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the learning process routine in the review document classification device which concerns on embodiment of this invention. 本発明の実施の形態に係るレビュー文書分類装置における文書分類処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the document classification | category process routine in the review document classification | category apparatus which concerns on embodiment of this invention.

以下、図面を参照して本発明の実施の形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

＜システム構成＞
本発明の実施の形態に係るレビュー文書分類装置１００は、ソーシャルメディア（例えば、ブログ)において得られたブログ文書が入力され、特定対象（例えば、店舗や商品)に関する客観的な情報または主観的な情報（例えば、口コミ情報などの意見）が記載されたレビュー文書であるか否かの判定結果を出力する。１つのブログ文書は１つ以上の文からなるテキストデータである。このレビュー文書分類装置１００は、ＣＰＵと、ＲＡＭと、後述する学習処理ルーチン及び文書分類処理ルーチンを実行するためのプログラムを記憶したＲＯＭとを備えたコンピュータで構成され、機能的には次に示すように構成されている。図１に示すように、レビュー文書分類装置１００は、入力部１０と、演算部２０と、出力部３０とを備えている。 <System configuration>
The review document classification apparatus 100 according to the embodiment of the present invention receives a blog document obtained on social media (for example, a blog) and inputs objective information or subjective information on a specific target (for example, a store or a product). A determination result as to whether or not the document is a review document in which information (for example, opinions such as word-of-mouth information) is described is output. One blog document is text data composed of one or more sentences. The review document classification apparatus 100 is configured by a computer including a CPU, a RAM, and a ROM that stores a program for executing a learning processing routine and a document classification processing routine described later. It is configured as follows. As shown in FIG. 1, the review document classification apparatus 100 includes an input unit 10, a calculation unit 20, and an output unit 30.

入力部１０は、学習用文書として入力された複数のブログ文書からなる文書群を受け付ける。例えば、ブログ文書として図２に示すようなデータを入力することができる。入力部１０は、学習用文書としてのブログ文書と共に、当該ブログ文書が、特定対象に関するレビュー文書であるか否かを示す教師情報の入力を、ブログ文書毎に受け付ける。 The input unit 10 receives a document group including a plurality of blog documents input as learning documents. For example, data as shown in FIG. 2 can be input as a blog document. The input unit 10 receives, for each blog document, teacher information indicating whether the blog document is a review document related to a specific target, together with the blog document as a learning document.

また、入力部１０は、分類対象として入力されたブログ文書を受け付ける。 Further, the input unit 10 accepts a blog document input as a classification target.

なお、入力されるブログ文書は形態素解析済みであってもよく、この場合には、後述する形態素解析部２３、３２を省略することができる。 Note that the input blog document may have been subjected to morphological analysis. In this case, morphological analysis units 23 and 32 described later can be omitted.

演算部２０は、文書データベース２１、文分割部２２、形態素解析部２３、素性抽出部２４、学習部２５、及びモデル記憶部２６を備えている。 The calculation unit 20 includes a document database 21, a sentence division unit 22, a morpheme analysis unit 23, a feature extraction unit 24, a learning unit 25, and a model storage unit 26.

文書データベース２１は、入力部１０により受け付けた学習用文書としての複数のブログ文書からなる文書群及びブログ文書毎の教師情報を記憶する。 The document database 21 stores a document group including a plurality of blog documents as learning documents received by the input unit 10 and teacher information for each blog document.

文分割部２２は、文書データベース２１に記憶された文書群の各ブログ文書を、文単位に分割する。文分割の方法は既知の技術を用いればよく、例えば句読点や改行などが出現する箇所で分割すればよい。また、ブログ等のソーシャルメディアで得られる文書を利用する際には、顔文字、絵文字等が文区切りに利用されていることが多い。そのため、絵文字や顔文字を文区切りに利用してもよい。 The sentence division unit 22 divides each blog document of the document group stored in the document database 21 into sentence units. A known technique may be used as the sentence division method. For example, the sentence may be divided at a place where punctuation marks or line breaks appear. In addition, when using a document obtained on social media such as a blog, emoticons, pictograms, and the like are often used for sentence separation. Therefore, pictograms and emoticons may be used for sentence breaks.

形態素解析部２３は、各ブログ文書について、分割された文ごとに、既存の技術である形態素解析によって、当該文を単語に区切り、さらに各単語に品詞を付与し出力する。たとえば、ブログ文書が、「表参道に行ったところ、すごい行列だったので覗いてみたら新商品の発売が始まっていました。買うつもりはなかったのですが、ついつられて、A社の新商品を買ってしまいました。・・・」である場合、形態素解析結果として、「・・・（略）・・・A社（名詞）/の（格助詞）/新商品（名詞）/を（格助詞）/買（動詞語幹）/っ（動詞活用語尾）/て（動詞接尾辞）/しま（動詞語幹）/い（動詞活用語尾）/ました（動詞接尾辞）・・・（略）・・・」が得られる。 For each blog document, the morpheme analysis unit 23 divides the sentence into words by morpheme analysis, which is an existing technology, and further adds a part of speech to each word for output. For example, the blog document says, “When I went to Omotesando, it was a great line, so when I took a peek, the release of a new product had begun. If you have "...", the result of the morphological analysis is "... (omitted) ... A company (noun) / no (case particle) / new product (noun) / Case particle) / buy (verb stem) / tsu (verb inflection ending) / te (verb suffix) / shima (verb stem) / i (verb inflection ending) / was (verb suffix) ... (omitted) ... "is obtained.

素性抽出部２４は、各ブログ文書について、分割された文ごとに、形態素解析によって得られた結果を利用して、機械学習に用いる、文の特徴を示す素性を作成する。例えば、文の素性として、文中の形態素の頻度分布（ヒストグラム)などを用いる。 For each blog document, the feature extraction unit 24 uses the result obtained by the morphological analysis for each divided sentence to create a feature indicating the feature of the sentence used for machine learning. For example, a frequency distribution (histogram) of morphemes in a sentence is used as a sentence feature.

レビュー文書には特定の感性表現および評価表現が多く出現することがある。そのため、感性表現・評価表現の有無および種類を素性として利用してもよい。また、顔文字・絵文字等を素性として利用してもよい。 There may be many specific emotional expressions and evaluation expressions appearing in review documents. Therefore, the presence / absence and type of sensitivity expression / evaluation expression may be used as a feature. In addition, emoticons and pictograms may be used as features.

また、ブログのようなソーシャルメディアのテキストでは、複数の文にまたいで評価している場合がある。そのため、前後の文の評価表現、形態素の頻度などを素性として利用しても良い。 Also, social media texts such as blogs may be evaluated across multiple sentences. For this reason, evaluation expressions of preceding and following sentences, morpheme frequencies, and the like may be used as features.

素性抽出部２４は、図３に示すように、レビュー文書であるブログ文書中の文をすべてレビュー文とし、レビュー文の素性を、正例の学習データとしてメモリ（図示省略)に記憶する。また、素性抽出部２４は、非レビュー文書であるブログ文書中の文をすべて非レビュー文とし、非レビュー文の素性を、負例の学習データとしてメモリに記憶する。ブログの場合１文書当たり約１０文存在する為、例えば、１０００文ほど必要な学習であれば、１００個のブログ文書あれば、適切に学習できるようになる。 As shown in FIG. 3, the feature extraction unit 24 stores all sentences in the blog document that is the review document as review sentences, and stores the features of the review sentences in a memory (not shown) as learning data of positive examples. Further, the feature extraction unit 24 sets all sentences in the blog document, which is a non-reviewed document, as non-reviewed sentences, and stores the features of the non-reviewed sentences in the memory as negative example learning data. In the case of a blog, there are about 10 sentences per document. For example, if learning is necessary for about 1000 sentences, 100 blog documents can be appropriately learned.

学習部２５は、学習用文書である文書群から得られた正例の学習データ（レビュー文の素性）及び負例の学習データ（非レビュー文の素性）を用いて、機械学習によって、入力された文がレビュー文書内の文であるか否かを分類するための分類モデルを作成して、モデル記憶部２６に記憶する。機械学習アルゴリズムとしては、例えばサポートベクトルマシン（SVM）やMarkov Logic Network (MLN)などのアルゴリズムを利用することができる。 The learning unit 25 is input by machine learning using positive learning data (features of review sentences) and negative learning data (features of non-review sentences) obtained from a document group that is a learning document. A classification model for classifying whether or not the sentence is a sentence in the review document is created and stored in the model storage unit 26. As the machine learning algorithm, for example, an algorithm such as support vector machine (SVM) or Markov Logic Network (MLN) can be used.

モデル記憶部２６に記憶される分類モデルは、例えば、各素性に関する重みの数値を格納したものである。 The classification model stored in the model storage unit 26 stores, for example, numerical values of weights regarding each feature.

また、演算部２０は、文分割部３１、形態素解析部３２、素性抽出部３３、分類部３４、及びレビュー文書判定部３５を備えている。なお、文分割部３１は、入力文分割手段の一例であり、素性抽出部３３は、入力素性抽出手段の一例である。 In addition, the calculation unit 20 includes a sentence division unit 31, a morpheme analysis unit 32, a feature extraction unit 33, a classification unit 34, and a review document determination unit 35. The sentence dividing unit 31 is an example of an input sentence dividing unit, and the feature extracting unit 33 is an example of an input feature extracting unit.

文分割部３１は、文分割部２２と同様に、図４（Ａ）に示すような入力された分類対象のブログ文書を、図４（Ｂ）に示すように文単位に分割する。 Similarly to the sentence dividing unit 22, the sentence dividing unit 31 divides the input blog document to be classified as shown in FIG. 4A into sentence units as shown in FIG. 4B.

形態素解析部３２は、形態素解析部２３と同様に、分類対象のブログ文書について、分割された文ごとに、形態素解析によって、当該文を単語に区切り、さらに各単語に品詞を付与し出力する。 Similar to the morpheme analysis unit 23, the morpheme analysis unit 32 divides the sentence into words by morpheme analysis for each divided sentence of the blog document to be classified, and further outputs a part of speech to each word.

素性抽出部３３は、分類対象のブログ文書について、図４（Ｃ）に示すように、分割された文ごとに、形態素解析によって得られた結果を利用して、素性抽出部２４と同様に、文の特徴を示す素性を作成する。 As shown in FIG. 4C, the feature extraction unit 33 uses the result obtained by morphological analysis for each divided sentence, as in the feature extraction unit 24, as shown in FIG. Create features that characterize the sentence.

分類部３４は、例えば、分類対象のブログ文書の各文について、例えば、抽出された素性を示す数値に、分類モデルが持つ対応する重みを乗算した数値の各々を要素とする特徴ベクトルを用いて、サポートベクトルマシンのアルゴリズムにより、当該文が、レビュー文書内の文であるか否かを分類する。これによって、図５に示すように、各文が、レビュー文であるか、非レビュー文であるかに分類される。 For example, for each sentence of the blog document to be classified, the classification unit 34 uses, for example, a feature vector whose elements are numerical values obtained by multiplying a numerical value indicating the extracted feature by a corresponding weight of the classification model. Then, according to the algorithm of the support vector machine, it is classified whether or not the sentence is a sentence in the review document. Accordingly, as shown in FIG. 5, each sentence is classified as a review sentence or a non-review sentence.

レビュー文書判定部３５は、分類対象のブログ文書について、レビュー文と分類された文の割合が、閾値以上である場合には、レビュー文書であると判定し、閾値未満である場合には、非レビュー文書であると判定する。 The review document determination unit 35 determines that the blog document to be classified is a review document when the ratio of the sentence classified as the review sentence is equal to or greater than the threshold value, and determines that the review document is not It is determined that the document is a review document.

レビュー文書判定部３５による判定結果が、出力部３０より出力される。 The determination result by the review document determination unit 35 is output from the output unit 30.

＜レビュー文書分類装置の作用＞
次に、本実施の形態に係るレビュー文書分類装置１００の作用について説明する。まず、学習用文書としての複数のブログ文書である文書群と、当該複数のブログ文書の各々がレビュー文書であるか否かを示す教師情報とがレビュー文書分類装置１００に入力されると、レビュー文書分類装置１００によって、入力された文書群及び教師情報が、文書データベース２１へ格納される。そして、レビュー文書分類装置１００によって、図６に示す学習処理ルーチンが実行される。 <Operation of review document classification device>
Next, the operation of the review document classification apparatus 100 according to the present embodiment will be described. First, when a document group which is a plurality of blog documents as learning documents and teacher information indicating whether each of the plurality of blog documents is a review document are input to the review document classification apparatus 100, a review is performed. The document grouping apparatus 100 stores the input document group and teacher information in the document database 21. Then, the review document classification apparatus 100 executes a learning process routine shown in FIG.

まず、ステップＳ１０１において、文書データベース２１から１つのブログ文書を取り出す。そして、ステップＳ１０２において、文分割部２２によって、上記ステップＳ１０１において取り出されたブログ文書を、文単位で分割する。ステップＳ１０３では、形態素解析部２３によって、ブログ文書の各文に対して、形態素解析処理を行う。 First, in step S101, one blog document is extracted from the document database 21. In step S102, the sentence dividing unit 22 divides the blog document extracted in step S101 into sentences. In step S103, the morpheme analysis unit 23 performs a morpheme analysis process on each sentence of the blog document.

次のステップＳ１０４では、素性抽出部２４によって、ブログ文書の各文について、上記ステップＳ１０３で得られた形態素解析結果に基づいて素性を抽出し、ステップＳ１０５において、当該ブログ文書がレビュー文書であれば、上記ステップＳ１０４で抽出された各文の素性を、レビュー文の素性（正例の学習データ）としてメモリに記憶し、当該ブログ文書が非レビュー文書であれば、上記ステップＳ１０４で抽出された各文の素性を、非レビュー文の素性（負例の学習データ）としてメモリに記憶する。 In the next step S104, the feature extraction unit 24 extracts features for each sentence of the blog document based on the morpheme analysis result obtained in step S103, and in step S105, if the blog document is a review document. The feature of each sentence extracted in step S104 is stored in the memory as the feature of the review sentence (positive learning data), and if the blog document is a non-review document, each feature extracted in step S104 is stored. The feature of the sentence is stored in the memory as the feature of the non-reviewed sentence (negative example learning data).

ステップＳ１０６では、文書データベース２１に記憶されている全てのブログ文書について、上記ステップＳ１０１〜Ｓ１０５の処理を実行したか否かを判定し、上記ステップＳ１０１〜Ｓ１０５の処理を実行していないブログ文書が存在する場合には、上記ステップＳ１０１へ戻り、当該ブログ文書を取り出す。一方、全てのブログ文書について、上記ステップＳ１０１〜Ｓ１０５の処理を実行した場合には、ステップＳ１０７へ進む。 In step S106, it is determined whether or not the processing in steps S101 to S105 has been executed for all the blog documents stored in the document database 21, and blog documents that have not executed the processing in steps S101 to S105 are determined. If it exists, the process returns to step S101 to take out the blog document. On the other hand, if the processes of steps S101 to S105 have been executed for all blog documents, the process proceeds to step S107.

ステップＳ１０７において、学習部２５によって、メモリに記憶された正例の学習データ及び負例の学習データを用いて、機械学習によって、分類モデルを学習し、ステップＳ１０８において、モデル記憶部２６に分類モデルを格納し、学習処理ルーチンを終了する。 In step S107, the learning unit 25 learns the classification model by machine learning using the positive example learning data and the negative example learning data stored in the memory. In step S108, the classification model is stored in the model storage unit 26. Is stored, and the learning processing routine is terminated.

そして、分類対象のブログ文書がレビュー文書分類装置１００に入力されると、レビュー文書分類装置１００によって、図７に示す文書分類処理ルーチンが実行される。 When the blog document to be classified is input to the review document classification device 100, the review document classification device 100 executes a document classification processing routine shown in FIG.

まず、ステップＳ１１１において、入力部１０により入力されたブログ文書を受け付ける。そして、ステップＳ１１２において、文分割部３１によって、上記ステップＳ１１１において入力されたブログ文書を、文単位で分割する。ステップＳ１１３では、形態素解析部３２によって、ブログ文書の各文に対して、形態素解析処理を行う。 First, in step S111, a blog document input by the input unit 10 is received. In step S112, the sentence dividing unit 31 divides the blog document input in step S111 into sentence units. In step S113, the morpheme analysis unit 32 performs a morpheme analysis process on each sentence of the blog document.

次のステップＳ１１４では、素性抽出部３３によって、ブログ文書の各文について、素性を抽出する。ステップＳ１１５では、分類部３４によって、ブログ文書の各文について、上記ステップＳ１１４で抽出された素性と、モデル記憶部２６に記憶された分類モデルとに基づいて、当該文が、レビュー文であるか非レビュー文であるか分類する。 In the next step S114, the feature extraction unit 33 extracts the feature for each sentence of the blog document. In step S115, for each sentence of the blog document by the classification unit 34, whether the sentence is a review sentence based on the features extracted in step S114 and the classification model stored in the model storage unit 26. Classify whether it is a non-reviewed sentence.

そして、ステップＳ１１６では、レビュー文書判定部３５によって、上記ステップＳ１１５でレビュー文であると分類された文の割合に基づいて、ブログ文書がレビュー文書であるか否かを判定する。ステップＳ１１７において、上記ステップＳ１１６の判定結果を出力部３０により出力して、文書分類処理ルーチンを終了する。 In step S116, the review document determination unit 35 determines whether the blog document is a review document based on the ratio of the sentences classified as review sentences in step S115. In step S117, the determination result in step S116 is output by the output unit 30, and the document classification processing routine is terminated.

以上説明したように、本実施の形態に係るレビュー文書分類装置によれば、レビュー文書を含む学習用のブログ文書の各々を文単位に分割し、レビュー文書であるブログ文書の各文について抽出された素性の各々を正例の学習データとし、非レビュー文書であるブログ文書の各文について抽出された素性の各々を負例の学習データとして、分類モデルを学習することにより、少量の学習データで、レビュー文書であるか否かを適切に分類することができる分類モデルを得ることができる。 As described above, according to the review document classification device according to the present embodiment, each of the learning blog documents including the review document is divided into sentence units, and each sentence of the blog document that is the review document is extracted. By learning each classification feature as a positive example learning data and learning each classification feature as a negative example learning data for each sentence of a non-reviewed blog document, a small amount of learning data can be used. Therefore, it is possible to obtain a classification model that can appropriately classify whether the document is a review document.

また、レビュー文書を含む学習用のブログ文書の各文について抽出された素性の各々に基づいて学習した分類モデルを用いて、入力されたブログ文書がレビュー文書であるか否かを判定することにより、少量の学習データで、レビュー文書であるか否かを適切に分類することができる。 In addition, by using the classification model learned based on each feature extracted for each sentence of the learning blog document including the review document, it is determined whether or not the input blog document is a review document. It is possible to appropriately classify whether or not the document is a review document with a small amount of learning data.

なお、本発明は、上述した実施形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 Note that the present invention is not limited to the above-described embodiment, and various modifications and applications are possible without departing from the gist of the present invention.

例えば、学習用のブログ文書の各文について、特定対象に関する客観的な情報又は主観的な情報が記載されたレビュー文であるか、あるいは非レビュー文であるかを手入力するようにしてもよい。この場合には、学習用のブログ文書の各文のうちレビュー文の各々について抽出された素性の各々を正例の学習データとし、学習用のブログ文書の各文のうち非レビュー文の各々について抽出された素性の各々を負例の学習データとして、学習部によって分類モデルを学習するようにすればよい。 For example, each sentence of a learning blog document may be manually input as to whether it is a review sentence in which objective or subjective information about a specific object is described or a non-review sentence. . In this case, each feature extracted for each of the review sentences in each sentence of the learning blog document is set as a positive example of learning data, and each of the non-review sentences in each sentence of the learning blog document is set. The classification model may be learned by the learning unit using each of the extracted features as negative learning data.

また、本願明細書中において、プログラムが予めインストールされている実施形態として説明したが、当該プログラムを、コンピュータ読み取り可能な記録媒体に格納して提供することも可能である。 In the present specification, the embodiment has been described in which the program is installed in advance. However, the program can be provided by being stored in a computer-readable recording medium.

１０入力部
２０演算部
２１文書データベース
２２、３１文分割部
２３、３２形態素解析部
２４、３３素性抽出部
２５学習部
２６モデル記憶部
３４分類部
３５レビュー文書判定部
１００レビュー文書分類装置 DESCRIPTION OF SYMBOLS 10 Input part 20 Arithmetic part 21 Document database 22, 31 Sentence division part 23, 32 Morphological analysis part 24, 33 Feature extraction part 25 Learning part 26 Model storage part 34 Classification part 35 Review document determination part 100 Review document classification apparatus

Claims

Dividing each of a plurality of learning documents including a review document in which information on a specific object is described by a sentence dividing unit;
Extracting features indicating characteristics of the sentence for each sentence of the learning document divided by the sentence dividing means by a feature extracting means;
Learning a classification model for classifying whether or not the input sentence is a sentence in the review document based on each of the features extracted for each sentence of the plurality of learning documents by the learning means. And steps to
Classification model learning method including

The step of learning by the learning means sets each feature extracted for each sentence of the learning document that is the review document as a positive feature, and extracts each sentence of the learning document that is not the review document. The classification model learning method according to claim 1, wherein the classification model is learned using each of the features as a negative example.

In the learning step by the learning means, each of the features extracted for each of the review sentences in which information on the specific target is described in each sentence of the learning document is set as a positive example feature, The classification model learning method according to claim 1, wherein the classification model is learned by using each of the features extracted for each of the sentences that are not the review sentence among the sentences of the document as features of negative examples.

Dividing the input document into sentence units by the input sentence dividing means;
Extracting the features for each sentence of the document divided by the input sentence dividing means by an input feature extracting means;
Based on the classification model learned by the classification model learning method according to any one of claims 1 to 3 by the classification unit and the features of each sentence extracted by the input feature extraction unit, Categorizing whether each sentence of the document is a sentence in the review document;
Determining whether or not the input document is the review document based on a classification result of each sentence of the document classified by the classification unit;
Review document classification method including

The classification model learning method according to claim 1, wherein the document is a blog document.

Sentence dividing means for dividing each of a plurality of learning documents including a review document in which information on a specific object is described;
For each sentence of the learning document divided by the sentence dividing means, a feature extracting means for extracting a feature indicating the feature of the sentence;
Learning means for learning a classification model for classifying whether or not an inputted sentence is a sentence in the review document based on each of the features extracted for each sentence of the plurality of learning documents; ,
Model learning apparatus including

A program for causing a computer to execute the steps of the classification model learning method according to claim 1, 2, 3, or 5, or the review document classification method according to claim 4.