JP2013131074A

JP2013131074A - Classification model learning method, device, program, and review document classifying method

Info

Publication number: JP2013131074A
Application number: JP2011280546A
Authority: JP
Inventors: Mariko Kawaba; 真理子川場; Toru Hirano; 徹平野; Toshiaki Makino; 俊朗牧野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2011-12-21
Filing date: 2011-12-21
Publication date: 2013-07-04

Abstract

PROBLEM TO BE SOLVED: To adjust a balance of positive examples and negative examples of learning data, and to accurately classify whether documents are review documents or not.SOLUTION: An evaluation sentence extracting portion 23 extracts evaluation sentences from each of a plurality of blog documents for learning which includes review documents, thereby producing evaluation documents. An identity extracting portion 24 extracts identities about each of the evaluation documents produced from each of the plurality of documents for learning. A learning portion 25 uses each of the extracted identities about the blog documents for learning which are the review documents as identities of positive examples, and each of the extracted identities about the blog documents for learning which are not the review documents as the identities of negative examples, thereby learning a classification model for learning whether the input documents are the review documents or not.

Description

本発明は、分類モデル学習方法、装置、プログラム、及びレビュー文書分類方法に係り、特に、文書がレビュー文書であるか否かを分類するための分類モデル学習方法、装置、プログラム、及びレビュー文書分類方法に関する。 The present invention relates to a classification model learning method, apparatus, program, and review document classification method, and more particularly to a classification model learning method, apparatus, program, and review document classification for classifying whether or not a document is a review document. Regarding the method.

blog等の文書中の単語の分布を利用して、文書を分類する分類方法がある（例えば、非特許文献１）。この分類方法では、特定の分野に出現しやすい単語とそうでない単語があり、それらの分布を利用して分野ごとに文書を分類している。 There is a classification method for classifying a document using a word distribution in a document such as a blog (for example, Non-Patent Document 1). In this classification method, there are words that are likely to appear in a specific field and words that are not so, and a document is classified for each field by using their distribution.

平野耕一、古林紀哉、高橋淳一、「日本語圏ブログの自動分類」、情報処理学会研究報告、2005年Koichi Hirano, Kiya Kobayashi, Junichi Takahashi, "Automatic classification of Japanese-speaking blogs", IPSJ Research Report, 2005

しかしながら、blog等のソーシャルメディアにおいて得られる文書から、レビュー文書を抽出する場合、レビュー文書になりえる文書数が、非レビュー文書と比べて少ない。例えば、飲食店に関するブログの内、全体の８割が非レビュー文書であり、レビュー文書は全体の２割程度しか存在しない。そのため、学習器を利用する際に正例と負例のバランスが悪くなり、学習結果として得られる分類モデルの分類性能の低下を招く場合がある。 However, when a review document is extracted from a document obtained on social media such as a blog, the number of documents that can be a review document is smaller than that of a non-review document. For example, 80% of the blogs related to restaurants are non-reviewed documents, and there are only about 20% of review documents. Therefore, when using the learning device, the balance between the positive example and the negative example is deteriorated, and the classification performance of the classification model obtained as a learning result may be deteriorated.

本発明は、上記の事情を鑑みてなされたもので、学習データの正例と負例のバランスを調整することができ、レビュー文書であるか否かを精度よく分類することができる分類モデル学習方法、装置、プログラム、及びレビュー文書分類方法を提供することを目的とする。 The present invention has been made in view of the above circumstances, and can perform classification model learning that can adjust the balance between positive and negative examples of learning data and can accurately classify whether or not the document is a review document. It is an object to provide a method, an apparatus, a program, and a review document classification method.

上記の目的を達成するために第１の発明に係る分類モデル学習方法は、評価文抽出手段によって、特定対象に関する情報が記載されたレビュー文書を含む複数の学習用文書の各々から、予め定められた評価表現を含む評価文を抽出するステップと、素性抽出手段によって、前記評価文抽出手段によって前記評価文が抽出された前記複数の学習用文書の各々について、前記学習用文書から抽出された前記評価文からなる文書の特徴を示す素性、又は前記学習用文書の特徴を示す素性を抽出するステップと、学習手段によって、前記レビュー文書である前記学習用文書について抽出された前記素性の各々を正例の素性とし、前記レビュー文書でない前記学習用文書について抽出された前記素性の各々を負例の素性として、入力された文書が前記レビュー文書であるか否かを分類するための分類モデルを学習するステップと、を含む。 In order to achieve the above object, the classification model learning method according to the first invention is determined in advance from each of a plurality of learning documents including a review document in which information about a specific object is described by an evaluation sentence extraction unit. Extracting the evaluation sentence including the evaluation expression, and the feature extraction means, for each of the plurality of learning documents from which the evaluation sentence is extracted by the evaluation sentence extraction means, the extracted from the learning document Each of the features extracted for the learning document that is the review document is corrected by the step of extracting the feature indicating the feature of the document composed of the evaluation sentence or the feature indicating the feature of the learning document. Each of the extracted features of the learning document that is not the review document is set as a negative example feature, and the input document is set as the feature of the example. Comprising the steps of: learning a classification model for classifying whether the-menu document, the.

第２の発明に係る分類モデル学習装置は、特定対象に関する情報が記載されたレビュー文書を含む複数の学習用文書の各々から、予め定められた評価表現を含む評価文を抽出する評価文抽出手段と、前記評価文抽出手段によって前記評価文が抽出された前記複数の学習用文書の各々について、前記学習用文書から抽出された前記評価文からなる文書の特徴を示す素性、又は前記学習用文書の特徴を示す素性を抽出する素性抽出手段と、前記レビュー文書である前記学習用文書について抽出された前記素性の各々を正例の素性とし、前記レビュー文書でない前記学習用文書について抽出された前記素性の各々を負例の素性として、入力された文書が前記レビュー文書であるか否かを分類するための分類モデルを学習する学習手段と、を含んで構成されている。 The classification model learning device according to the second invention is an evaluation sentence extraction means for extracting an evaluation sentence including a predetermined evaluation expression from each of a plurality of learning documents including a review document in which information on a specific object is described. And, for each of the plurality of learning documents from which the evaluation sentence has been extracted by the evaluation sentence extracting means, a feature indicating the characteristics of the document consisting of the evaluation sentence extracted from the learning document, or the learning document A feature extracting means for extracting a feature indicating a feature of the feature, and each of the features extracted for the learning document that is the review document is a positive feature, and the feature document is extracted for the learning document that is not the review document Learning means for learning a classification model for classifying whether or not the input document is the review document with each of the features as a negative feature. It is.

第１の発明に係る分類モデル学習方法及び第２の発明に係る分類モデル学習装置によれば、評価文抽出手段によって、特定対象に関する情報が記載されたレビュー文書を含む複数の学習用文書の各々から、予め定められた評価表現を含む評価文を抽出する。素性抽出手段によって、前記評価文抽出手段によって前記評価文が抽出された前記複数の学習用文書の各々について、前記学習用文書から抽出された前記評価文からなる文書の特徴を示す素性、又は前記学習用文書の特徴を示す素性を抽出する。 According to the classification model learning method according to the first invention and the classification model learning device according to the second invention, each of the plurality of learning documents including the review document in which the information about the specific target is described by the evaluation sentence extraction unit. Then, an evaluation sentence including a predetermined evaluation expression is extracted. A feature indicating the characteristics of the document composed of the evaluation sentence extracted from the learning document, for each of the plurality of learning documents from which the evaluation sentence is extracted by the evaluation sentence extraction means by the feature extraction means, or A feature indicating the characteristics of the learning document is extracted.

そして、学習手段によって、前記レビュー文書である前記学習用文書について抽出された前記素性の各々を正例の素性とし、前記レビュー文書でない前記学習用文書について抽出された前記素性の各々を負例の素性として、入力された文書が前記レビュー文書であるか否かを分類するための分類モデルを学習する。 Then, each of the features extracted for the learning document that is the review document by the learning unit is set as a positive example feature, and each of the features extracted for the learning document that is not the review document is set as a negative example. As a feature, a classification model for classifying whether or not the input document is the review document is learned.

このように、学習用文書から評価文を抽出し、評価文が抽出された複数の学習用文書の各々について抽出された素性の各々に基づいて、分類モデルを学習することにより、学習データの正例と負例のバランスを調整することができ、レビュー文書であるか否かを精度よく分類することができる。 As described above, by extracting an evaluation sentence from a learning document and learning a classification model based on each feature extracted for each of a plurality of learning documents from which the evaluation sentence has been extracted, The balance between the example and the negative example can be adjusted, and whether or not the document is a review document can be classified with high accuracy.

第３の発明に係る分類モデル学習方法は、文分割手段によって、特定対象に関する情報が記載されたレビュー文書を含む複数の学習用文書の各々を、文単位で分割するステップと、評価文抽出手段によって、前記複数の学習用文書の各々から、予め定められた評価表現を含む評価文を抽出するステップと、素性抽出手段によって、前記評価文抽出手段によって前記評価文が抽出された前記複数の学習用文書の各々について、前記学習用文書から抽出された前記評価文の各々の特徴を示す素性、又は前記学習用文書の各文の素性を抽出するステップと、学習手段によって、前記複数の学習用文書の各評価文又は各文について抽出された前記素性の各々に基づいて、入力された文が前記レビュー文書内の文であるか否かを分類するための前記分類モデルを学習するステップと、を含む。 A classification model learning method according to a third invention includes a step of dividing each of a plurality of learning documents including a review document in which information about a specific object is described by a sentence dividing unit, and an evaluation sentence extracting unit Extracting an evaluation sentence including a predetermined evaluation expression from each of the plurality of learning documents, and the plurality of learnings in which the evaluation sentence is extracted by the evaluation sentence extraction unit by a feature extraction unit. Extracting a feature indicating each feature of the evaluation sentence extracted from the learning document or a feature of each sentence of the learning document for each of the learning documents; The classification for classifying whether or not the inputted sentence is a sentence in the review document based on each evaluation sentence of the document or each of the features extracted for each sentence Including the steps of learning a Dell, a.

このように、レビュー文書を含む学習用文書の各々を文単位に分割すると共に、学習用文書から評価文を抽出し、評価文が抽出された学習用文書の各文について抽出された素性の各々に基づいて、分類モデルを学習することにより、学習データの正例と負例のバランスを調整することができ、レビュー文書であるか否かを精度よく分類することができる。 In this way, each of the learning documents including the review document is divided into sentence units, the evaluation sentence is extracted from the learning document, and each feature extracted for each sentence of the learning document from which the evaluation sentence is extracted By learning the classification model based on the above, it is possible to adjust the balance between the positive example and the negative example of the learning data, and to classify whether or not the document is a review document with high accuracy.

第４の発明に係るレビュー文書分類方法は、入力素性抽出手段によって、入力された文書内の前記評価文からなる文書の特徴を示す素性、又は前記入力された文書の特徴を示す素性を抽出するステップと、分類手段によって、上記第１の発明に係る分類モデル学習方法によって学習された前記分類モデルと、前記入力素性抽出手段によって抽出された前記素性とに基づいて、前記入力された文書が前記レビュー文書であるか否かを分類するステップと、を含む。 In the review document classification method according to the fourth aspect of the present invention, an input feature extraction unit extracts a feature indicating a feature of a document including the evaluation sentence in the input document or a feature indicating a feature of the input document. The input document is based on the classification model learned by the classification model learning method according to the first invention and the feature extracted by the input feature extraction means by the classification means; Categorizing whether the document is a review document.

このように、レビュー文書を含む学習用文書のうち、評価文が抽出された学習用文書について抽出された素性の各々に基づいて学習した分類モデルを用いて、入力された文書がレビュー文書であるか否かを判定することにより、正例と負例のバランスを調整した学習データで、レビュー文書であるか否かを精度よく分類することができる。 In this way, among the learning documents including the review document, the input document is the review document using the classification model learned based on each of the features extracted for the learning document from which the evaluation sentence is extracted. By determining whether or not the document is a review document, it is possible to accurately classify whether or not the document is a review document using learning data in which the balance between the positive example and the negative example is adjusted.

第５の発明に係るレビュー文書分類方法は、入力文分割手段によって、入力された文書を、文単位で分割するステップと、入力素性抽出手段によって、前記入力された文書から抽出される前記評価文の各々の素性、又は前記入力された文書の各文の素性を抽出するステップと、分類手段によって、第３の発明に係る分類モデル学習方法によって学習された前記分類モデルと、前記入力素性抽出手段によって抽出された各評価文の前記素性又は各文の前記素性とに基づいて、前記入力された文書の各評価文又は各文について、前記レビュー文書内の文であるか否かを分類するステップと、判定手段によって、前記分類手段によって分類された各評価文の分類結果又は前記文書の各文の分類結果に基づいて、前記入力された文書が前記レビュー文書であるか否かを判定するステップと、を含む。 A review document classification method according to a fifth aspect of the present invention includes a step of dividing an input document by a sentence unit by an input sentence dividing unit, and the evaluation sentence extracted from the input document by an input feature extracting unit. Each feature of the input document, or a step of extracting each sentence feature of the input document, the classification model learned by the classification model learning method according to the third invention by the classification means, and the input feature extraction means Classifying whether each evaluation sentence or each sentence of the input document is a sentence in the review document based on the feature of each evaluation sentence extracted by the above or the feature of each sentence The input document is converted to the review sentence based on the classification result of each evaluation sentence classified by the classification means or the classification result of each sentence of the document. Comprising the steps of: determining whether or not the.

このように、レビュー文書を含む学習用文書のうち、評価文が抽出された学習用文書の各文について抽出された素性の各々に基づいて学習した分類モデルを用いて、入力された文書がレビュー文書であるか否かを判定することにより、正例と負例のバランスを調整した学習データで、レビュー文書であるか否かを精度よく分類することができる。 In this way, among the learning documents including the review document, the input document is reviewed using the classification model learned based on each feature extracted for each sentence of the learning document from which the evaluation sentence is extracted. By determining whether or not the document is a document, it is possible to accurately classify whether or not the document is a review document by using learning data in which the balance between the positive example and the negative example is adjusted.

第６の発明に係るプログラムは、コンピュータに、上記の分類モデル学習方法、あるいは上記のレビュー文書分類方法の各ステップを実行させるためのプログラムである。 A program according to a sixth invention is a program for causing a computer to execute each step of the classification model learning method or the review document classification method.

以上説明したように、本発明の分類モデル学習方法、装置、及びプログラムによれば、学習データの正例と負例のバランスを調整することができ、レビュー文書であるか否かを精度よく分類することができる、という効果が得られる。 As described above, according to the classification model learning method, apparatus, and program of the present invention, the balance between positive and negative examples of learning data can be adjusted, and whether or not the document is a review document can be classified with high accuracy. The effect that it can do is acquired.

また、本発明のレビュー文書分類方法及びプログラムによれば、正例と負例のバランスを調整した学習データで、レビュー文書であるか否かを精度よく分類することができる、という効果が得られる。 Further, according to the review document classification method and program of the present invention, it is possible to accurately classify whether or not the document is a review document with learning data in which the balance between the positive example and the negative example is adjusted. .

本発明の第１の実施の形態に係るレビュー文書分類装置の構成を示す概略図である。It is the schematic which shows the structure of the review document classification | category apparatus which concerns on the 1st Embodiment of this invention. 入力されるブログ文書を示す図である。It is a figure which shows the blog document input. （Ａ）入力されるブログ文書を示す図、（Ｂ）形態素解析結果を示す図、及び（Ｃ）評価文書を示す図である。(A) The figure which shows the input blog document, (B) The figure which shows a morphological analysis result, (C) The figure which shows an evaluation document. 本発明の第１の実施の形態に係るレビュー文書分類装置における学習処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the learning process routine in the review document classification device which concerns on the 1st Embodiment of this invention. 本発明の第１の実施の形態に係るレビュー文書分類装置における文書分類処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the document classification | category process routine in the review document classification | category apparatus which concerns on the 1st Embodiment of this invention. 本発明の第２の実施の形態に係るレビュー文書分類装置の構成を示す概略図である。It is the schematic which shows the structure of the review document classification | category apparatus based on the 2nd Embodiment of this invention. レビュー文書と非レビュー文書から得られる学習データを説明するための図である。It is a figure for demonstrating the learning data obtained from a review document and a non-review document. 各評価文について抽出された素性を示す図である。It is a figure which shows the feature extracted about each evaluation sentence. 各評価文に対する分類結果を示す図である。It is a figure which shows the classification result with respect to each evaluation sentence. 本発明の第２の実施の形態に係るレビュー文書分類装置における学習処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the learning process routine in the review document classification device based on the 2nd Embodiment of this invention. 本発明の第２の実施の形態に係るレビュー文書分類装置における文書分類処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the document classification process routine in the review document classification device based on the 2nd Embodiment of this invention.

以下、図面を参照して本発明の実施の形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

[第１の実施の形態]
＜システム構成＞
本発明の第１の実施の形態に係るレビュー文書分類装置１００は、ソーシャルメディア（例えば、ブログ)において得られたブログ文書が入力され、特定対象（例えば、店舗や商品)に関する客観的な情報または主観的な情報（例えば、口コミ情報などの意見）が記載されたレビュー文書であるか否かの判定結果を出力する。１つのブログ文書は１つ以上の文からなるテキストデータである。このレビュー文書分類装置１００は、ＣＰＵと、ＲＡＭと、後述する学習処理ルーチン及び文書分類処理ルーチンを実行するためのプログラムを記憶したＲＯＭとを備えたコンピュータで構成され、機能的には次に示すように構成されている。図１に示すように、レビュー文書分類装置１００は、入力部１０と、演算部２０と、出力部３０とを備えている。 [First embodiment]
<System configuration>
The review document classification apparatus 100 according to the first embodiment of the present invention receives a blog document obtained on social media (for example, a blog) and inputs objective information on a specific target (for example, a store or a product) or A determination result as to whether or not the document is a review document in which subjective information (for example, opinions such as word-of-mouth information) is described is output. One blog document is text data composed of one or more sentences. The review document classification apparatus 100 is configured by a computer including a CPU, a RAM, and a ROM that stores a program for executing a learning processing routine and a document classification processing routine described later. It is configured as follows. As shown in FIG. 1, the review document classification apparatus 100 includes an input unit 10, a calculation unit 20, and an output unit 30.

入力部１０は、学習用文書として入力された複数のブログ文書からなる文書群を受け付ける。例えば、ブログ文書として図２に示すようなデータを入力することができる。入力部１０は、学習用文書としてのブログ文書と共に、当該ブログ文書が、特定対象に関するレビュー文書であるか否かを示す教師情報の入力を、ブログ文書毎に受け付ける。 The input unit 10 receives a document group including a plurality of blog documents input as learning documents. For example, data as shown in FIG. 2 can be input as a blog document. The input unit 10 receives, for each blog document, teacher information indicating whether the blog document is a review document related to a specific target, together with the blog document as a learning document.

また、入力部１０は、分類対象として入力されたブログ文書を受け付ける。 Further, the input unit 10 accepts a blog document input as a classification target.

なお、入力されるブログ文書は形態素解析済みであってもよく、この場合には、後述する形態素解析部２２、３１を省略することができる。 Note that the input blog document may have been subjected to morphological analysis. In this case, morphological analysis units 22 and 31 described later can be omitted.

演算部２０は、文書データベース２１、形態素解析部２２、評価文抽出部２３、素性抽出部２４、学習部２５、及びモデル記憶部２６を備えている。 The calculation unit 20 includes a document database 21, a morpheme analysis unit 22, an evaluation sentence extraction unit 23, a feature extraction unit 24, a learning unit 25, and a model storage unit 26.

文書データベース２１は、入力部１０により受け付けた学習用文書としての複数のブログ文書からなる文書群及びブログ文書毎の教師情報を記憶する。 The document database 21 stores a document group including a plurality of blog documents as learning documents received by the input unit 10 and teacher information for each blog document.

形態素解析部２２は、各ブログ文書について、分割された文ごとに、既存の技術である形態素解析によって、当該文を単語に区切り、さらに各単語に品詞を付与し出力する。たとえば、ブログ文書が、図３（Ａ）に示すように、「A社の新商品を買ってしまいました。・・・」である場合、形態素解析結果として、図３（Ｂ）に示すように、「A社（名詞）/の（格助詞）/新商品（名詞）/を（格助詞）/買（動詞語幹）/っ（動詞活用語尾）/て（動詞接尾辞）/しま（動詞語幹）/い（動詞活用語尾）/ました（動詞接尾辞）・・・（略）・・・」が得られる。 For each blog document, the morpheme analysis unit 22 divides the sentence into words by morpheme analysis, which is an existing technology, and further gives a part of speech to each word for output. For example, if the blog document is “I bought a new product from Company A ...” as shown in FIG. 3 (A), the morphological analysis result is as shown in FIG. 3 (B). , “Company A (noun) / no (case particle) / new product (noun) / (case particle) / buy (verb stem) / tsu (verb inflection ending) / te (verb suffix) / shima (verb (Stem) / i (verb inflection ending) / ta (verb suffix) ... (omitted) ... ".

評価文抽出部２３は、各ブログ文書について、形態素解析結果に基づいて、予め用意された評価表現が出現する評価文のみを抽出して、図３（Ｃ）に示すような評価文のみで構成された評価文書（非評価文が取り除かれた文書）を作成することにより、評価文のみで構成された評価文書の集合を作成する。このとき、当該評価文書の集合から、評価文の出現しない文書は削除される。 The evaluation sentence extraction unit 23 extracts only evaluation sentences in which evaluation expressions prepared in advance appear for each blog document based on the morphological analysis result, and includes only evaluation sentences as shown in FIG. A set of evaluation documents composed only of evaluation sentences is created by creating the evaluated evaluation documents (documents from which non-evaluation sentences have been removed). At this time, a document in which an evaluation sentence does not appear is deleted from the set of evaluation documents.

ここで、評価文は非レビュー文書と比べると、レビュー文書に多く出現する傾向にある。
入力された学習用のブログ文書から、非評価文を抜き取ることで、評価文書の集合中の非レビュー文書の割合を減少させることが可能になる。
Here, the evaluation sentences tend to appear more in the review document than in the non-review document.
By extracting non-evaluation sentences from the input learning blog document, it is possible to reduce the ratio of non-reviewed documents in the set of evaluation documents.

評価表現はあらかじめ作成した評価表現の辞書に基づいて取得される。評価表現の辞書には、例えば、「美味しい」、「綺麗」、「美しい」、「可愛い」、「おしゃれ」、「大きい」、「小さい」、「少ない」、「態度が悪い」、「汚い」、「まずい」などが含まれる。対象がブログのようなソーシャルメディアである場合、顔文字や絵文字、記号などを評価表現として利用しても良い。 The evaluation expression is acquired based on a dictionary of evaluation expressions created in advance. For example, “delicious”, “beautiful”, “beautiful”, “cute”, “fashionable”, “large”, “small”, “less”, “poor attitude”, “dirty” , “Bad”, etc. When the target is social media such as a blog, emoticons, pictograms, symbols, and the like may be used as evaluation expressions.

素性抽出部２４は、各ブログ文書について、作成された評価文書ごとに、形態素解析によって得られた結果を利用して、機械学習に用いる、文書の特徴を示す素性を作成する。例えば、文書の素性として、文書中の形態素の頻度分布（ヒストグラム)などを用いる。 For each blog document, the feature extraction unit 24 uses the result obtained by the morphological analysis for each created evaluation document to create a feature indicating the feature of the document used for machine learning. For example, the frequency distribution (histogram) of morphemes in the document is used as the document feature.

レビュー文書には特定の感性表現および評価表現が多く出現することがある。そのため、感性表現・評価表現の有無および種類を素性として利用してもよい。また、顔文字・絵文字等を素性として利用してもよい。 There may be many specific emotional expressions and evaluation expressions appearing in review documents. Therefore, the presence / absence and type of sensitivity expression / evaluation expression may be used as a feature. In addition, emoticons and pictograms may be used as features.

素性抽出部２４は、レビュー文書であるブログ文書から作成された評価文書の素性を、正例の学習データとしてメモリ（図示省略）に記憶する。また、素性抽出部２４は、非レビュー文書であるブログ文書から作成された評価文書の素性を、負例の学習データとしてメモリに記憶する。 The feature extraction unit 24 stores the feature of the evaluation document created from the blog document as the review document in a memory (not shown) as positive example learning data. The feature extraction unit 24 stores the feature of the evaluation document created from the blog document, which is a non-review document, in the memory as negative example learning data.

学習部２５は、学習用文書である文書群から得られた正例の学習データ（レビュー文書から作成された評価文書の素性）及び負例の学習データ（非レビュー文書から作成された評価文書の素性）を用いて、機械学習によって、入力された文書がレビュー文書であるか否かを分類するための分類モデルを作成して、モデル記憶部２６に記憶する。機械学習アルゴリズムとしては、例えばサポートベクトルマシン（SVM）やMarkov Logic Network (MLN)などのアルゴリズムを利用することができる。 The learning unit 25 uses positive learning data (features of an evaluation document created from a review document) obtained from a document group as a learning document and negative learning data (an evaluation document created from a non-review document). A classification model for classifying whether or not the input document is a review document is created by machine learning using the feature, and stored in the model storage unit 26. As the machine learning algorithm, for example, an algorithm such as support vector machine (SVM) or Markov Logic Network (MLN) can be used.

モデル記憶部２６に記憶される分類モデルは、例えば、各素性に関する重みの数値を格納したものである。 The classification model stored in the model storage unit 26 stores, for example, numerical values of weights regarding each feature.

また、演算部２０は、形態素解析部３１、評価文抽出部３２、素性抽出部３３、及び分類部３４を備えている。なお、素性抽出部３３は、入力素性抽出手段の一例である。 The computing unit 20 includes a morphological analysis unit 31, an evaluation sentence extraction unit 32, a feature extraction unit 33, and a classification unit 34. The feature extracting unit 33 is an example of an input feature extracting unit.

形態素解析部３１は、形態素解析部２２と同様に、分類対象のブログ文書について、形態素解析によって、当該文書を単語に区切り、さらに各単語に品詞を付与し出力する。 Similar to the morpheme analysis unit 22, the morpheme analysis unit 31 divides the blog document to be classified into words by morpheme analysis, and further gives a part of speech to each word for output.

評価文抽出部３２は、評価文抽出部３２と同様に、分類対象のブログ文書について、形態素解析結果に基づいて、予め用意された評価表現が出現する評価文のみを抽出して、評価文書（非評価文が取り除かれた文書）を作成する。このとき、分類対象のブログ文書が評価文の出現しない文書である場合には、後段の処理を行わずに、非レビュー文書であると分類すればよい。 Similar to the evaluation sentence extraction unit 32, the evaluation sentence extraction unit 32 extracts only evaluation sentences in which an evaluation expression prepared in advance appears for the blog document to be classified based on the morphological analysis result, and the evaluation document ( A document from which non-evaluation sentences are removed). At this time, if the blog document to be classified is a document in which an evaluation sentence does not appear, it may be classified as a non-reviewed document without performing subsequent processing.

素性抽出部３３は、分類対象のブログ文書から作成された評価文書について、形態素解析によって得られた結果を利用して、素性抽出部２４と同様に、文書の特徴を示す素性を作成する。 The feature extraction unit 33 uses the result obtained by the morphological analysis for the evaluation document created from the blog document to be classified, and creates a feature indicating the feature of the document, similar to the feature extraction unit 24.

分類部３４は、例えば、分類対象のブログ文書から作成された評価文書について、例えば、抽出された素性を示す数値に、分類モデルが持つ対応する重みを乗算した数値の各々を要素とする特徴ベクトルを用いて、サポートベクトルマシンのアルゴリズムにより、当該文書が、レビュー文書であるか否かを分類する。 For example, for an evaluation document created from a blog document to be classified, for example, the classification unit 34 uses, for example, a feature vector whose elements are numerical values obtained by multiplying a numerical value indicating the extracted feature by a corresponding weight of the classification model. Is used to classify whether or not the document is a review document by the support vector machine algorithm.

分類部３４による分類結果が、出力部３０より出力される。 The classification result by the classification unit 34 is output from the output unit 30.

＜レビュー文書分類装置の作用＞
次に、第１の実施の形態に係るレビュー文書分類装置１００の作用について説明する。まず、学習用文書としての複数のブログ文書である文書群と、当該複数のブログ文書の各々がレビュー文書であるか否かを示す教師情報とがレビュー文書分類装置１００に入力されると、レビュー文書分類装置１００によって、入力された文書群及び教師情報が、文書データベース２１へ格納される。そして、レビュー文書分類装置１００によって、図４に示す学習処理ルーチンが実行される。 <Operation of review document classification device>
Next, the operation of the review document classification apparatus 100 according to the first embodiment will be described. First, when a document group which is a plurality of blog documents as learning documents and teacher information indicating whether each of the plurality of blog documents is a review document are input to the review document classification apparatus 100, a review is performed. The document grouping apparatus 100 stores the input document group and teacher information in the document database 21. Then, the review document classification apparatus 100 executes a learning process routine shown in FIG.

まず、ステップＳ１０１において、文書データベース２１から１つのブログ文書を取り出す。そして、ステップＳ１０２において、形態素解析部２２によって、上記ステップＳ１０１において取り出されたブログ文書に対して、形態素解析処理を行う。 First, in step S101, one blog document is extracted from the document database 21. In step S102, the morpheme analysis unit 22 performs a morpheme analysis process on the blog document extracted in step S101.

次のステップＳ１０３では、評価文抽出部２３によって、上記ステップＳ１０３で得られた形態素解析結果と、予め用意した評価表現の辞書とに基づいて、ブログ文書から評価文を抽出し、評価文書を作成する。ステップＳ１０４では、素性抽出部２４によって、上記ステップＳ１０３で作成された評価文書について、上記ステップＳ１０２で得られた形態素解析結果に基づいて素性を抽出し、ステップＳ１０５において、当該ブログ文書がレビュー文書であれば、上記ステップＳ１０４で抽出された評価文書の素性を、レビュー文書の素性（正例の学習データ）としてメモリに記憶し、当該ブログ文書が非レビュー文書であれば、上記ステップＳ１０４で抽出された評価文書の素性を、非レビュー文書の素性（負例の学習データ）としてメモリに記憶する。 In the next step S103, the evaluation sentence extraction unit 23 extracts an evaluation sentence from the blog document based on the morphological analysis result obtained in step S103 and a dictionary of evaluation expressions prepared in advance, and creates an evaluation document. To do. In step S104, the feature extraction unit 24 extracts features from the evaluation document created in step S103 based on the morphological analysis result obtained in step S102. In step S105, the blog document is a review document. If there is, the feature of the evaluation document extracted in step S104 is stored in the memory as the feature of the review document (learned data of the positive example). If the blog document is a non-review document, the feature is extracted in step S104. The characteristics of the evaluated document are stored in the memory as the characteristics of the non-reviewed document (negative learning data).

ステップＳ１０６では、文書データベース２１に記憶されている全てのブログ文書について、上記ステップＳ１０１〜Ｓ１０５の処理を実行したか否かを判定し、上記ステップＳ１０１〜Ｓ１０５の処理を実行していないブログ文書が存在する場合には、上記ステップＳ１０１へ戻り、当該ブログ文書を取り出す。一方、全てのブログ文書について、上記ステップＳ１０１〜Ｓ１０５の処理を実行した場合には、ステップＳ１０７へ進む。 In step S106, it is determined whether or not the processing in steps S101 to S105 has been executed for all the blog documents stored in the document database 21, and blog documents that have not executed the processing in steps S101 to S105 are determined. If it exists, the process returns to step S101 to take out the blog document. On the other hand, if the processes of steps S101 to S105 have been executed for all blog documents, the process proceeds to step S107.

ステップＳ１０７において、学習部２５によって、メモリに記憶された正例の学習データ及び負例の学習データを用いて、機械学習によって、分類モデルを学習し、ステップＳ１０８において、モデル記憶部２６に分類モデルを格納し、学習処理ルーチンを終了する。 In step S107, the learning unit 25 learns the classification model by machine learning using the positive example learning data and the negative example learning data stored in the memory. In step S108, the classification model is stored in the model storage unit 26. Is stored, and the learning processing routine is terminated.

そして、分類対象のブログ文書がレビュー文書分類装置１００に入力されると、レビュー文書分類装置１００によって、図５に示す文書分類処理ルーチンが実行される。 When the blog document to be classified is input to the review document classification device 100, the review document classification device 100 executes a document classification processing routine shown in FIG.

まず、ステップＳ１１１において、入力部１０により入力されたブログ文書を受け付ける。そして、ステップＳ１１２において、形態素解析部３１によって、上記ステップＳ１１１において入力されたブログ文書に対して、形態素解析処理を行う。 First, in step S111, a blog document input by the input unit 10 is received. In step S112, the morpheme analysis unit 31 performs morpheme analysis on the blog document input in step S111.

次のステップＳ１１３では、評価文抽出部３２によって、上記ステップＳ１１２で得られた形態素解析結果と、予め用意した評価表現の辞書とに基づいて、ブログ文書から評価文を抽出し、評価文書を作成する。 In the next step S113, the evaluation sentence extraction unit 32 extracts an evaluation sentence from the blog document based on the morphological analysis result obtained in step S112 and a dictionary of evaluation expressions prepared in advance, and creates an evaluation document. To do.

次のステップＳ１１４では、素性抽出部３３によって、上記ステップＳ１１３で作成されたブログ文書の評価文書について、素性を抽出する。ステップＳ１１５では、分類部３４によって、上記ステップＳ１１４で抽出された素性と、モデル記憶部２６に記憶された分類モデルとに基づいて、当該ブログ文書が、レビュー文書であるか分類する。 In the next step S114, the feature extraction unit 33 extracts features of the evaluation document of the blog document created in step S113. In step S115, the classification unit 34 classifies whether the blog document is a review document based on the features extracted in step S114 and the classification model stored in the model storage unit 26.

そして、ステップＳ１１６では、上記ステップＳ１１５の分類結果を出力部３０により出力して、文書分類処理ルーチンを終了する。 In step S116, the classification result in step S115 is output by the output unit 30, and the document classification processing routine is terminated.

以上説明したように、第１の実施の形態に係るレビュー文書分類装置によれば、複数の学習用のブログ文書の各々から評価文を抽出し、評価文が含まれていた学習用のブログ文書から作成された評価文書の各々について抽出された素性の各々に基づいて、分類モデルを学習することにより、学習データの正例と負例のバランスを調整することができ、レビュー文書であるか否かを精度よく分類することが可能な分類モデルを学習することができる。 As described above, according to the review document classification apparatus according to the first embodiment, an evaluation sentence is extracted from each of a plurality of learning blog documents, and the learning blog document including the evaluation sentence is included. The balance between the positive and negative examples of the learning data can be adjusted by learning the classification model based on each of the features extracted for each evaluation document created from It is possible to learn a classification model that can accurately classify these.

また、レビュー文書を含む学習用のブログ文書のうち、評価文が含まれていた学習用のブログ文書から作成された評価文書について抽出された素性の各々に基づいて学習した分類モデルを用いて、入力されたブログ文書がレビュー文書であるか否かを判定することにより、正例と負例のバランスを調整した学習データで、レビュー文書であるか否かを精度よく分類することができる。 Moreover, using the classification model learned based on each of the features extracted for the evaluation document created from the learning blog document that included the evaluation sentence among the learning blog documents including the review document, By determining whether or not the input blog document is a review document, it is possible to accurately classify whether or not it is a review document by using learning data in which the balance between the positive example and the negative example is adjusted.

［第２の実施の形態］
＜システム構成＞
次に、第２の実施の形態に係るレビュー文書分類装置について説明する。なお、第１の実施の形態と同様の構成となる部分については、同一符号を付して説明を省略する。 [Second Embodiment]
<System configuration>
Next, a review document classification apparatus according to the second embodiment will be described. In addition, about the part which becomes the structure similar to 1st Embodiment, the same code | symbol is attached | subjected and description is abbreviate | omitted.

第２の実施の形態では、ブログ文書を文単位で分割し、各文の素性を抽出して、分類モデルの学習を行っている点が、第１の実施の形態と異なっている。 The second embodiment is different from the first embodiment in that the blog document is divided into sentences, the features of each sentence are extracted, and the classification model is learned.

図６に示すように、第２の実施の形態に係るレビュー文書分類装置２００の演算部２２０は、文書データベース２１、文分割部２２２、形態素解析部２２、評価文抽出部２３、素性抽出部２２４、学習部２２５、及びモデル記憶部２６を備えている。 As illustrated in FIG. 6, the calculation unit 220 of the review document classification apparatus 200 according to the second embodiment includes a document database 21, a sentence division unit 222, a morpheme analysis unit 22, an evaluation sentence extraction unit 23, and a feature extraction unit 224. The learning unit 225 and the model storage unit 26 are provided.

文分割部２２２は、文書データベース２１に記憶された文書群の各ブログ文書を、文単位に分割する。文分割の方法は既知の技術を用いればよく、例えば句読点や改行などが出現する箇所で分割すればよい。また、ブログ等のソーシャルメディアで得られる文書を利用する際には、顔文字、絵文字等が文区切りに利用されていることが多い。そのため、絵文字や顔文字を文区切りに利用してもよい。 The sentence division unit 222 divides each blog document of the document group stored in the document database 21 into sentence units. A known technique may be used as the sentence division method. For example, the sentence may be divided at a place where punctuation marks or line breaks appear. In addition, when using a document obtained on social media such as a blog, emoticons, pictograms, and the like are often used for sentence separation. Therefore, pictograms and emoticons may be used for sentence breaks.

素性抽出部２２４は、各ブログ文書について、作成された評価文書の評価文ごとに、形態素解析によって得られた結果を利用して、機械学習に用いる、文の特徴を示す素性を作成する。例えば、文の素性として、文中の形態素の頻度分布（ヒストグラム)などを用いる。 For each blog document, the feature extraction unit 224 creates a feature indicating the feature of the sentence, which is used for machine learning, using the result obtained by morphological analysis for each evaluation sentence of the created evaluation document. For example, a frequency distribution (histogram) of morphemes in a sentence is used as a sentence feature.

また、ブログのようなソーシャルメディアのテキストでは、複数の文にまたいで評価している場合がある。そのため、前後の文の評価表現、形態素の頻度などを素性として利用しても良い。 Also, social media texts such as blogs may be evaluated across multiple sentences. For this reason, evaluation expressions of preceding and following sentences, morpheme frequencies, and the like may be used as features.

素性抽出部２２４は、図７に示すように、レビュー文書であるブログ文書から抽出された各評価文の素性を、正例の学習データとしてメモリ（図示省略）に記憶する。また、素性抽出部２２４は、非レビュー文書であるブログ文書から抽出された各評価文の素性を、負例の学習データとしてメモリに記憶する。 As shown in FIG. 7, the feature extraction unit 224 stores the feature of each evaluation sentence extracted from the blog document as the review document in a memory (not shown) as positive example learning data. In addition, the feature extraction unit 224 stores the feature of each evaluation sentence extracted from the blog document, which is a non-review document, in the memory as negative example learning data.

学習部２２５は、学習用文書である文書群から得られた正例の学習データ（レビュー文書から抽出された評価文の素性）及び負例の学習データ（非レビュー文書から抽出された評価文の素性）を用いて、機械学習によって、入力された文がレビュー文書内の文であるか否かを分類するための分類モデルを作成して、モデル記憶部２６に記憶する。 The learning unit 225 includes positive example learning data (features of evaluation sentences extracted from review documents) obtained from a document group that is a learning document and negative example learning data (of evaluation sentences extracted from non-review documents). A classification model for classifying whether or not the input sentence is a sentence in the review document is created by machine learning using the feature, and stored in the model storage unit 26.

また、演算部２２０は、文分割部２３１、形態素解析部３１、評価文抽出部３２、素性抽出部２３３、分類部２３４、及びレビュー文書判定部２３５を備えている。なお、文分割部２３１は、入力文分割手段の一例であり、素性抽出部２３３は、入力素性抽出手段の一例である。 The calculation unit 220 includes a sentence division unit 231, a morpheme analysis unit 31, an evaluation sentence extraction unit 32, a feature extraction unit 233, a classification unit 234, and a review document determination unit 235. The sentence dividing unit 231 is an example of an input sentence dividing unit, and the feature extracting unit 233 is an example of an input feature extracting unit.

文分割部２３１は、文分割部２２２と同様に、入力された分類対象のブログ文書を、文単位に分割する。 Similar to the sentence dividing unit 222, the sentence dividing unit 231 divides the input blog document to be classified into sentence units.

形態素解析部３１は、形態素解析部２２と同様に、分類対象のブログ文書について、分割された文ごとに、形態素解析によって、当該文を単語に区切り、さらに各単語に品詞を付与し出力する。 Similar to the morpheme analysis unit 22, the morpheme analysis unit 31 divides the sentence into words by morpheme analysis for each divided sentence of the blog document to be classified, and further gives a part of speech to each word for output.

素性抽出部２３３は、分類対象のブログ文書について、図８に示すように、抽出された評価文ごとに、形態素解析によって得られた結果を利用して、素性抽出部２２４と同様に、文の特徴を示す素性を作成する。 As shown in FIG. 8, the feature extraction unit 233 uses the result obtained by morphological analysis for each extracted evaluation sentence, as in the case of the feature extraction unit 224, as shown in FIG. Create features that show features.

分類部２３４は、例えば、分類対象のブログ文書の各評価文について、例えば、抽出された素性を示す数値に、分類モデルが持つ対応する重みを乗算した数値の各々を要素とする特徴ベクトルを用いて、サポートベクトルマシンのアルゴリズムにより、当該評価文が、レビュー文書内の文であるか否かを分類する。これによって、図９に示すように、各評価文が、レビュー文であるか、非レビュー文であるかに分類される。 For example, for each evaluation sentence of the blog document to be classified, for example, the classification unit 234 uses a feature vector whose elements are each a numerical value obtained by multiplying a numerical value indicating the extracted feature by a corresponding weight of the classification model. Thus, it is classified by the support vector machine algorithm whether or not the evaluation sentence is a sentence in the review document. As a result, as shown in FIG. 9, each evaluation sentence is classified as a review sentence or a non-review sentence.

レビュー文書判定部２３５は、分類対象のブログ文書について、レビュー文と分類された評価文の割合が、閾値以上である場合には、レビュー文書であると判定し、閾値未満である場合には、非レビュー文書であると判定する。 The review document determination unit 235 determines that the blog document to be classified is a review document when the ratio of the evaluation sentence classified as the review sentence is equal to or higher than a threshold value. It is determined that the document is a non-review document.

レビュー文書判定部２３５による判定結果が、出力部３０より出力される。 The determination result by the review document determination unit 235 is output from the output unit 30.

＜レビュー文書分類装置の作用＞
次に、第２の実施の形態に係るレビュー文書分類装置２００の作用について説明する。なお、第１の実施の形態と同様の処理については、同一符号を付して詳細な説明を省略する。 <Operation of review document classification device>
Next, the operation of the review document classification apparatus 200 according to the second embodiment will be described. In addition, about the process similar to 1st Embodiment, the same code | symbol is attached | subjected and detailed description is abbreviate | omitted.

まず、レビュー文書分類装置２００によって、図１０に示す学習処理ルーチンが実行される。 First, the learning process routine shown in FIG. 10 is executed by the review document classification apparatus 200.

まず、ステップＳ１０１において、文書データベース２１から１つのブログ文書を取り出す。そして、ステップＳ１０２において、文分割部２２２によって、上記ステップＳ１０１において取り出されたブログ文書を、文単位で分割する。ステップＳ１０２において、形態素解析部２２によって、上記ステップＳ１０１において取り出されたブログ文書の各文に対して、形態素解析処理を行う。 First, in step S101, one blog document is extracted from the document database 21. In step S102, the sentence dividing unit 222 divides the blog document extracted in step S101 into sentences. In step S102, the morpheme analysis unit 22 performs a morpheme analysis process on each sentence of the blog document extracted in step S101.

次のステップＳ１０３では、評価文抽出部２３によって、上記ステップＳ１０３で得られた形態素解析結果と、予め用意した評価表現の辞書とに基づいて、ブログ文書から評価文を抽出する。ステップＳ２０２では、素性抽出部２２４によって、上記ステップＳ１０３で抽出された評価文の各々について、上記ステップＳ１０２で得られた形態素解析結果に基づいて素性を抽出し、ステップＳ２０３において、当該ブログ文書がレビュー文書であれば、上記ステップＳ２０２で抽出された評価文の素性を、レビュー文の素性（正例の学習データ）としてメモリに記憶し、当該ブログ文書が非レビュー文書であれば、上記ステップＳ２０２で抽出された評価文の素性を、非レビュー文の素性（負例の学習データ）としてメモリに記憶する。 In the next step S103, the evaluation sentence extraction unit 23 extracts an evaluation sentence from the blog document based on the morphological analysis result obtained in step S103 and a dictionary of evaluation expressions prepared in advance. In step S202, the feature extraction unit 224 extracts features for each of the evaluation sentences extracted in step S103 based on the morphological analysis result obtained in step S102. In step S203, the blog document is reviewed. If the document is a document, the feature of the evaluation sentence extracted in step S202 is stored in the memory as the feature of the review sentence (learned data of the positive example). If the blog document is a non-review document, the feature is read in step S202. The feature of the extracted evaluation sentence is stored in the memory as the feature of the non-review sentence (negative example learning data).

ステップＳ１０６では、文書データベース２１に記憶されている全てのブログ文書について、上記ステップＳ１０１、Ｓ２０１、Ｓ１０２、Ｓ１０３、Ｓ２０２、Ｓ２０３の処理を実行したか否かを判定し、上記ステップＳ１０１、Ｓ２０１、Ｓ１０２、Ｓ１０３、Ｓ２０２、Ｓ２０３の処理を実行していないブログ文書が存在する場合には、上記ステップＳ１０１へ戻り、当該ブログ文書を取り出す。一方、全てのブログ文書について、上記ステップＳ１０１、Ｓ２０１、Ｓ１０２、Ｓ１０３、Ｓ２０２、Ｓ２０３の処理を実行した場合には、ステップＳ１０７へ進む。 In step S106, it is determined whether or not the processes in steps S101, S201, S102, S103, S202, and S203 have been executed for all blog documents stored in the document database 21, and the steps S101, S201, and S102 are performed. If there is a blog document that has not been subjected to the processes of S103, S202, and S203, the process returns to step S101, and the blog document is extracted. On the other hand, when the processes of steps S101, S201, S102, S103, S202, and S203 have been executed for all blog documents, the process proceeds to step S107.

ステップＳ１０７において、学習部２２５によって、メモリに記憶された正例の学習データ及び負例の学習データを用いて、機械学習によって、分類モデルを学習し、ステップＳ１０８において、モデル記憶部２６に分類モデルを格納し、学習処理ルーチンを終了する。 In step S107, the learning unit 225 learns the classification model by machine learning using the positive learning data and the negative learning data stored in the memory. In step S108, the classification model is stored in the model storage unit 26. Is stored, and the learning processing routine is terminated.

そして、分類対象のブログ文書がレビュー文書分類装置２００に入力されると、レビュー文書分類装置２００によって、図１１に示す文書分類処理ルーチンが実行される。 When the blog document to be classified is input to the review document classification device 200, the review document classification device 200 executes a document classification processing routine shown in FIG.

まず、ステップＳ１１１において、入力部１０により入力されたブログ文書を受け付ける。そして、ステップＳ２１１において、文分割部２３１によって、上記ステップＳ１１１において入力されたブログ文書を、文単位で分割する。ステップＳ１１２において、形態素解析部３１によって、上記ステップＳ１１１において入力されたブログ文書の各文に対して、形態素解析処理を行う。 First, in step S111, a blog document input by the input unit 10 is received. In step S211, the sentence dividing unit 231 divides the blog document input in step S111 into sentence units. In step S112, the morpheme analysis unit 31 performs a morpheme analysis process on each sentence of the blog document input in step S111.

次のステップＳ１１３では、評価文抽出部３２によって、上記ステップＳ１１２で得られた形態素解析結果と、予め用意した評価表現の辞書とに基づいて、ブログ文書から評価文を抽出する。 In the next step S113, the evaluation sentence extraction unit 32 extracts an evaluation sentence from the blog document based on the morphological analysis result obtained in step S112 and a dictionary of evaluation expressions prepared in advance.

次のステップＳ２１２では、素性抽出部２３３によって、上記ステップＳ１１３で抽出されたブログ文書の各評価文について、素性を抽出する。ステップＳ２１３では、分類部２３４によって、ブログ文書の各評価文について、上記ステップＳ２１２で抽出された素性と、モデル記憶部２６に記憶された分類モデルとに基づいて、当該評価文が、レビュー文であるか非レビュー文であるか分類する。 In the next step S212, the feature extraction unit 233 extracts the features for each evaluation sentence of the blog document extracted in step S113. In step S213, for each evaluation sentence of the blog document by the classification unit 234, the evaluation sentence is converted into a review sentence based on the feature extracted in step S212 and the classification model stored in the model storage unit 26. Categorize whether it is a non-review sentence.

そして、ステップＳ２１４では、レビュー文書判定部２３５によって、上記ステップＳ２１４でレビュー文であると分類された評価文の割合に基づいて、ブログ文書がレビュー文書であるか否かを判定する。ステップＳ２１５において、上記ステップＳ２１４の判定結果を出力部３０により出力して、文書分類処理ルーチンを終了する。 In step S214, the review document determination unit 235 determines whether or not the blog document is a review document based on the ratio of the evaluation sentence classified as the review sentence in step S214. In step S215, the determination result in step S214 is output by the output unit 30, and the document classification processing routine is terminated.

以上説明したように、第２の実施の形態に係るレビュー文書分類装置によれば、レビュー文書を含む学習用のブログ文書の各々を文単位に分割すると共に、学習用のブログ文書から評価文を抽出し、学習用のブログ文書の各評価文について抽出された素性の各々に基づいて、分類モデルを学習することにより、学習データの正例と負例のバランスを調整することができ、レビュー文書であるか否かを精度よく分類することが可能な分類モデルを学習することができる。 As described above, according to the review document classification apparatus according to the second embodiment, each of the learning blog documents including the review document is divided into sentence units, and the evaluation sentence is extracted from the learning blog document. By extracting and learning the classification model based on each feature extracted for each evaluation sentence of the learning blog document, the balance between the positive and negative examples of the learning data can be adjusted, and the review document It is possible to learn a classification model that can accurately classify whether or not.

また、レビュー文書を含む学習用のブログ文書から抽出された各評価文の素性の各々に基づいて学習した分類モデルを用いて、入力されたブログ文書がレビュー文書であるか否かを判定することにより、正例と負例のバランスを調整した学習データで、レビュー文書であるか否かを精度よく分類することができる。 In addition, using a classification model learned based on each feature of each evaluation sentence extracted from a learning blog document including a review document, it is determined whether or not the input blog document is a review document. Thus, it is possible to accurately classify whether or not the document is a review document by using learning data in which the balance between the positive example and the negative example is adjusted.

なお、本発明は、上述した実施形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 Note that the present invention is not limited to the above-described embodiment, and various modifications and applications are possible without departing from the gist of the present invention.

例えば、評価文を含む学習用のブログ文書から、非評価文を取り除かないようにしてもよい。すなわち、学習用のブログ文書から非評価文を取り除いた評価文書を作成せずに、評価文を含まない学習用のブログ文書を、学習用のブログ文書の集合から削除するだけでもよい。上記の第１の実施の形態のように文書を文に分割しない場合には、評価文を含むブログ文書について文書の素性を抽出して、分類モデルを学習すればよい。また、分類対象のブログ文書について文書の素性を抽出し、分類モデルを用いて、当該ブログ文書がレビュー文書であるか否かを分類すればよい。また、上記の第２の実施の形態のように文書を文に分割する場合には、評価文を含むブログ文書について各文（評価文、非評価文)の素性を抽出して、分類モデルを学習すればよい。また、分類対象のブログ文書について各文の素性を抽出し、分類モデルを用いて、当該ブログ文書の各文がレビュー文であるか否かを分類し、レビュー文であると分類された文の割合に基づいて、当該ブログ文書がレビュー文書であるか否かを判定すればよい。 For example, a non-evaluation sentence may not be removed from a learning blog document including an evaluation sentence. That is, the learning blog document that does not include the evaluation sentence may be simply deleted from the set of learning blog documents without creating the evaluation document obtained by removing the non-evaluation sentence from the learning blog document. When the document is not divided into sentences as in the first embodiment described above, the classification feature may be learned by extracting the document features of the blog document including the evaluation sentence. Further, the document features of the blog document to be classified may be extracted, and the classification model may be used to classify whether or not the blog document is a review document. When the document is divided into sentences as in the second embodiment, the feature of each sentence (evaluation sentence, non-evaluation sentence) is extracted from the blog document including the evaluation sentence, and the classification model is extracted. Just learn. In addition, the feature of each sentence is extracted for the blog document to be classified, and the classification model is used to classify whether each sentence of the blog document is a review sentence. What is necessary is just to determine whether the said blog document is a review document based on a ratio.

また、上記の第２の実施の形態において、学習用のブログ文書の各評価文について、特定対象に関する客観的な情報又は主観的な情報が記載されたレビュー文であるか、あるいは非レビュー文であるかを手入力するようにしてもよい。この場合には、学習用のブログ文書の各評価文のうちレビュー文の各々について抽出された素性の各々を正例の学習データとし、学習用のブログ文書の各文のうち非レビュー文の各々について抽出された素性の各々を負例の学習データとして、学習部によって分類モデルを学習するようにすればよい。また、評価文を含む学習用のブログ文書から非評価文を取り除かない場合には、評価文を含む学習用のブログ文書の各文について、レビュー文であるか、あるいは非レビュー文であるかを手入力するようにしてもよい。 In the second embodiment, each evaluation sentence of the blog document for learning is a review sentence in which objective information or subjective information about the specific object is described, or a non-review sentence. You may make it input manually whether there exists. In this case, each feature extracted for each review sentence in each evaluation sentence of the learning blog document is used as positive learning data, and each non-review sentence in each sentence of the learning blog document is used. The classification model may be learned by the learning unit using each of the features extracted for as learning data of a negative example. If non-evaluation sentences are not removed from learning blog documents that include evaluation sentences, each sentence of learning blog documents that include evaluation sentences is either a review sentence or a non-review sentence. You may make it input manually.

また、本願明細書中において、プログラムが予めインストールされている実施形態として説明したが、当該プログラムを、コンピュータ読み取り可能な記録媒体に格納して提供することも可能である。 In the present specification, the embodiment has been described in which the program is installed in advance. However, the program can be provided by being stored in a computer-readable recording medium.

１０入力部
２０、２２０演算部
２１文書データベース
２２、３１形態素解析部
２３、３２評価文抽出部
２４、３３、２２４、２３３素性抽出部
２５、２２５学習部
２６モデル記憶部
３４、２３４分類部
１００、２００レビュー文書分類装置
２２２、２３１文分割部
２３５レビュー文書判定部 DESCRIPTION OF SYMBOLS 10 Input part 20, 220 Operation part 21 Document database 22, 31 Morphological analysis part 23, 32 Evaluation sentence extraction part 24, 33, 224, 233 Feature extraction part 25, 225 Learning part 26 Model storage part 34, 234 Classification part 100, 200 Review Document Classification Device 222, 231 Sentence Division Unit 235 Review Document Determination Unit

Claims

A step of extracting an evaluation sentence including a predetermined evaluation expression from each of a plurality of learning documents including a review document in which information on a specific object is described by an evaluation sentence extraction unit;
A feature indicating the characteristics of the document composed of the evaluation sentence extracted from the learning document, for each of the plurality of learning documents from which the evaluation sentence is extracted by the evaluation sentence extraction means by the feature extraction means, or Extracting features indicating characteristics of the learning document;
Each of the features extracted for the learning document that is the review document by the learning means is a positive feature, and each of the features extracted for the learning document that is not the review document is a negative feature. Learning a classification model for classifying whether or not the input document is the review document;
Classification model learning method including

Dividing each of a plurality of learning documents including a review document in which information on a specific object is described by a sentence dividing unit;
A step of extracting an evaluation sentence including a predetermined evaluation expression from each of the plurality of learning documents by an evaluation sentence extracting unit;
A feature indicating the characteristics of each of the evaluation sentences extracted from the learning document for each of the plurality of learning documents from which the evaluation sentence has been extracted by the evaluation sentence extraction means by the feature extraction means, or the learning Extracting features of each sentence of the document for use,
For classifying whether or not the input sentence is a sentence in the review document based on each evaluation sentence of each of the plurality of learning documents or each of the features extracted for each sentence by the learning means. Learning the classification model;
Classification model learning method including

The step of learning by the learning means includes setting each of the features extracted for each evaluation sentence or each sentence of the learning document that is the review document as a positive example feature, and each of the learning documents that are not the review document. The classification model learning method according to claim 2, wherein the classification model is learned by using each of the features extracted for the evaluation sentence or each sentence as a negative example feature.

The step of learning by the learning means sets each of the features extracted for each of the evaluation sentences of the learning document or each of the review sentences in which information on the specific target of the sentences is described as a positive example feature. 3. The classification model according to claim 2, wherein the classification model is learned by using each of the extracted features for each evaluation sentence or each sentence of the learning document that is not the review sentence as a negative feature. Learning method.

Extracting a feature indicating a feature of the document composed of the evaluation sentence in the input document or a feature indicating the feature of the input document by an input feature extracting unit;
Whether the input document is the review document based on the classification model learned by the classification model learning method according to claim 1 and the feature extracted by the input feature extraction means. Classifying whether or not,
Review document classification method including

Dividing the input document into sentence units by the input sentence dividing means;
Extracting each feature of the evaluation sentence extracted from the input document or a feature of each sentence of the input document by an input feature extraction unit;
The classification model learned by the classification model learning method according to any one of claims 2 to 4, and the features of each evaluation sentence extracted by the input feature extraction means or each sentence Categorizing whether or not each evaluation sentence or each sentence of the input document is a sentence in the review document based on the feature;
Determining whether or not the input document is the review document based on a classification result of each evaluation sentence classified by the classification means or a classification result of each sentence of the document by the determination means;
Review document classification method including

An evaluation sentence extracting means for extracting an evaluation sentence including a predetermined evaluation expression from each of a plurality of learning documents including a review document in which information on a specific object is described;
For each of the plurality of learning documents from which the evaluation sentence has been extracted by the evaluation sentence extracting means, a feature indicating the characteristics of the document composed of the evaluation sentence extracted from the learning document, or the characteristics of the learning document A feature extracting means for extracting a feature indicating
Each of the features extracted for the learning document that is the review document is input as a positive example feature, and each of the features extracted for the learning document that is not the review document is input as a negative example feature Learning means for learning a classification model for classifying whether or not a document is the review document;
Model learning apparatus including

The program for making a computer perform each step of the classification model learning method of any one of Claims 1-4, or the review document classification method of Claim 5 or 6.