JP2017107391A

JP2017107391A - Text mining method, and text mining program

Info

Publication number: JP2017107391A
Application number: JP2015240490A
Authority: JP
Inventors: 古橋　武; Takeshi Furuhashi; 武古橋; 大弘吉川; Ohiro Yoshikawa; 賢治奥山; Kenji Okuyama
Original assignee: Nagoya University NUC; Toho Gas Co Ltd
Current assignee: Nagoya University NUC; Toho Gas Co Ltd
Priority date: 2015-12-09
Filing date: 2015-12-09
Publication date: 2017-06-15

Abstract

PROBLEM TO BE SOLVED: To provide a text mining method capable of efficiently and more accurately extracting useful information documents only required for analysis from among a large amount of information documents disclosed in various document modes by a text data format, and a text mining program.SOLUTION: A text mining method distributes an extracted one information document into a first positive example or a first negative example, and includes: a first analysis step for taking one information document of the first positive example as an analysis object of a second analysis step and eliminating one information document of the first negative example from the analysis of the second analysis step; and the second analysis step of distributing one analysis schedule information document extracted from a deemed positive example information document classified into the first positive example into the second positive example that satisfies the condition including the meaning of the evaluation of a specific product as an analysis object or the second negative example that does not satisfy this condition, and identifying one analysis schedule information document of the second positive example as the analysis object and eliminating one analysis schedule information document of the second negative example from the positive example information document.SELECTED DRAWING: Figure 1

Description

この発明は、例えば、評価文書、広告文書、日常的様子を綴った文書等、様々な文書態様で、テキストデータ形式により公開された大量の情報文書の中から、企業や消費者等にとって有益となり得る情報として、解析の対象に該当する有用文書だけを、テキストマイニング（text mining）により、抽出する技術に関する。 The present invention is useful for companies, consumers, and the like from a large amount of information documents published in text data format in various document forms such as evaluation documents, advertisement documents, and documents that describe everyday situations. As information to be obtained, the present invention relates to a technique for extracting only useful documents corresponding to an analysis target by text mining.

近年、インターネットの普及に伴い、個人が、例えば、ブログやWeb掲示板等、SNS(social networking service)を通じて、情報の発信や収集を、テキスト形式の文書で手軽に行うことができるようになっている。このような情報文書には一般的に、企業や消費者にとって有益となり得る情報として、対象とする特定製品等に関する評価文書が含まれているほか、その製品等の広告記事や、日常の様子を記した文書等、企業や消費者にとって、特に解析を必要としない文書も、数多く含まれていることがある。そして、企業の中には、製品等の製造や販売の改善に活用するのにあたり、テキストマイニング装置を用いて、大量に公開されている種々の情報文書の中から有用な情報を抽出し、情報文書の解析を行っている企業もある。その一例であるテキストマイニング装置が、特許文献１に開示されている。 In recent years, with the widespread use of the Internet, it has become possible for individuals to easily send and collect information in text format via social networking services (SNS) such as blogs and Web bulletin boards. . In general, such information documents include evaluation documents related to the target specific products, etc., as information that can be useful to companies and consumers, as well as advertisement articles about the products, etc. There may be many documents such as written documents that do not require analysis for companies and consumers. Some companies use text mining equipment to extract useful information from a large number of information documents that are used to improve the production and sales of products. Some companies do document analysis. An example of such a text mining device is disclosed in Patent Document 1.

特許文献１は、分析対象のテキストから抽出した単語間の係り受け関係を、複数のカテゴリに分類したネットワークを表示可能とし、表示したネットワークに対し、カテゴリ、ノード、及びアークに関する表示情報を自由に調節可能になっている。これにより、分析対象のテキスト文書に対し、３語以上にわたる広範囲な係り受け関係の全体概観から、詳細な絞り込み表示まで、自由で対話的な解析が可能になったとされている。 Patent Document 1 makes it possible to display a network in which dependency relationships between words extracted from text to be analyzed are classified into a plurality of categories, and display information regarding categories, nodes, and arcs can be freely displayed on the displayed network. It is adjustable. This enables free and interactive analysis of a text document to be analyzed, from an overall overview of a wide range of dependency relationships spanning three or more words to a detailed refined display.

また、先行技術文献を開示しないが、特許文献１のほかにも、周知技術のテキストマイニング方法がある。その一例である従来のテキストマイニング方法では、テキストデータ形式により公開された大量の情報文書の中から、一の情報文書が抽出される。この一の情報文書に含まれている一文が、分析対象とする特定の製品に対応した評価を含む文意であるか否かを、サポートベクターマシンにより、特定の製品に関連したキーワードと、このキーワードと共起する単語とからなる素性ベクトルに基づいて判別され、一の情報文書が、正例または負例に振り分けられる。 Although no prior art document is disclosed, there is a known text mining method in addition to Patent Document 1. In the conventional text mining method as an example, one information document is extracted from a large amount of information documents disclosed in a text data format. Whether a sentence included in this one information document is a sentence including an evaluation corresponding to the specific product to be analyzed, whether or not it is a keyword related to the specific product by the support vector machine and this Discrimination is performed based on a feature vector composed of keywords and co-occurring words, and one information document is assigned to a positive example or a negative example.

特許第４８７６６９２号公報Japanese Patent No. 48766692

しかしながら、特許文献１では、前述したような、大量に公開されている種々の情報文書の中から、特に分析を必要としない広告記事等の文書を排除し、企業や消費者にとって有用な評価文書だけを選択することができない。そのため、企業等が、特定の製品を対象に、新規開発や改良、販売の改善等を目的とした市場調査を、前述したSNS等を活用して行う場合に、人が、様々な文書態様で公開された大量の情報文書に、一つひとつ目を通しながら、先の特定製品に結び付く有用な評価文書だけを抽出していく作業を行わなければならず、この作業には、多大な時間と労力が掛かり、作業効率も悪い。また、従来のテキストマイニング方法では、抽出された一の情報文書に対し、正例か負例かの判定が、十分な精度で得られず、企業等は、判定精度のより高いテキストマイニングの開発を望んでいた。 However, Patent Document 1 excludes documents such as advertisement articles that do not require analysis from various information documents disclosed in large quantities as described above, and is an evaluation document useful for companies and consumers. Just can't choose. Therefore, when a company conducts market research for new development, improvement, sales improvement, etc. for a specific product using the above-mentioned SNS etc., people use various document forms. It is necessary to go through a large amount of published information documents one by one and extract only useful evaluation documents that are linked to the specific product, and this work takes a lot of time and effort. The work efficiency is poor. In addition, in the conventional text mining method, it is not possible to determine whether the extracted information document is a positive example or a negative example with sufficient accuracy, and companies etc. develop text mining with higher determination accuracy. Wanted.

本発明は、上記問題点を解決するためになされたものであり、テキストデータ形式により、種々の文書態様で公開された大量の情報文書の中から、分析を必要とする有用な情報文書だけを、効率良くかつより高い精度に抽出することができるテキストマイニング方法、及びテキストマイニングプログラムを提供することを目的とする。 The present invention has been made to solve the above-described problems, and only useful information documents that require analysis are selected from a large number of information documents that are disclosed in various document modes by a text data format. An object of the present invention is to provide a text mining method and a text mining program that can be extracted efficiently and with higher accuracy.

上記課題を解決するためになされた本発明の一態様であるテキストマイニング方法は、テキストデータ形式により作成され公開された複数の情報文書の中から、一の前記情報文書が抽出され、前記一の情報文書に含まれている文のうち、選択された一文である第１の選択文が、主観視表現による文意であるか、または、不特定な製品に対しての客観視表現による文意であるかに対し、前記主観視表現による文意に該当する前記一の情報文書を、第１の正例とし、前記客観視表現による文意に該当する前記一の情報文書を、第１の負例とすると、抽出した前記一の情報文書を、分類器により、前記第１の選択文の最終句にある文末の素性に基づいて判別し、前記第１の正例または前記第１の負例に振り分けて、前記第１の正例に相当する前記一の情報文書を、次の第２の解析ステップで行う解析の対象とし、前記第１の負例に相当する前記一の情報文書を、前記第２の解析ステップによる解析から排除する第１の解析ステップと、前記一の情報文書が前記第１の正例に全て分類されたみなし正例情報文書を母集団とし、前記みなし正例情報文書から抽出された一の分析予定情報文書で、前記一の分析予定情報文書の中から選択された文である第２の選択文が、分析対象とする特定の製品に対応した評価を含む文意であるか否かに対し、前記評価を文意に含む条件を満たした前記一の分析予定情報文書を、第２の正例とし、この条件を満たさない前記一の分析予定情報文書を、第２の負例とすると、抽出した前記一の分析予定情報文書を、分類器により、前記特定の製品に関連したキーワードと、前記キーワードと共起する単語とからなる素性ベクトルに基づいて判別し、前記第２の正例または前記第２の負例に振り分けて、前記第２の正例に相当する前記一の分析予定情報文書を、分析対象である分析情報文書と認定し、前記第２の負例に相当する前記一の分析予定情報文書を、前記みなし正例情報文書から排除する第２の解析ステップと、からなること、を特徴とする。 The text mining method according to one aspect of the present invention for solving the above-described problem is that one information document is extracted from a plurality of information documents created and released in a text data format, and the one Of the sentences included in the information document, the first selected sentence, which is a selected sentence, is a sentence meaning based on subjective expression, or a sentence meaning based on objective expression for an unspecified product. The one information document corresponding to the sentence meaning by the subjective expression is a first positive example, and the one information document corresponding to the sentence meaning by the objective expression is the first information document As a negative example, the extracted one information document is discriminated by a classifier based on the feature at the end of the sentence in the last phrase of the first selected sentence, and the first positive example or the first negative document is obtained. The one corresponding to the first positive example is assigned to an example. A first analysis step in which an information document is an object to be analyzed in the next second analysis step, and the one information document corresponding to the first negative example is excluded from the analysis in the second analysis step. And the one information document is a deemed analysis case information document extracted from the deemed positive example information document, with the assumed positive example information document all classified into the first positive example as a population. Whether the second selected sentence, which is a sentence selected from the analysis scheduled information document, is a sentence including an evaluation corresponding to a specific product to be analyzed, includes the evaluation in the sentence If the one analysis schedule information document that satisfies the condition is a second positive example and the one analysis schedule information document that does not satisfy this condition is a second negative example, the extracted one analysis schedule information document A document is classified by a classifier into a key associated with the specific product. Based on a feature vector consisting of a word and a word co-occurring with the keyword, distributed to the second positive example or the second negative example, and the one corresponding to the second positive example. The second analysis step of certifying the analysis schedule information document as an analysis information document to be analyzed and excluding the one analysis schedule information document corresponding to the second negative example from the deemed positive example information document It is characterized by comprising.

この態様によれば、テキスト解析時に、種々の文書態様で公開された複数の情報文書の中から、例えば、負例として数多く含まれている広告文書等のように、不特定な製品に対しての客観視表現による文意である情報文書を前もって除去しているため、分析を必要とする有用な情報文書として、特定の製品に対応した評価を含む文意の情報文書だけを、効率良くかつより高い精度で抽出することができる。 According to this aspect, at the time of text analysis, among a plurality of information documents released in various document forms, for example, an unspecified product such as an advertising document included in many negative examples. Since the information document that is the meaning of the objective expression of the product is removed in advance, only the information document with the meaning that includes the evaluation corresponding to the specific product is efficiently and efficiently used as a useful information document that needs to be analyzed. Extraction can be performed with higher accuracy.

上記の態様においては、前記第１の負例の対象は、前記不特定な製品を宣伝する内容からなる広告であること、が好ましい。 In said aspect, it is preferable that the object of said 1st negative example is an advertisement consisting of the content which advertises the said unspecified product.

この態様によれば、分析を不要とする数多くの広告文書が第１の解析ステップで排除され、第２の解析ステップで解析を行うみなし正例情報文書の文書総数を、最初に収集された複数の情報文書の文書総数より大幅に減少させることが可能になる。ひいては、本発明に係るテキストマイニング方法では、分析対象とする情報文書を抽出する精度が、従来のテキストマイニング方法を用いる場合に比して、例えば、平均の正解率が１０％増等と、大幅に向上する。 According to this aspect, a large number of advertisement documents that do not require analysis are excluded in the first analysis step, and the total number of documents of the assumed positive example information document that is analyzed in the second analysis step is first collected. It is possible to significantly reduce the total number of information documents. As a result, in the text mining method according to the present invention, the accuracy of extracting the information document to be analyzed is significantly higher than the case of using the conventional text mining method. To improve.

上記の態様においては、前記第１の解析ステップで用いる前記素性は、形態素を利用したものであること、が好ましい。 In the above aspect, it is preferable that the feature used in the first analysis step uses morpheme.

この態様によれば、解析する対象の情報文書が、例えば、特定の製品（一例としてガスコンロ）に関する評価文書、不特定な製品（一例としてガスコンロやそれ以外の製品）を宣伝する内容からなる広告文書、日常的様子を綴った文書や、特定の製品以外の評価を記載した文書等、種々な文書態様になっていても、情報文書を、正例と負例に、精度良く分類することができる。 According to this aspect, the information document to be analyzed includes, for example, an evaluation document related to a specific product (gas stove as an example), an advertisement document including contents for promoting an unspecified product (gas stove or other product as an example). Even if it is in various document forms, such as a document that spells out everyday situations, a document that describes evaluations other than a specific product, etc., it is possible to accurately classify information documents into positive examples and negative examples. .

上記の態様においては、前記特定の製品は、ガス機器であること、が好ましい。 In the above aspect, the specific product is preferably a gas device.

この態様によれば、業として、自社でガス機器を取り扱う者が、例えば、ブログやWeb掲示板等、SNSを通じた情報発信媒体を活用して、例えば、ガスコンロ等に代表されるガス機器を対象に、新規開発や改良、販売の改善等を目的として行う市場調査では、分類された分析情報文書に基づいて、より信頼性の高い市場調査結果を得ることができる。 According to this aspect, as a business, a person who handles gas equipment in-house uses, for example, an information transmission medium through SNS such as a blog or a Web bulletin board, for example, for gas equipment represented by a gas stove or the like. In market research conducted for the purpose of new development, improvement, and sales improvement, more reliable market research results can be obtained based on the classified analysis information document.

上記課題を解決するためになされた本発明の他の態様であるテキストマイニングプログラムは、テキストデータ形式により作成され公開された複数の情報文書の中から、一の前記情報文書が抽出され、前記一の情報文書に含まれている文のうち、選択された一文である第１の選択文が、主観視表現による文意であるか、または、不特定な製品に対しての客観視表現による文意であるかに対し、前記主観視表現による文意に該当する前記一の情報文書を、第１の正例とし、前記客観視表現による文意に該当する前記一の情報文書を、第１の負例とすると、抽出した前記一の情報文書を、分類器により、前記第１の選択文の最終句にある文末の素性に基づいて判別し、前記第１の正例または前記第１の負例に振り分ける第１のステップと、前記第１の正例に相当する前記一の情報文書を、次の第２の解析ステップで行う解析の対象に設定し、前記第１の負例に相当する前記一の情報文書を、前記第２の解析ステップによる解析から排除する第２のステップと、前記一の情報文書が前記第１の正例に全て分類されたみなし正例情報文書を母集団とし、前記みなし正例情報文書から抽出された一の分析予定情報文書で、前記一の分析予定情報文書の中から選択された文である第２の選択文が、分析対象とする特定の製品に対応した評価を含む文意であるか否かに対し、前記評価を文意に含む条件を満たした前記一の分析予定情報文書を、第２の正例とし、この条件を満たさない前記一の分析予定情報文書を、第２の負例とすると、抽出した前記一の分析予定情報文書を、分類器により、前記特定の製品に関連したキーワードと、前記キーワードと共起する単語とからなる素性ベクトルに基づいて判別し、前記第２の正例または前記第２の負例に振り分ける第３のステップと、前記第２の正例に相当する前記一の分析予定情報文書を、分析対象である分析情報文書と認定し、前記第２の負例に相当する前記一の分析予定情報文書を、前記みなし正例情報文書から排除する第４のステップと、からなること、を特徴とする。 A text mining program according to another aspect of the present invention, which has been made to solve the above problems, extracts one information document from a plurality of information documents created and published in a text data format, Among the sentences included in the information document, the first selected sentence, which is a selected sentence, is a sentence meaning based on subjective expression, or a sentence based on objective expression for an unspecified product The first information document corresponding to the sentence meaning by the subjective expression is the first positive example, and the one information document corresponding to the sentence meaning by the objective expression is the first The one information document extracted is determined by the classifier based on the feature at the end of the sentence in the last phrase of the first selected sentence, and the first positive example or the first A first step of distributing to a negative example, and the first step The one information document corresponding to the positive example is set as an object to be analyzed in the next second analysis step, and the one information document corresponding to the first negative example is set to the second analysis. A second step that is excluded from the analysis by the step, and a one that is extracted from the deemed positive example information document with the assumed positive example information document in which the one information document is all classified as the first positive example as a population. Whether or not the second selected sentence, which is a sentence selected from the one analysis planned information document, includes an evaluation corresponding to a specific product to be analyzed On the other hand, the one analysis schedule information document that satisfies the condition including the evaluation in its meaning is defined as a second positive example, and the one analysis schedule information document that does not satisfy the condition is defined as a second negative example. Then, the one analysis schedule information document extracted is classified by the classifier. A third step of determining based on a feature vector comprising a keyword related to a fixed product and a word co-occurring with the keyword, and allocating to the second positive example or the second negative example; The one analysis schedule information document corresponding to two positive examples is recognized as an analysis information document to be analyzed, and the one analysis schedule information document corresponding to the second negative example is regarded as the deemed positive example information. And a fourth step of excluding from the document.

本発明に係るテキストマイニング方法、及びテキストマイニングプログラムによれば、テキストデータ形式により、種々の文書態様で公開された大量の情報文書の中から、解析を必要とする有用な情報文書だけを、効率良くかつより高い精度に抽出することができる。 According to the text mining method and the text mining program according to the present invention, only a useful information document that needs to be analyzed is efficiently selected from a large number of information documents that are disclosed in various document modes by a text data format. Good and higher accuracy can be extracted.

本実施形態に係るテキストマイニング方法の概念を示す模式図である。It is a schematic diagram which shows the concept of the text mining method which concerns on this embodiment. 開示された文書群に含まれる評価文と広告文とについて、それぞれ一例として挙げたブログ文書を示す図である。It is a figure which shows the blog document mentioned as an example, respectively about the evaluation sentence and advertisement sentence which are contained in the disclosed document group. 図１に示す第１解析ステップで、評価文と広告文とを分類するのに用いる考え方を、模式的に示した第１説明図である。It is the 1st explanatory view showing typically the idea used for classifying an evaluation sentence and an advertising sentence at the 1st analysis step shown in FIG. 文末の表現を素性として分類する手法について、図２で例示した評価文を用いて説明する第２説明図である。It is the 2nd explanatory view explaining using the evaluation sentence illustrated in Drawing 2 about the technique which classifies the expression at the end of a sentence as a feature. 図１に示す第１解析ステップで、文末を素性で解析する様子を具体的に図示した第３説明図である。FIG. 6 is a third explanatory diagram specifically showing how the sentence ending is analyzed by the feature in the first analysis step shown in FIG. 1. 図５に続き、文末を素性で解析する様子を具体的に図示した第４説明図である。FIG. 6 is a fourth explanatory diagram specifically showing the state of analyzing the sentence ending with the feature following FIG. 5. 図６に続き、文末を素性で解析する様子を具体的に図示した第５説明図である。FIG. 7 is a fifth explanatory diagram specifically showing a state of analyzing the sentence ending with the feature following FIG. 6. 公開された文書群に含まれる分析対象の評価文と、分析対象外の評価文等とについて、それぞれ一例のブログ文書を示す図である。It is a figure which shows an example blog document about the evaluation sentence of the analysis object contained in the published document group, the evaluation sentence etc. which are not analysis objects, respectively. 図８に例示した各ブログ文書の中で、着目するキーワードを強調して表示した図である。FIG. 9 is a diagram in which a focused keyword is highlighted in each blog document illustrated in FIG. 8. 図１に示す第２解析ステップで、分析対象の評価文と分析対象外の評価文等とを分類するのに用いる考え方を、模式的に示した第６説明図である。It is the 6th explanatory view showing typically the way of thinking used for classifying the evaluation sentence of analysis object, the evaluation sentence of non-analysis object, etc. in the 2nd analysis step shown in FIG. 図１に示す第２解析ステップで、特定評価項目に対し、共起する単語との関係で文末を解析する様子を具体的に図示した第７説明図である。FIG. 10 is a seventh explanatory diagram specifically illustrating a state in which the sentence ending is analyzed in relation to the co-occurring word with respect to the specific evaluation item in the second analysis step illustrated in FIG. 1. 図１１に続き、特定評価項目に対し、共起する単語との関係で文末を解析する様子を具体的に図示した第８説明図である。FIG. 13 is an eighth explanatory diagram specifically showing a state in which sentence endings are analyzed in relation to co-occurring words for specific evaluation items following FIG. 11. 図１２に続き、特定評価項目に対し、共起する単語との関係で文末を解析する様子を具体的に図示した第９説明図である。FIG. 13 is a ninth explanatory diagram specifically showing a state in which sentence endings are analyzed in relation to co-occurring words for specific evaluation items, following FIG. 12. 図１に示す第２解析ステップで、特定評価項目に共起した単語を名詞とした場合に対し、公開された文書群に含まれる各々の情報文書で、正例における特定評価項目の出現割合を算出する方法を説明した図である。In the second analysis step shown in FIG. 1, in the case where the word co-occurring in the specific evaluation item is a noun, in each information document included in the published document group, the occurrence rate of the specific evaluation item in the positive example is determined. It is a figure explaining the method to calculate. テキストマイニング方法で分類した情報文書の解析結果について、判定基準の定義を説明した図である。It is a figure explaining the definition of a judgment standard about the analysis result of the information document classified by the text mining method. 本実施形態に係るテキストマイニング方法により、実施例に係る実験の結果を示す表である。It is a table | surface which shows the result of the experiment which concerns on an Example by the text mining method which concerns on this embodiment. キーワードで検索して開示された文書群のうち、それに含まれる複数の情報文書を、主だったカテゴリ毎に分けて図示した模式図である。It is the schematic diagram which divided and showed the some information document contained in the document group searched by the keyword for every main category. テキストマイニング方法で用いる分類器の機能に関する説明図である。It is explanatory drawing regarding the function of the classifier used with a text mining method. 従来のテキストマイニング方法の概念を示す模式図である。It is a schematic diagram which shows the concept of the conventional text mining method. 従来のテキストマイニング方法により、比較例に係る実験の結果を示す表である。It is a table | surface which shows the result of the experiment which concerns on a comparative example by the conventional text mining method.

本発明の一例であるテキストマイニング方法、及びテキストマイニングプログラムについて説明する。本発明に係るテキストマイニング方法は、例えば、評価文書、広告文書、日常的様子を綴った文書等、様々な文書態様で、テキストデータ形式により公開された大量の情報文書（以下、「文書群」と称す。）の中から、企業や消費者等にとって有益となり得る情報として、分析の対象に該当する有用文書だけを抽出するのに用いられる。有用文書は、後述するように、分析対象とする特定の製品に対応した評価を含む文意で記載された文書である。この特定の製品は、本実施形態では、ガス機器に属するガスコンロであり、有用文書はすなわち、このガスコンロに関する評価文書である。なお、出願人の一（以下、「情報分析者」と称す。）は、業として、自社でガスコンロを取り扱う者であり、情報分析者と同様、図２以降の図面中に記載された「甲」も、業として、自社でガスコンロ（商品名「Ａ」）を取り扱う者である。 A text mining method and a text mining program which are examples of the present invention will be described. The text mining method according to the present invention is a large amount of information documents (hereinafter referred to as “document group”) published in a text data format in various document modes such as an evaluation document, an advertisement document, and a document spelling a daily appearance. It is used to extract only useful documents corresponding to the object of analysis as information that can be useful to companies and consumers. As will be described later, the useful document is a document described in a sentence including an evaluation corresponding to a specific product to be analyzed. In this embodiment, this specific product is a gas stove belonging to a gas appliance, and the useful document is an evaluation document relating to this gas stove. One of the applicants (hereinafter referred to as “information analyst”) is a person who handles gas stoves as his own business. Like the information analyst, “ "Is also a person who handles gas stoves (product name" A ") in-house as a business.

＜文書群＞
まず文書群について、簡単に説明する。図１７は、キーワードで検索して開示された文書群のうち、それに含まれる複数の情報文書を、主だったカテゴリ毎に分けて図示した模式図である。例えば、ブログやWeb掲示板等、SNS(social networking service)を通じた情報発信媒体には近年、主に個人が、テキストデータ形式により自ら作成した情報文書を、自由に発信している。発信された情報文書は、情報発信媒体側で蓄積され、この情報発信媒体にアクセスした人に対し、何人にも閲覧可能な状態になっている。 <Documents>
First, the document group will be briefly described. FIG. 17 is a schematic diagram illustrating a plurality of information documents included in a group of documents disclosed by searching with keywords, divided into main categories. For example, in recent years, an information document created by a person himself / herself in a text data format has been freely transmitted to an information transmission medium through an SNS (social networking service) such as a blog or a Web bulletin board. The transmitted information document is stored on the information transmission medium side, and is accessible to any person who accesses the information transmission medium.

他方、一の人が、情報発信媒体に特定のキーワードを入力すると、既に発信された複数の情報文書の中から、このキーワードに関連した情報文書が、少なくとも１つ以上挙げられるのが一般的であり、挙げられた複数の情報文書を全て収集したものが、文書群である。例えば、図１７に示す文書群は、キーワードを、「ガスコンロ＊Ａ」として情報発信媒体で検索を行ったものであるが、収集された文書群の中には、図１７に示すように、膨大な数からなる文書群のうちの大半が、広告に関する文書で占めていることがある。 On the other hand, when one person inputs a specific keyword to the information transmission medium, at least one information document related to this keyword is typically listed from among a plurality of information documents already transmitted. A document group is a collection of all the plurality of listed information documents. For example, the document group shown in FIG. 17 is obtained by performing a search using an information transmission medium with the keyword “gas stove * A”. However, as shown in FIG. The vast majority of a large number of documents may be occupied by documents related to advertisements.

＜本実施形態に係るテキストマイニング方法＞
次に、本実施形態に係るテキストマイニング方法について、説明する。図１は、本実施形態に係るテキストマイニング方法の概念を示す模式図である。本実施形態に係るテキストマイニング方法は、図１に示すように、収集された文書群１（複数の情報文書）を、第１解析ステップＳ１により、主観文書２（みなし正例情報文書）と客観文書３とに予め大別する。次の第２解析ステップＳ２では、第１解析ステップＳ１で分けられた主観文書２をさらに分類することにより、有用文書である分析情報文書４を抽出する。文書群１の分類は、第１，第２解析ステップＳ１，Ｓ２とも、分類器によって行われ、分類器には、周知のサポートベクターマシン(SVM：support vector machine）５０が用いられる。 <Text mining method according to this embodiment>
Next, the text mining method according to the present embodiment will be described. FIG. 1 is a schematic diagram showing a concept of a text mining method according to the present embodiment. In the text mining method according to the present embodiment, as shown in FIG. 1, the collected document group 1 (a plurality of information documents) is converted into the subjective document 2 (deemed positive example information document) and the objective by the first analysis step S1. It is roughly divided into Document 3 in advance. In the next second analysis step S2, the analysis information document 4 which is a useful document is extracted by further classifying the subjective document 2 divided in the first analysis step S1. The classification of the document group 1 is performed by the classifier in both the first and second analysis steps S1 and S2, and a known support vector machine (SVM) 50 is used for the classifier.

ここで、ＳＶＭ５０について、簡単に説明する。ＳＶＭ５０は、線形分離可能な入力素子を用いて、識別する対象を、「正例」と「負例」の２つのクラスに分けるパターン識別器である。ＳＶＭ５０では、次の式（１）に示すように、分離平面の方向ベクトル（ｗ）と切片（ｂ）を定めることにより、演算処理に必要な関数が構築され、この関数において、マージンの最大化が行われる。
ｆ（ｘ）＝ｗ×ｘ−ｂ …式（１）
ｗ：分離平面の方向ベクトル
ｘ：事例の素性方向ベクトル
ｂ：切片 Here, the SVM 50 will be briefly described. The SVM 50 is a pattern classifier that divides an object to be identified into two classes of “positive example” and “negative example” using linearly separable input elements. In the SVM 50, as shown in the following equation (1), a function necessary for arithmetic processing is constructed by determining the direction vector (w) and the intercept (b) of the separation plane. In this function, the margin is maximized. Is done.
f (x) = w × x−b Equation (1)
w: direction vector of separation plane x: feature direction vector of case b: intercept

マージンの最大化は、分離平面に最も近いサポートベクトルと、分離平面との距離を最大化することであり、ＳＶＭ５０の関数では、次に示す式（２）において、以下の最小化問題を解くことになる。
ｍｉｎ．１／２ｗ^２ …式（２）
すなわち、式（３）を解く。
ｙ^（ｉ）（ｗ×ｘ^（ｉ）−ｂ）−１≧０（但し、∀ｉ） …式（３） The margin maximization is to maximize the distance between the support vector closest to the separation plane and the separation plane. In the function of the SVM 50, the following minimization problem is solved in the following equation (2). become.
min. 1 / 2w ² Formula (2)
That is, Equation (3) is solved.
y ⁽ⁱ⁾ (w × x ⁽ⁱ⁾ −b) −1 ≧ 0 (where ∀i) (3)

式（１）において、ｆ（ｘ）≧０を満たす場合は、「正例」のクラスに分類され、ｆ（ｘ）＜０を満たす場合は、「負例」のクラスに分類される。一般的には、「正例」は、情報分析者にとって、着目したい情報文書を対象とするものであり、「負例」は、「正例」以外の情報文書であるが、本実施形態で用いる「正例」と「負例」との明確な定義については、後述する。 In the formula (1), when f (x) ≧ 0 is satisfied, it is classified into the “positive example” class, and when f (x) <0 is satisfied, it is classified into the “negative example” class. In general, a “positive example” is an information document that an information analyst wants to focus on, and a “negative example” is an information document other than a “positive example”. The clear definition of “positive example” and “negative example” to be used will be described later.

図１８は、ＳＶＭの機能に関する説明図である。ＳＶＭ５０で事例を分類する作業を行う前に、ＳＶＭ５０ではまず、図１８（ａ）に示すように、「正例」に該当する事例の場合と、「負例」に該当する事例の場合とについて、学習が行われ、ＳＶＭ５０において、「正例」と「負例」とを分ける判定基準が定められ、設定される。この判定基準が、先の分離平面の方向ベクトルに相当する。情報分析者は、未知のデータとして、分類する対象の情報文書をＳＶＭ５０にかける。これにより、ＳＶＭ５０は、図１８（ｂ）に示すように、この対象の情報文書を、学習済みの判定基準に基づいて、「正例」に該当するか、「負例」に該当するかを判別して、分類する。 FIG. 18 is an explanatory diagram regarding the function of the SVM. Before performing the task of classifying the cases with the SVM 50, the SVM 50 firstly shows a case corresponding to the “positive example” and a case corresponding to the “negative example” as shown in FIG. Learning is performed, and in the SVM 50, determination criteria for separating the “positive example” and the “negative example” are determined and set. This criterion corresponds to the direction vector of the previous separation plane. The information analyst applies to the SVM 50 the information document to be classified as unknown data. As a result, as shown in FIG. 18B, the SVM 50 determines whether the target information document corresponds to a “positive example” or a “negative example” based on a learned determination criterion. Discriminate and classify.

本実施形態に係るテキストマイニング方法の第１解析ステップＳ１では、テキストデータ形式により作成され収集された文書群１の中から、一の情報文書（例えば、第１情報文書１０Ａ、第２情報文書１０Ｂ、第３情報文書１０Ｃ、第４情報文書１０Ｄ等）が抽出される。この一の情報文書に含まれている文のうち、選択された一文である第１の選択文（後に詳述するが、例えば、第１情報文書１０Ａ中の一文である第１選択文１１Ａ等）が、主観視表現による文意であるか、または、不特定な製品に対しての客観視表現による文意であるかを判断する。 In the first analysis step S1 of the text mining method according to this embodiment, one information document (for example, the first information document 10A and the second information document 10B) is selected from the document group 1 created and collected in the text data format. , Third information document 10C, fourth information document 10D, etc.) are extracted. Of the sentences included in this one information document, a first selected sentence that is a selected sentence (described in detail later, for example, a first selected sentence 11A that is one sentence in the first information document 10A). ) Is a meaning based on subjective expression or an intentional expression based on objective expression for an unspecified product.

この判断にあたり、文書群１から抽出された一の情報文書が、主観視表現による文意に該当する主観文書２に属する場合には、この一の情報文書は、前述した「正例」として、本実施形態では、ガスコンロに関する評価を記載した文書を含み、分析対象とする第１正例２０Ａ（第１の正例）と規定する。その反対に、文書群１から抽出された一の情報文書が、客観視表現による文意に該当する客観文書３に属する場合には、この一の情報文書は、第１正例２０Ａに該当しない第１負例２０Ｂ（第１の負例）として規定する。すなわち、第１負例２０Ｂは、第２情報文書１０Ｂに例示するように、ガスコンロやそれ以外の不特定な製品を宣伝する内容からなる広告文書であり、主として情報分析者以外から出された広告である。 In this determination, when one information document extracted from the document group 1 belongs to the subjective document 2 corresponding to the meaning of subjective expression, this one information document is referred to as the above-mentioned “positive example”. In the present embodiment, it is defined as a first positive example 20A (first positive example) to be analyzed, including a document describing an evaluation related to a gas stove. On the other hand, when one information document extracted from the document group 1 belongs to the objective document 3 corresponding to the meaning of objective expression, this one information document does not correspond to the first positive example 20A. It is defined as a first negative example 20B (first negative example). That is, the first negative example 20B is an advertising document composed of contents for promoting a gas stove and other unspecified products, as exemplified in the second information document 10B, and is mainly an advertisement issued by a person other than the information analyst. It is.

そして、文書群１から抽出した一の情報文書が、ＳＶＭ５０により、第１選択文の最終句にある文末の素性、すなわち形態素に基づいて判別し、第１正例２０Ａまたは第１負例２０Ｂに振り分けられる。なお、第１正例２０Ａまたは第１負例２０Ｂに振り分けるプロセスが、本実施形態に係るテキストマイニングプログラムの第１のステップに対応する。 Then, one information document extracted from the document group 1 is discriminated by the SVM 50 based on the feature at the end of the sentence in the last phrase of the first selected sentence, that is, the morpheme, and the first positive example 20A or the first negative example 20B is determined. Sorted. The process of allocating to the first positive example 20A or the first negative example 20B corresponds to the first step of the text mining program according to the present embodiment.

第１正例２０Ａに相当する一の情報文書は、次の第２解析ステップＳ２で行う解析の対象となる。その反対に、第１負例２０Ｂに相当する一の情報文書は、第２解析ステップＳ２による解析の対象から排除される。なお、このように、第２解析ステップＳ２での解析に対し、対象と対象外とに振り分けるプロセスが、本実施形態に係るテキストマイニングプログラムの第２のステップに対応する。 One information document corresponding to the first positive example 20A is an object to be analyzed in the next second analysis step S2. On the other hand, one information document corresponding to the first negative example 20B is excluded from the analysis target in the second analysis step S2. As described above, the process of assigning the target and the non-target for the analysis in the second analysis step S2 corresponds to the second step of the text mining program according to the present embodiment.

第１解析ステップＳ１の後に行う第２解析ステップＳ２では、一の情報文書が第１正例２０Ａに全て分類された主観文書２を母集団として、一の分析予定情報文書６が、抽出される。この一の分析予定情報文書６に含まれている文のうち、選択された一文である第２の選択文（後に詳述するが、例えば、第１情報文書１０Ａ中にある第１選択文１１Ａを含む第２選択文１１Ｂ等）が、分析対象とするガスコンロ（特定の製品）に対応した評価を含む文意であるか否かを判断する。 In the second analysis step S2 performed after the first analysis step S1, the one scheduled analysis information document 6 is extracted with the subjective document 2 in which all the one information document is classified as the first positive example 20A as a population. . Of the sentences included in the one analysis schedule information document 6, a second selected sentence that is a selected sentence (described in detail later, for example, the first selected sentence 11A in the first information document 10A). Whether or not the second selected sentence 11B including) includes the evaluation corresponding to the gas stove (specific product) to be analyzed.

この判断にあたり、主観文書２から抽出された一の分析予定情報文書６が、ガスコンロの評価を文意に含む条件を満たす分析情報文書４に属する場合には、この一の分析予定情報文書６は、第２正例３０Ａ（第２の正例）の取り扱いと規定する。すなわち、第１情報文書１０Ａに例示するような情報文書が、第２正例３０Ａに該当する（図２参照）。その反対に、主観文書２から抽出された一の分析予定情報文書６が、ガスコンロの評価を文意に含む条件を満たさない分析対象外文書５に属する場合には、第２負例３０Ｂ（第２の負例）の取り扱いと規定する。第２負例３０Ｂは、第３情報文書１０Ｃ及び第４情報文書１０Ｄに例示するように、日常的様子を綴った文書や、ガスコンロ以外の評価を記載した文書、その他の文書等、第２正例３０Ａ以外の文書を対象としている（図９参照）。 In this determination, when one analysis schedule information document 6 extracted from the subjective document 2 belongs to the analysis information document 4 that satisfies the condition including the evaluation of the gas stove, this one analysis schedule information document 6 is And the handling of the second positive example 30A (second positive example). That is, the information document illustrated in the first information document 10A corresponds to the second positive example 30A (see FIG. 2). On the other hand, if the one analysis schedule information document 6 extracted from the subjective document 2 belongs to the non-analyzed document 5 that does not satisfy the condition including the evaluation of the gas stove, the second negative example 30B (first 2 negative example). As illustrated in the third information document 10C and the fourth information document 10D, the second negative example 30B is a second positive document such as a document spelling a daily situation, a document describing an evaluation other than a gas stove, or other documents. Documents other than Example 30A are targeted (see FIG. 9).

そして、主観文書２から抽出された一の分析予定情報文書６が、ＳＶＭ５０により、ガスコンロに関連したキーワードと、このキーワードと共起する単語とからなる素性ベクトルに基づいて判別し、第２正例３０Ａまたは第２負例３０Ｂに振り分けられる。なお、第２正例３０Ａまたは第２負例３０Ｂに振り分けるプロセスが、本実施形態に係るテキストマイニングプログラムの第３のステップに対応する。 Then, one analysis schedule information document 6 extracted from the subjective document 2 is discriminated by the SVM 50 based on a feature vector composed of a keyword related to the gas stove and a word co-occurring with the keyword, and the second positive example It is distributed to 30A or the second negative example 30B. The process of allocating to the second positive example 30A or the second negative example 30B corresponds to the third step of the text mining program according to the present embodiment.

第２正例３０Ａに属する一の情報文書は、分析の対象に相当する分析情報文書４に認定される。その反対に、第２負例３０Ｂに該当する一の情報文書は、主観文書２から排除され、解析の対象から除外される。なお、このように、主観文書２において、分析情報文書４とそれ以外とに振り分けるプロセスが、本実施形態に係るテキストマイニングプログラムの第４のステップに対応する。 One information document belonging to the second positive example 30A is authorized as an analysis information document 4 corresponding to an analysis target. On the other hand, one information document corresponding to the second negative example 30B is excluded from the subjective document 2 and excluded from the analysis target. As described above, in the subjective document 2, the process of distributing the analysis information document 4 to the other information corresponds to the fourth step of the text mining program according to the present embodiment.

次に、一の情報文書について、いくつか具体例を挙げて、本実施形態に係るテキストマイニング方法を詳述する。図２は、開示された文書群に含まれる評価文と広告文とについて、それぞれ一例として挙げたブログ文書を示す図である。図３は、第１解析ステップで、評価文と広告文とを分類するのに用いる考え方を、模式的に示した第１説明図である。 Next, the text mining method according to the present embodiment will be described in detail by giving some specific examples for one information document. FIG. 2 is a diagram illustrating a blog document that is cited as an example for each of an evaluation sentence and an advertisement sentence included in the disclosed document group. FIG. 3 is a first explanatory diagram schematically showing the concept used to classify the evaluation text and the advertisement text in the first analysis step.

図２に示すように、第１情報文書１０Ａは、甲が取り扱う商品名「Ａ」のガスコンロに関する評価文書であり、第２情報文書１０Ｂは、ショップ乙，丙に係るガスコンロ以外の不特定な商品について、記載された広告文書である。図３に示すように、評価文書である第１情報文書１０Ａの文意は、主観的な記述となっている一方で、広告文書である第２情報文書１０Ｂは、客観的な記述となっている傾向がある。そのため、評価文書は、主観的な立場で記載された情報文書であると仮定することができる。また、広告文書は、客観的な立場で記載された情報文書であると仮定することができる。 As shown in FIG. 2, the first information document 10 </ b> A is an evaluation document related to the gas stove with the product name “A” handled by the former, and the second information document 10 </ b> B is an unspecified product other than the gas stove related to Shop B and Sakai. Is an advertising document described. As shown in FIG. 3, the textual meaning of the first information document 10A, which is an evaluation document, is a subjective description, while the second information document 10B, which is an advertising document, is an objective description. Tend to be. Therefore, it can be assumed that the evaluation document is an information document described in a subjective position. Further, it can be assumed that the advertisement document is an information document described from an objective standpoint.

前述した図１７に示す文書群のように、開示された膨大な数の文書群のうち、その大半が、分析に不要な広告に関する文書で占めている現況に鑑み、第１解析ステップＳ１で、文書群１（図１参照）の中から、まず数多い広告文書（例えば、第２情報文書１０Ｂ等）を除外する。これにより、第２解析ステップＳ２を行う母集団である主観文書２の文書総数は、文書群１の文書総数より大幅に減少すると見込むことができ、第１解析ステップＳ１の実施は、分析対象とする情報文書を抽出する精度の向上に寄与する。 In the first analysis step S1, in view of the present situation where most of the disclosed document group is occupied by documents related to advertisements unnecessary for analysis, as in the document group shown in FIG. First, a large number of advertisement documents (for example, the second information document 10B) are excluded from the document group 1 (see FIG. 1). Thereby, it can be expected that the total number of documents of the subjective document 2 which is the population that performs the second analysis step S2 will be significantly smaller than the total number of documents in the document group 1, and the execution of the first analysis step S1 This contributes to improving the accuracy of extracting information documents.

文書群１から抽出した一の情報文書を、第１正例２０Ａまたは第１負例２０Ｂに振り分ける具体的な手法について、説明する。図４は、文末の表現を素性として分類する手法について、図２で例示した評価文（第１情報文書）を用いて説明する第２説明図である。図５は、図１に示す第１解析ステップで、文末を素性で解析する様子を具体的に図示した第３説明図である。図６は、図５に続き、文末を素性で解析する様子を具体的に図示した第４説明図である。図７は、図６に続き、文末を素性で解析する様子を具体的に図示した第５説明図である。 A specific method for distributing one information document extracted from the document group 1 to the first positive example 20A or the first negative example 20B will be described. FIG. 4 is a second explanatory diagram illustrating a method of classifying the expression at the end of a sentence as a feature using the evaluation sentence (first information document) illustrated in FIG. FIG. 5 is a third explanatory diagram specifically showing how the sentence ending is analyzed in the first analysis step shown in FIG. FIG. 6 is a fourth explanatory diagram specifically showing how the sentence ending is analyzed based on the features, following FIG. 5. FIG. 7 is a fifth explanatory diagram specifically showing the state of analyzing the sentence ending with the feature, following FIG. 6.

図２及び図４に示すように、第１情報文書１０Ａに含まれる第１選択文１１Ａ「甲のＡは便利ですね」を、ＳＶＭ５０により形態素解析すると、この第１選択文１１Ａ中の最終にある句１２「便利ですね」は、自立語の「便利」と、文末の「ですね」とからなる。なお、自立語は、助動詞及び助詞以外の品詞の語であり、文末とは、一つの文の終わりからこの自立語が出現する前までの範囲にある複数の語の集まりをいう。 As shown in FIG. 2 and FIG. 4, when the first selected sentence 11A included in the first information document 10A “A's A is convenient” is morphologically analyzed by the SVM 50, the final result in the first selected sentence 11A A phrase 12 “Convenient” is composed of an independent word “Convenient” and the end of the sentence “Nice”. Independent words are words of part of speech other than auxiliary verbs and particles, and the end of a sentence refers to a collection of a plurality of words in the range from the end of one sentence to the appearance of this independent word.

形態素解析では、図５に示すように、用いる素性が形態素単体である場合、文末「ですね」における形態素は、助動詞「です」と助詞「ね」となる。各品詞における素性方向ベクトルの大きさは、助動詞「です」を５、助詞「ね」を４に、それぞれカウントされる。 In the morphological analysis, as shown in FIG. 5, when the feature to be used is a morpheme simple substance, the morpheme at the end of the sentence “I am” is the auxiliary verb “I” and the particle “Ne”. The size of the feature direction vector in each part of speech is counted as 5 for the auxiliary verb “Da” and 4 for the particle “Ne”.

次に、文末「ですね」における形態素単体は、素性方向ベクトルの大きさ５にカウントされた助動詞「です」と、素性方向ベクトルの大きさ４にカウントされた助詞「ね」である。用いる素性が形態素の組み合わせである場合、図６に示すように、これを前提に、文末「ですね」における形態素の組み合わせとして、素性方向ベクトルの大きさ２にカウントされた「です＋ね」が加味される。 Next, the morpheme unit at the end of the sentence “Ne” is the auxiliary particle “is” counted in the size 5 of the feature direction vector and the particle “ne” counted in the size 4 of the feature direction vector. If the feature to be used is a combination of morphemes, as shown in Fig. 6, assuming that this is the combination of the morphemes at the end of the sentence, "Is + ne" counted as the feature direction vector size 2 It is added.

次に、図７に示すように、用いる素性が形態素のグループである場合では、グループ化した形態素が用いられる。グループ化した形態素は、確認・同意を意味する終助詞（ね、よね、な、ねぇ、ネ、ねえ、わね）を対象とし、事前にＳＶＭ５０に学習された形態素である。用いる素性が形態素の組み合わせである場合には、素性方向ベクトルの大きさ５の助動詞「です」と、素性方向ベクトルの大きさ４の助詞「ね」と、素性方向ベクトルの大きさ２の「です＋ね」とに、素性方向ベクトルの大きさ２の助詞（終助詞）「ね」が加わる。 Next, as shown in FIG. 7, when the feature to be used is a group of morphemes, the grouped morphemes are used. The grouped morpheme is a morpheme that has been learned in advance by the SVM 50 and is intended for a final particle (ne, yo, na, ne, ne, ne, nei) that means confirmation / consent. If the feature to be used is a combination of morphemes, it is a verb with a feature direction vector size of 5 is "Ne" with a feature direction vector size of 4 and a feature direction vector of size 2 A particle (final particle) “ne” with a feature direction vector size 2 is added to “+ ne”.

このように、第１情報文書１０Ａは、ＳＶＭ５０により、第１選択文１１Ａの句１２を形態素解析すると、主観視表現による文意に該当すると判別され、分析予定情報文書６（一の分析予定情報文書）として、第１正例２０Ａである主観文書２に分類される。他方、第２情報文書１０Ｂについても、第１情報文書１０Ａと同じ要領で、第２情報文書１０Ｂに含まれる一文を、ＳＶＭ５０により、選択した一文の最終句を形態素解析すると、客観視表現による文意に該当する判別される。そのため、第２情報文書１０Ｂのような広告文書は、第１負例２０Ｂである客観文書３に分類され、第２解析ステップＳ２による解析の対象外となって、文書群１から排除される。 As described above, when the phrase 12 of the first selected sentence 11A is morphologically analyzed by the SVM 50, the first information document 10A is determined to correspond to the meaning of the subjective expression, and the analysis schedule information document 6 (one analysis schedule information) Document) is classified into the subjective document 2 which is the first positive example 20A. On the other hand, for the second information document 10B, in the same manner as the first information document 10A, when one sentence included in the second information document 10B is analyzed by the SVM 50 with the morphological analysis of the last sentence of the selected sentence, the sentence in the objective expression It is discriminated that falls within Therefore, the advertising document such as the second information document 10B is classified into the objective document 3 that is the first negative example 20B, is excluded from the analysis by the second analysis step S2, and is excluded from the document group 1.

次に、分析予定情報文書６について、いくつか具体例を挙げて説明する。図８は、公開された文書群に含まれる分析対象の評価文と、分析対象外の評価文等とについて、それぞれ一例のブログ文書を示す図である。図９は、図８に例示した各ブログ文書の中で、着目するキーワードを強調して表示した図である。図１０は、図１に示す第２解析ステップで、分析対象の評価文と分析対象外の評価文等とを分類するのに用いる考え方を、模式的に示した第６説明図である。 Next, the analysis schedule information document 6 will be described with some specific examples. FIG. 8 is a diagram illustrating an example of a blog document for each of an analysis target evaluation sentence and a non-analysis evaluation sentence included in the published document group. FIG. 9 is a diagram in which the keyword of interest is highlighted in each blog document illustrated in FIG. FIG. 10 is a sixth explanatory view schematically showing the concept used for classifying the evaluation sentence to be analyzed and the evaluation sentence not to be analyzed in the second analysis step shown in FIG.

前述したように、図８に示す第１情報文書１０Ａは、甲が取り扱う商品名「Ａ」のガスコンロに関する評価文書であり、ＳＶＭ５０による第２解析ステップＳ２の実行前の状態として、分析予定情報文書６でもある。第３情報文書１０Ｃ，第４情報文書１０Ｄは、ガスコンロ以外の評価文書であり、また日常的様子を綴った文書でもある。 As described above, the first information document 10A shown in FIG. 8 is an evaluation document related to the gas stove with the product name “A” handled by A, and the analysis schedule information document is the state before the execution of the second analysis step S2 by the SVM 50. There are also six. The third information document 10 </ b> C and the fourth information document 10 </ b> D are evaluation documents other than the gas stove, and are documents that spell out everyday situations.

第１情報文書１０Ａでは、図９に示すように、形容詞や形容動詞（第１要素）である「便利、素敵、素晴らしい、おいしい、嬉しい」（点線による枠内で示す）と、これらと共起する名詞のうち、ガスコンロ以外の名詞（第２要素）である「タイマー機能、クッキー」（一点鎖線による枠内で示す）と、この第２要素と共起する名詞のうち、ガスコンロと関連する名詞（第３要素）である「甲、Ａ、トースト、グリル」（実線による枠内で示す）とが、含まれている。一方で、第３情報文書１０Ｃ，第４情報文書１０Ｄでは、第１要素である「良い、おしゃれな、いい」と、第２要素である「間取り、外観、自由設計」とが、含まれており、第３要素は含まれていない。 In the first information document 10A, as shown in FIG. 9, the adjectives and adjective verbs (first element) “convenient, nice, wonderful, delicious, happy” (shown in a dotted frame) co-occur with these Among the nouns to be used, the noun (second element) other than the gas stove (timer function, cookie) (shown in a frame with a one-dot chain line) and the noun associated with the second element are related to the gas stove (Third element) “Instep, A, toast, grill” (shown in a solid line frame) is included. On the other hand, in the third information document 10C and the fourth information document 10D, the first element “good, fashionable, good” and the second element “flooring, appearance, free design” are included. The third element is not included.

すなわち、評価文書である第１情報文書１０Ａの文意は、第３情報文書１０Ｃ等の文意と比べ、ガスコンロに関連する名詞に対し、共起した形容詞や形容動詞の使い方に違いがあるものと、仮定することができる。そのため、第２解析ステップＳ２では、主観文書２に属する分析予定情報文書６が、共起した形容詞や形容動詞に基づいて、分類される。 That is, the meaning of the first information document 10A, which is an evaluation document, is different from the meaning of the third information document 10C etc. in the usage of co-occurring adjectives and adjective verbs for nouns related to the gas stove. Can be assumed. Therefore, in the second analysis step S2, the analysis schedule information document 6 belonging to the subjective document 2 is classified based on the co-occurring adjectives and adjective verbs.

主観文書２から抽出した一の分析予定情報文書６を、第２正例３０Ａまたは第２負例３０Ｂに振り分ける具体的な手法について、説明する。図１１は、図１に示す第２解析ステップで、特定評価項目に対し、共起する単語との関係で文末を解析する様子を具体的に図示した第７説明図である。図１２は、図１１に続き、特定評価項目に対し、共起する単語との関係で文末を解析する様子を具体的に図示した第８説明図である。図１３は、図１２に続き、特定評価項目に対し、共起する単語との関係で文末を解析する様子を具体的に図示した第９説明図である。なお、図１１以降の各図で、表中に記載した第１情報文書は、第１情報文書１０Ａを示すものであり、第ｍ情報文書（ｍ＞２）は、その他の情報文書を示す。 A specific method of distributing one analysis schedule information document 6 extracted from the subjective document 2 to the second positive example 30A or the second negative example 30B will be described. FIG. 11 is a seventh explanatory diagram specifically showing how the sentence ending is analyzed in relation to the co-occurring word for the specific evaluation item in the second analysis step shown in FIG. FIG. 12 is an eighth explanatory diagram specifically showing the state of analyzing the sentence end in relation to the co-occurring words for the specific evaluation item, following FIG. FIG. 13 is a ninth explanatory diagram specifically showing how the sentence ending is analyzed in relation to the co-occurring words for the specific evaluation item, following FIG. 12. In each figure after FIG. 11, the first information document described in the table indicates the first information document 10A, and the m-th information document (m> 2) indicates other information documents.

第２解析ステップＳ２では、一の分析予定情報文書６（図１参照）として、第１情報文書１０Ａを、ＳＶＭ５０により形態素解析する。形態素解析は、ガスコンロの評価に関連する名詞と、この名詞が存在する第１情報文書１０Ａに含まれる形容詞や形容動詞の語感を素性ベクトルとして、行われる。 In the second analysis step S2, the first information document 10A is morphologically analyzed by the SVM 50 as one analysis schedule information document 6 (see FIG. 1). The morphological analysis is performed by using a noun related to the evaluation of the gas stove and the adjectives and adjective verbs included in the first information document 10A in which the noun exists as feature vectors.

図１１に示すように、第１情報文書１０Ａに含む第２選択文１１Ｂ「甲のＡは便利ですね。とりあえず朝のトーストは素敵に焼けます＾＾タイマー機能も素晴らしいですよ。」では、ガスコンロの評価に関連する名詞は、第２選択文１１Ｂの文頭順に、素性方向ベクトルの大きさ１の「Ａ」と、素性方向ベクトルの大きさ２の「便利」との２つである。 As shown in FIG. 11, the second selection sentence 11B included in the first information document 10A “The upper A is convenient. The morning toast can be burnt nicely ^^ The timer function is also great.” There are two nouns related to the evaluation: “A” having a feature direction vector size of 1 and “Convenient” having a feature direction vector size of 2 in the order of the head of the second selected sentence 11B.

次に、名詞「トースト」は、ガスコンロの評価に関連する名詞に属するものの、この「トースト」の語は、先に挙げた形容詞や形容動詞に分類される「素敵」と共起し、この「素敵」の語は、ガスコンロ評価に関連する名詞に共起していない。そのため、「素敵」の語は、素性方向ベクトルの大きさ０であり、形態素解析にカウントされない（図１２参照）。 Next, although the noun “Toast” belongs to the noun related to gas stove evaluation, the word “Toast” co-occurs with “Nice”, which is classified into the above-mentioned adjectives and adjective verbs. The word “nice” does not co-occur with nouns related to gas stove evaluation. Therefore, the word “nice” has a feature direction vector size of 0 and is not counted in the morphological analysis (see FIG. 12).

図１３に示すように、ガスコンロの評価に関連する名詞と、この名詞による語との共起する形容詞や形容動詞との組み合わせではないが、ガスコンロ以外の名詞「タイマー機能」の語は、先に挙げた形容詞や形容動詞に分類される「素晴らしい」と共起している。そのため、「タイマー機能」の語と「素晴らしい」の語とも、素性方向ベクトルの大きさ１がカウントされる。 As shown in FIG. 13, it is not a combination of a noun related to gas stove evaluation and an adjective or adjective verb that co-occurs with a word by this noun. It co-occurs with “great” which is classified into the adjectives and adjective verbs mentioned. Therefore, the size 1 of the feature direction vector is counted for both the word “timer function” and the word “great”.

図１４は、図１に示す第２解析ステップで、特定評価項目に共起した単語を名詞とした場合に対し、公開された文書群に含まれる各々の文書で、正例における特定評価項目の出現割合を算出する方法を説明した図である。 FIG. 14 shows the case of the specific evaluation item in the positive example in each document included in the published document group, when the word co-occurring in the specific evaluation item is a noun in the second analysis step shown in FIG. It is a figure explaining the method of calculating an appearance rate.

図１４に示すように、本実施形態に係るテキストマイニング方法は、ガスコンロの評価に関連する名詞等、特定の製品に関連したキーワードとなる名詞を、一例として５０語を選択しておくと共に、文書群１の中で、ある特定の形態素が存在する情報文書の文書数である出現文書数を１０以上とする条件下で、実施される。文書群１の中から分析情報文書４を抽出するにあたり、特定の製品に関連したキーワードの数は、一例として５０語としているが、参考として、その５０語のキーワードの選択数は、過去の分類解析に基づき、正例出現割合を上位５０語とすることに起因している。キーワードの選択数を上位５０語程度にすることで、本実施形態に係るテキストマイニング方法による分析情報文書４の抽出精度が高くなるからである。 As shown in FIG. 14, the text mining method according to the present embodiment selects, as an example, 50 nouns that are keywords related to a specific product, such as nouns related to gas stove evaluation, This is performed under the condition that the number of appearing documents, which is the number of information documents in a group 1 in which a specific morpheme exists, is 10 or more. In extracting the analysis information document 4 from the document group 1, the number of keywords related to a specific product is 50 as an example, but for reference, the number of keywords selected for the 50 words is a past classification. This is due to the fact that the percentage of positive cases appears in the top 50 based on the analysis. This is because the accuracy of extracting the analysis information document 4 by the text mining method according to the present embodiment is increased by setting the number of selected keywords to about 50 words.

なお、正例出現割合とは、正例の頻度合計を、正例の頻度合計と負例の頻度合計との和で除した商である。正例の頻度合計は、第２解析ステップＳ２により、第２正例３０Ａと判別された分析情報文書４を対象に、第１情報文書１０Ａ等のような各々の一の情報文書で、キーワード別に、素性方向ベクトルの大きさを合算した和である。負例の頻度合計は、第２解析ステップＳ２により、第２負例３０Ｂと判別された分析対象外文書５を対象に、第３情報文書１０Ｃや第４情報文書１０Ｄ等のような各々の一の情報文書で、正例の場合と同じキーワード別に、素性方向ベクトルの大きさを合算した和である。 The positive case appearance ratio is a quotient obtained by dividing the total frequency of positive examples by the sum of the total frequency of positive examples and the total frequency of negative examples. The total frequency of the positive examples is obtained for each keyword of each information document such as the first information document 10A for the analysis information document 4 determined as the second positive example 30A by the second analysis step S2. , The sum of the magnitudes of the feature direction vectors. The total frequency of the negative examples is calculated based on the non-analyzed document 5 determined to be the second negative example 30B by the second analysis step S2, and each one of the third information document 10C, the fourth information document 10D, etc. Is the sum of the magnitudes of the feature direction vectors for the same keywords as in the positive example.

＜調査実験＞
次に、文書群１の中から有用文書を抽出するのにあたり、本実施形態に係るテキストマイニング方法と、背景技術で挙げた従来のテキストマイニング方法とを比較し、本実施形態に係るテキストマイニング方法の有意性の有無を確認する調査実験を行った。調査実験では、本実施形態に係るテキストマイニング方法により有用文書を抽出する実施例と、従来のテキストマイニング方法により有用文書を抽出する比較例との２つの実験を行った。 <Investigation experiment>
Next, in extracting a useful document from the document group 1, the text mining method according to this embodiment is compared with the text mining method according to this embodiment and the conventional text mining method mentioned in the background art. A survey experiment was conducted to confirm the presence or absence of significance. In the investigation experiment, two experiments were performed: an example in which a useful document was extracted by the text mining method according to the present embodiment, and a comparative example in which a useful document was extracted by a conventional text mining method.

＜実施例に係る実験＞
情報分析者は、本実施形態に係るテキストマイニング方法により、文書数８８のブログ文書からなる文書群１の中から、ガスコンロの評価文書（分析情報文書４）を抽出することを目的に、実施例に係る実験を行った。 <Experiment according to Examples>
An information analyst uses the text mining method according to the present embodiment to extract a gas stove evaluation document (analysis information document 4) from a document group 1 consisting of 88 blog documents. The experiment concerning was conducted.

図１９は、従来のテキストマイニング方法の概念を示す模式図である。従来のテキストマイニング方法では、ガスコンロに対応した評価を含む文意であるガスコンロ評価文書を、正例４０Ａとし、正例４０Ａ以外の文書であるその他の文書（広告文書、日常的様子を綴った文書、ガスコンロ以外の評価を記載した文書等）を、負例４０Ｂとして、正例と負例とが定義されている。従来のテキストマイニング方法は、図１９に示すように、収集された文書群１から抽出された情報文書を、ＳＶＭ５０により、正例４０Ａと負例４０Ｂとに分類し、分析の対象に該当する有用文書だけを直接抽出する一段階分類手法である。具体的には、従来のテキストマイニング方法では、ＳＶＭ５０による情報文書の分類にあたり、情報文書に含まれる一文で、この文を構成する語は、「名詞・形容詞・副詞・助動詞」の４つの品詞を素性に、これらの品詞の出現頻度に基づいて、形態素解析される。 FIG. 19 is a schematic diagram showing the concept of a conventional text mining method. In the conventional text mining method, a gas stove evaluation document that is a sentence including an evaluation corresponding to a gas stove is set as a positive example 40A, and other documents other than the positive example 40A (an advertisement document, a document spelling a daily situation) A positive example and a negative example are defined as a negative example 40B. In the conventional text mining method, as shown in FIG. 19, the information documents extracted from the collected document group 1 are classified into positive examples 40A and negative examples 40B by the SVM 50, and are useful as analysis targets. It is a one-stage classification method that directly extracts documents. Specifically, in the conventional text mining method, when the information document is classified by the SVM 50, one sentence included in the information document includes four parts of speech of "noun / adjective / adverb / auxiliary verb". Based on the appearance frequency of these parts of speech, the morphological analysis is performed.

＜比較例に係る実験＞
また、情報分析者は、従来のテキストマイニング方法により、実施例に係る実験で用いた同じ文書群１（文書数８８のブログ文書）の中から、ガスコンロの評価文書（分析情報文書４に相当）を抽出することを目的に、比較例に係る実験を行った。 <Experiment according to comparative example>
In addition, the information analyst uses a conventional text mining method to select a gas stove evaluation document (corresponding to the analysis information document 4) from the same document group 1 (blog document with 88 documents) used in the experiment according to the embodiment. An experiment related to a comparative example was conducted for the purpose of extracting the.

＜実験前の準備＞
２つの実験にあたり、本実施形態では、キーワード「ガスコンロ／購入，テーブルコンロ，ビルトインコンロ，東邦ガス／コンロ，ガスコンロ／評価」を用いて、情報分析者は、情報発信媒体で情報文書の検索を行い、これにより、挙げられた複数の情報文書を全て開示した文書群を収集した。キーワードに使用した「東邦ガス」は、本出願人の一である。収集した文書群は、「負例」の数を７３、この数に合わせて「正例」の数も７３とした総数１４６である。文書群を構成する文書総数１４６の情報文書の内容は全て、予め作業者によって確認されている。ＳＶＭ５０への学習は、「負例」と「正例」とも、７３の文書数を用いて行われた。 <Preparation before experiment>
In this embodiment, in this embodiment, the information analyst searches the information document using the information transmission medium using the keywords “gas stove / purchase, table stove, built-in stove, Toho gas / stove, gas stove / evaluation”. As a result, a group of documents disclosing all of the plurality of information documents listed was collected. “Toho Gas” used as a keyword is one of the applicants. The collected document group has a total number 146 in which the number of “negative examples” is 73 and the number of “positive examples” is also 73 in accordance with this number. The contents of all the information documents 146 constituting the document group are all confirmed in advance by the operator. Learning to the SVM 50 was performed using 73 documents for both the “negative example” and the “positive example”.

「正例」は、ガスコンロに関する評価を記載した文書のみを対象としている。「負例」に属する情報文書の内訳は、ガスコンロやそれ以外の不特定な製品を宣伝する内容からなる広告文書が文書数３４、日常的様子を綴った文書が文書数２９、ガスコンロ以外の評価を記載した文書が文書数９、その他の文書が文書数１である。 The “positive example” is intended only for documents that describe evaluations related to gas stoves. The breakdown of information documents belonging to “Negative Examples” includes 34 documents for advertising documents consisting of contents that promote gas stoves and other unspecified products, 29 documents for documents that describe everyday situations, and evaluations other than for gas stoves Is 9 documents, and other documents are 1 document.

＜２つの実験の共通条件＞
情報分析者は、前述したように、収集した文書総数１４６の文書群の中から、内容の精査により主だった内容とされる文書数８８のブログ文書（情報文書）を、実施例に係る実験と比較例に係る実験とも、母集団として文書群１（図１及び図１９参照）に挙げている。 <Common conditions for the two experiments>
As described above, the information analyst performs an experiment related to the blog document (information document) having the number of documents 88 as the main contents by detailed examination from among the collected document group of 146 documents. Both of the experiments related to the comparative example are listed in the document group 1 (see FIGS. 1 and 19) as a population.

＜２つの実験の評価方法＞
図１５は、テキストマイニング方法で分類した文書の解析結果について、判定基準の定義を説明した図である。図１５に示すように、作業者によって予め確認された「正例」に属する情報文書について、実施例に係る実験で、正しく「第２正例３０Ａ」と判定、及び比較例に係る実験で、正しく「正例４０Ａ」と判定した場合を、「ＴＰ」と文字化して扱う。また、作業者によって予め確認された「正例」に属する情報文書について、実施例に係る実験で、誤って「第２負例３０Ｂ」と判定、及び比較例に係る実験で、誤って「負例４０Ｂ」と判定した場合を、「ＴＮ」と文字化して扱う。また、作業者によって予め確認された「負例」に属する情報文書について、実施例に係る実験で、誤って「第２正例３０Ａ」と判定、及び比較例に係る実験で、誤って「正例４０Ａ」と判定した場合を、「ＦＰ」と文字化して扱う。また、作業者によって予め確認された「負例」に属する情報文書について、実施例に係る実験で、正しく「第２負例３０Ｂ」と判定、及び比較例に係る実験で、正しく「負例４０Ｂ」と判定した場合を、「ＦＮ」と文字化して扱う。 <Evaluation method for two experiments>
FIG. 15 is a diagram illustrating the definition of the determination criterion for the analysis result of the document classified by the text mining method. As shown in FIG. 15, for the information document belonging to the “positive example” confirmed in advance by the operator, in the experiment related to the example, it is correctly determined as the “second positive example 30A”, and in the experiment related to the comparative example, When it is correctly determined as “positive example 40A”, it is treated as “TP”. In addition, regarding an information document belonging to the “positive example” confirmed in advance by the operator, it was erroneously determined as the “second negative example 30B” in the experiment related to the example, and in the experiment related to the comparative example, “negative” When it is determined as “Example 40B”, it is treated as “TN”. In addition, an information document belonging to the “negative example” confirmed in advance by the operator is erroneously determined as “second positive example 30A” in the experiment related to the example, and in the experiment related to the comparative example, the “positive example” is erroneously determined. When it is determined as “Example 40A”, it is treated as “FP”. In addition, regarding the information document belonging to the “negative example” confirmed in advance by the operator, it is correctly determined as the “second negative example 30B” in the experiment according to the example, and the “negative example 40B” is correctly determined in the experiment according to the comparative example. ”Is treated as text“ FN ”.

そして、正解率Ａは、各実験（モデル）において、「正例」と「負例」とも、正しく判定された割合を表すものであり、（ＴＰ＋ＦＮ）／（ＴＰ＋ＴＮ＋ＦＰ＋ＦＮ）の商である。また、正例正解率ＴＲは、モデル上の母集団（文書群１）の中から、「正例」を、どれだけ網羅的に抽出したかを示す指標であり、ＴＰ／（ＴＰ＋ＴＮ）の商で表されている。また、負例正解率ＦＲは、モデル上の母集団（文書群１）の中から、「負例」を、どれだけ網羅的に抽出したかを示す指標であり、ＦＮ／（ＦＰ＋ＦＮ）の商で表されている。 The correct answer rate A represents the ratio of “positive example” and “negative example” correctly determined in each experiment (model), and is a quotient of (TP + FN) / (TP + TN + FP + FN). The correct answer rate TR is an index indicating how comprehensively the “correct example” is extracted from the population (document group 1) on the model, and the quotient of TP / (TP + TN). It is represented by The negative example correct answer FR is an index indicating how exhaustively “negative examples” are extracted from the model population (document group 1), and is a quotient of FN / (FP + FN). It is represented by

＜実験結果＞
図１６は、本実施形態に係るテキストマイニング方法により、実施例に係る実験の結果を示す表である。図２０は、従来のテキストマイニング方法により、比較例に係る実験の結果を示す表である。実施例に係る実験では、図１６に示すように、正例正解率ＴＲは０．８１であった。負例正解率ＦＲは０．８４であった。そして、平均の正解率Ａは０．８３であった。これに対し、比較例に係る実験では、図２０に示すように、正例正解率ＴＲは０．５７であった。負例正解率ＦＲは０．７８であった。そして、平均の正解率Ａは０．７３であった。 <Experimental result>
FIG. 16 is a table showing the results of the experiment according to the example by the text mining method according to the present embodiment. FIG. 20 is a table showing the results of an experiment according to a comparative example using a conventional text mining method. In the experiment according to the example, as shown in FIG. 16, the correct answer rate TR was 0.81. The negative example correct answer rate FR was 0.84. The average accuracy rate A was 0.83. On the other hand, in the experiment according to the comparative example, as shown in FIG. 20, the correct case correct answer rate TR was 0.57. The negative example correct answer rate FR was 0.78. The average accuracy rate A was 0.73.

＜実験の考察＞
実施例に係る実験結果と比較例に係る実験結果とを対比すると、正例正解率ＴＲ、負例正解率ＦＲ、及び平均の正解率Ａのいずれについても、実施例に係る実験結果が、比較例に係る実験結果を上回っていた。特に平均の正解率Ａでは、１０％増の差があり、平均の正解率Ａが８０％を超えていた。これは、次に挙げる理由によるものと考えられる。すなわち、本実施形態に係るテキストマイニング方法は、解析を対象とする有用文書を一段階分類手法で抽出する従来のテキストマイニング方法と異なり、第１解析ステップＳ１と第２解析ステップＳ２とに分けられている。情報分析者が、文書群１の中から、企業や消費者等にとって有益となり得る情報として、分析の対象に該当する有用文書だけを抽出するとき、本実施形態に係るテキストマイニング方法では、第１解析ステップＳ１において、数多いとされる広告文書をまず文書群１から除外している。そして、第２解析ステップＳ２を行う母集団である主観文書２の文書総数を、最初に収集された文書群１の文書総数より大幅に減少させているため、第２解析ステップＳ２において、ノイズとなる広告文書が排除され、情報文書を解析し易い状態となり、解析の精度が向上しているものと考えられる。 <Experimental considerations>
When the experimental results according to the example and the experimental results according to the comparative example are compared, the experimental results according to the example are compared for any of the correct answer rate TR, the negative answer rate FR, and the average correct answer rate A. It exceeded the experimental results of the examples. In particular, there was a difference of 10% in the average accuracy rate A, and the average accuracy rate A exceeded 80%. This is thought to be due to the following reasons. That is, the text mining method according to the present embodiment is divided into a first analysis step S1 and a second analysis step S2, unlike a conventional text mining method that extracts a useful document targeted for analysis by a one-step classification method. ing. When the information analyst extracts only useful documents corresponding to the object of analysis from the document group 1 as information that can be useful to companies, consumers, etc., the text mining method according to this embodiment uses the first In the analysis step S1, advertisement documents that are considered to be numerous are first excluded from the document group 1. In addition, since the total number of documents of the subjective document 2 that is a population for performing the second analysis step S2 is significantly reduced from the total number of documents of the document group 1 collected first, in the second analysis step S2, noise and It is considered that the advertisement document is eliminated, the information document is easily analyzed, and the analysis accuracy is improved.

次に、本実施形態に係るテキストマイニング方法、及びテキストマイニングプログラムの作用・効果について、説明する。本実施形態に係るテキストマイニングプログラムは、テキストデータ形式により作成され公開された文書群１の中から、一の情報文書（第１情報文書１０Ａ，第２情報文書１０Ｂ等）が抽出され、一の情報文書に含まれている文のうち、選択された一文である第１選択文（例えば、第１選択文１１Ａ）が、主観視表現による文意であるか、または、不特定な製品に対しての客観視表現による文意であるかに対し、主観視表現による文意に該当する一の情報文書を、第１正例２０Ａとし、客観視表現による文意に該当する一の情報文書を、第１負例２０Ｂとすると、抽出した一の情報文書を、ＳＶＭ５０により、第１選択文の最終句（例えば、句１２）にある文末の素性に基づいて判別し、第１正例２０Ａまたは第１負例２０Ｂに振り分ける第１のステップと、第１正例２０Ａに相当する一の情報文書を、次の第２の解析ステップで行う解析の対象に設定し、第１負例２０Ｂに相当する一の情報文書を、第２の解析ステップによる解析から排除する第２のステップと、一の情報文書が第１正例２０Ａに全て分類された主観文書２を母集団とし、主観文書２から抽出された一の分析予定情報文書６で、一の分析予定情報文書６の中から選択された文である第２選択文（例えば、第２選択文１１Ｂ）が、分析対象とする特定の製品に対応した評価を含む文意であるか否かに対し、評価を文意に含む条件を満たした一の分析予定情報文書６を、第２正例３０Ａとし、この条件を満たさない一の分析予定情報文書６を、第２負例３０Ｂとすると、抽出した一の分析予定情報文書６を、ＳＶＭ５０により、特定の製品に関連したキーワードと、このキーワードと共起する単語とからなる素性ベクトルに基づいて判別し、第２正例３０Ａまたは第２負例３０Ｂに振り分ける第３のステップと、第２正例３０Ａに相当する一の分析予定情報文書６を、分析対象である分析情報文書４と認定し、第２負例３０Ｂに相当する一の分析予定情報文書６を、主観文書２から排除する第４のステップと、からなること、を特徴とする。 Next, operations and effects of the text mining method and the text mining program according to the present embodiment will be described. In the text mining program according to the present embodiment, one information document (first information document 10A, second information document 10B, etc.) is extracted from the document group 1 created and published in a text data format. Of the sentences included in the information document, the first selected sentence (for example, the first selected sentence 11A), which is a selected sentence, is a sentence meaning based on subjective expression, or for an unspecified product. One information document corresponding to the meaning of subjective expression is set as the first positive example 20A, and one information document corresponding to the meaning of objective expression is Assuming that the first negative example 20B is used, the extracted one information document is determined by the SVM 50 based on the feature at the end of the sentence in the final phrase (for example, phrase 12) of the first selected sentence, and the first positive example 20A or First to distribute to the first negative example 20B And one information document corresponding to the first positive example 20A is set as an object to be analyzed in the next second analysis step, and one information document corresponding to the first negative example 20B is set to the second A second step to be excluded from the analysis by the analysis step, and one analysis planned information document extracted from the subjective document 2 with the subjective document 2 in which all one information document is classified as the first positive example 20A as a population 6, the second selected sentence (for example, the second selected sentence 11B) which is a sentence selected from the one analysis schedule information document 6 is a sentence including an evaluation corresponding to a specific product to be analyzed. Whether or not there is one analysis schedule information document 6 that satisfies the condition including the wording of evaluation as the second positive example 30A, and one analysis schedule information document 6 that does not satisfy this condition is the second negative As an example 30B, the extracted one analysis schedule information document 6 is represented by SVM5. To determine based on a feature vector consisting of a keyword related to a specific product and a word that co-occurs with the keyword, and to distribute to the second positive example 30A or the second negative example 30B, One analysis schedule information document 6 corresponding to the positive example 30A is recognized as the analysis information document 4 to be analyzed, and one analysis schedule information document 6 corresponding to the second negative example 30B is excluded from the subjective document 2. And a fourth step.

また、本実施形態に係るテキストマイニング方法は、テキストデータ形式により作成され公開された文書群１の中から、一の情報文書（第１情報文書１０Ａ，第２情報文書１０Ｂ等）が抽出され、一の情報文書に含まれている文のうち、選択された一文である第１選択文（例えば、第１選択文１１Ａ）が、主観視表現による文意であるか、または、不特定な製品に対しての客観視表現による文意であるかに対し、主観視表現による文意に該当する一の情報文書を、第１正例２０Ａとし、客観視表現による文意に該当する一の情報文書を、第１負例２０Ｂとすると、抽出した一の情報文書を、ＳＶＭ５０により、第１選択文の最終句（例えば、句１２）にある文末の素性に基づいて判別し、第１正例２０Ａまたは第１負例２０Ｂに振り分けて、第１正例２０Ａに相当する一の情報文書を、次の第２解析ステップＳ２で行う解析の対象とし、第１負例２０Ｂに相当する一の情報文書を、第２解析ステップＳ２による解析から排除する第１解析ステップＳ１と、一の情報文書が第１正例２０Ａに全て分類された主観文書２を母集団とし、主観文書２から抽出された一の分析予定情報文書６で、一の分析予定情報文書６の中から選択された文である第２選択文（例えば、第２選択文１１Ｂ）が、分析対象とする特定の製品に対応した評価を含む文意であるか否かに対し、評価を文意に含む条件を満たした一の分析予定情報文書６を、第２正例３０Ａとし、この条件を満たさない一の分析予定情報文書６を、第２負例３０Ｂとすると、抽出した一の分析予定情報文書６を、ＳＶＭ５０により、特定の製品に関連したキーワードと、このキーワードと共起する単語とからなる素性ベクトルに基づいて判別し、第２正例３０Ａまたは第２負例３０Ｂに振り分けて、第２正例３０Ａに相当する一の分析予定情報文書６を、分析対象である分析情報文書４と認定し、第２負例３０Ｂに相当する一の分析予定情報文書６を、主観文書２から排除する第２解析ステップＳ２と、からなること、を特徴とする。 In the text mining method according to the present embodiment, one information document (first information document 10A, second information document 10B, etc.) is extracted from the document group 1 created and published in the text data format, A first selected sentence (for example, the first selected sentence 11A) that is a selected sentence among sentences included in one information document is a sentence meaning based on subjective expression or an unspecified product The first positive example 20A is one information document corresponding to the meaning of the subjective expression, and one piece of information corresponding to the intention of the objective expression. Assuming that the document is the first negative example 20B, the extracted one information document is determined by the SVM 50 based on the feature at the end of the sentence in the last phrase (for example, phrase 12) of the first selected sentence. Sort to 20A or the first negative example 20B, One information document corresponding to one positive example 20A is set as an object of analysis performed in the next second analysis step S2, and one information document corresponding to the first negative example 20B is excluded from the analysis in the second analysis step S2. The first analysis step S1 and the subject information document 6 in which one information document is all classified into the first positive example 20A as a population, and one analysis scheduled information document 6 extracted from the subjective document 2, one analysis Whether the second selected sentence (for example, the second selected sentence 11B), which is a sentence selected from the schedule information document 6, is a sentence including an evaluation corresponding to a specific product to be analyzed If one analysis schedule information document 6 that satisfies the condition including evaluation in its meaning is defined as a second positive example 30A, and one analysis schedule information document 6 that does not satisfy this condition is defined as a second negative example 30B, extraction is performed. The one analysis schedule information document 6 that has been It discriminates on the basis of a feature vector composed of a keyword related to a certain product and a word co-occurring with this keyword, and is assigned to the second positive example 30A or the second negative example 30B, and corresponds to the second positive example 30A. A second analysis step S2 for identifying one analysis schedule information document 6 as an analysis information document 4 to be analyzed and excluding one analysis schedule information document 6 corresponding to the second negative example 30B from the subjective document 2; It is characterized by comprising.

これらの特徴により、テキスト解析時に、種々の文書態様で公開された大量の文書群１の中から、負例として数多く含まれている広告文書を前もって除去しているため、分析を必要とする有用な情報文書として、特定の製品に対応した評価を含む文意の情報文書だけを、効率良くかつより高い精度で抽出することができる。 Because of these features, advertising documents that are included in large numbers as negative examples are removed in advance from a large number of document groups 1 released in various document modes at the time of text analysis. As a simple information document, only a textual information document including an evaluation corresponding to a specific product can be extracted efficiently and with higher accuracy.

また、本実施形態に係るテキストマイニング方法は、第１負例２０Ｂの対象は、不特定な製品を宣伝する内容からなる広告であることを特徴とするので、分析を不要とする数多くの広告文書が第１解析ステップＳ１で排除され、第２解析ステップＳ２で解析を行う主観文書２の文書総数を、最初に収集された文書群１の文書総数より大幅に減少させることが可能になる。ひいては、本実施形態に係るテキストマイニング方法では、分析対象とする情報文書を抽出する精度が、従来のテキストマイニング方法を用いる場合に比して、例えば、実施例に係る実験結果で示したように、平均の正解率が１０％増等と、大幅に向上する。 In addition, the text mining method according to the present embodiment is characterized in that the target of the first negative example 20B is an advertisement composed of contents for promoting an unspecified product, and therefore, many advertising documents that do not require analysis. Is eliminated in the first analysis step S1, and the total number of documents of the subjective document 2 to be analyzed in the second analysis step S2 can be significantly reduced from the total number of documents in the document group 1 collected first. As a result, in the text mining method according to the present embodiment, the accuracy of extracting the information document to be analyzed is higher than that in the case of using the conventional text mining method. The average accuracy rate is significantly improved, such as an increase of 10%.

また、本実施形態に係るテキストマイニング方法は、第１解析ステップＳ１で用いる素性は、形態素を利用したものであることを特徴とする。そのため、解析する対象の情報文書が、例えば、第１情報文書１０Ａに例示するような特定の製品（ガスコンロ）に関する評価文書、第２情報文書１０Ｂに例示するような不特定な製品を宣伝する内容からなる広告文書、第３情報文書１０Ｃ及び第４情報文書１０Ｄに例示するように、日常的様子を綴った文書や、ガスコンロ以外の評価を記載した文書等、種々な文書態様になっていても、情報文書を、正例と負例に、精度良く分類することができる。 The text mining method according to the present embodiment is characterized in that the feature used in the first analysis step S1 uses morphemes. Therefore, the information document to be analyzed is, for example, an evaluation document relating to a specific product (gas stove) as exemplified in the first information document 10A, and a content promoting an unspecified product as exemplified in the second information document 10B. Even if it is in various document modes, such as a document spelling out a daily appearance or a document describing an evaluation other than a gas stove, as exemplified in the advertisement document made up of, the third information document 10C and the fourth information document 10D It is possible to classify information documents into positive examples and negative examples with high accuracy.

また、本実施形態に係るテキストマイニング方法は、特定の製品は、ガス機器であることを特徴とするので、業として、自社でガスコンロを取り扱う者である情報分析者が、例えば、ブログやWeb掲示板等、SNSを通じた情報発信媒体を活用して、ガスコンロを対象に、新規開発や改良、販売の改善等を目的として行う市場調査では、分類された分析情報文書４に基づいて、より信頼性の高い市場調査結果を得ることができる。 In addition, the text mining method according to the present embodiment is characterized in that the specific product is a gas appliance, so that an information analyst who is a person handling a gas stove in-house as a business, for example, a blog or a Web bulletin board In market research conducted for the purpose of new development, improvement, sales improvement, etc., targeting gas stoves using information transmission media through SNS, etc., based on the classified analysis information document 4, more reliable High market research results can be obtained.

以上の説明から明らかなように、本発明に係るテキストマイニング方法、及びテキストマイニングプログラムによれば、テキストデータ形式により、種々の文書態様で公開された大量の情報文書の中から、分析を必要とする有用な情報文書だけを、効率良くかつより高い精度に抽出することができる。 As is clear from the above description, the text mining method and the text mining program according to the present invention require analysis from a large amount of information documents published in various document modes according to the text data format. Only useful information documents can be extracted efficiently and with higher accuracy.

なお、上記した実施の形態は単なる例示に過ぎず、本発明を何ら限定するものではなく、その要旨を逸脱しない範囲内で種々の改良、変形が可能であることは勿論である。 Note that the above-described embodiment is merely an example, and does not limit the present invention in any way. It goes without saying that various improvements and modifications can be made without departing from the scope of the invention.

例えば、本実施形態では、分類器をＳＶＭ５０としたが、本発明に係るテキストマイニング方法、及びテキストマイニングプログラムに用いる分類器は、ＳＶＭ５０に限定されず、識別する対象の情報文書を、「正例」と「負例」の２つのクラスに分けるパターン識別器であれば、何でも良い。 For example, in the present embodiment, the classifier is SVM50. However, the classifier used in the text mining method and the text mining program according to the present invention is not limited to SVM50. As long as it is a pattern discriminator that can be divided into two classes, “” and “negative example”.

１文書群（複数の情報文書）
２主観文書（みなし正例情報文書）
５分析情報文書
６分析予定情報文書（一の分析予定情報文書）
１０Ａ第１情報文書（一の情報文書）
１０Ｂ第２情報文書（一の情報文書）
１０Ｃ第３情報文書（一の情報文書）
１０Ｄ第４情報文書（一の情報文書）
１１Ａ第１選択文（第１の選択文）
１１Ｂ第２選択文（第２の選択文）
２０Ａ第１正例（第１の正例）
２０Ｂ第１負例（第１の負例）
３０Ａ第２正例（第２の正例）
３０Ｂ第２負例（第２の負例）
５０ＳＶＭ（分類器）
Ｓ１第１解析ステップ（第１の解析ステップ）
Ｓ２第２解析ステップ（第２の解析ステップ） 1 Document group (multiple information documents)
2. Subjective document (deemed positive example information document)
5 Analysis Information Document 6 Analysis Schedule Information Document (One Analysis Schedule Information Document)
10A First information document (one information document)
10B Second information document (one information document)
10C Third information document (one information document)
10D 4th information document (one information document)
11A First selected sentence (first selected sentence)
11B Second selected sentence (second selected sentence)
20A First positive example (first positive example)
20B First negative example (first negative example)
30A Second positive example (second positive example)
30B Second negative example (second negative example)
50 SVM (classifier)
S1 First analysis step (first analysis step)
S2 Second analysis step (second analysis step)

Claims

One information document is extracted from a plurality of information documents created and published in a text data format, and the first selection is a selected sentence among sentences included in the one information document The one information document corresponding to the sentence meaning by the subjective expression, whether the sentence is the sentence meaning by the subjective expression or the intention by the objective expression for the unspecified product. Is the first positive example, and the one information document corresponding to the meaning of the objective expression is the first negative example,
The extracted one information document is determined by the classifier based on the feature of the end of the sentence in the last phrase of the first selected sentence, and is distributed to the first positive example or the first negative example, The one information document corresponding to the first positive example is set as an object of analysis performed in the second analysis step, and the one information document corresponding to the first negative example is set as the second information document. A first analysis step excluded from the analysis by the analysis step;
The deemed information example document in which the one information document is all classified as the first positive example is defined as a population, and the one analysis schedule information document is extracted from the assumed positive example information document. Whether or not the second selected sentence, which is a sentence selected from the information document, is a sentence including an evaluation corresponding to a specific product to be analyzed, a condition including the evaluation as a sentence When the satisfied one analysis schedule information document is a second positive example, and the one analysis schedule information document that does not satisfy this condition is a second negative example,
The one analysis schedule information document extracted is determined by a classifier based on a feature vector including a keyword related to the specific product and a word co-occurring with the keyword, and the second positive example or The one analysis schedule information document corresponding to the second positive example is identified as the analysis information document to be analyzed, and the one corresponding to the second negative example is assigned to the second negative example. A second analysis step of excluding the analysis scheduled information document from the deemed positive example information document;
Consisting of,
Text mining method characterized by

The text mining method according to claim 1,
The target of the first negative example is an advertisement composed of content promoting the unspecified product,
Text mining method characterized by

In the text mining method according to claim 1 or claim 2,
The feature used in the first analysis step is a morpheme,
Text mining method characterized by

In the text mining method according to any one of claims 1 to 3,
The specific product is a gas appliance;
Text mining method characterized by

One information document is extracted from a plurality of information documents created and published in a text data format, and the first selection is a selected sentence among sentences included in the one information document The one information document corresponding to the sentence meaning by the subjective expression, whether the sentence is the sentence meaning by the subjective expression or the intention by the objective expression for the unspecified product. Is the first positive example, and the one information document corresponding to the meaning of the objective expression is the first negative example,
The first information document extracted is discriminated by the classifier based on the feature at the end of the sentence in the final phrase of the first selected sentence, and is distributed to the first positive example or the first negative example. And the steps
The one information document corresponding to the first positive example is set as an object to be analyzed in the second analysis step, and the one information document corresponding to the first negative example is set to the first information document. A second step excluded from analysis by the two analysis steps;
The deemed information example document in which the one information document is all classified as the first positive example is defined as a population, and the one analysis schedule information document is extracted from the assumed positive example information document. Whether or not the second selected sentence, which is a sentence selected from the information document, is a sentence including an evaluation corresponding to a specific product to be analyzed, a condition including the evaluation as a sentence When the satisfied one analysis schedule information document is a second positive example, and the one analysis schedule information document that does not satisfy this condition is a second negative example,
The one analysis schedule information document extracted is determined by a classifier based on a feature vector including a keyword related to the specific product and a word co-occurring with the keyword, and the second positive example or A third step of allocating to the second negative example;
The one analysis schedule information document corresponding to the second positive example is recognized as an analysis information document to be analyzed, and the one analysis schedule information document corresponding to the second negative example is regarded as the deemed positive document. A fourth step of excluding from the example information document;
Consisting of,
A text mining program characterized by