JP4911599B2

JP4911599B2 - Reputation information extraction device and reputation information extraction method

Info

Publication number: JP4911599B2
Application number: JP2006356021A
Authority: JP
Inventors: 真樹村田; 晃一土井; 雅裕松岡
Original assignee: National Institute of Information and Communications Technology
Current assignee: National Institute of Information and Communications Technology
Priority date: 2006-12-28
Filing date: 2006-12-28
Publication date: 2012-04-04
Anticipated expiration: 2026-12-28
Also published as: JP2008165599A

Description

本発明は所定の対象に対してネットワーク上で公開されている風評情報を抽出するコンピュータを用いた風評情報抽出方法とその装置に関する。 The present invention relates to a reputation information extracting method and apparatus using a computer that extracts reputation information published on a network for a predetermined target.

インターネットの普及により従来のようなマスメディアや書籍と異なり、個人でも容易に情報発信が行えるようになった。これに伴って多様な意見表明がなされるようになった反面、不確実な情報や、名誉を毀損するような情報（以下、風評情報と呼ぶ）も簡単に発信されてしまう。
しかも非常に多数のウェブサイトが存在するために、これを発見することすら難しい状況にある。 Unlike the conventional mass media and books, the spread of the Internet has made it easier for individuals to send information. Along with this, various opinions have been expressed, but uncertain information and information that damages honor (hereinafter referred to as reputation information) are easily transmitted.
And because there are so many websites, it's difficult to find them.

一方、このような風評情報を流された者にとって、その影響は深刻である。例えば、あるユーザが製品の欠陥についてインターネット上の掲示板やホームページに書き込んだ時、その欠陥がユーザの勘違いであったとしても、読者にとっては当該製品についてあたかも真実であるかのように印象づけられてしまう。 On the other hand, the impact is serious for those who have received such reputation information. For example, when a user writes a product defect on a bulletin board or homepage on the Internet, even if the defect is misunderstood by the user, the reader will be impressed as if the product is true. .

特に、インターネットでは情報検索が簡単に行えることから、ある製品を購入しようとする者がその製品について調べたときに、風評情報が流されていることは企業の経済活動にとって重大な障害を及ぼす。
実際、企業イメージの破壊や、特定個人への攻撃などがすでにインターネット上で行われており、大きな社会問題ともなっている。 In particular, since information retrieval is easy on the Internet, the fact that reputation information is circulated when a person who wants to purchase a product examines the product poses a serious obstacle to the economic activity of the company.
In fact, the destruction of corporate images and attacks on specific individuals have already occurred on the Internet, which has become a major social problem.

この問題への対処方法としては、手作業で自己に関係する記事を検索し、不適当なものを見つけ出すことが必要である。しかし、インターネットでの掲示は日々刻々なされており、リアルタイムでの発見はきわめて難しい。早急に発見しなければ、不適切な情報が多数の読者に晒されてしまい、失われた信用を回復することはより困難になる。 As a method of dealing with this problem, it is necessary to manually search for articles related to self and find inappropriate ones. However, posting on the Internet is made every day, and real-time discovery is extremely difficult. If not discovered quickly, inappropriate information is exposed to many readers, making it more difficult to recover lost trust.

この点、従来の手作業による方法では、発見に膨大な手間とコストがかかり、ごく著名なウェブサイトにおける監視が行える程度で、その他で氾濫する情報については事実上黙認せざるを得ない状況にあるともいえる。
特に、中小企業や個人にとって、その発見はほとんど不可能である。 In this regard, the conventional manual method is extremely laborious and expensive to discover, and can be monitored on very well-known websites, and other flooded information is virtually silent. It can be said that there is.
Especially for small businesses and individuals, the discovery is almost impossible.

このような風評情報を自動的に発見する方法としては、次にあげられる技術が開示されている。
まず、特許文献１の技術は、インターネット上のＷｅｂページを取得・蓄積し、蓄積されたＷｅｂページを解析してブロック単位に分割し、このブロック単位ごとに所定のキーワードを含むか否かを判定して、キーワードを含む場合に前記所定のキーワードごとの風評情報を抽出する。さらに、各風評情報ごとの重み付けを行った上で、自社製品名と他社製品名のＷｅｂページ上の出現状況などに基づいて風評情報の重要度を算出することを提案している。 As a method for automatically finding such reputation information, the following techniques are disclosed.
First, the technology of Patent Document 1 acquires and stores Web pages on the Internet, analyzes the stored Web pages, divides them into block units, and determines whether or not each block unit includes a predetermined keyword. Then, when a keyword is included, reputation information for each predetermined keyword is extracted. Furthermore, after weighting each reputation information, it is proposed to calculate the importance of the reputation information based on the appearance status of the company product name and the other company's product name on the Web page.

次に、特許文献２の技術は、利用者コンピュータからキーワードを含むレポート出力要求を解析サーバへ送信すると、解析サーバは、ネガティブ表現の単語を取得して、キーワードに相当する格納した単語と、取得したネガティブ表現の単語とが同時に出現している、１以上の文章情報をデータベースに照会する。解析サーバは、データベースから照会結果を取得し、それを表やグラフなどの形式に整えた解析レポートを作成するものである。 Next, in the technique of Patent Document 2, when a report output request including a keyword is transmitted from the user computer to the analysis server, the analysis server acquires a negative expression word, stores the stored word corresponding to the keyword, and acquires The database is queried for one or more pieces of text information in which the negative expression words appear simultaneously. The analysis server obtains a query result from a database and creates an analysis report that is arranged in a table or graph format.

特開2004-70405号公報JP 2004-70405 A 特開2005-63242号公報JP 2005-63242 A

上記従来技術のいずれにおいても、例えば企業名などのキーワードと、風評情報となる単語が認められると、風評情報であるとして抽出することを特徴としている。例えば、Ａ製品名と「欠陥」などの文字が共に現れたときに、そのサイトを抽出するというルールベースの抽出方法をとっている。
この方法では、予め定義したネガティブ表現の単語が現れれば抽出できるが、少しでも表現が変われば抽出できない。風評情報が多様な表現方法で発信されていることを考慮すれば、このような従来技術で的確に風評情報を抽出することは難しい。 Any of the above prior arts is characterized in that, for example, when a keyword such as a company name and a word serving as reputation information are recognized, it is extracted as reputation information. For example, a rule-based extraction method is used in which a site is extracted when characters such as A product name and “defect” appear together.
In this method, extraction can be performed if a word of a negative expression defined in advance appears, but it cannot be extracted if the expression changes even a little. Considering that the reputation information is transmitted by various expression methods, it is difficult to accurately extract the reputation information by such conventional techniques.

洩れなく抽出しようとすれば、非常に多数の単語を登録しなければならず、その場合には風評情報でない情報も多数抽出される問題が生じ、結局手作業で抽出するのとほとんどかわらない手間がかかってしまう。 If you try to extract without omission, you have to register a very large number of words. In that case, there is a problem that a lot of information that is not reputation information is extracted. It will take.

また、「欠陥」という表現があっても、実際にメーカーがリコールをする場合など、それが真実であれば風評情報ではなく、従来技術では、このような事実に基づく情報を有効に除外することができない。 In addition, even if there is an expression of “defect”, if the manufacturer actually recalls it, if it is true, it is not reputation information, but the conventional technology effectively excludes information based on such facts. I can't.

本発明は、上記従来技術の有する問題点に鑑みて創出されたものであり、ネットワーク上で公開される情報から風評情報であるか否かを効果的に判定する技術、特にその判定に寄与する複数の素性を複合的に利用して判定する技術を提供することを目的とする。 The present invention was created in view of the above-described problems of the prior art, and contributes to a technique for effectively determining whether or not the information is reputation information from information disclosed on the network, and particularly to the determination. An object of the present invention is to provide a technique for making a judgment by using a plurality of features in a composite manner.

本発明は、上記の課題を解決するために、次のような風評情報抽出装置を提供する。
請求項１に記載の発明は、所定の対象に対してネットワーク上で公開されている風評情報を抽出するコンピュータを用いた風評情報抽出装置であって、ネットワーク上の単数又は複数のサーバ装置から公開されているデータを受信して収集データ記憶手段に各々収集データとして格納するデータ収集手段と、少なくとも素性となる単語又は単語の集合を含む素性テーブルを格納した素性テーブル記憶手段と、該素性テーブルを参照して、該収集データから素性を抽出する素性抽出手段と、単数又は複数の素性を入力すると、学習結果記憶手段に格納された機械学習結果データを参照して、所定の対象に対する風評情報か否かを判定する所定の機械学習モジュールを備えた機械学習判定手段とを備えて、該素性抽出手段が抽出した素性を該機械学習判定手段に入力し、その収集データに対する風評情報か否かの判定結果を得ると共に、風評情報と判定された収集データの少なくとも一部、又はそれが公開されているサーバ装置の名称若しくはネットワークアドレス、又はその収集データのファイル情報、の少なくともいずれかを出力する風評情報出力手段を備えることを特徴とする。 In order to solve the above problems, the present invention provides the following reputation information extraction device.
The invention described in claim 1 is a reputation information extraction device using a computer that extracts reputation information published on a network for a predetermined target, and is disclosed from one or more server devices on the network. Data collecting means for receiving collected data and storing them as collected data in the collected data storage means, a feature table storage means for storing a feature table including at least a word or a set of words as a feature, and the feature table Referring to the feature extraction means for extracting features from the collected data and the input of one or a plurality of features, refer to the machine learning result data stored in the learning result storage means, Machine learning determination means having a predetermined machine learning module for determining whether or not the feature extracted by the feature extraction means is the machine Input to the learning determination means to obtain a determination result as to whether or not the collected data is reputation information, and at least a part of the collected data determined to be reputation information, or the name or network address of the server device to which it is disclosed Or reputation information output means for outputting at least one of the file information of the collected data.

そして、この風評情報抽出装置に、単語の意味を、符号を用いて意味クラスとして分類し、複数の単語に対してそれぞれの意味クラスを付与した分類語彙テーブルを有し、該意味クラスを素性として素性テーブルに含むと共に、前記素性抽出手段が、該素性テーブルを参照して、前記収集データからそれに含まれる単語の意味クラスを抽出することを特徴とする。 The reputation information extraction device has a classification vocabulary table in which the meaning of words is classified as a semantic class using a code and each semantic class is assigned to a plurality of words, and the semantic class is used as a feature. It is included in the feature table, and the feature extraction means refers to the feature table and extracts a semantic class of a word included in the collected data.

請求項２に記載の発明は、風評情報抽出装置に、前記ネットワーク上のサーバ装置の名称若しくはネットワークアドレス、又はその前記収集データのファイル情報に対して、そこで公開される情報の信頼度を数値で表す情報源信頼度データベースを有し、該信頼度を素性として素性テーブルに含むと共に、前記素性抽出手段が、該素性テーブルを参照して、該収集データに係る信頼度の数値を抽出することを特徴とする。 According to the second aspect of the present invention, the reputation of the information extracted from the name or network address of the server device on the network or the file information of the collected data is expressed numerically by the reputation information extracting device. Including an information source reliability database to represent, and including the reliability in the feature table as a feature, and the feature extraction means refers to the feature table and extracts a numerical value of the reliability related to the collected data. Features.

請求項３に記載の発明は、風評情報抽出装置が、データ信頼度値評価手段を備え、該データ信頼度値評価手段は、前記収集データと、該収集データの著作者、又は格納されるサーバ装置の名称若しくはネットワークアドレス、又は該収集データのファイル情報の少なくともいずれかが一致する評価用データをネットワーク上のサーバ装置、又は予め蓄積した評価用データベースから抽出する評価用データ抽出部と、該収集データの信頼性を高める因子である積極因子と、該収集データの信頼性を低める因子である消極因子との少なくともいずれかの因子を格納する評価因子テーブル記憶部と、該因子を素性として、該評価用データから抽出する評価用素性抽出部と、単数又は複数の素性を入力すると、学習結果記憶部に格納された機械学習結果データを参照して、該評価用データの信頼度に応じた分類を行う所定の機械学習モジュールを備えた評価用機械学習判定部とを備えて、該評価用素性抽出部が抽出した素性を該評価用機械学習判定部に入力して得られた評価用データの分類結果を、該収集データの信頼度として出力する構成であって該収集データの信頼度値を、前記素性抽出手段が抽出した素性と共に、前記機械学習判定手段に入力し、その収集データに対する風評情報か否かの判定結果を得ることを特徴とする。 According to a third aspect of the present invention, a reputation information extracting device includes data reliability value evaluation means, and the data reliability value evaluation means includes the collected data, the author of the collected data, or a stored server. An evaluation data extraction unit that extracts evaluation data that matches at least one of the device name or network address or the file information of the collected data from a server device on the network or an evaluation database that is stored in advance, and the collection An evaluation factor table storage unit for storing at least one of a positive factor that is a factor that increases the reliability of data and a negative factor that is a factor that decreases the reliability of the collected data; When the feature extraction unit for evaluation extracted from the data for evaluation and one or more features are input, the machine learning result stored in the learning result storage unit An evaluation machine learning determination unit including a predetermined machine learning module that performs classification according to the reliability of the evaluation data with reference to the data, and the features extracted by the evaluation feature extraction unit The classification result of the evaluation data obtained by inputting to the evaluation machine learning determination unit is output as the reliability of the collected data, and the feature extraction means extracts the reliability value of the collected data It is input to the machine learning determination means together with the feature, and a determination result as to whether or not it is reputation information for the collected data is obtained.

請求項４に記載の発明は、前記風評情報抽出装置が、根拠情報確認手段を備え、該根拠情報確認手段が、信頼できる情報を公開しているサーバ装置の名称若しくはネットワークアドレス、又は該信頼できる情報のファイル情報の少なくともいずれかの根拠情報源を定義した根拠情報データベースと、該根拠情報データベースに定義される根拠情報源から公開されているデータを取得し、前記収集データと話題が類似する類似データが含まれているか否かを判定する類似判定部とを備え、該類似判定部における判定結果を、前記素性抽出手段が抽出した素性と共に、前記機械学習判定手段に入力し、その収集データに対する風評情報か否かの判定結果を得ることを特徴とする。 According to a fourth aspect of the present invention, the reputation information extracting device comprises ground information confirmation means, and the ground information confirmation means is the name or network address of a server device that discloses reliable information, or the reliable information Similarity in which the basis information database that defines at least one basis information source of information file information and the data published from the basis information source defined in the basis information database are acquired, and the topic is similar to the collected data A similarity determination unit for determining whether or not data is included, and the determination result in the similarity determination unit is input to the machine learning determination unit together with the features extracted by the feature extraction unit, and the collected data is A determination result of whether or not the information is reputation information is obtained.

また、本発明は、次のような風評情報抽出方法を提供することもできる。
すなわち、請求項５に記載の発明は、所定の対象に対してネットワーク上で公開されている風評情報を抽出するコンピュータを用いた風評情報抽出方法であって、コンピュータのデータ収集手段が、ネットワーク上の単数又は複数のサーバ装置から公開されているデータを受信して収集データ記憶手段に各々収集データとして格納するデータ収集ステップ、少なくとも素性となる単語又は単語の集合を含む素性テーブルを格納した素性テーブル記憶手段を備えておき、コンピュータの素性抽出手段が、該素性テーブルを参照して、該収集データから素性を抽出する素性抽出ステップ、所定の機械学習モジュールを備えたコンピュータの機械学習判定手段が、抽出された素性を用い、学習結果記憶手段に格納された機械学習結果データを参照して、所定の対象に対する風評情報か否かを判定する機械学習判定ステップ、コンピュータの風評情報出力手段が、風評情報と判定された収集データの少なくとも一部、又はそれが公開されているサーバ装置の名称若しくはネットワークアドレス、又はその収集データのファイル情報、の少なくともいずれかを出力する風評情報出力ステップを含むことを特徴とする。 The present invention can also provide the following reputation information extraction method.
That is, the invention described in claim 5 is a reputation information extraction method using a computer that extracts reputation information published on a network for a predetermined object, and the data collection means of the computer is connected to the network. A data collection step of receiving data published from one or more server devices and storing the data as collected data in the collected data storage means, and a feature table storing at least a feature table or a feature table including a set of words A feature extraction step of extracting a feature from the collected data by referring to the feature table, and a machine learning determination unit of a computer provided with a predetermined machine learning module. Using the extracted features, refer to the machine learning result data stored in the learning result storage means. , A machine learning determination step for determining whether or not the information is reputation information for a predetermined target, at least part of the collected data determined by the computer reputation information output means as reputation information, or the name of the server device to which it is disclosed Or a reputation information output step of outputting at least one of a network address or file information of the collected data.

上記の構成において、単語の意味を、符号を用いて意味クラスとして分類し、複数の単語に対してそれぞれの意味クラスを付与した分類語彙テーブルを有し、該意味クラスを素性として素性テーブルに含むと共に、前記素性抽出ステップにおいて、前記素性抽出手段が、該素性テーブルを参照して、前記収集データからそれに含まれる単語の意味クラスを抽出することを特徴とする。 In the above configuration, the meaning of a word is classified as a semantic class using a code, and a classification vocabulary table in which each semantic class is assigned to a plurality of words is included in the feature table as a feature. In addition, in the feature extraction step, the feature extraction means refers to the feature table and extracts a semantic class of a word included in the collected data.

請求項６に記載の発明は、前記ネットワーク上のサーバ装置の名称若しくはネットワークアドレス、又はその前記収集データのファイル情報に対して、そこで公開される情報の信頼度を数値で表す情報源信頼度データベースを有し、該信頼度を素性として素性テーブルに含むと共に、前記素性抽出ステップにおいて、前記素性抽出手段が、該素性テーブルを参照して、該収集データに係る信頼度の数値を抽出することを特徴とする。 The invention according to claim 6 is an information source reliability database that represents numerically the reliability of information disclosed therein with respect to the name or network address of the server device on the network or the file information of the collected data. And including the reliability in the feature table as a feature, and in the feature extraction step, the feature extraction means refers to the feature table and extracts a numerical value of the reliability related to the collected data. Features.

請求項７に記載の発明は、前記風評情報抽出方法の前記データ収集ステップの後、前記機械学習判定ステップの前のいずれかの時点において、データ信頼度評価ステップを有し、該データ信頼度評価ステップにおいて、コンピュータのデータ信頼度値評価手段における評価用データ抽出部が、前記収集データと、該収集データの著作者、又は格納されるサーバ装置の名称若しくはネットワークアドレス、又は該収集データのファイル情報の少なくともいずれかが一致する評価用データをネットワーク上のサーバ装置、又は予め蓄積した評価用データベースから抽出する評価用データ抽出処理工程、該収集データの信頼性を高める因子である積極因子と、該収集データの信頼性を低める因子である消極因子との少なくともいずれかの因子を格納する評価因子テーブル記憶部を備えておき、データ信頼度値評価手段における評価用素性抽出部が、該因子を素性として、該評価用データから抽出する評価用素性抽出処理工程、データ信頼度値評価手段における所定の機械学習モジュールを備えた評価用機械学習判定部が、該評価用素性抽出処理工程で抽出された素性を用い、学習結果記憶部に格納された機械学習結果データを参照して、該評価用データの信頼度に応じた分類を行う評価用機械学習判定処理工程、を含んで評価用データの分類結果を、該収集データの信頼度として出力すると共に、該機械学習判定ステップにおいて、該収集データの信頼度値を、前記素性抽出手段が抽出した素性と共に、前記機械学習判定手段に入力し、その収集データに対する風評情報か否かの判定結果を得ることを特徴とする。 The invention according to claim 7 has a data reliability evaluation step at any time after the data collection step of the reputation information extraction method and before the machine learning determination step, and the data reliability evaluation In the step, the data extraction unit for evaluation in the data reliability value evaluation means of the computer uses the collected data, the author of the collected data, or the name or network address of the stored server device, or file information of the collected data An evaluation data extraction process step for extracting evaluation data that matches at least one of the above from a server device on the network or an evaluation database stored in advance, an active factor that is a factor that increases the reliability of the collected data, and Stores at least one of the negative factors that reduce the reliability of the collected data An evaluation factor table storage unit, and an evaluation feature extraction unit in the data reliability value evaluation unit extracts the factor as a feature from the evaluation data, an evaluation feature extraction process step, and a data reliability value evaluation An evaluation machine learning determination unit having a predetermined machine learning module in the means uses the features extracted in the evaluation feature extraction processing step, refers to the machine learning result data stored in the learning result storage unit, Including the evaluation machine learning determination processing step for performing classification according to the reliability of the evaluation data, and outputting the evaluation data classification result as the reliability of the collected data, and in the machine learning determination step, The reliability value of the collected data is input to the machine learning determination unit together with the features extracted by the feature extraction unit, and a determination result as to whether the collected data is reputation information And wherein the get.

請求項８に記載の発明は、前記風評情報抽出方法において、コンピュータのクラスタリング処理手段が、風評情報データ又は関連情報データの少なくともいずれかについて、当該いずれかのデータに含まれる著作者又はコンテンツを、所定のクラスタリング式に従ってクラスタリング処理するクラスタリング処理ステップを含み、前記出力ステップにおいて、該クラスタリングされた状態の該風評情報データ又は該関連情報データの少なくともいずれかを出力することを特徴とする。 The invention according to claim 8 is the reputation information extraction method, wherein the clustering processing means of the computer, for at least one of the reputation information data and the related information data, the author or content included in any of the data, A clustering process step of performing a clustering process according to a predetermined clustering equation, wherein at least one of the reputation information data and the related information data in the clustered state is output in the output step.

本発明は、上記構成を備えることにより次のような効果を奏する。
すなわち、本発明によれば、様々な素性を用いて風評情報を抽出することができるので、人手では不可能な複雑な要素を加味して風評情報か否か判定することができる。
またコンピュータを用いることでネットワーク上で流通する膨大な情報から迅速かつ網羅的に風評情報を探索することができるので、風評情報による被害を最小限に抑えることができる。 The present invention has the following effects by providing the above configuration.
That is, according to the present invention, reputation information can be extracted using various features, and therefore it is possible to determine whether or not the reputation information is in consideration of complex elements that are impossible by hand.
Also, by using a computer, it is possible to search for reputation information quickly and comprehensively from a vast amount of information distributed on the network, so that damage caused by reputation information can be minimized.

以下、本発明の実施形態を、図面に示す実施例を基に説明する。なお、実施形態は下記に限定されるものではない。 Hereinafter, embodiments of the present invention will be described based on examples shown in the drawings. The embodiment is not limited to the following.

図１は本発明のに係る風評情報抽出装置(１)（以下、本装置と呼ぶ）の全体構成図である。本発明は公知のパーソナルコンピュータにより容易に実現することが可能であり、演算処理や機械学習、テキスト処理などを司るＣＰＵ（１０）によって本発明の各ステップを実行処理する。ＣＰＵ（１０）は周知のようにメモリ（図示しない）と協働して動作し、キーボードやマウス（１１）などの入力手段の他、出力結果を表示するモニタ（１２）、ハードディスク等の外部記憶装置（１３）などを備えている。
また、テキストデータの取得などのためにデータの取得入力手段としてインターネット等のネットワークと接続するネットワークアダプタ（１４）を備える。 FIG. 1 is an overall configuration diagram of a reputation information extracting device (1) (hereinafter referred to as this device) according to the present invention. The present invention can be easily realized by a known personal computer, and each step of the present invention is executed and processed by a CPU (10) that controls arithmetic processing, machine learning, text processing, and the like. As is well known, the CPU (10) operates in cooperation with a memory (not shown), and in addition to input means such as a keyboard and mouse (11), a monitor (12) for displaying output results, an external storage such as a hard disk. A device (13) is provided.
In addition, a network adapter (14) connected to a network such as the Internet is provided as data acquisition input means for acquiring text data.

そして、ＣＰＵ（１０）にはデータ収集部（１００）、素性抽出部（１０１）、機械学習判定部（１０２）、風評情報出力部（１０３）が設けられている。
そして、公知のプログラミング言語によって記載されたプログラムがＣＰＵ（１０）及びそれと連動するハードウェアを動作させて、以下に説述する各部（１００）〜（１０３）の機能が実現される。 The CPU (10) includes a data collection unit (100), a feature extraction unit (101), a machine learning determination unit (102), and a reputation information output unit (103).
And the program described by the well-known programming language operates CPU (10) and the hardware linked with it, The function of each part (100)-(103) demonstrated below is implement | achieved.

以下、図２に示す処理フローチャートを用いて、本発明の各処理を詳細に説述する。
まず、データ収集部は、インターネット（２１）などのネットワーク上に多数設置されているサーバ装置（２０）から、各サーバ装置で公開されているデータを受信する。（データ収集ステップ：Ｓ１）
具体的には、ウェブサーバＡにおいて公開されているテキストデータ（例えばa1.txtという名前のテキストデータ）や、ＨＴＭＬ（HyperTextMarkup Language）などで記述された表示書式を含むデータ（例えばa2.htmlという名前のデータ）を受信する。 Hereinafter, each processing of the present invention will be described in detail with reference to a processing flowchart shown in FIG.
First, the data collection unit receives data published by each server device from a large number of server devices (20) installed on a network such as the Internet (21). (Data collection step: S1)
Specifically, text data (for example, text data named a1.txt) published on the web server A and data including a display format described in HTML (HyperTextMarkup Language) (for example, the name a2.html) Data).

周知のように、インターネット上に公開されているデータは自動巡回ロボットエンジンを用いて大量のデータを収集することが可能であり、例えばハイパーリンクを辿って順に取得していく方法が行われている。予め、所定のサーバを指定し、ディレクトリ構造に従って順に取得していく方法でもよい。
本発明で行うデータの収集は、検索エンジンなどで用いられる任意の方法で実施することができる。
収集されたデータはデータ記憶手段であるハードディスク（１３）に格納される。 As is well known, it is possible to collect a large amount of data published on the Internet using an automatic patrol robot engine. For example, a method of sequentially acquiring data by tracing hyperlinks is performed. . Alternatively, a method may be used in which a predetermined server is designated in advance and acquired in order according to the directory structure.
Data collection performed in the present invention can be performed by any method used in a search engine or the like.
The collected data is stored in the hard disk (13) which is a data storage means.

次に、素性抽出部（１０１）が、後述する機械学習判定部（１０２）において風評情報であるか否かを判定するのに用いる素性を収集されたデータから抽出する。（素性抽出ステップ：Ｓ２）
このとき、ハードディスク（１３）に素性テーブルを備えておき、素性抽出部（１０１）はそこで定義された素性を抽出する。 Next, a feature extraction unit (101) extracts features used for determining whether or not the information is reputation information in a machine learning determination unit (102) described later from the collected data. (Feature extraction step: S2)
At this time, a feature table is provided in the hard disk (13), and the feature extraction unit (101) extracts the features defined there.

例えば、素性としてはどのような単語又は単語の集合（以下、一括して単語と言う。）を含んでいるかという情報、あるいは単語を含むか否かの情報を用いる。
通常、素性は単語、その品詞などの形で与えられるが、本発明の素性テーブルには、素性の抽出に必要な形態素辞書なども格納する。 For example, as a feature, information on what word or set of words (hereinafter collectively referred to as a word) is included, or information on whether or not a word is included is used.
Usually, a feature is given in the form of a word, its part of speech, etc., but the feature table of the present invention also stores a morpheme dictionary necessary for feature extraction.

本発明で用いる素性を例示する。なお、本発明で用いる素性は以下に限定されるものではない。
まず、素性テーブル（１３１）の中で、定義される素性を表１に示す。 The feature used by this invention is illustrated. The features used in the present invention are not limited to the following.
First, the features defined in the feature table (131) are shown in Table 1.

単語又は単語の集合素性は、本発明で最低限用いる必要のある素性であり、表２に示すようにいくつかの種類がある。これらを収集されたデータを形態素解析することで抽出する。形態素解析は周知の形態素解析モジュールであるChasen（非特許文献１）などを用いることで容易に実施できる。 A word or a collective feature of a word is a feature that needs to be used at the minimum in the present invention, and there are several types as shown in Table 2. These are extracted by performing morphological analysis on the collected data. Morphological analysis can be easily performed by using Chasen (Non-Patent Document 1), which is a well-known morphological analysis module.

http://chasen.naist.jp/http://chasen.naist.jp/

Chasenを用いることで、解析された形態素について品詞情報、形態素生起コスト、発音情報、活用型を指定する活用型情報、活用形を指定する活用形情報、見出し語の原形(基本形)を指定する原形情報、その他の付加情報(意味情報)、複合語情報なども取得できる。 By using Chasen, part of speech information, morphological occurrence cost, pronunciation information, utilization type information to specify utilization type, utilization form information to specify utilization form, original form to specify headword (basic form) Information, other additional information (semantic information), compound word information, etc. can also be acquired.

上記において対象の名称が最も重要であり、風評情報の対象となる単語を素性とする。すなわち、風評情報を抽出する対象として、ユーザが「情報通信研究機構」と指定したときには、素性抽出部（１０１）が形態素解析結果に基づいて同語が含まれているか否かを検出する。
このように対象を特定する場合には、該対象を含むかどうかは風評情報を抽出する前提となるものであるから、含まないデータは機械学習判定を行うことなく除外してもよい。 In the above, the name of the target is the most important, and the word that is the target of the reputation information is the feature. That is, when the user designates “information communication research organization” as a target for extracting reputation information, the feature extraction unit (101) detects whether or not the same word is included based on the morphological analysis result.
When the target is specified in this manner, whether or not the target is included is a premise for extracting reputation information, and thus data not included may be excluded without performing machine learning determination.

ただし、風評情報は必ずしもフルネーム、正式名称で記載されるとは限らないから、例えば素性として「情報」「通信」「研究」「機構」のうち３つが含まれている場合には風評情報である可能性も否めないから、これらを素性として機械学習判定に用いてもよい。
これにより、ルールベースの場合には定義された語句そのもの以外では抽出することができなかったのに対して、本発明方法では、含まれる語句の組み合わせから風評情報の可能性のあるものを抽出することができるようになる。 However, since the reputation information is not always described with the full name and the official name, for example, when three of “information”, “communication”, “research”, and “mechanism” are included as features, it is reputation information. Since the possibility cannot be denied, these may be used as features in machine learning determination.
As a result, in the case of the rule base, it was impossible to extract anything other than the defined words and phrases. On the other hand, in the method of the present invention, the possibility of reputation information is extracted from the combinations of the included words and phrases. Will be able to.

次に風評の内容は、具体的な風評の中身をなすものであり、「損失を出した」など、ルールベースでもネガティブ表現として定義されるような語句である。
もっとも、機械学習判定を用いる本発明によれば、風評の内容も複数の語句が素性として抽出され、例えば「損失を出した」だけであれば風評情報である可能性が０．５であると判定される一方、「報道によると」「損失を出した」が抽出される場合には可能性が０．１、「噂によると」「損失を出した」「らしいよ」が抽出される場合には可能性が０．９というように、ルールベースでは得られない緻密な判定が行える。 Next, the content of the reputation is the content of the concrete reputation, and is a phrase that is defined as a negative expression even in the rule base, such as “Lost”.
Of course, according to the present invention using machine learning determination, a plurality of words and phrases are also extracted as features. For example, if only “loss” is given, the possibility that the information is reputation information is 0.5. On the other hand, if “according to the report” “exceeding loss” is extracted, the probability is 0.1, and “according to rumors” “exceeding loss” “probably” is extracted The possibility is 0.9, so that a precise determination that cannot be obtained by the rule base can be performed.

その他の素性としては、情報の信頼性の判断に役立つ符号や、単語を用いることもできる。例示のように顔文字（アスキー文字の組み合わせによって顔の表情を表現する符号）が含まれている場合には、その情報がプレスリリースやニュースサイトのような公式の情報ではないことを示唆するし、また「アングラ」のようにウェブサイトの種類、属性を示す単語や、「しやがった」などのように個人的な感情で書かれたことを示唆する単語も素性として用いることができる。 As other features, codes or words that are useful for determining the reliability of information can be used. If an emoticon (a code that expresses a facial expression by a combination of ASCII characters) is included as shown, this suggests that the information is not official information such as a press release or news site. Also, words that indicate the type and attributes of the website, such as “Angra”, and words that suggest personal feelings, such as “Feeling”, can be used as features. .

さらに、抽出されるデータが格納されていたサーバの名称（機器に付与された名称や、ドメイン名、ホスト名など）や、ファイル情報（ファイル名、ディレクトリ名、拡張子など）を素性として用いることもできる。
例示のように、bbsなどの単語が含まれている場合には、それがネットワーク上の掲示板であることを示唆し、公的な情報でない可能性が高い。また”~”（チルダ）が含まれる場合には、個人の開設するホームページである可能性があり、これも風評情報か否かの判定に用いる素性として適当である。 In addition, use the name of the server (name assigned to the device, domain name, host name, etc.) and file information (file name, directory name, extension, etc.) as features. You can also.
As illustrated, if a word such as bbs is included, it indicates that it is a bulletin board on the network, and there is a high possibility that it is not public information. In addition, when “~” (tilde) is included, there is a possibility that it is a homepage established by an individual, and this is also suitable as a feature used for determining whether or not it is reputation information.

また、公式の文書に対する形態素解析では一般に未知語が生じる割合は相対的に低く、逆に風評情報を含むような情報では俗語、特にインターネット上で最近多用されるようになった新しい表現がみられることが多い。
このような特性を利用し、本発明では形態素解析をしたときに辞書に登録されていないために解析ができなかった結果から未知語を判定してその結果を素性として用いることも提案する。 In addition, the rate of occurrence of unknown words is generally relatively low in morphological analysis of official documents, and conversely, in terms of information that includes reputation information, slang terms, especially new expressions that have recently become widely used on the Internet, are seen. There are many cases.
Utilizing such characteristics, the present invention also proposes that an unknown word is determined from a result that cannot be analyzed because it is not registered in the dictionary when morphological analysis is performed, and the result is used as a feature.

この場合、未知語が含まれているか否かを素性としてもよいし、未知語が占める割合（全単語のうち何％かなど）を素性としてもよい。
未知語に係る情報を素性として用いることで、例えば「カキコ」（掲示板への書き込みの意味）などの俗語が用いられている情報については当該情報が風評情報である可能性が高いものとして判定に反映される。 In this case, whether or not an unknown word is included may be used as a feature, and the proportion of unknown words (such as some percent of all words) may be used as a feature.
By using information related to an unknown word as a feature, for example, for information that uses a slang word such as “Kakiko” (meaning writing on a bulletin board), it is determined that the information is likely to be reputation information. Reflected.

次に、本発明では書式設定情報を素性として用いることもできる。具体例を表３に示す。

Next, formatting information can also be used as a feature in the present invention. Specific examples are shown in Table 3.

書式設定情報は、例えばＨＴＭＬにおけるタグで指定された書式を素性とするものである。ここで、背景が黒で文字色が赤の情報の場合、一般的にこのような配色のウェブページは公式の情報を発信するウェブサイトでない場合が多く、このような書式設定情報を素性として用いることでも風評情報の的確な抽出に寄与する。 The format setting information has, for example, a format specified by a tag in HTML as a feature. Here, when the background is black and the text color is red, the web page with such a color scheme is often not a website that sends official information, and such formatting information is used as a feature. This also contributes to accurate extraction of reputation information.

また、素性として意味クラスを用いることもできる。
ここで、意味クラスとは単語の意味的類似による分類であり、この意味クラスによって分類された分類語彙表が表４のように作られている。 Semantic classes can also be used as features.
Here, the semantic class is a classification based on semantic similarity of words, and a classification vocabulary table classified according to the semantic class is created as shown in Table 4.

本発明では素性テーブル（１３１）に該分類語彙表を格納しておく。分類語彙表は、一般に、単語を意味に基づいて整理した表であり、各単語に対して分類番号という数字が付与されている。この10桁の分類番号は、7レベルの階層構造を示しており、上位5レベルは分類番号の最初の5桁で表現され、6レベル目は次の2桁、最下層のレベルは最後の3桁で表現されている。 In the present invention, the classification vocabulary table is stored in the feature table (131). The classification vocabulary table is generally a table in which words are arranged based on meaning, and a number called a classification number is assigned to each word. This 10-digit classification number indicates a 7-level hierarchical structure, with the top 5 levels represented by the first 5 digits of the classification number, the 6th level is the next 2 digits, and the lowest level is the last 3 levels. It is expressed in digits.

このような分類語彙表を用いることで、類似の意味を有する単語を分類番号の上位の桁で限定することで一括して抽出することができる。すなわち、素性抽出部（１０１）では分類語彙表に基づいて形態素解析された単語について分類番号を照会し、その分類番号を素性として抽出する。このとき分類番号の上位５桁、又は７桁を素性とすることで類似の意味をもつ単語を幅広く網羅することができる。 By using such a classification vocabulary table, words having similar meanings can be extracted in a lump by limiting them with the upper digits of the classification number. That is, the feature extraction unit (101) inquires the classification number of the word subjected to the morphological analysis based on the classification vocabulary table, and extracts the classification number as the feature. At this time, it is possible to cover a wide range of words having similar meanings by making the upper 5 digits or 7 digits of the classification number a feature.

本発明で用いる意味クラスの素性の例を表５に示す。

Table 5 shows examples of semantic class features used in the present invention.

本発明で用いる素性としては、表６に示すような情報源信頼度データベースに基づく信頼度の値でもよい。ここでも素性テーブル（１３１）に該データベースを格納しておき、素性抽出部（１０１）が該データベースを参照して信頼度を抽出する。 The feature used in the present invention may be a reliability value based on an information source reliability database as shown in Table 6. Again, the database is stored in the feature table (131), and the feature extraction unit (101) extracts reliability by referring to the database.

具体的には、収集されたデータのサーバ装置のドメイン名を素性抽出部が抽出し、情報源信頼度データベースを参照してその信頼度値を機械学習判定部（１０２）で用いる。例えば、収集されたデータのドメイン名が、ｗww.asahi.comである場合、著名なニュースサイトが情報源である場合には、それが通常であれば風評である可能性が高い情報であろうとも、真実の報道であり、風評情報とは言えない。このように予め信頼のできる情報源信頼度データベースを備え、そこで定義された信頼度値を素性とすることで正確な風評情報抽出に寄与する。
また、ＩＰアドレスや、ファイル情報を情報源信頼度データベースに定義して素性としてもよい。表７には情報源信頼度の例を示す。
なお、本発明では情報源の信頼度を自動的に評価することもできるが、これについては後述する。 Specifically, the feature extraction unit extracts the domain name of the server device of the collected data, refers to the information source reliability database, and uses the reliability value in the machine learning determination unit (102). For example, if the domain name of the collected data is www.asahi.com, and if a prominent news site is the source of information, it would be information that is likely to be popular. However, it is a true news report and cannot be said to be reputable information. In this way, the reliable information source reliability database is provided in advance, and the reliability value defined there is used as a feature, thereby contributing to accurate reputation information extraction.
Further, the IP address and file information may be defined in the information source reliability database as features. Table 7 shows an example of information source reliability.
In the present invention, the reliability of the information source can also be automatically evaluated, which will be described later.

本発明は、以上に説述した様々な素性を用いることを特徴とする。このような素性は１つ１つだけでは風評情報と確実に認識できるものではないから、従来のルールベースの風評情報抽出ではこれらの素性を加味した判断を行うことはできなかった。
本発明では機械学習を用いることによって各素性をそれぞれ適切な重みで評価し、風評情報であるか否かを判定するものである。 The present invention is characterized by using various features described above. Since such features cannot be reliably recognized as reputation information by one by one, conventional rule-based extraction of reputation information cannot make a determination taking these features into account.
In the present invention, each feature is evaluated with an appropriate weight by using machine learning, and it is determined whether or not it is reputation information.

本発明では、上記のように機械学習した結果をハードディスク（１３）の機械学習結果データ（１３２）として格納しておき、機械学習判定部（１０２）における判定に用いる。
機械学習の手法は公知の機械学習モジュールにおける学習過程と、それを用いた解の推定過程とが一体的に成り立つものである。本発明の実施においては学習過程は必ずしも必須ではなくすでに機械学習結果データ（１３２）を備えていればよいが、このような機械学習の特徴に考慮して、両過程について公知の手法を簡述する。 In the present invention, the result of machine learning as described above is stored as machine learning result data (132) of the hard disk (13) and used for determination in the machine learning determination unit (102).
In the machine learning method, a learning process in a known machine learning module and a solution estimation process using the learning process are integrated. In the implementation of the present invention, the learning process is not necessarily required, and the machine learning result data (132) may already be provided. However, in consideration of such features of machine learning, known methods are briefly described for both processes. To do.

機械学習の手法は、問題-解の組のセットを多く用意し、それで学習を行ない、どういう問題のときにどういう解になるかを学習し、その学習結果を利用して、新しい問題のときも解を推測できるようにする方法である(例えば、下記の非特許文献２〜非特許文献文献４参照)。 The machine learning method prepares many sets of problem-solution pairs, learns them, learns what kind of solution is the problem, and uses the learning results to create a new problem. This is a method that makes it possible to guess the solution (for example, see Non-Patent Document 2 to Non-Patent Document 4 below).

村田真樹,機械学習に基づく言語処理,龍谷大学理工学部.招待講演.2004.http://www2.nict.go.jp/jt/a132/members/murata/ps/rk1-siryou.pdfMasaki Murata, Language processing based on machine learning, Faculty of Science and Engineering, Ryukoku University. Invited lecture. 2004. http://www2.nict.go.jp/jt/a132/members/murata/ps/rk1-siryou.pdf サポートベクトルマシンを用いたテンス・アスペクト・モダリティの日英翻訳,村田真樹,馬青,内元清貴,井佐原均,電子情報通信学会言語理解とコミュニケーション研究会 NLC2000-78 ,2001年.Japanese-English translation of tense aspect modality using support vector machine, Maki Murata, Ma Aoi, Kiyotaka Uchimoto, Hitoshi Isahara, IEICE Language Understanding and Communication Study Group NLC2000-78, 2001. SENSEVAL2J辞書タスクでのCRLの取り組み,村田真樹,内山将夫,内元清貴,馬青,井佐原均,電子情報通信学会言語理解とコミュニケーション研究会NLC2001-40 ,2001年.CRL's efforts in the SENSEVAL2J dictionary task, Masaki Murata, Masao Uchiyama, Kiyotaka Uchimoto, Ma Aoi, Hitoshi Isahara, IEICE Language Understanding and Communication Study Group NLC2001-40, 2001.

どういう問題のときに、という、問題の状況を機械に伝える際に、素性(解析に用いる情報で問題を構成する各要素)が必要になる。問題を素性によって表現するのである。例えば、日本語文末表現の時制の推定の問題において、
問題:「彼が話す。」---解「現在」
が与えられた場合に、素性の一例は、「彼が話す。」「が話す。」「話す。」「す。」「。」となる。 In order to convey the problem situation to the machine, what kind of problem is required, features (elements constituting the problem with information used for analysis) are required. The problem is expressed by the feature. For example, in the problem of estimating the tense of Japanese sentence ending expressions,
Problem: "He speaks." --- Solution "Current"
Is given as an example, “He speaks.” “Speaks.” “Speaks.” “Su.” “.”.

すなわち、機械学習の手法は、素性の集合-解の組のセットを多く用意し、それで学習を行ない、どういう素性の集合のときにどういう解になるかを学習し、その学習結果を利用して、新しい問題のときもその問題から素性の集合を取り出し、その素性の場合の解を推測する方法である。 In other words, the machine learning method prepares many sets of feature set-solution pairs, learns with it, learns what kind of feature set the solution will be, and uses the learning results. This is a method of extracting a set of features from a new problem and inferring a solution in the case of the feature.

図３は機械学習処理を実行する際の装置の構成図である。ＣＰＵ（１０）において、機械学習判定部（１０２）で処理する前段として、解−素性対抽出部（１４１）と、機械学習部（１４２）を備える。ここで機械学習処理は、図４のように分散したテキストデータをどのように分類するのか、その分類結果（解）を得る。
機械学習部（１４２）における機械学習の手法として、例えば、k近傍法、シンプルベイズ法、決定リスト法、最大エントロピー法、サポートベクトルマシン法などの手法を用いる。 FIG. 3 is a configuration diagram of an apparatus for executing the machine learning process. The CPU (10) includes a solution-feature pair extraction unit (141) and a machine learning unit (142) as a previous stage to be processed by the machine learning determination unit (102). Here, the machine learning process obtains the classification result (solution) as to how the distributed text data is classified as shown in FIG.
As a machine learning method in the machine learning unit (142), for example, a k neighborhood method, a simple Bayes method, a decision list method, a maximum entropy method, a support vector machine method, or the like is used.

k近傍法は、最も類似する一つの事例のかわりに、最も類似するk個の事例を用いて、このk個の事例での多数決によって分類先(解)を求める手法である。kは、あらかじめ定める整数の数字であって、一般的に、１から９の間の奇数を用いる。 The k-nearest neighbor method is a method for obtaining a classification destination (solution) by using the most similar k cases instead of the most similar case, and by majority decision of these k cases. k is a predetermined integer number, and generally an odd number between 1 and 9 is used.

シンプルベイズ法は、ベイズの定理にもとづいて各分類になる確率を推定し、その確率値が最も大きい分類を求める分類先とする方法である。 The Simple Bayes method is a method of estimating the probability of each classification based on Bayes' theorem and determining the classification having the highest probability value as a classification destination.

シンプルベイズ法において、文脈bで分類aを出力する確率は、以下の数１で与えられる。 In the Simple Bayes method, the probability of outputting the classification a in the context b is given by the following formula 1.

ただし、ここで文脈bは、あらかじめ設定しておいた素性f_j (∈F,1≦j≦k)の集合である。p(b)は、文脈bの出現確率である。ここで、分類aに非依存であって定数のために計算しない。P(a)(ここでPはpの上部にチルダ)とP(f_i|a)は、それぞれ教師データから推定された確率であって、分類aの出現確率、分類aのときに素性f_iを持つ確率を意味する。P(f_i|a)として最尤推定を行って求めた値を用いると、しばしば値がゼロとなり、数２の値がゼロで分類先を決定することが困難な場合が生じる。そのため、スームージングを行う。ここでは、以下の数３を用いてスームージングを行ったものを用いる。 Here, the context b is a set of features f _j (∈F, 1 ≦ j ≦ k) set in advance. p (b) is the appearance probability of context b. Here, it is independent of the classification a and is not calculated because it is a constant. P (a) (where P is a tilde at the top of p) and P (f _i | a) are the probabilities estimated from the teacher data, respectively, and the probability f of class a, and the feature f for class a means the probability of having _i . When the value obtained by performing maximum likelihood estimation as P (f _i | a) is used, the value often becomes zero, and it may be difficult to determine the classification destination because the value of Equation 2 is zero. Therefore, smoothing is performed. Here, the smoothing using the following equation 3 is used.

ただし、freq(f_i,a)は、素性f_iを持ちかつ分類がaである事例の個数、freq(a)は、分類がaである事例の個数を意味する。 Here, freq (f _i , a) means the number of cases having the feature f _i and the classification a, and freq (a) means the number of cases having the classification a.

決定リスト法は、素性と分類先の組とを規則とし、それらをあらかじめ定めた優先順序でリストに蓄えおき、検出する対象となる入力が与えられたときに、リストで優先順位の高いところから入力のデータと規則の素性とを比較し、素性が一致した規則の分類先をその入力の分類先とする方法である。 The decision list method uses features and combinations of classification destinations as rules, stores them in the list in a predetermined priority order, and when input to be detected is given, from the highest priority in the list This is a method in which input data is compared with the feature of the rule, and the classification destination of the rule having the same feature is set as the classification destination of the input.

決定リスト方法では、あらかじめ設定しておいた素性f_j( ∈F,1≦j≦k)のうち、いずれか一つの素性のみを文脈として各分類の確率値を求める。ある文脈bで分類aを出力する確率は以下の数４によって与えられる。 In the decision list method, the probability value of each classification is obtained using only one of the features f _j (εF, 1 ≦ j ≦ k) set in advance as a context. The probability of outputting classification a in a context b is given by

（数４)
p(a|b)=p(a|fmax )
ただし、fmax は以下の数５によって与えられる。
(Equation 4)
p (a | b) = p (a | fmax)
However, fmax is given by the following equation (5).

また、P(a_i|f_j)(ここでPはpの上部にチルダ)は、素性f_jを文脈に持つ場合の分類a_iの出現の割合である。 P (a _i | f _j ) (where P is a tilde at the top of p) is the rate of appearance of classification a _i when feature f _j is in the context.

最大エントロピー法は、あらかじめ設定しておいた素性f_j (1≦j≦k)の集合をFとするとき、以下所定の条件式（数６)を満足しながらエントロピーを意味する数７を最大にするときの確率分布p(a,b)を求め、その確率分布にしたがって求まる各分類の確率のうち、最も大きい確率値を持つ分類を求める分類先とする方法である。 In the maximum entropy method, when F is a set of features f _j (1 ≦ j ≦ k) set in advance, the maximum number 7 representing entropy is satisfied while satisfying the predetermined conditional expression (formula 6). This is a method of obtaining a probability distribution p (a, b) for the classification and obtaining a classification having the largest probability value among the probabilities of the respective classifications determined according to the probability distribution.

ただし、A、Bは分類と文脈の集合を意味し、g_j(a,b)は文脈bに素性f_jがあって、なおかつ分類がaの場合1となり、それ以外で0となる関数を意味する。また、P(a_i|f_j)(ここでPはpの上部にチルダ)は、既知データでの(a,b)の出現の割合を意味する。 However, A and B mean a set of classification and context, and g _j (a, b) is a function that is 1 if the context b has a feature f _j and the classification is a, and 0 otherwise means. P (a _i | f _j ) (where P is a tilde at the top of p) means the rate of appearance of (a, b) in the known data.

数６は、確率pと出力と素性の組の出現を意味する関数gをかけることで出力と素性の組の頻度の期待値を求めることになっており、右辺の既知データにおける期待値と、左辺の求める確率分布に基づいて計算される期待値が等しいことを制約として、エントロピー最大化(確率分布の平滑化) を行なって、出力と文脈の確率分布を求めるものとなっている。最大エントロピー法の詳細については、以下の非特許文献５および非特許文献６に記載されている。 Equation 6 is to obtain the expected value of the frequency of the output and feature pair by multiplying the probability p and the function g which means the appearance of the pair of output and feature, With the restriction that the expected values calculated based on the probability distribution obtained on the left side are equal, entropy maximization (smoothing of the probability distribution) is performed to obtain the probability distribution of the output and the context. Details of the maximum entropy method are described in Non-Patent Document 5 and Non-Patent Document 6 below.

Eric Sven Ristad, Maximum Entropy Modeling for NaturalLanguage,(ACL/EACL Tutorial Program, Madrid, 1997Eric Sven Ristad, Maximum Entropy Modeling for Natural Language, (ACL / EACL Tutorial Program, Madrid, 1997 Eric Sven Ristad, Maximum Entropy Modeling Toolkit, Release 1.6beta,(http://www.mnemonic.com/software/memt,1998) ) サポートベクトルマシン法は、空間を超平面で分割することにより、二つの分類からなるデータを分類する手法である。Eric Sven Ristad, Maximum Entropy Modeling Toolkit, Release 1.6beta, (http://www.mnemonic.com/software/memt,1998)) Is a method for classifying data consisting of

図５にサポートベクトルマシン法のマージン最大化の概念を示す。図５において、白丸は正例、黒丸は負例を意味し、実線は空間を分割する超平面を意味し、破線はマージン領域の境界を表す面を意味する。図５（Ａ）は、正例と負例の間隔が狭い場合(スモールマージン)の概念図、図５（Ｂ）は、正例と負例の間隔が広い場合(ラージマージン)の概念図である。 FIG. 5 shows the concept of margin maximization in the support vector machine method. In FIG. 5, a white circle means a positive example, a black circle means a negative example, a solid line means a hyperplane that divides the space, and a broken line means a surface that represents the boundary of the margin area. 5A is a conceptual diagram when the interval between the positive example and the negative example is small (small margin), and FIG. 5B is a conceptual diagram when the interval between the positive example and the negative example is wide (large margin). is there.

このとき、二つの分類が正例と負例からなるものとすると、学習データにおける正例と負例の間隔(マージン) が大きいものほどオープンデータで誤った分類をする可能性が低いと考えられ、図５（Ｂ）に示すように、このマージンを最大にする超平面を求めそれを用いて分類を行なう。 At this time, if the two classifications consist of positive and negative examples, the larger the interval (margin) between the positive and negative examples in the learning data, the less likely it is to make an incorrect classification with open data. As shown in FIG. 5B, a hyperplane that maximizes this margin is obtained, and classification is performed using it.

基本的には上記のとおりであるが、通常、学習データにおいてマージンの内部領域に少数の事例が含まれてもよいとする手法の拡張や、超平面の線形の部分を非線型にする拡張(カーネル関数の導入) がなされたものが用いられる。 Basically, it is as described above, but usually, an extension of the method that a small number of cases may be included in the inner area of the margin in the training data, or an extension that makes the linear part of the hyperplane nonlinear ( The one with the introduction of the kernel function is used.

この拡張された方法は、以下の識別関数（数８)を用いて分類することと等価であり、その識別関数の出力値が正か負かによって二つの分類を判別することができる。 This extended method is equivalent to classification using the following discriminant function (Equation 8), and the two classes can be discriminated depending on whether the output value of the discriminant function is positive or negative.

ただし、xは識別したい事例の文脈(素性の集合) を、x_iとy_j (i=1,...,l,y_j∈{1,-1})は学習データの文脈と分類先を意味し、関数sgnは、
sgn(x)=1(x≧0)
-1(otherwise )
であり、また、各α_iは数１０と数１１の制約のもと数９を最大にする場合のものである。 Where x is the context (set of features) to be identified, and x _i and y _j (i = 1, ..., l, y _j ∈ {1, -1}) are the context of the learning data and the classification destination Means the function sgn
sgn (x) = 1 (x ≧ 0)
-1 (otherwise)
Further, each α _i is for the case where the number 9 is maximized under the constraints of the numbers 10 and 11.

また、関数Kはカーネル関数と呼ばれ、様々なものが用いられるが、本形態では以下の多項式のものを用いる。 The function K is called a kernel function, and various functions are used. In this embodiment, the following polynomial is used.

（数１２）
K(x,y)=(x・y+1)d

C、dは実験的に設定される定数である。例えば、Cはすべての処理を通して1に固定した。また、dは、1と2の二種類を試している。ここで、α_i>0となるx_iは、サポートベクトルと呼ばれ、通常、数８の和をとっている部分は、この事例のみを用いて計算される。つまり、実際の解析には学習データのうちサポートベクトルと呼ばれる事例のみしか用いられない。 (Equation 12)
K (x, y) = (x ・ y + 1) d

C and d are constants set experimentally. For example, C was fixed at 1 throughout all treatments. Moreover, d is trying two kinds of 1 and 2. Here, x _i satisfying α _i > 0 is called a support vector, and the portion taking the sum of Expression 8 is usually calculated using only this case. That is, only actual cases called support vectors are used for actual analysis.

なお、拡張されたサポートベクトルマシン法の詳細については、以下の非特許文献７および非特許文献８に記載されている。 Details of the extended support vector machine method are described in Non-Patent Document 7 and Non-Patent Document 8 below.

Nello Cristianini and John Shawe-Taylor, An Introduction to SupportVector Machines and other kernel-based learning methods,(Cambridge UniversityPress,2000)Nello Cristianini and John Shawe-Taylor, An Introduction to SupportVector Machines and other kernel-based learning methods, (Cambridge University Press, 2000) Taku Kudoh, Tinysvm:Support Vectormachines,(http://chasen.org/~taku/software/TinySVM/,2002年)Taku Kudoh, Tinysvm: Support Vectormachines, (http://chasen.org/~taku/software/TinySVM/, 2002)

サポートベクトルマシン法は、分類の数が2個のデータを扱うものである。したがって、分類の数が3個以上の事例を扱う場合には、通常、これにペアワイズ法またはワンVSレスト法などの手法を組み合わせて用いることになる。 The support vector machine method handles data with two classifications. Therefore, when dealing with cases where the number of classifications is 3 or more, usually, a method such as the pair-wise method or the one-VS rest method is used in combination.

ペアワイズ法は、n個の分類を持つデータの場合に、異なる二つの分類先のあらゆるペア(n(n-1)/2個)を生成し、各ペアごとにどちらがよいかを二値分類器、すなわちサポートベクトルマシン法処理モジュールで求めて、最終的に、n(n-1)/2個の二値分類による分類先の多数決によって、分類先を求める方法である。 The pairwise method generates all pairs (n (n-1) / 2) of two different classification destinations in the case of data having n classifications, and the binary classifier determines which is better for each pair. In other words, it is a method of obtaining a classification destination by a majority decision of classification destinations based on n (n-1) / 2 binary classifications, which is obtained by a support vector machine method processing module.

ワンVSレスト法は、例えば、a、b、cという三つの分類先があるときは、分類先aとその他、分類先bとその他、分類先cとその他、という三つの組を生成し、それぞれの組についてサポートベクトルマシン法で学習処理する。そして、学習結果による推定処理において、その三つの組のサポートベクトルマシンの学習結果を利用する。推定するべき候補が、その三つのサポートベクトルマシンではどのように推定されるかを見て、その三つのサポートベクトルマシンのうち、その他でないほうの分類先であって、かつサポートベクトルマシンの分離平面から最も離れた場合のものの分類先を求める解とする方法である。例えば、ある候補が、「分類先aとその他」の組の学習処理で作成したサポートベクトルマシンにおいて分離平面から最も離れた場合には、その候補の分類先は、aと推定する。 For example, when there are three classification destinations, a, b, and c, the one-VS rest method generates three sets of classification destination a and other, classification destination b and other, classification destination c and other, The learning process is performed on the set of the support vector machine method. Then, in the estimation process based on the learning result, the learning results of the three sets of support vector machines are used. See how the three support vector machines are estimated as candidates to be estimated. Of the three support vector machines, it is the non-other classification target and the separation plane of the support vector machine. This is a method for obtaining a classification destination of a thing farthest from the object. For example, when a candidate is farthest from the separation plane in the support vector machine created by the learning process of “classification destination a and others”, the candidate classification destination is estimated as a.

そして機械学習判定部（１０２）が推定する、風評情報かどうかについての、どのような解(分類先)になりやすいかの度合いの求め方は、機械学習部（１４２）が機械学習の手法として用いる様々な方法によって異なる。 Then, the machine learning unit (142) uses the machine learning method (142) as a method of machine learning to determine the degree of the solution (classification destination) that is likely to be a reputation as to whether or not it is reputation information estimated by the machine learning determination unit (102). It depends on the various methods used.

例えば、本発明の実施の形態において、機械学習部（１４２）が、機械学習の手法としてk近傍法を用いる場合、機械学習部（１４２）は、教師データの事例同士で、その事例から抽出された素性の集合のうち重複する素性の割合(同じ素性をいくつ持っているかの割合)にもとづく事例同士の類似度を定義して、前記定義した類似度と事例とを学習結果情報として機械学習結果データ（１３２）に記憶しておく。 For example, in the embodiment of the present invention, when the machine learning unit (142) uses the k-nearest neighbor method as a machine learning method, the machine learning unit (142) is extracted from the cases of the teacher data examples. Define the similarity between cases based on the ratio of overlapping features in the set of features (the ratio of how many of the same features), and machine learning results using the defined similarities and cases as learning result information This is stored in the data (132).

そして、機械学習判定部（１０２）は、素性抽出部（１０１）が抽出したデータについて、機械学習結果データ（１３２）において定義された風評情報であるか否かの確率と、素性とを参照して、そのデータが風評情報である可能性が高い順にk個の素性を機械学習結果データ（１３２）の事例から選択し、選択したk個の素性での多数決によって風評情報か否かという分類先を、解として推定する。すなわち、機械学習判定部（１０２）では、抽出された各データに対して、どのような解(分類先)になりやすいかの度合いを、選択したk個の素性での多数決の票数、ここでは「風評情報である」という分類が獲得した票数とする。 Then, the machine learning determination unit (102) refers to the probability that the data extracted by the feature extraction unit (101) is reputation information defined in the machine learning result data (132) and the feature. Then, the k features are selected from the examples of the machine learning result data (132) in the descending order of the possibility that the data is the reputation information, and the classification destination of whether or not the reputation information is determined by the majority decision with the selected k features. Is estimated as a solution. That is, in the machine learning determination unit (102), the degree of what kind of solution (classification destination) is likely to be obtained for each extracted data is determined by the number of votes of majority vote with the selected k features, The number of votes acquired by the classification “reputation information”.

また、機械学習手法として、シンプルベイズ法を用いる場合には、機械学習部（１４２）は、教師データの事例について、前記事例の解と素性の集合との組を学習結果情報として機械学習結果データ（１３２）に記憶する。そして、機械学習判定部（１０２）は、データ収集部（１００）がデータを抽出したときに、機械学習結果データ（１３２）の学習結果情報の解と素性の集合との組をもとに、ベイズの定理にもとづいて素性抽出部（１０１）で取得した素性の集合について、風評情報であるか否かに係わる各分類になる確率を算出して、その確率の値が最も大きい分類を、そのデータについての素性の分類(解)と推定する。すなわち、機械学習判定部（１０２）では、抽出されたデータについての素性の集合の場合にある解となりやすさの度合いを、各分類になる確率、ここでは「風評情報である」という分類になる確率とする。 When the simple Bayes method is used as a machine learning method, the machine learning unit (142) uses machine learning result data for a case of teacher data, using a combination of the solution of the case and a set of features as learning result information. (132). Then, the machine learning determination unit (102), when the data collection unit (100) extracts the data, based on the set of the learning result information of the machine learning result data (132) and the feature set, For the feature set acquired by the feature extraction unit (101) based on Bayes' theorem, the probability of each classification related to whether it is reputation information is calculated, and the classification with the largest probability value is Estimate the feature classification (solution) for the data. That is, in the machine learning determination unit (102), the probability of being a certain solution in the case of a set of features about the extracted data is set to the probability of being classified into each category, in this case, the classification is “reputation information”. Probability.

また、機械学習手法として決定リスト法を用いる場合には、機械学習部（１４２）は、教師データの事例について、素性と分類先との規則を所定の優先順序で並べたリストを機械学習結果データ（１３２）に記憶する。そして、データ収集部（１００）がデータを抽出したときに、機械学習判定部（１０２）は、機械学習結果データ（１３２）のリストの優先順位の高い順に、抽出された表現対の候補の素性と規則の素性とを比較し、素性が一致した規則の分類先をその候補の分類先(解)として推定する。すなわち、機械学習判定部（１０２）では、抽出されたデータについてその素性の集合の場合にある解となりやすさの度合いを、所定の優先順位またはそれに相当する数値、尺度、ここでは「風評情報である」という分類になる確率のリストにおける優先順位とする。 When the decision list method is used as the machine learning method, the machine learning unit (142) creates a list in which rules of features and classification destinations are arranged in a predetermined priority order for the example of the teacher data. (132). Then, when the data collection unit (100) extracts the data, the machine learning determination unit (102) identifies the features of the extracted expression pair candidates in descending order of priority in the list of the machine learning result data (132). And the feature of the rule are compared, and the classification destination of the rule having the same feature is estimated as the candidate classification destination (solution). That is, in the machine learning determination unit (102), the degree of easiness to be a solution for the extracted data in the case of the set of features is determined by a predetermined priority or a numerical value, scale corresponding thereto, The priority is in the list of probabilities of being classified as “some”.

また、機械学習手法として最大エントロピー法を使用する場合には、機械学習部（１４２）は、教師データの事例から解となりうる分類を特定し、所定の条件式を満足しかつエントロピーを示す式を最大にするときの素性の集合と解となりうる分類の二項からなる確率分布を求めて機械学習結果データ（１３２）に記憶する。そして、データ収集部（１００）がデータを抽出したときに、機械学習判定部（１０２）は、機械学習結果データ（１３２）の確率分布を利用して、抽出されたデータについてその素性の集合についてその解となりうる分類の確率を求めて、最も大きい確率値を持つ解となりうる分類を特定し、その特定した分類をその候補の解と推定する。すなわち、機械学習判定部（１０２）では、抽出されたデータについてその素性の集合の場合にある解となりやすさの度合いを、各分類になる確率、ここでは「風評情報である」という分類になる確率とする。 When the maximum entropy method is used as the machine learning method, the machine learning unit (142) identifies a class that can be a solution from the example of the teacher data, and satisfies the predetermined conditional expression and expresses the entropy. A probability distribution composed of a set of features when maximizing and a binomial classification that can be a solution is obtained and stored in the machine learning result data (132). Then, when the data collection unit (100) extracts the data, the machine learning determination unit (102) uses the probability distribution of the machine learning result data (132) to determine the set of features of the extracted data. The probability of the classification that can be the solution is obtained, the classification that can be the solution having the largest probability value is specified, and the specified classification is estimated as the candidate solution. That is, in the machine learning determination unit (102), the degree of the likelihood of becoming a solution in the case of the set of features of the extracted data is the probability of each classification, in this case, the classification is “reputation information” Probability.

また、機械学習手法としてサポートベクトルマシン法を使用する場合には、機械学習部（１４２）は、教師データ（１４０）の事例から解となりうる分類を特定し、分類を正例と負例に分割して、カーネル関数を用いた所定の実行関数にしたがって事例の素性の集合を次元とする空間上で、その事例の正例と負例の間隔を最大にし、かつ正例と負例を超平面で分割する超平面を求めて機械学習結果データ（１３２）に記憶する。そしてデータ収集部（１００）がデータを抽出したときに、機械学習判定部（１０２）は、機械学習結果データ（１３２）の超平面を利用して、抽出されたデータについての素性の集合が超平面で分割された空間において正例側か負例側のどちらにあるかを特定し、その特定された結果にもとづいて定まる分類を、その候補の解と推定する。すなわち、機械学習判定部（１０２）では、抽出されたデータについてその素性の集合の場合にある解となりやすさの度合いを、分離平面からの正例(風評情報であるデータ)の空間への距離の大きさとする。より詳しくは、風評情報であるデータを正例、風評情報ではないデータを負例とする場合に、分離平面に対して正例側の空間に位置するデータが「風評情報であるデータ」と判断され、その事例の分離平面からの距離をそのデータの風評情報である度合いとする。 When the support vector machine method is used as the machine learning method, the machine learning unit (142) identifies a class that can be a solution from the example of the teacher data (140), and divides the class into a positive example and a negative example. Then, in a space whose dimension is a set of case features according to a predetermined execution function using a kernel function, the interval between the positive example and the negative example is maximized, and the positive example and the negative example are hyperplanes. The hyperplane to be divided by is obtained and stored in the machine learning result data (132). Then, when the data collection unit (100) extracts the data, the machine learning determination unit (102) uses the hyperplane of the machine learning result data (132), and the feature set for the extracted data is over. In the space divided by the plane, it is identified whether the space is on the positive example side or the negative example side, and the classification determined based on the identified result is estimated as the candidate solution. That is, in the machine learning determination unit (102), the degree of easiness to be a solution in the case of the set of features of the extracted data is determined from the distance from the separation plane to the space of the positive example (data which is reputation information). The size of More specifically, when data that is reputation information is a positive example and data that is not reputation information is a negative example, the data located in the space on the positive example side with respect to the separation plane is determined as “data that is reputation information”. The distance from the separation plane of the case is set as the degree of the reputation information of the data.

さらに、本発明では機械学習の手法として、公知のニューラルネットワークによる方法、重回帰分析による方法を用いることもできる。
例えば、求める分類が２種類であれば重回帰分析を利用することができる。重回帰分析をコンピュータ上で実行する方法については、非特許文献９に詳しい。 Furthermore, in the present invention, a method using a known neural network or a method using multiple regression analysis can be used as a machine learning method.
For example, if there are two types of classification to be obtained, multiple regression analysis can be used. The method of executing the multiple regression analysis on a computer is detailed in Non-Patent Document 9.

「Excelで学ぶ時系列分析と予測」３章,オーム社"Time series analysis and forecasting with Excel", Chapter 3, Ohm

重回帰分析の場合は、素性の数だけ説明変数xを用意し、素性のありなしを、その説明変数xの値を1,0で表現する。目的変数（被説明変数）は、ある分類の場合を値１、他の分類の場合を値0として求めればよい。 In the case of multiple regression analysis, as many explanatory variables x as the number of features are prepared, and the presence or absence of the features is represented by 1,0 as the value of the explanatory variable x. The objective variable (explained variable) may be obtained with a value of 1 for a certain classification and a value of 0 for another classification.

以上に説述した通り、本発明は公知の任意の機械学習手法を備えた機械学習モジュール（図３）により機械学習結果データ（１５３）を生成した上で、機械学習判定部（１０２）が、風評情報であるか否かを的確に判定する。（機械学習判定ステップ：Ｓ３）
風評情報であるか否かは、上述したように機械学習手法によって「風評情報である」「風評情報ではない」のいずれかで出力される場合もあるし、「風評情報である確率」が出力される場合もある。「風評情報である確率」が大きな順にその確率と共に出力されてもよい。また、確率を示すための書式、例えば、文字色や文字サイズ、あるいは確率を示すマークなどと共に出力されてもよい。 As described above, the present invention generates machine learning result data (153) by a machine learning module (FIG. 3) having any known machine learning method, and then the machine learning determination unit (102) It is judged accurately whether it is reputation information. (Machine learning determination step: S3)
Whether or not it is reputation information may be output as either “reputation information” or “not reputation information” by the machine learning method as described above, or “probability of reputation information” is output. Sometimes it is done. The “probability of reputation information” may be output together with the probabilities in descending order. Further, it may be output together with a format for indicating the probability, for example, a character color, a character size, or a mark indicating the probability.

風評情報出力部（１０３）からは、「風評情報である」「風評情報ではない」のいずれかを風評情報として出力（風評情報出力ステップ：Ｓ４）してもよいし、風評情報である確率をそのまま出力してもよい。
さらに、ユーザーが設定するか、あるいは予め定義されている閾値を用い、該確率が閾値を上回るときに「風評情報である」ことを出力してもよい。 From the reputation information output unit (103), either “reputation information” or “not reputation information” may be output as reputation information (reputation information output step: S4), or the probability of reputation information may be determined. You may output as it is.
Furthermore, a threshold value set by the user or a predefined threshold value may be used, and when the probability exceeds the threshold value, “reputation information” may be output.

出力方法としては、モニタ（１２）から表示する他、ネットワークアダプタ（１４）から別の端末装置に向けて結果を送信する方法、ハードディスク（１３）内に風評情報抽出データベースとして格納する方法でもよい。
その際、結果と共に、収集されたデータの少なくとも一部、又はそれが公開されているサーバ装置の名称若しくはネットワークアドレス、又はその収集データのファイル情報、の少なくともいずれかを出力する。特に風評情報と判定された根拠となる単語などを出力してもよい。 As an output method, in addition to displaying from the monitor (12), a method of transmitting the result from the network adapter (14) to another terminal device, or a method of storing as a reputation information extraction database in the hard disk (13) may be used.
At that time, at least one part of the collected data, or the name or network address of the server device to which the data is disclosed, or the file information of the collected data is output together with the result. In particular, a word or the like that is determined as the reputation information may be output.

上記において、素性の１つとして対象の名称を用いたが、このような対象の名称は必ずしもユーザが指定したものではなく、自動的に定義することもできる。
すなわち、本発明が実行する風評情報の抽出は、企業名や製品名、個人名など固有名詞を用いるものであり、これらを、ユーザが指定したサイトや、ユーザが指定したキーワードを公知の検索エンジンに入力して検索されたサイトに含まれるデータから抽出することができる。 In the above description, the target name is used as one of the features. However, the target name is not necessarily specified by the user, and can be automatically defined.
That is, the extraction of reputation information executed by the present invention uses proper nouns such as company names, product names, and personal names, and these are used for a site designated by the user or a keyword designated by the user as a known search engine. It can be extracted from the data contained in the site searched by typing.

このために、本発明のＣＰＵ（１０）に図示しない判定対象名詞抽出部を備えて、予めハードディスクに格納したサイト情報に基づいてネットワークアダプタ（１４）を介して指定されたサイトからデータを取得する。あるいは、ユーザからキーボード（１１）を介してキーワードを受理し、該キーワードを、ハードディスクに格納された検索サイト情報に基づいて当該検索サイトに出力すると共に、該検索サイトからのサイト検索結果を得、そのサイトからデータを取得する。
取得したデータから固有名詞を抽出する。なお、本発明で素性に用いるのは固有名詞でなく一般名詞でもよい。
この抽出には、次のような固有表現の抽出技術を用いることでデータから自動的に固有表現を抽出し、それを素性とすることができる。 For this purpose, the CPU (10) of the present invention includes a determination target noun extraction unit (not shown), and acquires data from a site designated via the network adapter (14) based on the site information stored in the hard disk in advance. . Alternatively, it accepts a keyword from the user via the keyboard (11), outputs the keyword to the search site based on the search site information stored in the hard disk, and obtains a site search result from the search site, Get data from the site.
A proper noun is extracted from the acquired data. In the present invention, general nouns may be used instead of proper nouns.
For this extraction, a specific expression can be automatically extracted from the data and used as a feature by using the following specific expression extraction technique.

以下に、固有表現抽出の一般的な手法の例について説明する。
(1) 機械学習を用いる手法
機械学習を用いて固有表現を抽出する手法がある(例えば、以下の非特許文献１０参照)。 Hereinafter, an example of a general technique for extracting a specific expression will be described.
(1) Method using machine learning There is a method of extracting a specific expression using machine learning (for example, see Non-Patent Document 10 below).

浅原正幸,松本裕治,日本語固有表現抽出における冗長的な形態素解析の利用情報処理学会自然言語処理研究会 NL153-7 2002年Masayuki Asahara, Yuji Matsumoto, Utilization of Redundant Morphological Analysis in Japanese Named Expression Extraction Information Processing Society of Japan Natural Language Processing Study Group NL153-7 2002

まず、例えば、「日本の首相は小泉さんです。」という文を、各文字に分割し、分割した文字について、以下のように、B-LOCATION、 I-LOCATION等の正解タグを付与することによって、正解を設定する。以下の一列目は、分割された各文字であり、各文字の正解タグは二列目である。

日 B-LOCATION
本 I-LOCATION
の O
首 O
相 O
は O
小 B-PERSON
泉 I-PERSON
さ O
ん O
で O
す O
。 O

上記において、B-???は、ハイフン以下の固有表現の種類の始まりを意味するタグである。例えば、 B-LOCATIONは、地名という固有表現の始まりを意味しており、B-PERSONは、人名という固有表現の始まりを意味している。また、I-???は、ハイフン以下の固有表現の種類の始まり以外を意味するタグであり、Oはこれら以外である。従って、例えば、文字「日」は、地名という固有表現の始まりに該当する文字であり、文字「本」までが地名という固有表現である。 First, for example, the sentence “Japan's prime minister is Mr. Koizumi” is divided into each character, and correct characters such as B-LOCATION and I-LOCATION are attached to the divided characters as follows. To set the correct answer. The first column below is each divided character, and the correct tag of each character is the second column.

Day B-LOCATION
I-LOCATION
O
Neck O
Phase O
Is O
Small B-PERSON
Izumi I-PERSON
O
N
At O
O
. O

In the above, B-??? is a tag that means the beginning of the type of proper expression below the hyphen. For example, B-LOCATION means the beginning of a unique expression called place name, and B-PERSON means the beginning of a unique expression called person name. Also, I-??? is a tag that means other than the beginning of the type of proper expression below the hyphen, and O is other than these. Therefore, for example, the character “day” is a character that corresponds to the beginning of the unique name “place name”, and the character “book” is the unique name “place name”.

このように、各文字の正解を設定しておき、このようなデータから学習し、新しいデータでこの正解を推定し、この正解のタグから、各固有表現の始まりと、どこまでがその固有表現かを認識して、固有表現を推定する。 In this way, the correct answer of each character is set, learned from such data, this correct answer is estimated with new data, and from this correct answer tag, the beginning of each proper expression and how far it is. Is recognized and the proper expression is estimated.

この各文字に設定された正解のデータから学習するときには、システムによってさまざまな情報を素性という形で利用する。例えば、
日 B-LOCATION
の部分は、
日本-B 名詞-B
などの情報を用いる。日本-B は、日本という単語の先頭を意味し、名詞-Bは、名詞の先頭を意味する。単語や品詞の認定には、例えば前述したChasenによる形態素解析を用いる。上述したChasenは各単語の品詞も推定することができるので、「学校へ行く」を入力すると以下の結果を得る。 When learning from the correct data set for each character, the system uses various information in the form of features. For example,
Day B-LOCATION
Part of
Japan-B Noun-B
Such information is used. Japan-B means the beginning of the word Japan, and noun-B means the beginning of the noun. For the recognition of words and parts of speech, for example, the morphological analysis by Chasen described above is used. The above-mentioned Chasen can also estimate the part of speech of each word, so if you enter "go to school" you get the following results.

学校ガッコウ学校名詞-一般
へヘへ助詞-格助詞-一般
行くイク行く動詞-自立五段・カ行促音便基本形
EOS
このように各行に一個の単語が入るように分割され、各単語に読みや品詞の情報が付与される。 School Gakkou School Noun-General
To He To Particle-Case particle-General
Go Iku Go Verb-Independence
EOS
In this way, each line is divided so that one word is included, and reading and part-of-speech information are given to each word.

なお、例えば、上記の非特許文献１０では、素性として、入力文を構成する文字の、文字自体(例えば、「小」という文字)、字種(例えば、ひらがなやカタカナ等)、品詞情報、タグ情報(例えば、「B-PERSON」等)を利用している。 Note that, for example, in Non-Patent Document 10 described above, as features, characters themselves (for example, “small”), character types (for example, hiragana and katakana), part-of-speech information, tags, Information (for example, “B-PERSON” etc.) is used.

これら素性を利用して学習する。タグを推定する文字やその周辺の文字にどういう素性が出現するかを調べ、どういう素性が出現しているときにどういうタグになりやすいかを学習し、その学習結果を利用して新しいデータでのタグの推定を行なう。機械学習には、例えばサポートベクトルマシンを用いる。 Learning using these features. Investigate what features appear in the characters that estimate the tag and the surrounding characters, learn what features are likely to appear when the features appear, and use the learning results to create new data Perform tag estimation. For machine learning, for example, a support vector machine is used.

固有表現抽出には、上記の手法の他にも種々の手法がある。例えば、最大エントロピーモデルと書き換え規則を用いて固有表現を抽出する手法がある(非特許文献１１参照)。 In addition to the above-described method, there are various methods for extracting the proper expression. For example, there is a technique for extracting a specific expression using a maximum entropy model and a rewrite rule (see Non-Patent Document 11).

内元清貴,馬青,村田真樹,小作浩美,内山将夫,井佐原均,最大エントロピーモデルと書き換え規則に基づく固有表現抽出,言語処理学会誌, Vol.7, No.2, 2000年Uchimoto Kiyotaka, Ma Aoi, Murata Maki, Osaku Hiromi, Uchiyama Masao, Isahara Hitoshi, Entity Expression Extraction Based on Maximum Entropy Model and Rewriting Rules, Journal of the Language Processing Society, Vol. 7, No. 2, 2000

また、例えば、以下の非特許文献１２に、サポートベクトルマシンを用いて日本語固有
表現抽出を行う手法について記載されている。 Also, for example, in the following Non-Patent Document 12, using a support vector machine,
It describes a technique for extracting expressions.

山田寛康,工藤拓,松本裕治,Support Vector Machineを用いた日本語固有表現抽出,情報処理学会論文誌, Vol.43,No.1", 2002年Hiroyasu Yamada, Taku Kudo, Yuji Matsumoto, Japanese Named Expression Extraction using Support Vector Machine, Journal of Information Processing Society of Japan, Vol.43, No.1 ", 2002

(2)作成したルールを用いる手法
人手でルールを作って固有表現を取り出すという方法もある。
例えば、
名詞+「さん」だと人名とする
名詞+「首相」だと人名とする
名詞+「株式会社」だと企業名とする
名詞+「町」だと地名とする
名詞+「市」だと地名とする
などである。 (2) A method using the created rule There is also a method of manually creating a rule to extract a specific expression.
For example,
A noun + “san” means a person
Name as noun + "Prime Minister"
Noun + “corporation” means company name + “town” means place name
For example, a noun + “city” is a place name.

以上の方法によって固有表現を抽出し、抽出された表現のうち、例えば人名や企業名などを素性抽出部（１０１）において抽出することができる。 A unique expression is extracted by the above method, and a person name, a company name, etc., for example, can be extracted in the feature extraction unit (101) among the extracted expressions.

本発明は上記の特徴に加えて、データ収集部（１００）で収集されたデータと関連のあるデータを抽出することを特徴とする。
その１つの例として、データの信頼度を評価する技術を提案する。図６に示すように、本装置（１）のＣＰＵ（１０）に評価用データ抽出部（１０４）、評価用素性抽出部（１０５）、評価用機械学習判定部（１０６）を備えて、該データの信頼度を判定し、その結果を機械学習判定部（１０２）において素性として利用する。 In addition to the above features, the present invention is characterized in that data related to the data collected by the data collection unit (100) is extracted.
As one example, a technique for evaluating the reliability of data is proposed. As shown in FIG. 6, the CPU (10) of the apparatus (1) includes an evaluation data extraction unit (104), an evaluation feature extraction unit (105), and an evaluation machine learning determination unit (106). The reliability of the data is determined, and the result is used as a feature in the machine learning determination unit (102).

図７には、本実施例に係るデータ信頼度評価ステップ（Ｓ５）を含む処理フローチャートである。
本構成では、ハードディスク（１３）に評価用データベース（１３３）を格納しておくか、ネットワーク上の任意のサーバに格納しておく。該評価用データベース（１３３）には、多数のデータが含まれており、評価用データ抽出部（１０４）は、データ収集部で収集されたデータと、収集データの著作者、又は格納されるサーバ装置の名称若しくはネットワークアドレス、又は該収集データのファイル情報の少なくともいずれかが一致する評価用データを抽出する。（評価用データ抽出工程：Ｓ５０） FIG. 7 is a processing flowchart including a data reliability evaluation step (S5) according to the present embodiment.
In this configuration, the evaluation database (133) is stored in the hard disk (13) or stored in an arbitrary server on the network. The evaluation database (133) includes a large amount of data, and the evaluation data extraction unit (104) includes the data collected by the data collection unit, the author of the collected data, or a server to be stored. Evaluation data that matches at least one of the device name or network address and the file information of the collected data is extracted. (Evaluation data extraction step: S50)

すなわち、まず評価用データ抽出部が、収集されたデータからその作成者を探索する。探索には、例えば「文責：○川○夫」のように著作者が明示されている場合に、「○川○夫」を抽出する。あるいはＨＴＭＬに含まれる非表示の著作者情報などを抽出してもよい。
そして、「○川○夫」が含まれる評価用データを、該評価用データベース（１３３）から抽出する。 That is, first, the evaluation data extraction unit searches for the creator from the collected data. For the search, for example, when the author is clearly indicated as “literal responsibility: ○ Ogawa ○ O”, “Ogawa Oo” is extracted. Alternatively, non-display author information included in HTML may be extracted.
Then, the evaluation data including “Ogawa ○ O” is extracted from the evaluation database (133).

なお、著作者の他に格納されるサーバ装置の名称若しくはネットワークアドレス、又は該収集データのファイル情報が一致するものを評価用データとしてもよい。例えば、収集されたデータが、www.nhk.or.jpというドメインから収集されたデータであった場合、同じドメインから公開されていた評価用データを抽出する。 In addition to the author, the server device name or network address stored or the file data of the collected data may match the evaluation data. For example, when the collected data is data collected from the domain www.nhk.or.jp, the evaluation data published from the same domain is extracted.

そして、表８に示すような評価因子テーブル（１３４）をハードディスク（１３）に格納する。 Then, an evaluation factor table (134) as shown in Table 8 is stored in the hard disk (13).

評価因子は上記に限られるものではないが、例えば、「××株式会社代表取締役○川○夫」という表現が評価用データに含まれるとき、この著作者の勤務先は「××株式会社」であること、役職は「代表取締役」であることが抽出される。著作については、評価用データとして図書館データベースを用いることで、「○川○夫」の著作があればそれを検出することもできる。（仮に同姓同名の他人であっても、本発明はこれをもって確定的に信頼できるという評価がされるわけではないので、重大な問題にはならない。） Although the evaluation factor is not limited to the above, for example, when the expression “XX Co., Ltd. Representative Director ○ Ogawa ○ o” is included in the evaluation data, the work place of this author is “XX Co., Ltd.” In other words, it is extracted that the title is “representative director”. With regard to a work, if a library database is used as evaluation data, it can be detected if there is a work of “* kawa * o”. (Even if someone with the same name has the same name, the present invention is not evaluated to be definitely reliable with this, so it is not a serious problem.)

また、抽出された評価用データが、どのような話題のものであるのか、後述する要約技術などによって抽出することもできる。さらに、評価用データにおける不適当な単語（前記した顔文字や、感情的な表現など）を抽出することもできる。 Also, what topic the extracted evaluation data is can be extracted by a summarization technique to be described later. Furthermore, inappropriate words (such as the above-mentioned emoticons and emotional expressions) in the evaluation data can be extracted.

このような評価因子を、評価用素性抽出部（１０５）が、機械学習における素性として評価データから抽出する。（評価用素性抽出工程：Ｓ５１）
そして、これらの素性を用いて、評価用機械学習判定部（１０６）が、当該評価データについて「信頼できる」「信頼できない」のいずれかか、数値で表現される信頼度として算出する。（評価用機械学習判定工程：Ｓ５２）
この判定には、評価用に備えた機械学習結果データ（１３５）を用いるが、同機械学習結果データ（１３５）についても、上述した機械学習の手法によって生成する。 Such an evaluation factor is extracted from the evaluation data by the evaluation feature extraction unit (105) as a feature in machine learning. (Evaluation feature extraction step: S51)
Then, using these features, the evaluation machine learning determination unit (106) calculates the reliability of the evaluation data as either “reliable” or “unreliable” expressed in numerical values. (Evaluation machine learning determination step: S52)
For this determination, the machine learning result data (135) prepared for evaluation is used. The machine learning result data (135) is also generated by the above-described machine learning method.

以上の構成によれば、データ収集部（１００）が収集したデータに関して、評価用データを抽出しその信頼度を評価することができる。一般的に、ある著作者や、あるウェブサイトの発信する情報が信頼性が高い場合、他で公開している情報についても信頼性は高いと考えられる。
そのため、本発明では評価用データについて判定された信頼度を、機械学習判定部（１０２）における素性として入力することで、さらに正確な風評情報か否かの判定に寄与することができる。 According to the above configuration, it is possible to extract evaluation data and evaluate the reliability of the data collected by the data collection unit (100). In general, when information transmitted from a certain author or a certain website is highly reliable, it is considered that the information disclosed elsewhere is also highly reliable.
Therefore, in the present invention, the reliability determined for the evaluation data is input as a feature in the machine learning determination unit (102), thereby contributing to more accurate determination of reputation information.

また、本発明では、機械学習判定部（１０２）に素性として入力せずに、又は入力すると共に、風評情報出力部（１０３）からの出力時に、関連情報として出力することもできる。 Further, in the present invention, the machine learning determination unit (102) may be input as a feature without being input, or may be output as related information at the time of output from the reputation information output unit (103).

以上の方法は、評価データを抽出してその信頼度を風評情報の判定に利用するものであるが、これをさらに進めて、ごく信頼のできる根拠情報に同様の記載があるか否かを確認する技術を提供することもできる。 The above method extracts the evaluation data and uses the reliability of the evaluation data to determine the reputation information. This is further advanced and it is confirmed whether there is a similar description in the very reliable basis information. Technology can be provided.

図８は、本技術に係わる構成であり、本装置（１）のＣＰＵ（１０）に、データ収集部（１００）で収集されたデータと類似する情報が、予めハードディスク（１３）に格納した根拠情報データベース（１３６）に含まれる根拠情報源に存在するか否かを判定する。 FIG. 8 shows a configuration related to the present technology. The basis for storing information similar to the data collected by the data collection unit (100) in the hard disk (13) in the CPU (10) of the apparatus (1) in advance. It is determined whether or not it exists in the ground information source included in the information database (136).

根拠情報データベース（１３６）には表９に示すように根拠情報となりうるサーバ装置の名称若しくはネットワークアドレス、又は該信頼できる情報のファイル情報の少なくともいずれかが格納されている。 In the basis information database (136), as shown in Table 9, at least one of the name or network address of the server device that can be the basis information and the file information of the reliable information is stored.

図９に示すように、本実施例ではデータ収集ステップ（Ｓ１）の後に、根拠情報確認ステップ（Ｓ６）を実行し、その中で類似判定部（１０７）が類似判定工程（Ｓ６１）を実行処理する。根拠情報データベース（１３６）に含まれるファイル情報や、ドメイン名、サイト名前などに従って、類似判定部（１０７）が各根拠情報源からデータを取得すると共に、その中に収集されたデータと話題が共通の情報が含まれているか否かを判定していく。 As shown in FIG. 9, in the present embodiment, after the data collection step (S1), the ground information confirmation step (S6) is executed, in which the similarity determination unit (107) executes the similarity determination step (S61). To do. In accordance with the file information, domain name, site name, etc. included in the rational information database (136), the similarity determination unit (107) acquires data from each rational information source, and the data collected therein and the topic are common It is determined whether or not information is included.

ここで、類似判定を行う方法としては、次のような類似文書の検索技術を用いることができる。
まず、データ収集の後、収集されたデータ形態素解析を行ってからそれによって得られた該データを構成する単語群Ａを、多く含む根拠情報データの抽出方法を説明する。 Here, as a method for performing similarity determination, the following similar document search technique can be used.
First, after data collection, a method of extracting ground information data including a large number of word groups A constituting the data obtained by performing the collected data morphological analysis will be described.

(1) 基本的な方法 (TF・IDF 法) の説明
（数１３)
score(D)= Σ ( tf(w,D) * log(N/df(w)) )
w ∈W で加算
Wはキーワードの集合、tf(w,D)は収集されたデータでのwの出現回数、df(w)は全文書でWが出現した文書の数、Nは文書の総数
数１３に示す式において、score(D) が高い文書データを類似した根拠情報データとして出力する。 (1) Explanation of basic method (TF / IDF method)
(Equation 13)
score (D) = Σ (tf (w, D) * log (N / df (w)))
Add by w ∈W
W is a set of keywords, tf (w, D) is the number of occurrences of w in the collected data, df (w) is the number of documents in which W appears in all documents, N is the total number of documents The document data having a high score (D) is output as similar basis information data.

(2)Robertson らの Okapi weightingの説明
本方法は、非特許文献１３に記載されている。 (2) Explanation of Okapi weighting by Robertson et al.
This method is described in Non-Patent Document 13.

村田真樹,馬青,内元清貴,小作浩美,内山将夫,井佐原均“位置情報と分野情報を用いた情報検索”自然言語処理(言語処理学会誌) 2000年 4月,7巻,2 号, p.141 〜 p.160 該非特許文献１３における数１４が性能がよいことが知られている。そして、Σで積を取る前の tf 項とidf 項の積が Okapiのウェイティング法になって、この値を単語の重みに使う。Murata Masaki, Ma Ao, Uchimoto Kiyotaka, Osaku Hiromi, Uchiyama Masao, Isahara Hitoshi "Information Retrieval Using Location Information and Field Information" Natural Language Processing (Journal of the Language Processing Society) April 2000, Vol. 7, No. 2 , p.141 to p.160 It is known that Equation 14 in Non-Patent Document 13 has good performance. The product of tf term and idf term before taking the product by Σ becomes Okapi's weighting method, and this value is used for the word weight.

Okapi の式なら
（数１４）
score(D)= Σ ( tf(w,D)/(tf(w,D) + length/delta) * log(N/df(w)) )
w ∈W で加算

lengthはデータＤの長さ、delta はデータの長さの平均、データの長さは、データのバイト数、また、データに含まれる単語数などを使う。 Okapi formula (14)
score (D) = Σ (tf (w, D) / (tf (w, D) + length / delta) * log (N / df (w)))
Add by w ∈W

length is the length of data D, delta is the average length of data, and the length of data is the number of bytes of data, the number of words included in the data, or the like.

さらに、以下の情報検索を行うこともできる。
(Okapi の参考文献)
非特許文献１４，１５に開示されるようなOkapiの式、SMARTの式を用いることもできる。より高度な情報検索の方法として、tf・idf を使うだけの式でなく、これらのOkapiのSMARTの式を用いてもよい。 Further, the following information search can be performed.
(Okapi reference)
The Okapi equation and SMART equation disclosed in Non-Patent Documents 14 and 15 can also be used. As a more advanced information retrieval method, these Okapi SMART formulas may be used instead of just formulas using tf · idf.

S. E. Robertson, S. Walker, S. Jones, M. M. Hancock-Beaulieu,and M. Gatford Okapi at TREC-3,TREC-3, 1994年S. E. Robertson, S. Walker, S. Jones, M. M. Hancock-Beaulieu, and M. Gatford Okapi at TREC-3, TREC-3, 1994 Amit Singhal AT&T at TREC-6, TREC-6, 1997 年Amit Singhal AT & T at TREC-6, TREC-6, 1997

これらの方法では、tf・idf だけでなく、記事の長さなども利用して、より高精度な情報検索を行うことができる。 In these methods, more accurate information retrieval can be performed using not only tf / idf but also the length of the article.

今回の、単語群Ａをより多く含む記事の抽出方法では、さらに、Rocchio'sformula （非特許文献１６）を使うことができる。 In this method of extracting articles including more word groups A, Rocchio'sformula (Non-Patent Document 16) can be used.

J. J. Rocchio,Relevance feedback in information retrieval,The SMARTretrieval System, Edited by G. Salton,Prentice Hall, Inc.,page 313-323, 1971年J. J. Rocchio, Relevance feedback in information retrieval, The SMARTretrieval System, Edited by G. Salton, Prentice Hall, Inc., page 313-323, 1971

この方法は、log(N/df(w))のかわりに、
（数１５）
{E(t) + k_af * (RatioC(t) - RatioD(t))}*log(N/df(w))
を使う。 This method can be used instead of log (N / df (w))
(Equation 15)
{E (t) + k_af * (RatioC (t)-RatioD (t))} * log (N / df (w))
use.

E(t) = 1 (元の検索にあったキーワード)
= 0 (それ以外)
RatioC(t) はデータ群Ｂでのt の出現率
RatioD(t) は記事群Cでのt の出現率
log(N/df(w))を上式でおきかえた式でscore(D)を求めて、その値が大きいものほど単語群Aをより多く含む記事として取り出すものである。 E (t) = 1 (keyword from the original search)
= 0 (otherwise)
RatioC (t) is the appearance rate of t in data group B
RatioD (t) is the appearance rate of t in article group C
The score (D) is obtained by replacing log (N / df (w)) with the above equation, and the larger the value, the more the word group A is extracted.

score(D)のΣの加算の際に足す単語wの集合Wは、元のキーワードと、単語群Aの両方とする。ただし、元のキーワードと、単語群Aは重ならないようにする。 A set W of words w added when Σ of score (D) is added is both the original keyword and the word group A. However, the original keyword and the word group A should not overlap.

また、他の方法として、score(D)のΣの加算の際に足す。単語wの集合Wは、単語群Ａのみとする。ただし、元のキーワードと、単語群Ａは重ならないようにする。 Another method is to add Σ of score (D). The set W of words w is only the word group A. However, the original keyword and the word group A should not overlap.

ここでは roccio の式で複雑な方法をとったが、単純に、単語群Aの単語の出現回数の和が大きいものほど、単語群Aをより多く含む記事として取り出すようにしてもよいし、また、単語群Aの出現の異なりの大きいものほど、単語群Aをより多く含む記事として取り出すようにしてもよい。 Here, the roccio formula is used in a complicated manner, but simply, the larger the sum of the number of occurrences of words in word group A, the more the word group A may be taken out, or Alternatively, an article having a larger difference in appearance of the word group A may be extracted as an article including more word groups A.

以上の方法により、単語群Aを含む記事を取り出すことができ、これを根拠情報データとして抽出することができる。
本発明における根拠情報確認ステップ（Ｓ６）における１つの処理として、上記のような単語群Ａを含む記事を抽出による方法が挙げられる。 By the above method, an article including the word group A can be taken out and extracted as ground information data.
As one process in the ground information confirmation step (S6) in the present invention, there is a method by extracting an article including the word group A as described above.

次に、根拠情報確認ステップ（Ｓ６）における別な手法として、収集されたデータ群Bの類似記事を抽出する方法を説明する。
記事同士の類似度を定義する。この類似度は、tf・idf や okapiや smartを使うとよい。tf・idf や okapiや smartなどにおける、記事Dとクエリを比較する二つの記事xとyとするとしてよい。そして、x、yの両方に含まれる単語をwとするとよい。 Next, as another method in the ground information confirmation step (S6), a method for extracting similar articles in the collected data group B will be described.
Define the similarity between articles. Use tf / idf, okapi, or smart for this similarity. You can use two articles x and y that compare articles D and queries in tf / idf, okapi, smart, etc. And let the word included in both x and y be w.

各単語を次元と、各単語のスコアを要素とするベクトルを作成し、記事xのベクトルを記事xに含まれる単語を使ってベクトル(vector _x)にし、また、記事yのベクトルを記事yに含まれる単語を使ってベクトル(vector_y)にし、それらベクトルの余弦(cos(vector _x,vector_y)) の値を記事の類似度としてもよい。各単語のスコアの算出には、tf・idf やokapiやsmart を用いるとよい。 Create a vector with each word as a dimension and the score of each word as an element, change the vector of article x to a vector (vector _x) using the words contained in article x, and the vector of article y to article y The included words may be used as vectors (vector_y), and the value of the cosine (cos (vector_x, vector_y)) of the vectors may be used as the similarity of articles. Use tf / idf, okapi, or smart to calculate the score of each word.

それらの式のΣの後ろの部分の式がスコアの算出の式となる。その式の値が各単語のスコアとなる。 The expression after the Σ of those expressions is the expression for calculating the score. The value of the expression is the score for each word.

tf・idf だと tf(w,D) * log(N/df(w))
okapi だと tf(w,D)/(tf(w,D) + length/delta)* log(N/df(w))
がその式となる。 tf ・ idf tf (w, D) * log (N / df (w))
For okapi, tf (w, D) / (tf (w, D) + length / delta) * log (N / df (w))
Is the formula.

また、単語群Ａをより多く含む記事の抽出においてもこのベクトルの余弦(cos(vector_x,vector_y)) の値を求め、この値が大きい記事ほど単語群Ａをより多く含む記事と判断してもよい。この場合は、単語群Aに含まれる単語を使ってベクトル(vector_x)にし、記事に含まれる単語を使ってベクトル(vector _y)にして求める。 Also, even in the extraction of articles containing more word groups A, the value of the cosine (cos (vector_x, vector_y)) of this vector is obtained, and an article having a larger word group A is judged to be an article containing more word groups A. Good. In this case, the word included in the word group A is used as a vector (vector_x), and the word included in the article is used as a vector (vector_y).

データ群Ｂと根拠情報データｘの類似度には、次の方法などがある。
（１）データ群Ｂのうち根拠情報データxと最も類似するデータと、根拠情報データxの類似度をその類似度とする方法
（２）データ群Ｂのうち根拠情報データxと最も類似しない記事と、根拠情報データxの類似度をその類似度とする方法
（３）データ群Ｂのすべての記事と根拠情報データxの類似度の平均をその類似度とする方法
他の方法でもよいが、このようにして、データ群Ｂと根拠情報データxの類似度を求めて、その類似度が大きいものを類似記事として取り出すことができる。 The similarity between the data group B and the ground information data x includes the following methods.
(1) Data that is most similar to the basis information data x in the data group B and a method that uses the similarity of the basis information data x as the similarity (2) Articles that are the least similar to the basis information data x in the data group B And a method of setting the similarity of the basis information data x as the similarity (3) a method of setting an average of the similarities of all articles in the data group B and the basis information data x as the similarity
Although other methods may be used, the similarity between the data group B and the basis information data x can be obtained in this way, and the article having a large similarity can be extracted as a similar article.

なお、他の方法としては、データ群Ｂに偏って出現する単語を先の方法で取り出し、そして、その単語も利用して、Rocchio's formula に基づく Score(D) を計算し、Score(D)の大きいものを類似データとして取り出してもよい。 As another method, a word that appears biased in the data group B is extracted by the previous method, and the score (D) based on the Rocchio's formula is calculated using the word, and the Score (D) Larger data may be extracted as similar data.

この方法により、類似記事を取り出すことができ、本発明ではこれを根拠情報データとして抽出してもよい。
さらに、上述した単語群Ａを含む記事を抽出による方法と、ここで説明した類似記事を取り出す方法とを両方実行してそれぞれ記事を抽出してもよい。 By this method, a similar article can be taken out and may be extracted as ground information data in the present invention.
Furthermore, the article may be extracted by executing both the above-described method of extracting an article including the word group A and the method of extracting a similar article described here.

以上のような類似文書の抽出技術を用いて、本発明における類似判定部（１０７）は類似度を判定し、類似する文書があるか否かについてを素性として機械学習判定部（１０２）に入力する。
類似度が数値で算出される場合には、その類似度をp倍(p<1)した値よりも大きい類似度の文書が根拠情報データにある場合、当該収集データは根拠のあるものとする。 Using the similar document extraction technique as described above, the similarity determination unit (107) in the present invention determines the similarity and inputs whether or not there is a similar document as a feature to the machine learning determination unit (102). To do.
When the similarity is calculated numerically, if there is a document with similarities greater than the value obtained by multiplying the similarity by p (p <1) in the ground information data, the collected data shall be grounded .

ここで、単語群Ａを含む記事を抽出による方法の場合には、正規化が必要になる。正規化としては、入力のキーワード群とまったく同じキーワード群の文書があると仮定し、その文書のスコアを算出する。そしてそのスコアで、類似度を割ることにより、正規化した類似度を算出することができる。なお、正規化の方法は任意である。 Here, in the case of the method based on extraction of articles including the word group A, normalization is required. As normalization, it is assumed that there is a document having the same keyword group as the input keyword group, and the score of the document is calculated. Then, the normalized similarity can be calculated by dividing the similarity by the score. Note that the normalization method is arbitrary.

また、非特許文献１７に開示される言い換え技術を用いて、類似判定を行うこともできる。本方法では、まず同義語の言い換えの変形規則をたくさん用意し、これを用いて文章の言い換えを行う。類似度の大きくなる言い換えを行っていき、最も類似度の高くなった文同士で類似度を求める。類似度は、例えばある文章に含まれる複数のキーワードとその周辺に出現するパターンが、他方の文章でどれだけ抽出されるか、その総数をスコアとして算出することができるが、これに限らず周知の方法によって類似度は算出することができる。
この言い換えは、類似度を求める両方の文を言い換えても良い。 Similarity determination can also be performed using the paraphrasing technique disclosed in Non-Patent Document 17. In this method, first, a large number of synonym paraphrasing transformation rules are prepared, and the paraphrasing is performed using these rules. Paraphrasing with increasing similarity is performed, and the similarity is obtained between sentences having the highest similarity. The similarity can be calculated as a score, for example, how many keywords included in one sentence and the pattern appearing in the vicinity are extracted in the other sentence, but not limited to this. The similarity can be calculated by this method.
This paraphrase may be paraphrased for both sentences for which the degree of similarity is obtained.

このように言い換えを行ってから類似度を求める方が、文同士が似た状態になるため、より正確に文同士の類似度を算出することができる。 In this way, when the degree of similarity is calculated after paraphrasing, the sentences are in a similar state, and thus the degree of similarity between sentences can be calculated more accurately.

村田真樹，井佐原均、「言い換えの統一的モデル -尺度に基づく変形の利用」、自然言語処理、11巻，5号，p.113-133,言語処理学会、2004年10月Masaki Murata and Hitoshi Isahara, “Unified Paraphrasing Model-Utilization of Scale-Based Deformation”, Natural Language Processing, Vol.11, No.5, p.113-133, Linguistic Processing Society of Japan, October 2004

（関連情報の抽出に機械学習を用いる方法）
上記において関連情報の抽出はＣＰＵ（１０）の関連情報抽出部（１１０）が、風評情報と判定された収集データについて、その著作者、コンテンツ、格納されるサーバ装置の名称若しくはネットワークアドレス、ファイル情報などに基づいて関連情報を抽出する構成である。
本発明では、さらに関連情報の抽出自体に機械学習モデルを用いて関連情報として抽出する妥当性を判断することもできる。 (Method of using machine learning to extract related information)
In the above description, the related information is extracted by the related information extraction unit (110) of the CPU (10) regarding the collected data determined as the reputation information, the author, the content, the name or network address of the server device to be stored, and file information The related information is extracted based on the above.
In the present invention, it is also possible to determine the validity of extracting related information by using a machine learning model for extracting related information itself.

ＣＰＵ（１０）には上記の記事の類似度を算出する手法を備えた関連情報類似度算出部（図示しない）と、機械学習モデルにより関連情報としての妥当性を評価する関連情報評価部（図示しない）とを設ける。
そして予め関連情報評価部に含む機械学習モジュールでは次のようにして機械学習を行い、その結果を機械学習結果データとしてハードディスクに格納しておく。 The CPU (10) includes a related information similarity calculation unit (not shown) having a method for calculating the similarity of the above article, and a related information evaluation unit (illustration) that evaluates the validity of the related information using a machine learning model. Do not).
The machine learning module included in the related information evaluation unit performs machine learning in the following manner and stores the result on the hard disk as machine learning result data.

機械学習のために、データ収集部（１００）で収集された風評情報データと、関連情報抽出部（１１０）で収集された関連情報データを大量に用意し、これを機械学習モジュールに入力する学習用の入力データとする。
関連情報データが本当に関連情報データとして妥当か否かを人手によって判定し、その結果を学習用の出力データとする。 For machine learning, a large amount of reputation information data collected by the data collection unit (100) and related information data collected by the related information extraction unit (110) are prepared and input to the machine learning module. Input data.
It is manually determined whether or not the related information data is really appropriate as the related information data, and the result is used as output data for learning.

同時に、関連情報類似度算出部において、学習用風評情報データ及び該学習用関連情報データの関連情報類似度を算出する。この類似度の算出方法は上記の通りであり、類似度として類似する、しないの２値の結果でもよいし、類似の度合いを示す数値でもよい。
そして、機械学習における素性として、入力データの風評情報データを構成する単語列と、関連情報データを構成する単語列と共に、関連情報類似度とを用いる。 At the same time, the related information similarity calculation unit calculates the reputation information data for learning and the related information similarity of the related information data for learning. The method for calculating the similarity is as described above. The similarity may be a binary result that is similar or not, or may be a numerical value indicating the degree of similarity.
Then, as features in machine learning, a word string constituting the reputation information data of the input data and a related information similarity are used together with a word string constituting the related information data.

このような関連情報評価部を備え、関連情報抽出部（１１０）で抽出された関連情報データと、そのときの風評情報データとを入力して該関連情報データが妥当か否かを判定する。判定結果に応じて関連情報として風評・関連情報出力部（１０３）から出力する。
機械学習モジュールにおける機械学習の方法は上述した様々な方法のいずれかを用いることができ、このときの判定結果は、「妥当である」「妥当でない」と出力される場合と、妥当である確率が出力される場合がある。前者の場合には「妥当である」関連情報データを出力すればよく、後者の場合には所定の閾値を超える確率の場合に出力すればよい。 The related information evaluation unit is provided, and the related information data extracted by the related information extraction unit (110) and the reputation information data at that time are input to determine whether the related information data is valid. According to the determination result, it is output from the reputation / related information output unit (103) as related information.
Any of the various methods described above can be used as the machine learning method in the machine learning module, and the judgment result at this time is output as “valid” or “invalid”, and the probability of being valid. May be output. In the former case, “valid” related information data may be output, and in the latter case, it may be output when the probability exceeds a predetermined threshold.

（関連情報データを類似度により抽出する方法）
本発明の関連情報抽出部（１１０）において次のように抽出処理を行うこともできる。
すなわち、ＣＰＵ（１０）に図示しない関連情報類似度算出部を備え、該関連情報類似度算出部では上述した通りの記事の類似度の算出処理を行う。
そして、データ収集部（１００）で収集されて風評情報と判定された風評情報データと、ネットワーク上又はハードディスクの関連情報ＤＢから抽出する記事の類似度を算出する。 (Method of extracting related information data based on similarity)
The related information extraction unit (110) of the present invention can also perform the extraction process as follows.
That is, the CPU (10) includes a related information similarity calculation unit (not shown), and the related information similarity calculation unit performs the similarity calculation processing of articles as described above.
Then, the degree of similarity between the reputation information data collected by the data collection unit (100) and determined as the reputation information and the article extracted from the related information DB on the network or the hard disk is calculated.

関連情報類似度算出部で算出された類似度に応じて、例えば類似度が所定の閾値を超えたものについて関連情報として抽出することができる。 According to the similarity calculated by the related information similarity calculation unit, for example, information whose similarity exceeds a predetermined threshold can be extracted as related information.

（言い換えを行った上で、類似度により抽出する方法）
本発明ではさらに、ＣＰＵ（１０）に上記言い換えの技術を備えた単語列置換部（図示しない）と上記関連情報類似度算出部を設けて、該単語列置換部において関連情報ＤＢ等から抽出された記事に含まれる単語列を順次置換しながら、風評情報データとの類似度を
関連情報類似度算出部で算出する。そして、常に類似度が高まるように単語列を置換していき、もっとも類似度が高くなるときの類似度が所定の閾値を超えたときに関連情報データとして抽出することもできる。 (Method of extracting by similarity after paraphrasing)
In the present invention, the CPU (10) is further provided with a word string replacement unit (not shown) having the above paraphrasing technique and the related information similarity calculation unit, and the word string replacement unit extracts from the related information DB or the like. The related information similarity calculation unit calculates the similarity with the reputation information data while sequentially replacing the word strings included in the articles. Then, the word string is replaced so that the degree of similarity always increases, and when the degree of similarity exceeds the predetermined threshold, the information can be extracted as related information data.

本発明では、風評情報出力部（１０３）における出力処理についても、次のような技術を提供することができる。
その１つは、クラスタリング処理により、機械学習判定部（１０２）で風評情報と判定された複数の風評情報を、関連する風評情報同士をまとめて出力することである。 In the present invention, the following technique can also be provided for the output processing in the reputation information output unit (103).
One of them is to output a plurality of pieces of reputation information determined by the machine learning determination unit (102) as the reputation information by clustering processing.

（Ａ）クラスタリングの説明
クラスタリングにはさまざまな方法がある。一般的なものを以下に記述する。 (A) Description of clustering There are various methods for clustering. The general ones are described below.

(階層クラスタリング(ボトムアップクラスタリング)の説明)
最も近い成員同士をくっつけていき、クラスターを作る。クラスターとクラスター同士
も(クラスターと成員同士も)、最も近いクラスター同士をくっつける。
クラスター間の距離の定義は様々あるので以下に説明する。 (Description of hierarchical clustering (bottom-up clustering))
Connect the closest members together to create a cluster. Clusters and clusters
No (clusters and members) also connect the nearest clusters.
Since there are various definitions of the distance between clusters, it will be described below.

・クラスターAとクラスターBの距離を、クラスターAの成員とクラスターBの成員の
距離の中で最も小さいものをその距離とする方法
・クラスターAとクラスターBの距離を、クラスターAの成員とクラスターBの成員の
距離の中で最も大きいものをその距離とする方法
・クラスターAとクラスターBの距離を、すべてのクラスターAの成員とクラスターB
の成員の距離の平均をその距離とする方法
・クラスターAとクラスターBの距離を、すべてのクラスターAの成員の位置の平均を
そのクラスターの位置とし、すべてのクラスターBの成員の位置の平均をそのクラスター
の位置とし、その位置同士の距離の平均をその距離とする方法・ The distance between cluster A and cluster B is the distance between cluster A and cluster B members.
The method of setting the smallest distance among the distances
・ The distance between cluster A and cluster B is the distance between cluster A and cluster B members.
The method of taking the largest distance among the distances
・ Distance between cluster A and cluster B, all members of cluster A and cluster B
The average of the members' distance is the distance
・ The distance between cluster A and cluster B, and the average position of all cluster A members
The cluster position, and the average of all cluster B member positions
And the average of the distances between the positions is the distance.

・ウォード法と呼ばれる方法もある。以下、ウォード法の説明をする。
（数１６）
W = ΣΣ (x(i,j) - ave _x(i)) ^ 2
^は指数を意味する。 There is also a method called the Ward method. Hereinafter, the Ward method will be described.
(Equation 16)
W = ΣΣ (x (i, j)-ave _x (i)) ^ 2
^ Means exponent.

一つ目の
Σは i=1からi=g までの加算
二つ目の
Σは j=1からj=niまでの加算
x(i,j)は i番目のクラスターの j番目の成員の位置
ave _x(i)は i番目のクラスターのすべての成員の位置の平均
クラスター同士をくっつけていくと、Wの値が増加するが、ウォード法では、Wの値が
なるべく大きくならないようにクラスター同士をくっつけていく。 First
Σ is an addition from i = 1 to i = g
Second
Σ is an addition from j = 1 to j = ni
x (i, j) is the position of the j-th member of the i-th cluster
ave _x (i) is the average of the positions of all members of the i-th cluster
As the clusters are joined together, the value of W increases, but in the Ward method, the value of W increases.
Connect the clusters together so that they do not become as large as possible.

成員の位置は、記事から単語を取り出し、その単語の種類をベクトルの次元とし、各単
語のベクトルの要素の値を、単語の頻度やその単語のtf・idf (すなわち、tf(w,D) * log(N/df(w))) 、その単語のOkapi の式 (すなわち、tf(w,D)/(tf(w,D)+length/delta)*log(N/df(w))) としたベクトルを作成し、それをその成員の位置とする。 The position of members is taken from the article, and the type of the word is taken as the vector dimension.
The value of a word vector element can be expressed as the frequency of the word, tf ・ idf of the word (i.e. tf (w, D) * log (N / df (w))), the Okapi expression of the word (i.e. tf Create a vector (w, D) / (tf (w, D) + length / delta) * log (N / df (w))) and use it as the position of the member.

(トップダウンクラスタリング(非階層クラスタリング)の説明)
以下、トップダウンのクラスタリング(非階層クラスタリング)の方法を説明する。 (Description of top-down clustering (non-hierarchical clustering))
Hereinafter, a method of top-down clustering (non-hierarchical clustering) will be described.

(最大距離アルゴリズムの説明)
ある成員をとる。次にその成員と最も離れた成員をとる。これら成員をそれぞれのクラスターの中心とする。それぞれのクラスター中心と、成員の距離の最小値を、各成員の距離として、その距離が最も大きい成員をあらたなクラスターの中心とする。これを繰り返す。あらかじめ定めた数のクラスターになったときに、繰り返しをやめる。また、クラスター間の距離があらかじめ定めた数以下になると繰り返しをやめる。また、クラスターの良さをAIC情報量基準などで評価してその値を利用して繰り返しをやめる方法もある。各成員は、最も近いクラスター中心の成員となる。 (Explanation of maximum distance algorithm)
Take a member. Next, take the member farthest from that member. These members will be the center of each cluster. The minimum distance between each cluster center and the member is taken as the distance of each member, and the member with the largest distance is the center of the new cluster. Repeat this. When the number of clusters reaches a predetermined number, stop repeating. Moreover, the repetition is stopped when the distance between the clusters is equal to or less than a predetermined number. There is also a method to stop the repetition by evaluating the goodness of the cluster based on the AIC information criterion and using the value. Each member becomes the closest cluster-centered member.

(K平均法の説明)
あらかじめ定めた個数k個にクラスタリングすることを考える。k個成員をランダムに選ぶ、それをクラスターの中心とする。各成員は最も近いクラスター中心の成員となる。クラスター内の各成員の平均をそれぞれのクラスターの中心とする。各成員は最も近いクラスター中心の成員となる。また、クラスター内の各成員の平均をそれぞれのクラスターの中心とする。これらを繰り返す。そして、クラスターの中心が移動しなくなると繰り返しをやめる。又は、あらかじめ定めた回数だけ繰り返してやめる。その最終的なクラスター中心のときのクラスター中心を使ってクラスターを求める。各成員は最も近いクラスター中心の成員となる。 (Explanation of K-means method)
Consider clustering into a predetermined number k. Choose k members randomly, and use it as the center of the cluster. Each member becomes the closest cluster-centered member. The average of each member in the cluster is the center of each cluster. Each member becomes the closest cluster-centered member. Moreover, the average of each member in a cluster is made into the center of each cluster. Repeat these. When the center of the cluster stops moving, it stops repeating. Or, repeat it a predetermined number of times. The cluster is obtained using the cluster center at the time of the final cluster center. Each member becomes the closest cluster-centered member.

（単語群によるクラスタリング）
クラスタリングに類似する文書分類の方法として、あらかじめ分類先毎に単数又は複数の単語群を定義しておき、入力された情報に該単語群が含まれるか否かにより分類先に分類する方法がある。該文書分類方法についても本発明ではクラスタリングに含まれる。
入力された情報の中で複数の分類先の単語群が含まれる場合には、含まれる数が多い単語群の分類先に分類してもよいし、各単語群に重みの値をつけておき、その重みが大きい単語群の分類先に分類するようにしてもよい。 (Clustering by word group)
As a document classification method similar to clustering, there is a method in which one or a plurality of word groups are defined in advance for each classification destination and classified into classification destinations based on whether or not the input information includes the word group. . The document classification method is also included in the clustering in the present invention.
When the input information includes a plurality of classification target word groups, the input information may be classified into the word group classification destinations with a large number of classifications, and a weight value is assigned to each word group. The words may be classified into the word group having a large weight.

このようにして、クラスタリングをする。クラスタリングの方法は、これら以外にも様々な方法が公知であるので、それらを利用してもよい。
風評情報は、類似の情報が複数抽出されることが多く、複数の風評情報がランダムに出力されてしまうと、どの風評情報が本当に問題があるのかがわかりにくい問題がある。
本発明において風評情報出力部（１０３）でクラスタリング処理をすることにより、モニタ（１２）などで類似の風評情報ごとに表示させることができるので、例えば誤った情報が集中している場合なども迅速的確に把握可能である。 In this way, clustering is performed. Since various methods other than these are known as clustering methods, they may be used.
As for the reputation information, a plurality of similar information is often extracted, and if a plurality of reputation information is output at random, there is a problem that it is difficult to determine which reputation information is really problematic.
In the present invention, clustering processing is performed by the reputation information output unit (103), so that similar reputation information can be displayed on the monitor (12) or the like. For example, even when erroneous information is concentrated, it can be quickly performed. It can be accurately grasped.

本発明では、風評情報出力部（１０３）において出力する際の表示態様を次のように変化させることもできる。
すなわち、風評情報出力部（１０３）は、風評情報と判定されたデータの、日次、週次、月次ごとに判定件数をカウントして、判定件数データを作成する。例えば、図１０に示すような週次発表データが作成される。 In this invention, the display mode at the time of outputting in a reputation information output part (103) can also be changed as follows.
That is, the reputation information output unit (103) counts the number of judgments for each day, weekly, and monthly of the data judged to be reputation information, and creates judgment number data. For example, weekly announcement data as shown in FIG. 10 is created.

図１０に示す週次発表データは、例えば、上記でクラスタリング処理された風評情報１については、第３週次に1件、第４週次に5件、第６週次に10件、第７週次に1件の判定件数があり、風評情報２については、第1週次に5件、第2週次に3件、第3週次に10件、第8週次に1件の文書発表があり、風評情報３については、第4週次に2件、第7週次に4件、第8週次に12件、第9週次に5件、第10週次に13件の判定件数があることを示している。 The weekly announcement data shown in FIG. 10 includes, for example, one case in the third week, five cases in the fourth week, ten cases in the sixth week, and seven pieces for the reputation information 1 clustered as described above. There is one judgment per week, and for reputation information 2, there are 5 documents in the 1st week, 3 in the 2nd week, 10 in the 3rd week, and 1 document in the 8th week. There are announcements, and about reputation information 3, 2 cases in the 4th week, 4 cases in the 7th week, 12 cases in the 8th week, 5 cases in the 9th week, 13 cases in the 10th week It indicates that there is a judgment number.

風評情報出力部（１０３）は、上記定期発表データを等高線データに変換し、変換後の等高線データを表示データとする構成をとることもできる。図１１のように、発表件数を等高線で表し、高さに応じて色を濃く表示することができる。 The reputation information output unit (103) can also be configured to convert the regular announcement data into contour line data and use the converted contour line data as display data. As shown in FIG. 11, the number of presentations can be represented by contour lines, and the color can be displayed darkly according to the height.

モニタ（１２）で、風評情報出力部（１０３）によって作成された表示データを画面表示する。モニタ（１２）は、例えば図１１に示すように、各風評情報の各週次における文書の発表件数のデータが等高線表示される画面を表示する。発表件数の度合いによって等高線の表示色が異なっている。例えば、８〜１０件の発表件数に対応する等高線の表示色は一番濃い色で表示される。 On the monitor (12), the display data created by the reputation information output unit (103) is displayed on the screen. For example, as shown in FIG. 11, the monitor (12) displays a screen on which data of the number of document announcements in each week of each reputation information is displayed in contour lines. The display color of the contour line varies depending on the number of presentations. For example, the display color of contour lines corresponding to the number of presentations of 8 to 10 is displayed in the darkest color.

図１１の表示順序は、各文書発表の件数において、週次の平均値と最頻値と中央値を求め、その平均値の小さい順に表示している。このように並べることで早い時期に発表が集中している分類から表示することができるので、風評情報や関連情報がどのように発表されていったのか、視覚的に認識することができる。
なお、並べ順は平均値、最頻値、中央値のいずれかによって並べてもよいし、それらを用いた計算方法も任意である。 In the display order of FIG. 11, the average value, the mode value, and the median value for each week are obtained for the number of documents published, and the average values are displayed in ascending order. By arranging in this way, it is possible to display from a category in which announcements are concentrated at an early stage, so it is possible to visually recognize how reputation information and related information were announced.
Note that the arrangement order may be an average value, a mode value, or a median value, and a calculation method using them is also arbitrary.

等高線のグラフ表示においては、複数の折れ線グラフを使った表示や、各分類毎に１つの折れ線グラフを使った表示を行ってもよい。 In the contour line graph display, a display using a plurality of line graphs or a display using one line graph for each classification may be performed.

なお、モニタ（１２）は、例えば、図１２に示すように、各風評情報の各週次におけるデータ判定件数をバブルチャートとして画面表示する構成を採ることもできる。
バブルチャートとは、一般に、ある事象を示す(円)を2つの軸を持つ図上に配置した図のことを言う。図１２に示すバブルチャートでは、円の大きさが判定件数の度合いを示している。 For example, as shown in FIG. 12, the monitor (12) can adopt a configuration in which the number of data judgments for each week of each reputation information is displayed on a screen as a bubble chart.
A bubble chart generally refers to a diagram in which (circle) indicating a certain event is arranged on a diagram having two axes. In the bubble chart shown in FIG. 12, the size of the circle indicates the degree of determination.

本発明で風評情報を出力する際に、要約処理を行ってから出力することもできる。すなわち、風評情報を長文のまま出力しても、ユーザがどのような内容であるかを把握するには時間を要し、大量の風評情報をチェックするには不適当である。
そこで、本発明では次の要約処理により、出力される風評情報をわかりやすく提示することができる。 When outputting reputation information in the present invention, it is also possible to output after summary processing. That is, even if the reputation information is output in a long sentence, it takes time for the user to understand what the content is, and it is inappropriate for checking a large amount of reputation information.
Therefore, in the present invention, the output of reputation information can be presented in an easy-to-understand manner by the following summary process.

まず、要約処理は公知の様々な手法が知られているが、例えば本発明者らによる特許文献２及び特許文献３の方法に開示される要約手法を用いることができる。
すなわち、特許文献２の方法によれば、要約装置として、文章およびその要約結果である問題と前記要約結果に対する評価を示す複数の分類先である解との組からなる解データを記憶する解データ記憶手段と、解データの問題である文章および要約結果から、例えば要約結果の文のなめらかさを示す情報および要約結果が文章の内容を表示しているかどうかを示す情報を含む所定の情報を素性として抽出する。 First, various known methods are known for the summarization process. For example, the summarization methods disclosed in the methods of Patent Document 2 and Patent Document 3 by the present inventors can be used.
That is, according to the method of Patent Document 2, as summarization apparatus, solution data that stores solution data composed of a combination of a sentence and a problem that is a summary result thereof and a plurality of classification destination solutions that indicate evaluation on the summary result Based on the storage means and the sentences and summary results that are the problem of the solution data, for example, the information indicating the smoothness of the sentences of the summary results and the predetermined information including the information indicating whether or not the summary results display the contents of the sentences Extract as

そして、その解と素性の集合との組を生成する解?素性対抽出手段と、解と前記素性の集合との組を学習結果として学習結果記憶手段に記憶する機械学習手段と、解−素性対抽出手段により抽出される情報を素性とし、入力されたテキストから前記素性の集合を抽出する素性抽出手段と、学習結果である前記解と前記素性の集合との組をもとに、ベイズの定理にもとづいて前記素性抽出手段から得た前記テキストの素性の集合の場合の各分類になる確率を求め、前記確率の値が最も大きい分類を、求める推定解とする評価推定手段とを備える。 A solution feature pair extraction unit that generates a set of the solution and a set of features; a machine learning unit that stores a set of the solution and the set of features as a learning result in a learning result storage unit; and a solution-feature Based on a set of feature extraction means for extracting the set of features from the input text and information extracted by the pair extraction means, and the solution and the set of features as a learning result, And a probability estimation unit that obtains the probability of each classification in the case of the set of text features obtained from the feature extraction unit based on the theorem, and uses the category having the largest probability value as an estimated solution to be obtained.

また、特許文献３に記載の方法は、機械学習法によりテキストを自動要約する処理で用いる解データを編集する解データ編集処理装置であって、テキストの要約結果を表示装置に表示する要約表示処理手段と、前記要約結果に対する評価の入力を受け付けて前記要約結果の評価とする評価付与処理手段と、前記テキストおよび前記要約結果を問題とし前記評価を解とする解データを出力する解データ出力処理手段とを備える。 The method described in Patent Document 3 is a solution data editing processing device that edits solution data used in processing for automatically summarizing text by a machine learning method, and displays summary results of text on a display device. Means for accepting an evaluation input for the summary result and evaluating the summary result; and a solution data output process for outputting solution data with the text and the summary result as problems and using the evaluation as a solution Means.

特許第3682529号Patent No. 3682529 特開2003-248676号JP2003-248676

以上にあげた方法は、いずれも公知の要約方法に対して、それを評価し、又はその結果をフィードバックすることにより機械学習の精度の向上を図るものであり、これによって効果的な要約方法に寄与する。
もちろん、ここで用いる機械学習方法はシンプルベイズ法に限らず、k近傍法、決定リスト法、最大エントロピー法、サポートベクトルマシン法、ニューラルネットワーク法などいかなるモデルを用いても良い。 All of the above-mentioned methods are intended to improve the accuracy of machine learning by evaluating or feeding back the results of the methods to known summarization methods. Contribute.
Of course, the machine learning method used here is not limited to the simple Bayes method, and any model such as a k-nearest neighbor method, a decision list method, a maximum entropy method, a support vector machine method, or a neural network method may be used.

なお、本発明における要約処理は、機械学習による方法に限らず、公知の任意の要約方法を用いることができる。
例えば、文書の位置、タイトルの単語を含んでいる個数、その文に出現する単語のtfidfの値のそれぞれの情報を用いて、各文のスコアを求め、そのスコアの大きいものを要約結果とする方法がある。
また、一般にタイトルの単語は重要なため、タイトルの単語を多く含む文を抽出してその文を要約結果とすることができる。
より単純に、文書の第一文など、最初の方を要約文としてもよい。 Note that the summarization processing in the present invention is not limited to the method based on machine learning, and any known summarization method can be used.
For example, using the information on the position of the document, the number of words containing the title, and the tfidf value of the word that appears in the sentence, the score of each sentence is obtained, and the score with the highest score is used as the summary result. There is a way.
In general, since the title word is important, a sentence including many title words can be extracted and used as a summary result.
More simply, the first sentence such as the first sentence of the document may be a summary sentence.

さらに、本発明ではデータ収集部（１００）において外国語のデータを収集することもできる。そして、風評の対象となる単語や、風評の内容を示す単語、その他の素性について、先に翻訳部において機械翻訳した後、素性抽出部（１０１）では翻訳された素性を収集されたデータから抽出する。 Furthermore, in the present invention, the data collection unit (100) can also collect foreign language data. Then, the word to be evaluated, the word indicating the content of the reputation, and other features are first machine-translated in the translation unit, and the feature extraction unit (101) extracts the translated features from the collected data. To do.

機械翻訳については、公知の機械翻訳方法により高精度な翻訳をしてもよいが、ハードディスクに外国語辞書を登載して、単に外国語辞書を参照して単語を逐語翻訳するだけでもよい。 As for machine translation, high-accuracy translation may be performed by a known machine translation method, or a foreign language dictionary may be listed on the hard disk and a word may be translated verbatim by simply referring to the foreign language dictionary.

本発明が対象とするデータは日本語に限られず、機械学習などの各処理もすべて任意の外国語を対象として行うことで、外国語の風評情報を抽出することができる。
さらに、このように抽出された外国語の風評情報を、公知の機械翻訳処理装置又は機械翻訳プログラムに入力することによって日本語に翻訳し、出力させてもよい。 The data targeted by the present invention is not limited to Japanese, and it is possible to extract reputation information of foreign languages by performing all processes such as machine learning on any foreign language.
Further, the foreign language reputation information extracted in this way may be translated into Japanese by inputting it into a known machine translation processing apparatus or machine translation program and output.

本発明は以上のような構成により、次のような風評情報の判定を行うことができる。その実験例を示す。
まず、次の通りの教師データを用意する。

［教師データ１］

書き込み：「．．銀行が倒産する」
発信者：信頼できる
文字：黒
背景：白
特許論文新聞の根拠：なし
風評でない(正しい情報)

［教師データ２］

書き込み：「．．銀行が倒産する」
発信者：信頼できるか不明
文字：黒
背景：白
特許論文新聞の根拠：あり
風評でない(正しい情報)

［教師データ３］

書き込み：「．．銀行が倒産する」
発信者：信頼できるか不明
文字：赤
背景：黒
特許論文新聞の根拠：なし
風評である(正しくない情報)

［教師データ４］

書き込み：「．．銀行がもうかっている」
発信者：信頼できるか不明
文字：黒
背景：白
特許論文新聞の根拠：なし
風評でない(正しいかどうか不明だが)
According to the present invention, the following reputation information can be determined by the above configuration. The experimental example is shown.
First, prepare the following teacher data.

[Teacher data 1]

Writing: “… bank goes bankrupt”
Sender: Reliable characters: Black background: White patent paper Newspaper basis: None Reputation (correct information)

[Teacher data 2]

Writing: “… bank goes bankrupt”
Sender: Reliable or unknown character: Black background: White patent paper newspaper basis: Yes, not popular (correct information)

[Teacher data 3]

Writing: “… bank goes bankrupt”
Sender: Reliable or unknown character: Red background: Black patent paper newspaper basis: None Reputation (incorrect information)

[Teacher data 4]

Writing: “… the bank is already born”
Sender: Reliable or unknown Character: Black background: White patent paper newspaper basis: None

以上の教師データから、素性を取り出す

［教師データ１］

書き込みにあった単語など：「銀行」
書き込みにあった単語など：「倒産する」
発信者：信頼できる
文字：黒
背景：白
特許論文新聞の根拠：なし

［教師データ２］

書き込みにあった単語など：「銀行」
書き込みにあった単語など：「倒産する」
発信者：信頼できるか不明
文字：黒
背景：白
特許論文新聞の根拠：あり

［教師データ３］

書き込みにあった単語など：「銀行」
書き込みにあった単語など：「倒産する」
発信者：信頼できるか不明
文字：赤
背景：黒
特許論文新聞の根拠：なし

［教師データ４］

書き込みにあった単語など：「銀行」
書き込みにあった単語など：「もうかっている」
発信者：信頼できるか不明
文字：黒
背景：白
特許論文新聞の根拠：なし

となる。 Extract features from the above teacher data

[Teacher data 1]

Words that were written: “Bank”
Words that were written: “go bankrupt”
Sender: Reliable text: Black background: White patent paper newspaper basis: None

[Teacher data 2]

Words that were written: “Bank”
Words that were written: “go bankrupt”
Sender: Reliable or unknown character: Black background: White patent paper newspaper basis: Yes

[Teacher data 3]

Words that were written: “Bank”
Words that were written: “go bankrupt”
Sender: Reliable or unknown character: Red background: Black patent paper newspaper basis: None

[Teacher data 4]

Words that were written: “Bank”
Words that were written: “I am already born”
Sender: Reliable or unknown characters: Black background: White patent paper newspaper basis: None

It becomes.

機械学習モジュール（図３）の機械学習部は、これらの情報からどういう素性のときに風評情報で、どういう素性のときに風評情報でないかを学習する。その結果、

発信者：信頼できる
や
特許論文の根拠：あり
のときは、風評情報でないことや、

「倒産する」という語があって
文字：赤
背景：黒
の場合
風評の可能性が高いなどが学習される。 The machine learning unit of the machine learning module (FIG. 3) learns from this information what kind of feature is the reputation information and what kind of feature is not the reputation information. as a result,

Sender: Reliable or the basis of a patent paper:

There is a word “bankruptcy”, and if the character: red background: black, the possibility of reputation is high.

このような学習結果が、機械学習結果データに格納された後、機械学習判定部が判定処理を行う。そこで、データ収集部により収集されたデータの素性が次の通りであったとする。

[収集データ１の素性]

書き込み：「．．会社が倒産する」
発信者：信頼できるか不明
文字：赤
背景：黒
特許論文新聞の根拠：なし

このとき、
「倒産する」という語があって
文字：赤
背景：黒
の場合、風評の可能性が高いという学習結果から、風評情報であると判定され、出力される。 After such a learning result is stored in the machine learning result data, the machine learning determination unit performs a determination process. Therefore, it is assumed that the features of the data collected by the data collection unit are as follows.

[Feature of Collected Data 1]

Writing: “… The company goes bankrupt”
Sender: Reliable or unknown character: Red background: Black patent paper newspaper basis: None

At this time,
If there is a word “bankruptcy” and text: red, background: black, it is determined that it is reputation information from the learning result that the reputation is highly likely, and is output.

本発明の風評情報抽出装置の全体構成図である。It is a whole block diagram of the reputation information extraction device of the present invention. 本発明の風評情報抽出方法の処理フローチャートである。It is a process flowchart of the reputation information extraction method of this invention. 本発明における機械学習モジュールの構成図である。It is a block diagram of the machine learning module in this invention. 機械学習処理におけるテキストの分類の概念を説明する説明図である。It is explanatory drawing explaining the concept of the classification | category of the text in a machine learning process. サポートベクトルマシンによるマージンの概念を説明する説明図である。It is explanatory drawing explaining the concept of the margin by a support vector machine. 本発明によるデータ信頼度を評価する手段の構成図である。It is a block diagram of the means to evaluate the data reliability by this invention. 本発明によるデータ信頼度を評価する方法の処理フローチャートである。4 is a process flowchart of a method for evaluating data reliability according to the present invention. 本発明による根拠情報確認手段の構成図である。It is a block diagram of the basis information confirmation means by this invention. 本発明による根拠情報確認方法の処理フローチャートである。It is a processing flowchart of the ground information confirmation method by this invention. 本発明によるモニタからの出力態様（１）を示す図である。It is a figure which shows the output mode (1) from the monitor by this invention. 本発明によるモニタからの出力態様（２）を示す図である。It is a figure which shows the output mode (2) from the monitor by this invention. 本発明によるモニタからの出力態様（３）を示す図である。It is a figure which shows the output mode (3) from the monitor by this invention.

Explanation of symbols

１風評情報抽出装置
１０ＣＰＵ
１１キーボード・マウス
１２モニタ
１３ハードディスク
１４ネットワークアダプタ
１００データ収集部
１０１素性抽出部
１０２機械学習判定部
１０３風評情報出力部
１３１素性テーブル
１３２機械学習結果データ 1 Reputation information extraction device 10 CPU
11 Keyboard / Mouse 12 Monitor 13 Hard Disk 14 Network Adapter 100 Data Collection Unit 101 Feature Extraction Unit 102 Machine Learning Determination Unit 103 Reputation Information Output Unit 131 Feature Table 132 Machine Learning Result Data

Claims

A reputation information extraction device using a computer that extracts reputation information published on a network for a predetermined target,
Data collection means for receiving data published from one or more server devices on the network and storing the data as collected data in the collected data storage means;
A feature table storage means for storing a feature table including at least a word or a set of words as a feature;
A feature extraction means for extracting features from the collected data with reference to the feature table;
Machine learning determination means provided with a predetermined machine learning module for determining whether or not it is reputation information for a predetermined object by referring to the machine learning result data stored in the learning result storage means when one or more features are input The feature extraction means inputs the feature extracted to the machine learning judgment means, and obtains a judgment result as to whether or not it is reputation information for the collected data,
Reputation information output means for outputting at least one part of collected data determined as reputation information, or the name or network address of a server device to which it is disclosed, or file information of the collected data
In a configuration comprising:
The meaning of a word is classified as a semantic class using a code, and a classification vocabulary table in which each semantic class is assigned to a plurality of words is included in the feature table as a feature,
The feature extraction means is
A reputation information extracting apparatus , wherein a semantic class of a word included in the collected data is extracted from the collected data with reference to the feature table .

In the reputation information extraction device,
For the name or network address of the server device on the network, or the file information of the collected data, an information source reliability database that represents numerically the reliability of information disclosed there is provided, and the reliability is identified As included in the feature table,
The feature extraction means is
The reputation information extraction device according to claim 1, wherein a numerical value of reliability related to the collected data is extracted with reference to the feature table.

The reputation information extraction device comprises data reliability value evaluation means,
The data reliability value evaluation means includes:
Evaluation data that matches at least one of the collected data and the author of the collected data, the name or network address of the stored server device, or the file information of the collected data is sent to the server device on the network, or in advance. An evaluation data extraction unit for extracting from the accumulated evaluation database;
An evaluation factor table storage unit that stores at least one of an active factor that is a factor that increases the reliability of the collected data and a negative factor that is a factor that decreases the reliability of the collected data;
An evaluation feature extraction unit that extracts the factor as the feature from the evaluation data;
An evaluation machine provided with a predetermined machine learning module that performs classification according to the reliability of the evaluation data by referring to the machine learning result data stored in the learning result storage unit when one or more features are input And a learning determination unit, and outputs the classification result of the evaluation data obtained by inputting the features extracted by the evaluation feature extraction unit to the evaluation machine learning determination unit as the reliability of the collected data Configuration,
The reliability value of the collection data, together with the feature extracting means has extracted feature, claim input to the machine learning decision means to obtain the reputation information whether the determination result for the collected data 1 Or the reputation information extraction apparatus in any one of 2 .

The reputation information extraction device comprises ground information confirmation means,
The basis information confirmation means
A rationale information database that defines a rationale information source of at least one of the name or network address of a server device that publishes reliable information or file information of the reliable information;
A similarity determination unit that acquires data published from a basis information source defined in the basis information database, and determines whether or not similar data similar in topic to the collected data is included,
The determination result in the similarity determination unit is input to the machine learning determination unit together with the feature extracted by the feature extraction unit, and a determination result as to whether or not the collected data has reputation information is obtained. Reputation information extraction apparatus in any one of 3 thru | or 3 .

A method for extracting reputation information using a computer that extracts reputation information published on a network for a predetermined target,
A data collection step in which the data collection means of the computer receives data published from one or more server devices on the network and stores them as collected data in the collected data storage means;
A feature table storage means for storing a feature table including at least a word or a set of words as a feature;
A feature extraction step in which a feature extraction unit of the computer refers to the feature table to extract a feature from the collected data;
The machine learning determination means of the computer having the predetermined machine learning module uses the extracted feature and refers to the machine learning result data stored in the learning result storage means to determine whether or not the reputation information is for a predetermined object. A machine learning determination step for determining,
The reputation information output means of the computer outputs at least one of the collected data determined as reputation information, the name or network address of the server device to which it is disclosed, or the file information of the collected data In a configuration including a reputation information output step to
The meaning of a word is classified as a semantic class using a code, and a classification vocabulary table in which each semantic class is assigned to a plurality of words is included in the feature table as a feature,
In the feature extraction step, the feature extraction means includes:
A reputation information extracting method , wherein a semantic class of a word included in the collected data is extracted from the collected data with reference to the feature table .

For the name or network address of the server device on the network, or the file information of the collected data, an information source reliability database that represents numerically the reliability of information disclosed there is provided, and the reliability is identified As included in the feature table,
In the feature extraction step, the feature extraction means includes:
The reputation information extraction method according to claim 5 , wherein a numerical value of reliability related to the collected data is extracted with reference to the feature table.

A data reliability evaluation step at any point after the data collection step of the reputation information extraction method and before the machine learning determination step;
In the data reliability evaluation step,
The evaluation data extraction unit in the data reliability value evaluation means of the computer has at least one of the collected data, the author of the collected data, the name or network address of the server device stored, or the file information of the collected data Evaluation data extraction processing step for extracting evaluation data with which they match from a server device on the network or an evaluation database stored in advance,
An evaluation factor table storage unit that stores at least one of an active factor that is a factor that increases the reliability of the collected data and a negative factor that is a factor that decreases the reliability of the collected data;
An evaluation feature extraction processing step in which an evaluation feature extraction unit in the data reliability value evaluation means extracts the factor as a feature from the evaluation data;
Machine learning result data stored in the learning result storage unit using the features extracted in the evaluation feature extraction processing step by the evaluation machine learning determination unit having a predetermined machine learning module in the data reliability value evaluation means The evaluation machine learning determination processing step for performing classification according to the reliability of the evaluation data,
Output the classification result of the evaluation data including as the reliability of the collected data,
In the machine learning determination step, the reliability value of the collected data is input to the machine learning determination unit together with the feature extracted by the feature extraction unit, and a determination result as to whether or not the collected data is reputation information is obtained. The method for extracting reputation information according to claim 5 or 6 .

In the reputation information extraction method,
Computer clustering means
About at least one of the reputation information data or the related information data,
Including a clustering processing step of clustering the author or content included in any of the data according to a predetermined clustering formula;
In the output step,
8. The reputation information extraction method according to claim 5, wherein at least one of the reputation information data and the related information data in the clustered state is output.