JP2016099685A

JP2016099685A - Information reliability determination system, information reliability determination method, and information reliability determination program

Info

Publication number: JP2016099685A
Application number: JP2014234089A
Authority: JP
Inventors: 淳伊藤; Atsushi Ito; 浩之戸田; Hiroyuki Toda; 義昌小池; Yoshimasa Koike
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2014-11-19
Filing date: 2014-11-19
Publication date: 2016-05-30
Anticipated expiration: 2034-11-19
Also published as: JP6321529B2

Abstract

PROBLEM TO BE SOLVED: To enhance the determination accuracy of reliability by adding a new feature amount to the conventional feature amount.SOLUTION: An author topic extraction unit 201 extracts an author topic if the author topic of a contribution document 203 exists in an author topic DB 108. A document topic extraction unit 102 acquires a word topic of a word included in an input document from a word topic DB 109, and extracts a document topic showing characteristics of the input document. A similarity calculation unit 103 calculates similarity between the author topic and the document topic. A feature amount imparting unit 104 imparts the author topic, the document topic and the similarity between both topics to a feature amount extracted by a feature amount extraction part 106 as a new feature amount. A reliability determination unit 202 uses a reliability determination device DB 111 and a feature amount imparted by the feature amount imparting unit 104 to determine the reliability of the contribution document, and stores a determination result in a reliability DB 206.SELECTED DRAWING: Figure 2

Description

本発明は、電子文書（以下、文書と省略する。）における情報の信憑性を判定する技術に関する。 The present invention relates to a technique for determining the authenticity of information in an electronic document (hereinafter abbreviated as a document).

周知のようにＳＮＳ（ｓｏｃｉａｌｎｅｔｗｏｒｋｉｎｇｓｅｒｖｉｃｅ）の普及によって誰もが容易に情報を発信することが可能となったため、情報の信憑性が確認されないまま情報が流布され、デマの拡散などの風評被害の社会問題が生じている。 As is well known, the spread of SNS (social networking service) has made it possible for anyone to easily transmit information, so that information can be disseminated without confirming the authenticity of information, resulting in reputational damage such as the spread of hoaxes. Social problems are occurring.

この問題に対しては、情報の信憑性を自動的に判別し、ユーザに提示できれば、その解決に貢献することができる。この信憑性判定の手法としては、非特許文献１が公知となっている。 If this problem can be automatically determined and presented to the user, it can contribute to the solution. Non-patent document 1 is publicly known as a method for determining the authenticity.

非特許文献１には、投稿文書の長さ・ＵＲＬの有無・ネガポジ表現などの投稿文書に基づく特徴量や、アカウント作成日・総投稿数・友人数などの著者に基づく特徴量、トレンドにもとづく特徴量・情報伝播にもとづく特徴量などを利用して信憑性を判定することが記載されている。 In Non-Patent Document 1, the feature amount based on the posted document such as the length of the posted document, presence / absence of URL, negative / positive expression, etc., the feature amount based on the author such as the account creation date, the total number of posts, the number of friends, etc. It describes that the credibility is determined by using a feature amount or a feature amount based on information propagation.

Carlos Castillo, Marcelo Mendoza and Barbara Poblete, "Information Credibility on Twitter", in Proceedings of WWW, pp. 675-684, 2011.Carlos Castillo, Marcelo Mendoza and Barbara Poblete, "Information Credibility on Twitter", in Proceedings of WWW, pp. 675-684, 2011. David M. Blei, Andrew Y. Ng and Michael I. Jordan, "Latent Dirichlet Allocation", The Journal of Machine Learning Research, Vol. 3, pp. 993-1022, 2003.David M. Blei, Andrew Y. Ng and Michael I. Jordan, "Latent Dirichlet Allocation", The Journal of Machine Learning Research, Vol. 3, pp. 993-1022, 2003. "MeCab:Yet Another Part-of-Speech and Morphological Analyzer MECab（和布蕪）とは"・［Online］，［平成２６年１１月９日検索］，インターネット<URFL: https://code.google.com/p/mecab/>"What is MeCab: Yet Another Part-of-Speech and Morphological Analyzer MECab"? / p / mecab />

しかしながら、非特許文献１の手法では、以下に示す情報の信憑性を判定する重要な要素が考慮されていないため、判定精度が低下するおそれがある。 However, the method disclosed in Non-Patent Document 1 does not consider important elements for determining the credibility of the information shown below, and therefore the determination accuracy may be reduced.

（１）すなわち、非特許文献１では、専門性（著者トピック）に関する特徴量が考慮されていない。したがって、信憑性判定にあたって投稿文書の話題に詳しい人と、そうでない人とを区別することができない。 (1) That is, Non-Patent Document 1 does not take into account feature quantities related to expertise (author topic). Therefore, it is not possible to distinguish between those who are familiar with the topic of the posted document and those who are not.

（２）また、投稿文書の話題（文書トピック）に関する特徴量が考慮されておらず、信用できる話題と疑ってかかるべき話題との区別もできない。 (2) Moreover, the feature quantity regarding the topic (document topic) of the posted document is not taken into consideration, and it is impossible to distinguish between a topic that can be trusted and a topic that should be suspected.

（３）さらに専門性と投稿文書の話題との類似性に関する特徴量が考慮されておらず、詳しい人がその話題に関して投稿したのか、あるいは詳しくない人が偶然にその話題に関して投稿したのかも区別できない。 (3) In addition, it does not take into account the features related to the similarity between the expertise and the topic of the posted document, and whether a detailed person posted about the topic or whether a non-detailed person accidentally posted about the topic is also distinguished. Can not.

本発明は、このような従来の問題を解決するためになされ、従来の特徴量に新たな特徴量を加えて信憑性の判定精度を高めることを解決課題としている。 The present invention has been made to solve such a conventional problem, and an object of the present invention is to increase the reliability determination accuracy by adding a new feature amount to the conventional feature amount.

本発明の情報信憑性判定システムは、過去文書群の著者特性を示す著者トピックを保存する第１データベースを参照し、入力文書に該当する著者トピックがあれば該著者トピックを抽出する著者トピック抽出部と、過去文書群の単語特性を示す単語トピックを保存する第２データベースから入力文書に含まれる単語の単語トピックを取得し、入力文書の特性を示す文書トピックを抽出する文書トピック抽出部と、前記著者トピックと前記文書トピックとの類似度を加味した特徴量を入力文書に付与する特徴量付与部と、過去文書に付与された前記特徴量と教師データとを用いた機械学習で構築された信憑性判定器を保存する第３データベースと、第３データベースの信憑性判定器と入力文書に付与された前記特徴量とを用いて該入力文書の信憑性を判定する信憑性判定部と、を備える。 An information credibility determination system according to the present invention refers to a first database that stores an author topic indicating author characteristics of a past document group, and if there is an author topic corresponding to an input document, an author topic extraction unit extracts the author topic. A document topic extraction unit that acquires a word topic of a word included in an input document from a second database that stores a word topic indicating a word characteristic of a past document group, and extracts a document topic indicating a characteristic of the input document; A credibility constructed by machine learning using a feature value adding unit that gives an input document a feature value taking into account the similarity between the author topic and the document topic, and the feature value and teacher data given to the past document A third database that stores the sex determination unit, a credibility determination unit of the third database, and the feature value given to the input document. And a credibility determination unit determines sex.

本発明の情報信憑性判定方法は、過去文書群の著者特性を示す著者トピックを保存する第１データベースを参照し、入力文書に該当する著者トピックがあれば該著者トピックを抽出する著者トピック抽出ステップと、過去文書群の単語特性を示す単語トピックを保存する第２データベースから入力文書に含まれる単語の単語トピックを取得し、入力文書の特性を示す文書トピックを抽出する文書トピック抽出ステップと、前記著者トピックと前記文書トピックとの類似度を加味した特徴量を入力文書に付与する特徴量付与ステップと、過去文書に付与された前記特徴量と教師データとに基づく機械学習で構築された信憑性判定器を保存する第３データベースと、入力文書に付与された前記特徴量とを用いて該入力文書の信憑性を判定する信憑性判定ステップと、を有する。 The information credibility determination method of the present invention refers to a first database that stores an author topic indicating author characteristics of a past document group, and extracts an author topic if there is an author topic corresponding to the input document. A document topic extraction step of acquiring a word topic of a word included in the input document from a second database that stores a word topic indicating the word characteristics of the past document group, and extracting a document topic indicating the characteristics of the input document; Authenticity built by machine learning based on the feature value adding step that adds to the input document a feature value that takes into account the similarity between the author topic and the document topic, and the feature value and teacher data assigned to the past document A credential for determining the authenticity of the input document using the third database for storing the determiner and the feature value given to the input document. It has a determination step.

前記特徴量付与部と前記特徴量付与ステップにおいては、前記著者トピックと前記文書トピックとを加味した特徴量を入力文書に付与してもよい。また、前記著者トピック、前記文書トピック、前記著者トピックと前記文書トピックとの類似度を加味した特徴量を入力文書に付与してもよい。 In the feature amount assigning unit and the feature amount assigning step, a feature amount in consideration of the author topic and the document topic may be assigned to the input document. Further, the input document may be provided with a feature value that takes into account the author topic, the document topic, and the similarity between the author topic and the document topic.

なお、本発明は、前記システムとしてコンピュータを機能させるプログラムの態様としてもよい。このプログラムは、ネットワークや記録媒体などを通じて提供することができる。 In addition, this invention is good also as an aspect of the program which makes a computer function as said system. This program can be provided through a network or a recording medium.

本発明によれば、従来の特徴量に新たな特徴量が加えられるため、信憑性の判定精度を高めることができる。 According to the present invention, since a new feature amount is added to the conventional feature amount, it is possible to improve the reliability determination accuracy.

本発明の実施形態に係る情報信憑性判定システムの信憑性判定器構築装置の構成図。The block diagram of the authenticity determination device construction | assembly apparatus of the information reliability determination system which concerns on embodiment of this invention. 同信憑性判定装置の構成図。The block diagram of a reliability determination apparatus. 同信憑性判定器構築装置の処理内容を示すフローチャート。The flowchart which shows the processing content of the reliability determination device construction | assembly apparatus. 同信憑性判定装置の処理内容を示すフローチャート。The flowchart which shows the processing content of a reliability determination apparatus.

以下、本発明の実施形態に係る情報信憑性判定システムを説明する。このシステムは、著者の特性（著者トピック）と、文書の特性（文書トピック）と、該両特性の類似度を従来の特徴量に追加し、該特徴量を用いて識別器（信憑性判定器）を構築する。したがって、新たな投稿文書があれば、同様に特徴量を抽出し、前記識別器にかけて信憑性を判定する。 Hereinafter, an information credibility determination system according to an embodiment of the present invention will be described. This system adds the characteristics of an author (author topic), the characteristics of a document (document topic), and the similarity of both characteristics to a conventional feature quantity, and uses the feature quantity to identify a classifier (credibility determination unit). ) Build. Therefore, if there is a new posted document, the feature amount is extracted in the same manner, and the authenticity is determined by the classifier.

≪システム構成≫
図１および図２に基づき前記情報信憑性判定システムの構成例を説明する。このシステムは、あらかじめ信憑性判定器を構築する図１の信憑性判定器構築装置１と、該信憑性判定器構築装置１の構築した信憑性判定器により投稿文書の信憑性を判定する信憑性判定装置２とを有している。 << System configuration >>
A configuration example of the information credibility determination system will be described with reference to FIGS. 1 and 2. This system includes a credibility determiner construction apparatus 1 shown in FIG. 1 that constructs a credibility determiner in advance, and a credibility that determines the credibility of a posted document using the credibility determiner constructed by the credibility determiner construction apparatus 1. And a determination device 2.

この両装置１，２は、それぞれコンピュータにより構成されている。ただし、それぞれの装置１，２を単一のコンピュータで構成する必要は無く、複数のコンピュータで構成してもよく、また両装置１，２を同じコンピュータで構成してもよい。 Both the devices 1 and 2 are each constituted by a computer. However, it is not necessary to configure each of the devices 1 and 2 with a single computer, and it may be configured with a plurality of computers, and both the devices 1 and 2 may be configured with the same computer.

具体的には信憑性判定器構築装置１は、図１に示すように、トピック抽出部１０１，文書トピック抽出部１０２，類似度算出部１０３，特徴量付与部１０４，構築部１０５，特徴量抽出部１０６，投稿文書ＤＢ１０７，著者トピックＤＢ１０８，単語トピックＤＢ１０９，教師ＤＢ１１０，信憑性判定器ＤＢ１１１を構成部とする。 Specifically, as shown in FIG. 1, the credibility determiner construction apparatus 1 includes a topic extraction unit 101, a document topic extraction unit 102, a similarity calculation unit 103, a feature amount assigning unit 104, a construction unit 105, and a feature amount extraction. The unit 106, the posted document DB 107, the author topic DB 108, the word topic DB 109, the teacher DB 110, and the credibility determiner DB 111 are constituent units.

また、信憑性判定装置２は、類似度算出部１０３，特徴量付与部１０４，特徴量抽出部１０６，著者トピック抽出部２０１，信憑性判定部２０２，著者トピックＤＢ１０８，単語トピックＤＢ１０９，信憑性判定器ＤＢ１１１，信憑性ＤＢ２０６を構成部とする。 Further, the credibility determination device 2 includes a similarity calculation unit 103, a feature amount assigning unit 104, a feature amount extraction unit 106, an author topic extraction unit 201, a credibility determination unit 202, an author topic DB 108, a word topic DB 109, and a credibility determination. The device DB 111 and the credibility DB 206 are the constituent parts.

したがって、前記両装置１，２は、構成部１０２〜１０４，１０６，１０８，１０９，１１１をお互いに共通して備えた装置構成からなる。また、前記各ＤＢ１１０，１０７〜１０９，１１１，２０６は、コンピュータの記憶装置（ＲＡＭやＲＯＭなどの主記憶装置，ハードディスクドライブ装置やソリッドステートドライブ装置などの補助記憶装置）に構築されている。 Therefore, both the devices 1 and 2 have a device configuration including the components 102 to 104, 106, 108, 109, and 111 in common. Each of the DBs 110, 107 to 109, 111, 206 is constructed in a computer storage device (a main storage device such as a RAM or a ROM, an auxiliary storage device such as a hard disk drive device or a solid state drive device).

まず、信憑性判定器構築装置１の処理を概略説明すれば、トピック抽出部１０１は、過去の投稿文書を保存した投稿文書ＤＢ１０７の保存データを入力として著者トピックと単語トピックとを抽出する。この抽出結果を著者トピックＤＢ１０８と単語トピックＤＢ１０９にそれぞれ保存する。 First, the process of the credibility determination device construction apparatus 1 will be briefly described. The topic extraction unit 101 extracts the author topic and the word topic by using the saved data of the posted document DB 107 storing the past posted documents as input. The extraction results are stored in the author topic DB 108 and the word topic DB 109, respectively.

文書トピック抽出部１０２は、あらかじめ人手によって信憑性の有無が付与された過去の投稿文書（教師データ）を教師ＤＢ１１０から取得する。ここで取得した教師データに含まれる単語が持つ単語トピックを単語トピックＤＢ１０９から取得し、取得した単語トピックを用いて文書トピックを抽出する。 The document topic extraction unit 102 acquires from the teacher DB 110 past posted documents (teacher data) to which the presence / absence of credibility is manually assigned in advance. The word topic possessed by the word included in the acquired teacher data is acquired from the word topic DB 109, and the document topic is extracted using the acquired word topic.

類似度算出部１０３は、著者トピックＤＢ１０８に保存された著者トピックと、抽出した文書トピックとの類似度を算出する。特徴量抽出部１０６は、教師ＤＢ１１０の教師データから特徴量を抽出する。 The similarity calculation unit 103 calculates the similarity between the author topic stored in the author topic DB 108 and the extracted document topic. The feature amount extraction unit 106 extracts feature amounts from teacher data in the teacher DB 110.

特徴量付与部１０４は、特徴量抽出部１０６の抽出した特徴量に新たな特徴量として、著者トピック・文書トピック・該両トピックの類似度を付与する。構築部１０５は、特徴量付与部１０４の処理後の特徴量と、教師データに付与された信憑性の有無を示すラベルとから信憑性判定器を構築し、信憑性判定器ＤＢ１１１に保存する。 The feature amount assigning unit 104 assigns the similarity between the author topic, document topic, and both topics as a new feature amount to the feature amount extracted by the feature amount extracting unit 106. The construction unit 105 constructs a credibility determiner from the feature amount after processing by the feature amount assigning unit 104 and the label indicating the presence or absence of the credibility assigned to the teacher data, and stores the credibility determiner in the credibility determiner DB 111.

つぎに信憑性判定装置２の処理を概略説明すれば、著者トピック抽出部２０１は信憑性を判定したい投稿文書２０３を入力とし、その著者の著者トピックを著者トピックＤＢ１０８から取得する。その際、その著者が著者トピックＤＢ１０８に存在しない場合は、インターネット２０５を経由してウェブサイト２０４から投稿文書２０３と同じ著者の過去の投稿文書を取得する。 Next, the process of the credibility determination device 2 will be briefly described. The author topic extraction unit 201 receives the posted document 203 whose credibility is to be determined as input, and acquires the author topic of the author from the author topic DB 108. At that time, if the author does not exist in the author topic DB 108, a past posted document of the same author as the posted document 203 is acquired from the website 204 via the Internet 205.

ここで取得した過去の投稿文書に含まれる単語の単語トピックを単語トピックＤＢ１０９から取得し、取得した単語トピックを用いて入力された投稿文書の著者トピックを算出する。このとき過去の投稿文書が得られなければ、あらかじめ定められた初期値を著者トピックとして用いる。 The word topic of the word included in the past posted document acquired here is acquired from the word topic DB 109, and the author topic of the posted document input using the acquired word topic is calculated. At this time, if a past posted document cannot be obtained, a predetermined initial value is used as the author topic.

文書トピック抽出部１０２は、投稿文書２０３に含まれる単語が持つ単語トピックを単語トピックＤＢ１０９から取得し、取得した単語トピックを用いて文書トピックを抽出する。類似度算出部１０３は、抽出した著者トピックと文書トピックとの類似度を算出する。特徴量抽出部１０６は、投稿文書２０３から特徴量を抽出する。 The document topic extraction unit 102 acquires a word topic possessed by a word included in the posted document 203 from the word topic DB 109, and extracts a document topic using the acquired word topic. The similarity calculation unit 103 calculates the similarity between the extracted author topic and the document topic. The feature amount extraction unit 106 extracts feature amounts from the posted document 203.

特徴量付与部１０４は、特徴量抽出部１０６の抽出した特徴量に新たな特徴量として、著者トピック・文書トピック・両トピックの類似度を付与する。信憑性判定部２０２は、信憑性判定器ＤＢ１１１の信憑性判定器と、特徴量付与部１０４の処理後の特徴量とを用いて投稿文書２０３の信憑性を判定し、判定結果を信憑性ＤＢ２０６に保存する。 The feature amount assigning unit 104 assigns the similarity between the author topic, document topic, and both topics as a new feature amount to the feature amount extracted by the feature amount extracting unit 106. The credibility determination unit 202 determines the credibility of the posted document 203 using the credibility determination unit of the credibility determination unit DB 111 and the feature amount after processing of the feature amount adding unit 104, and the determination result is used as the credibility DB 206. Save to.

≪信憑性判定器構築装置１の処理内容≫
図３に基づき信憑性判定器構築装置１の処理内容（処理手順）を説明する。図３中のＳ１０１〜Ｓ１０４はトピック抽出部１０１の処理を示し、Ｓ１０５，Ｓ１０６は文書トピック抽出部１０２の処理を示している。 ≪Processing content of authenticity determination device construction device 1≫
The processing content (processing procedure) of the credibility determination device construction apparatus 1 will be described with reference to FIG. 3, S101 to S104 indicate processing of the topic extraction unit 101, and S105 and S106 indicate processing of the document topic extraction unit 102.

また、Ｓ１０７は類似度算出部１０３の処理を示し、Ｓ１０８は特徴量抽出部１０６の処理を示し、Ｓ１０９は特徴量付与部１０４の処理を示し、Ｓ１１０，Ｓ１１１は構築部１０５の処理を示している。 S107 indicates the processing of the similarity calculation unit 103, S108 indicates the processing of the feature amount extraction unit 106, S109 indicates the processing of the feature amount adding unit 104, and S110 and S111 indicate the processing of the construction unit 105. Yes.

Ｓ１０１：まず、処理が開始されると投稿文書ＤＢ１０７にアクセスし、該ＤＢ１０７に保存されたすべての投稿文書を取得する。 S101: First, when processing is started, the posted document DB 107 is accessed, and all posted documents stored in the DB 107 are acquired.

表１は投稿文書ＤＢのデータ構造例を示している。ここでは各投稿文書は、文書を識別できるユニークなＩＤの文書ＩＤと、該投稿文書を記述したと著者を識別できるユニークなＩＤの著者ＩＤと、投稿日時とペアに保存されている。なお、これら以外のメタ情報を併せて保存してもよい。 Table 1 shows an example of the data structure of the posted document DB. Here, each posted document is stored in a pair with a document ID of a unique ID that can identify the document, an author ID of a unique ID that can identify the author when the posted document is described, and a posting date and time. Meta information other than these may be stored together.

Ｓ１０２：Ｓ１０１で取得した投稿文書群から著者トピックと単語トピックとを抽出する。ここでは一例として機械学習の教師なし学習の「ＬａｔｅｎｔＤｉｒｉｃｈｌｅｔＡｌｌｏｃａｔｉｏｎ（ＬＤＡ）」（非特許文献２参照）を利用した抽出方法を説明する。 S102: An author topic and a word topic are extracted from the posted document group acquired in S101. Here, as an example, an extraction method using “Lent Dirichlet Allocation (LDA)” (see Non-Patent Document 2) of unsupervised learning of machine learning will be described.

まず、著者ごとに投稿文書が１つになるように投稿文書を連結する。ここでは著者ごとに複数の投稿文書を一つの投稿文書として連結するため、「著者数＝投稿文書数Ｄ」の関係が成立する。 First, the posted documents are linked so that there is one posted document for each author. Here, since a plurality of posted documents are linked as one posted document for each author, the relationship “number of authors = number of posted documents D” is established.

つぎに非特許文献３の「ＭｅＣａｂ」などの形態素解析器を用いて連結した投稿文書を形態素解析し、あらかじめ人手によって定められた品詞の形態素（単語）の出現頻度をカウントする。 Next, morphological analysis is performed on post documents linked using a morphological analyzer such as “MeCab” of Non-Patent Document 3, and the frequency of appearance of morphemes (words) of parts of speech predetermined in advance is counted.

このとき単語は、前記定められた品詞をそのまま用いてもよく、基幹表現に変換して用いてもよい。また、あらかじめ人手によってストップワードを設定し、それに含まれる単語を除外してもよく、出現頻度があらかじめ人手によって定められた一定数を下回る場合は、その単語を除外してもよい。 At this time, as the word, the predetermined part of speech may be used as it is, or may be converted into a basic expression and used. In addition, a stop word may be set manually in advance, and the words included in the stop word may be excluded. If the appearance frequency falls below a predetermined number determined in advance by hand, the word may be excluded.

こうして得られた単語出現頻度と著者の関係を行列に表現し、それを「ＬＤＡ」の入力とすると、著者トピック「θ_dt」および単語トピック「θ_wt」は式（１）（２）のように得られる。 When the relationship between the word appearance frequency and the author obtained in this way is expressed in a matrix and is used as an input for “LDA”, the author topic “θ _dt ” and the word topic “θ _wt ” are expressed by equations (1) and (2). Is obtained.

ここで「ｄ∈Ｄ」は投稿文書を示し、「ｗ∈Ｗ」は単語を示し、「ｔ∈Ｔ」はトピックを示し、「α」および「β」は予め定められたハイパーパラメータを示している。また、「Ｃ_dt ^DT」は現在着目している単語（前記定められた品詞の形態素）「ｗ」が持つトピック「ｔ」を除き、連結された投稿文書「ｄ」に含まれる単語が持つトピックの出現回数をカウントしたものを示している。 Here, “d∈D” indicates a posted document, “w∈W” indicates a word, “t∈T” indicates a topic, “α” and “β” indicate predetermined hyperparameters. Yes. “C _dt ^DT ” is a topic that a word included in the concatenated posted document “d” has, except for the topic “t” that the currently focused word (the morpheme of the determined part of speech) “w” has. The number of occurrences of is counted.

トピックの数「｜Ｔ｜」は、人手によってあらかじめ定めるものとする。また、「｜Ｗ｜」は、連結された投稿文書に含まれる単語の種類数なので、自動的に求めることができる。 The number of topics “| T |” is determined in advance by hand. Also, “| W |” can be automatically obtained because it is the number of types of words included in the linked posted documents.

ただし、著者トピック「θ_dt」および単語トピック「θ_wt」は、行列分解により求められるので、著者単位に処理することはできない。すなわち、入力として与えられた投稿文書集合「Ｄ」や、それに含まれる単語集合「Ｗ」に応じて後述の表２，表３のような確率分布が一度に求められる。 However, the author topic “θ _dt ” and the word topic “θ _wt ” are obtained by matrix decomposition and cannot be processed in units of authors. That is, probability distributions as shown in Tables 2 and 3 to be described later are obtained at a time according to the posted document set “D” given as input and the word set “W” included therein.

なお、「ＮＭＦ（Ｎｏｎ−ｎｅｇａｔｉｖｅＭａｔｒｉｘＦａｃｔｏｒｉｚａｔｉｏｎ）」など「ＬＤＡ」以外の教師なし学習や、「ＮａｉｖｅＢａｙｅｓ」や「ＳｕｐｐｏｒｔＶｅｃｔｏｒＭａｃｈｉｎｅ（ＳＶＭ）」などの教師あり学習を用いてもよい。ただし、教師あり学習を用いる場合は、人手によってあらかじめ政治、経済などのトピックを具体的に定め、それらのラベルが付与された過去の投稿文書を用意する必要がある。 Note that unsupervised learning other than “LDA” such as “NMF (Non-negative Matrix Factorization)” or supervised learning such as “Native Bayes” or “Support Vector Machine (SVM)” may be used. However, when using supervised learning, it is necessary to manually define topics such as politics and economy in advance by hand and prepare past posted documents to which those labels have been assigned.

Ｓ１０３，Ｓ１０４：Ｓ１０２の式（１）で算出した著者トピック「θ_dt」を著者トピックＤＢ１０８に保存する（Ｓ１０３）。また、Ｓ１０２の式（２）で算出した単語トピック「θ_wt」を単語トピックＤＢ１０９に保存する（Ｓ１０４）。 S103, S104: The author topic “θ _dt ” calculated by the equation (1) of S102 is stored in the author topic DB 108 (S103). Further, the word topic “θ _wt ” calculated by the equation (2) in S102 is stored in the word topic DB 109 (S104).

表２は、著者トピックＤＢ１０８のデータ構造例を示している。ここでは著者を識別できるユニークなＩＤの著者ＩＤ毎にトピック数「｜Ｔ｜」に応じた個数のトピック確率が保存されている。 Table 2 shows an example of the data structure of the author topic DB 108. Here, the number of topic probabilities corresponding to the number of topics “| T |” is stored for each author ID of a unique ID that can identify the author.

表３は、単語トピックＤＢ１０９のデータ構造例を示している。ここでは単語を識別できるユニークなＩＤの単語ＩＤ毎に該単語ＩＤが示す単語と、トピック数「｜Ｔ｜」に応じた個数のトピック確率とが保存されている。 Table 3 shows an example of the data structure of the word topic DB 109. Here, for each word ID of a unique ID that can identify a word, the word indicated by the word ID and the number of topic probabilities corresponding to the number of topics “| T |” are stored.

Ｓ１０５：教師ＤＢ１１０から教師データを取得する。 S105: Obtain teacher data from the teacher DB 110.

表４は、教師ＤＢ１１０のデータ構造例を示している。この教師ＤＢ１１０には、表１に示す投稿文書ＤＢ１０７のデータ構造に加えて、人手によって予め付与された信憑性の有無を示すラベルが保存されている。 Table 4 shows an example of the data structure of the teacher DB 110. In addition to the data structure of the posted document DB 107 shown in Table 1, the teacher DB 110 stores a label indicating the presence or absence of credibility previously assigned by hand.

ここでは「１」は「信憑性あり」を示し、「０」は「信憑性無し」を示している。ただし、「０」また「１」の表示には限定されず、順序変数や連続変数となる値でもよい。 Here, “1” indicates “with credibility” and “0” indicates “without credibility”. However, the display is not limited to “0” or “1”, and may be an order variable or a continuous variable.

Ｓ１０６：Ｓ１０５で取得した教師データと、単語トピックＤＢ１０９に保存された単語トピックとを用いて、文書トピックを算出する。文書トピック「Ψ_dt」は、式（３）により求められる。 S106: A document topic is calculated using the teacher data acquired in S105 and the word topic stored in the word topic DB 109. The document topic “Ψ _dt ” is obtained by Expression (3).

ここで「ｗ∈Ｗ_d」は入力された投稿文書に含まれる単語数を示している。ここで「Ｗ_d」が空集合でなければ、入力された投稿文書に出現した単語が持つ単語トピックの総和を取り、それをトピックでの総和で除算することで正規化し、文書トピックを求める。 Here, “w∈W _d ” indicates the number of words included in the input posted document. Here, if “W _d ” is not an empty set, the sum of word topics held by words appearing in the input posted document is taken and normalized by dividing the sum by the sum of topics to obtain a document topic.

この文書トピックは、著者トピックと同様にＴ個の確率変数を持つ確率分布となっている。すなわち、１文書１トピックではなく、各文書が複数のトピック確率を持つ（例：政治０．５，経済０．２，．．．など）。文書トピックは教師データ毎に算出することができる。 Similar to the author topic, this document topic has a probability distribution having T random variables. That is, each document has a plurality of topic probabilities instead of one topic per document (eg, political 0.5, economy 0.2,..., Etc.). The document topic can be calculated for each teacher data.

なお、「Ｗ_d」が空集合であれば、単語トピックから文書トピックを求めることができないので、一様分布のトピック確率を持つとする。 If “W _d ” is an empty set, a document topic cannot be obtained from a word topic, and it is assumed that the topic probabilities have a uniform distribution.

Ｓ１０７：著者トピックと文書トピックとの類似度を算出する。ここでは一例として「Ｊｅｎｓｅｎ−ＳｈａｎｎｏｎＤｉｖｅｒｇｅｎｃｅ（ＪＳＤ）」を用いた類似度の算出方法を説明する。 S107: The similarity between the author topic and the document topic is calculated. Here, a similarity calculation method using “Jensen-Shannon Divergence (JSD)” will be described as an example.

「ＪＳＤ」は２つの確率分布の差を測る尺度を示し、式（６）の「Ｋｕｌｌｂａｃｋ_ＬｅｉｂｌｅｒＤｉｖｅｒｇｅｎｃｅ（ＫＬＤ）」を用いて、式（５）のように定義される。「ＫＬＤ」は引数の順番によって値が変わるのに対し、そうならないように定義したのが「ＪＳＤ」である。 “JSD” indicates a scale for measuring a difference between two probability distributions, and is defined as Equation (5) by using “Kullback_Leibler Divergence (KLD)” of Equation (6). The value of “KLD” changes depending on the order of the arguments, whereas “JSD” is defined so that it does not.

「ＪＳＤ」は［０，１］の値域を持ち、２つの確率分布が似ているほど「０」に近い値を取る。したがって、類似度「Ｓ」は式（４）のように「１」から「ＪＳＤ」を減算した値として求めることができる。なお、著者トピックと文書トピックは、表２および表４に示すように、Ｔ個の確率変数を持つ確率分布なため、「ＪＳＤ」によって両トピックの差を測ると、その値はスカラーとなる。 “JSD” has a value range of [0, 1], and takes a value closer to “0” as two probability distributions are more similar. Therefore, the similarity “S” can be obtained as a value obtained by subtracting “JSD” from “1” as in Expression (4). Since the author topic and the document topic are probability distributions having T random variables as shown in Table 2 and Table 4, when the difference between the two topics is measured by “JSD”, the value is a scalar.

もっとも、「ＪＳＤ」の代わりに「ＫＬＤ」を用いたり、類似度を「ＣｏｓｉｎｅＳｉｍｉｌａｒｉｔｙ」などを用いたりして求めてもよい。また、確率分布同士を比較するのではなく、確率分布の中から確率値の高い順に「Ｋ（０＜Ｋ＜Ｔ）」個のトピックを選出し、それらの類似度を求めてもよい。さらに確率分布同士の自乗誤差を求めてもよい。 However, “KLD” may be used instead of “JSD”, or the degree of similarity may be obtained by using “Cosine Similarity” or the like. Instead of comparing probability distributions, “K (0 <K <T)” topics may be selected from the probability distributions in descending order of probability values, and their similarity may be obtained. Further, a square error between probability distributions may be obtained.

Ｓ１０８：教師ＤＢ１１０から教師データを取得し、特徴量を抽出する。この特徴量は、信憑性の有無に応じて変化することが予想される計量可能な変数を用いて、教師データをベクトル表現化したものとする。 S108: Teacher data is acquired from the teacher DB 110, and feature quantities are extracted. It is assumed that this feature amount is a vector representation of the teacher data using a metric that is expected to change depending on the presence or absence of credibility.

表４の「文書ＩＤ＝４，５」の教師データに基づき処理例を説明すれば、「文書ＩＤ＝４」の教師データはＵＲＬが記述され、投稿文書長が６７文字からなる。一方、「文書ＩＤ＝５」の教師データは「ＵＲＬ」は記述されていなく、投稿文書長が６３文字からなる。 A processing example will be described based on the teacher data of “Document ID = 4, 5” in Table 4. The teacher data of “Document ID = 4” describes a URL and has a posted document length of 67 characters. On the other hand, “URL” is not described in the teacher data of “document ID = 5”, and the posted document length is 63 characters.

したがって、両教師データを「ＵＲＬ」の有無と投稿文書長という２つの変数でベクトル表現すると、「ｄ₄＝（１，６７）」と「ｄ₅＝（０，６３）」と表され、これが特徴量となる。 Therefore, when both teacher data are expressed as vectors with two variables, the presence / absence of “URL” and the posted document length, “d ₄ = (1, 67)” and “d ₅ = (0, 63)” are represented. This is a feature value.

ただし、前記両変数以外にも投稿文書長・ＵＲＬの有無・ネガポジ表現などの投稿文書にもとづく特徴量、あるいはアカウント作成日・総投稿数・友人数などの著者にもとづく特徴量やトレンドにもとづく特徴量、情報伝播にもとづく特徴量など様々なものを用いることができるものとする。言い換えれば、信憑性の有無に応じて変化することが予想され、計量可能であればどんな変数を用いてもよい。 However, in addition to the above variables, features based on the submitted document length, presence / absence of URL, negative / positive expression, etc., or feature based on the author such as account creation date, total number of posts, number of friends, etc. It is assumed that various things such as quantity and feature quantity based on information propagation can be used. In other words, it is expected to change depending on the presence or absence of credibility, and any variable can be used as long as it can be measured.

Ｓ１０９：著者トピックＡＴ「θ_dt」・文書トピックＤＴ「Ψ_dt」・類似度Ｓ「Ｓ（θ_dt，Ψ_dt）」を新たな特徴量として元の特徴量、即ちＳ１０８で求めた特徴量に付与する。 S109: The author topic AT “θ _dt ”, the document topic DT “Ψ _dt ”, and the similarity S “S (θ _dt , Ψ _dt )” are used as the new feature values to obtain the original feature values, that is, the feature values obtained in S108. Give.

前述の「文書ＩＤ＝４，５」の教師データ例に基づき説明すれば、元の特徴量「ｄ₄＝（１，６７）」と「ｄ₅＝（０，６３）」とは、それぞれ「ｄ₄＝（１，６７，ＡＴ_d4，ＤＴ_d4，Ｓ（θ_4t，Ψ_4t）」と「ｄ₅＝（０，６３，ＡＴ_d5，ＤＴ_d5，Ｓ（θ_5t，Ψ_5t）」とに新しく置き換わる。これにより特徴量の次元数は、「２」から「２｜Ｔ｜＋３」へ増加する。 To explain based on the teacher data example of “document ID = 4, 5”, the original feature amount “d ₄ = (1, 67)” and “d ₅ = (0, 63)” are respectively “ d ₄ = (1, 67, AT _d4 , DT _d4 , S (θ _4t , Ψ _4t ) ”and“ d ₅ = (0, 63, AT _d5 , DT _d5 , S (θ _5t , Ψ _5t ) ”. As a result, the number of dimensions of the feature amount increases from “2” to “2 | T | +3”.

Ｓ１１０，Ｓ１１１：機械学習の教師あり学習を用いて、Ｓ１０９の処理後の特徴量と教師ＤＢ１１０に保存された教師データの信憑性を示すラベルとから信憑性判定器を構築する（Ｓ１１０）。ここでは「ＮａｉｖｅＢａｙｅｓ」や「ＳｕｐｐｏｒｔＶｅｃｔｏｒＭａｃｈｉｎｅ（ＳＶＭ）」など、どのような教師あり学習を用いてもよいものとする。学習の結果、構築された信憑性判定器を信憑性判定器ＤＢ１１１に保存し、処理を終了する。 S110, S111: Using machine learning supervised learning, a credibility determiner is constructed from the feature value after the processing of S109 and a label indicating the credibility of the teacher data stored in the teacher DB 110 (S110). In this case, any supervised learning such as “Native Bayes” or “Support Vector Machine (SVM)” may be used. As a result of learning, the constructed credibility determiner is stored in the credibility determiner DB 111, and the process is terminated.

≪信憑性判定装置２の処理内容≫
図４に基づき信憑性判定装置２の処理内容（処理手順）を説明する。ここではユーザは、ユーザ所有の端末（スマートフォン，ＰＣ）からネットワーク経由（インターネット２０５経由でよい。）で信憑性を判定したい文書２０３を信憑性判定装置２に投稿する。この投稿文書２０３の入力を信憑性判定装置２が受け付けると、その信憑性判定の処理が開始される。 << Processing content of authenticity determination device 2 >>
The processing content (processing procedure) of the credibility determination device 2 will be described with reference to FIG. Here, the user posts the document 203 whose reliability is to be determined from the user-owned terminal (smart phone, PC) via the network (via the Internet 205) to the reliability determination device 2. When the credibility determination device 2 receives the input of the posted document 203, the credibility determination process is started.

なお、図４中、Ｓ２０１〜Ｓ２０６は著者トピック抽出部２０１の処理を示し、Ｓ２０７は文書トピック抽出部１０２の処理を示し、Ｓ２０８は類似度算出部１０３の処理を示し、Ｓ２０９は特徴量抽出部１０６の処理を示し、Ｓ２１０は特徴量付与部１０４の処理を示し、Ｓ２１１，Ｓ２１２は信憑性判定部２０２の処理を示している。 In FIG. 4, S201 to S206 indicate processing of the author topic extraction unit 201, S207 indicates processing of the document topic extraction unit 102, S208 indicates processing of the similarity calculation unit 103, and S209 indicates a feature amount extraction unit. 106, S 210 indicates the process of the feature amount assigning unit 104, and S 211 and S 212 indicate the process of the credibility determination unit 202.

Ｓ２０１：処理が開始されると、受け付けた投稿文書２０３の著者が、著者トピックＤＢ１０８に存在するか否かを確認する。確認の結果、存在する場合はＳ２０２に進む一方、存在しない場合はＳ２０３に進む。 S201: When the process is started, it is confirmed whether or not the author of the accepted posted document 203 exists in the author topic DB 108. As a result of the confirmation, if it exists, the process proceeds to S202, and if it does not exist, the process proceeds to S203.

Ｓ２０２：著者トピックＤＢ１０８から該当する著者トピックを取得し、Ｓ２０７に進む。 S202: The corresponding author topic is acquired from the author topic DB 108, and the process proceeds to S207.

Ｓ２０３，Ｓ２０４：インターネットを経由して投稿文書２０３と同じ著者の過去の投稿文書をウェブサイトから取得できるか否かを確認する（Ｓ２０３）。確認の結果、取得できなければＳ２０６に進む一方、取得できればＳ２０４に進む。Ｓ２０４では、インターネット２０５を経由してＳ２０３で確認した過去の投稿文書をウェブサイトから取得し、Ｓ２０５に進む。 S203, S204: It is confirmed whether a past posted document of the same author as the posted document 203 can be acquired from the website via the Internet (S203). As a result of confirmation, if it cannot be acquired, the process proceeds to S206, and if it can be acquired, the process proceeds to S204. In S204, the past posted document confirmed in S203 is acquired from the website via the Internet 205, and the process proceeds to S205.

Ｓ２０５：Ｓ２０４で取得した過去の投稿文書すべてに含まれる単語集合と、単語トピックＤＢ１０９に保存された単語トピックとを用いて、式（３）により文書トピックを算出する。ここで算出された値を著者トピックとし、Ｓ２０７に進む。 S205: Using the word set included in all past posted documents acquired in S204 and the word topic stored in the word topic DB 109, the document topic is calculated by Expression (3). The value calculated here is set as the author topic, and the process proceeds to S207.

Ｓ２０６：あらかじめ定められた方法で著者トピックを初期化する。例えば、一様分布を用いる。この場合には著者トピックとして、著者トピックＤＢ１０８に保存された全著者の著者トピックの平均値を用いてＳ２０７に進む。なお、著者トピックとして他の手段を用いてもよいものとする。 S206: The author topic is initialized by a predetermined method. For example, a uniform distribution is used. In this case, the process proceeds to S207 using the average value of the author topics of all authors stored in the author topic DB 108 as the author topic. It should be noted that other means may be used as the author topic.

Ｓ２０７：投稿文書２０３に含まれる単語集合と、単語トピックＤＢ１０９に保存された単語トピック確率とを用いて、式（３）により文書トピックを算出する。 S207: Using the word set included in the posted document 203 and the word topic probabilities stored in the word topic DB 109, the document topic is calculated by Expression (3).

Ｓ２０８：著者トピックと文書トピックを用いて、式（４）により類似度を算出する。 S208: Using the author topic and the document topic, the similarity is calculated by equation (4).

Ｓ２０９：投稿文書２０３から特徴量を抽出する。このＳ２０９の処理は、入力データが異なるだけで、Ｓ１０８と同様な処理を実行する。 S209: Feature values are extracted from the posted document 203. The process of S209 is the same as that of S108 except that the input data is different.

Ｓ２１０：Ｓ２０２／Ｓ２０５／Ｓ２０６の著者トピックと、Ｓ２０７の文書トピックと、Ｓ２０８の類似度とを新たな特徴量として元の特徴量（Ｓ２０９の特徴量）に付与する。ここでは入力データが異なるだけで、Ｓ１０９と同様な処理を実行する。 S210: The author topic in S202 / S205 / S206, the document topic in S207, and the similarity in S208 are added to the original feature (feature in S209) as a new feature. Here, only the input data is different, and processing similar to S109 is executed.

Ｓ２１１，Ｓ２１２：信憑性判定器ＤＢ１１１に保存された信憑性判定器と、Ｓ２１０の特徴量とを用いて投稿文書２０３の信憑性を判定する（Ｓ２１１）。この判定処理は、Ｓ１１０で信憑性判定器を構築する際に使用した教師あり学習に依存する。判定結果は、信憑性ＤＢ２０６に保存される（Ｓ２１２）。 S211, S212: The credibility of the posted document 203 is determined using the credibility determiner stored in the credibility determiner DB 111 and the feature amount of S210 (S211). This determination process depends on the supervised learning used in constructing the credibility determiner in S110. The determination result is stored in the credibility DB 206 (S212).

表５は、信憑性ＤＢ２０６のデータ構造例を示している。ここでは文書を識別できるユニークなＩＤの文書ＩＤ毎に投稿文書の著者を識別可能なユニークなＩＤの著者ＩＤと、投稿日時と、信憑性の有無と、信憑性判定の確信度とが保存されている。ここで確信度は、Ｓ２１１で信憑性を判定した際の教師あり学習における分類確率を用いる。なお、これら以外のメタデータを保存してもよいものとする。 Table 5 shows an example of the data structure of the credibility DB 206. Here, for each document ID with a unique ID that can identify the document, the author ID of the unique ID that can identify the author of the posted document, the posting date and time, the presence or absence of credibility, and the reliability of the credibility determination are stored. ing. Here, as the certainty factor, the classification probability in supervised learning when the authenticity is determined in S211 is used. It should be noted that metadata other than these may be stored.

このような本実施形態の前記情報信憑性判定システムによれば、投稿文書の信憑性判定にあたって、従来の特徴量に新たな特徴量が加えられているため、信憑性の判定精度を高めることができる。 According to the information credibility determination system of the present embodiment as described above, since a new feature amount is added to the conventional feature amount in the determination of the credibility of the posted document, it is possible to improve the determination accuracy of the credibility. it can.

（１）すなわち、信憑性判定にあたって専門性（著者トピック）に関する特徴量を考慮しないと、投稿文書の話題に詳しい人とそうでない人を区別できず信憑性の判定精度が低下するおそれがあった。 (1) In other words, if the characteristic amount related to expertise (author topic) is not taken into account in determining credibility, it is impossible to distinguish between those who are familiar with the topic of the submitted document and those who are not, and there is a risk that the accuracy of credibility determination will decrease. .

そこで、前記情報信憑性判定システムでは、Ｓ１０９，Ｓ２１０にて過去の投稿文書から得た著者トピックを新たな特徴量として加えることとした。これにより信憑性判定にあたって専門性が考慮され（Ｓ１１０，Ｓ２１１）、判定精度の向上が可能となった。 Therefore, in the information credibility determination system, the author topic obtained from the past posted document is added as a new feature amount in S109 and S210. As a result, expertise is taken into account in the determination of authenticity (S110, S211), and the determination accuracy can be improved.

（２）また、投稿文書の話題（文書トピック）に関する特徴量が考慮されていないと、信用できる話題と疑ってかかるべき話題の区別ができず、信憑性の判断精度が低下するおそれがあった。 (2) In addition, if the feature amount related to the topic (document topic) of the posted document is not taken into consideration, it is not possible to distinguish between a topic that can be trusted and a topic that should be suspected, and there is a possibility that the accuracy of determination of authenticity may be lowered. .

そこで、前記情報信憑性判定システムでは、Ｓ１０９，Ｓ２１０にて文書トピックを新たな特徴量として加えることとした。これにより信憑性判定にあたって投稿文書の話題が考慮され（Ｓ１１０，Ｓ２１１）、判定精度の向上が可能となった。 Therefore, in the information credibility determination system, the document topic is added as a new feature amount in S109 and S210. As a result, the topic of the posted document is taken into consideration when determining the authenticity (S110, S211), and the determination accuracy can be improved.

（３）さらに専門性と投稿文書の話題との類似性に関する特徴量が考慮されていないと、詳しい人がその話題に関して投稿したのか、あるいは詳しくない人がたまたまその話題に関して投稿したのかを区別できないため、信憑性の判定精度が低下するおそれがあった。 (3) Furthermore, if the feature quantity related to the similarity between the expertise and the topic of the submitted document is not taken into account, it cannot be distinguished whether a detailed person has posted about the topic or a non-detailed person has happened to post about the topic. For this reason, there is a possibility that the determination accuracy of credibility may be lowered.

そこで前記情報信憑性判定システムでは、Ｓ１０７，Ｓ２０８にて著者トピックと文書トピックの類似度を算出し、さらにＳ１０９，Ｓ２１０にて新たな特徴量として加えることとした。これにより信憑性判定にあたって専門性と投稿文書の話題との類似性が考慮され（Ｓ１１０，Ｓ２１１）、判定精度の向上が可能となった。 Therefore, in the information credibility determination system, the similarity between the author topic and the document topic is calculated in S107 and S208, and further added as a new feature amount in S109 and S210. As a result, the similarity between the specialty and the topic of the posted document is taken into account when determining the authenticity (S110, S211), and the determination accuracy can be improved.

≪プログラム等≫
本発明は、上記実施形態に限定されるものではなく、各請求項に記載された範囲内で応用・変形して実施することができる。例えばＳ１０９，Ｓ２１０において著者トピック・文書トピック・類似度のすべてを新たな特徴量として追加する必要はなく、著者トピックと文書トピックとの追加でもよく、あるいは類似度のみの追加でもよい。この場合には、追加する特徴量に応じて前記（１）〜（３）の効果が得られる。 ≪Programs≫
The present invention is not limited to the above-described embodiments, and can be applied and modified within the scope of the claims. For example, in S109 and S210, it is not necessary to add all of the author topic / document topic / similarity as new feature amounts, and the author topic and document topic may be added, or only the similarity may be added. In this case, the effects (1) to (3) can be obtained according to the feature amount to be added.

また、本発明は、前記情報信憑性判定システム（信憑性判定器構築装置１，信憑性判定装置２）の各構成部１０１〜１１１，２０１，２０２，２０６の一部もしくは全部として、コンピュータを機能させる情報信憑性判定プログラムとして構成することもできる。このプログラムによれば、Ｓ１０１〜Ｓ１１１，Ｓ２０１〜Ｓ２１２の一部あるいは全部をコンピュータに実行させることが可能となる。 The present invention also functions as a computer as a part or all of each component 101 to 111, 201, 202, 206 of the information credibility determination system (the credibility determiner construction device 1, the credibility determination device 2). It can also be configured as an information credibility determination program. According to this program, part or all of S101 to S111 and S201 to S212 can be executed by a computer.

前記プログラムは、Ｗｅｂサイトや電子メールなどネットワークを通じて提供することができる。また、前記プログラムは、ＣＤ−ＲＯＭ，ＤＶＤ−ＲＯＭ，ＣＤ−Ｒ，ＣＤ−ＲＷ，ＤＶＤ−Ｒ，ＤＶＤ−ＲＷ，ＭＯ，ＨＤＤ，ＢＤ−ＲＯＭ，ＢＤ−Ｒ，ＢＤ−ＲＥなどの記録媒体に記録して、保存・配布することも可能である。この記録媒体は、記録媒体駆動装置を利用して読み出され、そのプログラムコード自体が前記実施形態の処理を実現するので、該記録媒体も本発明を構成する。 The program can be provided through a network such as a website or e-mail. The program is stored in a recording medium such as a CD-ROM, DVD-ROM, CD-R, CD-RW, DVD-R, DVD-RW, MO, HDD, BD-ROM, BD-R, or BD-RE. It is also possible to record, save and distribute. This recording medium is read using a recording medium driving device, and the program code itself realizes the processing of the above embodiment, so that the recording medium also constitutes the present invention.

１…信憑性判定器構築装置
２…信憑性判定装置
１０１…トピック抽出部
１０２…文書トピック抽出部
１０３…類似度算出部
１０４…特徴量付与部
１０５…構築部
１０６…特徴量抽出部
１０７…投稿文書ＤＢ
１０８…著者トピックＤＢ（第１データベース）
１０９…単語トピックＤＢ（第２データベース）
１１０…教師ＤＢ
１１１…信憑性判定器ＤＢ（第３データベース）
２０１…著者トピック抽出部
２０２…信憑性判定部
２０３…投稿文書（入力文書）
２０４…ウェブサイト
２０５…インターネット
２０６…信憑性ＤＢ DESCRIPTION OF SYMBOLS 1 ... Reliability determination device construction apparatus 2 ... Reliability determination apparatus 101 ... Topic extraction part 102 ... Document topic extraction part 103 ... Similarity calculation part 104 ... Feature-value provision part 105 ... Construction part 106 ... Feature-value extraction part 107 ... Posting Document DB
108… Author topic DB (first database)
109 ... Word topic DB (second database)
110 ... Teacher DB
111 ... authenticity determination unit DB (third database)
201: Author topic extraction unit 202 ... Credibility determination unit 203 ... Posted document (input document)
204 ... Website 205 ... Internet 206 ... Authenticity DB

Claims

An author topic extraction unit that refers to a first database that stores an author topic indicating author characteristics of past document groups and extracts the author topic if there is an author topic corresponding to the input document;
A document topic extraction unit that acquires a word topic of a word included in an input document from a second database that stores a word topic indicating a word characteristic of a past document group, and extracts a document topic indicating a characteristic of the input document;
A feature amount assigning unit that assigns to the input document a feature amount that takes into account the similarity between the author topic and the document topic;
A third database for storing a credibility determiner constructed by machine learning using the feature amount and teacher data assigned to a past document;
A credibility determination unit that determines the credibility of the input document using the credibility determination unit of the third database and the feature value given to the input document;
An information credibility determination system comprising:

An author topic extraction unit that refers to a first database that stores an author topic indicating author characteristics of past document groups and extracts the author topic if there is an author topic corresponding to the input document;
A document topic extraction unit that acquires a word topic of a word included in an input document from a second database that stores a word topic indicating a word characteristic of a past document group, and extracts a document topic indicating a characteristic of the input document;
A feature amount assigning unit for giving an input document a feature amount in consideration of the author topic and the document topic;
A third database for storing a credibility determiner constructed by machine learning using the feature amount and teacher data assigned to a past document;
A credibility determination unit that determines the credibility of the input document using the credibility determination unit of the third database and the feature value given to the input document;
An information credibility determination system comprising:

An information credibility determination system comprising: a first database storing an author topic indicating an author characteristic of a past document group; and a second database storing a word topic indicating a word characteristic of the past document group,
If the author topic of the input document exists in the first database, the author topic extraction unit extracts the author topic, and if not, acquires the past document of the same author as the input document via the website and extracts the author topic. When,
A document topic extraction unit that acquires a word topic of a word included in the input document from the second database, and extracts a document topic indicating characteristics of the input document;
A feature amount assigning unit that assigns to the input document a feature amount that takes into account the similarity between the author topic, the document topic, and the author topic and the document topic;
A third database for storing a credibility determiner constructed by machine learning using the feature amount and teacher data given to a past document;
A credibility determination unit that determines the credibility of the input document using the credibility determination unit of the third database and the feature value given to the input document;
An information credibility determination system comprising:

An information credibility determination method for determining the authenticity of an input document by a computer,
An author topic extraction step of referring to a first database storing an author topic indicating author characteristics of past document groups and extracting the author topic if there is an author topic corresponding to the input document;
A document topic extracting step of acquiring a word topic of a word included in the input document from the second database for storing a word topic indicating the word characteristic of the past document group, and extracting a document topic indicating the characteristic of the input document;
A feature amount assigning step for giving an input document a feature amount that takes into account the degree of similarity between the author topic and the document topic;
A third database that stores a credibility determiner constructed by machine learning based on the feature amount and teacher data assigned to the past document, and the feature amount assigned to the input document is used for the input document. A credibility determination step for determining credibility;
An information credibility determination method characterized by comprising:

An information credibility determination method for determining the authenticity of an input document by a computer,
An author topic extraction step of referring to a first database storing an author topic indicating author characteristics of past document groups and extracting the author topic if there is an author topic corresponding to the input document;
A document topic extracting step of acquiring a word topic of a word included in the input document from the second database for storing a word topic indicating the word characteristic of the past document group, and extracting a document topic indicating the characteristic of the input document;
A feature amount assigning step for assigning an input document with a feature amount that takes into account the author topic and the document topic;
A third database that stores a credibility determiner constructed by machine learning based on the feature amount and teacher data assigned to the past document, and the feature amount assigned to the input document is used for the input document. A credibility determination step for determining credibility;
An information credibility determination method characterized by comprising:

Computer
An information credential that determines the authenticity of an input document using a first database that stores author topics extracted from a set of past documents and a second database that stores word topics acquired from a set of past documents. A sex determination method,
If the author topic of the input document exists in the first database, the author topic is extracted. On the other hand, if the author topic does not exist, the past topic of the same author as the input document is obtained via the website and the author topic is extracted. When,
A document topic extraction unit that acquires a word topic of a word included in the input document from the second database, and extracts a document topic indicating characteristics of the input document;
A feature amount assigning step for assigning to the input document a feature amount taking into account the similarity between the author topic, the document topic, and the author topic and the document topic;
A third database that stores a credibility determiner constructed by machine learning based on the feature amount and teacher data assigned to the past document, and the feature amount assigned to the input document is used for the input document. A credibility determination step for determining credibility;
An information credibility determination method characterized by comprising:

An information credibility determination program that causes a computer to function as the information credibility determination system according to any one of claims 1 to 3.