JP6839001B2

JP6839001B2 - Model learning device, information judgment device and their programs

Info

Publication number: JP6839001B2
Application number: JP2017048039A
Authority: JP
Inventors: 太郎宮▲崎▼; 後藤　淳; 淳後藤; 友香武井
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2017-03-14
Filing date: 2017-03-14
Publication date: 2021-03-03
Anticipated expiration: 2037-03-14
Also published as: JP2018151892A

Description

本発明は、ソーシャルメディア情報が、ある分野の情報であるか否かを判定する情報判定技術に関する。 The present invention relates to an information determination technique for determining whether or not social media information is information in a certain field.

近年、ソーシャル・ネットワーキング・サービス（ＳＮＳ：Social Networking Service）の発達により、個人が、容易にかつリアルタイムで情報を発信することが可能になった。例えば、ＳＮＳには、火事、事故等の現場に偶然居合わせた人からの目撃情報等が多数投稿されている。これらの投稿は、現場の写真や映像を付加したものがあり、火事、事故等の発生した瞬間をとらえた画像、映像として、ニュース番組等に活用されることも多い。そこで、放送局では、常時、ＳＮＳを監視する等、人手での作業により、これらの情報を抽出している。 In recent years, the development of social networking services (SNS) has made it possible for individuals to easily and in real time transmit information. For example, a lot of sighting information from a person who happened to be present at the scene of a fire, an accident, etc. is posted on SNS. Some of these posts have photos and videos of the scene added to them, and are often used in news programs as images and videos that capture the moment when a fire or accident occurs. Therefore, the broadcasting station extracts this information by manual work such as constantly monitoring the SNS.

このようなＳＮＳから人手で必要な情報を抽出する手法は、キーワード検索を用いることが多い。しかし、投稿は、「○○線が遅れている」といった路線名が書かれる場合、「△△駅での事故で電車が遅れている」といった駅名が書かれる場合等、表現の方法が多岐にわたっている。そのため、これらの表現をすべてカバーするキーワードを作成することは困難である。 A keyword search is often used as a method for manually extracting necessary information from such an SNS. However, there are various ways to express the post, such as when the line name such as "The XX line is delayed" or when the station name such as "The train is delayed due to an accident at the △△ station" is written. There is. Therefore, it is difficult to create a keyword that covers all of these expressions.

そこで、これらの問題を解決するために、機械学習を用いた投稿の抽出手法が多く研究されている。例えば、特定のテーマに依存して危険表現となりうる単語やフレーズをニューラルネットワークにより学習し、ＳＮＳのソーシャル・ビッグデータから、特定のテーマに依存して危険表現となりうる単語やフレーズを抽出する手法が開示されている（特許文献１参照）。また、例えば、ＳＮＳへの投稿のｎ−ｇｒａｍと気象状況の関連度とを算出し、機械学習により、気象事象に関連する有用な投稿を抽出する手法が開示されている（非特許文献１参照）。 Therefore, in order to solve these problems, many post extraction methods using machine learning have been studied. For example, there is a method of learning words and phrases that can be dangerous expressions depending on a specific theme by a neural network, and extracting words and phrases that can be dangerous expressions depending on a specific theme from SNS social big data. It is disclosed (see Patent Document 1). Further, for example, a method of calculating the n-gram of posts to SNS and the degree of relevance of meteorological conditions and extracting useful posts related to meteorological events by machine learning is disclosed (see Non-Patent Document 1). ).

特開２０１５−７２６１４号公報JP-A-2015-72614

萩行正嗣，“選択式天気情報を用いたソーシャルメディアからの有用投稿抽出”，言語処理学会，第22回年次大会発表論文集，pp.397-400，2016年3月Masatsugu Hagiyuki, “Extracting Useful Posts from Social Media Using Selective Weather Information”, Natural Language Processing Society, Proceedings of the 22nd Annual Conference, pp.397-400, March 2016

従来の機械学習を用いた手法は、学習データの量により、機械学習を用いた学習精度が大きく影響する。そこで、その精度を高めるためには、多くの学習データを準備する必要がある。しかし、学習データを作成するためには、ソーシャルメディア情報（投稿文）が正解データであるのか不正解データであるのかを峻別する必要がある。このため、人手、作業時間等のコストがかかってしまうという問題があった。 In the conventional method using machine learning, the learning accuracy using machine learning has a great influence on the amount of learning data. Therefore, in order to improve the accuracy, it is necessary to prepare a lot of learning data. However, in order to create learning data, it is necessary to distinguish whether the social media information (posted text) is correct answer data or incorrect answer data. For this reason, there is a problem that costs such as manpower and work time are required.

そこで、本発明は、ソーシャルメディア情報を峻別した学習データだけではなく、ニュース原稿等の分野が既知のテキストデータの特徴を利用して、ソーシャルメディア情報が、ニュース原稿等と同じ分野の情報であるか否かを判定するためのモデル学習装置、情報判定装置およびそれらのプログラムを提供することを課題とする。 Therefore, the present invention utilizes not only learning data that distinguishes social media information but also characteristics of text data whose fields such as news manuscripts are known, so that social media information is information in the same field as news manuscripts and the like. An object of the present invention is to provide a model learning device, an information determination device, and a program thereof for determining whether or not the information is present.

前記課題を解決するため、本発明に係るモデル学習装置は、分野が既知の複数の原稿文から特徴量を抽出するための特徴抽出モデルを学習するとともに、前記分野が既知のソーシャルメディア情報である投稿文を教師データとして、判定対象の投稿文が前記分野を示す情報か否かを判定するための情報判定モデルを学習するモデル学習装置であって、ベクトル化手段と、特徴抽出モデル学習手段と、特徴抽出手段と、情報判定モデル学習手段と、を備える構成とした。 In order to solve the above problems, the model learning device according to the present invention learns a feature extraction model for extracting a feature amount from a plurality of manuscript sentences whose fields are known, and is social media information whose fields are known. It is a model learning device that learns an information judgment model for judging whether or not the posted text to be judged is information indicating the above field by using the posted text as teacher data, and is a vectorization means and a feature extraction model learning means. , A feature extraction means and an information determination model learning means.

かかる構成において、モデル学習装置は、ベクトル化手段によって、原稿文または教師データである投稿文に対して、予め記憶手段に記憶されている単語単位の分散表現ベクトルを平均化して、原稿単位または投稿単位の分散表現ベクトルである文単位分散表現ベクトルを生成する。単語ごとの分散表現ベクトルは、単語の分布から、近似する意味内容を示す単語ほど、近い数値ベクトルを与えたものである。この分散表現ベクトルは、ｗｏｒｄ２ｖｅｃ等の手法により学習して生成することができる。
これによって、ベクトル化手段は、原稿文や投稿文において、そのものの意味内容を加味したベクトルを生成する。 In such a configuration, the model learning device uses the vectorizing means to average the distributed expression vector of the word unit stored in the storage means in advance with respect to the manuscript sentence or the posted sentence which is the teacher data, and then the manuscript unit or the posted sentence. Generates a sentence-based distributed representation vector, which is a unit distributed representation vector. The distributed expression vector for each word is given a numerical vector that is closer to the word that shows the similar meaning and content from the distribution of the words. This distributed representation vector can be learned and generated by a method such as word2vec.
As a result, the vectorization means generates a vector in which the meaning and content of the manuscript sentence and the submitted sentence are added.

そして、モデル学習装置は、特徴抽出モデル学習手段によって、原稿文から生成される文単位分散表現ベクトルの次元を圧縮して特徴ベクトルを抽出する特徴抽出モデルを学習する。この特徴抽出モデルは、ニューラルネットワークのオートエンコーダの特徴抽出を行う部分として構成することができる。
そして、モデル学習装置は、特徴抽出手段によって、教師データから生成される文単位分散表現ベクトルから、特徴抽出モデルを用いて特徴ベクトルを抽出する。 Then, the model learning device learns the feature extraction model that extracts the feature vector by compressing the dimension of the sentence unit distributed expression vector generated from the manuscript sentence by the feature extraction model learning means. This feature extraction model can be configured as a part for performing feature extraction of the autoencoder of the neural network.
Then, the model learning device extracts the feature vector from the sentence-based distributed representation vector generated from the teacher data by the feature extraction means using the feature extraction model.

そして、モデル学習装置は、情報判定モデル学習手段によって、特徴抽出手段で抽出される教師データに対する特徴ベクトルと、当該特徴ベクトルを抽出した元となる教師データの文単位分散表現ベクトルとを入力し、機械学習することで、情報判定モデルを学習する。
このように、情報判定モデル学習手段は、教師データの文単位分散表現ベクトルに対して、さらに、原稿文から学習した特徴抽出モデルで抽出した特徴ベクトルを用いることで、ニュース原稿等の過去の原稿により学習効果を高めることができる。
なお、モデル学習装置は、コンピュータを、前記した各手段として機能させるためのモデル学習プログラムで動作させることができる。 Then, the model learning device inputs the feature vector for the teacher data extracted by the feature extraction means by the information determination model learning means and the sentence unit distributed expression vector of the teacher data from which the feature vector is extracted. The information judgment model is learned by machine learning.
In this way, the information determination model learning means uses the feature vector extracted by the feature extraction model learned from the manuscript sentence with respect to the sentence-based distributed representation vector of the teacher data, so that the past manuscript such as a news manuscript can be used. Therefore, the learning effect can be enhanced.
The model learning device can be operated by a model learning program for operating the computer as each of the above-mentioned means.

また、前記課題を解決するため、本発明に係る情報判定装置は、モデル学習装置で学習した特徴抽出モデルおよび情報判定モデルを用いて、分野が未知のソーシャルメディア情報の投稿文である未知データが、学習済みの分野を示す情報か否かを判定する情報判定装置であって、ベクトル化手段と、特徴抽出手段と、判定手段と、を備える構成とした。 Further, in order to solve the above-mentioned problems, the information determination device according to the present invention uses the feature extraction model and the information determination model learned by the model learning device to generate unknown data which is a posted sentence of social media information whose field is unknown. This is an information determination device for determining whether or not the information indicates a learned field, and is configured to include a vectorization means, a feature extraction means, and a determination means.

かかる構成において、情報判定装置は、ベクトル化手段によって、未知データに対して、予め記憶手段に記憶されている単語単位の分散表現ベクトルを平均化して、投稿単位の分散表現ベクトルである文単位分散表現ベクトルを生成する。
そして、情報判定装置は、特徴抽出手段によって、ベクトル化手段で生成される文単位分散表現ベクトルから、特徴抽出モデルを用いて特徴ベクトルを抽出する。特徴抽出モデルは、分野が既知の原稿から事前学習したものであるため、未知データから生成した特徴ベクトルには、その分野に関する情報であれば抽出されるべき特徴が加味されていることになる。
そして、情報判定装置は、判定手段によって、ベクトル化手段で生成される文単位分散表現ベクトルと、特徴抽出手段で抽出される特徴ベクトルとから、情報判定モデルを用いて、未知データが学習済みの分野を示す情報か否かを判定する。 In such a configuration, the information determination device averages the distributed expression vector of the word unit stored in the storage means in advance with respect to the unknown data by the vectorizing means, and the sentence unit distributed which is the distributed expression vector of the posting unit. Generate a representation vector.
Then, the information determination device extracts the feature vector from the sentence-based distributed representation vector generated by the vectorization means by the feature extraction means using the feature extraction model. Since the feature extraction model is pre-learned from a manuscript whose field is known, the feature vector generated from unknown data includes features that should be extracted if it is information about the field.
Then, the information determination device has learned unknown data from the sentence unit distributed expression vector generated by the vectorization means and the feature vector extracted by the feature extraction means by using the information determination model. Judge whether the information indicates the field.

また、前記課題を解決するため、本発明に係る情報判定装置は、分野が既知の複数の原稿文から特徴量を抽出するための特徴抽出モデルを学習するとともに、前記分野が既知のソーシャルメディア情報である投稿文を教師データとして、判定対象の投稿文が前記分野を示す情報か否かを判定するための情報判定モデルを学習し、分野が未知のソーシャルメディア情報の投稿文である未知データが、学習済みの分野を示す情報か否かを判定する情報判定装置であって、ベクトル化手段と、特徴抽出モデル学習手段と、特徴抽出手段と、情報判定モデル学習手段と、判定手段と、を備える構成とした。
なお、情報判定装置は、コンピュータを、前記した各手段として機能させるための情報判定プログラムで動作させることができる。 Further, in order to solve the above-mentioned problems, the information determination device according to the present invention learns a feature extraction model for extracting a feature amount from a plurality of manuscript sentences whose fields are known, and social media information whose fields are known. Using the posted text as teacher data, the information judgment model for determining whether or not the posted text to be judged is information indicating the above field is learned, and the unknown data which is the posted text of social media information whose field is unknown is , An information determination device for determining whether or not the information indicates a learned field, the vectorization means, the feature extraction model learning means, the feature extraction means, the information determination model learning means, and the determination means. It was configured to be prepared.
The information determination device can be operated by an information determination program for operating the computer as each of the above-mentioned means.

本発明は、以下に示す優れた効果を奏するものである。
本発明によれば、分野が既知の原稿文（テキストデータ）の特徴を利用して、ソーシャルメディア情報が同じ分野の情報であるか否かを判定することができる。
これによって、本発明は、ＳＮＳにおけるソーシャルメディア情報を大量に学習しなくても、既存の大量の原稿を利用することで学習効果を高めることができ、情報判定の精度を高めることができる。
これによって、本発明は、ＳＮＳにおいて個人が発信するソーシャル・ビッグデータを、ニュース等の情報源として有効に活用することができる。 The present invention has the following excellent effects.
According to the present invention, it is possible to determine whether or not the social media information is information in the same field by utilizing the characteristics of the manuscript text (text data) whose field is known.
Thereby, the present invention can enhance the learning effect by using a large amount of existing manuscripts without learning a large amount of social media information in SNS, and can improve the accuracy of information determination.
Thereby, the present invention can effectively utilize social big data transmitted by an individual on SNS as an information source such as news.

本発明の実施形態に係る情報判定装置の構成を示すブロック構成図である。It is a block block diagram which shows the structure of the information determination apparatus which concerns on embodiment of this invention. ベクトル化手段の処理内容を説明するための図であって、（ａ）はメディア情報を単語に分割する例、（ｂ）は単語の分散表現ベクトルから原稿文または投稿文の分散表現ベクトルを算出する例を説明するための説明図である。It is a figure for demonstrating the processing content of the vectorization means, (a) is an example of dividing media information into words, (b) calculates the distributed expression vector of a manuscript sentence or a post sentence from the distributed expression vector of a word. It is explanatory drawing for demonstrating the example. 特徴抽出モデル学習手段が学習する特徴抽出モデルの構造を説明するための説明図である。It is explanatory drawing for demonstrating the structure of the feature extraction model which a feature extraction model learning means learns. 情報判定モデル学習手段が学習する情報判定モデルの構造を説明するための説明図である。It is explanatory drawing for demonstrating the structure of the information judgment model which the information judgment model learning means learns. 本発明の実施形態に係る情報判定装置の第１学習モードの動作を示すフローチャートである。It is a flowchart which shows the operation of the 1st learning mode of the information determination apparatus which concerns on embodiment of this invention. 本発明の実施形態に係る情報判定装置の第２習モードの動作を示すフローチャートである。It is a flowchart which shows the operation of the 2nd learning mode of the information determination apparatus which concerns on embodiment of this invention. 本発明の実施形態に係る情報判定装置の評価モードの動作を示すフローチャートである。It is a flowchart which shows the operation of the evaluation mode of the information determination apparatus which concerns on embodiment of this invention. 本発明の他の実施形態に係るモデル学習装置の構成を示すブロック構成図である。It is a block block diagram which shows the structure of the model learning apparatus which concerns on other embodiment of this invention. 本発明の他の実施形態に係る情報判定装置の構成を示すブロック構成図である。It is a block block diagram which shows the structure of the information determination apparatus which concerns on other embodiment of this invention.

以下、本発明の実施形態について図面を参照して説明する。
［情報判定装置の構成］
最初に、図１を参照して、本発明の実施形態に係る情報判定装置１の構成について説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[Configuration of information judgment device]
First, the configuration of the information determination device 1 according to the embodiment of the present invention will be described with reference to FIG.

情報判定装置１は、制御部１０と記憶部２０とで構成される。
情報判定装置１は、ＳＮＳで発信される情報（投稿単位のテキストデータであるツイート〔登録商標〕等）が、特定の分野の情報であるか否かを判定するものである。 The information determination device 1 includes a control unit 10 and a storage unit 20.
The information determination device 1 determines whether or not the information transmitted by the SNS (tweet [registered trademark], etc., which is text data for each posting) is information in a specific field.

制御部１０は、図１に示すように、単語分割手段１１と、ベクトル化手段１２と、特徴抽出モデル学習手段１３と、特徴抽出手段１４と、情報判定モデル学習手段１５と、判定手段１６と、を備える。
制御部１０は、情報判定装置１の動作を制御するものである。制御部１０は、３つの動作モードで動作する。動作モードの１つめは、分野が既知の情報（テキストデータ）から、その情報の特徴量を抽出する特徴抽出モデルを学習する第１学習モードである。動作モードの２つめは、分野が既知のソーシャルメディア情報（以下、単にメディア情報）から、未知のメディア情報が、その分野の情報であるか否かを判定する情報判定モデルを学習する第２学習モードである。動作モードの３つめは、未知のメディア情報が、学習した分野の情報であるか否かを判定する評価モードである。 As shown in FIG. 1, the control unit 10 includes a word dividing means 11, a vectorizing means 12, a feature extraction model learning means 13, a feature extraction means 14, an information determination model learning means 15, and a determination means 16. , Equipped with.
The control unit 10 controls the operation of the information determination device 1. The control unit 10 operates in three operation modes. The first operation mode is a first learning mode in which a feature extraction model for extracting a feature amount of information from information (text data) whose field is known is learned. The second operation mode is the second learning to learn an information judgment model for determining whether or not unknown media information is information in the field from social media information in which the field is known (hereinafter, simply media information). The mode. The third operation mode is an evaluation mode for determining whether or not the unknown media information is information in the learned field.

本実施形態では、特定の分野として、報道番組等で活用可能な「ニュース」を例として説明する。もちろん、この分野は、ニュース以外の分野でもよく、例えば、スポーツ、音楽等であってもよい。 In the present embodiment, as a specific field, "news" that can be used in news programs and the like will be described as an example. Of course, this field may be a field other than news, and may be, for example, sports, music, or the like.

本実施形態では、第１学習モードにおいて、制御部１０は、大量（例えば、数十万件）のニュース原稿を入力情報として入力する。例えば、制御部１０は、ニュースのタイトルを原稿単位で入力する。
また、制御部１０は、第２学習モードにおいて、分野がニュースであることが既知のメディア情報を、入力情報として大量（例えば、数万件）に入力する。例えば、制御部１０は、ツイート〔登録商標〕を投稿単位で入力する。
また、制御部１０は、評価モードにおいて、分野が未知のメディア情報を入力情報として入力し、分野がニュースの情報であるか否かの判定結果を出力する。 In the first embodiment, in the first learning mode, the control unit 10 inputs a large amount (for example, hundreds of thousands) of news manuscripts as input information. For example, the control unit 10 inputs a news title in units of manuscripts.
Further, in the second learning mode, the control unit 10 inputs a large amount (for example, tens of thousands) of media information known to be news in the field as input information. For example, the control unit 10 inputs a tweet [registered trademark] in units of posts.
Further, in the evaluation mode, the control unit 10 inputs media information whose field is unknown as input information, and outputs a determination result of whether or not the field is news information.

単語分割手段１１は、テキストデータであるニュース原稿（原稿文）、または、メディア情報（投稿文）を単語に分割するものである。具体的には、単語分割手段１１は、テキストデータを形態素解析することで単語に分割し、ベクトル化手段１２に出力する。 The word dividing means 11 divides a news manuscript (manuscript sentence) or media information (posted sentence), which is text data, into words. Specifically, the word dividing means 11 divides the text data into words by morphological analysis and outputs the text data to the vectorizing means 12.

ベクトル化手段１２は、単語分割手段１１で分割された単語から、当該テキストデータである原稿文または投稿文をベクトル化するものである。ベクトル化手段１２は、分散表現ベクトル記憶手段２１に予め記憶されている単語ごとの分散表現ベクトルから、原稿文または投稿文を構成する単語の分散表現ベクトルを平均化して、原稿単位または投稿単位の分散表現ベクトルを生成する。
なお、分散表現ベクトルとは、意味が近い（分散の特徴が近い）単語を近いベクトルに対応させて、単語を有限の高次元（例えば、２００次元）の数値ベクトルで表現したものである。この分散表現ベクトルは、例えば、ｗｏｒｄ２ｖｅｃ、ＧｌｏＶｅ（Global Vectors for Word Representation）等の一般的な手法により生成されたものである。 The vectorization means 12 vectorizes the manuscript sentence or the posted sentence, which is the text data, from the words divided by the word dividing means 11. The vectorizing means 12 averages the distributed expression vectors of the words constituting the manuscript sentence or the posted sentence from the distributed expression vector for each word stored in advance in the distributed expression vector storage means 21, and the manuscript unit or the post unit. Generate a distributed representation vector.
The variance representation vector is a vector in which words having similar meanings (similar characteristics of variance) are associated with close vectors, and the words are represented by a finite high-dimensional (for example, 200-dimensional) numerical vector. This distributed representation vector is generated by a general method such as word2vec or GloVe (Global Vectors for Word Representation).

ベクトル化手段１２は、分割した単語に対応する分散表現ベクトルを読み出して加算する。そして、ベクトル化手段１２は、加算した分散表現ベクトルを当該原稿文または投稿文に含まれる単語数で除算することで、ベクトルを正規化し、原稿文または投稿文の分散表現ベクトル（文単位分散表現ベクトル）を生成する。 The vectorization means 12 reads out the distributed representation vector corresponding to the divided words and adds them. Then, the vectorizing means 12 normalizes the vector by dividing the added distributed expression vector by the number of words included in the manuscript sentence or the posted sentence, and the distributed expression vector (sentence unit distributed expression) of the manuscript sentence or the posted sentence. Vector) is generated.

ベクトル化手段１２は、第１学習モードにおいては、生成した文単位分散表現ベクトルを特徴抽出モデル学習手段１３に出力する。また、ベクトル化手段１２は、第２学習モードにおいては、生成した文単位分散表現ベクトルを特徴抽出手段１４および情報判定モデル学習手段１５に出力する。また、ベクトル化手段１２は、評価モードにおいては、生成した文単位分散表現ベクトルを特徴抽出手段１４および判定手段１６に出力する。 In the first learning mode, the vectorization means 12 outputs the generated sentence-based distributed expression vector to the feature extraction model learning means 13. Further, in the second learning mode, the vectorization means 12 outputs the generated sentence unit distributed expression vector to the feature extraction means 14 and the information determination model learning means 15. Further, in the evaluation mode, the vectorization means 12 outputs the generated sentence-based distributed expression vector to the feature extraction means 14 and the determination means 16.

ここで、図２を参照（適宜図１参照）して、文単位分散表現ベクトルについて説明する。
図２（ａ）に示すように、テキストデータ（原稿文または投稿文）の一例を「○○線が事故で遅れている。」とした場合、単語分割手段１１は、当該テキストデータを「○○／線／が／事故／で／遅れ／て／いる／。」と単語単位で分割する。 Here, the sentence-based distributed representation vector will be described with reference to FIG. 2 (see FIG. 1 as appropriate).
As shown in FIG. 2A, when an example of text data (manuscript sentence or posted sentence) is "the XX line is delayed due to an accident", the word dividing means 11 sets the text data as "○". ○ / line / ga / accident / de / delayed / / is /. ”Is divided into word units.

そして、ベクトル化手段１２は、分割した単語ごとに、対応する分散表現ベクトルを分散表現ベクトル記憶手段２１から読み出す。例えば、図２（ｂ）に示すように、単語「○○」に対応する次元数がｎ個（例えば、２００次元）の分散表現ベクトル「０．１，０．３，０．４，０．１，０．８，０．９，０．２，…，０．９」を読み出す。
そして、ベクトル化手段１２は、テキストデータを構成する単語数だけ分散表現ベクトルを加算して、全単語合計（図２（ｂ）の例では、「７．２，１．８，２．７，３．６，３．６，７．２，４．５，…，６．３」）を算出する。 Then, the vectorization means 12 reads out the corresponding distributed expression vector from the distributed expression vector storage means 21 for each divided word. For example, as shown in FIG. 2B, the distributed representation vector “0.1, 0.3, 0.4, 0.” Has n (for example, 200 dimensions) of dimensions corresponding to the word “○○”. 1,0.8, 0.9, 0.2, ..., 0.9 ”is read out.
Then, the vectorization means 12 adds the distributed expression vectors by the number of words constituting the text data, and totals all the words (in the example of FIG. 2B, “7.2, 1.8, 2.7, 3.6, 3.6, 7.2, 4.5, ..., 6.3 ") is calculated.

その後、ベクトル化手段１２は、分散表現ベクトルの全単語合計を、テキストデータを構成する単語数（図２の例では、９個）で除算することで、文単位分散表現ベクトル（図２（ｂ）の例では、「０．８，０．２，０．３，０．４，０．４，０．８，０．５，…，０．７」）を算出する。
これによって、ベクトル化手段１２は、原稿文または投稿文ごとに文単位分散表現ベクトルを生成する。
図１に戻って、情報判定装置１の構成について説明を続ける。 After that, the vectorization means 12 divides the total of all the words of the distributed expression vector by the number of words constituting the text data (9 in the example of FIG. 2), thereby dividing the sentence-based distributed expression vector (FIG. 2 (b). ), “0.8, 0.2, 0.3, 0.4, 0.4, 0.8, 0.5, ..., 0.7”) is calculated.
As a result, the vectorizing means 12 generates a sentence-based distributed representation vector for each manuscript sentence or posted sentence.
Returning to FIG. 1, the configuration of the information determination device 1 will be described.

特徴抽出モデル学習手段１３は、第１学習モードにおいて、ベクトル化手段１２で生成される文単位分散表現ベクトルから、特徴量を抽出するための特徴抽出モデルを学習するものである。文単位分散表現ベクトルの特徴量は、文単位分散表現ベクトルの次元数を圧縮することで求める。 The feature extraction model learning means 13 learns a feature extraction model for extracting a feature amount from a sentence-based distributed expression vector generated by the vectorization means 12 in the first learning mode. The feature quantity of the sentence-based distributed representation vector is obtained by compressing the number of dimensions of the sentence-based distributed representation vector.

具体的には、特徴抽出モデル学習手段１３は、図３に示す入力層Ａ_Ｌ１、隠れ層Ａ_Ｌ２、出力層Ａ_Ｌ３で構成されるニューラルネットワークであるオートエンコーダＡＥにより、特徴量（特徴ベクトル）を抽出する特徴抽出モデルＭ_１を学習する。
図３に示すオートエンコーダＡＥは、入力層Ａ_Ｌ１に、ベクトル化手段１２で生成される文単位分散表現ベクトルＶｓを入力する。入力層Ａ_Ｌ１および出力層Ａ_Ｌ３は、文単位分散表現ベクトルと同じ次元数（例えば、２００次元）である。隠れ層Ａ_Ｌ２は、入力層Ａ_Ｌ１および出力層Ａ_Ｌ３よりも次元数が少ない（例えば、１００次元）。 Specifically, the feature extraction model learning means 13 uses an autoencoder AE, which is a neural network composed _{of an input layer A L1} , a hidden layer A _L2 , and an output layer A _{L3 shown in FIG.} learning the feature extraction model M ₁ for extracting.
Autoencoder AE shown in FIG. 3, the input layer A _L1, and inputs the sentence distributed representation vector Vs generated by vectorization unit 12. The input layer A _L1 and the output layer A _L3 have the same number of dimensions (for example, 200 dimensions) as the sentence-based distributed representation vector. The hidden layer A _L2 has a smaller number of dimensions than the input layer A _L1 and the output layer A _L3 (for example, 100 dimensions).

特徴抽出モデル学習手段１３は、オートエンコーダＡＥにおいて、入力層Ａ_Ｌ１と出力層Ａ_Ｌ３とが、同じ文単位分散表現ベクトルＶｓとなるように、入力層Ａ_Ｌ１から隠れ層Ａ_Ｌ２へのエンコード式、隠れ層Ａ_Ｌ２から出力層Ａ_Ｌ３へのデコード式の係数等のパラメータを学習する。このように、オートエンコーダＡＥは、中間層に、入力層Ａ_Ｌ１および出力層Ａ_Ｌ３よりも次元数が少ない隠れ層Ａ_Ｌ２を設けることで、隠れ層Ａ_Ｌ２において、文単位分散表現ベクトルの次元を圧縮した特徴量を抽出することができる。 Feature extraction model learning unit 13, in Autoencoder AE, the input layer _{A L1} and an output layer _{A L3} is such that the same sentence distributed representation vector Vs, the encoding type from the input layer _{A L1} to the hidden layer _{A L2} , The parameters such as the coefficient of the decoding formula from the hidden layer A _L2 to the output layer A _{L3 are learned.} As described above, the autoencoder AE provides the _{hidden layer A L2} having a smaller number of dimensions than _{the input layer A L1} and the output layer A _{L3 in} the intermediate layer, so that the dimension of the sentence unit distributed expression vector is provided in the hidden layer A _L2. It is possible to extract the feature amount compressed by.

この特徴抽出モデル学習手段１３は、順次入力される文単位分散表現ベクトルが、入力層Ａ_Ｌ１と出力層Ａ_Ｌ３とで同じになるようにオートエンコーダＡＥを学習し、入力層Ａ_Ｌ１から隠れ層Ａ_Ｌ２へのパラメータを、特徴抽出モデルＭ_１として学習する。なお、オートエンコーダＡＥの学習には、例えば、誤差逆伝播法（back propagation）を用いる。
この特徴抽出モデル学習手段１３は、ニュース原稿を用いた学習を所定回数行うか、パラメータ誤差が予め定めた誤差内に収束した段階で学習を終了する。
特徴抽出モデル学習手段１３は、学習した特徴抽出モデルＭ_１を、特徴抽出モデル記憶手段２２に書き込み記憶する。 The feature extraction model learning means 13 learns the autoencoder AE so that the sentence-based distributed expression vectors that are sequentially input become the same in _{the input layer A L1} and the output layer A _L3, and hides from _{the input layer A L1.} the parameters to a _L2, learning as a feature extraction model _{M 1.} For learning the autoencoder AE, for example, an error back propagation method is used.
The feature extraction model learning means 13 ends learning when the learning using the news manuscript is performed a predetermined number of times or when the parameter error converges within a predetermined error.
The feature extraction model learning means 13 _{writes and stores the learned feature extraction model M 1} in the feature extraction model storage means 22.

特徴抽出手段１４は、第２学習モードまたは評価モードにおいて、特徴抽出モデル記憶手段２２に記憶されている特徴抽出モデルを用いて、ベクトル化手段１２で生成される文単位分散表現ベクトルから、特徴ベクトル（特徴量）を抽出するものである。
この特徴抽出手段１４は、特徴抽出モデル記憶手段２２に記憶されている特徴抽出モデルＭ_１（図３参照）を用いて、文単位分散表現ベクトル（例えば、２００次元）から、次元数の少ないベクトル（例えば、１００次元）を算出し、特徴ベクトルを抽出する。
特徴抽出手段１４は、第２学習モードにおいて、抽出した特徴ベクトルを情報判定モデル学習手段１５に出力する。また、特徴抽出手段１４は、評価モードにおいて、抽出した特徴ベクトルを情報判定モデル学習手段１５に出力する。 In the second learning mode or the evaluation mode, the feature extraction means 14 uses the feature extraction model stored in the feature extraction model storage means 22 to obtain a feature vector from the sentence-based distributed representation vector generated by the vectorization means 12. (Feature amount) is extracted.
The feature extraction means 14 uses the feature extraction model M ₁ (see FIG. 3) stored in the feature extraction model storage means 22 to convert a sentence-based distributed representation vector (for example, 200 dimensions) into a vector having a small number of dimensions. (For example, 100 dimensions) is calculated and a feature vector is extracted.
The feature extraction means 14 outputs the extracted feature vector to the information determination model learning means 15 in the second learning mode. Further, the feature extraction means 14 outputs the extracted feature vector to the information determination model learning means 15 in the evaluation mode.

情報判定モデル学習手段１５は、第２学習モードにおいて、ベクトル化手段１２で生成される文単位分散表現ベクトルから、メディア情報がある分野の情報であるか否かを判定する情報判定モデルを学習するものである。 In the second learning mode, the information determination model learning means 15 learns an information determination model that determines whether or not the media information is information in a certain field from the sentence unit distributed expression vector generated by the vectorization means 12. It is a thing.

具体的には、情報判定モデル学習手段１５は、図４に示す入力層Ｆ_Ｌ１、隠れ層Ｆ_Ｌ２、出力層Ｆ_Ｌ３で構成される順伝播ニューラルネットワーク（Feed Forward Neural Network：ＦＦＮＮ）により、情報判定モデルＭ_２を学習する。
図４に示すＦＦＮＮは、入力層Ｆ_Ｌ１に、ベクトル化手段１２で生成される文単位分散表現ベクトルＶｓと、特徴抽出モデルＭ_１で抽出される特徴ベクトルＶｆとを入力する。そして、ＦＦＮＮは、隠れ層Ｆ_Ｌ２において、入力層Ｆ_Ｌ１に入力されたベクトル（文単位分散表現ベクトルＶｓ＋特徴ベクトルＶｆ）の各要素の値に重みを付加して伝搬させて、出力層Ｆ_Ｌ３から、判定結果を出力する。ここで、出力層Ｆ_Ｌ３は、例えば、次元数を２とし、一方のノードから、文単位分散表現ベクトルＶｓが第１学習モードで学習した分野の情報であることを示す確率を正規化して出力する。また、他方のノードから、文単位分散表現ベクトルＶｓが第１学習モードで学習した分野の情報ではないことを示す確率を正規化して出力する。 Specifically, the information determining the model learning unit 15, an input layer _{F L1} shown in FIG. 4, the hidden layer _{F L2,} constituted order propagation neural network output layer _{F L3:} the (Feed Forward Neural Network FFNN), information The determination model M ₂ is learned.
FFNN shown in Figure 4, the input layer F _L1, and inputs a sentence unit variance representation vectors Vs generated by vectorization unit 12, a feature vector Vf that is extracted by the feature extraction model M _1. Then, FFNN, in the hidden layer _{F L2,} and is propagated by adding a weight to the value of each element of the input to the input layer _{F L1} vector (sentence distributed representation vector Vs + feature vector Vf), the output layer _{F L3} Outputs the judgment result from. Here, the output layer _FL3 outputs, for example, by setting the number of dimensions to 2 and normalizing the probability indicating that the sentence unit distributed expression vector Vs is the information of the field learned in the first learning mode from one node. To do. Further, from the other node, the probability indicating that the sentence unit distributed expression vector Vs is not the information of the field learned in the first learning mode is normalized and output.

そして、情報判定モデル学習手段１５は、教師データが正例の場合、一方のノードの出力が、文単位分散表現ベクトルＶｓが第１学習モードで学習した分野の投稿文のベクトルである確率値“１”、他方のノードの出力が確率値“０”となるように、各層の重みを情報判定モデルＭ_２のパラメータとして学習する。また、教師データが負例の場合、一方のノードの出力が“０”、他方のノードの出力が“１” となるように、各層の重みを情報判定モデルＭ_２のパラメータとして学習する。なお、ＦＦＮＮの学習には、例えば、誤差逆伝播法（back propagation）を用いる。
この情報判定モデル学習手段１５は、教師データを用いた学習を所定回数行うか、パラメータ誤差が予め定めた誤差内に収束した段階で学習を終了する。
情報判定モデル学習手段１５は、学習した情報判定モデルを、情報判定モデル記憶手段２３に書き込み記憶する。 Then, when the teacher data is a positive example, the information determination model learning means 15 has a probability value that the output of one node is a vector of posted sentences in the field in which the sentence unit distributed expression vector Vs is learned in the first learning mode. 1 ", the output of the other node probability value" so that the 0 "to learn the weights of each layer as a parameter information determining the model M _2. Also, if the teacher data is negative example, the output of one node is "0", so that the output of the other nodes becomes "1", to learn the weights of each layer as a parameter information determining the model M _2. For learning FFNN, for example, an error back propagation method is used.
The information determination model learning means 15 ends learning when the learning using the teacher data is performed a predetermined number of times or when the parameter error converges within a predetermined error.
The information determination model learning means 15 writes and stores the learned information determination model in the information determination model storage means 23.

判定手段１６は、評価モードにおいて、メディア情報が、学習モード（第１学習モードおよび第２学習モード）で学習した分野の情報であるか否かを判定するものである。
判定手段１６は、評価モードにおいて、ベクトル化手段１２から文単位分散表現ベクトルを入力し、特徴抽出手段１４から、文単位分散表現ベクトルから抽出した特徴ベクトルを入力する。 In the evaluation mode, the determination means 16 determines whether or not the media information is information in the field learned in the learning modes (first learning mode and second learning mode).
In the evaluation mode, the determination means 16 inputs the sentence unit distributed expression vector from the vectorization means 12, and inputs the feature vector extracted from the sentence unit distributed expression vector from the feature extraction means 14.

判定手段１６は、情報判定モデル記憶手段２３に記憶されている情報判定モデルを用いて、入力した文単位分散表現ベクトルと特徴ベクトルとが、学習モードで学習した分野の情報に対応するベクトルであるか否かを判定する。具体的には、判定手段１６は、図４に示したＦＦＮＮの入力層Ｆ_Ｌ１に文単位分散表現ベクトルＶｓと特徴ベクトルＶｆとを入力し、出力層Ｆ_Ｌ３から出力される結果に基づいて判定を行う。図４の例では、判定手段１６は、出力層Ｆ_Ｌ３の一方のノードの出力である学習した分野の事象の情報である確率値から、他方のノードから出力される確率値を減算し、正であれば、メディア情報が、学習した分野の情報であると判定する。一方、負であれば、判定手段１６は、メディア情報が、学習した分野の情報ではないと判定する。
これによって、判定手段１６は、メディア情報が学習した分野の情報か否かを判定することができる。判定手段１６は、この判定結果を外部に出力する。 The determination means 16 uses the information determination model stored in the information determination model storage means 23, and the sentence-based distributed expression vector and the feature vector input are vectors corresponding to the information in the field learned in the learning mode. Judge whether or not. Specifically, the determination means 16 determines based on a result of inputs the Buntan'i distributed representation vector Vs and the feature vector Vf to the input layer F _L1 of FFNN shown in FIG. 4, is outputted from the output layer F _L3 I do. In the example of FIG. 4, the determination means 16 subtracts the probability value output from the other node from the probability value which is the information of the event in the learned field which is the output of one node of _{the output layer FL3, and is positive.} If so, it is determined that the media information is the information of the learned field. On the other hand, if it is negative, the determination means 16 determines that the media information is not the information in the learned field.
Thereby, the determination means 16 can determine whether or not the media information is the information of the learned field. The determination means 16 outputs this determination result to the outside.

記憶部２０は、分散表現ベクトル記憶手段２１と、特徴抽出モデル記憶手段２２と、情報判定モデル記憶手段２３と、を備える。記憶部２０は、情報判定装置１の動作で使用または生成する各種データを記憶するものである。
これら各記憶手段は、ハードディスク、半導体メモリ等の一般的な記憶装置で構成することができる。なお、ここでは、記憶部２０において、各記憶手段を個別に設けているが、１つの記憶装置の記憶領域を複数に区分して各記憶手段としてもよい。また、記憶部２０を外部記憶装置として、情報判定装置１の構成から省いてもよい。 The storage unit 20 includes a distributed expression vector storage means 21, a feature extraction model storage means 22, and an information determination model storage means 23. The storage unit 20 stores various data used or generated by the operation of the information determination device 1.
Each of these storage means can be configured by a general storage device such as a hard disk or a semiconductor memory. Here, although each storage means is individually provided in the storage unit 20, the storage area of one storage device may be divided into a plurality of storage means as each storage means. Further, the storage unit 20 may be used as an external storage device and may be omitted from the configuration of the information determination device 1.

分散表現ベクトル記憶手段２１は、分散表現ベクトルを単語に対応付けて記憶するものである。ここでは、分散表現ベクトル記憶手段２１に、予め分散表現ベクトルを記憶しておくこととするが、情報判定装置１は、制御部１０に、図示を省略した分散表現ベクトル生成手段を備える構成としても構わない。その場合、分散表現ベクトル生成手段は、既存のメディア情報等の大量の学習データから、ｗｏｒｄ２ｖｅｃ等によって、単語ごとの分散表現ベクトルを生成し、分散表現ベクトル記憶手段２１に記憶する。 The distributed expression vector storage means 21 stores the distributed expression vector in association with a word. Here, it is assumed that the distributed expression vector storage means 21 stores the distributed expression vector in advance, but the information determination device 1 may be configured to include the distributed expression vector generation means (not shown) in the control unit 10. I do not care. In that case, the distributed expression vector generation means generates a distributed expression vector for each word from a large amount of learning data such as existing media information by word2vec or the like, and stores it in the distributed expression vector storage means 21.

特徴抽出モデル記憶手段２２は、特徴抽出モデル学習手段１３で学習した特徴抽出モデルを記憶するものである。この特徴抽出モデル記憶手段２２に記憶される特徴抽出モデルは、特徴抽出手段１４が参照する。 The feature extraction model storage means 22 stores the feature extraction model learned by the feature extraction model learning means 13. The feature extraction means 14 refers to the feature extraction model stored in the feature extraction model storage means 22.

情報判定モデル記憶手段２３は、情報判定モデル学習手段１５で学習した情報判定モデルを記憶するものである。この情報判定モデル記憶手段２３に記憶される情報判定モデルは、判定手段１６が参照する。 The information determination model storage means 23 stores the information determination model learned by the information determination model learning means 15. The information determination model stored in the information determination model storage means 23 is referred to by the determination means 16.

以上説明したように情報判定装置１を構成することで、情報判定装置１は、ニュース原稿から、ニュース分野における原稿文の特徴を抽出する特徴抽出モデルを学習することができる。また、情報判定装置１は、教師データである予め定めた分野（ここでは、ニュース分野）の情報であるか否かが既知のメディア情報から、情報判定モデルを学習することができる。
そして、情報判定装置１は、情報判定モデルを用いて、未知のメディア情報が学習した分野の情報であるか否かを判定することができる。
なお、情報判定装置１は、一般的なコンピュータを、前記した制御部１０の各手段として機能させるプログラム（情報判定プログラム）で動作させることができる。 By configuring the information determination device 1 as described above, the information determination device 1 can learn a feature extraction model that extracts features of a manuscript sentence in the news field from a news manuscript. Further, the information determination device 1 can learn an information determination model from media information whose information is known to be information in a predetermined field (here, news field) which is teacher data.
Then, the information determination device 1 can determine whether or not the unknown media information is information in the learned field by using the information determination model.
The information determination device 1 can be operated by a program (information determination program) that causes a general computer to function as each means of the control unit 10 described above.

［情報判定装置の動作］
次に、図５〜図７を参照して、本発明の実施形態に係る情報判定装置１の動作について説明する。なお、分散表現ベクトル記憶手段２１には、予め単語に対応付けて分散表現ベクトルが記憶されているものとする。ここでは、情報判定装置１の動作を、第１学習モードと第２学習モードと評価モードとに分けて説明する。 [Operation of information judgment device]
Next, the operation of the information determination device 1 according to the embodiment of the present invention will be described with reference to FIGS. 5 to 7. It is assumed that the distributed expression vector storage means 21 stores the distributed expression vector in advance in association with the word. Here, the operation of the information determination device 1 will be described separately for the first learning mode, the second learning mode, and the evaluation mode.

（第１学習モード）
まず、図５を参照（構成については適宜図１参照）して、情報判定装置１の特徴抽出モデルを学習する第１学習モードの動作について説明する。
ステップＳ１において、情報判定装置１の単語分割手段１１は、テキストデータであるニュース原稿（例えば、ニュースのタイトル）を、原稿ごとに入力する。
そして、ステップＳ２において、情報判定装置１の単語分割手段１１は、ステップＳ１で入力した原稿文を、形態素解析することで単語に分割する。 (1st learning mode)
First, the operation of the first learning mode for learning the feature extraction model of the information determination device 1 will be described with reference to FIG. 5 (see FIG. 1 for the configuration as appropriate).
In step S1, the word dividing means 11 of the information determination device 1 inputs a news manuscript (for example, a news title) which is text data for each manuscript.
Then, in step S2, the word dividing means 11 of the information determination device 1 divides the manuscript sentence input in step S1 into words by morphological analysis.

そして、ステップＳ３において、情報判定装置１のベクトル化手段１２は、ステップＳ２で分割した単語に対応する分散表現ベクトルを分散表現ベクトル記憶手段２１から読み出して、単語数分だけ加算する。
さらに、ステップＳ４において、情報判定装置１のベクトル化手段１２は、ステップＳ３で加算された分散表現ベクトルを、原稿文に含まれる単語数で除算することで、原稿文ごとの正規化したベクトル（文単位分散表現ベクトル）を生成する。 Then, in step S3, the vectorizing means 12 of the information determination device 1 reads out the distributed expression vector corresponding to the words divided in step S2 from the distributed expression vector storage means 21 and adds the number of words.
Further, in step S4, the vectorizing means 12 of the information determination device 1 divides the distributed expression vector added in step S3 by the number of words included in the manuscript sentence, thereby normalizing the vector for each manuscript sentence ( Sentence unit distributed representation vector) is generated.

ステップＳ５において、情報判定装置１の特徴抽出モデル学習手段１３は、ステップＳ４で生成した文単位分散表現ベクトルを、図３に示すオートエンコーダＡＥの入力層Ａ_Ｌ１への入力、および、出力層Ａ_Ｌ３からの出力として、特徴抽出モデルを学習する。 In step S5, the feature extraction model learning unit 13 of the information determination device 1, a sentence unit variance representation vectors generated in step S4, the input to the input layer A _L1 of Autoencoder AE shown in FIG. 3, and an output layer A _The feature extraction model is learned as the output from L3.

そして、ステップＳ６において、情報判定装置１の特徴抽出モデル学習手段１３は、学習を所定回数行うか、特徴抽出モデルのパラメータ誤差が収束したかにより、学習が終了したか否かを判定する。
このステップＳ６で、学習が終了していないと判定された場合（Ｎｏ）、情報判定装置１は、ステップＳ１に戻って学習動作を継続する。
一方、ステップＳ６で、学習が終了したと判定された場合（Ｙｅｓ）、情報判定装置１は、ステップＳ７において、学習した特徴抽出モデルを、特徴抽出モデル記憶手段２２に書き込む。 Then, in step S6, the feature extraction model learning means 13 of the information determination device 1 determines whether or not the learning is completed depending on whether the learning is performed a predetermined number of times or the parameter error of the feature extraction model has converged.
If it is determined in step S6 that the learning has not been completed (No), the information determination device 1 returns to step S1 and continues the learning operation.
On the other hand, when it is determined in step S6 that the learning is completed (Yes), the information determination device 1 writes the learned feature extraction model in the feature extraction model storage means 22 in step S7.

以上の動作によって、情報判定装置１は、大量のニュース原稿を教師データとして、テキストデータがニュースの分野の情報である特徴量を抽出する特徴抽出モデルを生成することができる。 By the above operation, the information determination device 1 can generate a feature extraction model for extracting a feature amount whose text data is information in the field of news, using a large amount of news manuscripts as teacher data.

（第２学習モード）
次に、図６を参照（構成については適宜図１参照）して、情報判定装置１の情報判定モデルを学習する第２学習モードの動作について説明する。この第２学習モードは、図５で説明した第１学習モードの動作の後に行われる。
ステップＳ１０において、情報判定装置１の単語分割手段１１は、分野がニュースであることが既知のメディア情報（教師データ）を、投稿ごとに入力する。
そして、ステップＳ１１において、情報判定装置１の単語分割手段１１は、ステップＳ１０で入力した投稿文を、形態素解析することで単語に分割する。 (Second learning mode)
Next, the operation of the second learning mode for learning the information determination model of the information determination device 1 will be described with reference to FIG. 6 (see FIG. 1 for the configuration as appropriate). This second learning mode is performed after the operation of the first learning mode described with reference to FIG.
In step S10, the word dividing means 11 of the information determination device 1 inputs media information (teacher data) known to be news in the field for each post.
Then, in step S11, the word dividing means 11 of the information determination device 1 divides the posted sentence input in step S10 into words by morphological analysis.

そして、ステップＳ１２において、情報判定装置１のベクトル化手段１２は、ステップＳ１１で分割した単語に対応する分散表現ベクトルを分散表現ベクトル記憶手段２１から読み出して、単語数分だけ加算する。
さらに、ステップＳ１３において、情報判定装置１のベクトル化手段１２は、ステップＳ１２で加算された分散表現ベクトルを、投稿文に含まれる単語数で除算することで、原稿文ごとの正規化したベクトル（文単位分散表現ベクトル）を生成する。 Then, in step S12, the vectorizing means 12 of the information determination device 1 reads out the distributed expression vector corresponding to the words divided in step S11 from the distributed expression vector storage means 21 and adds the number of words.
Further, in step S13, the vectorizing means 12 of the information determination device 1 divides the distributed expression vector added in step S12 by the number of words included in the posted sentence, thereby normalizing the vector for each manuscript sentence ( Sentence unit distributed representation vector) is generated.

ステップＳ１４において、情報判定装置１の特徴抽出手段１４は、特徴抽出モデル記憶手段２２に記憶されている特徴抽出モデルを用いて、ステップＳ１３で生成した文単位分散表現ベクトルから、特徴量である特徴ベクトルを抽出し、特徴抽出モデルの出力とする。
そして、ステップＳ１５において、情報判定装置１の情報判定モデル学習手段１５は、ステップＳ１３で生成した教師データの文単位分散表現ベクトルと、ステップＳ１４で生成した特徴ベクトルとを、図４に示すＦＦＮＮの入力層Ｆ_Ｌ１への入力として、情報判定モデルを教師あり学習する。 In step S14, the feature extraction means 14 of the information determination device 1 uses the feature extraction model stored in the feature extraction model storage means 22 and uses the feature extraction model generated in step S13 to obtain features that are feature quantities from the sentence-based distributed representation vector. Extract the vector and use it as the output of the feature extraction model.
Then, in step S15, the information determination model learning means 15 of the information determination device 1 converts the sentence-based distributed representation vector of the teacher data generated in step S13 and the feature vector generated in step S14 into the FFNN shown in FIG. as inputs to the input layer F _L1, it is supervised information determination model.

そして、ステップＳ１６において、情報判定装置１の情報判定モデル学習手段１５は、教師データを用いた学習を所定回数行うか、情報判定モデルのパラメータ誤差が収束したかにより、学習が終了したか否かを判定する。
このステップＳ１６で、学習が終了していないと判定された場合（Ｎｏ）、情報判定装置１は、ステップＳ１０に戻って学習動作を継続する。
一方、ステップＳ１６で、学習が終了したと判定された場合（Ｙｅｓ）、情報判定装置１は、ステップＳ１７において、学習した情報判定モデルを、情報判定モデル記憶手段２３に書き込む。 Then, in step S16, whether or not the information determination model learning means 15 of the information determination device 1 has completed the learning depending on whether the learning using the teacher data is performed a predetermined number of times or the parameter error of the information determination model has converged. To judge.
If it is determined in step S16 that the learning has not been completed (No), the information determination device 1 returns to step S10 and continues the learning operation.
On the other hand, when it is determined in step S16 that the learning is completed (Yes), the information determination device 1 writes the learned information determination model in the information determination model storage means 23 in step S17.

以上の動作によって、情報判定装置１は、教師データから、分野が未知のメディア情報が、ニュースの分野の情報であるか否かを判定するための情報判定モデルを生成することができる。 By the above operation, the information determination device 1 can generate an information determination model for determining whether or not the media information whose field is unknown is the information in the news field from the teacher data.

（評価モード）
次に、図７を参照（構成については適宜図１参照）して、情報判定装置１の評価モードの動作について説明する。この評価モードの動作は、図６で説明した第２学習モードの動作の後に行われる。
なお、ステップＳ２０〜Ｓ２４の動作は、入力情報が、教師データのメディア情報であるか、分野が未知のメディア情報であるかが異なるだけで、図６で説明したステップＳ１０からＳ１４の動作と同じであるため、説明を省略する。 (Evaluation mode)
Next, the operation of the evaluation mode of the information determination device 1 will be described with reference to FIG. 7 (see FIG. 1 for the configuration as appropriate). The operation of this evaluation mode is performed after the operation of the second learning mode described with reference to FIG.
The operation of steps S20 to S24 is the same as the operation of steps S10 to S14 described with reference to FIG. 6, except that the input information is media information of teacher data or media information of unknown field. Therefore, the description thereof will be omitted.

ステップＳ２５において、情報判定装置１の判定手段１６は、情報判定モデル記憶手段２３に記憶されている情報判定モデルを用いて、文単位分散表現ベクトルと特徴ベクトルとから、分野が未知のメディア情報が、学習モードで学習した分野の情報であるか否かを判定する。この文単位分散表現ベクトルは、ステップＳ２３で生成されたものであり、特徴ベクトルは、ステップＳ２４で生成されたものである。
さらに、ステップＳ２６において、情報判定装置１の判定手段１６は、ステップＳ２５で判定した結果を外部に出力する。 In step S25, the determination means 16 of the information determination device 1 uses the information determination model stored in the information determination model storage means 23 to obtain media information whose field is unknown from the sentence unit distributed expression vector and the feature vector. , Judge whether it is the information of the field learned in the learning mode. This sentence-based distributed representation vector is generated in step S23, and the feature vector is generated in step S24.
Further, in step S26, the determination means 16 of the information determination device 1 outputs the result of determination in step S25 to the outside.

ステップＳ２７において、情報判定装置１は、さらにメディア情報が入力されるか否かにより、評価モードの動作の終了を判定する。
このステップＳ２７で、さらにメディア情報が入力され、評価モードの動作が終了していない場合（Ｎｏ）、情報判定装置１は、ステップＳ２０に動作を戻って、判定動作を継続する。
一方、ステップＳ２７で、新たなメディア情報が入力されず、評価モードの動作が終了した場合（Ｙｅｓ）、動作を終了する。
以上の動作によって、情報判定装置１は、未知のメディア情報が、学習モードで学習した分野の情報であるか否かを判定することができる。 In step S27, the information determination device 1 determines the end of the operation of the evaluation mode depending on whether or not media information is further input.
If the media information is further input in step S27 and the evaluation mode operation is not completed (No), the information determination device 1 returns to step S20 and continues the determination operation.
On the other hand, in step S27, when new media information is not input and the operation of the evaluation mode ends (Yes), the operation ends.
By the above operation, the information determination device 1 can determine whether or not the unknown media information is the information of the field learned in the learning mode.

以上、本発明の実施形態に係る情報判定装置１の構成および動作について説明したが、本発明は、この実施形態に限定されるものではない。
ここでは、情報判定装置１は、特徴抽出モデルおよび情報判定モデルを学習する学習動作（第１学習モードおよび第２学習モード）と、特徴抽出モデルおよび情報判定モデルを用いて、未知のメディア情報が、学習した分野の情報であるか否かを判定する判定動作（評価モード）との２つの動作を１つの装置で行うものとした。
しかし、これらの動作は、別々の装置で動作させても構わない。 Although the configuration and operation of the information determination device 1 according to the embodiment of the present invention have been described above, the present invention is not limited to this embodiment.
Here, the information determination device 1 uses a learning operation (first learning mode and second learning mode) for learning the feature extraction model and the information determination model, and the feature extraction model and the information determination model to obtain unknown media information. , It is assumed that two operations, a determination operation (evaluation mode) for determining whether or not the information is in the learned field, are performed by one device.
However, these operations may be operated by separate devices.

具体的には、特徴抽出モデルおよび情報判定モデルを学習する学習動作を実現する装置は、図８に示すモデル学習装置３として構成することができる。
モデル学習装置３は、図８に示すように、図１で説明した情報判定装置１から、判定手段１６を省いて構成すればよい。この構成は、図１で説明した情報判定装置１と同じ、特徴抽出モデルおよび情報判定モデルを学習する学習動作のみを行う。なお、モデル学習装置３の動作は、図５および図６で説明した動作と同じである。
このモデル学習装置３は、コンピュータを前記した各手段として機能させるためのプログラム（モデル学習プログラム）で動作させることができる。 Specifically, the device that realizes the learning operation for learning the feature extraction model and the information determination model can be configured as the model learning device 3 shown in FIG.
As shown in FIG. 8, the model learning device 3 may be configured by omitting the determination means 16 from the information determination device 1 described with reference to FIG. This configuration performs only the learning operation of learning the feature extraction model and the information determination model, which is the same as the information determination device 1 described with reference to FIG. The operation of the model learning device 3 is the same as the operation described with reference to FIGS. 5 and 6.
The model learning device 3 can be operated by a program (model learning program) for operating the computer as each of the above-mentioned means.

また、特徴抽出モデルおよび情報判定モデルを用いて、未知のメディア情報が、学習した分野の情報であるか否かを判定する判定動作を実現する装置は、図９に示す情報判定装置１Ｂとして構成することができる。
情報判定装置１Ｂは、図９に示すように、図１で説明した情報判定装置１から、特徴抽出モデル学習手段１３と情報判定モデル学習手段１５とを省いて構成すればよい。この構成は、図１で説明した情報判定装置１と同じ、未知のメディア情報が、予め学習した分野の情報であるか否かを判定する判定動作のみを行う。なお、情報判定装置１Ｂの動作は、図７で説明した動作と同じである。
この情報判定装置１Ｂは、コンピュータを前記した各手段として機能させるためのプログラム（情報判定プログラム）で動作させることができる。
このように、学習動作と判定動作とを、異なる装置で動作させることで、１つのモデル学習装置３で学習した特徴抽出モデルおよび情報判定モデルを、複数の情報判定装置１Ｂで利用することが可能になる。 Further, a device that realizes a determination operation for determining whether or not the unknown media information is information in the learned field by using the feature extraction model and the information determination model is configured as the information determination device 1B shown in FIG. can do.
As shown in FIG. 9, the information determination device 1B may be configured by omitting the feature extraction model learning means 13 and the information determination model learning means 15 from the information determination device 1 described with reference to FIG. This configuration performs only the determination operation of determining whether or not the unknown media information is the information of the field learned in advance, which is the same as the information determination device 1 described with reference to FIG. The operation of the information determination device 1B is the same as the operation described with reference to FIG. 7.
The information determination device 1B can be operated by a program (information determination program) for operating the computer as each of the above-mentioned means.
By operating the learning operation and the determination operation on different devices in this way, the feature extraction model and the information determination model learned by one model learning device 3 can be used by a plurality of information determination devices 1B. become.

また、ここでは、情報判定モデル学習手段１５が学習する情報判定モデルを、教師あり学習により学習するニューラルネットワークとした。しかし、この教師あり学習は、他の一般的な機械学習を用いることができる。例えば、サポートベクタマシン（ＳＶＭ：Support Vector Machine）、条件付確率場（ＣＲＦ：Conditional Random Fields）等を用いることができる。 Further, here, the information determination model learned by the information determination model learning means 15 is a neural network learned by supervised learning. However, this supervised learning can use other common machine learning. For example, a support vector machine (SVM), a conditional random field (CRF), or the like can be used.

また、ここでは、単語分割手段１１が、形態素解析により、入力されるニュース原稿（原稿文）、メディア情報（投稿文）を単語に分割した。しかし、情報判定装置１への入力が、予め単語に区分されたものである場合、情報判定装置１の構成から、単語分割手段１１を省略することができる。 Further, here, the word dividing means 11 divides the input news manuscript (manuscript sentence) and media information (posted sentence) into words by morphological analysis. However, when the input to the information determination device 1 is divided into words in advance, the word dividing means 11 can be omitted from the configuration of the information determination device 1.

１，１Ｂ情報判定装置
１１単語分割手段
１２ベクトル化手段
１３特徴抽出モデル学習手段
１４特徴抽出手段
１５情報判定モデル学習手段
１６判定手段
２１分散表現ベクトル記憶手段
２２特徴抽出モデル記憶手段
２３情報判定モデル記憶手段
３モデル学習装置 1,1B Information judgment device 11 Word division means 12 Vectorization means 13 Feature extraction model learning means 14 Feature extraction means 15 Information judgment model learning means 16 Judgment means 21 Distributed expression vector storage means 22 Feature extraction model storage means 23 Information judgment model memory Means 3 Model learning device

Claims

A feature extraction model for extracting the feature amount of the information contained in the manuscript sentence is learned from a plurality of manuscript sentences whose fields are known, and the posted sentence which is the social media information whose field is known is judged as teacher data. It is a model learning device that learns an information judgment model for determining whether or not the posted text that is the target of the above field is information indicating the above field.
A sentence-based distributed representation vector that is a manuscript-based or post-based distributed representation vector by averaging the word-based distributed representation vectors stored in advance in the storage means with respect to the manuscript text or the posted text that is the teacher data. Vectorization means to generate
A feature extraction model learning means for learning a feature extraction model that extracts a feature vector by compressing the dimension of the sentence unit distributed expression vector generated from the manuscript sentence, and
A feature extraction means for extracting the feature vector from the sentence-based distributed representation vector generated from the teacher data using the feature extraction model.
The information determination model is obtained by inputting a feature vector for the teacher data extracted by the feature extraction means and a sentence-based distributed expression vector of the teacher data that is the source of extracting the feature vector and performing machine learning. Information judgment model to learn Learning means and
A model learning device characterized by being equipped with.

The model learning device according to claim 1, wherein the feature extraction model learning means learns parameters from an input layer to a hidden layer as the feature extraction model by an autoencoder which is a neural network.

Using the feature extraction model and the information determination model learned by the model learning device according to claim 1 or 2, the unknown data, which is a post of social media information whose field is unknown, indicates the learned field. It is an information judgment device that determines whether or not it is.
A vectorization means for averaging the word-based distributed representation vector stored in the storage means in advance with respect to the unknown data to generate a sentence-based distributed representation vector, which is a post-unit distributed representation vector.
A feature extraction means that extracts a feature vector from the sentence-based distributed representation vector generated by this vectorization means using the feature extraction model, and a feature extraction means.
Whether or not the unknown data indicates the learned field by using the information determination model from the sentence-based distributed representation vector generated by the vectorization means and the feature vector extracted by the feature extraction means. Judgment means to determine whether
An information determination device comprising.

A feature extraction model for extracting the feature amount of the information contained in the manuscript sentence is learned from a plurality of manuscript sentences whose fields are known, and the posted sentence which is the social media information whose field is known is judged as teacher data. The information judgment model for determining whether or not the posted text to be the target of the field is information indicating the above-mentioned field is learned, and the unknown data which is the posted text of the social media information whose field is unknown indicates the learned field. It is an information judgment device that determines whether or not it is.
A sentence-based distributed representation vector, which is a manuscript-based or submission-based distributed representation vector, is obtained by averaging the word-based distributed representation vectors stored in advance in the storage means with respect to the manuscript sentence, the teacher data, or the unknown data. Vectorization means to generate
A feature extraction model learning means for learning a feature extraction model that extracts a feature vector by compressing the dimensions of the sentence-based distributed expression vector generated from the manuscript sentence.
A feature extraction means for extracting the feature vector from the teacher data or the sentence-based distributed representation vector generated from the unknown data using the feature extraction model.
The information determination model is obtained by inputting a feature vector for the teacher data extracted by the feature extraction means and a sentence-based distributed expression vector of the teacher data that is the source of extracting the feature vector and performing machine learning. Information judgment model to learn Learning means and
Using the information determination model, the sentence-based distributed representation vector generated from the unknown data generated by the vectorizing means and the feature vector extracted from the sentence-based distributed representation vector by the feature extracting means are used. , A determination means for determining whether or not the unknown data is information indicating the learned field,
An information determination device comprising.

A model learning program for operating a computer as the model learning device according to claim 1 or 2.

An information determination program for causing a computer to function as the information determination device according to claim 3 or 4.