JP2018151892A

JP2018151892A - Model learning apparatus, information determination apparatus, and program therefor

Info

Publication number: JP2018151892A
Application number: JP2017048039A
Authority: JP
Inventors: 太郎宮▲崎▼; Taro Miyazaki; 後藤　淳; Atsushi Goto; 淳後藤; 友香武井; Yuka Takei
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2017-03-14
Filing date: 2017-03-14
Publication date: 2018-09-27
Anticipated expiration: 2037-03-14
Also published as: JP6839001B2

Abstract

PROBLEM TO BE SOLVED: To provide an information determination apparatus for determining whether social media information is information in a certain field.SOLUTION: An information determination apparatus 1 includes feature extraction model learning means 13 for learning a feature extraction model from sentence unit distributed expression vectors generated from a plurality of manuscripts whose fields are known, information determination model learning means 15 for learning an information determination model on the basis of machine learning of a sentence unit distributed expression vector generated from a posted message whose field is known and a feature vector extracted using the feature extraction model from the sentence unit distributed expression vector, and determination means 16 for performing information determination by using the information determination model with reference to a sentence unit distributed expression vector generated from a posted message whose field is unknown and a feature vector extracted from the sentence unit distributed expression vector by the feature extraction model.SELECTED DRAWING: Figure 1

Description

本発明は、ソーシャルメディア情報が、ある分野の情報であるか否かを判定する情報判定技術に関する。 The present invention relates to an information determination technique for determining whether social media information is information in a certain field.

近年、ソーシャル・ネットワーキング・サービス（ＳＮＳ：Social Networking Service）の発達により、個人が、容易にかつリアルタイムで情報を発信することが可能になった。例えば、ＳＮＳには、火事、事故等の現場に偶然居合わせた人からの目撃情報等が多数投稿されている。これらの投稿は、現場の写真や映像を付加したものがあり、火事、事故等の発生した瞬間をとらえた画像、映像として、ニュース番組等に活用されることも多い。そこで、放送局では、常時、ＳＮＳを監視する等、人手での作業により、これらの情報を抽出している。 In recent years, with the development of social networking services (SNS), it has become possible for individuals to transmit information easily and in real time. For example, a large number of witness information from people who happen to be present at a site such as a fire or an accident is posted on the SNS. Some of these posts have on-site photos and videos added, and are often used for news programs as images and videos that capture the moment when a fire or accident occurs. Therefore, the broadcasting station always extracts these pieces of information by manual work such as monitoring the SNS.

このようなＳＮＳから人手で必要な情報を抽出する手法は、キーワード検索を用いることが多い。しかし、投稿は、「○○線が遅れている」といった路線名が書かれる場合、「△△駅での事故で電車が遅れている」といった駅名が書かれる場合等、表現の方法が多岐にわたっている。そのため、これらの表現をすべてカバーするキーワードを作成することは困難である。 A keyword search is often used as a method for manually extracting necessary information from such an SNS. However, there are a variety of ways to express the post, such as when the route name is written as “XX line is delayed” or when the station name is written as “The train is delayed due to an accident at △△ station”. Yes. Therefore, it is difficult to create a keyword that covers all these expressions.

そこで、これらの問題を解決するために、機械学習を用いた投稿の抽出手法が多く研究されている。例えば、特定のテーマに依存して危険表現となりうる単語やフレーズをニューラルネットワークにより学習し、ＳＮＳのソーシャル・ビッグデータから、特定のテーマに依存して危険表現となりうる単語やフレーズを抽出する手法が開示されている（特許文献１参照）。また、例えば、ＳＮＳへの投稿のｎ−ｇｒａｍと気象状況の関連度とを算出し、機械学習により、気象事象に関連する有用な投稿を抽出する手法が開示されている（非特許文献１参照）。 Therefore, in order to solve these problems, many researches have been made on a method for extracting posts using machine learning. For example, there is a method of learning words and phrases that can be dangerous expressions depending on a specific theme using a neural network and extracting words and phrases that can be dangerous expressions depending on a specific theme from social big data of SNS. It is disclosed (see Patent Document 1). In addition, for example, a method for calculating an n-gram of a post to SNS and a degree of association between weather conditions and extracting a useful post related to a weather event by machine learning is disclosed (see Non-Patent Document 1). ).

特開２０１５−７２６１４号公報Japanese Patent Laying-Open No. 2015-72614

萩行正嗣，“選択式天気情報を用いたソーシャルメディアからの有用投稿抽出”，言語処理学会，第22回年次大会発表論文集，pp.397-400，2016年3月Masayuki Sasayuki, “Extracting useful posts from social media using selective weather information”, Language Processing Society of Japan, 22nd Annual Conference Proceedings, pp.397-400, March 2016

従来の機械学習を用いた手法は、学習データの量により、機械学習を用いた学習精度が大きく影響する。そこで、その精度を高めるためには、多くの学習データを準備する必要がある。しかし、学習データを作成するためには、ソーシャルメディア情報（投稿文）が正解データであるのか不正解データであるのかを峻別する必要がある。このため、人手、作業時間等のコストがかかってしまうという問題があった。 In the conventional method using machine learning, the learning accuracy using machine learning greatly affects the amount of learning data. Therefore, in order to increase the accuracy, it is necessary to prepare a lot of learning data. However, in order to create learning data, it is necessary to distinguish whether social media information (posted text) is correct data or incorrect data. For this reason, there is a problem that costs such as manpower and work time are required.

そこで、本発明は、ソーシャルメディア情報を峻別した学習データだけではなく、ニュース原稿等の分野が既知のテキストデータの特徴を利用して、ソーシャルメディア情報が、ニュース原稿等と同じ分野の情報であるか否かを判定するためのモデル学習装置、情報判定装置およびそれらのプログラムを提供することを課題とする。 Therefore, according to the present invention, the social media information is information in the same field as the news manuscript and the like by using the characteristics of the text data known in the field such as the news manuscript as well as the learning data in which the social media information is distinguished. It is an object of the present invention to provide a model learning device, an information determination device, and a program for determining whether or not a program is used.

前記課題を解決するため、本発明に係るモデル学習装置は、分野が既知の複数の原稿文から特徴量を抽出するための特徴抽出モデルを学習するとともに、前記分野が既知のソーシャルメディア情報である投稿文を教師データとして、判定対象の投稿文が前記分野を示す情報か否かを判定するための情報判定モデルを学習するモデル学習装置であって、ベクトル化手段と、特徴抽出モデル学習手段と、特徴抽出手段と、情報判定モデル学習手段と、を備える構成とした。 In order to solve the above problems, a model learning device according to the present invention learns a feature extraction model for extracting feature amounts from a plurality of manuscript sentences whose fields are known, and is social media information whose fields are known. A model learning device that learns an information determination model for determining whether a determination target posted sentence is information indicating the field, using a posted sentence as teacher data, a vectorization unit, a feature extraction model learning unit, , Feature extraction means, and information determination model learning means.

かかる構成において、モデル学習装置は、ベクトル化手段によって、原稿文または教師データである投稿文に対して、予め記憶手段に記憶されている単語単位の分散表現ベクトルを平均化して、原稿単位または投稿単位の分散表現ベクトルである文単位分散表現ベクトルを生成する。単語ごとの分散表現ベクトルは、単語の分布から、近似する意味内容を示す単語ほど、近い数値ベクトルを与えたものである。この分散表現ベクトルは、ｗｏｒｄ２ｖｅｃ等の手法により学習して生成することができる。
これによって、ベクトル化手段は、原稿文や投稿文において、そのものの意味内容を加味したベクトルを生成する。 In such a configuration, the model learning device averages the word-unit distributed representation vectors stored in the storage unit in advance for the original text or the post text that is the teacher data by the vectorization means, and the original learning unit or the post A sentence unit distributed expression vector that is a unit distributed expression vector is generated. The distributed expression vector for each word is obtained by giving a closer numerical vector to the word indicating the meaning content to be approximated from the word distribution. This distributed expression vector can be generated by learning using a method such as word2vec.
As a result, the vectorization means generates a vector that takes into account the semantic content of the original text or the posted text.

そして、モデル学習装置は、特徴抽出モデル学習手段によって、原稿文から生成される文単位分散表現ベクトルの次元を圧縮して特徴ベクトルを抽出する特徴抽出モデルを学習する。この特徴抽出モデルは、ニューラルネットワークのオートエンコーダの特徴抽出を行う部分として構成することができる。
そして、モデル学習装置は、特徴抽出手段によって、教師データから生成される文単位分散表現ベクトルから、特徴抽出モデルを用いて特徴ベクトルを抽出する。 Then, the model learning device learns a feature extraction model for extracting a feature vector by compressing the dimension of a sentence unit distributed expression vector generated from a document sentence by a feature extraction model learning unit. This feature extraction model can be configured as a part that performs feature extraction of an auto encoder of a neural network.
Then, the model learning device extracts a feature vector using the feature extraction model from the sentence unit distributed expression vector generated from the teacher data by the feature extraction unit.

そして、モデル学習装置は、情報判定モデル学習手段によって、特徴抽出手段で抽出される教師データに対する特徴ベクトルと、当該特徴ベクトルを抽出した元となる教師データの文単位分散表現ベクトルとを入力し、機械学習することで、情報判定モデルを学習する。
このように、情報判定モデル学習手段は、教師データの文単位分散表現ベクトルに対して、さらに、原稿文から学習した特徴抽出モデルで抽出した特徴ベクトルを用いることで、ニュース原稿等の過去の原稿により学習効果を高めることができる。
なお、モデル学習装置は、コンピュータを、前記した各手段として機能させるためのモデル学習プログラムで動作させることができる。 Then, the model learning device inputs a feature vector for the teacher data extracted by the feature extraction unit by the information determination model learning unit, and a sentence unit distributed expression vector of the teacher data from which the feature vector is extracted, An information determination model is learned by machine learning.
In this way, the information determination model learning means uses the feature vector extracted from the feature extraction model learned from the manuscript sentence for the sentence unit distributed expression vector of the teacher data, thereby allowing past manuscripts such as news manuscripts to be used. The learning effect can be enhanced.
Note that the model learning apparatus can operate the computer with a model learning program for causing the computer to function as each of the above-described means.

また、前記課題を解決するため、本発明に係る情報判定装置は、モデル学習装置で学習した特徴抽出モデルおよび情報判定モデルを用いて、分野が未知のソーシャルメディア情報の投稿文である未知データが、学習済みの分野を示す情報か否かを判定する情報判定装置であって、ベクトル化手段と、特徴抽出手段と、判定手段と、を備える構成とした。 In order to solve the above problem, the information determination apparatus according to the present invention uses the feature extraction model and the information determination model learned by the model learning apparatus, and unknown data that is a posted sentence of social media information whose field is unknown. An information determination apparatus that determines whether or not the information indicates a learned field, and includes a vectorization unit, a feature extraction unit, and a determination unit.

かかる構成において、情報判定装置は、ベクトル化手段によって、未知データに対して、予め記憶手段に記憶されている単語単位の分散表現ベクトルを平均化して、投稿単位の分散表現ベクトルである文単位分散表現ベクトルを生成する。
そして、情報判定装置は、特徴抽出手段によって、ベクトル化手段で生成される文単位分散表現ベクトルから、特徴抽出モデルを用いて特徴ベクトルを抽出する。特徴抽出モデルは、分野が既知の原稿から事前学習したものであるため、未知データから生成した特徴ベクトルには、その分野に関する情報であれば抽出されるべき特徴が加味されていることになる。
そして、情報判定装置は、判定手段によって、ベクトル化手段で生成される文単位分散表現ベクトルと、特徴抽出手段で抽出される特徴ベクトルとから、情報判定モデルを用いて、未知データが学習済みの分野を示す情報か否かを判定する。 In such a configuration, the information determination device averages the word-unit distributed expression vectors stored in advance in the storage unit with respect to the unknown data by the vectorization unit, and the sentence unit distribution which is the post-unit distributed expression vector Generate an expression vector.
Then, the information determination apparatus extracts a feature vector by using the feature extraction model from the sentence unit distributed expression vector generated by the vectorization unit by the feature extraction unit. Since the feature extraction model is pre-learned from a manuscript whose field is known, the feature vector generated from the unknown data includes the feature to be extracted if it is information about the field.
Then, the information determination device uses the information determination model to learn unknown data from the sentence unit variance expression vector generated by the vectorization unit and the feature vector extracted by the feature extraction unit. It is determined whether the information indicates a field.

また、前記課題を解決するため、本発明に係る情報判定装置は、分野が既知の複数の原稿文から特徴量を抽出するための特徴抽出モデルを学習するとともに、前記分野が既知のソーシャルメディア情報である投稿文を教師データとして、判定対象の投稿文が前記分野を示す情報か否かを判定するための情報判定モデルを学習し、分野が未知のソーシャルメディア情報の投稿文である未知データが、学習済みの分野を示す情報か否かを判定する情報判定装置であって、ベクトル化手段と、特徴抽出モデル学習手段と、特徴抽出手段と、情報判定モデル学習手段と、判定手段と、を備える構成とした。
なお、情報判定装置は、コンピュータを、前記した各手段として機能させるための情報判定プログラムで動作させることができる。 In order to solve the above problem, the information determination apparatus according to the present invention learns a feature extraction model for extracting feature quantities from a plurality of manuscript sentences whose fields are known, and social media information whose fields are known. Learning information determination model for determining whether or not the posting text to be determined is information indicating the field, using the posted text as teacher data, and unknown data that is a posted text of social media information whose field is unknown An information determination device for determining whether or not the information indicates a learned field, comprising: a vectorization unit, a feature extraction model learning unit, a feature extraction unit, an information determination model learning unit, and a determination unit. It was set as the structure provided.
The information determination apparatus can operate the computer with an information determination program for causing the computer to function as each of the above-described means.

本発明は、以下に示す優れた効果を奏するものである。
本発明によれば、分野が既知の原稿文（テキストデータ）の特徴を利用して、ソーシャルメディア情報が同じ分野の情報であるか否かを判定することができる。
これによって、本発明は、ＳＮＳにおけるソーシャルメディア情報を大量に学習しなくても、既存の大量の原稿を利用することで学習効果を高めることができ、情報判定の精度を高めることができる。
これによって、本発明は、ＳＮＳにおいて個人が発信するソーシャル・ビッグデータを、ニュース等の情報源として有効に活用することができる。 The present invention has the following excellent effects.
According to the present invention, it is possible to determine whether or not social media information is information in the same field using the characteristics of a document sentence (text data) whose field is known.
As a result, the present invention can enhance the learning effect by using a large amount of existing manuscript without learning a large amount of social media information in the SNS, and can improve the accuracy of information determination.
As a result, the present invention can effectively utilize social big data transmitted by individuals in SNS as an information source such as news.

本発明の実施形態に係る情報判定装置の構成を示すブロック構成図である。It is a block block diagram which shows the structure of the information determination apparatus which concerns on embodiment of this invention. ベクトル化手段の処理内容を説明するための図であって、（ａ）はメディア情報を単語に分割する例、（ｂ）は単語の分散表現ベクトルから原稿文または投稿文の分散表現ベクトルを算出する例を説明するための説明図である。It is a figure for demonstrating the processing content of the vectorization means, (a) is an example which divides | segments media information into a word, (b) calculates the distributed expression vector of a manuscript sentence or a contribution sentence from the distributed expression vector of a word. It is explanatory drawing for demonstrating the example to do. 特徴抽出モデル学習手段が学習する特徴抽出モデルの構造を説明するための説明図である。It is explanatory drawing for demonstrating the structure of the feature extraction model which a feature extraction model learning means learns. 情報判定モデル学習手段が学習する情報判定モデルの構造を説明するための説明図である。It is explanatory drawing for demonstrating the structure of the information determination model which an information determination model learning means learns. 本発明の実施形態に係る情報判定装置の第１学習モードの動作を示すフローチャートである。It is a flowchart which shows operation | movement in the 1st learning mode of the information determination apparatus which concerns on embodiment of this invention. 本発明の実施形態に係る情報判定装置の第２習モードの動作を示すフローチャートである。It is a flowchart which shows operation | movement of the 2nd learning mode of the information determination apparatus which concerns on embodiment of this invention. 本発明の実施形態に係る情報判定装置の評価モードの動作を示すフローチャートである。It is a flowchart which shows operation | movement of the evaluation mode of the information determination apparatus which concerns on embodiment of this invention. 本発明の他の実施形態に係るモデル学習装置の構成を示すブロック構成図である。It is a block block diagram which shows the structure of the model learning apparatus which concerns on other embodiment of this invention. 本発明の他の実施形態に係る情報判定装置の構成を示すブロック構成図である。It is a block block diagram which shows the structure of the information determination apparatus which concerns on other embodiment of this invention.

以下、本発明の実施形態について図面を参照して説明する。
［情報判定装置の構成］
最初に、図１を参照して、本発明の実施形態に係る情報判定装置１の構成について説明する。 Embodiments of the present invention will be described below with reference to the drawings.
[Configuration of information judgment device]
Initially, with reference to FIG. 1, the structure of the information determination apparatus 1 which concerns on embodiment of this invention is demonstrated.

情報判定装置１は、制御部１０と記憶部２０とで構成される。
情報判定装置１は、ＳＮＳで発信される情報（投稿単位のテキストデータであるツイート〔登録商標〕等）が、特定の分野の情報であるか否かを判定するものである。 The information determination apparatus 1 includes a control unit 10 and a storage unit 20.
The information determination apparatus 1 determines whether information transmitted by SNS (such as tweet [registered trademark] that is text data in units of posting) is information in a specific field.

制御部１０は、図１に示すように、単語分割手段１１と、ベクトル化手段１２と、特徴抽出モデル学習手段１３と、特徴抽出手段１４と、情報判定モデル学習手段１５と、判定手段１６と、を備える。
制御部１０は、情報判定装置１の動作を制御するものである。制御部１０は、３つの動作モードで動作する。動作モードの１つめは、分野が既知の情報（テキストデータ）から、その情報の特徴量を抽出する特徴抽出モデルを学習する第１学習モードである。動作モードの２つめは、分野が既知のソーシャルメディア情報（以下、単にメディア情報）から、未知のメディア情報が、その分野の情報であるか否かを判定する情報判定モデルを学習する第２学習モードである。動作モードの３つめは、未知のメディア情報が、学習した分野の情報であるか否かを判定する評価モードである。 As shown in FIG. 1, the control unit 10 includes a word dividing unit 11, a vectorizing unit 12, a feature extraction model learning unit 13, a feature extraction unit 14, an information determination model learning unit 15, and a determination unit 16. .
The control unit 10 controls the operation of the information determination apparatus 1. The control unit 10 operates in three operation modes. The first operation mode is a first learning mode in which a feature extraction model for extracting a feature amount of information from information (text data) whose field is known is learned. The second operation mode is a second learning that learns an information determination model for determining whether or not unknown media information is information in the field from social media information in which the field is known (hereinafter simply referred to as media information). Mode. The third operation mode is an evaluation mode for determining whether unknown media information is information in a learned field.

本実施形態では、特定の分野として、報道番組等で活用可能な「ニュース」を例として説明する。もちろん、この分野は、ニュース以外の分野でもよく、例えば、スポーツ、音楽等であってもよい。 In the present embodiment, “news” that can be used in a news program or the like will be described as an example of a specific field. Of course, this field may be a field other than news, for example, sports, music, and the like.

本実施形態では、第１学習モードにおいて、制御部１０は、大量（例えば、数十万件）のニュース原稿を入力情報として入力する。例えば、制御部１０は、ニュースのタイトルを原稿単位で入力する。
また、制御部１０は、第２学習モードにおいて、分野がニュースであることが既知のメディア情報を、入力情報として大量（例えば、数万件）に入力する。例えば、制御部１０は、ツイート〔登録商標〕を投稿単位で入力する。
また、制御部１０は、評価モードにおいて、分野が未知のメディア情報を入力情報として入力し、分野がニュースの情報であるか否かの判定結果を出力する。 In the present embodiment, in the first learning mode, the control unit 10 inputs a large amount (for example, hundreds of thousands) of news manuscripts as input information. For example, the control unit 10 inputs a news title in units of originals.
Further, in the second learning mode, the control unit 10 inputs a large amount (for example, tens of thousands) of media information whose field is known to be news as input information. For example, the control unit 10 inputs a tweet [registered trademark] on a posting basis.
Further, in the evaluation mode, the control unit 10 inputs media information whose field is unknown as input information, and outputs a determination result as to whether the field is news information.

単語分割手段１１は、テキストデータであるニュース原稿（原稿文）、または、メディア情報（投稿文）を単語に分割するものである。具体的には、単語分割手段１１は、テキストデータを形態素解析することで単語に分割し、ベクトル化手段１２に出力する。 The word dividing means 11 divides a news manuscript (manuscript sentence) or media information (posted sentence) as text data into words. Specifically, the word dividing unit 11 divides the text data into words by performing morphological analysis, and outputs it to the vectorizing unit 12.

ベクトル化手段１２は、単語分割手段１１で分割された単語から、当該テキストデータである原稿文または投稿文をベクトル化するものである。ベクトル化手段１２は、分散表現ベクトル記憶手段２１に予め記憶されている単語ごとの分散表現ベクトルから、原稿文または投稿文を構成する単語の分散表現ベクトルを平均化して、原稿単位または投稿単位の分散表現ベクトルを生成する。
なお、分散表現ベクトルとは、意味が近い（分散の特徴が近い）単語を近いベクトルに対応させて、単語を有限の高次元（例えば、２００次元）の数値ベクトルで表現したものである。この分散表現ベクトルは、例えば、ｗｏｒｄ２ｖｅｃ、ＧｌｏＶｅ（Global Vectors for Word Representation）等の一般的な手法により生成されたものである。 The vectorizing means 12 vectorizes the original text or the posted text as the text data from the words divided by the word dividing means 11. The vectorization means 12 averages the distributed expression vectors of the words constituting the original sentence or the posted sentence from the distributed expression vectors for each word stored in advance in the distributed expression vector storage means 21 to obtain the original unit or the post unit. Generate a distributed representation vector.
The distributed expression vector is a word expressed by a finite high-dimensional (for example, 200-dimensional) numerical vector by associating words having similar meanings (having similar distribution characteristics) with close vectors. This distributed expression vector is generated by a general method such as word2vec or GloVe (Global Vectors for Word Representation).

ベクトル化手段１２は、分割した単語に対応する分散表現ベクトルを読み出して加算する。そして、ベクトル化手段１２は、加算した分散表現ベクトルを当該原稿文または投稿文に含まれる単語数で除算することで、ベクトルを正規化し、原稿文または投稿文の分散表現ベクトル（文単位分散表現ベクトル）を生成する。 The vectorization means 12 reads and adds the distributed expression vectors corresponding to the divided words. Then, the vectorization means 12 normalizes the vector by dividing the added distributed expression vector by the number of words included in the original sentence or the posted sentence, and the distributed expression vector (sentence unit distributed expression) of the original sentence or the posted sentence. Vector).

ベクトル化手段１２は、第１学習モードにおいては、生成した文単位分散表現ベクトルを特徴抽出モデル学習手段１３に出力する。また、ベクトル化手段１２は、第２学習モードにおいては、生成した文単位分散表現ベクトルを特徴抽出手段１４および情報判定モデル学習手段１５に出力する。また、ベクトル化手段１２は、評価モードにおいては、生成した文単位分散表現ベクトルを特徴抽出手段１４および判定手段１６に出力する。 The vectorization unit 12 outputs the generated sentence unit distributed expression vector to the feature extraction model learning unit 13 in the first learning mode. Further, the vectorization means 12 outputs the generated sentence unit distributed expression vector to the feature extraction means 14 and the information determination model learning means 15 in the second learning mode. Further, in the evaluation mode, the vectorization unit 12 outputs the generated sentence unit distributed expression vector to the feature extraction unit 14 and the determination unit 16.

ここで、図２を参照（適宜図１参照）して、文単位分散表現ベクトルについて説明する。
図２（ａ）に示すように、テキストデータ（原稿文または投稿文）の一例を「○○線が事故で遅れている。」とした場合、単語分割手段１１は、当該テキストデータを「○○／線／が／事故／で／遅れ／て／いる／。」と単語単位で分割する。 Here, the sentence unit distributed expression vector will be described with reference to FIG. 2 (refer to FIG. 1 as appropriate).
As shown in FIG. 2A, when an example of text data (a manuscript sentence or a posted sentence) is set to “XX line is delayed due to an accident”, the word dividing unit 11 sets the text data to “ ○ / line / ga / accident / de / delay / te / de / ”.

そして、ベクトル化手段１２は、分割した単語ごとに、対応する分散表現ベクトルを分散表現ベクトル記憶手段２１から読み出す。例えば、図２（ｂ）に示すように、単語「○○」に対応する次元数がｎ個（例えば、２００次元）の分散表現ベクトル「０．１，０．３，０．４，０．１，０．８，０．９，０．２，…，０．９」を読み出す。
そして、ベクトル化手段１２は、テキストデータを構成する単語数だけ分散表現ベクトルを加算して、全単語合計（図２（ｂ）の例では、「７．２，１．８，２．７，３．６，３．６，７．２，４．５，…，６．３」）を算出する。 Then, the vectorization means 12 reads the corresponding distributed expression vector from the distributed expression vector storage means 21 for each divided word. For example, as illustrated in FIG. 2B, the distributed representation vectors “0.1, 0.3, 0.4, 0. 1, 0.8, 0.9, 0.2,...
Then, the vectorization means 12 adds the distributed expression vectors by the number of words constituting the text data, and adds up all the words (in the example of FIG. 2B, “7.2, 1.8, 2.7, 3.6, 3.6, 7.2, 4.5, ..., 6.3 ").

その後、ベクトル化手段１２は、分散表現ベクトルの全単語合計を、テキストデータを構成する単語数（図２の例では、９個）で除算することで、文単位分散表現ベクトル（図２（ｂ）の例では、「０．８，０．２，０．３，０．４，０．４，０．８，０．５，…，０．７」）を算出する。
これによって、ベクトル化手段１２は、原稿文または投稿文ごとに文単位分散表現ベクトルを生成する。
図１に戻って、情報判定装置１の構成について説明を続ける。 Thereafter, the vectorization means 12 divides the total of all the words of the distributed expression vector by the number of words constituting the text data (9 in the example of FIG. 2), thereby obtaining the sentence unit distributed expression vector (FIG. 2 (b In the example of), “0.8, 0.2, 0.3, 0.4, 0.4, 0.8, 0.5,..., 0.7”) is calculated.
As a result, the vectorization means 12 generates a sentence unit distributed expression vector for each original sentence or posted sentence.
Returning to FIG. 1, the description of the configuration of the information determination apparatus 1 will be continued.

特徴抽出モデル学習手段１３は、第１学習モードにおいて、ベクトル化手段１２で生成される文単位分散表現ベクトルから、特徴量を抽出するための特徴抽出モデルを学習するものである。文単位分散表現ベクトルの特徴量は、文単位分散表現ベクトルの次元数を圧縮することで求める。 The feature extraction model learning means 13 learns a feature extraction model for extracting feature quantities from the sentence unit distributed expression vector generated by the vectorization means 12 in the first learning mode. The feature quantity of the sentence unit distributed expression vector is obtained by compressing the number of dimensions of the sentence unit distributed expression vector.

具体的には、特徴抽出モデル学習手段１３は、図３に示す入力層Ａ_Ｌ１、隠れ層Ａ_Ｌ２、出力層Ａ_Ｌ３で構成されるニューラルネットワークであるオートエンコーダＡＥにより、特徴量（特徴ベクトル）を抽出する特徴抽出モデルＭ_１を学習する。
図３に示すオートエンコーダＡＥは、入力層Ａ_Ｌ１に、ベクトル化手段１２で生成される文単位分散表現ベクトルＶｓを入力する。入力層Ａ_Ｌ１および出力層Ａ_Ｌ３は、文単位分散表現ベクトルと同じ次元数（例えば、２００次元）である。隠れ層Ａ_Ｌ２は、入力層Ａ_Ｌ１および出力層Ａ_Ｌ３よりも次元数が少ない（例えば、１００次元）。 Specifically, the feature extraction model learning unit 13 performs feature amounts (feature vectors) by an auto encoder AE that is a neural network including the input layer A _L1 , the hidden layer A _L2 , and the output layer A _L3 shown in FIG. learning the feature extraction model M ₁ for extracting.
Autoencoder AE shown in FIG. 3, the input layer A _L1, and inputs the sentence distributed representation vector Vs generated by vectorization unit 12. The input layer A _L1 and the output layer A _L3 have the same number of dimensions (for example, 200 dimensions) as the sentence unit distributed expression vector. The hidden layer A _L2 has fewer dimensions than the input layer A _L1 and the output layer A _L3 (for example, 100 dimensions).

特徴抽出モデル学習手段１３は、オートエンコーダＡＥにおいて、入力層Ａ_Ｌ１と出力層Ａ_Ｌ３とが、同じ文単位分散表現ベクトルＶｓとなるように、入力層Ａ_Ｌ１から隠れ層Ａ_Ｌ２へのエンコード式、隠れ層Ａ_Ｌ２から出力層Ａ_Ｌ３へのデコード式の係数等のパラメータを学習する。このように、オートエンコーダＡＥは、中間層に、入力層Ａ_Ｌ１および出力層Ａ_Ｌ３よりも次元数が少ない隠れ層Ａ_Ｌ２を設けることで、隠れ層Ａ_Ｌ２において、文単位分散表現ベクトルの次元を圧縮した特徴量を抽出することができる。 Feature extraction model learning unit 13, in Autoencoder AE, the input layer _{A L1} and an output layer _{A L3} is such that the same sentence distributed representation vector Vs, the encoding type from the input layer _{A L1} to the hidden layer _{A L2} Then, parameters such as coefficients of the decoding formula from the hidden layer A _L2 to the output layer A _L3 are learned. In this way, the auto encoder AE provides the hidden layer A _L2 having a smaller number of dimensions than the input layer A _L1 and the output layer A _{L3 in} the intermediate layer, so that the dimension of the sentence unit distributed expression vector in the hidden layer A _L2 Can be extracted.

この特徴抽出モデル学習手段１３は、順次入力される文単位分散表現ベクトルが、入力層Ａ_Ｌ１と出力層Ａ_Ｌ３とで同じになるようにオートエンコーダＡＥを学習し、入力層Ａ_Ｌ１から隠れ層Ａ_Ｌ２へのパラメータを、特徴抽出モデルＭ_１として学習する。なお、オートエンコーダＡＥの学習には、例えば、誤差逆伝播法（back propagation）を用いる。
この特徴抽出モデル学習手段１３は、ニュース原稿を用いた学習を所定回数行うか、パラメータ誤差が予め定めた誤差内に収束した段階で学習を終了する。
特徴抽出モデル学習手段１３は、学習した特徴抽出モデルＭ_１を、特徴抽出モデル記憶手段２２に書き込み記憶する。 The feature extraction model learning unit 13 learns the auto encoder AE so that the sentence unit distributed expression vectors sequentially input are the same in the input layer A _L1 and the output layer A _L3, and the hidden layer is input from the input layer A _L1. the parameters to a _L2, learning as a feature extraction model _{M 1.} For the learning of the auto encoder AE, for example, a back propagation method is used.
The feature extraction model learning unit 13 ends the learning when the learning using the news manuscript is performed a predetermined number of times or when the parameter error converges within a predetermined error.
The feature extraction model learning unit 13 writes and stores the learned feature extraction model M ₁ in the feature extraction model storage unit 22.

特徴抽出手段１４は、第２学習モードまたは評価モードにおいて、特徴抽出モデル記憶手段２２に記憶されている特徴抽出モデルを用いて、ベクトル化手段１２で生成される文単位分散表現ベクトルから、特徴ベクトル（特徴量）を抽出するものである。
この特徴抽出手段１４は、特徴抽出モデル記憶手段２２に記憶されている特徴抽出モデルＭ_１（図３参照）を用いて、文単位分散表現ベクトル（例えば、２００次元）から、次元数の少ないベクトル（例えば、１００次元）を算出し、特徴ベクトルを抽出する。
特徴抽出手段１４は、第２学習モードにおいて、抽出した特徴ベクトルを情報判定モデル学習手段１５に出力する。また、特徴抽出手段１４は、評価モードにおいて、抽出した特徴ベクトルを情報判定モデル学習手段１５に出力する。 The feature extraction unit 14 uses the feature extraction model stored in the feature extraction model storage unit 22 in the second learning mode or the evaluation mode, and uses the feature vector from the sentence unit distributed expression vector generated by the vectorization unit 12. (Feature amount) is extracted.
The feature extraction unit 14 uses a feature extraction model M ₁ (see FIG. 3) stored in the feature extraction model storage unit 22 to generate a vector having a small number of dimensions from a sentence unit distributed expression vector (for example, 200 dimensions). (For example, 100 dimensions) is calculated, and a feature vector is extracted.
The feature extraction unit 14 outputs the extracted feature vector to the information determination model learning unit 15 in the second learning mode. Further, the feature extraction unit 14 outputs the extracted feature vector to the information determination model learning unit 15 in the evaluation mode.

情報判定モデル学習手段１５は、第２学習モードにおいて、ベクトル化手段１２で生成される文単位分散表現ベクトルから、メディア情報がある分野の情報であるか否かを判定する情報判定モデルを学習するものである。 In the second learning mode, the information determination model learning unit 15 learns an information determination model for determining whether the media information is information in a certain field from the sentence unit distributed expression vector generated by the vectorization unit 12. Is.

具体的には、情報判定モデル学習手段１５は、図４に示す入力層Ｆ_Ｌ１、隠れ層Ｆ_Ｌ２、出力層Ｆ_Ｌ３で構成される順伝播ニューラルネットワーク（Feed Forward Neural Network：ＦＦＮＮ）により、情報判定モデルＭ_２を学習する。
図４に示すＦＦＮＮは、入力層Ｆ_Ｌ１に、ベクトル化手段１２で生成される文単位分散表現ベクトルＶｓと、特徴抽出モデルＭ_１で抽出される特徴ベクトルＶｆとを入力する。そして、ＦＦＮＮは、隠れ層Ｆ_Ｌ２において、入力層Ｆ_Ｌ１に入力されたベクトル（文単位分散表現ベクトルＶｓ＋特徴ベクトルＶｆ）の各要素の値に重みを付加して伝搬させて、出力層Ｆ_Ｌ３から、判定結果を出力する。ここで、出力層Ｆ_Ｌ３は、例えば、次元数を２とし、一方のノードから、文単位分散表現ベクトルＶｓが第１学習モードで学習した分野の情報であることを示す確率を正規化して出力する。また、他方のノードから、文単位分散表現ベクトルＶｓが第１学習モードで学習した分野の情報ではないことを示す確率を正規化して出力する。 Specifically, the information determination model learning means 15 uses a forward propagation neural network (FFNN) composed of an input layer F _L1 , a hidden layer F _L2 , and an output layer F _L3 shown in FIG. to learn the decision model _{M 2.}
FFNN shown in Figure 4, the input layer F _L1, and inputs a sentence unit variance representation vectors Vs generated by vectorization unit 12, a feature vector Vf that is extracted by the feature extraction model M _1. Then, the FFNN adds the weight to the value of each element of the vector (sentence unit distributed expression vector Vs + feature vector Vf) input to the input layer F _L1 and propagates it in the hidden layer F _L2 , and outputs the output layer F _L3 To output the determination result. Here, the output layer _FL3 , for example, sets the number of dimensions to 2 and normalizes and outputs the probability that the sentence unit distributed expression vector Vs is information on the field learned in the first learning mode from one node. To do. Further, the probability that indicates that the sentence unit distributed expression vector Vs is not information of the field learned in the first learning mode is normalized and output from the other node.

そして、情報判定モデル学習手段１５は、教師データが正例の場合、一方のノードの出力が、文単位分散表現ベクトルＶｓが第１学習モードで学習した分野の投稿文のベクトルである確率値“１”、他方のノードの出力が確率値“０”となるように、各層の重みを情報判定モデルＭ_２のパラメータとして学習する。また、教師データが負例の場合、一方のノードの出力が“０”、他方のノードの出力が“１” となるように、各層の重みを情報判定モデルＭ_２のパラメータとして学習する。なお、ＦＦＮＮの学習には、例えば、誤差逆伝播法（back propagation）を用いる。
この情報判定モデル学習手段１５は、教師データを用いた学習を所定回数行うか、パラメータ誤差が予め定めた誤差内に収束した段階で学習を終了する。
情報判定モデル学習手段１５は、学習した情報判定モデルを、情報判定モデル記憶手段２３に書き込み記憶する。 Then, when the teacher data is a positive example, the information determination model learning means 15 outputs a probability value “in which the output of one node is a vector of posted sentences in the field in which the sentence unit distributed expression vector Vs is learned in the first learning mode”. 1 ", the output of the other node probability value" so that the 0 "to learn the weights of each layer as a parameter information determining the model M _2. Also, if the teacher data is negative example, the output of one node is "0", so that the output of the other nodes becomes "1", to learn the weights of each layer as a parameter information determining the model M _2. For FFNN learning, for example, an error back propagation method is used.
The information determination model learning unit 15 ends the learning when the learning using the teacher data is performed a predetermined number of times or when the parameter error converges within a predetermined error.
The information determination model learning unit 15 writes and stores the learned information determination model in the information determination model storage unit 23.

判定手段１６は、評価モードにおいて、メディア情報が、学習モード（第１学習モードおよび第２学習モード）で学習した分野の情報であるか否かを判定するものである。
判定手段１６は、評価モードにおいて、ベクトル化手段１２から文単位分散表現ベクトルを入力し、特徴抽出手段１４から、文単位分散表現ベクトルから抽出した特徴ベクトルを入力する。 In the evaluation mode, the determination unit 16 determines whether the media information is information on a field learned in the learning mode (first learning mode and second learning mode).
In the evaluation mode, the determination unit 16 receives the sentence unit variance expression vector from the vectorization unit 12 and the feature extraction unit 14 receives the feature vector extracted from the sentence unit distribution expression vector.

判定手段１６は、情報判定モデル記憶手段２３に記憶されている情報判定モデルを用いて、入力した文単位分散表現ベクトルと特徴ベクトルとが、学習モードで学習した分野の情報に対応するベクトルであるか否かを判定する。具体的には、判定手段１６は、図４に示したＦＦＮＮの入力層Ｆ_Ｌ１に文単位分散表現ベクトルＶｓと特徴ベクトルＶｆとを入力し、出力層Ｆ_Ｌ３から出力される結果に基づいて判定を行う。図４の例では、判定手段１６は、出力層Ｆ_Ｌ３の一方のノードの出力である学習した分野の事象の情報である確率値から、他方のノードから出力される確率値を減算し、正であれば、メディア情報が、学習した分野の情報であると判定する。一方、負であれば、判定手段１６は、メディア情報が、学習した分野の情報ではないと判定する。
これによって、判定手段１６は、メディア情報が学習した分野の情報か否かを判定することができる。判定手段１６は、この判定結果を外部に出力する。 The determination unit 16 uses the information determination model stored in the information determination model storage unit 23, and the input sentence unit distributed expression vector and the feature vector are vectors corresponding to the field information learned in the learning mode. It is determined whether or not. Specifically, the determination unit 16 inputs the sentence unit distributed expression vector Vs and the feature vector Vf to the input layer F _L1 of the FFNN shown in FIG. 4 and determines based on the result output from the output layer F _L3. I do. In the example of FIG. 4, the determination unit 16 subtracts the probability value output from the other node from the probability value that is the information of the event in the learned field that is the output of one node of the output layer _FL3. If so, it is determined that the media information is information in the learned field. On the other hand, if it is negative, the determination means 16 determines that the media information is not information in the learned field.
Accordingly, the determination unit 16 can determine whether the media information is information in a learned field. The determination unit 16 outputs the determination result to the outside.

記憶部２０は、分散表現ベクトル記憶手段２１と、特徴抽出モデル記憶手段２２と、情報判定モデル記憶手段２３と、を備える。記憶部２０は、情報判定装置１の動作で使用または生成する各種データを記憶するものである。
これら各記憶手段は、ハードディスク、半導体メモリ等の一般的な記憶装置で構成することができる。なお、ここでは、記憶部２０において、各記憶手段を個別に設けているが、１つの記憶装置の記憶領域を複数に区分して各記憶手段としてもよい。また、記憶部２０を外部記憶装置として、情報判定装置１の構成から省いてもよい。 The storage unit 20 includes a distributed representation vector storage unit 21, a feature extraction model storage unit 22, and an information determination model storage unit 23. The storage unit 20 stores various data used or generated in the operation of the information determination apparatus 1.
Each of these storage means can be constituted by a general storage device such as a hard disk or a semiconductor memory. Here, each storage unit is provided individually in the storage unit 20, but the storage area of one storage device may be divided into a plurality of storage units. Moreover, you may omit from the structure of the information determination apparatus 1 by making the memory | storage part 20 into an external storage device.

分散表現ベクトル記憶手段２１は、分散表現ベクトルを単語に対応付けて記憶するものである。ここでは、分散表現ベクトル記憶手段２１に、予め分散表現ベクトルを記憶しておくこととするが、情報判定装置１は、制御部１０に、図示を省略した分散表現ベクトル生成手段を備える構成としても構わない。その場合、分散表現ベクトル生成手段は、既存のメディア情報等の大量の学習データから、ｗｏｒｄ２ｖｅｃ等によって、単語ごとの分散表現ベクトルを生成し、分散表現ベクトル記憶手段２１に記憶する。 The distributed expression vector storage means 21 stores the distributed expression vector in association with the word. Here, the distributed representation vector storage unit 21 stores the distributed representation vector in advance, but the information determination apparatus 1 may be configured to include the distributed representation vector generation unit (not shown) in the control unit 10. I do not care. In this case, the distributed representation vector generating unit generates a distributed representation vector for each word from a large amount of learning data such as existing media information by using word2vec or the like, and stores it in the distributed representation vector storage unit 21.

特徴抽出モデル記憶手段２２は、特徴抽出モデル学習手段１３で学習した特徴抽出モデルを記憶するものである。この特徴抽出モデル記憶手段２２に記憶される特徴抽出モデルは、特徴抽出手段１４が参照する。 The feature extraction model storage unit 22 stores the feature extraction model learned by the feature extraction model learning unit 13. The feature extraction model 14 refers to the feature extraction model stored in the feature extraction model storage unit 22.

情報判定モデル記憶手段２３は、情報判定モデル学習手段１５で学習した情報判定モデルを記憶するものである。この情報判定モデル記憶手段２３に記憶される情報判定モデルは、判定手段１６が参照する。 The information determination model storage unit 23 stores the information determination model learned by the information determination model learning unit 15. The information determination model stored in the information determination model storage unit 23 is referred to by the determination unit 16.

以上説明したように情報判定装置１を構成することで、情報判定装置１は、ニュース原稿から、ニュース分野における原稿文の特徴を抽出する特徴抽出モデルを学習することができる。また、情報判定装置１は、教師データである予め定めた分野（ここでは、ニュース分野）の情報であるか否かが既知のメディア情報から、情報判定モデルを学習することができる。
そして、情報判定装置１は、情報判定モデルを用いて、未知のメディア情報が学習した分野の情報であるか否かを判定することができる。
なお、情報判定装置１は、一般的なコンピュータを、前記した制御部１０の各手段として機能させるプログラム（情報判定プログラム）で動作させることができる。 By configuring the information determination device 1 as described above, the information determination device 1 can learn a feature extraction model for extracting features of a manuscript sentence in the news field from a news manuscript. In addition, the information determination apparatus 1 can learn an information determination model from media information that is known whether it is information in a predetermined field (here, a news field) that is teacher data.
And the information determination apparatus 1 can determine whether unknown media information is the information of the field | area which was learned using an information determination model.
In addition, the information determination apparatus 1 can operate a general computer with a program (information determination program) that functions as each unit of the control unit 10 described above.

［情報判定装置の動作］
次に、図５〜図７を参照して、本発明の実施形態に係る情報判定装置１の動作について説明する。なお、分散表現ベクトル記憶手段２１には、予め単語に対応付けて分散表現ベクトルが記憶されているものとする。ここでは、情報判定装置１の動作を、第１学習モードと第２学習モードと評価モードとに分けて説明する。 [Operation of information judgment device]
Next, with reference to FIGS. 5-7, operation | movement of the information determination apparatus 1 which concerns on embodiment of this invention is demonstrated. It is assumed that the distributed expression vector storage unit 21 stores a distributed expression vector in advance in association with a word. Here, the operation of the information determination apparatus 1 will be described separately for the first learning mode, the second learning mode, and the evaluation mode.

（第１学習モード）
まず、図５を参照（構成については適宜図１参照）して、情報判定装置１の特徴抽出モデルを学習する第１学習モードの動作について説明する。
ステップＳ１において、情報判定装置１の単語分割手段１１は、テキストデータであるニュース原稿（例えば、ニュースのタイトル）を、原稿ごとに入力する。
そして、ステップＳ２において、情報判定装置１の単語分割手段１１は、ステップＳ１で入力した原稿文を、形態素解析することで単語に分割する。 (First learning mode)
First, the operation in the first learning mode for learning the feature extraction model of the information determination apparatus 1 will be described with reference to FIG.
In step S1, the word dividing unit 11 of the information determination apparatus 1 inputs a news manuscript (for example, a news title), which is text data, for each manuscript.
In step S2, the word division unit 11 of the information determination apparatus 1 divides the document sentence input in step S1 into words by performing morphological analysis.

そして、ステップＳ３において、情報判定装置１のベクトル化手段１２は、ステップＳ２で分割した単語に対応する分散表現ベクトルを分散表現ベクトル記憶手段２１から読み出して、単語数分だけ加算する。
さらに、ステップＳ４において、情報判定装置１のベクトル化手段１２は、ステップＳ３で加算された分散表現ベクトルを、原稿文に含まれる単語数で除算することで、原稿文ごとの正規化したベクトル（文単位分散表現ベクトル）を生成する。 In step S3, the vectorization means 12 of the information determination device 1 reads the distributed expression vectors corresponding to the words divided in step S2 from the distributed expression vector storage means 21, and adds them by the number of words.
Furthermore, in step S4, the vectorization means 12 of the information determination apparatus 1 divides the distributed expression vector added in step S3 by the number of words included in the original sentence, thereby normalizing a vector (for each original sentence) ( Sentence unit distributed expression vector).

ステップＳ５において、情報判定装置１の特徴抽出モデル学習手段１３は、ステップＳ４で生成した文単位分散表現ベクトルを、図３に示すオートエンコーダＡＥの入力層Ａ_Ｌ１への入力、および、出力層Ａ_Ｌ３からの出力として、特徴抽出モデルを学習する。 In step S5, the feature extraction model learning unit 13 of the information determination device 1, a sentence unit variance representation vectors generated in step S4, the input to the input layer A _L1 of Autoencoder AE shown in FIG. 3, and an output layer A _A feature extraction model is learned as an output from _L3 .

そして、ステップＳ６において、情報判定装置１の特徴抽出モデル学習手段１３は、学習を所定回数行うか、特徴抽出モデルのパラメータ誤差が収束したかにより、学習が終了したか否かを判定する。
このステップＳ６で、学習が終了していないと判定された場合（Ｎｏ）、情報判定装置１は、ステップＳ１に戻って学習動作を継続する。
一方、ステップＳ６で、学習が終了したと判定された場合（Ｙｅｓ）、情報判定装置１は、ステップＳ７において、学習した特徴抽出モデルを、特徴抽出モデル記憶手段２２に書き込む。 In step S6, the feature extraction model learning unit 13 of the information determination apparatus 1 determines whether learning has been completed based on whether the learning has been performed a predetermined number of times or the parameter error of the feature extraction model has converged.
If it is determined in step S6 that learning has not ended (No), the information determination apparatus 1 returns to step S1 and continues the learning operation.
On the other hand, when it is determined in step S6 that the learning is completed (Yes), the information determination apparatus 1 writes the learned feature extraction model in the feature extraction model storage unit 22 in step S7.

以上の動作によって、情報判定装置１は、大量のニュース原稿を教師データとして、テキストデータがニュースの分野の情報である特徴量を抽出する特徴抽出モデルを生成することができる。 With the above operation, the information determination apparatus 1 can generate a feature extraction model that extracts feature quantities whose text data is information in the field of news using a large amount of news manuscripts as teacher data.

（第２学習モード）
次に、図６を参照（構成については適宜図１参照）して、情報判定装置１の情報判定モデルを学習する第２学習モードの動作について説明する。この第２学習モードは、図５で説明した第１学習モードの動作の後に行われる。
ステップＳ１０において、情報判定装置１の単語分割手段１１は、分野がニュースであることが既知のメディア情報（教師データ）を、投稿ごとに入力する。
そして、ステップＳ１１において、情報判定装置１の単語分割手段１１は、ステップＳ１０で入力した投稿文を、形態素解析することで単語に分割する。 (Second learning mode)
Next, referring to FIG. 6 (refer to FIG. 1 as appropriate for the configuration), the operation in the second learning mode for learning the information determination model of the information determination apparatus 1 will be described. The second learning mode is performed after the operation of the first learning mode described with reference to FIG.
In step S10, the word division unit 11 of the information determination apparatus 1 inputs, for each post, media information (teacher data) whose field is known to be news.
And in step S11, the word division | segmentation means 11 of the information determination apparatus 1 is divided | segmented into a word by performing the morphological analysis of the posting text input by step S10.

そして、ステップＳ１２において、情報判定装置１のベクトル化手段１２は、ステップＳ１１で分割した単語に対応する分散表現ベクトルを分散表現ベクトル記憶手段２１から読み出して、単語数分だけ加算する。
さらに、ステップＳ１３において、情報判定装置１のベクトル化手段１２は、ステップＳ１２で加算された分散表現ベクトルを、投稿文に含まれる単語数で除算することで、原稿文ごとの正規化したベクトル（文単位分散表現ベクトル）を生成する。 In step S12, the vectorization means 12 of the information determination apparatus 1 reads the distributed expression vectors corresponding to the words divided in step S11 from the distributed expression vector storage means 21, and adds them by the number of words.
Further, in step S13, the vectorization unit 12 of the information determination apparatus 1 divides the distributed expression vector added in step S12 by the number of words included in the posted sentence, thereby normalizing each document sentence ( Sentence unit distributed expression vector).

ステップＳ１４において、情報判定装置１の特徴抽出手段１４は、特徴抽出モデル記憶手段２２に記憶されている特徴抽出モデルを用いて、ステップＳ１３で生成した文単位分散表現ベクトルから、特徴量である特徴ベクトルを抽出し、特徴抽出モデルの出力とする。
そして、ステップＳ１５において、情報判定装置１の情報判定モデル学習手段１５は、ステップＳ１３で生成した教師データの文単位分散表現ベクトルと、ステップＳ１４で生成した特徴ベクトルとを、図４に示すＦＦＮＮの入力層Ｆ_Ｌ１への入力として、情報判定モデルを教師あり学習する。 In step S14, the feature extraction unit 14 of the information determination apparatus 1 uses the feature extraction model stored in the feature extraction model storage unit 22 and the feature that is a feature amount from the sentence unit distributed expression vector generated in step S13. The vector is extracted and used as the output of the feature extraction model.
In step S15, the information determination model learning means 15 of the information determination apparatus 1 uses the sentence unit variance expression vector of the teacher data generated in step S13 and the feature vector generated in step S14 in the FFNN shown in FIG. as inputs to the input layer F _L1, it is supervised information determination model.

そして、ステップＳ１６において、情報判定装置１の情報判定モデル学習手段１５は、教師データを用いた学習を所定回数行うか、情報判定モデルのパラメータ誤差が収束したかにより、学習が終了したか否かを判定する。
このステップＳ１６で、学習が終了していないと判定された場合（Ｎｏ）、情報判定装置１は、ステップＳ１０に戻って学習動作を継続する。
一方、ステップＳ１６で、学習が終了したと判定された場合（Ｙｅｓ）、情報判定装置１は、ステップＳ１７において、学習した情報判定モデルを、情報判定モデル記憶手段２３に書き込む。 In step S16, the information determination model learning unit 15 of the information determination apparatus 1 determines whether the learning is completed depending on whether the learning using the teacher data is performed a predetermined number of times or the parameter error of the information determination model has converged. Determine.
If it is determined in step S16 that learning has not ended (No), the information determination apparatus 1 returns to step S10 and continues the learning operation.
On the other hand, when it is determined in step S16 that the learning is completed (Yes), the information determination apparatus 1 writes the learned information determination model in the information determination model storage unit 23 in step S17.

以上の動作によって、情報判定装置１は、教師データから、分野が未知のメディア情報が、ニュースの分野の情報であるか否かを判定するための情報判定モデルを生成することができる。 With the above operation, the information determination apparatus 1 can generate an information determination model for determining whether or not media information whose field is unknown is news field information from the teacher data.

（評価モード）
次に、図７を参照（構成については適宜図１参照）して、情報判定装置１の評価モードの動作について説明する。この評価モードの動作は、図６で説明した第２学習モードの動作の後に行われる。
なお、ステップＳ２０〜Ｓ２４の動作は、入力情報が、教師データのメディア情報であるか、分野が未知のメディア情報であるかが異なるだけで、図６で説明したステップＳ１０からＳ１４の動作と同じであるため、説明を省略する。 (Evaluation mode)
Next, referring to FIG. 7 (refer to FIG. 1 as appropriate for the configuration), the operation in the evaluation mode of the information determination apparatus 1 will be described. The operation in the evaluation mode is performed after the operation in the second learning mode described with reference to FIG.
The operations in steps S20 to S24 are the same as those in steps S10 to S14 described with reference to FIG. 6 except that the input information is media information of teacher data or media information whose field is unknown. Therefore, the description is omitted.

ステップＳ２５において、情報判定装置１の判定手段１６は、情報判定モデル記憶手段２３に記憶されている情報判定モデルを用いて、文単位分散表現ベクトルと特徴ベクトルとから、分野が未知のメディア情報が、学習モードで学習した分野の情報であるか否かを判定する。この文単位分散表現ベクトルは、ステップＳ２３で生成されたものであり、特徴ベクトルは、ステップＳ２４で生成されたものである。
さらに、ステップＳ２６において、情報判定装置１の判定手段１６は、ステップＳ２５で判定した結果を外部に出力する。 In step S25, the determination unit 16 of the information determination apparatus 1 uses the information determination model stored in the information determination model storage unit 23 to obtain media information whose field is unknown from the sentence unit distributed expression vector and the feature vector. Then, it is determined whether the information is in the field learned in the learning mode. This sentence unit distributed expression vector is generated in step S23, and the feature vector is generated in step S24.
Furthermore, in step S26, the determination means 16 of the information determination apparatus 1 outputs the result determined in step S25 to the outside.

ステップＳ２７において、情報判定装置１は、さらにメディア情報が入力されるか否かにより、評価モードの動作の終了を判定する。
このステップＳ２７で、さらにメディア情報が入力され、評価モードの動作が終了していない場合（Ｎｏ）、情報判定装置１は、ステップＳ２０に動作を戻って、判定動作を継続する。
一方、ステップＳ２７で、新たなメディア情報が入力されず、評価モードの動作が終了した場合（Ｙｅｓ）、動作を終了する。
以上の動作によって、情報判定装置１は、未知のメディア情報が、学習モードで学習した分野の情報であるか否かを判定することができる。 In step S27, the information determination apparatus 1 determines the end of the evaluation mode operation depending on whether or not media information is further input.
In this step S27, when media information is further input and the operation in the evaluation mode has not ended (No), the information determination apparatus 1 returns to the operation in step S20 and continues the determination operation.
On the other hand, if no new media information is input in step S27 and the evaluation mode operation ends (Yes), the operation ends.
With the above operation, the information determination apparatus 1 can determine whether the unknown media information is information on a field learned in the learning mode.

以上、本発明の実施形態に係る情報判定装置１の構成および動作について説明したが、本発明は、この実施形態に限定されるものではない。
ここでは、情報判定装置１は、特徴抽出モデルおよび情報判定モデルを学習する学習動作（第１学習モードおよび第２学習モード）と、特徴抽出モデルおよび情報判定モデルを用いて、未知のメディア情報が、学習した分野の情報であるか否かを判定する判定動作（評価モード）との２つの動作を１つの装置で行うものとした。
しかし、これらの動作は、別々の装置で動作させても構わない。 The configuration and operation of the information determination apparatus 1 according to the embodiment of the present invention have been described above, but the present invention is not limited to this embodiment.
Here, the information determination apparatus 1 uses the learning operation (first learning mode and second learning mode) for learning the feature extraction model and the information determination model, and the feature extraction model and the information determination model to detect unknown media information. It is assumed that one device performs two operations, a determination operation (evaluation mode) for determining whether or not the information is in the learned field.
However, these operations may be performed by separate devices.

具体的には、特徴抽出モデルおよび情報判定モデルを学習する学習動作を実現する装置は、図８に示すモデル学習装置３として構成することができる。
モデル学習装置３は、図８に示すように、図１で説明した情報判定装置１から、判定手段１６を省いて構成すればよい。この構成は、図１で説明した情報判定装置１と同じ、特徴抽出モデルおよび情報判定モデルを学習する学習動作のみを行う。なお、モデル学習装置３の動作は、図５および図６で説明した動作と同じである。
このモデル学習装置３は、コンピュータを前記した各手段として機能させるためのプログラム（モデル学習プログラム）で動作させることができる。 Specifically, an apparatus for realizing a learning operation for learning a feature extraction model and an information determination model can be configured as a model learning apparatus 3 shown in FIG.
As shown in FIG. 8, the model learning device 3 may be configured by omitting the determination unit 16 from the information determination device 1 described in FIG. This configuration performs only the learning operation for learning the feature extraction model and the information determination model, which is the same as the information determination apparatus 1 described with reference to FIG. The operation of the model learning device 3 is the same as the operation described with reference to FIGS.
The model learning device 3 can be operated by a program (model learning program) for causing a computer to function as each of the above-described means.

また、特徴抽出モデルおよび情報判定モデルを用いて、未知のメディア情報が、学習した分野の情報であるか否かを判定する判定動作を実現する装置は、図９に示す情報判定装置１Ｂとして構成することができる。
情報判定装置１Ｂは、図９に示すように、図１で説明した情報判定装置１から、特徴抽出モデル学習手段１３と情報判定モデル学習手段１５とを省いて構成すればよい。この構成は、図１で説明した情報判定装置１と同じ、未知のメディア情報が、予め学習した分野の情報であるか否かを判定する判定動作のみを行う。なお、情報判定装置１Ｂの動作は、図７で説明した動作と同じである。
この情報判定装置１Ｂは、コンピュータを前記した各手段として機能させるためのプログラム（情報判定プログラム）で動作させることができる。
このように、学習動作と判定動作とを、異なる装置で動作させることで、１つのモデル学習装置３で学習した特徴抽出モデルおよび情報判定モデルを、複数の情報判定装置１Ｂで利用することが可能になる。 An apparatus that realizes a determination operation for determining whether unknown media information is information in a learned field using the feature extraction model and the information determination model is configured as an information determination apparatus 1B illustrated in FIG. can do.
As shown in FIG. 9, the information determination apparatus 1B may be configured by omitting the feature extraction model learning unit 13 and the information determination model learning unit 15 from the information determination apparatus 1 described in FIG. This configuration performs only the determination operation for determining whether or not the unknown media information is information in a field learned in advance, which is the same as the information determination apparatus 1 described with reference to FIG. Note that the operation of the information determination apparatus 1B is the same as the operation described in FIG.
This information determination apparatus 1B can be operated by a program (information determination program) for causing a computer to function as each means described above.
As described above, the feature extraction model and the information determination model learned by one model learning device 3 can be used by a plurality of information determination devices 1B by operating the learning operation and the determination operation by different devices. become.

また、ここでは、情報判定モデル学習手段１５が学習する情報判定モデルを、教師あり学習により学習するニューラルネットワークとした。しかし、この教師あり学習は、他の一般的な機械学習を用いることができる。例えば、サポートベクタマシン（ＳＶＭ：Support Vector Machine）、条件付確率場（ＣＲＦ：Conditional Random Fields）等を用いることができる。 In addition, here, the information determination model learned by the information determination model learning unit 15 is a neural network that learns by supervised learning. However, other general machine learning can be used for this supervised learning. For example, a support vector machine (SVM), a conditional random field (CRF), or the like can be used.

また、ここでは、単語分割手段１１が、形態素解析により、入力されるニュース原稿（原稿文）、メディア情報（投稿文）を単語に分割した。しかし、情報判定装置１への入力が、予め単語に区分されたものである場合、情報判定装置１の構成から、単語分割手段１１を省略することができる。 Here, the word dividing means 11 divides the input news manuscript (original manuscript) and media information (posted sentence) into words by morphological analysis. However, when the input to the information determination apparatus 1 is previously divided into words, the word dividing unit 11 can be omitted from the configuration of the information determination apparatus 1.

１，１Ｂ情報判定装置
１１単語分割手段
１２ベクトル化手段
１３特徴抽出モデル学習手段
１４特徴抽出手段
１５情報判定モデル学習手段
１６判定手段
２１分散表現ベクトル記憶手段
２２特徴抽出モデル記憶手段
２３情報判定モデル記憶手段
３モデル学習装置 DESCRIPTION OF SYMBOLS 1,1B Information determination apparatus 11 Word division means 12 Vectorization means 13 Feature extraction model learning means 14 Feature extraction means 15 Information determination model learning means 16 Determination means 21 Distributed expression vector storage means 22 Feature extraction model storage means 23 Information determination model storage Means 3 Model learning device

Claims

A feature extraction model is extracted from a plurality of manuscript sentences whose fields are known, and a feature extraction model for extracting feature amounts of information included in the manuscript sentence is determined. A model learning device that learns an information determination model for determining whether or not a posted sentence to be targeted is information indicating the field,
A sentence-unit distributed expression vector that is a distributed expression vector of a manuscript unit or a post unit is obtained by averaging word-unit distributed expression vectors stored in advance in the storage unit for the manuscript sentence or the posted sentence that is the teacher data. Vectorizing means for generating
Feature extraction model learning means for learning a feature extraction model for extracting a feature vector by compressing a dimension of the sentence unit distributed expression vector generated from the manuscript sentence;
Feature extraction means for extracting the feature vector from the sentence unit distributed expression vector generated from the teacher data using the feature extraction model;
By inputting a feature vector for the teacher data extracted by the feature extraction means and a sentence unit dispersion expression vector of the teacher data from which the feature vector is extracted, and machine learning, the information determination model is obtained. An information determination model learning means for learning;
A model learning device comprising:

The model learning apparatus according to claim 1, wherein the feature extraction model learning unit learns, as the feature extraction model, a parameter from an input layer to a hidden layer by an auto encoder that is a neural network.

Information in which unknown data, which is a posted sentence of social media information with an unknown field, indicates a learned field, using the feature extraction model and the information determination model learned by the model learning device according to claim 1 or 2. An information determination apparatus for determining whether or not
A vectorization unit that averages word-unit distributed representation vectors stored in advance in the storage unit for the unknown data, and generates a sentence-unit distributed representation vector that is a post-unit distributed representation vector;
Feature extraction means for extracting a feature vector from the sentence unit distributed expression vector generated by the vectorization means using the feature extraction model;
Whether the unknown data is information indicating the learned field using the information determination model from the sentence unit distributed expression vector generated by the vectorization unit and the feature vector extracted by the feature extraction unit Determination means for determining whether or not
An information determination apparatus comprising:

A feature extraction model is extracted from a plurality of manuscript sentences whose fields are known, and a feature extraction model for extracting feature amounts of information included in the manuscript sentence is determined. Information indicating a field in which unknown data, which is a posted sentence of social media information whose field is unknown, is learned by learning an information determination model for determining whether the posted text targeted for the field is information indicating the field An information determination apparatus for determining whether or not
A sentence unit distributed expression vector which is a distributed expression vector of a document unit or a posting unit by averaging the distributed expression vector of a word unit stored in the storage unit in advance for the original sentence, the teacher data or the unknown data. Vectorizing means for generating
Feature extraction model learning means for learning a feature extraction model for extracting a feature vector by compressing a dimension of the sentence unit distributed expression vector generated from the manuscript sentence;
Feature extraction means for extracting the feature vector from the sentence unit distributed expression vector generated from the teacher data or the unknown data using the feature extraction model;
By inputting a feature vector for the teacher data extracted by the feature extraction means and a sentence unit dispersion expression vector of the teacher data from which the feature vector is extracted, and machine learning, the information determination model is obtained. An information determination model learning means for learning;
From the sentence unit distributed expression vector generated from the unknown data generated by the vectorization means and the feature vector extracted by the feature extraction means from the sentence unit distributed expression vector, using the information determination model Determining means for determining whether the unknown data is information indicating the learned field;
An information determination apparatus comprising:

A model learning program for causing a computer to function as the model learning device according to claim 1.

An information determination program for causing a computer to function as the information determination apparatus according to claim 3.