JP2018124617A

JP2018124617A - Teacher data collection apparatus, teacher data collection method and program

Info

Publication number: JP2018124617A
Application number: JP2017014021A
Authority: JP
Inventors: 迪利吉井; Mitsutoshi Yoshii; 中島　章; Akira Nakajima; 章中島
Original assignee: Mitsubishi Heavy Industries Ltd
Current assignee: Mitsubishi Heavy Industries Ltd
Priority date: 2017-01-30
Filing date: 2017-01-30
Publication date: 2018-08-09
Anticipated expiration: 2037-01-30
Also published as: JP6936014B2

Abstract

PROBLEM TO BE SOLVED: To collect high quality teacher data on machine learning at low cost.SOLUTION: A teacher data collection apparatus for collecting a data relating to a certain field used for a teacher data of machine learning, the teacher data collection apparatus that includes a feature calculation unit for calculating a first feature vector that is a feature vector of a reference data relating to the certain field registered in advance, a generation unit for generating a search condition used for collecting the data relating the certain field from the first feature vector, a collection unit for collecting the data relating the certain field based on the generated search condition, a similarity calculation unit for calculating a similarity between a second feature vector and the first feature vector after calculating the second feature vector that is a feature vector of the collected data by the feature calculation unit, and an extraction unit for extracting the collected data whose similarity is within the predetermined range as the teacher data is provided.SELECTED DRAWING: Figure 1

Description

本発明は、自然言語解析技術に関し、特に教師あり学習において重要な教師データを自動的に獲得する教師データ収集装置、教師データ収集方法、及び、プログラムに関するものである。 The present invention relates to a natural language analysis technique, and more particularly to a teacher data collection device, a teacher data collection method, and a program that automatically acquire important teacher data in supervised learning.

情報抽出の分野では、一般的に機械学習による手法が用いられることが多い。深層学習をはじめとする人工知能の機械学習の学習方法には、入力データと出力データとの関係を人間がシステムに教えて学習させる「教師あり学習」と、入力データだけ用いて、システム自身が傾向を導き出したり、多数のデータを少ないデータ毎に分類したりする「教師なし学習」に大別できる。 In the field of information extraction, a machine learning method is generally used in many cases. Artificial intelligence machine learning learning methods, such as deep learning, include “supervised learning” in which the system teaches the relationship between the input data and the output data, and the system itself uses only the input data. It can be broadly divided into “unsupervised learning” in which a tendency is derived or a large amount of data is classified into small amounts of data.

「教師あり学習」は、入出力の関係を示した大量の教師データがあれば、どのような情報の相互関係も学習させることができる。しかしながら、教師データの作成には人手が必要であり、大きなコストが掛かるという問題がある。 With “supervised learning”, if there is a large amount of teacher data indicating an input / output relationship, the interrelationship of any information can be learned. However, there is a problem in that the creation of teacher data requires manpower and is costly.

一方、「教師なし学習」は、システムの学習コストが安く済む利点があるが、正解が分からなくても実現できる作業にしか適用できない。 On the other hand, “unsupervised learning” has an advantage that the learning cost of the system can be reduced, but it can be applied only to work that can be realized without knowing the correct answer.

「教師あり学習」の教師データ作成コストを低減するための手法としては、半教師あり学習であるブートストラップ法がある。ブートストラップ法は、最初に入力として与えた少数の教師データを基にして、その規則に適合するデータを抽出し、教師データに追加することによって大量の教師データを作成する（例えば、特許文献１参照）。 As a technique for reducing the teacher data creation cost of “supervised learning”, there is a bootstrap method that is semi-supervised learning. In the bootstrap method, a large amount of teacher data is created by extracting data that conforms to the rule based on a small number of teacher data given as input first and adding it to the teacher data (for example, Patent Document 1). reference).

他方、多義語の曖昧性解消に適した高精度の連想語に基づき連想概念辞書を作成し、件数数や品質方針などのパラメータに応じて学習データを収集可能とする技術も検討されている（例えば、特許文献２参照）。 On the other hand, a technique is also being studied that enables the creation of an associative concept dictionary based on high-accuracy associative words suitable for disambiguation of ambiguous words, and that collects learning data according to parameters such as the number of cases and quality policy ( For example, see Patent Document 2).

特開２００５−２２２５３２号公報JP 2005-222532 A 特開２０１１−１６４７１７号公報JP 2011-164717 A

しかしながら、特許文献１に示すようなブートストラップ法では、不適切なデータであっても、最初の教師データを基にした規則に適合すると新たな教師データとして追加するため、作成した大量の教師データには不適切なデータが多く含まれてしまうことになる。
また、特許文献２の学習データ収集では、無秩序に文章データを集積したコーパスから学習データを収集するため、連想概念辞書を用いて選別精度を向上したとしても、そもそもの収集データの分野に関する偏りについては考慮することが出来ない。そのため、母集団に依存したデータ集合を得ることになる。 However, in the bootstrap method as shown in Patent Document 1, even if inappropriate data is used, it is added as new teacher data if it conforms to the rules based on the first teacher data. Will contain a lot of inappropriate data.
In addition, in the learning data collection of Patent Document 2, since learning data is collected from a corpus in which sentence data is randomly collected, even if the selection accuracy is improved using an associative concept dictionary, there is a bias in the field of collected data in the first place. Cannot be considered. Therefore, a data set depending on the population is obtained.

本発明は、上記に鑑み、機械学習に関する質の高い教師データを低コストで収集することができる教師データ収集装置、教師データ収集方法、及びプログラムを提供することを目的とする。 In view of the above, an object of the present invention is to provide a teacher data collection device, a teacher data collection method, and a program that can collect high-quality teacher data related to machine learning at a low cost.

上記目的を達成するために、本発明は、機械学習の教師データとして用いるための、特定の分野に関するデータを収集する教師データ収集装置であって、予め登録しておいた特定の分野に関する参照データの特徴ベクトルである第１の特徴ベクトルを算出する特徴算出部と、前記第１の特徴ベクトルから、前記特定の分野に関するデータの収集に用いる検索条件を生成する生成部と、生成された前記検索条件をもとに、前記特定の分野に関するデータを収集する収集部と、収集した前記データの特徴ベクトルである第２の特徴ベクトルを前記特徴算出部が算出すると、該第２の特徴ベクトルと、前記第１の特徴ベクトルとの類似度を算出する類似度算出部と、前記類似度が所定の範囲内にある収集した前記データを、前記教師データとして抽出する抽出部と、を備える教師データ収集装置を提供する。 In order to achieve the above object, the present invention is a teacher data collection device that collects data related to a specific field for use as machine learning teacher data, and has been registered in advance as reference data related to a specific field. A feature calculation unit that calculates a first feature vector that is a feature vector of the target, a generation unit that generates a search condition used to collect data related to the specific field from the first feature vector, and the generated search Based on the condition, when the feature calculation unit calculates a second feature vector that is a feature vector of the collected data and a collection unit that collects data related to the specific field, the second feature vector, A similarity calculation unit that calculates a similarity with the first feature vector, and the collected data with the similarity within a predetermined range are extracted as the teacher data. An extraction unit which provides training data collection device comprising a.

特定の分野に関する情報収集に特化した、機械学習に関する質の高い教師データを、低コストで自動的に収集することが可能となる。 It is possible to automatically collect high-quality teacher data related to machine learning that is specialized for collecting information related to a specific field at a low cost.

実施形態に係る教師データ収集システムのシステム構成の一例を示す図である。It is a figure which shows an example of the system configuration | structure of the teacher data collection system which concerns on embodiment. 実施形態に係る教師データ収集システムのハードウエア構成の一例を示す図である。It is a figure which shows an example of the hardware constitutions of the teacher data collection system which concerns on embodiment. 実施形態に係る記憶装置で記憶する各種テーブルの一例を示す図である。It is a figure which shows an example of the various tables memorize | stored with the memory | storage device which concerns on embodiment. 実施形態に係る特徴ベクトルの算出の処理の流れの一例を示すフローチャートである。It is a flowchart which shows an example of the flow of a process of calculation of the feature vector which concerns on embodiment. 実施形態に係る特徴ベクトルの算出処理の具体例を示す図である。It is a figure which shows the specific example of the calculation process of the feature vector which concerns on embodiment. 実施形態に係る特定の分野に関するデータ収集処理の流れの一例を示すフローチャートである。It is a flowchart which shows an example of the flow of the data collection process regarding the specific field | area concerning embodiment. 実施形態に係る特定の分野に関連するデータの収集処理の具体例を示す図である。It is a figure which shows the specific example of the collection process of the data relevant to the specific field | area concerning embodiment. 実施形態に係る特徴ベクトルの類似度を算出する処理の流れの一例を示すフローチャートである。It is a flowchart which shows an example of the flow of a process which calculates the similarity degree of the feature vector which concerns on embodiment. 実施形態に係る類似度を算出する処理の具体例を示す図である。It is a figure which shows the specific example of the process which calculates the similarity which concerns on embodiment. 実施形態に係る教師データとして格納するデータを抽出する処理の流れの一例を示すフローチャートである。It is a flowchart which shows an example of the flow of the process which extracts the data stored as teacher data which concern on embodiment.

以下では、本発明の実施形態について、図面を用いて詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

（システム構成）
図１は、本発明の実施形態に係る教師データ収集システムのシステム構成の一例を示す図である。 (System configuration)
FIG. 1 is a diagram illustrating an example of a system configuration of a teacher data collection system according to an embodiment of the present invention.

教師データ収集システム１００は、大量の情報（データ）の中から、特定の分野に関係するデータのみを抽出する。
図１によれば、教師データ収集システム１００は、特定の分野に関する参照データやその参照データの特徴ベクトル等をデータベース化して記憶する記憶装置３００と、記憶装置３００上のデータベースを管理するとともに、当該データベースの検索等を可能とする管理コンピュータ２００（教師データ収集装置）とを含んで構成される。 The teacher data collection system 100 extracts only data related to a specific field from a large amount of information (data).
According to FIG. 1, the teacher data collection system 100 manages a database on the storage device 300, a storage device 300 that stores reference data related to a specific field, feature vectors of the reference data, and the like in a database. It includes a management computer 200 (teacher data collection device) that enables database search and the like.

また、管理コンピュータ２００には、記憶装置３００が接続されている。 A storage device 300 is connected to the management computer 200.

さらに、この管理コンピュータ２００は、例えば、インターネット等のネットワークＮを介して端末４００や外部の文書データベース４１０と接続されている。端末４００は、管理コンピュータ２００へアクセスすることができる。例えば、ユーザは、管理コンピュータ２００によって収集された教師データを、端末４００から確認することができる。また、管理コンピュータ２００は、外部の文書データベース４１０にアクセスすることができる。例えば、管理コンピュータ２００は、外部の文書データベース４１０に保管されているデータを取り込むことができる。 Furthermore, the management computer 200 is connected to the terminal 400 and an external document database 410 via a network N such as the Internet. The terminal 400 can access the management computer 200. For example, the user can check the teacher data collected by the management computer 200 from the terminal 400. Further, the management computer 200 can access an external document database 410. For example, the management computer 200 can capture data stored in the external document database 410.

管理コンピュータ２００は、データ受付部２１０と、特徴ベクトル算出部２２０と、検索条件生成部２３０と、データ収集部２４０と、類似度算出部２５０と、を備えている。 The management computer 200 includes a data reception unit 210, a feature vector calculation unit 220, a search condition generation unit 230, a data collection unit 240, and a similarity calculation unit 250.

データ受付部２１０は、例えば、マウス、キーボード、タッチパネル等の各種入力手段を用いて、ユーザが選択した特定の分野に関するデータ（参照データ）を受け付ける。データ受付部２１０は、受け付けた参照データを記憶装置３００の参照データ記憶部３１０に格納する。 The data reception unit 210 receives data (reference data) related to a specific field selected by the user using various input means such as a mouse, a keyboard, and a touch panel. The data receiving unit 210 stores the received reference data in the reference data storage unit 310 of the storage device 300.

特徴ベクトル算出部２２０（特徴算出部）は、記憶装置３００の参照データ記憶部３１０に格納されている参照データから、当該参照データに関する特徴ベクトルである第１の特徴ベクトルを算出し、記憶装置３００の参照データ記憶部３１０に格納する。 The feature vector calculation unit 220 (feature calculation unit) calculates a first feature vector, which is a feature vector related to the reference data, from the reference data stored in the reference data storage unit 310 of the storage device 300, and stores the storage device 300. Is stored in the reference data storage unit 310.

検索条件生成部２３０（生成部）は、記憶装置３００の参照データ記憶部３１０に格納されている第１の特徴ベクトルから、データ収集を行うための検索条件を生成し、データ収集部２４０に出力する。 The search condition generation unit 230 (generation unit) generates a search condition for collecting data from the first feature vector stored in the reference data storage unit 310 of the storage device 300 and outputs the search condition to the data collection unit 240. To do.

データ収集部２４０（収集部）は、検索条件生成部２３０によって生成された検索条件をもとにして、文書データベース４１０から検索条件に適合するデータ（収集データ）を収集し、記憶装置３００の収集データ記憶部３２０に格納する。 The data collection unit 240 (collection unit) collects data (collection data) that matches the search condition from the document database 410 based on the search condition generated by the search condition generation unit 230 and collects it in the storage device 300. The data is stored in the data storage unit 320.

特徴ベクトル算出部２２０は、記憶装置３００の収集データ記憶部３２０に保管されている収集データから、当該収集データに関する特徴ベクトルである第２の特徴ベクトルを算出し、記憶装置３００の収集データ記憶部３２０に格納する。 The feature vector calculation unit 220 calculates a second feature vector that is a feature vector related to the collected data from the collected data stored in the collected data storage unit 320 of the storage device 300, and the collected data storage unit of the storage device 300 Stored in 320.

類似度算出部２５０は、記憶装置３００の収集データ記憶部３２０に格納されている第２の特徴ベクトルを、記憶装置３００の参照データ記憶部に保管されている第１の特徴ベクトルと比較し、第１の特徴ベクトルに対する第２の特徴ベクトルの類似度を算出する。類似度算出部２５０は、算出した類似度を記憶装置３００の収集データ記憶部３２０に格納する。 The similarity calculation unit 250 compares the second feature vector stored in the collected data storage unit 320 of the storage device 300 with the first feature vector stored in the reference data storage unit of the storage device 300, The similarity of the second feature vector to the first feature vector is calculated. The similarity calculation unit 250 stores the calculated similarity in the collected data storage unit 320 of the storage device 300.

教師データ抽出部２６０（抽出部）は、記憶装置３００の収集データ記憶部３２０に保管されている類似度が所定の範囲内にある収集データを教師データとして抽出し、記憶装置３００の教師データ記憶部３３０に格納する。また、教師データ抽出部２６０は、記憶装置３００の参照データ記憶部３１０に格納されている参照データを、記憶装置３００の教師データ記憶部３３０に格納する。 The teacher data extraction unit 260 (extraction unit) extracts the collected data stored in the collected data storage unit 320 of the storage device 300 and having a similarity within a predetermined range as teacher data, and stores the teacher data in the storage device 300. Stored in the unit 330. In addition, the teacher data extraction unit 260 stores the reference data stored in the reference data storage unit 310 of the storage device 300 in the teacher data storage unit 330 of the storage device 300.

（ハードウエア構成）
実施形態に係る管理コンピュータ２００は、例えば一般的なコンピュータ５００を用いて実現することができる。図２は、コンピュータ５００の構成の一例を示す図である。 (Hardware configuration)
The management computer 200 according to the embodiment can be realized using, for example, a general computer 500. FIG. 2 is a diagram illustrating an example of the configuration of the computer 500.

コンピュータ５００は、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）５０１、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）５０２、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）５０３、ストレージ装置５０４、外部Ｉ／Ｆ（Ｉｎｔｅｒｆａｃｅ）５０５、入力装置５０６、出力装置５０７、通信Ｉ／Ｆ５０８等を有する。これらの装置はバスＢを介して相互に信号の送受信を行う。 The computer 500 includes a CPU (Central Processing Unit) 501, a RAM (Random Access Memory) 502, a ROM (Read Only Memory) 503, a storage device 504, an external I / F (Interface) 505, an input device 506, an output device 507, a communication I / F508 and the like. These devices transmit and receive signals to and from each other via the bus B.

ＣＰＵ５０１は、ＲＯＭ５０３やストレージ装置５０４等に格納されたプログラムやデータをＲＡＭ５０２上に読み出し、処理を実行することで、コンピュータ５００の各機能を実現する演算装置である。ＲＡＭ５０２は、ＣＰＵ５０１のワークエリア等として用いられる揮発性のメモリである。ＲＯＭ５０３は、電源を切ってもプログラムやデータを保持する不揮発性のメモリである。
ストレージ装置５０４は、例えば、ＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）、ＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）等により実現され、ＯＳ（ＯｐｅｒａｔｉｏｎＳｙｓｔｅｍ）、アプリケーションプログラム、及び各種データ等を記憶する。
外部Ｉ／Ｆ５０５は、外部装置とのインタフェースである。外部装置には、例えば、記録媒体５０９等がある。コンピュータ５００は、外部Ｉ／Ｆ５０５を介して、記録媒体５０９の読取り、書き込みを行うことができる。記録媒体５０９には、例えば、光学ディスク、磁気ディスク、メモリカード、ＵＳＢ（ＵｎｉｖｅｒｓａｌＳｅｒｉａｌＢｕｓ）メモリ等が含まれる。 The CPU 501 is an arithmetic device that implements each function of the computer 500 by reading out programs and data stored in the ROM 503, the storage device 504, and the like onto the RAM 502 and executing the processing. A RAM 502 is a volatile memory used as a work area for the CPU 501. The ROM 503 is a nonvolatile memory that retains programs and data even when the power is turned off.
The storage device 504 is realized by, for example, an HDD (Hard Disk Drive), an SSD (Solid State Drive), and the like, and stores an OS (Operation System), application programs, various data, and the like.
The external I / F 505 is an interface with an external device. Examples of the external device include a recording medium 509. The computer 500 can read and write the recording medium 509 via the external I / F 505. The recording medium 509 includes, for example, an optical disk, a magnetic disk, a memory card, a USB (Universal Serial Bus) memory, and the like.

入力装置５０６は、例えば、マウス、タッチパネル及びキーボード等で構成され、操作者（ユーザ）の指示を受けてコンピュータ５００に各種操作等を入力する。 The input device 506 includes, for example, a mouse, a touch panel, and a keyboard, and inputs various operations and the like to the computer 500 in response to instructions from an operator (user).

出力装置５０７は、例えば、液晶ディスプレイにより実現され、ＣＰＵ５０１による処理結果を表示する。 The output device 507 is realized by a liquid crystal display, for example, and displays a processing result by the CPU 501.

通信Ｉ／Ｆ５０８は、有線通信又は無線通信により、コンピュータ５００をインターネット等のネットワーク（図１のネットワークＮ等）に接続するインタフェースである。バスＢは、上記各構成装置に接続され、制御装置間で各種制御信号等を送受信する。 The communication I / F 508 is an interface that connects the computer 500 to a network such as the Internet (such as the network N in FIG. 1) by wired communication or wireless communication. The bus B is connected to each of the above constituent devices, and transmits and receives various control signals and the like between the control devices.

（各種テーブルの説明）
次に、図３を用いて、記憶装置３００が記憶する各テーブルについて説明する。 (Description of various tables)
Next, each table stored in the storage device 300 will be described with reference to FIG.

図３は、本発明の実施形態に係る記憶装置が記憶する各種テーブルの一例を示す図である。
図３（ａ）に示す記憶装置３００の参照データ記憶部３１０が記憶する参照データ管理テーブル６００は、特定の分野に関する参照データを、当該参照データに割り当てられているデータ識別子、および当該データの特徴ベクトル（第１の特徴ベクトル）と対応付けて記憶するテーブルである。 FIG. 3 is a diagram illustrating an example of various tables stored in the storage device according to the embodiment of the present invention.
The reference data management table 600 stored in the reference data storage unit 310 of the storage device 300 shown in FIG. 3A includes reference data relating to a specific field, a data identifier assigned to the reference data, and characteristics of the data It is a table stored in association with a vector (first feature vector).

例えば、図３（ａ）の１行目のデータは、特定の分野に関する参照データ「ＤＡ００００１」のデータ識別子は「＃Ａ００００１」であり、参照データ「ＤＡ００００１」に関する特徴ベクトル「ＸＡ００００１」は、特徴ベクトル算出部２２０が算出した第１の特徴ベクトルであることを示す。 For example, in the data on the first line in FIG. 3A, the data identifier of the reference data “DA00001” relating to the specific field is “# A00001”, and the feature vector “XA00001” relating to the reference data “DA00001” is the feature vector. The first feature vector calculated by the calculation unit 220 is indicated.

特徴ベクトル算出部２２０は、ユーザが端末４００に入力し、記憶装置３００の参照データ記憶部３１０に格納された特定の分野に関する参照データ「ＤＡ００００１」から、参照データに関する特徴ベクトルである第１の特徴ベクトル「ＸＡ００００１」を算出し、記憶装置３００の参照データ記憶部３１０に格納する。特徴ベクトルの算出方法としては、例えば、ＴＦ（ＴｅｒｍＦｒｅｑｕｅｎｃｙ）・ＩＤＦ（ＩｎｖｅｒｓｅＤｏｃｕｍｅｎｔＦｒｅｑｕｅｎｃｙ）法が挙げられる。 The feature vector calculation unit 220 is a first feature that is a feature vector related to reference data from the reference data “DA00001” related to a specific field that is input by the user to the terminal 400 and stored in the reference data storage unit 310 of the storage device 300. The vector “XA00001” is calculated and stored in the reference data storage unit 310 of the storage device 300. As a feature vector calculation method, for example, a TF (Term Frequency) / IDF (Inverse Document Frequency) method may be used.

このように、参照データ記憶部３１０の参照データ管理テーブル６００には、ユーザが入力した特定の分野に関する参照データについての特徴ベクトルが格納される。 As described above, the reference data management table 600 of the reference data storage unit 310 stores feature vectors for reference data related to a specific field input by the user.

図３（ｂ）に示す記憶装置３００の収集データ記憶部３２０が記憶する収集データ管理テーブル６１０は、特定の分野に関して収集したデータ（収集データ）を、当該データに割り当てられているデータ識別子、当該データの特徴ベクトル（第２の特徴ベクトル）、および、参照データに対する当該データの類似度と対応付けて記憶するテーブルである。 The collected data management table 610 stored in the collected data storage unit 320 of the storage device 300 shown in FIG. 3B includes data collected about a specific field (collected data), a data identifier assigned to the data, It is a table stored in association with the feature vector of data (second feature vector) and the similarity of the data to the reference data.

例えば、図３（ｂ）の１行目のデータは、検索条件生成部２３０が生成した特定の分野に関してデータ収集を行うための検索条件に基づいて、データ収集部２４０が収集した特定の分野に関する収集データ「ＤＳ００００１」のデータ識別子は「＃Ｓ００００１」であり、前記収集データ「ＤＳ００００１」に関する特徴ベクトル「ＸＳ００００１」は、特徴ベクトル算出部２２０が算出した第２の特徴ベクトルであり、特定の分野に関する参照データの特徴ベクトル（第１の特徴ベクトル）に対する収集データ「ＤＳ００００１」の特徴ベクトル（第２の特徴ベクトル）「ＸＳ００００１」の類似度が０．６３４であることを示す。 For example, the data in the first row in FIG. 3B relates to a specific field collected by the data collection unit 240 based on a search condition for collecting data regarding a specific field generated by the search condition generation unit 230. The data identifier of the collected data “DS00001” is “# S00001”, and the feature vector “XS00001” related to the collected data “DS00001” is the second feature vector calculated by the feature vector calculation unit 220 and relates to a specific field. The similarity of the feature vector (second feature vector) “XS00001” of the collected data “DS00001” with respect to the feature vector (first feature vector) of the reference data is 0.634.

検索条件生成部２３０は、記憶装置３００の参照データ記憶部３１０に格納されている第１の特徴ベクトルから、データ収集を行うための検索条件（検索語の組み合わせ）を生成し、データ収集部２４０に出力する。 The search condition generation unit 230 generates a search condition (a combination of search terms) for collecting data from the first feature vector stored in the reference data storage unit 310 of the storage device 300, and the data collection unit 240. Output to.

データ収集部２４０は、例えばｗｅｂ上で一般に利用可能な検索エンジン等を用いて、検索条件生成部２３０によって生成された検索条件（検索語の組み合わせ等）をもとにして、文書データベース４１０から検索条件に適合するデータを収集する。そして、データ収集部２４０は、収集したデータ（検索条件に適合したデータ）を、特定の分野に関する収集データ「ＤＳ００００１」として、記憶装置３００の収集データ記憶部３２０に格納する。 The data collection unit 240 searches the document database 410 based on the search conditions (such as combinations of search terms) generated by the search condition generation unit 230 using, for example, a search engine that is generally available on the web. Collect data that meets your requirements. Then, the data collection unit 240 stores the collected data (data suitable for the search condition) in the collected data storage unit 320 of the storage device 300 as the collected data “DS00001” regarding the specific field.

特徴ベクトル算出部２２０は、記憶装置３００の収集データ記憶部３２０に保管されている収集データ「ＤＳ００００１」から、収集データに関する特徴ベクトルである第２の特徴ベクトル「ＸＳ００００１」を算出し、記憶装置３００の収集データ記憶部３２０に格納する。 The feature vector calculation unit 220 calculates a second feature vector “XS00001” that is a feature vector related to the collected data from the collected data “DS00001” stored in the collected data storage unit 320 of the storage device 300. Stored in the collected data storage unit 320.

類似度算出部２５０は、記憶装置３００の収集データ記憶部３２０に格納されている第２の特徴ベクトル「ＸＳ００００１」を、記憶装置３００の参照データ記憶部に保管されている第１の特徴ベクトル「ＸＡ００００１」、「ＸＡ００００２」、「ＸＡ００００３」等と比較し、第１の特徴ベクトルに対する第２の特徴ベクトルの類似度（ここでは０．６３４）を算出する。
具体的には、類似度算出部２５０は、記憶装置３００の収集データ記憶部３２０に格納されている第２の特徴ベクトル「ＸＳ００００１」を、記憶装置３００の参照データ記憶部に保管されている第１の特徴ベクトル「ＸＡ００００１」と比較して、第１の特徴ベクトル「ＸＡ００００１」に対する第２の特徴ベクトル「ＸＳ００００１」の類似度を算出する。「ＸＡ００００２」、「ＸＡ００００３」等についても同様である。
次いで、類似度算出部２５０は、算出された第１の特徴ベクトル「ＸＡ００００１」、「ＸＡ００００２」、「ＸＡ００００３」等に対する第２の特徴ベクトル「ＸＳ００００１」の類似度を組み合わせて、第１の特徴ベクトルに対する第２の特徴ベクトルの類似度を算出する。組み合わせの方法としては、例えば、類似度の平均値や類似度の最大値が挙げられる。類似度算出部２５０は、類似度算出部２５０は、算出した類似度（ここでは０．６３４）を記憶装置３００の収集データ記憶部３２０に格納する。 The similarity calculation unit 250 uses the second feature vector “XS00001” stored in the collected data storage unit 320 of the storage device 300 as the first feature vector “X00001” stored in the reference data storage unit of the storage device 300. Compared with “XA00001”, “XA00002”, “XA00003”, etc., the similarity (here, 0.634) of the second feature vector with respect to the first feature vector is calculated.
Specifically, the similarity calculation unit 250 stores the second feature vector “XS00001” stored in the collected data storage unit 320 of the storage device 300 in the reference data storage unit of the storage device 300. Compared with the first feature vector “XA00001”, the similarity of the second feature vector “XS00001” with respect to the first feature vector “XA00001” is calculated. The same applies to “XA00002”, “XA00003”, and the like.
Next, the similarity calculation unit 250 combines the similarity of the second feature vector “XS00001” with respect to the calculated first feature vectors “XA00001”, “XA00002”, “XA00003”, etc. The similarity of the second feature vector with respect to is calculated. Examples of the combination method include an average value of similarity and a maximum value of similarity. The similarity calculation unit 250 stores the calculated similarity (here, 0.634) in the collected data storage unit 320 of the storage device 300.

このように、収集データ記憶部３２０の収集データ管理テーブル６１０には、特定の分野に関する収集データ、収集データについての特徴ベクトル（第２の特徴ベクトル）、および、特定の分野に関する参照データの特徴ベクトル（第１の特徴ベクトル）に対する収集データについての特徴ベクトル（第２の特徴ベクトル）の類似度が格納される。 As described above, the collected data management table 610 of the collected data storage unit 320 includes collected data related to a specific field, a feature vector (second feature vector) for the collected data, and a feature vector of reference data related to the specific field. The similarity of the feature vector (second feature vector) for the collected data with respect to (first feature vector) is stored.

図３（ｃ）に示す記憶装置３００の教師データ記憶部３３０が記憶する教師データ管理テーブル６２０は、教師データを、当該データに割り当てられているデータ識別子と対応付けて記憶するテーブルである。 The teacher data management table 620 stored in the teacher data storage unit 330 of the storage device 300 illustrated in FIG. 3C is a table that stores teacher data in association with a data identifier assigned to the data.

例えば、図３（ｃ）の１行目のデータは、教師データ抽出部２６０が教師データとして抽出した収集データ「ＤＳ００００３」のデータ識別子は「＃Ｓ００００３」であることを示す。 For example, the data in the first row in FIG. 3C indicates that the data identifier of the collected data “DS00003” extracted as the teacher data by the teacher data extraction unit 260 is “# S00003”.

教師データ抽出部２６０は、記憶装置３００の収集データ記憶部３２０に保管されている類似度が所定の範囲内にある収集データ「ＤＳ００００３」を教師データとして抽出し、収集データ「ＤＳ００００３」のデータ識別子は「＃Ｓ００００３」とともに記憶装置３００の教師データ記憶部３３０に格納する。 The teacher data extraction unit 260 extracts the collected data “DS00003” stored in the collected data storage unit 320 of the storage device 300 within the predetermined range as the teacher data, and the data identifier of the collected data “DS00003” Is stored in the teacher data storage unit 330 of the storage device 300 together with “# S00003”.

また、図３（ｃ）に示すように、教師データ抽出部２６０は、記憶装置３００の参照データ記憶部３１０に格納されている参照データ（ユーザによって手動で選ばれたデータ）を、記憶装置３００の教師データ記憶部３３０に格納する。 Further, as shown in FIG. 3C, the teacher data extraction unit 260 converts the reference data (data manually selected by the user) stored in the reference data storage unit 310 of the storage device 300 into the storage device 300. Is stored in the teacher data storage unit 330.

このように、収集データ記憶部３２０の教師データ管理テーブル６２０には、教師データが格納される。 Thus, teacher data is stored in the teacher data management table 620 of the collected data storage unit 320.

（処理の流れ）
次に、図１の構成および図３の各種テーブルを例に、図４、図５を用いて特徴ベクトルの算出処理について説明する。 (Process flow)
Next, the feature vector calculation process will be described with reference to FIGS. 4 and 5, taking the configuration of FIG. 1 and the various tables of FIG. 3 as examples.

図４は、本発明の実施形態に係る特徴ベクトルの算出処理の流れの一例を示すフローチャートである。
また、図５は、本発明の実施形態に係る特徴ベクトルの算出処理の具体例を示す図である。 FIG. 4 is a flowchart illustrating an example of a flow of feature vector calculation processing according to the embodiment of the present invention.
FIG. 5 is a diagram showing a specific example of the feature vector calculation process according to the embodiment of the present invention.

前提として、ユーザが、ユーザ自身で判断して選択した特定の分野に関連する参照データ「ＤＡ００００１」、「ＤＡ００００２」、「ＤＡ００００３」（図３（ａ）参照）を端末４００に入力したとする。すると、端末４００は、それら３つの参照データを管理コンピュータ２００に送信する。管理コンピュータ２００では、データ受付部２１０が、これら参照データを受信し、記憶装置３００に格納する。より具体的には、データ受付部２１０は、参照データ「ＤＡ００００１」をそのデータ識別子「＃Ａ００００１」と対応付けて、参照データ管理テーブル６００に格納する。データ受付部２１０は、参照データ「ＤＡ００００２」、「ＤＡ００００３」についても同様に各々のデータ識別子と対応付けて参照データ管理テーブル６００に格納する。なお、参照データのデータ識別子については、データ受付部２１０が算出してもよいし、記憶装置３００で稼働するデータベースシステム等が算出してもよい。このように記憶装置３００に参照データが格納されると、例えば、ユーザによる指示によって、管理コンピュータ２００は、参照データの特徴ベクトル算出処理を開始する。 As a premise, it is assumed that the user inputs reference data “DA00001”, “DA00002”, “DA00003” (see FIG. 3A) related to a specific field determined and selected by the user himself / herself to the terminal 400. Then, the terminal 400 transmits the three reference data to the management computer 200. In the management computer 200, the data receiving unit 210 receives these reference data and stores them in the storage device 300. More specifically, the data reception unit 210 stores the reference data “DA00001” in the reference data management table 600 in association with the data identifier “# A00001”. Similarly, the data reception unit 210 stores the reference data “DA00002” and “DA00003” in the reference data management table 600 in association with each data identifier. Note that the data identifier of the reference data may be calculated by the data reception unit 210, or may be calculated by a database system or the like that operates on the storage device 300. When the reference data is stored in the storage device 300 in this way, for example, the management computer 200 starts a feature vector calculation process of the reference data according to an instruction from the user.

ここで、本実施形態の例として、「特定の分野」は“鉄道システム”に関する分野であるものとする。また、参照データ「ＤＡ００００１」、「ＤＡ００００２」、「ＤＡ００００３」・・は、ユーザ自身の判断によってｗｅｂ等から収集（選択）された、鉄道システムに関する文書Ａ１、Ａ２、Ａ３、・・（図５参照）である。鉄道システムに関する文書とは、例えば、“鉄道建設の契約に関するニュース”、“鉄道に関する技術論文”などである。 Here, as an example of the present embodiment, the “specific field” is a field related to the “railway system”. Further, reference data “DA00001”, “DA00002”, “DA00003”, etc. are collected (selected) from a web or the like by a user's own judgment, and are related to documents A1, A2, A3,. ). The documents related to the railway system are, for example, “news about railway construction contracts”, “technical papers on railways” and the like.

まず、特徴ベクトル算出部２２０は、参照データを記憶装置３００から取り込む（ステップＳ１０１）。より具体的には、特徴ベクトル算出部２２０は、記憶装置３００の参照データ記憶部３１０（図３（ａ）の参照データ管理テーブル６００）に格納されている３つの参照データ「ＤＡ００００１」、「ＤＡ００００２」、「ＤＡ００００３」を読み出して取り込む。 First, the feature vector calculation unit 220 takes in reference data from the storage device 300 (step S101). More specifically, the feature vector calculation unit 220 includes three pieces of reference data “DA00001” and “DA00002” stored in the reference data storage unit 310 of the storage device 300 (reference data management table 600 in FIG. 3A). ”And“ DA00003 ”are read and loaded.

次に、特徴ベクトル算出部２２０は、取り込んだ参照データ「ＤＡ００００１」、「ＤＡ００００２」、「ＤＡ００００３」から、当該参照データ「ＤＡ００００１」、「ＤＡ００００２」、「ＤＡ００００３」に関する特徴ベクトル（第１の特徴ベクトル）を算出する（ステップＳ１０２）。例えば、特徴ベクトル算出部２２０は、参照データ「ＤＡ００００１」について、特徴ベクトル（第１の特徴ベクトル）「ＸＡ００００１」を算出する。 Next, the feature vector calculation unit 220 uses the feature vectors (first feature vectors) related to the reference data “DA00001,” “DA00002,” and “DA00003” from the fetched reference data “DA00001,” “DA00002,” “DA00003”. ) Is calculated (step S102). For example, the feature vector calculation unit 220 calculates a feature vector (first feature vector) “XA00001” for the reference data “DA00001”.

ここで、図３（ａ）に示す参照データ「ＤＡ００００１」が図５に示す文書Ａ１であったとすると、参照データ「ＤＡ００００１」の特徴ベクトル（第１の特徴ベクトル）「ＸＡ００００１」は、文書Ａ１に含まれる単語ｉ（ｗｏｒｄｉ）と、その重み値との組み合わせで表現される。ここで、「重み値」とは、各単語ｉが、特徴ベクトルに対する特徴付けに貢献する度合いであって、本実施形態においては、例えば、各単語ｉの「出現回数」で表現される。単語ｉとは、特徴ベクトル算出部２２０が文書Ａ１から自動的に抽出した名詞群であり、例えば、“ｔｒａｆｆｉｃ”、“ｔｒａｉｎ”、“ｒａｉｌｗａｙ”、“ｇｏｖｅｒｎｍｅｎｔ”などである。図５に示す例では、参照データ「ＤＡ００００１」（文書Ａ１）についての特徴ベクトル（第１の特徴ベクトル）「ＸＡ００００１」は、「ｔｒａｆｆｉｃ＝１０、ｔｒａｉｎ＝４、ｒａｉｌｗａｙ＝７、ｇｏｖｅｒｎｍｅｎｔ＝２、・・」などと表現される。 If the reference data “DA00001” shown in FIG. 3A is the document A1 shown in FIG. 5, the feature vector (first feature vector) “XA00001” of the reference data “DA00001” is stored in the document A1. It is expressed by a combination of the included word i (word i) and its weight value. Here, the “weight value” is the degree to which each word i contributes to the characterization of the feature vector. In this embodiment, for example, the “weight value” is expressed by “number of appearances” of each word i. The word i is a group of nouns automatically extracted from the document A1 by the feature vector calculation unit 220, such as “traffic”, “train”, “railway”, “government”, and the like. In the example shown in FIG. 5, the feature vector (first feature vector) “XA00001” for the reference data “DA00001” (document A1) is “traffic = 10, train = 4, railway = 7, governance = 2,.・ ”Etc.

特徴ベクトル算出部２２０は、算出した第１の特徴ベクトルを記憶装置３００に出力する（ステップＳ１０３）。より具体的には、特徴ベクトル算出部２２０は、算出した特徴ベクトル（第１の特徴ベクトル）を記憶装置３００の参照データ記憶部３１０（参照データ管理テーブル６００）に格納する。例えば、特徴ベクトル算出部２２０は、特徴ベクトル（第１の特徴ベクトル）「ＸＡ００００１」を参照データ「ＤＡ００００１」と対応付けて参照データ管理テーブル６００に格納する。
特徴ベクトル算出部２２０は、参照データ「ＤＡ００００２」、「ＤＡ００００３」についても同様に、文書Ａ２、文書Ａ３の各々に含まれる単語ｉとその重み値（例えば「出現回数」）とによって表現される特徴ベクトル「ＸＡ００００２」、「ＸＡ００００３」を算出する。 The feature vector calculation unit 220 outputs the calculated first feature vector to the storage device 300 (step S103). More specifically, the feature vector calculation unit 220 stores the calculated feature vector (first feature vector) in the reference data storage unit 310 (reference data management table 600) of the storage device 300. For example, the feature vector calculation unit 220 stores the feature vector (first feature vector) “XA00001” in the reference data management table 600 in association with the reference data “DA00001”.
The feature vector calculation unit 220 similarly applies to the reference data “DA00002” and “DA00003” by the feature represented by the word i included in each of the document A2 and the document A3 and its weight value (for example, “appearance count”). The vectors “XA00002” and “XA00003” are calculated.

次に、図１の構成および図３の各種テーブルを例に、図６、図７を用いて特定の分野に関連するデータの収集処理について説明する。 Next, data collection processing related to a specific field will be described with reference to FIGS. 6 and 7 taking the configuration of FIG. 1 and various tables of FIG. 3 as examples.

図６は、本発明の実施形態に係る特定の分野に関連するデータの収集処理の流れの一例を示すフローチャートである。
また、図７は、本発明の実施形態に係る特定の分野に関連するデータの収集処理の具体例を示す図である。 FIG. 6 is a flowchart showing an example of a flow of data collection processing related to a specific field according to the embodiment of the present invention.
FIG. 7 is a diagram showing a specific example of data collection processing related to a specific field according to the embodiment of the present invention.

前提として、記憶装置３００の参照データ記憶部３１０には、図４で説明した処理によって複数の参照データ「ＤＡ００００１」〜「ＤＡ００００３」等とそれぞれの特徴ベクトル（第１の特徴ベクトル）「ＸＡ００００１」〜「ＸＡ００００３」等が格納されている。ユーザは、データ収集指示情報を端末４００に入力する。 As a premise, a plurality of reference data “DA00001” to “DA00003” and their respective feature vectors (first feature vectors) “XA00001” are stored in the reference data storage unit 310 of the storage device 300 by the processing described in FIG. “XA00003” and the like are stored. The user inputs data collection instruction information to the terminal 400.

すると、検索条件生成部２３０は、記憶装置３００の参照データ記憶部に格納されている特定の分野に関連する参照データの特徴ベクトル（第１の特徴ベクトル）を取り込む（ステップＳ２０１）。例えば、参照データ「ＤＡ００００１」〜「ＤＡ００００３」の特徴ベクトル（第１の特徴ベクトル）「ＸＡ００００１」〜「ＸＡ００００３」を取り込む。 Then, the search condition generation unit 230 takes in a feature vector (first feature vector) of reference data related to a specific field stored in the reference data storage unit of the storage device 300 (step S201). For example, the feature vectors (first feature vectors) “XA00001” to “XA00003” of the reference data “DA00001” to “DA00003” are captured.

次に、検索条件生成部２３０は、取り込んだ参照データの特徴ベクトル（第１の特徴ベクトル）「ＸＡ００００１」〜「ＸＡ００００３」から、データ収集を行うための検索条件を生成する（ステップＳ２０２）。より具体的には、検索条件生成部２３０は、取り込んだ参照データの特徴ベクトル（第１の特徴ベクトル）「ＸＡ００００１」〜「ＸＡ００００３」から、検索語、各検索語の重み付け係数、検索語の組み合わせなどの検索条件を生成する。検索条件生成部２３０は、生成した検索条件をデータ収集部２４０に出力する（ステップＳ２０３）。 Next, the search condition generation unit 230 generates a search condition for collecting data from the feature vectors (first feature vectors) “XA00001” to “XA00003” of the captured reference data (step S202). More specifically, the search condition generation unit 230 obtains a search word, a weighting coefficient for each search word, and a combination of search words from the feature vectors (first feature vectors) “XA00001” to “XA00003” of the captured reference data. Generate search conditions such as The search condition generation unit 230 outputs the generated search condition to the data collection unit 240 (step S203).

ここで、「検索条件」が“検索語の組み合わせ”である例について、図７を参照しながら説明する。
まず、検索条件生成部２３０は、例えば、参照データ「ＤＡ００００１」（文書Ａ１）に対し、その特徴ベクトル「ＸＡ００００１」を用いて、単語ｉごとの重み値を算出する。ここでの「重み値」は、本実施形態においては、例えば、単語ｉの出現頻度（ｔｆ：ＴｅｒｍＦｒｅｑｕｅｎｃｙ）と逆文書出現頻度（ｉｄｆ：ＩｎｖｅｒｓｅＤｏｃｕｍｅｎｔＦｒｅｑｕｅｎｃｙ）との積（ｔｆ×ｉｄｆ）である。図７に示す例によれば、文書Ａ１の単語“ｔｒａｆｆｉｃ”に係る出現頻度は０．３３３と算出され、逆文書出現頻度は０．８１２と算出される。検索条件生成部２３０は、他の参照データ「ＤＡ００００２」、「ＤＡ００００３」（文書Ａ２、文書Ａ３）についても同様に、単語ｉごとの重み値（例えばｔｆ×ｉｄｆ）を算出する。 Here, an example in which the “search condition” is “a combination of search terms” will be described with reference to FIG.
First, for example, the search condition generation unit 230 calculates a weight value for each word i using the feature vector “XA00001” for the reference data “DA00001” (document A1). In this embodiment, the “weight value” here is, for example, the product (tf × idf) of the appearance frequency (tf: Term Frequency) of the word i and the reverse document appearance frequency (idf: Inverse Document Frequency). . According to the example shown in FIG. 7, the appearance frequency related to the word “traffic” of the document A1 is calculated as 0.333, and the reverse document appearance frequency is calculated as 0.812. The search condition generation unit 230 similarly calculates a weight value (for example, tf × idf) for each word i for the other reference data “DA00002” and “DA00003” (document A2, document A3).

次に、検索条件生成部２３０は、全ての参照データ（文書Ａ１、Ａ２、Ａ３）に共通して重み値（例えばｔｆ×ｉｄｆ）が大きい単語ｉを抽出する。具体的には、検索条件生成部２３０は、文書Ａ１、Ａ２、Ａ３別に算出した重み値の平均値が予め規定された所定の判定閾値以上か否かを判定し、当該判定閾値以上であった単語ｉを抽出する。そして、検索条件生成部２３０は、抽出した複数の単語ｉを検索語とする検索条件を作成する。このようにすることで、複数の単語ｉの中から、参照データ（文書Ａ１、Ａ２、Ａ３）を特に特徴づけている単語ｉ（つまり、参照データの中で特に頻出する単語ｉ）が抽出され、抽出された単語ｉの組み合わせが検索条件となる。
例えば、検索条件生成部２３０は、重み値の平均値の算出結果より、“ｔｒａｆｆｉｃ”、“ｔｒａｉｎ”、“ｒａｉｌｗａｙ”の３つの単語ｉを抽出したとする。この場合、検索条件生成部２３０は、“ｔｒａｆｆｉｃ”、“ｔｒａｉｎ”、“ｒａｉｌｗａｙ”の３つの検索語の組み合わせを検索条件とする。 Next, the search condition generation unit 230 extracts a word i having a large weight value (for example, tf × idf) common to all reference data (documents A1, A2, and A3). Specifically, the search condition generation unit 230 determines whether or not the average value of the weight values calculated for each of the documents A1, A2, and A3 is equal to or greater than a predetermined determination threshold, and is equal to or greater than the determination threshold. Extract word i. Then, the search condition generation unit 230 creates a search condition using the extracted words i as search words. In this way, a word i that specifically characterizes the reference data (documents A1, A2, and A3) (that is, a word i that appears frequently in the reference data) is extracted from the plurality of words i. The combination of the extracted word i becomes the search condition.
For example, it is assumed that the search condition generation unit 230 extracts three words i “traffic”, “train”, and “railway” from the calculation result of the average value of the weight values. In this case, the search condition generation unit 230 uses a combination of three search terms “traffic”, “train”, and “railway” as a search condition.

次に、データ収集部２４０は、例えばｗｅｂ上で一般に利用可能な検索エンジン等を用いて、ステップＳ２０３で生成した検索条件（検索語の組み合わせ等）を検索キーにして、外部の文書データベース４１０からデータを収集する（ステップＳ２０４）。上述の例によれば、データ収集部２４０は、検索エンジンを通じて、“ｔｒａｆｆｉｃ”、“ｔｒａｉｎ”、“ｒａｉｌｗａｙ”の３つの検索語を全て含む文書を収集する。
ここで、データ収集部２４０は、上記検索条件を用いた検索処理の結果、例えば「ＤＳ００００１」、「ＤＳ００００２」、「ＤＳ００００３」（図３（ｂ）参照）の３つのデータを収集したものとする。 Next, the data collection unit 240 uses, for example, a search engine that is generally available on the web, and uses the search condition (combination of search terms) generated in step S203 as a search key from the external document database 410. Data is collected (step S204). According to the above example, the data collection unit 240 collects documents including all three search terms “traffic”, “train”, and “railway” through the search engine.
Here, it is assumed that the data collection unit 240 collects, for example, three data “DS00001”, “DS00002”, and “DS00003” (see FIG. 3B) as a result of the search processing using the above search conditions. .

データ収集部２４０は、ステップＳ２０４で収集した当該データ（収集データ）「ＤＳ００００１」、「ＤＳ００００２」、「ＤＳ００００３」を記憶装置３００に出力する（ステップＳ２０５）。より具体的には、データ収集部２４０は、収集データ「ＤＳ００００１」、「ＤＳ００００２」、「ＤＳ００００３」を記憶装置３００の収集データ記憶部３２０（収集データ管理テーブル６１０）に格納する。例えば、データ収集部２４０は、収集データ「ＤＳ００００１」をデータ識別子「＃Ｓ００００１」と対応付けて収集データ管理テーブル６１０に格納する。収集データ「ＤＳ００００２」、「ＤＳ００００３」についても同様である。 The data collection unit 240 outputs the data (collected data) “DS00001”, “DS00002”, and “DS00003” collected in step S204 to the storage device 300 (step S205). More specifically, the data collection unit 240 stores the collected data “DS00001”, “DS00002”, and “DS00003” in the collected data storage unit 320 (collected data management table 610) of the storage device 300. For example, the data collection unit 240 stores the collected data “DS00001” in the collected data management table 610 in association with the data identifier “# S00001”. The same applies to the collected data “DS00002” and “DS00003”.

次に、図１の構成および図３の各種テーブルを例に、図８、図９を用いて類似度の算出処理について説明する。 Next, the calculation process of the similarity will be described with reference to FIGS. 8 and 9, taking the configuration of FIG. 1 and various tables of FIG. 3 as examples.

図８は、本発明の実施形態に係る類似度を算出する処理の流れの一例を示すフローチャートである。
また、図９は、本発明の実施形態に係る類似度を算出する処理の具体例を示す図である。 FIG. 8 is a flowchart showing an example of the flow of processing for calculating the similarity according to the embodiment of the present invention.
FIG. 9 is a diagram illustrating a specific example of processing for calculating similarity according to the embodiment of the present invention.

前提として、記憶装置３００の収集データ記憶部３２０には、図６で説明した処理によって、“ｔｒａｆｆｉｃ”、“ｔｒａｉｎ”、“ｒａｉｌｗａｙ”なる３つの検索語を検索キーにして検索された複数の収集データ「ＤＳ００００１」〜「ＤＳ００００３」が格納されている。ユーザは、類似度算出指示情報を端末４００に入力する。 As a premise, the collection data storage unit 320 of the storage device 300 stores a plurality of collections searched by using the three search terms “traffic”, “train”, and “railway” by the processing described in FIG. Data “DS00001” to “DS00003” are stored. The user inputs similarity calculation instruction information to the terminal 400.

すると、特徴ベクトル算出部２２０は、記憶装置３００の収集データ記憶部３２０に格納されている収集データ「ＤＳ００００１」〜「ＤＳ００００３」の各々の特徴ベクトル（第２の特徴ベクトル）を算出する（ステップＳ３００）。特徴ベクトル算出部２２０は、第２の特徴ベクトルを記憶装置３００の収集データ記憶部３２０に格納する。 Then, the feature vector calculation unit 220 calculates each feature vector (second feature vector) of the collected data “DS00001” to “DS00003” stored in the collected data storage unit 320 of the storage device 300 (step S300). ). The feature vector calculation unit 220 stores the second feature vector in the collected data storage unit 320 of the storage device 300.

ここで、検索条件生成部２３０が生成した検索条件（“ｔｒａｆｆｉｃ”、“ｔｒａｉｎ”、“ｒａｉｌｗａｙ”の３つの検索語の組み合わせ）を検索キーとして検索された結果、図９に示す新たな３つの文書Ｘ、文書Ｙ、文書Ｚ（例えば、“米国における鉄道建設計画のニュース”など）が収集されたとする。文書Ｘ、文書Ｙ、文書Ｚは、それぞれ、図３（ｂ）に示す収集データ「ＤＳ００００１」、「ＤＳ００００２」、「ＤＳ００００３」である。
この場合、収集データ「ＤＳ００００１」の特徴ベクトル（第２の特徴ベクトル）「ＸＳ００００１」は、文書Ｘに含まれる単語ｉ（ｗｏｒｄｉ）と、その出現回数との組み合わせで表現される。単語ｉとは、特徴ベクトル算出部２２０が文書Ｘから自動的に抽出した名詞群である。図９に示す例では、収集データ「ＤＳ００００１」（文書Ｘ）についての特徴ベクトル（第２の特徴ベクトル）「ＸＳ００００１」は、（ｔｒａｆｆｉｃ＝１４、ｔｒａｉｎ＝２２、ｒａｉｌｗａｙ＝６７、ｇｏｖｅｒｎｍｅｎｔ＝９８、・・）と表現される。 Here, as a result of the search using the search condition (combination of three search terms “traffic”, “train”, and “railway”) generated by the search condition generation unit 230, three new search conditions shown in FIG. It is assumed that document X, document Y, and document Z (for example, “News of Railway Construction Plan in the United States”) are collected. The document X, the document Y, and the document Z are the collected data “DS00001”, “DS00002”, and “DS00003” illustrated in FIG.
In this case, the feature vector (second feature vector) “XS00001” of the collected data “DS00001” is expressed by a combination of the word i (word i) included in the document X and the number of appearances thereof. The word i is a noun group automatically extracted from the document X by the feature vector calculation unit 220. In the example shown in FIG. 9, the feature vector (second feature vector) “XS00001” for the collected data “DS00001” (document X) is (traffic = 14, train = 22, railway = 67, governance = 98,.・).

次に、類似度算出部２５０は、第１の特徴ベクトル（参照データの特徴ベクトル）を記憶装置３００から取り込む（ステップＳ３０１）。より具体的には、類似度算出部２５０は、記憶装置３００の参照データ記憶部３１０（図３（ａ）に示す参照データ管理テーブル６００）に格納されている第１の特徴ベクトルを取り込む。 Next, the similarity calculation unit 250 takes in the first feature vector (feature vector of reference data) from the storage device 300 (step S301). More specifically, the similarity calculation unit 250 takes in the first feature vector stored in the reference data storage unit 310 (reference data management table 600 shown in FIG. 3A) of the storage device 300.

次に、類似度算出部２５０は、第２の特徴ベクトル（収集データの特徴ベクトル）を記憶装置３００から取り込む（ステップＳ３０２）。より具体的には、類似度算出部２５０は、記憶装置３００の収集データ記憶部３２０（図３（ｂ）に示す収集データ管理テーブル６１０）に格納されている第２の特徴ベクトルを取り込む。 Next, the similarity calculation unit 250 takes in the second feature vector (feature vector of collected data) from the storage device 300 (step S302). More specifically, the similarity calculation unit 250 takes in the second feature vector stored in the collected data storage unit 320 (the collected data management table 610 shown in FIG. 3B) of the storage device 300.

次に、類似度算出部２５０は、取り込んだ第１の特徴ベクトルと第２の特徴ベクトルを比較し、参照データに対する収集データの類似度を算出する（ステップＳ３０３）。類似度算出部２５０は、算出した類似度を、収集データのデータ識別子に対応付けて記憶装置３００の収集データ記憶部３２０に格納する（ステップＳ３０４）。 Next, the similarity calculation unit 250 compares the captured first feature vector with the second feature vector, and calculates the similarity of the collected data with respect to the reference data (step S303). The similarity calculation unit 250 stores the calculated similarity in the collected data storage unit 320 of the storage device 300 in association with the data identifier of the collected data (step S304).

具体的に説明すると、類似度算出部２５０は、例えば、収集データ「ＤＳ００００１」の特徴ベクトル（第２の特徴ベクトル）「ＸＳ００００１」と、３つの参照データ「ＤＡ００００１」、「ＤＡ００００２」、「ＤＡ００００３」の特徴ベクトル（第１の特徴ベクトル）「ＸＡ００００１」、「ＸＡ００００２」、「ＸＡ００００３」の各々とのコサイン類似度を算出する。そして、類似度算出部２５０は、収集データ「ＤＳ００００１」の特徴ベクトル（第２の特徴ベクトル）「ＸＳ００００１」に対し、３つの参照データの特徴ベクトル「ＸＡ００００１」、「ＸＡ００００２」、「ＸＡ００００３」の各々とのコサイン類似度の平均値（又は最大値等）を特定し、当該特定した類似度を記憶装置３００の収集データ記憶部３２０に格納する。
類似度算出部２５０は、収取データ「ＤＳ００００２」、「ＤＳ００００３」についても同様に類似度を算出し、記憶装置３００の収集データ記憶部３２０に格納する（図３（ｂ）参照）。
なお、下記において、類似度算出部２５０は、第１の特徴ベクトルと第２の特徴ベクトルとの「コサイン類似度」を算出するものとして説明を続けるが、他の実施形態においてはこの態様に限定されない。他の実施形態に係る類似度算出部２５０は、例えば、第１の特徴ベクトルと第２の特徴ベクトルとの「ユークリッド距離」に基づく類似度を算出してもよい。 Specifically, the similarity calculation unit 250, for example, the feature vector (second feature vector) “XS00001” of the collected data “DS00001” and the three reference data “DA00001”, “DA00002”, “DA00003”. The cosine similarity with each of the feature vectors (first feature vectors) “XA00001”, “XA00002”, and “XA00003” is calculated. Then, the similarity calculation unit 250 applies each of the three reference data feature vectors “XA00001”, “XA00002”, and “XA00003” to the feature vector (second feature vector) “XS00001” of the collected data “DS00001”. The average value (or maximum value, etc.) of the cosine similarity is specified, and the specified similarity is stored in the collected data storage unit 320 of the storage device 300.
The similarity calculation unit 250 similarly calculates the similarity for the collected data “DS00002” and “DS00003” and stores them in the collected data storage unit 320 of the storage device 300 (see FIG. 3B).
In the following description, the similarity calculation unit 250 continues to be described as calculating the “cosine similarity” between the first feature vector and the second feature vector. However, in other embodiments, the embodiment is limited to this mode. Not. The similarity calculation unit 250 according to another embodiment may calculate the similarity based on the “Euclidean distance” between the first feature vector and the second feature vector, for example.

次に、図１の構成および図３の各種テーブルを例に、図１０を用いて教師データの抽出処理について説明する。 Next, teacher data extraction processing will be described with reference to FIG. 10, taking the configuration of FIG. 1 and various tables of FIG. 3 as examples.

図１０は、本発明の実施形態に係る教師データとして格納するデータを抽出する流れの一例を示すフローチャートである。 FIG. 10 is a flowchart illustrating an example of a flow of extracting data to be stored as teacher data according to the embodiment of the present invention.

教師データ抽出部２６０は、収集データの類似度（コサイン類似度）を記憶装置３００から取り込む（ステップＳ４０１）。より具体的には、教師データ抽出部２６０は、記憶装置３００の収集データ記憶部３２０（図３（ｂ）に示す収集データ管理テーブル６１０）に格納されているコサイン類似度を取り込む。 The teacher data extraction unit 260 takes in the collected data similarity (cosine similarity) from the storage device 300 (step S401). More specifically, the teacher data extraction unit 260 takes in the cosine similarity stored in the collected data storage unit 320 (the collected data management table 610 shown in FIG. 3B) of the storage device 300.

次に、教師データ抽出部２６０は、取り込んだコサイン類似度が所定の範囲内にあるかどうかを判定する（ステップＳ４０２）。例えば、教師データ抽出部２６０は、コサイン類似度がある一定の数値以上であるかどうかで判定する。教師データ抽出部２６０は、記憶装置３００の収集データ記憶部３２０（図３（ｂ）の収集データ管理テーブル６１０）に格納されている全ての収集データの類似度を判定する。また、教師データ抽出部２６０は、判定結果に基づいて、教師データの候補を抽出し、抽出した収集データを記憶装置３００の教師データ記憶部３３０（図３（ｃ）に示す教師データ管理テーブル６２０）に出力する（ステップＳ４０３）。
例えば、図３（ｂ）に示す通り、収集データ「ＤＳ００００１」（文書Ｘ）についてのコサイン類似度ｃｏｓθｘが０．６３４であり、収集データ「ＤＳ００００２」（文書Ｙ）についてのコサイン類似度ｃｏｓθｙが０．９４５であり、収集データ「ＤＳ００００３」（文書Ｚ）についてのコサイン類似度ｃｏｓθｚが０．８０３であったとする。この場合、教師データ抽出部２６０は、各収集データ「ＤＳ００００１」、「ＤＳ００００２」、「ＤＳ００００３」の各々についてのコサイン類似度が、所定の判定閾値（例えば、０．９）以上か否かを判定する。そして、教師データ抽出部２６０は、判定閾値以上である収集データ「ＤＳ００００２」（文書Ｙ）を新たな教師データの候補として抽出し、抽出した収集データ「ＤＳ００００２」を記憶装置３００の教師データ記憶部３３０に出力する。
このように、教師データ管理テーブル６２０（図３（ｃ））には、自動的に収集された複数の収集データ「ＤＳ００００１」、「ＤＳ００００２」、「ＤＳ００００３」のうち、参照データの特徴ベクトル（第１の特徴ベクトル）と類似する特徴ベクトル（第２の特徴ベクトル）を有するデータ（文書）だけが登録される。また、教師データ抽出部２６０は、ユーザの判断によって選択された参照データ「ＤＡ００００１」、「ＤＡ００００２」、「ＤＡ００００３」そのものも教師データ管理テーブル６２０に登録してもよい。 Next, the teacher data extraction unit 260 determines whether or not the captured cosine similarity is within a predetermined range (step S402). For example, the teacher data extraction unit 260 determines whether or not the cosine similarity is a certain numerical value or more. The teacher data extraction unit 260 determines the similarity of all the collected data stored in the collected data storage unit 320 of the storage device 300 (the collected data management table 610 in FIG. 3B). Further, the teacher data extraction unit 260 extracts teacher data candidates based on the determination result, and extracts the collected data into the teacher data storage unit 330 of the storage device 300 (teacher data management table 620 shown in FIG. 3C). (Step S403).
For example, as shown in FIG. 3B, the cosine similarity cos θx for the collected data “DS00001” (document X) is 0.634, and the cosine similarity cos θy for the collected data “DS00002” (document Y) is 0. 945, and the cosine similarity cos θz for the collected data “DS00003” (document Z) is 0.803. In this case, the teacher data extraction unit 260 determines whether or not the cosine similarity for each of the collected data “DS00001”, “DS00002”, and “DS00003” is equal to or greater than a predetermined determination threshold (for example, 0.9). To do. Then, the teacher data extraction unit 260 extracts the collected data “DS00002” (document Y) that is equal to or higher than the determination threshold as a new teacher data candidate, and the extracted collected data “DS00002” is the teacher data storage unit of the storage device 300. To 330.
As described above, the teacher data management table 620 (FIG. 3C) includes a reference data feature vector (the first data) among a plurality of automatically collected data “DS00001”, “DS00002”, and “DS00003”. Only data (documents) having a feature vector (second feature vector) similar to (one feature vector) is registered. The teacher data extraction unit 260 may also register the reference data “DA00001”, “DA00002”, and “DA00003” itself selected by the user's judgment in the teacher data management table 620.

（作用・効果）
以上の通り、本実施形態に係る教師データ収集システム１００の管理コンピュータ２００は、機械学習の教師データとして用いるための、特定の分野（例えば、“鉄道システム”等）に関するデータ（文書）を収集する教師データ収集装置である。
管理コンピュータ２００は、予め登録しておいた特定の分野に関するデータ（参照データ）の特徴ベクトルである第１の特徴ベクトルを算出する特徴ベクトル算出部２２０と、第１の特徴ベクトルから、特定の分野に関するデータの収集に用いる検索条件（検索語の組み合わせなど）を生成する検索条件生成部２３０と、生成された検索条件をもとに、特定の分野に関するデータ（収集データ）を収集するデータ収集部２４０と、収集データの特徴ベクトルである第２の特徴ベクトルを特徴ベクトル算出部２２０が算出すると、該第２の特徴ベクトルと第１の特徴ベクトルとの類似度を算出する類似度算出部２５０と、当該類似度が所定の範囲内にある収集データを教師データとして抽出する教師データ抽出部２６０と、を備えている。
このような構成によれば、まず、ユーザ自身が「教師データ」にふさわしいものと判断して予め登録しておいたデータ（文書）であって、ある特定の分野に関する参照データ（文書Ａ１、Ａ２、・・）の特徴ベクトル（第１の特徴ベクトル）が算出される。そして、当該参照データの特徴ベクトル（第１の特徴ベクトル）に基づいて、新たな教師データを自動的に収集するための検索条件（検索語の組み合わせ）が生成される。第１の特徴ベクトルから生成された検索条件に基づいて自動的に収集されたデータ（文書）は、第１の特徴ベクトルに類似する特徴ベクトルを有している可能性が高い。即ち、このように収集されたデータ（収集データ）は、参照データの特徴に近い特徴を有している可能性が高い。よって、ある程度高い確率で、“特定の分野”について学習させるための教師データにふさわしいデータ（文書）を自動的に収集することができる。
しかしながら、自動収集（検索）の処理によっては、参照データの特徴ベクトル（第１の特徴ベクトル）に類似しない特徴ベクトルを有するデータ（即ち、特定の分野に属さない文書）が、たまたま上記の検索条件に合致して収集されてしまう可能性も考えられる。このようなデータが教師データの中に紛れてしまうと、“特定の分野”についての機械学習の信頼性を低減させてしまう。そこで、管理コンピュータ２００は、更に、検索条件に基づいて自動的に収集されたデータ（収集データ）に対して特徴ベクトル（第２の特徴ベクトル）を算出するとともに、第１の特徴ベクトルと第２の特徴ベクトルとの類似度を算出する。そして、管理コンピュータ２００は、この類似度が所定値以上である収集データのみを教師データとして取り込む。
このようにすることで、収集データのうち“特定の分野”に属さないデータ（たまたま検索条件に合致して収集されたデータ）を排除し、真に教師データとしてふさわしいデータを教師データとして取り込むことができる。
以上より、本実施形態に係る教師データ収集システム１００によれば、特定の分野に関する情報収集に特化した、機械学習に関する質の高い教師データを、低コストで自動的に収集することが可能となる。 (Action / Effect)
As described above, the management computer 200 of the teacher data collection system 100 according to the present embodiment collects data (documents) relating to a specific field (for example, “railway system”) to be used as machine learning teacher data. It is a teacher data collection device.
The management computer 200 includes a feature vector calculation unit 220 that calculates a first feature vector that is a feature vector of data (reference data) related to a specific field that has been registered in advance, and a specific field from the first feature vector. Search condition generation unit 230 that generates search conditions (such as combinations of search terms) used to collect data related to data, and a data collection unit that collects data (collection data) related to a specific field based on the generated search conditions 240, when the feature vector calculation unit 220 calculates the second feature vector that is the feature vector of the collected data, a similarity calculation unit 250 that calculates the similarity between the second feature vector and the first feature vector; And a teacher data extraction unit 260 that extracts the collected data having the similarity in a predetermined range as teacher data.
According to such a configuration, first, data (documents) that the user himself / herself determines to be suitable for “teacher data” is registered in advance, and reference data (documents A1, A2) relating to a specific field. ,...) Feature vector (first feature vector) is calculated. Then, based on the feature vector (first feature vector) of the reference data, a search condition (a combination of search terms) for automatically collecting new teacher data is generated. There is a high possibility that the data (document) automatically collected based on the search condition generated from the first feature vector has a feature vector similar to the first feature vector. That is, the data collected in this way (collected data) is highly likely to have characteristics close to those of the reference data. Therefore, it is possible to automatically collect data (documents) suitable for teacher data for learning about a “specific field” with a certain high probability.
However, depending on the process of automatic collection (search), data having a feature vector that is not similar to the feature vector (first feature vector) of the reference data (that is, a document that does not belong to a specific field) happens to be the search condition. There is a possibility that it will be collected in accordance with If such data is mixed into teacher data, the reliability of machine learning for a “specific field” is reduced. Therefore, the management computer 200 further calculates a feature vector (second feature vector) for the automatically collected data (collected data) based on the search condition, and the first and second feature vectors. The similarity with the feature vector is calculated. Then, the management computer 200 takes in only the collected data whose similarity is a predetermined value or more as teacher data.
In this way, data that does not belong to a “specific field” in the collected data (data that happens to meet the search conditions) is excluded, and data that is truly suitable as teacher data is imported as teacher data. Can do.
As described above, according to the teacher data collection system 100 according to the present embodiment, it is possible to automatically collect high-quality teacher data related to machine learning, which is specialized for information collection regarding a specific field, at low cost. Become.

また、本実施形態に係る管理コンピュータ２００によれば、検索条件生成部２３０は、第１の特徴ベクトルに基づいて、参照データに用いられる度合いが所定値以上である単語の組み合わせを検索条件として生成する。
このようにすることで、参照データの中で特に頻出する単語（単語ｉ）の組み合わせを検索キーとして新たなデータが収集されるので、収集されたデータが参照データに類似する特徴を有していることの蓋然性を高めることができる。 In addition, according to the management computer 200 according to the present embodiment, the search condition generation unit 230 generates, based on the first feature vector, a combination of words whose degree used for reference data is a predetermined value or more as a search condition. To do.
In this way, new data is collected using a combination of words (word i) that appears particularly frequently in the reference data as a search key. Therefore, the collected data has characteristics similar to the reference data. The probability of being present can be increased.

また、他の実施形態に係る管理コンピュータ２００によれば、データ収集部２４０は、予め登録しておいた特定の分野に関するデータ（参照データ）に含まれる単語ｉごとの重み値（例えば、出現頻度、ｔｆ×ｉｄｆ）が所定値以上の単語の組み合わせを検索条件とする。
重み値の計算手法を工夫することによって文書の構造自体を考慮するなど、参照データ（文書Ａ１、Ａ２、Ａ３）と同じ特徴を有するデータが収集される確度を一層高めることができる。
なお、上述の実施形態において、「重み値」とは、「出現回数」、「ｔｆ×ｉｄｆ」であるものとして説明したが、他の実施形態においてはこの態様に限定されない。例えば、他の実施形態においては、「重み値」とは、「出現頻度（ｔｆ）」、「逆文書出現頻度（ｉｄｆ）」であってもよい。 Further, according to the management computer 200 according to another embodiment, the data collection unit 240 has a weight value (for example, appearance frequency) for each word i included in data (reference data) related to a specific field registered in advance. , Tf × idf) is a combination of words with a predetermined value or more as a search condition.
The accuracy of collecting data having the same characteristics as the reference data (documents A1, A2, and A3) can be further improved, for example, by considering the document structure itself by devising the weight value calculation method.
In the above-described embodiment, the “weight value” has been described as “appearance count” and “tf × idf”, but is not limited to this aspect in other embodiments. For example, in another embodiment, the “weight value” may be “appearance frequency (tf)” or “reverse document appearance frequency (idf)”.

以上、説明したように、本実施形態の教師データ収集装置、教師データ収集方法、及びプログラムによれば、特定の分野に関する情報収集に特化した、機械学習に関する質の高い教師データを、低コストで自動的に収集することが可能となる。
なお、上述した管理コンピュータ２００における各処理の過程は、プログラムの形式でコンピュータ読み取り可能な記録媒体に記憶されており、このプログラムを管理コンピュータ２００のコンピュータが読み出して実行することによって、上記処理が行われる。ここでコンピュータ読み取り可能な記録媒体とは、磁気ディスク、光磁気ディスク、ＣＤ−ＲＯＭ、ＤＶＤ−ＲＯＭ、半導体メモリ等をいう。また、このコンピュータプログラムを通信回線によってコンピュータに配信し、この配信を受けたコンピュータが当該プログラムを実行するようにしてもよい。 As described above, according to the teacher data collection device, the teacher data collection method, and the program of the present embodiment, high-quality teacher data related to machine learning specialized for information collection related to a specific field can be obtained at low cost. Can be collected automatically.
Each process in the management computer 200 described above is stored in a computer-readable recording medium in the form of a program, and the above processing is performed by the computer of the management computer 200 reading and executing this program. Is called. Here, the computer-readable recording medium means a magnetic disk, a magneto-optical disk, a CD-ROM, a DVD-ROM, a semiconductor memory, or the like. Alternatively, the computer program may be distributed to the computer via a communication line, and the computer that has received the distribution may execute the program.

また、上記プログラムは、前述した機能の一部を実現するためのものであってもよい。さらに、前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるもの、いわゆる差分ファイル（差分プログラム）であってもよい。
また、管理コンピュータ２００は、１台のコンピュータで構成されていても良いし、通信可能に接続された複数のコンピュータで構成されていてもよい。 The program may be for realizing a part of the functions described above. Furthermore, what can implement | achieve the function mentioned above in combination with the program already recorded on the computer system, what is called a difference file (difference program) may be sufficient.
Moreover, the management computer 200 may be comprised by one computer, and may be comprised by the some computer connected so that communication was possible.

その他、本発明の趣旨を逸脱しない範囲で、上記した実施の形態における構成要素を周知の構成要素に置き換えることは適宜可能である。また、この発明の技術範囲は上記の実施形態に限られるものではなく、本発明の趣旨を逸脱しない範囲において種々の変更を加えることが可能である。 In addition, it is possible to appropriately replace the components in the above-described embodiments with known components without departing from the spirit of the present invention. The technical scope of the present invention is not limited to the above-described embodiment, and various modifications can be made without departing from the spirit of the present invention.

１００教師データ収集システム
２００管理コンピュータ（教師データ収集装置）
２１０データ受付部
２２０特徴ベクトル算出部（特徴算出部）
２３０検索条件生成部（生成部）
２４０データ収集部（収集部）
２５０類似度算出部
２６０教師データ抽出部（抽出部）
３００記憶装置
３１０参照データ記憶部
３２０収集データ記憶部
３３０教師データ記憶部
４００端末
４１０文書データベース
５００一般的なコンピュータ
６００参照データ管理テーブル
６１０収集データ管理テーブル
６２０教師データ管理テーブル 100 teacher data collection system 200 management computer (teacher data collection device)
210 Data reception unit 220 Feature vector calculation unit (feature calculation unit)
230 Search condition generator (generator)
240 Data collection unit (collection unit)
250 Similarity calculation unit 260 Teacher data extraction unit (extraction unit)
300 storage device 310 reference data storage unit 320 collection data storage unit 330 teacher data storage unit 400 terminal 410 document database 500 general computer 600 reference data management table 610 collection data management table 620 teacher data management table

Claims

A teacher data collection device that collects data related to a specific field for use as machine learning teacher data,
A feature calculation unit that calculates a first feature vector that is a feature vector of data related to a specific field registered in advance;
A generating unit that generates a search condition used for collecting data related to the specific field from the first feature vector;
A collection unit for collecting data on the specific field based on the generated search condition;
When the feature calculation unit calculates a second feature vector that is a feature vector of the collected data, a similarity calculation unit that calculates a similarity between the second feature vector and the first feature vector;
An extraction unit for extracting the collected data having the similarity in a predetermined range as the teacher data;
A teacher data collection device comprising:

The generation unit generates, as the search condition, a combination of words whose degree used for the data related to the specific field registered in advance is a predetermined value or more based on the first feature vector. The teacher data collection device described in 1.

The teacher data according to claim 1, wherein the generation unit uses a combination of words having a weight value for each word included in the pre-registered data related to a specific field as a search condition. Collection device.

An information processing method for collecting data related to a specific field for use as machine learning teacher data,
A feature calculation step of calculating a first feature vector that is a feature vector of data related to a specific field registered in advance;
Generating a search condition used for collecting data related to the specific field from the first feature vector;
A collection step of collecting data on the specific field based on the generated search condition;
Calculating a second feature vector which is a feature vector of the collected data, and calculating a similarity between the second feature vector and the first feature vector;
An extraction step of extracting the collected data having the similarity in a predetermined range as the teacher data;
A teacher data collection method.

A program for collecting data about a specific field for use as machine learning teacher data,
Computer
A feature calculation unit for calculating a first feature vector that is a feature vector of data related to a specific field registered in advance;
A generating unit that generates a search condition used to collect data related to the specific field from the first feature vector;
A collecting unit for collecting data on the specific field based on the generated search condition;
When the feature calculation unit calculates a second feature vector that is a feature vector of the collected data, a similarity calculation unit that calculates a similarity between the second feature vector and the first feature vector;
An extraction unit for extracting the collected data having the similarity within a predetermined range as the teacher data;
Program to function as.