JP2019200628A

JP2019200628A - Arithmetic apparatus, data processing method, and data processing system

Info

Publication number: JP2019200628A
Application number: JP2018095170A
Authority: JP
Inventors: ナウェルアリウア; Nawel Ariua; 暖山本; Dan Yamamoto; 大介鬼頭; Daisuke Kito; 俊二川村; Shunji Kawamura; 航平佐々木; Kohei Sasaki
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2018-05-17
Filing date: 2018-05-17
Publication date: 2019-11-21

Abstract

【課題】同じ属性を有する複数のデータから、その属性を代表するデータを選択する技術を提供する。【解決手段】演算装置は、複数の属性の何れかを有する第１の複数のデータから選択された同じ属性を有する２つのデータの類似性を示すベクトル空間における点の集合を、複数のクラスタに分割する分割処理と、前記複数のクラスタの夫々から、少なくとも１つの点を選択する第１選択処理と、前記複数のクラスタの夫々から選択された複数の点の基となる第２の複数のデータの何れかと同じデータが、前記第２の複数のデータに含まれる場合、前記同じデータを選択する第２選択処理と、を実行する。【選択図】図５PROBLEM TO BE SOLVED: To provide a technique for selecting data representative of an attribute from a plurality of data having the same attribute. An arithmetic unit sets, in a plurality of clusters, a set of points in a vector space indicating the similarity of two data having the same attribute selected from a first plurality of data having any of a plurality of attributes. Division processing for division, first selection processing for selecting at least one point from each of the plurality of clusters, and second plurality of data serving as a basis for the plurality of points selected from each of the plurality of clusters When the same data as any of the above is included in the second plurality of data, a second selection process of selecting the same data is executed. [Selection diagram] Fig. 5

Description

本発明は、演算装置、データ処理方法、データ処理システムに関するものであり、具体的には、同じ属性を有する複数のデータから、その属性を代表するデータを選択する技術する技術に関する。 The present invention relates to an arithmetic device, a data processing method, and a data processing system. Specifically, the present invention relates to a technique for selecting data representative of an attribute from a plurality of data having the same attribute.

近年、属性を示すラベルが付与されていないラベル無しデータを有効的に活用すべく、ラベル無しデータと、ラベルが付与されているラベル有りデータとの類似性等に基づいて、ラベル無しデータの属性を判定することが行われることがある（例えば、特許文献１参照）。 In recent years, in order to effectively utilize unlabeled data that has not been given a label indicating the attribute, the attribute of unlabeled data is based on the similarity between unlabeled data and labeled data to which a label has been assigned. May be determined (see, for example, Patent Document 1).

米国特許出願第２０１７／０１４００３５号明細書US Patent Application No. 2017/0140035

ところで、所定の属性を有する複数のラベル有りデータの中には、他の属性に含まれるデータと同一のデータ、または区別し難いデータが含まれていることがある。例えば、「名字」の属性を示すラベルが付された「青山」とのデータは、「地名」の属性を示すラベルが付されたデータ「青山」と同一である。一般に、コンピュータが取り扱うデータ量を減らしつつ、コンピュータがデータの属性を判定するためには、このようなデータを事前に除くことが好ましい。 By the way, a plurality of labeled data having a predetermined attribute may include data that is the same as data included in other attributes or data that is difficult to distinguish. For example, the data “Aoyama” with the label indicating the attribute “Last name” is the same as the data “Aoyama” with the label indicating the attribute “Place name”. In general, it is preferable to remove such data in advance so that the computer can determine the attribute of the data while reducing the amount of data handled by the computer.

そこで本発明の目的は、同じ属性を有する複数のデータから、その属性を代表するデータを選択する技術を提供することにある。 Therefore, an object of the present invention is to provide a technique for selecting data representing an attribute from a plurality of data having the same attribute.

上記課題を解決する本発明の演算装置は、複数の属性の何れかを有する第１の複数のデータから選択された同じ属性を有する２つのデータの類似性を示すベクトル空間における点の集合を、複数のクラスタに分割する分割処理と、前記複数のクラスタの夫々から、少なくとも１つの点を選択する第１選択処理と、前記複数のクラスタの夫々から選択された複数の点の基となる第２の複数のデータの何れかと同じデータが、前記第２の複数のデータに含まれる場合、前記同じデータを選択する第２選択処理と、を実行することとする。 The arithmetic device of the present invention that solves the above-described problem is a set of points in a vector space indicating the similarity of two data having the same attribute selected from the first plurality of data having any one of the plurality of attributes. A division process for dividing the plurality of clusters; a first selection process for selecting at least one point from each of the plurality of clusters; and a second basis for a plurality of points selected from each of the plurality of clusters. When the same data as any of the plurality of data is included in the second plurality of data, a second selection process for selecting the same data is performed.

本発明によれば、同じ属性を有する複数のデータから、その属性を代表するデータを選択することができる。 According to the present invention, data representing an attribute can be selected from a plurality of data having the same attribute.

ラベル付与システム１０の構成を示す図である。1 is a diagram showing a configuration of a labeling system 10. FIG. トレーニングデータＤＢ３０の一例を示す図である。It is a figure which shows an example of training data DB30. 標準データＤＢ３１の一例を示す図である。It is a figure which shows an example of standard data DB31. 演算装置５０に実現される機能ブロックを示す図である。It is a figure which shows the functional block implement | achieved by the arithmetic unit. 標準データＤＢ３１を生成する処理を示すフローチャートである。It is a flowchart which shows the process which produces | generates standard data DB31. 属性が同じ２つのデータの類似性を示す点の集合３００を示す図である。It is a figure which shows the set 300 of the point which shows the similarity of two data with the same attribute. 点の集合３００が複数のクラスタに分割された状態を示す図である。It is a figure showing the state where point set 300 was divided into a plurality of clusters. 選択された点に含まれる複数のデータの一例を示す図である。It is a figure which shows an example of the some data contained in the selected point. 演算装置５０に実現される機能ブロックを示す図である。It is a figure which shows the functional block implement | achieved by the arithmetic unit. 学習済みモデル３２を生成する処理を示すフローチャートである。It is a flowchart which shows the process which produces | generates the learned model 32. FIG. ベクトル空間における計算部１１０の計算結果を示す図である。It is a figure which shows the calculation result of the calculation part 110 in vector space. 演算装置６０に実現される機能ブロックを示す図である。FIG. 4 is a diagram showing functional blocks implemented in the arithmetic device 60. ラベルを付与する処理を示すフローチャートである。It is a flowchart which shows the process which provides a label. ベクトル空間の点に対応する識別情報Ｉ及び確率Ｐを示す図である。It is a figure which shows the identification information I and the probability P corresponding to the point of vector space.

−−−ラベル付与システム１０−−−
図１は、ラベル付与システム１０の構成を示す図である。ラベル付与システム１０（データ処理システム）は、属性を示すラベルが付与されていないラベル無しデータの属性を判定し、ラベル無しデータに対し、属性を示すラベルを付与するシステムである。ラベル付与システム１０は、記憶装置２０、データ処理装置２１、及びラベル付与装置２２を含んで構成される。 --- Labeling system 10 ---
FIG. 1 is a diagram showing the configuration of the labeling system 10. The label assignment system 10 (data processing system) is a system that determines an attribute of unlabeled data to which no label indicating an attribute is assigned, and assigns a label indicating the attribute to the unlabeled data. The labeling system 10 includes a storage device 20, a data processing device 21, and a labeling device 22.

記憶装置２０は、例えばハードディスク等の不揮発性の記憶手段領域であり、プログラムやデータベース等の様々な情報が格納される。記憶装置２０には、トレーニングデータＤＢ（データベース）３０、標準データＤＢ３１、及び学習済みモデル３２が記憶される。 The storage device 20 is a non-volatile storage means area such as a hard disk, and stores various information such as programs and databases. The storage device 20 stores a training data DB (database) 30, a standard data DB 31, and a learned model 32.

トレーニングデータＤＢ３０（第１の複数のデータ）は、データの有する属性ごとに複数のデータが分類されたデータベースである。図２は、トレーニングデータＤＢ３０の一例を示す図であり、トレーニングデータＤＢ３０は予めシステム管理者等により作成され、記憶装置２０に格納されている。トレーニングデータＤＢ３０の１列目は、「名前」の属性（attribute）を示すラベルが付されたデータ（以下、属性を有するデータとも言う
）である「Ａｌｉｃｅ」、「Ｂｏｂ」、「Ｊｏｈｎ」を含む。なお、ここでは、例えば「Ａｌｉｃｅ」との前に「ａ０」との情報が付され「ａ０：Ａｌｉｃｅ」と記載されているが、「ａ０」は、「Ａｌｉｃｅ」を簡易的に示す記号である。なお、「ａ１」〜「ｃ２」等も「ａ０」と同様である。 The training data DB 30 (first plurality of data) is a database in which a plurality of data is classified for each attribute of the data. FIG. 2 is a diagram illustrating an example of the training data DB 30. The training data DB 30 is created in advance by a system administrator or the like and stored in the storage device 20. The first column of the training data DB 30 includes “Alice”, “Bob”, and “John”, which are data (hereinafter, also referred to as data having an attribute) with a label indicating an attribute of “name”. . Here, for example, information “a0” is added before “Alice” and described as “a0: Alice”, but “a0” is a symbol that simply indicates “Alice”. . “A1” to “c2” and the like are the same as “a0”.

トレーニングデータＤＢ３０の２列目は、「住所」の属性を有するデータである「北海道」、「沖縄」、「東京」を含み、３列目は、「年代」の属性を有するデータである「３０代」、「２０代」、「５０代」を含む。なお、トレーニングデータＤＢ３０は、３つ以上の属性を含むが、図２では便宜上省略し３つの属性のみ記載している。また、トレーニングデータＤＢ３０の行は、１列目の人物に関する情報を示す。例えば、２行目の情報は、「Ａｌｉｃｅ」という「名前」の人物の「住所」及び「年代」である。 The second column of the training data DB 30 includes “Hokkaido”, “Okinawa”, and “Tokyo”, which are data having an attribute of “address”, and the third column is “30” that is data having an attribute of “age”. Includes “generation”, “20s”, and “50s”. The training data DB 30 includes three or more attributes, but is omitted for convenience in FIG. 2 and only three attributes are shown. Further, the row of the training data DB 30 indicates information related to the person in the first column. For example, the information on the second line is the “address” and “age” of the “name” person named “Alice”.

標準データＤＢ３１（第４の複数のデータ）は、トレーニングデータＤＢ３０から、属性を代表しないデータが除かれたデータ（標準データ、canonical data）が格納されたデータベースである。図３は、標準データＤＢ３１の一例を示す図である。ここでは、トレーニングデータＤＢ３０の１列目に含まれていた「Ａｌｉｃｅ」「Ｂｏｂ」、「Ｊｏｈｎ」のうち、「Ｂｏｂ」、「Ｊｏｈｎ」は、「名前」の属性を代表しないデータとして除かれている。つまり、ここでは、「Ａｌｉｃｅ」は、「名前」の属性を代表するデータとして選択されている。なお、標準データＤＢ３１の生成方法については後述する。 The standard data DB 31 (fourth data) is a database in which data (standard data, canonical data) obtained by removing data that does not represent attributes from the training data DB 30 is stored. FIG. 3 is a diagram illustrating an example of the standard data DB 31. Here, among “Alice”, “Bob”, and “John” included in the first column of the training data DB 30, “Bob” and “John” are excluded as data that does not represent the attribute of “name”. Yes. That is, here, “Alice” is selected as data representing the attribute of “name”. A method for generating the standard data DB 31 will be described later.

学習済みモデル３２は、２つのデータの類似性（類似度合い）に基づいて、それらのデータが同じ属性を有するか否かを示す識別情報Ｉと、同じ属性を有する確率Ｐと、を出力するモデルである。学習済みモデル３２の詳細については後述する。 The learned model 32 is a model that outputs, based on the similarity (similarity) of two data, identification information I indicating whether or not these data have the same attribute, and a probability P having the same attribute. It is. Details of the learned model 32 will be described later.

データ処理装置２１は、所定のプログラムを実行することにより、標準データＤＢ３１や学習済みモデル３２を生成する装置である。データ処理装置２１は、演算装置５０、メモリ５１、記憶装置５２、入力装置５３、表示装置５４、及び通信装置５５を含んで構成される。 The data processing device 21 is a device that generates a standard data DB 31 and a learned model 32 by executing a predetermined program. The data processing device 21 includes an arithmetic device 50, a memory 51, a storage device 52, an input device 53, a display device 54, and a communication device 55.

演算装置５０（第１演算装置）は、メモリ５１や記憶装置５２に格納されたプログラムを実行することにより、様々な機能を実現する。 The arithmetic device 50 (first arithmetic device) realizes various functions by executing programs stored in the memory 51 and the storage device 52.

メモリ５１は、例えばＲＡＭ（Random Access Memory）等であり、プログラムやデータ等の一時的な記憶領域として用いられる。 The memory 51 is, for example, a RAM (Random Access Memory) or the like, and is used as a temporary storage area for programs and data.

記憶装置５２は、例えばハードディスク等の不揮発性の記憶手段領域であり、プログラムやデータベース等の様々な情報が格納される。本実施形態の記憶装置５２は、標準データＤＢ３１を生成する際に実行されるプログラム７０と、学習済みモデル３２が生成される際に実行されるプログラム７１とを記憶する。 The storage device 52 is a non-volatile storage means area such as a hard disk and stores various information such as programs and databases. The storage device 52 of the present embodiment stores a program 70 that is executed when the standard data DB 31 is generated and a program 71 that is executed when the learned model 32 is generated.

入力装置５３は、例えばタッチパネルやキーボードであり、利用者の操作結果や入力を受け付ける装置である。また、表示装置５４は、例えばディスプレイであり、操作結果や処理結果等を表示する。 The input device 53 is, for example, a touch panel or a keyboard, and is a device that receives a user operation result or input. The display device 54 is a display, for example, and displays operation results, processing results, and the like.

通信装置５５は、ネットワークインターフェイスなどの通信手段であって、ネットワークを介して記憶装置２０やラベル付与装置２２との間でデータの送受信を行う。 The communication device 55 is a communication unit such as a network interface, and transmits and receives data to and from the storage device 20 and the labeling device 22 via the network.

ラベル付与装置２２は、ラベル無しデータの属性を判定し、ラベル無しデータに対して属性を示すラベルを付与する装置である。ラベル付与装置２２は、演算装置６０（第２演算装置）、メモリ６１、記憶装置６２、入力装置６３、表示装置６４、及び通信装置６５を含んで構成される。ここで、ラベル付与装置２２に含まれる演算装置６０等の夫々の装置は、データ処理装置２１に含まれる演算装置５０等と同様である。このため、演算装置６０等の詳細な説明は省略する。なお、本実施形態の記憶装置６２は、ラベル無しデータに対してラベルを付与する際に実行されるプログラム７５を記憶する。 The label assigning device 22 is a device that determines the attribute of unlabeled data and assigns a label indicating the attribute to unlabeled data. The label assigning device 22 includes an arithmetic device 60 (second arithmetic device), a memory 61, a storage device 62, an input device 63, a display device 64, and a communication device 65. Here, each device such as the arithmetic device 60 included in the label applying device 22 is the same as the arithmetic device 50 included in the data processing device 21. For this reason, detailed description of the arithmetic unit 60 etc. is abbreviate | omitted. Note that the storage device 62 of this embodiment stores a program 75 that is executed when a label is assigned to unlabeled data.

−−−標準データＤＢ３１の生成について−−−
＜＜演算装置５０の機能ブロック＞＞
図４は、演算装置５０が、標準データＤＢ３１を生成するためのプログラム７０を実行した際に、演算装置５０に実現される機能ブロックを示す図である。 --- Generation of standard data DB 31 ---
<< Functional Block of Computing Device 50 >>
FIG. 4 is a diagram illustrating functional blocks implemented in the arithmetic device 50 when the arithmetic device 50 executes the program 70 for generating the standard data DB 31.

演算装置５０には、計算部１００、分割部１０１、選択部１０２，１０３、及び更新部１０４が実現される。 In the arithmetic device 50, a calculation unit 100, a division unit 101, selection units 102 and 103, and an update unit 104 are realized.

計算部１００は、トレーニングデータＤＢ３０のうち、２つのデータ間の類似度ベクトルの集合を計算する。具体的には、計算部１００は、同じ属性のデータ“ｘ”と、データ“ｙ”との間の、Ｊａｃｃａｒｄ類似度と、ＴＦ−ＩＤＦコサイン類似度と、を計算し、結果として２次元のベクトルを出力する。ここで、データ“ｘ”と、データ“ｙ”との間のＪａｃｃａｒｄ類似度は、ｓｉｍ１（ｘ，ｙ）で表し、データ“ｘ”と、データ“ｙ”との間のＴＦ−ＩＤＦコサイン類似度は、ｓｉｍ２（ｘ，ｙ）で表す。計算部１００は、同じ属性を有する全ての２つのデータ間の類似度ベクトル（類似性を示すベクトル空間における点）を、以下の式（１）を用いて計算する。 The calculation unit 100 calculates a set of similarity vectors between two pieces of data in the training data DB 30. Specifically, the calculation unit 100 calculates the Jaccard similarity and the TF-IDF cosine similarity between the data “x” having the same attribute and the data “y”. Output a vector. Here, the Jaccard similarity between the data “x” and the data “y” is expressed as sim1 (x, y), and the TF-IDF cosine similarity between the data “x” and the data “y”. The degree is represented by sim2 (x, y). The calculation unit 100 calculates a similarity vector (a point in a vector space indicating similarity) between all two pieces of data having the same attribute using the following equation (1).

ｓ（ｘ，ｙ）＝（ｓｉｍ１（ｘ，ｙ），ｓｉｍ２（ｘ，ｙ））・・・（１）
なお、ここで（ｘ，ｙ）は、（ａ０，ａ１）、（ａ１，ａ２）、（ｂ０，ｂ１）・・・
（ｃ１，ｃ２）等である。また、計算部１００は、例えば、Ｊａｃｃａｒｄ類似度と、ＴＦ−ＩＤＦコサイン類似度とを計算することとしたが、他の類似度（例えば、Ｌｅｖｅｎｓｈｔｅｉｎ距離に基づく類似度）であっても良い。それに伴い、ｓ（ｘ，ｙ）は３以上の次元をもつベクトルであっても良い。 s (x, y) = (sim1 (x, y), sim2 (x, y)) (1)
Here, (x, y) is (a0, a1), (a1, a2), (b0, b1).
(C1, c2) and the like. In addition, the calculation unit 100 calculates, for example, the Jaccard similarity and the TF-IDF cosine similarity, but may be other similarities (for example, a similarity based on the Levenshtein distance). Accordingly, s (x, y) may be a vector having three or more dimensions.

分割部１０１は、式（１）の計算結果を示す類似度ベクトルの集合をｎ個のクラスタに分割する。 The dividing unit 101 divides the set of similarity vectors indicating the calculation result of Expression (1) into n clusters.

選択部１０２は、ｎ個のクラスタの夫々の中心に最も近いｍ個のベクトルを、ｎ個のクラスタごとに選択する。 The selection unit 102 selects m vectors closest to the center of each of the n clusters for each of the n clusters.

選択部１０３は、選択部１０２で選択されたｍ×ｎ個の点（第１の複数の点）に基づいて、トレーニングデータＤＢ３０から、属性を代表しないデータが除かれた標準データＤＢ３１を選択する。なお、選択部１０３の詳細については後述するが、選択部１０３は、ｍ×ｎ個の点に基づく複数のデータ（第２の複数のデータ）において、同じデータがｋ個（所定の個数）以上含まれていたら、そのデータを選択する。さらに、選択部１０３は、トレーニングデータＤＢ３０のうち、ｋ個の同じデータの属性とは異なる属性の全てのデータ（第３の複数のデータ）を選択する。 Based on the m × n points (first plurality of points) selected by the selection unit 102, the selection unit 103 selects, from the training data DB 30, the standard data DB 31 from which data that does not represent attributes is removed. . Although the details of the selection unit 103 will be described later, the selection unit 103 includes k (predetermined number) or more of the same data in a plurality of data (second plurality of data) based on m × n points. If so, select that data. Further, the selection unit 103 selects all the data (third plurality of data) having an attribute different from the attribute of the k pieces of the same data in the training data DB 30.

更新部１０４は、記憶装置２０にアクセスし、記憶装置２０に格納された標準データＤＢ３１の情報を更新する。 The update unit 104 accesses the storage device 20 and updates information in the standard data DB 31 stored in the storage device 20.

＜＜標準データＤＢ３１を生成する処理Ｓ１００＞＞
標準データＤＢ３１を生成する処理Ｓ１００について説明する。ここで、分割部１０１は、点の集合を７個（ｎ＝７）のクラスタに分割することとし、選択部１０２は、夫々のクラスタから２個（ｍ＝２）の点を選択することとする。また、選択部１０３は、同じデータが２個（ｋ＝２）含まれていたら、そのデータを属性の代表とする。 << Process S100 for Generating Standard Data DB31 >>
The process S100 for generating the standard data DB 31 will be described. Here, the dividing unit 101 divides the set of points into seven (n = 7) clusters, and the selecting unit 102 selects two (m = 2) points from each cluster. To do. Further, if the same data is included in two pieces (k = 2), the selection unit 103 sets the data as a representative of the attribute.

まず、計算部１００は、トレーニングデータＤＢ３０から、同じ属性を有する２つのデータ（例えば、（ａ０，ａ１））を選択する（Ｓ２００）。そして、計算部１００は、選択した２つのデータの類似性を示す類似度ベクトルを、式（１）を用いて計算する（Ｓ２０１）。この結果、例えば、ｓ（ａ０，ａ１）＝（ｓｉｍ１（ａ０，ａ１），ｓｉｍ２（ａ０，ａ１））が計算されることになる。そして、計算部１００は、トレーニングデータＤＢ３０のうち、同じ属性を有する全てのペアを選択したか否かを判定する（Ｓ２０２）。全てのペアの選択がされていない場合（Ｓ２０２：Ｎｏ）、処理Ｓ２００が実行される。一方、全てのペアの選択がされた場合（Ｓ２０２：Ｙｅｓ）、全てのペアの夫々に対応する点の集合３００（第１の点の集合）が、図６に示すベクトル空間において表されることになる。なお、図６では、ｓ（ａ０，ａ１）に対応する点のみ符号“ｓ（ａ０，ａ１）”を付しているが、他の点の夫々も他のペア（例えば、（ｂ０，ｂ１）や（ｃ１０，ｃ１１））に対応している。また、前述のように、図２のトレーニングデータＤＢ３０には、各属性を有するデータとして、３つのデータが記載されているが、実際には多数含まれている。このため、図６のベクトル空間においても点が多数記載されている。 First, the calculation unit 100 selects two data (for example, (a0, a1)) having the same attribute from the training data DB 30 (S200). Then, the calculation unit 100 calculates a similarity vector indicating the similarity between the two selected data using Expression (1) (S201). As a result, for example, s (a0, a1) = (sim1 (a0, a1), sim2 (a0, a1)) is calculated. Then, the calculation unit 100 determines whether or not all pairs having the same attribute in the training data DB 30 have been selected (S202). When all the pairs have not been selected (S202: No), the process S200 is executed. On the other hand, when all the pairs have been selected (S202: Yes), the point set 300 (first point set) corresponding to each of all the pairs is represented in the vector space shown in FIG. become. In FIG. 6, only the point corresponding to s (a0, a1) is assigned the symbol “s (a0, a1)”, but each of the other points is also in another pair (for example, (b0, b1)). And (c10, c11)). As described above, the training data DB 30 in FIG. 2 describes three pieces of data as data having respective attributes, but actually includes a large number. For this reason, many points are described also in the vector space of FIG.

処理Ｓ２０２で、計算部１００が全てのペアの選択がされたと判定すると（Ｓ２０２：Ｙｅｓ）、分割部１０１は、図７に示すように、ベクトル空間の点の集合３００を７個のクラスタ３１０〜３１６に分割する（Ｓ２０３：分割処理）。 If the calculation unit 100 determines in step S202 that all pairs have been selected (S202: Yes), the dividing unit 101 converts a set of vector space points 300 into seven clusters 310 to 310 as shown in FIG. The process is divided into 316 (S203: division process).

選択部１０２は、７個のクラスタ３１０〜３１６の夫々の中心点に最も近い２個の点を、クラスタ毎に選択する（Ｓ２０４：第１選択処理）。例えば、クラスタ３１０の中心点が点Ｐ０であり、点Ｐ０に最も近い点は、点Ｐ１，Ｐ２であるとすると、選択部１０２は
、クラスタ３１０においては、点Ｐ１，Ｐ２を選択する。なお、選択部１０２は、クラスタ３１１〜３１６の夫々に対してもクラスタ３１０と同様の処理を実施するため、処理Ｓ２０４では、結果的に、１４個（＝７×２）の点が選択される。ここで、１４個のそれぞれの点は、同じ属性を有する２つのデータ（例えば、（ａ０，ａ１））の類似性を示している。このため、例えば１４個の点が選択されると、２８個のデータが選択されたことになる。 The selection unit 102 selects, for each cluster, two points that are closest to the center point of each of the seven clusters 310 to 316 (S204: first selection process). For example, if the center point of the cluster 310 is the point P0 and the points closest to the point P0 are the points P1 and P2, the selection unit 102 selects the points P1 and P2 in the cluster 310. Since the selection unit 102 performs the same processing as that of the cluster 310 for each of the clusters 311 to 316, in the process S204, as a result, 14 points (= 7 × 2) are selected. . Here, each of the 14 points indicates the similarity between two pieces of data (for example, (a0, a1)) having the same attribute. Therefore, for example, when 14 points are selected, 28 data are selected.

選択部１０３は、１４個の点の基となる２８個のデータ（第２の複数のデータ）において、同じデータが２個含まれているか否かを判定する（Ｓ２０５）。そして、同じデータが２個含まれていると（Ｓ２０５：Ｙｅｓ）、選択部１０３は、トレーニングデータＤＢ３０から、同じデータが２個含まれているデータと、２個の同じデータの属性とは異なる属性の全てのデータ（第３の複数のデータ）を選択する（Ｓ２０６：第２選択処理）。一方、同じデータが２個含まれていないと（Ｓ２０５：Ｎｏ）、選択部１０３は、例えば、トレーニングデータＤＢ３０の全てを選択する（Ｓ２０７）。 The selection unit 103 determines whether or not two pieces of the same data are included in the 28 pieces of data (second plurality of pieces of data) that are the basis of the 14 points (S205). If two pieces of the same data are included (S205: Yes), the selection unit 103 differs from the training data DB 30 in that the data that includes the two pieces of the same data is different from the attributes of the two pieces of the same data. All data (third plurality of data) of the attribute are selected (S206: second selection process). On the other hand, if two identical data are not included (S205: No), the selection unit 103 selects, for example, all of the training data DB 30 (S207).

ここで、選択部１０３が実行する処理の詳細を、図８を参照しつつ説明する。図８は、１４個の点の基となる２８個のデータの一例である。例えば、１４個の点に、（ａ０，ａ１）、（ａ０，ａ２）、（ｂ１，ｂ２）等が含まれている場合、図８に示すように、データ“ａ０”の個数は、２個、データ“ａ１”，“ａ２”，“ｂ１”，“ｂ２”の夫々の個数は、１個となる。なお、図８では、便宜上、２８個のデータのうち、一部のデータのみが記載され、他は省略されている。このような場合、選択部１０３は、トレーニングデータＤＢ３０のうち、データ“ａ０”と、２個の同じデータの属性「名前」とは異なる属性「住所」、「年代」の全てのデータを選択する。なお、このような処理を実行することにより、「名前」の属性を有するデータのうち、代表的なデータ“ａ０”を選択することができる。なお、図８では、データ“ａ０”が２個あることとしたが、例えば、２８個のデータのうち、同じデータ２個無い場合（Ｓ２０５：Ｎｏ）、選択部１０３は、トレーニングデータＤＢ３０を選択する（Ｓ２０７）。 Here, the details of the process executed by the selection unit 103 will be described with reference to FIG. FIG. 8 is an example of 28 data that is the basis of 14 points. For example, when 14 points include (a0, a1), (a0, a2), (b1, b2), etc., the number of data “a0” is 2 as shown in FIG. The number of data “a1”, “a2”, “b1”, and “b2” is one. In FIG. 8, for convenience, only some of the 28 data are shown, and others are omitted. In such a case, the selection unit 103 selects all the data having the attributes “address” and “age” different from the attribute “name” of the data “a0” and the two pieces of the same data in the training data DB 30. . By executing such processing, representative data “a0” can be selected from the data having the “name” attribute. In FIG. 8, it is assumed that there are two pieces of data “a0”. For example, when there are two pieces of the same data out of 28 pieces of data (S205: No), the selection unit 103 selects the training data DB 30. (S207).

そして、更新部１０４は、記憶装置２０に格納された情報を更新すべく、処理Ｓ２０６で選択されたデータを、標準データＤＢ３１として記憶装置２０に格納する（Ｓ２０８：更新処理）。この結果、例えば、「名前」の属性においては“ａ０”のみが選択された、図３で示す標準データＤＢ３１が得られることとなる。なお、図３では、便宜上、「住所」または「年代」の属性を有するデータは、“ｂ０”〜“ｂ２”，“ｃ０”〜“ｃ２”のみを記載しているが、例えば、“ｂ３”、“ｃ３”等の複数のデータも含まれている。そして、処理Ｓ２０６では、「住所」及び「年代」の属性を有する全てのデータ（第３の複数のデータ）が選択される。 Then, the update unit 104 stores the data selected in step S206 in the storage device 20 as the standard data DB 31 in order to update the information stored in the storage device 20 (S208: update process). As a result, for example, the standard data DB 31 shown in FIG. 3 in which only “a0” is selected in the “name” attribute is obtained. In FIG. 3, for convenience, the data having the attribute of “address” or “age” describes only “b0” to “b2” and “c0” to “c2”, but for example, “b3” , “C3” and the like. In step S206, all data (third plurality of data) having the attributes “address” and “age” are selected.

なお、例えば、処理Ｓ２０５において、データ“ａ０”に加え、データ“ａ１”も２個含まれていた場合、処理Ｓ２０６では、データ“ａ０”，“ａ１”が選択されることになる。 For example, if two pieces of data “a1” are included in addition to data “a0” in step S205, data “a0” and “a1” are selected in step S206.

−−−学習済みモデル３２の生成について−−−
＜＜演算装置５０の機能ブロック＞＞
図９は、演算装置５０が、学習済みモデル３２を生成するためのプログラム７１を実行した際に、演算装置５０に実現される機能ブロックを示す図である。 --- About generation of learned model 32 ---
<< Functional Block of Computing Device 50 >>
FIG. 9 is a diagram illustrating functional blocks implemented in the arithmetic device 50 when the arithmetic device 50 executes the program 71 for generating the learned model 32.

演算装置５０には、計算部１１０、識別情報付与部１１１、判定部１１２、及びトレーニング部１１３が実現される。 In the arithmetic device 50, a calculation unit 110, an identification information adding unit 111, a determination unit 112, and a training unit 113 are realized.

計算部１１０は、トレーニングデータＤＢ３０のうち、２つのデータ間の類似度ベクト
ル（類似性を示すベクトル空間における点）の集合を計算する。具体的には、計算部１１０は、トレーニングデータＤＢ３０から２つのデータ（“ｘ”，“ｙ”）を選択し、２つのデータ間の類似度ベクトルを、上述した式（１）を用いて計算する。なお、計算部１１０は、属性が同じか否かに関わらず、トレーニングデータＤＢ３０に含まれる全てのデータのペアについて、式（１）の計算を行う。 The calculation unit 110 calculates a set of similarity vectors (points in a vector space indicating similarity) between two pieces of data in the training data DB 30. Specifically, the calculation unit 110 selects two data (“x”, “y”) from the training data DB 30, and calculates the similarity vector between the two data using the above-described equation (1). To do. Note that the calculation unit 110 performs the calculation of Expression (1) for all data pairs included in the training data DB 30 regardless of whether the attributes are the same.

識別情報付与部１１１は、計算部１１０の計算結果に対し、同じ属性の２つのデータに基づく結果であるか、異なる属性の２つのデータに基づく結果であるかを示す識別情報Ｉを付与する。例えば、同じ属性のデータ“ａ０”，“ａ１”の２つのデータの類似性が計算された場合、識別情報付与部１１１は、計算結果に対し、識別情報Ｉ“Ｔｒｕｅ”（以下、識別情報Ｉ“Ｔ”とする）を付与する。一方、異なる属性のデータ“ａ０”，“ｂ１”の２つのデータの類似性が計算された場合、識別情報付与部１１１は、計算結果に対し、識別情報Ｉ“Ｆａｌｓｅ”（以下、識別情報Ｉ“Ｆ”とする）を付与する。 The identification information adding unit 111 adds identification information I indicating whether the calculation result of the calculation unit 110 is based on two data having the same attribute or based on two data having different attributes. For example, when the similarity between two data “a0” and “a1” having the same attribute is calculated, the identification information adding unit 111 determines the identification information I “True” (hereinafter, the identification information I) from the calculation result. “T”). On the other hand, when the similarity between two data “a0” and “b1” having different attributes is calculated, the identification information adding unit 111 determines the identification information I “False” (hereinafter, the identification information I) from the calculation result. "F").

判定部１１２は、トレーニングデータＤＢ３０に含まれる全てのデータのペアが選択され、計算が実施されたか否かを判定する。 The determination unit 112 determines whether all data pairs included in the training data DB 30 have been selected and the calculation has been performed.

トレーニング部１１３は、識別情報Ｉが付された計算結果（例えば、ｓ（ａ０，ａ１）＝（１，３５）“Ｔ”，ｓ（ａ１，ａ２）＝（５，５８）“Ｔ”，ｓ（ａ０，ｂ０）＝（２，５８）“Ｆ”等）を学習データ（教師データ）とし、これらの学習データを再現する関数（学習済みモデル３２）を求める。 The training unit 113 calculates the calculation result with the identification information I (for example, s (a0, a1) = (1, 35) “T”, s (a1, a2) = (5, 58) “T”, s (A0, b0) = (2, 58) “F” etc.) is used as learning data (teacher data), and a function (learned model 32) for reproducing these learning data is obtained.

＜＜学習済みモデル３２を生成する処理Ｓ１１０＞＞
ここで、図１０を参照しつつ、学習済みモデル３２を生成する処理Ｓ１１０について説明する。まず、計算部１１０は、トレーニングデータＤＢ３０から、２つのデータを選択する（Ｓ２１０）。そして、計算部１１０は、選択した２のデータの類似度ベクトルを、式（１）を用いて計算する（Ｓ２１１）。例えば、選択された２つのデータがデータ“ａ１”，“ｂ１”である場合、処理Ｓ２１０では、ｓ（ａ１，ｂ１）＝（ｓｉｍ１（ａ１，ｂ１），ｓｉｍ２（ａ１，ｂ１））が計算されることになる。 << Process S110 for Generating Learned Model 32 >>
Here, the processing S110 for generating the learned model 32 will be described with reference to FIG. First, the calculation unit 110 selects two pieces of data from the training data DB 30 (S210). Then, the calculation unit 110 calculates the similarity vector of the selected two pieces of data using Expression (1) (S211). For example, when the two selected data are data “a1” and “b1”, s (a1, b1) = (sim1 (a1, b1), sim2 (a1, b1)) is calculated in the process S210. Will be.

そして、識別情報付与部１１１は、計算部１１０が用いた２つのデータの属性を参照し、計算結果に対し、識別情報Ｉを付与する（Ｓ２１２）。例えば、処理Ｓ２１１において、属性の異なるデータ“ａ１”，“ｂ１”の類似度ベクトルを示す“ｓ（ａ１，ｂ１）”が計算された場合、識別情報付与部１１１は、“ｓ（ａ１，ｂ１）”の計算結果に対し、識別情報Ｉ“Ｆ”を付与する。 Then, the identification information adding unit 111 refers to the attributes of the two data used by the calculation unit 110, and adds identification information I to the calculation result (S212). For example, when “s (a1, b1)” indicating similarity vectors of data “a1” and “b1” having different attributes is calculated in the process S211, the identification information adding unit 111 sets “s (a1, b1) The identification information I “F” is given to the calculation result of “)”.

図１１は、ベクトル空間における計算部１１０の計算結果を示す図である。ここで、集合３５０（第１の点の集合）に含まれる点の夫々は、識別情報Ｉ“Ｔ”が付された計算結果を示し、集合３５１（第２の点の集合）に含まれる点の夫々は、識別情報Ｉ“Ｆ”が付された計算結果を示す。 FIG. 11 is a diagram illustrating a calculation result of the calculation unit 110 in the vector space. Here, each of the points included in the set 350 (first point set) indicates a calculation result to which the identification information I “T” is attached, and is included in the set 351 (second point set). Each indicates a calculation result to which identification information I “F” is attached.

また、判定部１１２は、トレーニングデータＤＢ３０に含まれる全てのデータのペアの計算が実施されたか否かを判定する（Ｓ２１３）。ここで、全てのデータのペアの計算が実施されていない場合（Ｓ２１３：Ｎｏ）、処理Ｓ２１０が実行され、新なペアのデータが選択される。一方、全てのデータのペアの計算が実施された場合（Ｓ２１３：Ｙｅｓ）、
トレーニング部１１３は、識別情報Ｉが付された計算結果を学習データとしてトレーニングを実行する（Ｓ２１４）。具体的には、トレーニング部１１３は、図１１のベクトル空間に示された各点を学習データとし、学習済みモデル３２を求める。そして、トレーニング部１１３は、トレーニングデータＤＢ３０から得られる学習データの全てを用いて学習
済みモデル３２を求めると、学習済みモデル３２を記憶装置２０に格納する（Ｓ２１５）。なお、本実施形態の学習済みモデル３２は、２つのデータ間の類似性を示す計算結果に基づいて、２つのデータが同じ属性であるか否かを示す識別情報Ｉを確率Ｐとともに出力する。例えば、学習済みモデル３２が、例えば、ベクトル空間の（１０，３５）の点が、同じ属性のデータによる計算結果であり、その確率は７０％であると判定すると、学習済みモデル３２は、識別情報Ｉ“Ｔ”と、確率Ｐ“７０％”とを出力する。 Further, the determination unit 112 determines whether or not all data pairs included in the training data DB 30 have been calculated (S213). Here, when calculation of all data pairs has not been performed (S213: No), processing S210 is executed, and a new pair of data is selected. On the other hand, when calculation of all data pairs is performed (S213: Yes),
The training unit 113 executes training using the calculation result with the identification information I as learning data (S214). Specifically, the training unit 113 uses the points shown in the vector space in FIG. And if the training part 113 calculates | requires the learned model 32 using all the learning data obtained from training data DB30, it will store the learned model 32 in the memory | storage device 20 (S215). Note that the learned model 32 of the present embodiment outputs the identification information I indicating whether or not the two data have the same attribute together with the probability P based on the calculation result indicating the similarity between the two data. For example, if the learned model 32 determines that, for example, the (10, 35) point in the vector space is a calculation result based on data having the same attribute and the probability is 70%, the learned model 32 is identified. Information I “T” and probability P “70%” are output.

−−−ラベル付与の処理について−−−
＜＜演算装置６０の機能ブロック＞＞
図１２は、ラベル付与装置２２の演算装置６０が、ラベルを付与するためのプログラム７５を実行した際に、演算装置６０に実現される機能ブロックを示す図である。 --- About labeling process ---
<< Functional Block of Computing Device 60 >>
FIG. 12 is a diagram illustrating functional blocks implemented in the arithmetic device 60 when the arithmetic device 60 of the label attaching device 22 executes the program 75 for assigning labels.

演算装置６０には、計算部１５０、出力部１５１、判定部１５２、及びラベル付与部１５３が実現される。 In the arithmetic device 60, a calculation unit 150, an output unit 151, a determination unit 152, and a label assignment unit 153 are realized.

計算部１５０は、標準データＤＢ３１の夫々のデータと、ラベル無しの入力データ“ｘ”との複数の類似度ベクトル（第２の複数の点）を、式（１）に基づいて計算する。 The calculation unit 150 calculates a plurality of similarity vectors (second plurality of points) between each piece of data in the standard data DB 31 and unlabeled input data “x” based on Expression (1).

出力部１５１は、計算されたベクトル空間の複数の点の夫々に対し、学習済みモデル３２を用いて、識別情報Ｉと、確率Ｐとを出力する。具体的には、出力部１５１は、学習済みモデル３２を用い、入力データ“ｘ”とペアとなったデータと同じ属性であるか否かの識別情報Ｉと、識別情報Ｉの確からしさを示す確率Ｐ（“Ｔ”または“Ｆ”である確率）と、を出力する。 The output unit 151 outputs the identification information I and the probability P using the learned model 32 for each of the plurality of points in the calculated vector space. Specifically, the output unit 151 uses the learned model 32 and indicates whether or not the identification information I has the same attribute as the data paired with the input data “x” and the probability of the identification information I. Probability P (probability of “T” or “F”) is output.

判定部１５２は、出力部１５１で出力された確率Ｐのうち、最も高い確率Ｐに基づいて、入力データ“ｘ”の属性を判定する。 The determination unit 152 determines the attribute of the input data “x” based on the highest probability P among the probabilities P output from the output unit 151.

ラベル付与部１５３は、入力データ“ｘ”に対し、判定部１５２で判定された属性を示すラベルを付与して出力する。 The label attaching unit 153 assigns and outputs a label indicating the attribute determined by the determining unit 152 to the input data “x”.

＜＜ラベル付与処理Ｓ１２０＞＞
ここで、図１３を参照しつつ、ラベル無しの入力データ“ｘ”に対し、ラベルを付与する処理Ｓ１２０について説明する。まず、計算部１５０は、標準データＤＢ３１の夫々のデータと、入力データ“ｘ”との複数の類似度ベクトルを、式（１）を用いて計算する（Ｓ２２０）。ここで、図３に示すように標準データＤＢ３１は、データ“ａ０”，“ｂ１”等を含むため、計算部１５０は、ｓ（ｘ，ａ０）、ｓ（ｘ，ｂ０）等を計算する。この結果、図１４に示すように、ｓ（ｘ，ａ０）＝（７７．８，５）、ｓ（ｘ，ｂ０）＝（２，５９）等のベクトル空間における複数の類似度ベクトル（第２の複数の点）が得られる。 << Labeling Process S120 >>
Here, with reference to FIG. 13, the process S120 for assigning a label to unlabeled input data “x” will be described. First, the calculation unit 150 calculates a plurality of similarity vectors between each piece of data in the standard data DB 31 and the input data “x” using Expression (1) (S220). Here, as shown in FIG. 3, since the standard data DB 31 includes data “a0”, “b1”, and the like, the calculation unit 150 calculates s (x, a0), s (x, b0), and the like. As a result, as shown in FIG. 14, a plurality of similarity vectors (second) in a vector space such as s (x, a0) = (77.8, 5), s (x, b0) = (2, 59), etc. A plurality of points).

出力部１５１は、計算された複数の点（例えば（７７．８，５）、（２，５９））の夫々に対し、学習済みモデル３２を用いて、識別情報Ｉと、確率Ｐとを出力する。例えば、ｓ（ｘ，ａ０）の計算結果である（７７．８，５）に対し、学習済みモデル３２が用いられると、学習済みモデル３２は、識別情報Ｉ“Ｆ”と、確率Ｐ“７０％”とを出力する。つまり、学習済みモデル３２は、データ“ｘ”と、データ“ａ０”とは属性が異なり、その確率は“７０％”であるとの情報を出力する。そして、判定部１５２は、出力部１５１で出力された確率Ｐのうち、最も高い確率Ｐに基づいて、入力データ“ｘ”の属性を判定する（Ｓ２２２）。図１４に示す例では、ｓ（ｘ，ｂ２）の計算結果に基づく確率Ｐが９５％と最も高い。このような場合、判定部１５２は、入力データ“ｘ”は、データ“ｂ２”の属性である「住所」を有するデータであると判定する。ラベル付与部１５３は、入力
データ“ｘ”に対し、判定部１５２で判定された属性を示すラベルを付与して出力する（Ｓ２２３）。具体的には、ラベル付与部１５３は、入力データ“ｘ”に対して「住所」を示すラベルを付して出力する。このような処理が実行されることにより、ラベル無しのデータに対し、ラベル無しデータの属性を示す正しいラベルが、高い確率で付与されることになる。また、処理Ｓ２２０では、トレーニングデータＤＢ３０よりデータ量が少ない標準データＤＢ３１が用いられている。このため、ラベル付与処理Ｓ１２０においては、演算装置６０への負荷が軽減される。 The output unit 151 outputs the identification information I and the probability P using the learned model 32 for each of the plurality of calculated points (for example, (77.8,5), (2,59)). To do. For example, when the learned model 32 is used for (77.8,5) which is the calculation result of s (x, a0), the learned model 32 has the identification information I “F” and the probability P “70. % ”Is output. That is, the learned model 32 outputs information that the data “x” and the data “a0” have different attributes and the probability is “70%”. Then, the determination unit 152 determines the attribute of the input data “x” based on the highest probability P among the probabilities P output from the output unit 151 (S222). In the example shown in FIG. 14, the probability P based on the calculation result of s (x, b2) is the highest at 95%. In such a case, the determination unit 152 determines that the input data “x” is data having “address” which is an attribute of the data “b2”. The label assigning unit 153 assigns and outputs a label indicating the attribute determined by the determining unit 152 to the input data “x” (S223). Specifically, the label assigning unit 153 attaches a label indicating “address” to the input data “x” and outputs it. By executing such processing, a correct label indicating the attribute of unlabeled data is given to unlabeled data with a high probability. In the process S220, the standard data DB 31 having a smaller data amount than the training data DB 30 is used. For this reason, in the label provision process S120, the load on the arithmetic device 60 is reduced.

−−−まとめ−−−
以上、本実施形態のラベル付与システム１０について説明した。こうした本実施形態の演算装置５０は、同じ属性を有する２つのデータの集合から、集合に含まれる何れかのデータと同じデータが選択する。このように選択されたデータは、一般にデータの有する属性の特徴量を多く含むため、本実施形態によれば、同じ属性を有する複数のデータから、その属性を代表するデータを選択できる。 ---- Summary ---
The label application system 10 according to the present embodiment has been described above. The arithmetic device 50 according to this embodiment selects the same data as any data included in the set from the set of two data having the same attribute. Since the data selected in this way generally includes many feature quantities of the attributes of the data, according to the present embodiment, data representing the attributes can be selected from a plurality of data having the same attributes.

また、本明細書の記載により、少なくとも次のことが明らかにされる。すなわち、前記第１選択処理は、前記複数のクラスタの夫々から、少なくとも前記複数のクラスタの夫々の中心点に最も近い点を選択する処理としても良い。 Moreover, at least the following will be made clear by the description of the present specification. That is, the first selection process may be a process of selecting a point closest to the center point of each of the plurality of clusters from each of the plurality of clusters.

本実施形態において、図７に示すクラスタ３１０〜３１６の夫々に含まれる点をランダムに選択しても良い。しかしながら、クラスタ３１０〜３１６の夫々の中心点に近い点が選択されることにより、データ間の類似性が大きく異なるデータのペアを複数選択することができる。この結果、より高い精度で同じ属性を有する複数のデータから、その属性を代表するデータを選択できる。 In the present embodiment, the points included in each of the clusters 310 to 316 shown in FIG. 7 may be selected at random. However, by selecting a point close to the center point of each of the clusters 310 to 316, it is possible to select a plurality of data pairs that are greatly different in similarity between data. As a result, data representing the attribute can be selected from a plurality of data having the same attribute with higher accuracy.

また、前記第２選択処理は、前記第２の複数のデータに前記同じデータが２以上の所定の個数含まれている場合、前記同じデータを選択する処理としても良い。 The second selection process may be a process of selecting the same data when the second plurality of data includes a predetermined number of two or more of the same data.

同じ属性を有する２つのデータの集合に、何れかのデータが２以上の所定の個数含まれている場合、そのようなデータは、一般にその属性の特徴量を多く含むデータである。このため、本実施形態によれば、同じ属性を有する複数のデータから、その属性を代表するデータをより高い精度で選択することができる。 When any set of two or more pieces of data is included in a set of two data having the same attribute, such data is generally data including a large amount of feature values of the attribute. For this reason, according to the present embodiment, data representing the attribute can be selected with higher accuracy from a plurality of data having the same attribute.

また、前記第２選択処理は、前記第１の複数のデータのうち前記同じデータの属性とは異なる属性を有する第３の複数のデータと、前記同じデータとを、第４の複数のデータとして選択する処理としても良い。 In the second selection process, the third plurality of data having an attribute different from the attribute of the same data among the first plurality of data and the same data as the fourth plurality of data. It is good also as processing to choose.

このような処理を実行することにより、トレーニングデータＤＢ３０から、標準データＤＢ３１を生成することができる。 By executing such processing, the standard data DB 31 can be generated from the training data DB 30.

また、前記演算装置は、選択された前記第４の複数のデータが、夫々の属性に対応して記憶装置に記憶されるよう、前記記憶装置の情報を更新する更新処理を実行しても良い。 Further, the arithmetic device may execute an update process for updating information in the storage device so that the selected fourth plurality of data is stored in the storage device corresponding to each attribute. .

このような処理が実行されることにより、標準データＤＢ３１の情報が更新されるため、ラベル付与システム１０は、最新の標準データＤＢ３１を用いることができる。 By executing such processing, the information in the standard data DB 31 is updated, so that the label assignment system 10 can use the latest standard data DB 31.

また、本実施形態の演算装置５０，６０を含むラベル付与システム１０は、ラベル無しのデータに対し、ラベル無しデータの属性を示す正しいラベルを、高い確率で付与することができる。また、ラベル付与システム１０では、ラベルを付与する際に、トレーニングデータＤＢ３０に比べ、データ量の少ない標準データＤＢ３１が用いられている。このた
め、トレーニングデータＤＢ３０を用いてラベルを付与する場合と比較すると、演算装置６０の計算量を減らすことができる。 Moreover, the label assignment system 10 including the arithmetic devices 50 and 60 of the present embodiment can assign a correct label indicating the attribute of unlabeled data to unlabeled data with high probability. Further, in the label assignment system 10, the standard data DB 31 having a smaller data amount than the training data DB 30 is used when assigning labels. For this reason, compared with the case where a label is provided using the training data DB 30, the calculation amount of the arithmetic device 60 can be reduced.

なお、上記実施例は本発明の理解を容易にするためのものであり、本発明を限定して解釈するためのものではない。本発明は、その趣旨を逸脱することなく、変更、改良され得ると共に、本発明にはその等価物も含まれる。 In addition, the said Example is for making an understanding of this invention easy, and is not for limiting and interpreting this invention. The present invention can be changed and improved without departing from the gist thereof, and the present invention includes equivalents thereof.

例えば、図５の標準データＤＢ３１を生成する処理Ｓ１００において、計算部１００は、同じ属性を有するペアの類似性を計算し、点の集合３００の情報を取得している（処理Ｓ２００〜Ｓ２０２）。ただし、集合３００は、図１０の学習済みモデル３２を生成する処理Ｓ１１０で得られる、点の集合３５０（図１１）と同様である。このため、点の集合３５０が、例えば記憶装置２０に格納されている場合、分割部１０１は、点の集合３５０を複数のクラスタに分割しても良い。この場合、計算部１００が実行する処理は省略できる。 For example, in the process S100 for generating the standard data DB 31 in FIG. 5, the calculation unit 100 calculates the similarity of pairs having the same attribute, and acquires information on the point set 300 (processes S200 to S202). However, the set 300 is the same as the point set 350 (FIG. 11) obtained in the process S110 for generating the learned model 32 of FIG. For this reason, when the point set 350 is stored in the storage device 20, for example, the dividing unit 101 may divide the point set 350 into a plurality of clusters. In this case, the process performed by the calculation unit 100 can be omitted.

また、本実施形態では、クラスタの中心点から最も近い点がｍ個選択されたが、例えば、クラスタの重心点から近い点を選択しても良い。 In the present embodiment, m points closest to the center point of the cluster are selected. However, for example, a point close to the center of gravity point of the cluster may be selected.

また、本実施形態で扱われるデータは、「Ａｌｉｃｅ」、「北海道」等、テキストデータであるがこれに限られず、例えば画像データであっても良い。 The data handled in the present embodiment is text data such as “Alice” and “Hokkaido”, but is not limited thereto, and may be image data, for example.

なお、本実施形態では、分割部１０１は、点の集合を７個（ｎ＝７）のクラスタに分割することとし、選択部１０２は、夫々のクラスタから２個（ｍ＝２）の点を選択することとしたが、これに限られない。例えば、クラスタの個数（ｎ）と、夫々のクラスタから選択する点の数（ｍ）とを多くすると、より精度良く、属性を代表するデータの選択が可能となる。 In this embodiment, the dividing unit 101 divides the set of points into seven (n = 7) clusters, and the selecting unit 102 assigns two (m = 2) points from each cluster. Although it was decided to select, it is not restricted to this. For example, if the number of clusters (n) and the number of points (m) to be selected from each cluster are increased, data representing attributes can be selected with higher accuracy.

本実施形態のトレーニングデータＤＢ３０の１列目は、例えば「人物」に関するデータであるとしたが、これに限られるものではない。例えば、「施設」、「組織」、「モノ」等であっても良い。また、「年代」に関しても、例えば１列目の情報が「施設」であれば、「築年数」等の情報であっても良い。このように、本実施形態のトレーニングデータＤＢ３０の内容は一例であり、データベースを構成するデータであれば良い。 The first column of the training data DB 30 of the present embodiment is, for example, data related to “person”, but is not limited thereto. For example, “facility”, “organization”, “thing”, and the like may be used. In addition, regarding “age”, for example, if the information in the first column is “facility”, information such as “aged age” may be used. Thus, the content of the training data DB 30 of this embodiment is an example, and any data may be used as long as it constitutes the database.

また、実行可能なプログラムが記憶された非一時的なコンピュータ可読媒体（ｎｏｎ−ｔｒａｎｓｉｔｏｒｙｃｏｍｐｕｔｅｒｒｅａｄａｂｌｅｍｅｄｉｕｍｗｉｔｈ
ａｎｅｘｅｃｕｔａｂｌｅｐｒｏｇｒａｍｔｈｅｒｅｏｎ）を用いて、コンピュータにプログラムを供給することも可能である。なお、非一時的なコンピュータの可読媒体の例は、磁気記録媒体（例えばフレキシブルディスク、磁気テープ、ハードディスクドライブ）、ＣＤ−ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）等がある。 In addition, a non-transitory computer readable medium with non-transitory computer readable media stored therein.
It is also possible to supply a program to a computer using an executable program thereon). Examples of non-transitory computer readable media include magnetic recording media (for example, flexible disks, magnetic tapes, hard disk drives), CD-ROMs (Read Only Memory), and the like.

１０ラベル付与システム
２０，５２，６２記憶装置
２１データ処理装置
２２ラベル付与装置
３０トレーニングデータＤＢ
３１標準データＤＢ
３２学習済みモデル
５０，６０演算装置
５１，６１メモリ
５３，６３入力装置
５４，６４表示装置
５５，６５通信装置
７０，７１，７５プログラム
１００，１１０，１５０計算部
１０１分割部
１０２，１０３選択部
１０４更新部
１１１識別情報付与部
１１２，１５２判定部
１１３トレーニング部
１５１出力部
１５３ラベル付与部
３１０〜３１６クラスタ
３００，３５０，３５１集合 10 Labeling system 20, 52, 62 Storage device 21 Data processing device 22 Labeling device 30 Training data DB
31 Standard data DB
32 learned model 50, 60 arithmetic device 51, 61 memory 53, 63 input device 54, 64 display device 55, 65 communication device 70, 71, 75 program 100, 110, 150 calculation unit 101 division unit 102, 103 selection unit 104 Update unit 111 Identification information giving unit 112,152 Judgment unit 113 Training unit 151 Output unit 153 Label giving unit 310-316 Cluster 300, 350, 351

Claims

A division process of dividing a set of points in a vector space indicating the similarity of two data having the same attribute selected from the first plurality of data having any of the plurality of attributes into a plurality of clusters;
A first selection process for selecting at least one point from each of the plurality of clusters;
When the same data as any of the second plurality of data that is the basis of the plurality of points selected from each of the plurality of clusters is included in the second plurality of data, the second data that selects the same data Selection process,
An arithmetic unit characterized by executing

The arithmetic device according to claim 1,
The first selection process is a process of selecting a point closest to a center point of each of the plurality of clusters from each of the plurality of clusters;
An arithmetic unit characterized by the above.

The arithmetic device according to claim 1,
The second selection process is a process of selecting the same data when the second plurality of data includes a predetermined number of two or more of the same data;
An arithmetic unit characterized by the above.

The arithmetic device according to any one of claims 1 to 3,
The second selection processing selects, as the fourth plurality of data, the third plurality of data having an attribute different from the attribute of the same data among the plurality of first data and the same data. Processing,
An arithmetic unit characterized by the above.

The arithmetic device according to claim 4,
The arithmetic device executes an update process for updating information in the storage device so that the selected fourth plurality of data is stored in the storage device corresponding to each attribute,
An arithmetic unit characterized by the above.

Dividing a set of points in a vector space indicating the similarity of two data having the same attribute selected from the first plurality of data having any of the plurality of attributes into a plurality of clusters;
Selecting at least one point from each of the plurality of clusters;
When the same data as any of the second plurality of data that is the basis of the plurality of points selected from each of the plurality of clusters is included in the second plurality of data, the same data is selected.
A data processing method.

A data processing system including a first arithmetic device and a second arithmetic device,
The first arithmetic unit includes:
A division process for dividing a first set of points in a vector space indicating similarity between two data having the same attribute selected from the first plurality of data having any of the plurality of attributes into a plurality of clusters; ,
A first selection process for selecting at least one point from each of the plurality of clusters;
The same data as any of the second plurality of data that is the basis of the first plurality of points selected from each of the plurality of clusters is different from the attribute of the same data among the first plurality of data A second selection process of selecting the third plurality of data having attributes as the fourth plurality of data;
Run
The second arithmetic unit is
A calculation process for calculating a second plurality of points in the vector space indicating similarity between the input data not having attribute information and each of the fourth plurality of data;
Each point included in the second set of points and the first set of points in the vector space showing the similarity of two data having different attributes selected from the first plurality of data And a learned model that outputs identification information indicating whether the points in the vector space are points based on two data having the same attribute or points based on two data having different attributes, obtained as learning data An output process for outputting the identification information for each of the second plurality of points;
Executing a determination process for determining an attribute of the input data based on the identification information of each of the second plurality of points;
A data processing system.