JP2006252333A

JP2006252333A - Data processing method, data processor and its program

Info

Publication number: JP2006252333A
Application number: JP2005069921A
Authority: JP
Inventors: Koichi Doi; 晃一土井; Tomohiro Mimori; 智裕三森; Yasushi Fukuda; 安志福田; Hitoshi Jitsui; 仁実井; Maki Murata; 真樹村田
Original assignee: Nara Institute of Science and Technology NUC; National Institute of Information and Communications Technology; Sony Corp
Current assignee: Nara Institute of Science and Technology NUC; National Institute of Information and Communications Technology; Sony Corp
Priority date: 2005-03-11
Filing date: 2005-03-11
Publication date: 2006-09-21
Also published as: WO2006095853A1; CN101138001A

Abstract

<P>PROBLEM TO BE SOLVED: To provide a data processing method capable of enhancing reliability of machine learning when performing the machine learning using two or more pieces of learning data. <P>SOLUTION: In a similar learning data generation part 4, from among n pieces of learning data SDq, a piece of similar learning data SSDq is selected which have high similarity with data to be processed. A machine learning machine 5 uses the similar learning data SSDq to perform the machine learning. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、学習データを用いて、被処理データを処理するデータ処理方法、データ処理装置およびそのプログラムに関する。 The present invention relates to a data processing method, a data processing apparatus, and a program for processing data to be processed using learning data.

例えば、遺伝子解析システムは、遺伝子間に生じる作用を、遺伝子（分子）名をノードとし、作用をノード間のリンクとして表現したデータベースを用いる。
このようなデータベースを構築するには、例えば、公開された論文のなかから、遺伝子名を抽出してノードとしてデータベースに登録する。
ここで、公開された論文数は膨大であるため、人間が論文を見て遺伝子名を抽出するのでは負担があまりに大きい。
そのため、コンピュータなどを用いて、論文データから機械的に遺伝子名を抽出することが考えられる。しかしながら、新規の遺伝子名を機械的に抽出するのは困難である。
同様の問題は、遺伝子名の他、人名、地名、組織名などの固有表現を文字データから抽出する場合に生じる。
このような問題を解決するために、例えば、ＳＶＭなどのように、予め被学習データ（トレーニングデータ）について、所定の解析単位（トークン）で所望の固有表現が出現するパターンを特定し、そのパターンを学習データとして用いて、被処理データから上記固有表現を抽出する機械学習装置ある。
従来の機械学習装置は、例えば、当該機械学習装置が保持する複数の学習データの全てを用いて、被処理データから所望の固有表現を抽出している。
「Gene/protain recognition using Support Vector Machine after dictionary matching」, Tomohiro Mitsumori, Sevrani Fation, Masaki Murata, Kouichi Doi and Hirohumi Doi BioCreative Workshop: Critical Assessment for Information Extraction in Biology (BioCreative 2004), Granada, Spain, March, 2004 中野、平井、日本語固有表現抽出における文節情報の利用、情報処理学会論文誌、Ｖｏｌ．４５Ｎｏ．３、ｐ９３４−９４１、Ｍａｒ．２００４平、春野、ＳｕｐｐｏｒｔＶｅｃｔｏｒＭａｃｈｉｎｅによるテキスト分類における属性選択、情報処理学会論文誌、Ｖｏｌ．４５Ｎｏ．４、ｐ１１１３−１１２３、Ａｐｒ．２００４ For example, the gene analysis system uses a database in which actions that occur between genes are expressed as gene (molecule) names as nodes and the actions as links between nodes.
In order to construct such a database, for example, gene names are extracted from published papers and registered as nodes in the database.
Here, since the number of published papers is enormous, it is too heavy for humans to look at papers and extract gene names.
Therefore, it is conceivable to extract gene names mechanically from paper data using a computer or the like. However, it is difficult to extract a new gene name mechanically.
A similar problem occurs when a unique expression such as a person name, place name, or organization name is extracted from character data in addition to a gene name.
In order to solve such a problem, for example, a pattern in which a desired specific expression appears in a predetermined analysis unit (token) is specified in advance in learned data (training data), such as SVM, and the pattern Is used as learning data to extract the specific expression from the data to be processed.
A conventional machine learning device extracts a desired specific expression from data to be processed using, for example, all of a plurality of learning data held by the machine learning device.
`` Gene / protain recognition using Support Vector Machine after dictionary matching '', Tomohiro Mitsumori, Sevrani Fation, Masaki Murata, Kouichi Doi and Hirohumi Doi BioCreative Workshop: Critical Assessment for Information Extraction in Biology (BioCreative 2004), Granada, Spain, March, 2004 Nakano, Hirai, Use of phrase information in Japanese proper expression extraction, IPSJ Transactions, Vol. 45 No. 3, p934-941, Mar. 2004 Hira, Haruno, attribute selection in text classification by Support Vector Machine, Journal of Information Processing Society, Vol. 45 No. 4, p1113-1123, Apr. 2004

しかしながら、上述した従来の機械学習装置は、被処理データの属性とは無関係に、当該機械学習装置が保持する全ての学習データを用いて当該被処理データから固有表現を抽出するため、被処理データの属性と類似度が低い学習データが用いられることにより、固有表現抽出の信頼性が低くなるという問題がある。
同様の問題は、上述した遺伝子解析システム以外の機械学習装置にもある。 However, since the above-described conventional machine learning device extracts a specific expression from the processed data using all the learning data held by the machine learning device regardless of the attribute of the processed data, the processed data There is a problem that the reliability of extraction of the specific expression is lowered by using learning data having a low degree of similarity with the attribute of.
Similar problems exist in machine learning devices other than the gene analysis system described above.

本発明は上述した従来技術の問題点を解決するために、複数の学習データを用いて被処理データに処理を施す場合に、その処理の信頼性を高めることができるデータ処理方法、データ処理装置およびそのプログラムを提供することを目的とする。 In order to solve the above-described problems of the prior art, the present invention provides a data processing method and data processing apparatus capable of improving the reliability of processing when processing data to be processed using a plurality of learning data. And to provide the program.

上述した従来技術の問題点を解決し、上述した目的を達成するため、第１の観点の発明のデータ処理方法は、被学習データを基に生成した学習データを用いて、被処理データを機械学習処理するデータ処理方法であって、複数の前記学習データの各々について、当該学習データを生成するために用いた前記被学習データと、前記被処理データとの類似度を示す類似度データを生成する第１の工程と、前記第１の工程で生成した前記類似度データを基に、前記複数の学習データのうち一部の学習データを選択する第２の工程と、前記第２の工程で選択した前記学習データを用いて、前記被処理データを機械学習処理する第３の工程とを有する。 In order to solve the above-described problems of the prior art and achieve the above-described object, the data processing method of the first aspect of the invention uses machine learning data generated based on the data to be learned, A data processing method for performing learning processing, for each of a plurality of learning data, generating similarity data indicating similarity between the learning data used to generate the learning data and the processing data The second step of selecting a part of the learning data from the plurality of learning data based on the similarity data generated in the first step, and the second step And a third step of performing machine learning processing on the data to be processed using the selected learning data.

第２の観点の発明のデータ処理方法は、被学習データを構成する複数の処理単位データの各々に属性データを付した学習データを用いて、被処理データに前記属性データを機械学習により付加するデータ処理装置が実行するデータ処理方法であって、複数の前記学習データの各々について、当該学習データの前記被学習データを構成する前記処理単位データと、前記被処理データを構成する前記処理単位データとの類似度を示す類似度データを生成する第１の工程と、前記第１の工程で生成した前記類似度データを基に、前記複数の学習データの前記被学習データのうち、前記被処理データとの間の類似度が所定の基準を満たす前記被学習データを特定し、当該特定した前記被学習データに対応する前記学習データを選択する第２の工程と、前記第２の工程で選択した前記学習データを用いて、前記被処理データを構成する前記処理単位データに前記属性データを機械学習により付加する第３の工程とを有する。 The data processing method according to the second aspect of the invention uses learning data in which attribute data is attached to each of a plurality of processing unit data constituting the learned data, and adds the attribute data to the processed data by machine learning. A data processing method executed by a data processing apparatus, wherein each of a plurality of pieces of learning data includes the processing unit data constituting the learning data of the learning data and the processing unit data constituting the processing data A first step of generating similarity data indicating the degree of similarity and the processed data among the learned data of the plurality of learning data based on the similarity data generated in the first step A second step of specifying the learning data whose similarity with data satisfies a predetermined criterion, and selecting the learning data corresponding to the specified learning data; Using said learning data selected by the serial second step, a third step of adding the machine learning the attribute data the in the processing unit data constituting the data to be processed.

第３の観点の発明のデータ処理装置は、被学習データを基に生成した学習データを用いて、被処理データを機械学習処理するデータ処理装置であって、複数の前記学習データの各々について、当該学習データを生成するために用いた前記被学習データと、前記被処理データとの類似度を示す類似度データを生成する類似度データ生成手段と、前記類似度データ生成手段が生成した前記類似度データを基に、前記複数の学習データのうち一部の学習データを選択する選択手段と、前記選択手段が選択した前記学習データを用いて、前記被処理データを機械学習処理する処理手段とを有する。 A data processing device according to a third aspect of the invention is a data processing device that performs machine learning processing of data to be processed using learning data generated based on the data to be learned, and for each of the plurality of learning data, Similarity data generating means for generating similarity data indicating similarity between the learned data used for generating the learning data and the processed data, and the similarity generated by the similarity data generating means Selection means for selecting a part of the learning data based on the degree data, and processing means for performing machine learning processing on the processed data using the learning data selected by the selection means; Have

第３の観点の発明のデータ処理装置の作用は以下のようになる。
類似度データ生成手段が、複数の前記学習データの各々について、当該学習データを生成するために用いた前記被学習データと、前記被処理データとの類似度を示す類似度データを生成する。
次に、選択手段が、前記類似度データ生成手段が生成した前記類似度データを基に、前記複数の学習データのうち一部の学習データを選択する。
次に、処理手段が、前記選択手段が選択した前記学習データを用いて、前記被処理データを機械学習処理する。 The operation of the data processing apparatus according to the third aspect of the invention is as follows.
Similarity data generation means generates similarity data indicating the similarity between the learned data used to generate the learning data and the processed data for each of the plurality of learning data.
Next, the selection unit selects a part of the learning data from the plurality of learning data based on the similarity data generated by the similarity data generation unit.
Next, processing means performs machine learning processing on the processing target data using the learning data selected by the selection means.

第４の観点の発明のプログラムは、被学習データを基に生成した学習データを用いて、被処理データを機械学習処理するデータ処理装置が実行するプログラムであって、複数の前記学習データの各々について、当該学習データを生成するために用いた前記被学習データと、前記被処理データとの類似度を示す類似度データを生成する第１の手順と、前記第１の手順で生成した前記類似度データを基に、前記複数の学習データのうち一部の学習データを選択する第２の手順と、前記第２の手順で選択した前記学習データを用いて、前記被処理データを機械学習処理する第３の手順とを有する。 A program according to a fourth aspect of the present invention is a program executed by a data processing device that performs machine learning processing of data to be processed using learning data generated based on the data to be learned, and each of the plurality of learning data The learning data used to generate the learning data, a first procedure for generating similarity data indicating the similarity between the processed data, and the similarity generated in the first procedure Based on the degree data, a second procedure for selecting some of the plurality of learning data and the learning data selected in the second procedure are used to machine-process the processed data. And a third procedure.

本発明のデータ処理方法、データ処理装置およびそのプログラムによれば、複数の学習データを用いて被処理データに処理を施す場合に、その処理の信頼性を高めることができるデータ処理方法、データ処理装置およびそのプログラムを提供することができる。 According to the data processing method, the data processing apparatus, and the program thereof of the present invention, when processing data to be processed using a plurality of learning data, the data processing method and data processing that can improve the reliability of the processing A device and its program can be provided.

以下、本発明の実施形態に係わる機械学習装置について説明する。
なお、以下の実施形態において、第２実施形態は、第１実施形態の機械学習システムを、論文等の学習処理を行う機械学習システムに適用した本発明の一例としての実施形態である。
また、第３実施形態は、第１実施形態の機械学習システムを、インターネット上のコンテンツへのアクセス制御を行う機械学習システムに適用した本発明の一例としての実施形態である。 Hereinafter, a machine learning apparatus according to an embodiment of the present invention will be described.
In the following embodiments, the second embodiment is an embodiment of the present invention in which the machine learning system of the first embodiment is applied to a machine learning system that performs learning processing for papers and the like.
The third embodiment is an embodiment of the present invention in which the machine learning system of the first embodiment is applied to a machine learning system that performs access control to content on the Internet.

＜第１実施形態＞
図１は、本発明の第１実施形態の機械学習システムの構成図である。
図１に示すように、本実施形態の機械学習システムは、例えば、類似学習データ生成機２と、機械学習機３とを有する。
類似学習データ生成機２は、例えば、類似度計算部３および類似学習データ生成部４を有する。
本実施形態の機械学習システムは、正答例集合（学習データＳＤｑ）の中から、解きたい問題（問題データＴＤ）との類似度が所定の条件を満たす部分集合（類似学習データＳＳＤｑ）を選択し、学習機に対する学習データとすることによって、学習速度と精度の向上を図るものである。
先ず、本実施形態の構成要素と、本発明の構成要素との対応関係を説明する。
図１等に示す問題データＴＤが本発明の被処理データに対応し、学習データＳＤｑが本発明の学習データに対応している。
また、に示す被学習データＲｑが本発明の被学習データに対応している。
また、本実施形態の語が、本発明の処理単位データに対応している。
また、本実施形態の類似度データＢＡ（ｑ）が本発明の類似度データに対応している。
図１に示す類似度計算部３が第３の観点の発明の類似データ生成手段に対応し、類似学習データ生成部４が第３の観点の発明の選択手段に対応し、機械学習機５が第３の観点の発明の処理手段に対応している。
また、図１に示す類似度計算部３、類似学習データ生成部４および機械学習機５の機能をプログラムとして記述し、処理回路で実行することが可能であり、その場合に当該プログラムが第４の観点の発明のプログラムに対応する。 <First Embodiment>
FIG. 1 is a configuration diagram of a machine learning system according to the first embodiment of this invention.
As shown in FIG. 1, the machine learning system according to the present embodiment includes, for example, a similar learning data generator 2 and a machine learning machine 3.
The similar learning data generator 2 includes, for example, a similarity calculation unit 3 and a similar learning data generation unit 4.
The machine learning system according to the present embodiment selects a subset (similar learning data SSDq) whose similarity with a problem to be solved (question data TD) satisfies a predetermined condition from a set of correct answer examples (learning data SDq). The learning data for the learning machine is used to improve the learning speed and accuracy.
First, the correspondence between the components of the present embodiment and the components of the present invention will be described.
The problem data TD shown in FIG. 1 etc. corresponds to the processed data of the present invention, and the learning data SDq corresponds to the learning data of the present invention.
Further, the learned data Rq shown in the figure corresponds to the learned data of the present invention.
Moreover, the word of this embodiment respond | corresponds to the processing unit data of this invention.
Further, the similarity data BA (q) of the present embodiment corresponds to the similarity data of the present invention.
The similarity calculation unit 3 shown in FIG. 1 corresponds to the similar data generation unit of the invention of the third aspect, the similar learning data generation unit 4 corresponds to the selection unit of the invention of the third aspect, and the machine learning machine 5 This corresponds to the processing means of the invention of the third aspect.
Further, it is possible to describe the functions of the similarity calculation unit 3, the similar learning data generation unit 4 and the machine learning machine 5 shown in FIG. Corresponds to the program of the invention of this aspect.

［類似度計算部３］
類似度計算部３は、ｎ個の学習データＳｑの被学習データＲｑの各々について、当該被学習データＲｑと、問題データＴＤとの類似度を計算する。
本実施形態において、被学習データＲｑおよび問題データＴＤは、ＰＯＳ(Point Of Sale)データ、テキストデータおよびマルチメディアデータなどである。
これらのデータは、複数の処理単位データを組み合わせて構成されている。
類似度計算部３は、複数の被学習データＲｑの各々について、当該被学習データＲｑを構成する処理単位データと、問題データＴＤを構成する処理単位データとの類似度を示す類似度データを生成し、これを類似学習データ生成部４に出力する。
具体的には、類似度計算部３は、被学習データＲｑおよび問題データＴＤの各々について、各データを構成する処理単位データを基に各データの特徴を、予め決められ特徴評価座標系内で規定したベクトルデータを生成する。
そして、類似度計算部３は、上記生成したベクトルデータを基に、類似度データを生成する。 [Similarity calculation unit 3]
The similarity calculation unit 3 calculates the similarity between the learned data Rq and the problem data TD for each of the learned data Rq of the n pieces of learning data Sq.
In the present embodiment, the learned data Rq and the problem data TD are POS (Point Of Sale) data, text data, multimedia data, and the like.
These data are configured by combining a plurality of processing unit data.
The similarity calculation unit 3 generates, for each of the plurality of learned data Rq, similarity data indicating the similarity between the processing unit data constituting the learned data Rq and the processing unit data constituting the problem data TD. This is output to the similar learning data generation unit 4.
Specifically, the similarity calculation unit 3 determines the characteristics of each data for each of the learned data Rq and the problem data TD based on the processing unit data constituting each data in a predetermined feature evaluation coordinate system. Generate specified vector data.
Then, the similarity calculator 3 generates similarity data based on the generated vector data.

類似度計算部３は、例えば、（Ｘ，Ｙ，Ｚ）で示される上記ベクトルデータを生成し、
A1： d(x,y)≧0
A2： d(x,y)=d(y,x)
A3： d(x,y)=0 となる必要十分条件は x=y である
A3'： d(x,x)=0
A4： d(x,z)≦ d(x,y) + d(y,z)
とし、
B1: A1,A2,A3,A4
B2: A1,A2,A3',A4
B3: A1,A2,A3
B4: A1,A2,A3'
B5: A1,A2
とした場合に、
上記B1,B2,B3,B4,B5のいずれかを満たす測度（測定した値）を示す関数d() を用いる、もしくは、類似度が増加することに対して、距離が単調に減少する類似度計算式によって距離を計算し、当該距離を示す前記類似度データを生成する。
ここで、上記B1が、いわゆる「距離」に相当する。例えば、３次元空間であればユークリッド距離であり、「d(x,y)={(x1-y1)^2+(x2-y2)^2+(x3-y3)^2}^(1/2)」となる。 The similarity calculation unit 3 generates the vector data represented by (X, Y, Z), for example,
A1: d (x, y) ≧ 0
A2: d (x, y) = d (y, x)
A3: The necessary and sufficient condition for d (x, y) = 0 is x = y
A3 ': d (x, x) = 0
A4: d (x, z) ≤ d (x, y) + d (y, z)
age,
B1: A1, A2, A3, A4
B2: A1, A2, A3 ', A4
B3: A1, A2, A3
B4: A1, A2, A3 '
B5: A1, A2
If
Use the function d () that indicates the measure (measured value) that satisfies one of the above B1, B2, B3, B4, or B5, or the similarity that the distance decreases monotonously as the similarity increases The distance is calculated by a calculation formula, and the similarity data indicating the distance is generated.
Here, B1 corresponds to a so-called “distance”. For example, in a three-dimensional space, it is the Euclidean distance, and “d (x, y) = {(x1-y1) ^ 2 + (x2-y2) ^ 2 + (x3-y3) ^ 2} ^ (1 / 2) ".

また、類似度計算部３は、数の被学習データＲｑの各々について、当該被学習データＲｑを構成する処理単位データと、問題データＴＤを構成する処理単位データとの距離を所定の座標系で示す類似度データを生成してもよい。
この場合、類似度計算部３は、距離計算方法として、ユークリッド距離やユークリッド平方距離、標準化ユークリッド距離、ミンコフスキー距離、もしくはカーネル法による距離計算による評価手法を用いる。 In addition, the similarity calculation unit 3 sets, for each of a plurality of pieces of learned data Rq, the distance between the processing unit data constituting the learned data Rq and the processing unit data constituting the problem data TD in a predetermined coordinate system. Similarity data shown may be generated.
In this case, the similarity calculation unit 3 uses, as a distance calculation method, an Euclidean distance, a Euclidean square distance, a standardized Euclidean distance, a Minkowski distance, or an evaluation method based on a distance calculation by a kernel method.

また、類似度計算部３は、一つの被学習データＲｑあるいは問題データＴＤに対して問題とする処理単位データ群に対して求まる複数の距離、ないしは類似度に対して、距離に関しては別途与えられた類似度への変換式を用いて変換を行った後、類似度ベクトルとして表現し、別途定義する選択関数によってスカラー値に変換し、これを類似度データとしてもよい。
また、類似度計算部３は、複数の類似度を要素として持つ類似度ベクトルに対して、各要素の和、二乗和、最大値の選択、最小値の選択等によってスカラーに変換する計算を行ってもよい。また、類似度計算部３は、上記生成した距離データにゼロでない正数を加え、逆数を取ったものを類似度データとしてもよい。 In addition, the similarity calculation unit 3 is separately provided with respect to a plurality of distances or similarities obtained for a processing unit data group that is a problem with respect to one learned data Rq or problem data TD. After conversion using the conversion formula to similarity, it is expressed as a similarity vector, converted into a scalar value by a separately defined selection function, and this may be used as similarity data.
Further, the similarity calculation unit 3 performs a calculation for converting a similarity vector having a plurality of similarities as elements into a scalar by summing each element, sum of squares, selecting a maximum value, selecting a minimum value, and the like. May be. Further, the similarity calculation unit 3 may add a positive number that is not zero to the generated distance data, and use the reciprocal as the similarity data.

［類似学習データ生成部４］
類似学習データ生成部４は、ｎ個の被学習データＲｑのうち、類似度計算部３で生成した類似度データが示す類似度が所定のしきい値を超える被学習データＲｑの学習データＳＤｑを選択し、これを類似学習データＳＳＤｑとして機械学習機５に出力する。
ここで、学習データＳＤｑは、図１に示すように、被学習データＲｑと、その属性データＰＤとを含んでいる。
ここで、属性データＰＤは、被学習データＲｑを構成する処理単位データの各々について、その属性を示している。
当該属性は、例えば、被学習データＲｑおよび問題データＴＤが電子メールである場合には、迷惑メールであるか否かを示す情報であり、被学習データＲｑおよび問題データＴＤが文書データである場合には、語の品詞を示す情報である。 [Similar learning data generation unit 4]
The similar learning data generation unit 4 uses the learning data SDq of the learning data Rq whose similarity expressed by the similarity data generated by the similarity calculation unit 3 exceeds a predetermined threshold among the n learning data Rq. This is selected and output to the machine learning machine 5 as similar learning data SSDq.
Here, the learning data SDq includes the learned data Rq and its attribute data PD as shown in FIG.
Here, the attribute data PD indicates the attribute of each processing unit data constituting the learned data Rq.
For example, when the learned data Rq and the problem data TD are e-mails, the attribute is information indicating whether or not the e-mail is a junk mail, and the learned data Rq and the problem data TD are document data. Is information indicating the part of speech of the word.

［機械学習機５］
機械学習機５は、類似学習データ生成部４から入力した類似学習データＳＳＤｑを用いて、問題データＴＤの処理を行う。
具体的には、機械学習機５は、類似学習データＳＳＤｑを用いて、問題データＴＤを構成する処理単位データに属性データＰＤを付加する。
機械学習機５は、例えば、ＳＶＭ(Support Vector Machine)、Artificial Neural Network、遺伝的アルゴリズムなど、教師有り学習処理を行う。
機械学習機５の学習で用いる学習ルールは、Support Vector Machine においてはデータ分離を行う超平面を記述するパラメータ群となり、Artificial Neural Network では、各ニューロンに対する重みベクトルになる。
機械学習機５は、機械学習法として、上記ＳＭＶなどの他に、決定リスト、類似度に基づく方法、シンプルベイズ法、最大エントロピー法、決定木、ニューラルネット、判別分析等の手法を用いてもよい。 [Machine learning machine 5]
The machine learning machine 5 processes the problem data TD using the similar learning data SSDq input from the similar learning data generation unit 4.
Specifically, the machine learning machine 5 adds the attribute data PD to the processing unit data constituting the problem data TD using the similar learning data SSDq.
The machine learning machine 5 performs supervised learning processing such as SVM (Support Vector Machine), Artificial Neural Network, and genetic algorithm.
The learning rule used for learning by the machine learning machine 5 is a parameter group describing a hyperplane for performing data separation in the Support Vector Machine, and a weight vector for each neuron in the Artificial Neural Network.
The machine learning machine 5 may use a method such as a decision list, a method based on similarity, a simple Bayes method, a maximum entropy method, a decision tree, a neural network, and a discriminant analysis in addition to the above SMV as a machine learning method. Good.

以下、機械学習機５が一例として採用するＳＶＭについて説明する。
ＳＶＭは、例えば、前述した非特許文献３等に開示されている。
機械学習機５は、問題データＴＤを超空間上で正例集合へと分離する際、マージンを最大にすることによって最適な分離超平面を得るＳＶＭに基づく学習処理を行う。
ＳＶＭは、最小の汎化誤差を保証する仮説を見つける構造的リスク最小化に基づく手法である。
ＳＶＭは、例えば、入力ベクトル（問題データＴＤ）をｘとした場合に、下記式（２）の関数が仮説ｈを示すとする。 Hereinafter, SVM which the machine learning machine 5 employs as an example will be described.
SVM is disclosed in, for example, Non-Patent Document 3 described above.
When the machine learning machine 5 separates the problem data TD into a positive example set in the superspace, the machine learning machine 5 performs a learning process based on SVM to obtain an optimum separated hyperplane by maximizing the margin.
SVM is a technique based on structural risk minimization that finds hypotheses that guarantee the least generalization error.
In the SVM, for example, when the input vector (problem data TD) is x, the function of the following formula (2) indicates the hypothesis h.

上記式（２）において、ｗ、ｂは、パラメータである。入力ベクトルｘの次元ｎとＶＣ次元λの関係については以下の補助定理が知られている。 In the above equation (2), w and b are parameters. Regarding the relationship between the dimension n of the input vector x and the VC dimension λ, the following lemma is known.

補助定理：
仮説ｈ（ｘ）として超平面ｈ（ｘ）＝ｓｉｇｎ｛ｗ・ｘ＋ｂ｝を仮定する。ｌ個の訓練データ（本実施形態では、類似学習データＳＳＤｑ）ｘ＝ｘｉ（ｉは１〜ｌまでの整数）全てを含む半径Ｒの球が存在し、各ｘｉに対して下記式（３）が成り立つならば、||ｗ||をｗのノルムとした場合、ＶＣ次元λについて下記式（４）が成り立つ。 Lemma:
As a hypothesis h (x), a hyperplane h (x) = sign {w · x + b} is assumed. There is a sphere having a radius R including all of l pieces of training data (similar learning data SSDq in this embodiment) x = xi (i is an integer from 1 to 1). If || w || is the norm of w, the following equation (4) is established for the VC dimension λ.

上記式（４）から、ＶＣ次元は、||ｗ||に依存する場合がある。
ＳＶＭは、上記訓練データを正例と負例とにわけ、正負例間のマージンが最大、すなわち、||ｗ||が最小になる超平面を特定する。
機械学習機５は、上記超平面の特定を、例えば、ラグランジェ乗数を用いて２次最適化問題として処理する。 From the above equation (4), the VC dimension may depend on || w ||.
The SVM divides the training data into a positive example and a negative example, and specifies a hyperplane in which the margin between the positive and negative examples is the maximum, that is, || w || is the minimum.
The machine learning machine 5 processes the hyperplane specification as a secondary optimization problem using, for example, a Lagrange multiplier.

以上説明したように、本実施形態の機械学習システムによれば、ｎ個の学習データＳＤｑのうち問題データＴＤとの間の類似度が高いもののみを選択して用いて、機械学習機５において問題データＴＤの学習処理を行う。
そのため、問題データＴＤの学習に、問題データＴＤとの間の類似度が低い学習データＳＤｑは用いられなくなり、処理済データＴＲの信頼性が高まる。
その結果、処理済データＴＲの信頼性を高めることができる。 As described above, according to the machine learning system of the present embodiment, the machine learning machine 5 selects and uses only the n learning data SDq having high similarity to the problem data TD. The learning process of the problem data TD is performed.
Therefore, the learning data SDq having a low similarity with the problem data TD is not used for learning the problem data TD, and the reliability of the processed data TR is increased.
As a result, the reliability of the processed data TR can be improved.

また、本実施形態の機械学習システムによれば、処理の信頼性向上の他に、学習に用いるデータ量を削減し、学習に要する時間の短縮、並びにマシンリソースの低減という効果が得られる。 Further, according to the machine learning system of the present embodiment, in addition to the improvement of processing reliability, the amount of data used for learning is reduced, and the effects of shortening the time required for learning and reducing machine resources can be obtained.

本実施形態において、問題データＴＤとしてテキストデータを入力し、個々の単語に対して、品詞情報、単語尾部スペル、語の種類を属性データとして、テキストデータ内から所望の単語を抽出する課題を設定して、システム提示を行う。ここで、入力や課題設定はこれに限られたものではなく、いろいろな応用ができることは明白である。例えば、図４に示すように、問題データＴＤとして、Point Of Sale データや、音楽、音声、テレビ番組、ビデオ映像などのマルチメディアデータ等を指定することができ、また、課題設定として、売り上げパターンの解析や、迷惑メールやニュース番組等のフィルタリングや、ユーザが所望する映像クリップを抽出することなどができる。
本実施形態は、ＰＯＳデータからの顧客動向抽出や、テキストデータやマルチメディアデータの分類、及び情報抽出を行うシステムに適用可能である。
また、本実施形態の処理単位データは、商品種とその売り上げ個数、入荷日、売上日、年齢、性別、家族構成等の購買顧客情報などを含む Point Of Sales 情報や、メール文章、論文、特許、HP文書、番組表、歌詞等の文書、ないしは文章や単語へ分解したもの、楽譜データ、音楽等の時系列データ、ガスクロマトグラフィーによる出力結果や等のスペクトルデータ、ニュース番組やドラマ、ビデオ画像などの映像情報など、ある構成単位を定義し、これの組み合わせや重ね合わせ、合成、シーケンスとして構成する、ないしは構成されているものとして解析を行ったデータ、および、データを何らかの付加手続きを用いて加工したデータを付加して用いることもできる。 In the present embodiment, text data is input as problem data TD, and a task for extracting a desired word from text data is set for each word, using part-of-speech information, word tail spelling, and word type as attribute data. System presentation. Here, input and assignment setting are not limited to this, and it is obvious that various applications are possible. For example, as shown in FIG. 4, point of sale data, multimedia data such as music, voice, TV program, video image, etc. can be designated as problem data TD, and sales patterns can be used as problem settings. Analysis, filtering of unwanted e-mails and news programs, and extracting video clips desired by the user.
This embodiment can be applied to a system that performs customer trend extraction from POS data, classification of text data and multimedia data, and information extraction.
In addition, the processing unit data of the present embodiment includes point of sales information including merchandise type and the number of units sold, arrival date, sales date, age, gender, family composition, etc., mail text, papers, patents, etc. , HP documents, program guides, lyric documents, etc., or decomposed into sentences and words, score data, time series data such as music, spectrum data such as output results by gas chromatography, news programs, dramas, video images Define a certain structural unit such as video information, etc., configure it as a combination, overlay, composition, sequence, or analyze it as configured, and use some additional procedure for the data The processed data can be added and used.

＜第２実施形態＞
第２実施形態は、第１実施形態の機械学習システムを、論文等の学習処理を行う機械学習システムに適用した実施形態である。 Second Embodiment
The second embodiment is an embodiment in which the machine learning system of the first embodiment is applied to a machine learning system that performs learning processing of papers and the like.

先ず、本実施形態の構成要素と、本発明の構成要素との対応関係を説明する。
図３等に示す問題データＴＤが本発明の被処理データに対応し、学習データＳＤｑが本発明の学習データに対応している。
また、図４等に示す被学習データＲｑが本発明の被学習データに対応している。
また、本実施形態の語が、本発明の処理単位データに対応している。
また、本実施形態の類似度データＢＡ（ｑ）が本発明の類似度データに対応している。
また、式（６）に示す、指標データＴＦ（ｉ，ｊ）が本発明の指標データに対応している。
図１３に示すステップＳＴ２が第１の観点の発明の第１の工程に対応し、ステップＳＴ３が第２の工程に対応し、ステップＳＴ５が第３の工程に対応している。
また、図５に示す類似学習データ選択部１１の類似度計算部３３が第３の観点の発明の類似データ生成手段に対応し、学習データ選択部３４が第３の観点の発明の選択手段に対応し、図１２に示すＩＯＢ付加部２２のＩＯＢ判定部７２が第３の観点の発明の処理手段に対応している。 First, the correspondence between the components of the present embodiment and the components of the present invention will be described.
The problem data TD shown in FIG. 3 etc. corresponds to the processed data of the present invention, and the learning data SDq corresponds to the learning data of the present invention.
Further, the learned data Rq shown in FIG. 4 and the like corresponds to the learned data of the present invention.
Moreover, the word of this embodiment respond | corresponds to the processing unit data of this invention.
Further, the similarity data BA (q) of the present embodiment corresponds to the similarity data of the present invention.
In addition, index data TF (i, j) shown in Expression (6) corresponds to the index data of the present invention.
Step ST2 shown in FIG. 13 corresponds to the first step of the first aspect of the invention, step ST3 corresponds to the second step, and step ST5 corresponds to the third step.
Further, the similarity calculation unit 33 of the similar learning data selection unit 11 shown in FIG. 5 corresponds to the similar data generation unit of the invention of the third aspect, and the learning data selection unit 34 serves as the selection unit of the invention of the third aspect. Correspondingly, the IOB determination unit 72 of the IOB addition unit 22 shown in FIG. 12 corresponds to the processing means of the invention of the third aspect.

図３は、本発明の実施形態に係わる機械学習装置１の全体構成図である。
図３に示すように、機械学習装置１は、例えば、メモリ９、類似学習データ選択部１１、タグ付加部１３、ＩＯＢ判定データ生成部１５、タグ付加部２１、並びにＩＯＢ付加部２２を有する。
機械学習装置１の各構成要素は、例えば、電子回路などのハードウェアにより構成される。
また、後述する当該各構成要素を構成する要素も、それぞれ電子回路などのハードウェアを用いて構成される。
なお、本発明は、図３に示す機械学習装置１の各構成要素、並びにその要素の一部あるいは全部をＣＰＵ(Central Processing Unit)がプログラムを実行することで実現してもよい。 FIG. 3 is an overall configuration diagram of the machine learning device 1 according to the embodiment of the present invention.
As illustrated in FIG. 3, the machine learning device 1 includes, for example, a memory 9, a similar learning data selection unit 11, a tag addition unit 13, an IOB determination data generation unit 15, a tag addition unit 21, and an IOB addition unit 22.
Each component of the machine learning device 1 is configured by hardware such as an electronic circuit, for example.
In addition, elements constituting the respective constituent elements described later are also configured using hardware such as an electronic circuit.
The present invention may be realized by a CPU (Central Processing Unit) executing a program for each component of the machine learning device 1 shown in FIG. 3 and part or all of the components.

先ず、図３に示すメモリ９に記憶され、類似学習データ選択部１１に入力される学習データＳＤｑ（ＳＤ１〜ＳＤｎ）を説明する。
図４は、学習データＳＤｑ（ＳＤ１〜ＳＤｎ）を説明するための図である。
図４に示すように、学習データＳＤｑは、予め被学習データＲｑ内の全ての語の各々について、その属性タグデータＩＯＢを対応付けたデータである。
ここで、属性タグデータＩＯＢは、所定の用語（例えば、蛋白質の名称）を構成する先頭の語に対して“Ｂ”を示している。
また、属性タグデータＩＯＢは、先頭の語に続く語であって、上記所定の語を構成する語に対して”Ｉ“を示している。
また、属性タグデータＩＯＢは、上記所定の語を構成しない語に対して“Ｏ”を示している。
なお、本実施形態において、被学習データＲｑおよび問題データＴＤは、例えば、蛋白質の名称を含むような英語の論文データであり、例えば、スペースを区切り文字とし、文末のピリオドとその直前の語は分割することを当該英文を語に分割するルールとする。
また、本実施形態では、機械学習装置１は、ｎ個の学習データＳＤｑが利用可能である。 First, learning data SDq (SD1 to SDn) stored in the memory 9 shown in FIG. 3 and input to the similar learning data selection unit 11 will be described.
FIG. 4 is a diagram for explaining the learning data SDq (SD1 to SDn).
As shown in FIG. 4, the learning data SDq is data in which attribute tag data IOB is associated with each of all words in the learned data Rq in advance.
Here, the attribute tag data IOB indicates “B” for the first word constituting a predetermined term (for example, the name of a protein).
The attribute tag data IOB is a word following the first word, and indicates “I” for the word constituting the predetermined word.
The attribute tag data IOB indicates “O” for a word that does not constitute the predetermined word.
In the present embodiment, the learned data Rq and the problem data TD are, for example, English paper data including the name of the protein. For example, a space is a delimiter, and a period at the end of the sentence and the word immediately preceding it are Dividing is a rule for dividing the English sentence into words.
In the present embodiment, the machine learning device 1 can use n pieces of learning data SDq.

以下、図３に示す各構成要素を説明する。
［類似学習データ選択部１１］
図５は、図３に示す類似学習データ選択部１１の構成図である。
図５に示すように、類似学習データ選択部１１は、入力部３１、入力部３２、類似度計算部３３、学習データ選択部３４、並びに出力部３５を有する。
入力部３１は、例えば、機械学習装置１が備えるメモリ（図示せず）あるいは機械学習装置１の外部から、図４に示すｎ個の学習データＳＤｑを入力する。
また、入力部３２は、機械学習装置１の外部から図６に示す問題データＴＤを入力する。 Hereinafter, each component shown in FIG. 3 will be described.
[Similar learning data selection unit 11]
FIG. 5 is a configuration diagram of the similar learning data selection unit 11 shown in FIG.
As illustrated in FIG. 5, the similar learning data selection unit 11 includes an input unit 31, an input unit 32, a similarity calculation unit 33, a learning data selection unit 34, and an output unit 35.
The input unit 31 inputs, for example, n pieces of learning data SDq shown in FIG. 4 from a memory (not shown) included in the machine learning device 1 or from the outside of the machine learning device 1.
Further, the input unit 32 inputs the problem data TD shown in FIG. 6 from the outside of the machine learning device 1.

類似度計算部３３は、入力部３１が入力した図４に示すｎ個の学習データＳＤｑ内の被学習データＲｑの各々につて、当該被学習データＲｑと、問題データＴＤとの類似度を計算する。
以下、当該類似度の計算方法について説明する。
ここで、ｎ個の学習データＳＤｑにそれぞれ対応したｎ個の被学習データＲｑと、問題データＴＤとに含まれる語の種類の数をｋとする。
また、“ｉ”は１〜ｋの整数、“ｊ”はｎ個の被学習データＲｑと１個の問題データＴＤとに付した識別子とする。 The similarity calculation unit 33 calculates the similarity between the learned data Rq and the problem data TD for each of the learned data Rq in the n pieces of learning data SDq shown in FIG. 4 input by the input unit 31. To do.
Hereinafter, a method for calculating the similarity will be described.
Here, k is the number of types of words included in the n pieces of learned data Rq and the problem data TD respectively corresponding to the n pieces of learning data SDq.
“I” is an integer from 1 to k, and “j” is an identifier attached to n pieces of learned data Rq and one piece of problem data TD.

類似度計算部３３は、下記式（５）により、指標データＴＦ（ｉ，ｊ）を計算する。 The similarity calculation unit 33 calculates the index data TF (i, j) by the following equation (5).

［数５］
ＴＦ（ｉ，ｊ）＝（語ｉが被学習データＲｊ（問題データＴＤ）に出現する回数）／（被学習データＲｊ（問題データＴＤ）に含まれる語の総数）
…（５） [Equation 5]
TF (i, j) = (number of times word i appears in learned data Rj (problem data TD)) / (total number of words included in learned data Rj (problem data TD))
... (5)

また、類似度計算部３３は、下記式（６）により、ＤＦ（ｉ）を特定する。 Moreover, the similarity calculation part 33 specifies DF (i) by following formula (6).

［数６］
ＤＦ（ｉ）＝（ｎ個の被学習データＲｑと、問題データＴＤとのうち、語ｉが出現するものの数）
…（６） [Equation 6]
DF (i) = (number of words i appearing among n pieces of learned data Rq and problem data TD)
(6)

また、類似度計算部３３は、被学習データＲｑと問題データＴＤとの各々について、全ての語ｉとの間のｗ（ｉ，ｊ）を下記式（７），（８）により計算する。 The similarity calculation unit 33 calculates w (i, j) between all the words i for each of the learned data Rq and the problem data TD by the following equations (7) and (8).

［数７］
ＩＤＦ（ｉ）＝ｌｏｇ［（Ｎ＋１）／ＤＦ（ｉ）］
…（７） [Equation 7]
IDF (i) = log [(N + 1) / DF (i)]
... (7)

［数８］
ｗ（ｉ，ｊ）＝ＴＦ（ｉ，ｊ）＊ＩＤＦ（ｉ）
…（８） [Equation 8]
w (i, j) = TF (i, j) * IDF (i)
... (8)

上記ＩＤＦ（ｉ）は、問題データＴＤおよび被学習データＲｑのうち、語ｉを含むものの数が多くなるに従って指数関数的にその値を小さくする。
このようなＩＤＦ（ｉ）をＴＦ（ｉ，ｊ）に乗じてｗ（ｉ，ｊ）を生成することで、「ａ」，「ｔｈｅ」，「ｔｈｉｓ」，「ｔｈａｔ」などの抽出すべき固有表現ではない、当該データの属性とは無関係の語が類似度に与える影響を殆どなくすことができる。 The IDF (i) decreases its value exponentially as the number of items including the word i among the problem data TD and the learned data Rq increases.
By multiplying such IDF (i) by TF (i, j) to generate w (i, j), it is possible to extract “a”, “the”, “this”, “that”, etc. It is possible to almost eliminate the influence of words that are not expressions but are unrelated to the attribute of the data on the similarity.

そして、類似度計算部３３は、被学習データＲｑと問題データＴＤとの各々について、下記（９），（１０）によりベクトルＤ（ｑ），Ｄ（Ｍ）を規定する。 Then, the similarity calculator 33 defines vectors D (q) and D (M) according to the following (9) and (10) for each of the learned data Rq and the problem data TD.

［数９］
Ｄ（ｑ）＝（ｗ（１，ｑ），ｗ（２，ｑ），．．．．，ｗ（ｋ，ｑ）
…（９） [Equation 9]
D (q) = (w (1, q), w (2, q), ..., w (k, q)
... (9)

［数１０］
ＤＭ＝（ｗ（１，Ｍ），ｗ（２，Ｍ），．．．．，ｗ（ｋ，Ｍ））
…（１０） [Equation 10]
DM = (w (1, M), w (2, M), ..., w (k, M))
(10)

そして、類似度計算部３３は、全ての被学習データＲｑについて、下記（１１）に示す類似度データＢＡ（ｑ）を計算する。 Then, the similarity calculation unit 33 calculates the similarity data BA (q) shown in the following (11) for all the learned data Rq.

類似度計算部３３は、類似度データＢＡ（ｑ）を学習データ選択部３４に出力する。 The similarity calculation unit 33 outputs the similarity data BA (q) to the learning data selection unit 34.

学習データ選択部３４は、入力部３１が入力したｎ個の学習データＳＤｑのうち、類似度計算部３３から入力した類似度データＢＡ（ｑ）が所定の基準値を超えるもののみを選択して類似学習データＳＳＤｑとして出力部３５に出力する。
なお、学習データ選択部３４は、入力部３１から入力したｎ個の学習データＳＤｑのうち、類似度データＢＡ（ｑ）が示す類似度が高いものから所定数分だけ選択して出力部３５に出力してもよい。
図５に示す例では、学習データ選択部３４は、学習データＳＤ１，３，１０を類似学習データＳＳＤｑとして出力部３５に出力する。 The learning data selection unit 34 selects only the learning data SDq input from the input unit 31 that has the similarity data BA (q) input from the similarity calculation unit 33 exceeding a predetermined reference value. The similar learning data SSDq is output to the output unit 35.
Note that the learning data selection unit 34 selects a predetermined number of n learning data SDq input from the input unit 31 from those having a high similarity indicated by the similarity data BA (q), and outputs the selected data to the output unit 35. It may be output.
In the example illustrated in FIG. 5, the learning data selection unit 34 outputs the learning data SD1, 3, and 10 to the output unit 35 as the similar learning data SSDq.

出力部３５は、学習データ選択部３４から入力した類似学習データＳＳＤｑを、図３に示すタグ付加部１３に出力する。 The output unit 35 outputs the similar learning data SSDq input from the learning data selection unit 34 to the tag addition unit 13 shown in FIG.

［タグ付加部１３］
図３に示すタグ付加部１３は、図７に示すように、類似学習データ選択部１１から入力した類似学習データＳＳＤｑの被学習データＲｑを構成する各語について、その品詞データと、ｓｕｆｆｉｘデータとを付加して新たな類似学習データＳＳＤＡｑを生成する。 [Tag addition unit 13]
As shown in FIG. 7, the tag addition unit 13 shown in FIG. 3 has, for each word constituting the learned data Rq of the similar learning data SSDq input from the similar learning data selection unit 11, its part-of-speech data, suffix data, To generate new similar learning data SSDDAq.

図８は、図３にタグ付加部１３の構成図である。
図８に示すように、タグ付加部１３は、例えば、入力部４１、品詞タガー部４２、Ｓｕｆｆｉｘタガー部４３、並びに出力部４４を有する。
入力部４１は、図３に示す類似学習データ選択部１１から類似学習データＳＳＤｑを入力し、これを品詞タガー部４２に出力する。
品詞タガー部４２は、入力部４１から入力した図７に示す類似学習データＳＳＤｑ内の各語に、その品詞を示す品詞データを付加し、これをＳｕｆｆｉｘタガー部４３に出力する。
Ｓｕｆｆｉｘタガー部４３は、図７に示すように、品詞タガー部４２から入力した品詞データが付加された類似学習データの各語に、そのｓｕｆｆｉｘ（接尾辞）データをさらに付加して類似学習データＳＳＤＡｑを生成し、これを出力部４４に出力する。
本実施形態では、Ｓｕｆｆｉｘタガー部４３は、３ｇｒａｍのｓｕｆｆｉｘを付加する。
出力部４４は、Ｓｕｆｆｉｘタガー部４３から入力した類似学習データＳＳＤＡｑを図３に示すＩＯＢ判定データ生成部１５に出力する。 FIG. 8 is a block diagram of the tag adding unit 13 shown in FIG.
As illustrated in FIG. 8, the tag adding unit 13 includes, for example, an input unit 41, a part-of-speech tagger unit 42, a Suffix tagger unit 43, and an output unit 44.
The input unit 41 inputs the similar learning data SSDq from the similar learning data selection unit 11 shown in FIG. 3 and outputs it to the part of speech tagger unit 42.
The part-of-speech tagger unit 42 adds part-of-speech data indicating the part of speech to each word in the similar learning data SSDq shown in FIG. 7 input from the input unit 41, and outputs this to the Suffix tagger unit 43.
As shown in FIG. 7, the Suffix tagger unit 43 further adds the suffix (suffix) data to each word of the similar learning data to which the part of speech data input from the part of speech tagger unit 42 is added. Is output to the output unit 44.
In this embodiment, the Suffix tagger unit 43 adds 3 gram suffix.
The output unit 44 outputs the similar learning data SSDAq input from the Suffix tagger unit 43 to the IOB determination data generation unit 15 shown in FIG.

［ＩＯＢ判定データ生成部１５］
ＩＯＢ判定データ生成部１５は、タグ付加部１３から入力した図７に示す類似学習データＳＳＤＡｑを用いて、ＩＯＢ付加部２２における解析に用いるＩＯＢ判定データ（素性データ）ＳＰを生成し、これをＩＯＢ付加部２２に出力する。 [IOB determination data generation unit 15]
The IOB determination data generation unit 15 generates IOB determination data (feature data) SP used for analysis in the IOB addition unit 22 by using the similar learning data SSDDAq shown in FIG. The data is output to the adding unit 22.

図９は、図３に示すＩＯＢ判定データ生成部１５の構成図である。
図９に示すように、ＩＯＢ判定データ生成部１５は、例えば、入力部５１、ＳＶＭ学習部５２および出力部５３を有する。
入力部５１は、タグ付加部１３から類似学習データＳＳＤＡｑを入力し、これをＳＶＭ学習部５２に出力する。
ＳＶＭ学習部５２は、入力部５１から入力した図７に示す類似学習データＳＳＤＡｑを基に、各語の属性タグデータＩＯＢが、Ｉ，Ｏ，Ｂの何れであるかを判断するために、例えば、各語の前後２語ずつの品詞データおよびｓｕｆｆｉｘデータを用いてＳＶＭ(Support Vector Machines)方式で図１０に示すＩＯＢ判定データＳＰを生成する。
ＳＶＭ学習部５２は、当該ＳＶＭ方式による学習処理において、例えば、カーネル関数として多項式カーネルを用い、多値分類拡張手法としてベアワイズを用い、解析方向を文の先頭から後ろとする。
ＳＶＭ学習部５２による学習処理としては、例えば、第１実施形態で説明したＳＶＭが用いられる。
ＳＶＭ学習部５２は、ＩＯＢ判定データＳＰを出力部５３に出力する。
出力部５３は、ＳＶＭ学習部５２から入力したＩＯＢ判定データＳＰをＩＯＢ付加部２２に出力する。 FIG. 9 is a block diagram of the IOB determination data generation unit 15 shown in FIG.
As illustrated in FIG. 9, the IOB determination data generation unit 15 includes, for example, an input unit 51, an SVM learning unit 52, and an output unit 53.
The input unit 51 receives the similar learning data SSDAq from the tag addition unit 13 and outputs it to the SVM learning unit 52.
For example, the SVM learning unit 52 determines whether the attribute tag data IOB of each word is I, O, or B based on the similar learning data SSDDAq shown in FIG. 7 input from the input unit 51. The IOB determination data SP shown in FIG. 10 is generated by the SVM (Support Vector Machines) method using the part of speech data and suffix data of two words before and after each word.
In the learning process by the SVM method, the SVM learning unit 52 uses, for example, a polynomial kernel as a kernel function, uses bear-wise as a multi-value classification extension method, and sets the analysis direction from the beginning to the back of the sentence.
As the learning process by the SVM learning unit 52, for example, the SVM described in the first embodiment is used.
The SVM learning unit 52 outputs the IOB determination data SP to the output unit 53.
The output unit 53 outputs the IOB determination data SP input from the SVM learning unit 52 to the IOB adding unit 22.

なお、ＩＯＢ判定データ生成部１５は、ＳＶＭ方式以外の学習方式、例えば、決定リスト方式、類似度に基づく方式、シンプルベイズ方式、最大エントリピー方式、決定木方式、ニューラルネット方式、判別分析方式等を用いてもよい。 The IOB determination data generation unit 15 is a learning method other than the SVM method, such as a decision list method, a similarity-based method, a simple Bayes method, a maximum entry pea method, a decision tree method, a neural network method, a discriminant analysis method, and the like. May be used.

［タグ付加部２１］
図３に示すタグ付加部２１は、図６に示すように、機械学習装置１の外部から入力した問題データＴＤを構成する各語について、その品詞データと、ｓｕｆｆｉｘデータとを付加して新たな問題データＴＤａを生成する。 [Tag addition unit 21]
As shown in FIG. 6, the tag addition unit 21 shown in FIG. 3 adds part-of-speech data and suffix data to each word constituting the question data TD input from the outside of the machine learning device 1 to create a new one. Problem data TDa is generated.

図１１は、図３にタグ付加部２１の構成図である。
図１１に示すように、タグ付加部２１は、例えば、入力部６１、品詞タガー部６２、Ｓｕｆｆｉｘタガー部６３、並びに出力部６４を有する。
入力部６１は、図３に示す機械学習装置１の外部から問題データＴＤを入力し、これを品詞タガー部６２に出力する。
品詞タガー部６２は、入力部６１から入力した図６に示す問題データＴＤ内の各語に、その品詞を示す品詞データを付加し、これをＳｕｆｆｉｘタガー部６３に出力する。
Ｓｕｆｆｉｘタガー部６３は、図６に示すように、品詞タガー部６２から入力した品詞データが付加された被処理データの各語に、そのｓｕｆｆｉｘ（接尾辞）データをさらに付加して問題データＴＤａを生成し、これを出力部６４に出力する。
出力部６４は、Ｓｕｆｆｉｘタガー部６３から入力した問題データＴＤａを図３に示すＩＯＢ付加部２２に出力する。 FIG. 11 is a block diagram of the tag adding unit 21 in FIG.
As illustrated in FIG. 11, the tag addition unit 21 includes, for example, an input unit 61, a part-of-speech tagger unit 62, a Suffix tagger unit 63, and an output unit 64.
The input unit 61 inputs problem data TD from the outside of the machine learning device 1 shown in FIG. 3 and outputs it to the part-of-speech tagger unit 62.
The part-of-speech tagger unit 62 adds part-of-speech data indicating the part-of-speech to each word in the question data TD shown in FIG. 6 input from the input unit 61, and outputs this to the Suffix tagger unit 63.
As shown in FIG. 6, the Suffix tagger unit 63 further adds the suffix data to each word of the data to be processed to which the part of speech data input from the part of speech tagger unit 62 is added, and obtains problem data TDa. This is generated and output to the output unit 64.
The output unit 64 outputs the problem data TDa input from the Suffix tagger unit 63 to the IOB addition unit 22 shown in FIG.

［ＩＯＢ付加部２２］
図１２は、図３に示すＩＯＢ付加部２２の構成図である。
図１２に示すように、ＩＯＢ付加部２２は、例えば、入力部７１、ＩＯＢ判定部７２および出力部７３を有する。
入力部７１は、ＩＯＢ判定データ生成部１５から入力したＩＯＢ判定データＳＰをＩＯＢ判定部６２に出力する。
ＩＯＢ判定部７２は、入力部６１から入力したＩＯＢ判定データＳＰを基に、タグ付加部２１から入力した図６に示す問題データＴＤａの各語に、属性タグデータＩＯＢを付加して図６に示す処理済データＴＲを生成する。
ここで、問題データＴＤを解くべき問題とすると、処理済データＴＲが解くべき問題の解となる。
ＩＯＢ判定部７２は、テスト結果データＴＤａを出力部６３に出力する。
出力部７３は、ＩＯＢ判定部７２から入力した処理済データＴＲを機械学習装置１の外部に出力する。 [IOB addition unit 22]
FIG. 12 is a configuration diagram of the IOB adding unit 22 shown in FIG.
As illustrated in FIG. 12, the IOB addition unit 22 includes, for example, an input unit 71, an IOB determination unit 72, and an output unit 73.
The input unit 71 outputs the IOB determination data SP input from the IOB determination data generation unit 15 to the IOB determination unit 62.
Based on the IOB determination data SP input from the input unit 61, the IOB determination unit 72 adds attribute tag data IOB to each word of the problem data TDa input from the tag addition unit 21 shown in FIG. The processed data TR shown is generated.
Here, if the problem data TD is a problem to be solved, the processed data TR is a solution of the problem to be solved.
The IOB determination unit 72 outputs the test result data TDa to the output unit 63.
The output unit 73 outputs the processed data TR input from the IOB determination unit 72 to the outside of the machine learning device 1.

以下、図３に示す機械学習装置１の動作例を説明する。
図１３は、当該動作例を説明するためのフローチャートである。
以下、図１３に示す各ステップを説明する。
ステップＳＴ１：
図３に示すタグ付加部２１は、図６に示すように、機械学習装置１の外部から入力した問題データＴＤを構成する各語について、その品詞データと、ｓｕｆｆｉｘデータとを付加して新たな問題データＴＤａを生成し、これをＩＯＢ付加部２２に出力する。 Hereinafter, an operation example of the machine learning device 1 illustrated in FIG. 3 will be described.
FIG. 13 is a flowchart for explaining the operation example.
Hereinafter, each step shown in FIG. 13 will be described.
Step ST1:
As shown in FIG. 6, the tag addition unit 21 shown in FIG. 3 adds part-of-speech data and suffix data to each word constituting the question data TD input from the outside of the machine learning device 1 to create a new one. The problem data TDa is generated and output to the IOB adding unit 22.

ステップＳＴ２：
図５に示す類似学習データ選択部１１の類似度計算部３３は、入力部３１が入力した図４に示すｎ個の学習データＳＤｑ内の被学習データＲｑの各々につて、当該被学習データＲｑと、問題データＴＤとの類似度を計算して類似度デ.ータＢＡ（ｑ）を生成し、これを学習データ選択部３４に出力する。
ステップＳＴ３：
図５に示す学習データ選択部３４は、入力したｎ個の学習データＳＤｑのうち、類似度計算部３３から入力した類似度データＢＡ（ｑ）が所定の基準値を超えるもののみを選択して類似学習データＳＳＤｑとして図３に示すタグ付加部１３に出力する。 Step ST2:
The similarity calculation unit 33 of the similar learning data selection unit 11 shown in FIG. 5 performs the learning data Rq for each of the learning data Rq in the n pieces of learning data SDq shown in FIG. Then, the similarity with the problem data TD is calculated to generate similarity data BA (q), which is output to the learning data selector 34.
Step ST3:
The learning data selection unit 34 shown in FIG. 5 selects only n pieces of input learning data SDq that have the similarity data BA (q) input from the similarity calculation unit 33 exceeding a predetermined reference value. It outputs to the tag addition part 13 shown in FIG. 3 as similar learning data SSDq.

ステップＳＴ４：
図３に示すタグ付加部１３は、図７に示すように、類似学習データ選択部１１から入力した類似学習データＳＳＤｑの被学習データＲｑを構成する各語について、その品詞データと、ｓｕｆｆｉｘデータとを付加して新たな類似学習データＳＳＤＡｑを生成し、これをＩＯＢ判定データ生成部１５に出力する。
ステップＳＴ５：
図３に示すＩＯＢ判定データ生成部１５は、タグ付加部１３から入力した図７に示す類似学習データＳＳＤＡｑを用いて、ＩＯＢ付加部２２における解析に用いるＩＯＢ判定データ（素性データ）ＳＰを生成し、これをＩＯＢ付加部２２に出力する。
ステップＳＴ１６：
図３に示すＩＯＢ付加部２２は、ステップＳＴ５で入力したＩＯＢ判定データＳＰを基に、タグ付加部２１から入力した図６に示す問題データＴＤａの各語に、属性タグデータＩＯＢを付加して図６に示す処理済データＴＲを生成する。
なお、機械学習装置１は、処理済データＴＲに付された属性タグデータＩＯＢを基に、問題データＴＤ内の固有表現（遺伝子名）を抽出する。 Step ST4:
As shown in FIG. 7, the tag addition unit 13 shown in FIG. 3 has, for each word constituting the learned data Rq of the similar learning data SSDq input from the similar learning data selection unit 11, its part-of-speech data, suffix data, Is added to generate new similar learning data SSDAq, which is output to the IOB determination data generation unit 15.
Step ST5:
The IOB determination data generation unit 15 illustrated in FIG. 3 generates IOB determination data (feature data) SP used for analysis in the IOB addition unit 22 using the similar learning data SSDDAq illustrated in FIG. 7 input from the tag addition unit 13. This is output to the IOB adding unit 22.
Step ST16:
The IOB addition unit 22 shown in FIG. 3 adds attribute tag data IOB to each word of the problem data TDa shown in FIG. 6 input from the tag addition unit 21 based on the IOB determination data SP input in step ST5. The processed data TR shown in FIG. 6 is generated.
The machine learning device 1 extracts a specific expression (gene name) in the problem data TD based on the attribute tag data IOB attached to the processed data TR.

以上説明したように、機械学習装置１によれば、メモリ９に記憶されたｎ個の学習データＳＤｑのうち問題データＴＤとの間の類似度が高いもののみを選択して用いて、ＩＯＢ付加部２２における問題データＴＤａへの属性タグデータＩＯＢの付加を行う。
そのため、問題データＴＤａへの属性タグデータＩＯＢの付加において、問題データＴＤとの間の類似度が低い学習データＳＤｑは用いられなくなり、処理済データＴＲの信頼性が高まる。
その結果、処理済データＴＲから、所望の固有表現（遺伝子名）を高い信頼性で抽出することが可能になる。
また、本実施形態の機械学習１によれば、処理の信頼性向上の他に、学習に用いるデータ量を削減し、学習に要する時間の短縮、並びにマシンリソースの低減という効果が得られる。 As described above, according to the machine learning device 1, only the data having high similarity with the problem data TD among the n pieces of learning data SDq stored in the memory 9 is selected and used. The attribute tag data IOB is added to the problem data TDa in the unit 22.
Therefore, in the addition of the attribute tag data IOB to the problem data TDa, the learning data SDq having a low similarity with the problem data TD is not used, and the reliability of the processed data TR is increased.
As a result, a desired specific expression (gene name) can be extracted from the processed data TR with high reliability.
Further, according to the machine learning 1 of the present embodiment, in addition to the improvement of processing reliability, the amount of data used for learning is reduced, and the effects of shortening the time required for learning and reducing machine resources can be obtained.

＜第３実施形態＞
第３実施形態は、第１実施形態の機械学習システムを、インターネット上のコンテンツへのアクセス制御を行う機械学習システムに適用した実施形態である。
図１４は、本発明の第３実施形態の機械学習システム１０１を説明するための図である。
機械学習システム１０１では、インターネット１１１上のサーバ（図示せず）が記憶する複数のＷｅｂページデータＷ１を学習データ生成部１１２がダウンロードする。
学習データ生成部１１２は、予め決められたルールに従って、上記ダウンロードしたＷｅｂページデータＷ１に、コンテンツの分類（属性）を示すタグデータＴＧを付加して学習データ（教師データ）ＳＤｑを生成し、これを類似学習データ選択部１１５に出力する。
タグデータＴＧとしては、例えば、視聴制限の有無、制限年齢以下の禁止、暴力的表現有りなどの情報を示している。 <Third Embodiment>
The third embodiment is an embodiment in which the machine learning system of the first embodiment is applied to a machine learning system that controls access to content on the Internet.
FIG. 14 is a diagram for explaining the machine learning system 101 according to the third embodiment of this invention.
In the machine learning system 101, the learning data generation unit 112 downloads a plurality of Web page data W1 stored in a server (not shown) on the Internet 111.
The learning data generation unit 112 generates learning data (teacher data) SDq by adding tag data TG indicating the content classification (attribute) to the downloaded web page data W1 according to a predetermined rule. Is output to the similar learning data selection unit 115.
As the tag data TG, for example, information such as presence / absence of viewing restriction, prohibition under the restriction age, presence of violent expression is shown.

類似学習データ選択部１１５は、インターネット１１１を介してダウンロードされた被処理データであるＷｅｂページデータＷ２と、学習データＳＤｑのＷｅｂページデータＷ１との類似関係を基に、類似度が所定の基準を満たした学習データＳＤｑを類似学習データＳＳＤｑとして選択して機械学習機１１６に出力する。
当該類似関係は、第１実施形態で説明した手法等を用いて生成した類似度データに基づいて判断する。 The similar learning data selection unit 115 determines the similarity based on a predetermined criterion based on the similarity between the Web page data W2 that is the processed data downloaded via the Internet 111 and the Web page data W1 of the learning data SDq. The satisfied learning data SDq is selected as the similar learning data SSDq and output to the machine learning machine 116.
The similarity relationship is determined based on similarity data generated using the method described in the first embodiment.

機械学習機１１６は、類似学習データ選択部１１５から入力した類似学習データＳＳＤｑを用いてＷｅｂページデータＷ２の学習処理を行い、タグデータＴＧが付された処理済ＷｅｂページデータＷ３をキャッシュメモリ１１８および／またはフィルタ１２５に出力する。
機械学習機１１６による学習処理としては、例えば、第１実施形態で説明したＳＶＭが用いられる。 The machine learning machine 116 performs learning processing of the web page data W2 using the similar learning data SSDq input from the similar learning data selection unit 115, and processes the processed web page data W3 to which the tag data TG is attached in the cache memory 118 and Output to the filter 125.
As the learning process by the machine learning machine 116, for example, the SVM described in the first embodiment is used.

キャッシュメモリ１１８は、処理済ＷｅｂページデータＷ３を記憶する。
キャッシュ探索部１２３は、コンピュータ上で動作するユーザ・インタフェース１２１などを用いてユーザが出した閲覧要求を入力すると、その閲覧要求に応じた処理済ＷｅｂページデータＷ３をキャッシュメモリ１１８から読み出してフィルタ１２５に出力する。
キャッシュ探索部１２３は、上記閲覧要求に応じた処理済ＷｅｂページデータＷ３がキャッシュメモリ１１８に記憶されていない場合に、コンテンツローダ１３１に対してその処理済ＷｅｂページデータＷ３に対応したＷｅｂページデータを要求するダウンロード要求を出力する。
コンテンツローダ１３１は、インターネット１１１を介して、上記ダウンロード要求をサーバに送信する。
これにより、上記閲覧要求に係わるＷｅｂページデータＷ１が学習データ生成部１１２にダウンロードされる。 The cache memory 118 stores processed web page data W3.
When a browsing request issued by the user is input using the user interface 121 or the like that operates on the computer, the cache search unit 123 reads out the processed Web page data W3 corresponding to the browsing request from the cache memory 118 and filters 125. Output to.
When the processed web page data W3 corresponding to the browsing request is not stored in the cache memory 118, the cache search unit 123 sends the web page data corresponding to the processed web page data W3 to the content loader 131. Output the requested download request.
The content loader 131 transmits the download request to the server via the Internet 111.
Thereby, the Web page data W1 related to the browsing request is downloaded to the learning data generation unit 112.

フィルタ１２５は、所定のサーバあるいはユーザが使用するコンピュータ内に機能として組み込まれ、予め保持したフィルタルールに従って、入力した処理済ＷｅｂページデータＷ３のタグデータＴＧを検証し、所定の条件を満たす処理済ＷｅｂページデータＷ３を、そのタグデータＴＧを除去してユーザ・インタフェース１２１に出力する。
なお、図１４の例において、キャッシュ探索部１２３は、特に必須ではない。 The filter 125 is incorporated as a function in a predetermined server or a computer used by the user, verifies the tag data TG of the input processed Web page data W3 according to a pre-stored filter rule, and has processed a predetermined condition. The Web page data W3 is output to the user interface 121 after removing the tag data TG.
In the example of FIG. 14, the cache search unit 123 is not particularly essential.

以上説明したように、機械学習システム１０１によれば、類似学習データ選択部１１５において、被処理データのＷｅｂページデータＷ２と属性が類似した学習データＳＤｑのみを類似学習データＳＳＤｑとして類似学習データ選択部１１５に出力する。
これにより、類似学習データ選択部１１５において、ＷｅｂページデータＷ２に高い信頼性のタグデータＴＧを付けることができ、フィルタ１２５におけるフィルタ処理を適切に行うことができる。
また、本実施形態の機械学習システム１０１によれば、処理の信頼性向上の他に、学習に用いるデータ量を削減し、学習に要する時間の短縮、並びにマシンリソースの低減という効果が得られる。 As described above, according to the machine learning system 101, in the similar learning data selection unit 115, only the learning data SDq having an attribute similar to that of the Web page data W2 of the processed data is used as the similar learning data SSDq. 115.
Thereby, in the similar learning data selection unit 115, highly reliable tag data TG can be attached to the Web page data W2, and the filter processing in the filter 125 can be appropriately performed.
Further, according to the machine learning system 101 of the present embodiment, in addition to improving the reliability of processing, the amount of data used for learning can be reduced, the time required for learning can be shortened, and machine resources can be reduced.

本発明は上述した実施形態には限定されない。
上述した実施形態では、本発明の被処理データおよび被学習データＲｑとして、遺伝子分野の論文（文献）データを例示したが、それ以外のデータであってもよい。
例えば、本発明は、蛋白質表現の抽出、固有表現抽出(人名，地名など)、モダリティ表現の翻訳、格解析，格変換、並びに多義性解消等の機械学習処理にも適用可能である。 The present invention is not limited to the embodiment described above.
In the embodiment described above, the paper (reference) data in the gene field is exemplified as the processed data and learned data Rq of the present invention, but other data may be used.
For example, the present invention can be applied to machine learning processing such as protein expression extraction, proper expression extraction (person name, place name, etc.), modality expression translation, case analysis, case conversion, and ambiguity resolution.

本発明は、学習データを用いて、所定の用語を抽出すための属性データを被処理データを構成する処理単位データに付加するデータ処理システムに適用可能である。 The present invention is applicable to a data processing system in which attribute data for extracting a predetermined term is added to processing unit data constituting processing data using learning data.

図１は、本発明の第１実施形態の機械学習システムの構成図である。FIG. 1 is a configuration diagram of a machine learning system according to the first embodiment of this invention. 図２は、本発明の第１の実施形態の機械学習システムを説明するための図である。FIG. 2 is a diagram for explaining the machine learning system according to the first embodiment of this invention. 図３は、本発明の第２実施形態に係わる機械学習装置の構成図である。FIG. 3 is a configuration diagram of a machine learning device according to the second embodiment of the present invention. 図４は、本発明の第２実施形態の被学習データＲｑおよび学習データＳＤｑを説明するための図である。FIG. 4 is a diagram for explaining learned data Rq and learned data SDq according to the second embodiment of the present invention. 図５は、図３に示す類似学習データ選択部の構成図である。FIG. 5 is a block diagram of the similar learning data selection unit shown in FIG. 図６は、本発明の第２実施形態の被処理データ等を説明するための図である。FIG. 6 is a diagram for explaining data to be processed according to the second embodiment of the present invention. 図７は、本発明の第２実施形態の類似学習データを説明するための図である。FIG. 7 is a diagram for explaining similar learning data according to the second embodiment of this invention. 図８は、本発明の第２実施形態のタグ付加部の構成図である。FIG. 8 is a configuration diagram of the tag addition unit of the second embodiment of the present invention. 図９は、本発明の第２実施形態のＩＯＢ判定データ生成部の構成図である。FIG. 9 is a configuration diagram of an IOB determination data generation unit according to the second embodiment of this invention. 図１０は、本発明の第２実施形態のＩＯＢ判定データを説明するための図である。FIG. 10 is a diagram for explaining IOB determination data according to the second embodiment of this invention. 図１１は、本発明の第２実施形態のタグ付加部を説明するための図である。FIG. 11 is a diagram for explaining a tag addition unit according to the second embodiment of this invention. 図１２は、本発明の第２実施形態のＩＯＢ付加部を説明するための図である。FIG. 12 is a diagram for explaining the IOB addition unit according to the second embodiment of this invention. 図１３は、図３に示す機械学習装置の動作例を説明するための図である。FIG. 13 is a diagram for explaining an operation example of the machine learning apparatus illustrated in FIG. 3. 本発明の第３実施形態を説明するための図である。It is a figure for demonstrating 3rd Embodiment of this invention.

Explanation of symbols

１…機械学習装置、２…類似学習データ生成機、３…類似度計算部、４…類似学習データ生成部、１１…類似学習データ選択部、１３…タグ付加部、１５…ＩＯＢ判定データ生成部、２１…タグ付加部、２２…ＩＯＢ付加部、３１…入力部、３２…入力部、３３…加算回路、３４…学習データ選択部、３５…出力部、４１…入力部、４２…品詞タガー部、４３…Ｓｕｆｆｉｘタガー部、４４…出力部、５１…入力部、５２…ＳＶＭ学習部、５３…出力部、６１…入力部、６２…品詞タガー部、６３…Ｓｕｆｆｉｘタガー部、６４…出力部、７１…入力部、７２…ＩＯＢ判定部、７３…出力部

DESCRIPTION OF SYMBOLS 1 ... Machine learning apparatus, 2 ... Similarity learning data generator, 3 ... Similarity calculation part, 4 ... Similarity learning data generation part, 11 ... Similarity learning data selection part, 13 ... Tag addition part, 15 ... IOB determination data generation part , 21 ... tag adding unit, 22 ... IOB adding unit, 31 ... input unit, 32 ... input unit, 33 ... addition circuit, 34 ... learning data selection unit, 35 ... output unit, 41 ... input unit, 42 ... part of speech tagger unit 43 ... Suffix tagger unit, 44 ... output unit, 51 ... input unit, 52 ... SVM learning unit, 53 ... output unit, 61 ... input unit, 62 ... part of speech tagger unit, 63 ... Suffix tagger unit, 64 ... output unit, 71 ... Input unit, 72 ... IOB determination unit, 73 ... Output unit

Claims

A data processing method for machine learning processing of data to be processed using learning data generated based on the data to be learned,
For each of the plurality of learning data, a first step of generating similarity data indicating a similarity between the learning data used to generate the learning data and the processed data;
A second step of selecting a part of the learning data among the plurality of learning data based on the similarity data generated in the first step;
And a third step of performing machine learning processing on the processed data using the learning data selected in the second step.

The learning data defines a correspondence relationship between each of a plurality of processing unit data constituting the learned data and attribute data indicating the attribute,
In the third step, based on the correspondence defined by the learning data selected in the second step, the correspondence between a plurality of processing unit data constituting the processing target data and the attribute data The data processing method according to claim 1, wherein a process for defining the data is performed.

In the first step, for each of the plurality of learning data, the similarity between the processing unit data constituting the learning data of the learning data and the processing unit data constituting the processing data is determined. Generating said similarity data,
3. The data according to claim 2, wherein the third step performs processing for adding the attribute data to the processing unit data constituting the processing target data, using the learning data selected in the second step. Processing method.

In the first step, for each of the data to be learned and the data to be processed, a vector in which the characteristics of each data is predetermined and defined in the feature evaluation coordinate system based on the processing unit data constituting each data. The data processing method according to claim 1, wherein data is generated, and the similarity data is generated based on the vector data.

When the learned data and the processed data are document data, and the processing unit data is word data,
5. The first step generates the vector data having the type of the word data appearing in each data as an element of the vector data and the appearance frequency of the word data of the type as the value of the element. The data processing method described in 1.

The first step generates the vector data indicated by (X, Y, Z),
A1: d (x, y) ≧ 0
A2: d (x, y) = d (y, x)
A3: The necessary and sufficient condition for d (x, y) = 0 is x = y
A3 ': d (x, x) = 0
A4: d (x, z) ≤ d (x, y) + d (y, z)
age,
B1: A1, A2, A3, A4
B2: A1, A2, A3 ', A4
B3: A1, A2, A3
B4: A1, A2, A3 '
B5: A1, A2
If
The function d () indicating the measure satisfying any one of the above B1, B2, B3, B4, and B5 is used, or the distance is calculated by a similarity calculation formula in which the distance monotonously decreases as the similarity increases. The data processing method according to claim 4, wherein the similarity data indicating the distance is calculated and calculated.

A data processing method executed by a data processing apparatus that uses learning data to which attribute data is attached to each of a plurality of processing unit data constituting the learned data, and adds the attribute data to the processed data by machine learning. ,
For each of a plurality of pieces of learning data, a similarity data indicating a degree of similarity between the processing unit data constituting the learned data of the learning data and the processing unit data constituting the processed data is generated. 1 process,
Based on the similarity data generated in the first step, among the learned data of the plurality of learning data, the learned data satisfying a predetermined criterion for similarity to the processed data. A second step of identifying and selecting the learning data corresponding to the identified learned data;
A data processing method comprising: a third step of adding the attribute data to the processing unit data constituting the data to be processed by machine learning using the learning data selected in the second step.

The first step includes
Identify the different types of processing unit data included in the processed data and the learned data,
For each of the processed data and the learned data, the number of each of the different types of the processing unit data is specified, and the specified number constitutes each of the processed data and the learned data. Generate index data by dividing by the total number of processing unit data,
The data processing method according to claim 7, wherein the similarity data is generated based on a combination pattern of the index data of the different types of the processing unit data obtained for each of the processed data and the learned data.

The first step includes
Generating vector data whose elements are the index data of the processing unit data of the different types obtained for each of the processed data and the learning data;
The data processing method according to claim 8, wherein the similarity data is generated based on a relationship between the vector data of the processed data and the vector data of the plurality of learned data.

The first step includes
The index data is subjected to processing for reducing the value of the index data of the processing unit data of the type as the number of the processed data and the learning data including the specified type of processing unit data increases,
The data processing method according to claim 9, wherein the similarity data is generated based on the combination pattern of the index data after the processing is performed.

The number of learning data is n, the number of processed data is 1,
The number of types of processing unit data included in the n pieces of learning data corresponding to the n pieces of learning data and the processing data is k,
i is an integer from 1 to k, j is an identifier attached to N pieces of learned data and one piece of processed data,
The index data is expressed as “TF (i, j) = (number of times processing unit data i appears in learned data j (processed data j)) / (processing unit included in learned data j (processed data j)). Total number of data) "
When “DF (i) = (the number of data in which the processing unit data i appears among the n pieces of learned data and the processed data”),
The first step includes
“IDF (i) = log [(N + 1) / DF (i)]”
Calculate
“W (i, j) = TF (i, j) * IDF (i)”
Calculate
A vector D (q) of the n pieces of learned data q (q is an integer of 1 to N) is expressed as D (q) = (w (1, q), w (2, q),. (K, q)), and the vector D (M) of the processed data is DM = (w (1, M), w (2, M),..., W (k, M)),
The data processing method according to claim 10, wherein the similarity data BA (q) is calculated by the following formula (1) for all q of 1 to n.

The first step generates the similarity data using the learned data of the plurality of learning data read from the memory and the processed data input from the outside of the data processing device,
The data processing method according to claim 7, wherein in the second step, the learning data corresponding to the specified learned data is selected from the learning data read from the memory.

The data processing device has similar data generation means, selection means, and attribute data addition means,
The similar data generating means performs the first step,
The selection means performs the second step;
The data processing method according to claim 7, wherein the attribute data adding unit performs the third step.

A data processing device that performs machine learning processing of processed data using learning data generated based on the learned data,
For each of the plurality of learning data, similarity data generation means for generating similarity data indicating the similarity between the learning data used to generate the learning data and the processed data;
Selection means for selecting a part of the learning data among the plurality of learning data based on the similarity data generated by the similarity data generation means;
A data processing apparatus comprising: processing means for performing machine learning processing on the data to be processed using the learning data selected by the selection means.

A program executed by a data processing apparatus that performs machine learning processing of processed data using learning data generated based on the learned data,
For each of the plurality of learning data, a first procedure for generating similarity data indicating a similarity between the learning data used to generate the learning data and the processed data;
A second procedure for selecting a part of the learning data from the plurality of learning data based on the similarity data generated in the first procedure;
A third procedure for performing machine learning processing on the data to be processed using the learning data selected in the second procedure.