JP6862969B2

JP6862969B2 - Information processing method, information processing device and information processing program for estimating data type

Info

Publication number: JP6862969B2
Application number: JP2017054494A
Authority: JP
Inventors: 真平齋藤
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2017-03-21
Filing date: 2017-03-21
Publication date: 2021-04-21
Anticipated expiration: 2037-03-21
Also published as: JP2018156549A

Description

本発明は、数値データの集合から、その集合のデータ種別を推定するための情報処理方法、情報処理装置および情報処理プログラムに関する。 The present invention relates to an information processing method, an information processing apparatus, and an information processing program for estimating a data type of the set from a set of numerical data.

データの収集技術、データの処理技術、およびデータの蓄積技術の発展に従って、多種多様かつ多量の情報を扱う機会が多くなっている。扱うデータの大半は、他の目的で他者が収集したデータや、過去に収集されたデータである。データ収集の際に、データが集積された表やデータ自体にデータの意味や仕様が記載されないことが多い。データの意味や仕様は、表をまとめた資料の凡例やシステムの仕様書等に記載される。一般に、それらの仕様情報は入手できなかったり、最新状態に更新されていなかったり、場合によっては紛失するという事態が発生する。 With the development of data collection technology, data processing technology, and data storage technology, there are increasing opportunities to handle a wide variety of large amounts of information. Most of the data we handle is data collected by others for other purposes or data collected in the past. When collecting data, the meaning and specifications of the data are often not described in the table in which the data is collected or in the data itself. The meaning and specifications of the data are described in the legend of the material that summarizes the table and the specifications of the system. In general, such specification information may not be available, may not be updated to the latest state, or may be lost in some cases.

一般に、数値集合データを扱う際には、そのデータの種別、データの出所、データの単位等は既知のことである。従って、データをどのように処理すればよいか判別することは容易である。 Generally, when dealing with numerical set data, the type of the data, the source of the data, the unit of the data, and the like are known. Therefore, it is easy to determine how to process the data.

しかし、データが採取されてから処理を行う人がデータを入手するまでの過程において、データを正しく処理するために必要な情報の一部が欠落して、そのデータが何を表すのか一見して判別できないことも起こり得る。例えば、データベースのテーブル構造についての資料が無い状態で、列名が略されていた場合、データが何を表すのか一見して判別できない。そのような場合には、入手経路や処理方法を解析する等の対策がとられている。しかし、データを採取した人間との接触が制限されていたり専門的な知識を要する等のために、データの種別を明らかにするために長時間を要することもある。 However, in the process from when the data is collected until the person who processes the data obtains the data, some of the information necessary to process the data correctly is missing, and at a glance what the data represents. It may happen that it cannot be determined. For example, if there is no data on the table structure of the database and the column names are abbreviated, it is not possible to determine at a glance what the data represents. In such cases, measures such as analyzing the acquisition route and processing method are taken. However, it may take a long time to clarify the type of data because the contact with the person who collected the data is restricted or specialized knowledge is required.

例えば、表計算ソフトウェアの表データ等は、作成者本人および利用者だけが理解できる用語が盛り込まれていることが多い。時間が経って作成当時の関係者が不在になると、そのデータが何を表すのか、全く手がかりがない状態も起こり得る。 For example, spreadsheet software table data often contains terms that only the creator and the user can understand. If the people involved at the time of creation are absent over time, there may be no clue as to what the data represents.

文字データであれば、単語や文章の意味からある程度の意味、例えば住所や氏名を推測することはできるが、数値データの場合、一つ一つのデータを眺めても推測を行うことは難しい。また、テキストデータに関して、類似度の判定方法として、単語ごとに多次元のベクトルを生成し、データ間の類似性を判定する手法が知られている。しかし、数値データについては、数値そのものを次元として利用できないので、そのような手法を適用することは難しい。 In the case of character data, it is possible to infer a certain meaning from the meaning of words and sentences, for example, an address or a name, but in the case of numerical data, it is difficult to make an inference even by looking at each piece of data. Further, as a method of determining the degree of similarity with respect to text data, a method of generating a multidimensional vector for each word and determining the similarity between the data is known. However, for numerical data, it is difficult to apply such a method because the numerical value itself cannot be used as a dimension.

特許文献１には、平均や分散等の統計値にもとづいてどの種別のデータかを推定する方法が記載されている。 Patent Document 1 describes a method of estimating which type of data is based on statistical values such as mean and variance.

特開２００６−９９２３６号公報Japanese Unexamined Patent Publication No. 2006-99236

しかし、特許文献１に記載された発明では、統計値（平均や分散等）しか利用していないので、ある程度正しいと思われる種別を得るための計算量が多くなる。また、例えば、「長さ」という種別を得ることは可能であるが、それより下位の種別（詳細な種別）を得ることはできない。 However, in the invention described in Patent Document 1, since only statistical values (mean, variance, etc.) are used, the amount of calculation for obtaining a type that seems to be correct to some extent is large. Further, for example, although it is possible to obtain a type called "length", it is not possible to obtain a type lower than that (detailed type).

本発明は、数値データの集合のより詳しい種別を推定可能にすることを目的とする。 An object of the present invention is to make it possible to estimate a more detailed type of a set of numerical data.

本発明による情報処理方法は、既知のデータの種別が対応付けられた学習用の数値集合データから統計値を算出し、該数値集合データに付加されている付加データを抽出し、統計値と抽出された付加データとを要素とする特徴量ベクトルを生成し、数値集合データの種別で階層化された状態で特徴量ベクトルを記憶部に保存し、種別の推定対象である対象用の数値集合データから統計値を算出し、該数値集合データに付加されている付加データを抽出し、当該統計値と抽出された当該付加データとを要素とする特徴量ベクトルを生成し、生成された特徴量ベクトルと記憶部に保存されている特徴量ベクトルとの距離を算出し、算出された距離にもとづいて対象用の数値集合データの種別を推定することを特徴とする。 In the information processing method according to the present invention, statistical values are calculated from numerical set data for learning to which known data types are associated with each other, additional data added to the numerical set data is extracted, and the statistical values and extraction are performed. A feature quantity vector with the added data as an element is generated, and the feature quantity vector is stored in the storage unit in a state of being layered according to the type of the numerical set data, and the numerical set data for the target to be estimated by the type. The statistical value is calculated from, the additional data added to the numerical set data is extracted, a feature quantity vector having the statistical value and the extracted additional data as elements is generated, and the generated feature quantity vector is generated. The feature is that the distance between the data and the feature amount vector stored in the storage unit is calculated, and the type of numerical set data for the target is estimated based on the calculated distance .

本発明による情報処理装置は、既知のデータの種別が対応付けられた学習用の数値集合データから統計値を算出し、該数値集合データに付加されている付加データを抽出する抽出手段と、統計値と抽出された付加データとを要素とする特徴量ベクトルを生成する特徴量ベクトル生成手段と、数値集合データの種別で階層化された状態で特徴量ベクトルを記憶部に保存する特徴量ベクトル階層化保存手段とを備え、抽出手段は、種別の推定対象である対象用の数値集合データから統計値を算出し、該数値集合データに付加されている付加データを抽出し、特徴量ベクトル生成手段は、当該統計値と抽出された当該付加データとを要素とする特徴量ベクトルを生成し、生成された特徴量ベクトルと記憶部に保存されている特徴量ベクトルとの距離を算出し、算出された距離にもとづいて対象用の数値集合データの種別を推定する推定手段をさらに備えることを特徴とする。 The information processing apparatus according to the present invention is an extraction means for calculating statistical values from learning numerical set data associated with known data types and extracting additional data added to the numerical set data, and statistics. A feature quantity vector generation means that generates a feature quantity vector having a value and extracted additional data as elements, and a feature quantity vector hierarchy that stores the feature quantity vector in a storage unit in a layered state according to the type of numerical set data. of storage means and Bei example and extraction means calculates a statistical value from value set data for target is a type of the estimation target, and extracts the additional data added to said numerical value set data, the feature vector generating The means generates a feature quantity vector having the statistical value and the extracted additional data as elements, and calculates and calculates the distance between the generated feature quantity vector and the feature quantity vector stored in the storage unit. It is further provided with an estimation means for estimating the type of numerical set data for an object based on the distance obtained.

本発明による情報処理プログラムは、コンピュータに、既知のデータの種別が対応付けられた学習用の数値集合データから統計値を算出し、該数値集合データに付加されている付加データを抽出する処理と、統計値と抽出された付加データとを要素とする特徴量ベクトルを生成する処理と、数値集合データの種別で階層化された状態で特徴量ベクトルを記憶部に保存する処理と、種別の推定対象である対象用の数値集合データから統計値を算出し、該数値集合データに付加されている付加データを抽出する処理と、当該統計値と抽出された当該付加データとを要素とする特徴量ベクトルを生成する処理と、生成された特徴量ベクトルと記憶部に保存されている特徴量ベクトルとの距離を算出し、算出された距離にもとづいて対象用の数値集合データの種別を推定する処理とを実行させることを特徴とする。 The information processing program according to the present invention is a process of calculating statistical values from learning numerical set data associated with known data types and extracting additional data added to the numerical set data. , The process of generating a feature amount vector with statistical values and extracted additional data as elements, the process of saving the feature amount vector in the storage unit in a layered state according to the type of numerical set data, and the type estimation. A process of calculating a statistical value from the numerical value set data for the target object and extracting the additional data added to the numerical value set data, and a feature amount having the statistical value and the extracted additional data as elements. The process of generating a vector, the process of calculating the distance between the generated feature amount vector and the feature amount vector stored in the storage unit, and the process of estimating the type of numerical set data for the target based on the calculated distance. characterized in that to execute and.

本発明によれば、数値データの集合のより詳しい種別を推定することが可能になる。 According to the present invention, it is possible to estimate a more detailed type of a set of numerical data.

データ種別を推定するための情報処理装置の一例であるデータ種別推定装置を示すブロック図である。It is a block diagram which shows the data type estimation apparatus which is an example of the information processing apparatus for estimating a data type. 学習用データの一例を示す説明図である。It is explanatory drawing which shows an example of the learning data. 推定対象データの一例を示す説明図である。It is explanatory drawing which shows an example of the estimation target data. 学習フェーズの処理を示すフローチャートである。It is a flowchart which shows the processing of a learning phase. 特徴量ベクトル生成部が生成する特徴量ベクトルを説明するための説明図である。It is explanatory drawing for demonstrating the feature quantity vector generated by the feature quantity vector generation part. 特徴量ベクトル階層化保存部に保存されている階層構造を説明するための説明図である。It is explanatory drawing for demonstrating the hierarchical structure stored in the feature quantity vector layering preservation part. 推定フェーズの処理を示すフローチャートである。It is a flowchart which shows the processing of the estimation phase. 抽出された特徴量ベクトル等の一例を示す説明図である。It is explanatory drawing which shows an example of the extracted feature amount vector and the like. 親種別を含む抽出結果を示す説明図である。It is explanatory drawing which shows the extraction result including a parent type. 本発明による情報処理装置の主要部を示すブロック図である。It is a block diagram which shows the main part of the information processing apparatus by this invention. 本発明による他の態様の情報処理装置の主要部を示すブロック図である。It is a block diagram which shows the main part of the information processing apparatus of another aspect by this invention.

以下、本発明の実施形態を図面を参照して説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

図１は、データ種別（以下、単に「種別」ともいう。）を推定するための情報処理装置の一例であるデータ種別推定装置を示すブロック図である。図１には、データ種別推定装置１０と、学習用数値列データ記憶装置５００および推定対象数値列データ記憶装置６００が示されている。 FIG. 1 is a block diagram showing a data type estimation device which is an example of an information processing device for estimating a data type (hereinafter, also simply referred to as “type”). FIG. 1 shows a data type estimation device 10, a learning numerical string data storage device 500, and an estimation target numerical string data storage device 600.

データ種別推定装置１０は、特徴量ベクトル抽出部１００と、学習部２００と、推定部３００と、特徴量ベクトル管理部４００とを備える。 The data type estimation device 10 includes a feature amount vector extraction unit 100, a learning unit 200, an estimation unit 300, and a feature amount vector management unit 400.

特徴量ベクトル抽出部１００は、種別（具体的には、種別を示すデータ）および付加情報を含む数値集合データを入力して特徴量ベクトルを出力する。特徴量ベクトル抽出部１００は、統計値算出部１０１と、付加データ抽出部１０２と、特徴量ベクトル生成部１０３とを含む。 The feature amount vector extraction unit 100 inputs numerical value set data including a type (specifically, data indicating the type) and additional information, and outputs a feature amount vector. The feature amount vector extraction unit 100 includes a statistical value calculation unit 101, an additional data extraction unit 102, and a feature amount vector generation unit 103.

統計値算出部１０１は、入力された数値集合データの平均や分散等の統計値を算出する。付加データ抽出部１０２は、特徴量ベクトルの単位や併記されている情報の特性を抽出する。特徴量ベクトル生成部１０３は、統計値算出部１０１の算出結果と付加データ抽出部１０２の抽出結果とを結合して、入力データに対する特徴量ベクトルを生成する。 The statistical value calculation unit 101 calculates statistical values such as the average and variance of the input numerical set data. The additional data extraction unit 102 extracts the unit of the feature amount vector and the characteristics of the information described together. The feature amount vector generation unit 103 combines the calculation result of the statistical value calculation unit 101 and the extraction result of the additional data extraction unit 102 to generate a feature amount vector for the input data.

学習部２００は、学習用の数値集合データ（数値データの集合）すなわち学習用データを入力し、特徴量ベクトルを生成する。学習部２００は、学習用データ入力部２０１と、データ解析部２０２と、特徴量ベクトル出力部２０３とを含む。 The learning unit 200 inputs numerical value set data for learning (a set of numerical data), that is, learning data, and generates a feature amount vector. The learning unit 200 includes a learning data input unit 201, a data analysis unit 202, and a feature quantity vector output unit 203.

学習用データ入力部２０１は、学習用数値列データ記憶装置５００から学習用データを入力する。データ解析部２０２は、特徴量ベクトル抽出部１００を利用して、学習用データを対象として特徴量ベクトルを取得する。特徴量ベクトル出力部２０３は、取得された特徴量ベクトルと種別とを、特徴量ベクトル管理部４００に出力する。 The learning data input unit 201 inputs learning data from the learning numerical string data storage device 500. The data analysis unit 202 uses the feature amount vector extraction unit 100 to acquire the feature amount vector for the learning data. The feature amount vector output unit 203 outputs the acquired feature amount vector and the type to the feature amount vector management unit 400.

推定部３００は、数値集合データの種別を推定する。推定部３００は、推定対象データ入力部３０１と、データ解析部３０２と、類似特徴量ベクトル探索部３０３と、結果表示部３０４とを含む。 The estimation unit 300 estimates the type of numerical set data. The estimation unit 300 includes an estimation target data input unit 301, a data analysis unit 302, a similar feature amount vector search unit 303, and a result display unit 304.

推定対象データ入力部３０１は、推定対象数値列データ記憶装置６００から、種別推定対象の数値集合データを入力する。データ解析部３０２は、特徴量ベクトル抽出部１００を利用して、種別推定対象の数値集合データを対象として特徴量ベクトルを生成する。類似特徴量ベクトル探索部３０３は、特徴量ベクトル管理部４００を利用して、生成された特徴量ベクトルと距離が近い特徴量ベクトルを特徴量ベクトル記憶部４０３から抽出する。結果表示部３０４は、類似特徴量ベクトル探索部３０３の推定結果を表示部（図示せず）に表示する。なお、学習部２００を経て特徴量ベクトル記憶部４０３に格納された特徴量ベクトルを学習済み特徴量ベクトルとする。 The estimation target data input unit 301 inputs the numerical set data of the type estimation target from the estimation target numerical string data storage device 600. The data analysis unit 302 uses the feature vector extraction unit 100 to generate a feature vector for the numerical set data of the type estimation target. The similar feature vector search unit 303 uses the feature vector management unit 400 to extract a feature vector having a distance close to the generated feature vector from the feature vector storage unit 403. The result display unit 304 displays the estimation result of the similar feature amount vector search unit 303 on the display unit (not shown). The feature amount vector stored in the feature amount vector storage unit 403 via the learning unit 200 is used as the learned feature amount vector.

特徴量ベクトル管理部４００は、特徴量ベクトルを管理する。特徴量ベクトル管理部４００は、特徴量ベクトル階層化保存部４０１と、特徴量ベクトル比較部４０２と、特徴量ベクトル記憶部４０３とを含む。 The feature vector management unit 400 manages the feature vector. The feature amount vector management unit 400 includes a feature amount vector layered storage unit 401, a feature amount vector comparison unit 402, and a feature amount vector storage unit 403.

特徴量ベクトル階層化保存部４０１は、特徴量ベクトル出力部２０３から入力される種別を、特徴量ベクトル記憶部４０３の記憶内容と照合しながら、特徴量ベクトルを保存すべき階層を特定して、特徴量ベクトルを特徴量ベクトル記憶部４０３に保存する。特徴量ベクトル比較部４０２は、類似特徴量ベクトル探索部３０３から入力した特徴量ベクトルと、特徴量ベクトル記憶部４０３に保存されている学習済み特徴量ベクトルとの距離を算出し、距離が近い特徴量ベクトルが有する種別を、距離に応じた確率を付与して抽出する。なお、抽出された種別に共通の種別（親種別）が存在する場合は、その親種別も併せて抽出する。 The feature vector layered storage unit 401 identifies the layer in which the feature vector should be stored while collating the type input from the feature vector output unit 203 with the stored contents of the feature vector storage unit 403. The feature amount vector is stored in the feature amount vector storage unit 403. The feature vector comparison unit 402 calculates the distance between the feature vector input from the similar feature vector search unit 303 and the learned feature vector stored in the feature vector storage unit 403, and features that are close to each other. The type of the quantity vector is extracted by giving a probability according to the distance. If there is a common type (parent type) in the extracted types, the parent type is also extracted.

少なくとも統計値算出部１０１、付加データ抽出部１０２、特徴量ベクトル生成部１０３、データ解析部２０２、特徴量ベクトル出力部２０３、データ解析部３０２、類似特徴量ベクトル探索部３０３、特徴量ベクトル階層化保存部４０１、および特徴量ベクトル比較部４０２は、プログラム記憶部に格納されたプログラムにもとづいて１つまたは複数のＣＰＵ（Central Processing Unit ）が処理を実行することによって実現可能である。しかし、それらおよび他のブロック（記憶部を除く。）は、ハードウェア（個別回路またはＬＳＩ（Large Scale Integration ））で実現されてもよい。 At least statistical value calculation unit 101, additional data extraction unit 102, feature amount vector generation unit 103, data analysis unit 202, feature amount vector output unit 203, data analysis unit 302, similar feature amount vector search unit 303, feature amount vector layering. The storage unit 401 and the feature amount vector comparison unit 402 can be realized by one or a plurality of CPUs (Central Processing Units) executing processing based on the program stored in the program storage unit. However, they and other blocks (excluding storage) may be implemented in hardware (individual circuits or LSI (Large Scale Integration)).

次に、データ種別推定装置１０の動作を説明する。 Next, the operation of the data type estimation device 10 will be described.

ここでは、図２に示す学習用データを使用し、数値集合データの推定のために学習する場合と、図３に示す推定対象データを使用し、数値集合データを推定する場合を例にする。 Here, a case where the learning data shown in FIG. 2 is used for learning for estimation of the numerical set data and a case where the estimation target data shown in FIG. 3 is used to estimate the numerical set data are taken as examples.

図２に例示する学習用データは、２０代成人男子の最高血圧に関する数値集合データである。図３に示す推定対象データは、血圧に関する数値集合データであってデータ種別は未知である。 The learning data illustrated in FIG. 2 is numerical set data regarding systolic blood pressure of an adult male in his twenties. The estimation target data shown in FIG. 3 is numerical set data related to blood pressure, and the data type is unknown.

また、表形式のデータを用いる場合を例にする。以下、数値集合データを数値列データということがある。入力データは、表計算ソフトウェアのファイルであったり、データベースの表、XML （eXtensible Markup Language ）文書、CSV （Common Separated Value）形式の文書、HTML（Hypertext Markup Language ）文書などである。ただし、数値の集合と付加データに分解可能なものであれば、入力データの形式は問わない。 In addition, the case of using tabular data will be taken as an example. Hereinafter, the numerical set data may be referred to as numerical string data. Input data may be spreadsheet software files, database tables, XML (eXtensible Markup Language) documents, CSV (Common Separated Value) format documents, HTML (Hypertext Markup Language) documents, and so on. However, the format of the input data does not matter as long as it can be decomposed into a set of numerical values and additional data.

本実施形態では、データ種別推定装置１０は、学習フェーズの処理と実際に推定を行う推定フェーズの処理とを実行する。データ種別推定装置１０は、学習フェーズにおいて、データ種別が既知である数値集合データについて、統計値の算出および付加データの抽出を行って作成した特徴量ベクトルを、与えられたデータ種別を元に階層化して保存する。 In the present embodiment, the data type estimation device 10 executes the processing of the learning phase and the processing of the estimation phase in which the estimation is actually performed. The data type estimation device 10 hierarchically creates a feature vector created by calculating statistical values and extracting additional data for numerical set data whose data type is known in the learning phase, based on the given data type. Convert and save.

図４は、学習フェーズの処理を示すフローチャートである。図４（Ａ）は、学習部２００の処理を示す。図４（Ｂ）は、特徴量ベクトル抽出部１００の処理を示す。 FIG. 4 is a flowchart showing the processing of the learning phase. FIG. 4A shows the processing of the learning unit 200. FIG. 4B shows the processing of the feature amount vector extraction unit 100.

学習フェーズにおいて、学習部２００の学習用データ入力部２０１は、学習用数値列データ記憶装置５００から学習用の数値列データを入力データとして入力する（ステップＳ１１）。データ解析部２０２は、特徴量ベクトル抽出部１００を利用して、学習用データを対象として特徴量ベクトルを取得する（ステップＳ１２）。特徴量ベクトル抽出部１００は、図４（Ｂ）に示す処理を実行する。 In the learning phase, the learning data input unit 201 of the learning unit 200 inputs the learning numerical string data from the learning numerical string data storage device 500 as input data (step S11). The data analysis unit 202 uses the feature amount vector extraction unit 100 to acquire the feature amount vector for the learning data (step S12). The feature amount vector extraction unit 100 executes the process shown in FIG. 4 (B).

統計値算出部１０１は、入力された数値列データの統計値を算出する（ステップＢ１１）。統計値は、例えば、平均、分散、尖度、歪度、分布の型（一例として、正規分布、Poisson 分布、ロングテールの分布）である。 The statistical value calculation unit 101 calculates the statistical value of the input numerical string data (step B11). The statistics are, for example, mean, variance, kurtosis, skewness, and distribution type (for example, normal distribution, Poisson distribution, long tail distribution).

付加データ抽出部１０２は、数値列データにおける数値の単位（一例として、m 、g 、円、℃）や、数値列データにおいて併記されている情報の特性（付加データ）を抽出する（ステップＢ１２）。また、付加データ抽出部１０２は、数値列データにおいて氏名やIDなど固体を識別可能なデータがあれば、個体識別情報を「有」にする。その他、付加データとして、例えば、緯度経度や時刻情報が考えられる。図２に示された例では、単位として"mmHg"が抽出される。また、個体識別情報は「有」とされる。 The additional data extraction unit 102 extracts the unit of the numerical value (for example, m, g, circle, ° C.) in the numerical string data and the characteristic (additional data) of the information written together in the numerical string data (step B12). .. Further, the additional data extraction unit 102 sets the individual identification information to "Yes" if there is data that can identify an individual such as a name and an ID in the numerical string data. In addition, as additional data, for example, latitude / longitude and time information can be considered. In the example shown in FIG. 2, "mmHg" is extracted as a unit. In addition, the individual identification information is "Yes".

特徴量ベクトル生成部１０３は、統計値算出部１０１の算出結果と付加データ抽出部１０２の抽出結果とを結合して、入力データに対する特徴量ベクトルを生成し、呼び出し元（この場合には、データ解析部２０２）に供給する（ステップＢ１３）。 The feature amount vector generation unit 103 combines the calculation result of the statistical value calculation unit 101 and the extraction result of the additional data extraction unit 102 to generate a feature amount vector for the input data, and the caller (in this case, the data). It is supplied to the analysis unit 202) (step B13).

図５は、特徴量ベクトル生成部１０３が生成する特徴量ベクトルを説明するための説明図である。図５に示されている等式の右辺には、図３に例示された数値列データを入力値として、統計値算出部１０１と付加データ抽出部１０２とが得た値を、多次元量（ベクトル）としてまとめた状態が示されている。 FIG. 5 is an explanatory diagram for explaining the feature quantity vector generated by the feature quantity vector generation unit 103. On the right side of the equation shown in FIG. 5, the values obtained by the statistical value calculation unit 101 and the additional data extraction unit 102 with the numerical string data illustrated in FIG. 3 as input values are input as multidimensional quantities ( The state summarized as a vector) is shown.

特徴量ベクトル出力部２０３は、取得された特徴量ベクトルと種別とを、特徴量ベクトル管理部４００に出力する。 The feature amount vector output unit 203 outputs the acquired feature amount vector and the type to the feature amount vector management unit 400.

特徴量ベクトル記憶部４０３において、特徴量ベクトルは、データ種別に関して、階層化（クラスタ化）されて保存されている。階層化は、特徴量ベクトル階層化保存部４０１によって実行されるが、特徴量ベクトル階層化保存部４０１は、特徴量ベクトルを保存するときに、与えられたデータ種別を階層化した状態で保存する。その際に、例えば「２０代男性の血圧」、「２０代女性の血圧」という２つの入力データがあった場合、共通する「２０代成人の血圧」という種別を子種別として保存してもよい。なお、クラスタ化の手法として、特定分野の辞書と照合する手法や、特徴量ベクトルでクラスタ分析を行う手法等がある。特徴量ベクトル階層化保存部４０１は、既知の階層化手法のいずれを使用してもよいが、望ましい推定結果が得られる階層化手法を選択することが好ましい。 In the feature amount vector storage unit 403, the feature amount vector is stored in a layered manner (clustered) with respect to the data type. The layering is executed by the feature vector layered storage unit 401, and the feature vector layered storage unit 401 saves the given data type in a layered state when the feature vector is stored. .. At that time, for example, "blood pressure of 20's man", if there is a two input data of "blood pressure of women in their 20s", may be stored in a type of "blood pressure in their 20s adult" that is common as a child type .. As a method of clustering, there are a method of collating with a dictionary of a specific field, a method of performing cluster analysis with a feature vector, and the like. The feature vector layered storage unit 401 may use any known layering method, but it is preferable to select a layering method that can obtain a desired estimation result.

特徴量ベクトル管理部４００における特徴量ベクトル階層化保存部４０１は、特徴量ベクトル記憶部４０３から、データ種別の階層構造を示す階層情報を読み出す（ステップＳ１３）。 The feature vector layered storage unit 401 in the feature vector management unit 400 reads out the hierarchical information indicating the hierarchical structure of the data type from the feature vector storage unit 403 (step S13).

特徴量ベクトル階層化保存部４０１は、学習用データとしての数値列データの種別の階層と、特徴量ベクトル階層化保存部４０１から読み出した階層構造とから、特徴量ベクトルを保存すべき階層を特定する（ステップＳ１４）。特徴量ベクトル階層化保存部４０１は、特徴量ベクトルを特徴量ベクトル記憶部４０３における特定された階層に特徴量ベクトルを保存する（ステップＳ１５）。 The feature vector layered storage unit 401 identifies the layer in which the feature vector should be stored from the layer of the type of numerical string data as training data and the hierarchical structure read from the feature vector layered storage unit 401. (Step S14). The feature vector layered storage unit 401 stores the feature vector in the specified layer in the feature vector storage unit 403 (step S15).

図５に示されている等式の左辺には、図３に例示された数値列データに付与された種別が階層化された状態が示されている。図５に示すように、種別をより広い概念（この例では、「圧力」）から狭い概念（この例では「２０代」）の順に階層化されている。 On the left side of the equation shown in FIG. 5, a state in which the types assigned to the numerical string data illustrated in FIG. 3 are hierarchized is shown. As shown in FIG. 5, the types are layered in order from a broader concept (“pressure” in this example) to a narrower concept (“20s” in this example).

図６は、特徴量ベクトル階層化保存部４０１に保存されている階層構造を説明するための説明図である。図６に示す例では、階層構造における最も広い概念から最も狭い概念に向けてツリー状に表現されている。図５および図６に示す例では、ステップＳ１４の処理で、「男性」の下の階層が特定される。 FIG. 6 is an explanatory diagram for explaining the hierarchical structure stored in the feature quantity vector layered storage unit 401. In the example shown in FIG. 6, the concept is represented in a tree shape from the widest concept to the narrowest concept in the hierarchical structure. In the examples shown in FIGS. 5 and 6, the process under step S14 identifies the hierarchy below "male".

図７は、推定フェーズの処理を示すフローチャートである。図７（Ａ）は、推定部３００の処理を示す。図７（Ｂ）は、特徴量ベクトル抽出部１００の処理を示す。 FIG. 7 is a flowchart showing the processing of the estimation phase. FIG. 7A shows the processing of the estimation unit 300. FIG. 7B shows the processing of the feature amount vector extraction unit 100.

推定フェーズにおいて、推定部３００の推定対象データ入力部３０１は、推定対象数値列データ記憶装置６００から推定対象の数値列データを入力データとして入力する（ステップＳ２１）。データ解析部３０２は、特徴量ベクトル抽出部１００を利用して、推定対象の数値列データを対象として特徴量ベクトルを取得する（ステップＳ２２）。特徴量ベクトル抽出部１００は、図７（Ｂ）に示す処理を実行する。図７（Ｂ）に示す処理は、図４（Ｂ）に示された処理と同じである。 In the estimation phase, the estimation target data input unit 301 of the estimation unit 300 inputs the estimation target numerical string data from the estimation target numerical string data storage device 600 as input data (step S21). The data analysis unit 302 uses the feature vector extraction unit 100 to acquire a feature vector for the numerical string data to be estimated (step S22). The feature amount vector extraction unit 100 executes the process shown in FIG. 7B. The process shown in FIG. 7 (B) is the same as the process shown in FIG. 4 (B).

次いで、類似特徴量ベクトル探索部３０３は、特徴量ベクトル管理部４００を利用して、生成された特徴量ベクトルと距離が近い学習済み特徴量ベクトルを特徴量ベクトル記憶部４０３から抽出する。 Next, the similar feature amount vector search unit 303 uses the feature amount vector management unit 400 to extract a learned feature amount vector having a distance close to the generated feature amount vector from the feature amount vector storage unit 403.

具体的には、特徴量ベクトル比較部４０２は、特徴量ベクトル記憶部４０３から特徴量ベクトルのリスト（一覧）を読み出す（ステップＳ２３）。特徴量ベクトル比較部４０２は、ステップＳ２３の処理で取得された特徴量ベクトルを、類似特徴量ベクトル探索部３０３を介して入力し、当該特徴量ベクトルと学習済みの個々の特徴量ベクトル（リストに存在する特徴量ベクトル）との距離を算出する（ステップＳ２４）。このとき、特徴量ベクトル比較部４０２は、特徴量ベクトルに含まれる各々の要素を均等に扱うのではなく、単位や個体識別情報の有無等について、統計値よりも重みをつけて扱うことが好ましい。例えば、種別が血圧であれば、単位が長さ（m ）や重さ（kg）であることはないので、単位の違い（mmHg以外の単位）の距離への影響を大きくすることが好ましい。 Specifically, the feature amount vector comparison unit 402 reads out a list (list) of feature amount vectors from the feature amount vector storage unit 403 (step S23). The feature amount vector comparison unit 402 inputs the feature amount vector acquired in the process of step S23 via the similar feature amount vector search unit 303, and the feature amount vector and the trained individual feature amount vectors (in the list). The distance from the existing feature vector) is calculated (step S24). At this time, it is preferable that the feature amount vector comparison unit 402 does not treat each element included in the feature amount vector equally, but treats the unit, the presence / absence of individual identification information, etc. with more weight than the statistical value. .. For example, if the type is blood pressure, the unit is not length (m) or weight (kg), so it is preferable to increase the influence of the difference in units (units other than mmHg) on the distance.

なお、外国為替（￥⇔＄）や電力（Ｗ⇔VA）等が関連するような対象データを扱う場合、必ずしも１種別１単位に集約できるとは限らない。そのような場合、単位を排除してもよいが、それぞれに重み付けを行って対応してもよい。同様に、個体識別情報、位置情報、時刻情報の有無に対しても、算出された統計値に対して重みをつけるなど、好ましい推定結果が得られるよう調整する。 When dealing with target data related to foreign exchange (¥ ⇔ $), electric power (W⇔VA), etc., it is not always possible to aggregate one type and one unit. In such a case, the unit may be excluded, but each may be weighted to deal with it. Similarly, the presence or absence of individual identification information, position information, and time information is also adjusted so that a favorable estimation result can be obtained, such as by weighting the calculated statistical value.

特徴量ベクトル比較部４０２は、距離が小さいｎ件の特徴量ベクトルを抽出する（ステップＳ２５）。なお、「ｎ」はあらかじめ決められている正の整数である。そして、特徴量ベクトル比較部４０２は、ｎ件の特徴量ベクトルを類似特徴量ベクトル探索部３０３に出力する。その際に、特徴量ベクトル比較部４０２は、距離に応じた確率も類似特徴量ベクトル探索部３０３に出力する。特徴量ベクトル比較部４０２は、ｎ件の特徴量ベクトルに共通する種別（親種別）が存在する場合は、その親種別も併せて抽出する。 The feature vector comparison unit 402 extracts n feature vectors having a small distance (step S25). Note that "n" is a predetermined positive integer. Then, the feature amount vector comparison unit 402 outputs n feature amount vectors to the similar feature amount vector search unit 303. At that time, the feature amount vector comparison unit 402 also outputs the probability according to the distance to the similar feature amount vector search unit 303. If the feature quantity vector comparison unit 402 has a common type (parent type) in the n feature quantity vectors, the feature quantity vector comparison unit 402 also extracts the parent type.

ベクトル間の距離を測定する手法として様々の方法が知られているが、例えば、MinHash 法に代表されるベクトル間の角度（コサイン類似度）を距離として扱う手法は、高速で処理可能であるため、大量の数値集合データを扱うのに適している。 Various methods are known as methods for measuring the distance between vectors. For example, a method that treats the angle between vectors (cosine similarity) as a distance, as represented by the MinHash method, can be processed at high speed. , Suitable for handling large amounts of numerical set data.

なお、本実施形態では、特徴量ベクトル比較部４０２が距離と確率の算出を行うが、それらの機能は、類似特徴量ベクトル探索部３０３に含まれていてもよい。 In the present embodiment, the feature vector comparison unit 402 calculates the distance and the probability, but these functions may be included in the similar feature vector search unit 303.

結果表示部３０４は、抽出されたｎ件の特徴量ベクトルを表示部に表示することによってユーザに提示する（ステップＳ２６）。 The result display unit 304 presents the extracted n feature quantity vectors to the user by displaying them on the display unit (step S26).

図８は、抽出された特徴量ベクトル等の一例（表示例）を示す説明図である。図８に示すように、種別は階層化されている。なお、個々の距離に応じて確率が計算されるので、確率の合計は１を越える。 FIG. 8 is an explanatory diagram showing an example (display example) of the extracted feature amount vector and the like. As shown in FIG. 8, the types are layered. Since the probabilities are calculated according to the individual distances, the total probability exceeds 1.

図８に示す例では、「圧力／血圧／最高／男性」が共通の種別（親種別）になっている。そのような場合には、特徴量ベクトル比較部４０２は、ステップＳ２５の処理で、親種別として「圧力／血圧／最高／男性」を抽出結果に含める。また、抽出結果が表示されるときに、親種別が最上位に表示されることが好ましい（図９参照）。 In the example shown in FIG. 8, "pressure / blood pressure / maximum / male" is a common type (parent type). In such a case, the feature vector comparison unit 402 includes "pressure / blood pressure / maximum / male" as the parent type in the extraction result in the process of step S25. Further, when the extraction result is displayed, it is preferable that the parent type is displayed at the highest level (see FIG. 9).

図９は、親種別を含む抽出結果を示す説明図である。図９に示す例では、「２０代」の距離と「３０代」の距離との違いは僅差である。しかし、男性の血圧であることがほぼ確実と推定できるので、特徴量ベクトル比較部４０２は、推定精度の向上を図るために、種別階層を利用した結果の集約を行う。なお、種別の階層化が行われない場合には、距離が近い特徴量ベクトル間の関係が特定できず、包括する種別を提示することができない。 FIG. 9 is an explanatory diagram showing the extraction result including the parent type. In the example shown in FIG. 9, the difference between the distance of "20's" and the distance of "30's" is small. However, since it can be estimated that the blood pressure of a man is almost certain, the feature vector comparison unit 402 aggregates the results using the type hierarchy in order to improve the estimation accuracy. If the types are not layered, the relationship between the feature vectors with short distances cannot be specified, and the comprehensive types cannot be presented.

以上に説明したように、本実施形態では、利用者が対象数値集合データに関する知識を有していなくても、データ種別推定装置１０が、数値集合データの統計値や単位等の付加情報を特徴量ベクトルとして抽出し、特徴量ベクトルと学習済みデータの特徴量ベクトルとの比較結果を提示するので、利用者は、データ種別を容易に推定することができる。 As described above, in the present embodiment, even if the user does not have knowledge about the target numerical set data, the data type estimation device 10 is characterized by additional information such as statistical values and units of the numerical set data. Since it is extracted as a quantity vector and the comparison result between the feature quantity vector and the feature quantity vector of the trained data is presented, the user can easily estimate the data type.

また、数値集合データは数値データであるから学習データと完全に一致することはないが、データ種別推定装置１０が、特徴量ベクトル間の距離を算出し、学習済み特徴量データの階層構造と照合することによって、距離が近い学習済み特徴量ベクトルに対応する種別の抽出と、抽出された種別の共通種別（親種別）を高い精度で推定することができる。 Further, since the numerical set data is numerical data, it does not completely match the training data, but the data type estimation device 10 calculates the distance between the feature quantity vectors and collates it with the hierarchical structure of the trained feature quantity data. By doing so, it is possible to extract the types corresponding to the learned feature vectors having a short distance and to estimate the common type (parent type) of the extracted types with high accuracy.

なお、本実施形態では、血圧に関するデータが数値集合データとされたが、本発明は、他の種類の数値集合データを対象とすることもできる。例えば、過去の販売データを参照する際に、その当時のテーブル構造や種別についての仕様が入手困難な状態であっても、現在の販売データに含まれる数値データをあらかじめ学習しておくことによって、数値集合データにおけるどの列が売上で、どの列が値引き額である等の推定を自動的に行うことができる。 In the present embodiment, the data related to blood pressure is set as the numerical set data, but the present invention can also target other types of numerical set data. For example, when referring to past sales data, even if it is difficult to obtain specifications for the table structure and type at that time, by learning in advance the numerical data included in the current sales data, It is possible to automatically estimate which column in the numerical set data is the sales, which column is the discount amount, and so on.

図１０は、本発明による情報処理装置の主要部を示すブロック図である。図１０に示す情報処理装置１Ａは、学習用の数値集合データから統計値を算出し、該数値集合データに付加されている付加データを抽出する抽出手段２（実施形態では、統計値算出部１０１および付加データ抽出部１０２で実現される。）と、統計値と抽出された付加データとから特徴量ベクトルを生成する特徴量ベクトル生成手段３（実施形態では、特徴量ベクトル生成部で実現される。）と、数値集合データの種別で階層化された状態で特徴量ベクトルを記憶部５（実施形態では、特徴量ベクトル記憶部４０３で実現される。）に保存する特徴量ベクトル階層化保存手段４（実施形態では、特徴量ベクトル階層化保存部４０１で実現される。）とを備えている。 FIG. 10 is a block diagram showing a main part of the information processing apparatus according to the present invention. The information processing device 1A shown in FIG. 10 calculates a statistical value from the numerical set data for learning, and extracts the additional data added to the numerical set data. Extraction means 2 (in the embodiment, the statistical value calculation unit 101). And the feature quantity vector generation means 3 (in the embodiment, realized by the feature quantity vector generation unit) that generates the feature quantity vector from the statistical value and the extracted additional data. ) And the feature quantity vector layered storage means for storing the feature quantity vector in the storage unit 5 (in the embodiment, the feature quantity vector storage unit 403) in a state of being layered according to the type of the numerical set data. 4 (in the embodiment, it is realized by the feature quantity vector layered storage unit 401).

図１１は、本発明による他の態様の情報処理装置の主要部を示すブロック図である。図１１に示す情報処理装置１Ｂは、抽出手段が、種別の推定対象である対象用の数値集合データから統計値を算出し、該数値集合データに付加されている付加データを抽出し、特徴量ベクトル生成手段が、当該統計値と抽出された当該付加データとから特徴量ベクトルを生成し、生成された特徴量ベクトルと記憶部５に保存されている特徴量ベクトルとの距離を算出し、算出された距離にもとづいて対象用の数値集合データの種別を推定する推定手段６（実施形態では、特徴量ベクトル比較部４０２で実現される。）を備えている。 FIG. 11 is a block diagram showing a main part of the information processing apparatus of another aspect according to the present invention. In the information processing apparatus 1B shown in FIG. 11, the extraction means calculates a statistical value from the numerical set data for the target which is the estimation target of the type, extracts the additional data added to the numerical set data, and features the feature amount. The vector generation means generates a feature amount vector from the statistical value and the extracted additional data, and calculates and calculates the distance between the generated feature amount vector and the feature amount vector stored in the storage unit 5. The estimation means 6 (in the embodiment, realized by the feature amount vector comparison unit 402) for estimating the type of the numerical set data for the target based on the set distance is provided.

１Ａ，１Ｂ情報処理装置
２抽出手段
３特徴量ベクトル生成手段
４特徴量ベクトル階層化保存手段
５記憶部
６推定手段
１０データ種別推定装置
１００特徴量ベクトル抽出部
１０１統計値算出部
１０２付加データ抽出部
１０３特徴量ベクトル生成部
２００学習部
２０１学習用データ入力部
２０２データ解析部
２０３特徴量ベクトル出力部
３００推定部
３０１推定対象データ入力部
３０２データ解析部
３０３類似特徴量ベクトル探索部
３０４結果表示部
４００特徴量ベクトル管理部
４０１特徴量ベクトル階層化保存部
４０２特徴量ベクトル比較部
４０３特徴量ベクトル記憶部
５００学習用数値列データ記憶装置
６００推定対象数値列データ記憶装置 1A, 1B Information processing device 2 Extraction means 3 Feature vector generation means 4 Feature vector layered storage means 5 Storage unit 6 Estimating means 10 Data type estimation device 100 Feature vector extraction unit 101 Statistical value calculation unit 102 Additional data extraction unit 103 Feature vector generation unit 200 Learning unit 201 Learning data input unit 202 Data analysis unit 203 Feature quantity vector output unit 300 Estimating unit 301 Estimating target data input unit 302 Data analysis unit 303 Similar feature quantity vector search unit 304 Result display unit 400 Feature vector management unit 401 Feature vector layered storage unit 402 Feature vector comparison unit 403 Feature vector storage unit 500 Learning numerical string data storage device 600 Estimated numerical string data storage device

Claims

Statistical values are calculated from the numerical set data for learning to which known data types are associated, and the additional data added to the numerical set data is extracted.
A feature vector having the statistical value and the extracted additional data as elements is generated.
The feature quantity vector is stored in the storage unit in a state of being layered according to the type of numerical set data .
Statistical values are calculated from the numerical set data for the target that is the estimation target of the type, and the additional data added to the numerical set data is extracted.
Generate a feature vector with the statistical value and the extracted additional data as elements,
An information processing method that calculates the distance between the generated feature vector and the feature vector stored in the storage unit, and estimates the type of numerical set data for the target based on the calculated distance.

Calculate the distance between the generated feature vector and each of the multiple feature vectors stored in the storage unit.
Distance information processing method according to claim 1, wherein outputting the feature vector is stored within the storage portion of the small predetermined matter.

The information processing method according to claim 1 or 2 , wherein the numerical set data is tabular data.

An extraction means that calculates statistical values from learning numerical set data associated with known data types and extracts additional data added to the numerical set data.
A feature vector generating means for generating a feature vector having the statistical value and the extracted additional data as elements, and a feature vector generating means.
E Bei a feature vector layered storage means for storing said feature vector in a storage unit in a state of being layered with the type of value set data,
The extraction means calculates a statistical value from the numerical set data for the target which is the estimation target of the type, extracts the additional data added to the numerical set data, and extracts the additional data.
The feature amount vector generation means generates a feature amount vector having the statistical value and the extracted additional data as elements.
Further provided is an estimation means for calculating the distance between the generated feature vector and the feature vector stored in the storage unit and estimating the type of numerical set data for the target based on the calculated distance. Information processing device.

The estimation means calculates the distance between the generated feature vector and each of the plurality of feature vectors stored in the storage unit, and outputs the feature vector stored in a predetermined storage unit having a small distance. The information processing apparatus according to claim 4.

On the computer
A process of calculating statistical values from numerical set data for learning in which known data types are associated with each other and extracting additional data added to the numerical set data.
A process of generating a feature vector having the statistical value and the extracted additional data as elements, and
A process of storing the feature vector in a storage unit in hierarchical state type of value set data,
A process of calculating statistical values from the numerical set data for the target that is the estimation target of the type and extracting the additional data added to the numerical set data.
Processing to generate a feature vector with the statistical value and the extracted additional data as elements,
To calculate the distance between the generated feature vector and the feature vector stored in the storage unit, and to execute the process of estimating the type of numerical set data for the target based on the calculated distance. Information processing program.

On the computer
The process of calculating the distance between the generated feature vector and each of the plurality of feature vectors stored in the storage unit, and
The information processing program according to claim 6, wherein a process of outputting a feature amount vector stored in a predetermined storage unit having a small distance is executed.