JP6424756B2

JP6424756B2 - Data processing apparatus and data processing method

Info

Publication number: JP6424756B2
Application number: JP2015139613A
Authority: JP
Inventors: 亮根山; 元裕中村
Original assignee: Toyota Motor Corp
Current assignee: Toyota Motor Corp
Priority date: 2015-07-13
Filing date: 2015-07-13
Publication date: 2018-11-21
Anticipated expiration: 2035-07-13
Also published as: JP2017021634A

Description

本発明は、データ処理装置に関し、特に、項目の仕様が未知の新規データと項目の仕様が既知の既知データとの項目の対応付けを行うデータ処理装置に関する。 The present invention relates to a data processing apparatus, and more particularly to a data processing apparatus that associates items of new data whose item specification is unknown with known data whose item specification is known.

項目の仕様が未知の新規データを、項目の仕様が既知のマスターＤＢ（データベース）に取り込む際に、どの項目とどの項目が対応するのか決定し、同じ項目のデータとして取り込むことが望まれる。項目名（フィールド名、属性名とも呼ばれる）が「車速」、「速度」、「スピード」のように類似していれば対応する項目である可能性が高いと言えるが、必ずしも同一の項目とは限らない。また、「ｓ１」のような項目名が付けられている場合に、これが「車速」と同一であることを項目名から判断することは困難である。 When importing new data whose item specification is unknown into a master DB (database) whose item specification is known, it is desirable to determine which item corresponds to which item, and to collect it as data of the same item. If the item names (also called field names and attribute names) are similar, such as "vehicle speed", "speed", and "speed", it can be said that there is a high possibility of corresponding items, but not necessarily the same items. Not exclusively. Also, when an item name such as "s1" is given, it is difficult to determine from the item name that this is the same as "vehicle speed".

特許文献１には、新規データと値の特徴が類似するマスターＤＢの項目を求め、値の特徴が類似する項目が複数存在する場合には、項目名の類似度を基に、そのうちの１つを選択することが記載されている。 In Patent Document 1, an item of a master DB having similar characteristics of the new data and the value is determined, and when there are a plurality of items having similar characteristics of the value, one of them is based on the similarity of the item name. It is stated to choose.

特開２００６−０９９２３６号公報JP, 2006-099236, A 特開２００１−１５５０２５号公報JP 2001-155025 A 特開２０１１−２４８６６１号公報JP, 2011-248661, A

データ値の特徴に基づいて対応する項目を正確に求めることは常に可能であるわけではない。同様の特徴を有する項目が既知データ内の複数存在する場合には、どの項目に対応するかを判定することは困難である。 It is not always possible to determine the corresponding item precisely on the basis of the features of the data values. When there are a plurality of items having similar characteristics in the known data, it is difficult to determine which item corresponds to.

特許文献１では、データ値の特徴に基づく類似度が高い項目が複数存在する場合に、項目名の類似度を基にして対応する項目を選択している。しかしながら、データ値の類似度では判別が付きにくい項目がある場合には誤判定が生じる可能性が高い。また、データ値の類似度が高い項目が複数有り、項目名が適切に付けられていない場合にも誤判定が生じる可能性が高い。 In Patent Document 1, when there are a plurality of items having high similarity based on the features of data values, the corresponding items are selected based on the similarity of item names. However, if there is an item that is difficult to determine by the similarity of data values, there is a high possibility that an erroneous determination will occur. In addition, there is a high possibility that an erroneous determination may occur even when there are a plurality of items with high data value similarity and item names are not appropriately assigned.

上記のような問題を考慮して、本発明は、項目の仕様が未知の新規データと項目の仕様が既知の既知データとの項目の対応付けを精度良く行える技術を提供することを目的とする。 In view of the problems as described above, the present invention aims to provide a technique capable of accurately matching items with new data whose item specification is unknown and known data whose item specification is known. .

上記目的を達成するために、本発明にかかるデータ処理装置は、既知データの項目のそれぞれについてデータ値の特徴による判別精度に関する情報を保持する。そして、新規データについて既知データの各項目とのデータ値による類似度を求め、類似度が高い項目の中にデータ値による判別精度が高い項目が存在するか否かによって新規データと対応する既知データの項目候補の決定方法を切り替える。 In order to achieve the above object, a data processing apparatus according to the present invention holds information on discrimination accuracy by characteristics of data values for each item of known data. Then, the degree of similarity with data items of known data is obtained for new data, and among the items with high degree of similarity, known data corresponding to new data depending on whether or not the item with high determination accuracy by data values exists. Switch the decision method of the item candidate of.

より具体的には、本発明にかかるデータ処理装置は、項目の仕様が未知の新規データと項目の仕様が既知の既知データとの項目の対応付けを行うデータ処理装置であって、既知
データの複数の項目についての、データ値の特徴による判別精度に関する情報を記憶する記憶手段と、新規データの項目名とデータ値とを取得する取得手段と、前記新規データと対応する前記既知データの項目の候補を求め、当該候補の項目名を出力する処理手段と、を備え、前記処理手段は、前記既知データの前記複数の項目について、前記新規データとの間でデータ値の特徴の類似度を求め、データ値の類似度が高い上位所定個の項目の中に、データ値の特徴による判別精度が高い項目が存在する場合には、当該データ値の特徴による判別精度が高い項目の項目名を、前記データ値の特徴の類似度に応じた順位付けとともに出力し、データ値の類似度が高い上位所定個の項目の中に、データ値の特徴による判別精度が高い項目が存在しない場合には、当該上位所定個の項目の項目名と、前記新規データの項目名との類似度を求め、前記上位所定個の項目の項目名を、前記項目名の類似度に応じた順位付けとともに出力する、ことを特徴とする。 More specifically, the data processing apparatus according to the present invention is a data processing apparatus that associates items with new data whose item specification is unknown and known data whose item specification is known. Storage means for storing information on discrimination accuracy according to the characteristics of data values of a plurality of items, acquisition means for acquiring item names and data values of new data, items of the known data corresponding to the new data Processing means for obtaining a candidate and outputting the item name of the candidate, wherein the processing means obtains, for the plurality of items of the known data, similarity of features of data values with the new data If there is an item with high determination accuracy due to the feature of the data value among the high-order predetermined items with high similarity of data value, the item with high determination accuracy due to the feature of the data value Are output together with the ranking according to the degree of similarity of the features of the data value, and when there are no items with high determination accuracy due to the features of the data value among the upper predetermined number of items with high degree of similarity of data values. Determines the similarity between the item name of the upper predetermined number of items and the item name of the new data, and outputs the item names of the upper predetermined number of items together with the ranking according to the similarity of the item names To be characterized.

項目の仕様が既知の既知データとは、各項目に格納されているあるいは格納すべきデータがどのようなデータであるかが既知のデータである。項目の仕様が未知の新規データとは、項目名（テキスト）は分かるものの、そこに格納されているデータがどのような（何を表す）データであるかが不明のデータである。 Known data whose item specification is known is data that is known as what data is stored in or to be stored in each item. The new data whose item specification is unknown is data whose item name (text) is known but whose data stored therein is what (representative) data is.

データ値の特徴による判別精度に関する情報は、データ値による判別精度が高いか否かを表す情報を含み、例えば、データ値による判別精度が高いか否かを示す二値のデータであってもよいし、判別精度を数値で示すデータであってもよい。 The information on the determination accuracy based on the data value feature includes information indicating whether the determination accuracy based on the data value is high, and may be, for example, binary data indicating whether the determination accuracy based on the data value is high. It may be data that indicates the determination accuracy as a numerical value.

データ値の特徴は、例えば、所定期間内のデータ値の最大値、最小値、平均値、分散、またはデータ値の時間差分（時間変化）の所定期間内の最大値、最小値、平均値、分散の１つまたは複数に基づいて求められる、ことができる。所定期間は、あらかじめ定められて固定の期間であってもよいし、状況に応じて変化する期間であってもよい。例えば、取り扱うデータが車両に関するデータである場合には、１トリップの期間を所定期間とすることが考えられる。 The characteristics of the data values are, for example, the maximum value, the minimum value, the average value, the variance, or the maximum value, the minimum value, the average value within a predetermined period of time difference (time change) of data values within a predetermined period. It can be determined based on one or more of the variances. The predetermined period may be a predetermined and fixed period, or may be a period that changes according to the situation. For example, when the data to be handled is data concerning a vehicle, it is conceivable to set a period of one trip as a predetermined period.

本発明における処理手段は、新規データのデータ値の特徴と既知データのデータ値の特徴の類似度を算出し、類似度が高い上位所定個の既知データの項目を求める。ここで、上位所定個の項目とは、類似度が高い方から数えてあらかじめ定められた数の項目であってもよいし、類似度が閾値以上の項目であってもよい。 The processing means in the present invention calculates the degree of similarity between the feature of the data value of the new data and the feature of the data value of the known data, and obtains items of upper predetermined pieces of known data having a high degree of similarity. Here, the upper predetermined number of items may be a predetermined number of items counted from the one with the highest degree of similarity, or may be an item whose similarity is equal to or higher than a threshold.

処理手段は、類似度が高い上位所定個の既知データの項目の中に、データ値の特徴による判別精度が高い項目が存在するか判定する。この判定は、判別精度に関する情報に基づいて行われる。データ値の類似度が高い上位所定個の項目の中に判別精度が高い項目があれば、処理手段は、判別精度が高い項目の中からデータ値の類似度にしたがった順序で順位付けをした上で、新規データと対応する既知データの項目の候補を出力する。一方、データ値の類似度が高い上位所定個の既知データの項目の中に判別精度が高い項目が存在しない場合には、処理手段は、データ値の類似度が高い上位所定個の項目名を、項目名（テキスト）の類似度に基づいた順位付けした上で、新規データと対応する既知データの項目の候補として出力する。 The processing means determines whether or not there is an item with high determination accuracy based on the feature of the data value among items of upper predetermined pieces of known data having high similarity. This determination is performed based on the information on the determination accuracy. If there is an item with high determination accuracy among the high-order predetermined items having high similarity of data values, the processing means ranks the items in the order according to the similarity of the data values from the items with high determination accuracy. Above, the candidate of the item of the known data corresponding to the new data is output. On the other hand, when there is no item with high determination accuracy among the items of high-order predetermined pieces of known data with high degree of similarity of data values, the processing means selects the high-order predetermined number of item names with high degree of similarity of data values. After ranking based on the degree of similarity of the item name (text), it is output as a candidate of items of known data corresponding to new data.

このような構成によれば、データ値による判別精度が高い項目については、データ値のみによる対応候補の決定が行えるので、項目名に依存した誤判定を防止できる。一方、データ値による判別精度が低い項目については、データ値と項目名の両方を使った対応候補の決定が行えるので、データ値のみに依存した誤判定を防止できる。すなわち、本発明によれば、項目の仕様が未知の新規データと対応する既知データの項目の候補を精度良く決定することが可能となる。 According to such a configuration, for the item having high determination accuracy based on the data value, the corresponding candidate can be determined based on only the data value, so that the erroneous determination depending on the item name can be prevented. On the other hand, for an item having a low determination accuracy based on data values, it is possible to determine correspondence candidates using both data values and item names, so it is possible to prevent an erroneous determination depending only on data values. That is, according to the present invention, it becomes possible to accurately determine the candidate of the item of known data corresponding to the new data whose item specification is unknown.

本発明における前記処理手段は、項目名が既知のデータを用いてあらかじめ学習した学習器（分類器、識別器）を用いて、前記データ値の特徴の類似度を求める、ことができる。学習器の作成にあたっては、対応する項目が既知のデータを学習データとして用いた学習（教師有り学習）処理を行う。なお、学習アルゴリズムは、Random ForestやSVMなどを含む任意の既知のアルゴリズムを採用可能である。また、学習処理の結果とし得られる学習器を用いて、既知データの各項目についてデータ値に基づく判別が正しく行えるか否かを判定し、その結果に応じて、データ値の特徴による判別精度に関する情報を生成することができる。 The processing means in the present invention can obtain the similarity of the feature of the data value using a learning device (classifier, classifier) previously learned using data whose item name is known. When creating a learning device, learning (supervised learning) processing is performed using data whose corresponding items are known as learning data. The learning algorithm may employ any known algorithm including Random Forest, SVM, and the like. In addition, it is determined whether or not the discrimination based on the data value can be correctly performed for each item of the known data using the learning device obtained as a result of the learning processing, and according to the result, the discrimination accuracy by the feature of the data value Information can be generated.

本発明において、前記新規データに対応する前記既知データの項目の入力を受け付ける入力手段をさらに有し、前記入力手段への入力を用いて、前記学習器の再学習を行う、ことも好ましい。ユーザが入力した項目（候補に含まれるものであってもよいし、それ以外であってもよい）は、対応する項目の正解であるので、この結果を用いて再学習を行うことで、学習器の精度を向上させることができる。 In the present invention, it is also preferable to further include input means for receiving an input of the item of the known data corresponding to the new data, and performing re-learning of the learning device using the input to the input means. The item (a candidate may be included in the candidate or any other item) input by the user may be the correct answer to the corresponding item, and thus, the result may be used to perform relearning. Can improve the accuracy of the

本発明において、前記処理手段は、項目名の類似度を、レーベンシュタイン距離などの編集距離を用いて求めてもよいし、ユークリッド距離を用いて求めてもよい。 In the present invention, the processing means may obtain the degree of similarity of item names using an edit distance such as Levenshtein distance or may use an Euclidean distance.

本発明において、前記処理手段は、前記データ値の類似度が所定の閾値以上の項目を、前記上位所定個の項目として求める、ことができる。また、前記処理手段は、前記データ値の類似度が前記所定の閾値以上の項目がない場合には、類似する項目が存在しない旨を出力する、ことも好ましい。 In the present invention, the processing means can obtain, as the upper predetermined number of items, items having a similarity of the data value equal to or higher than a predetermined threshold value. Further, preferably, the processing means outputs that there is no similar item, when there is no item whose similarity of the data value is equal to or more than the predetermined threshold value.

なお、本発明は、上記手段の少なくとも一部を備えるデータ処理装置として捉えることができる。また、本発明は、上記手段が行う処理の少なくとも一部を実行する方法として捉えることもできる。また、本発明は、この方法をコンピュータに実行させるためのコンピュータプログラム、あるいはこのコンピュータプログラムを非一時的に記憶したコンピュータ可読記憶媒体として捉えることもできる。上記手段および処理の各々は可能な限り互いに組み合わせて本発明を構成することができる。 The present invention can be grasped as a data processing apparatus provided with at least a part of the above-mentioned means. Further, the present invention can also be understood as a method of executing at least a part of the processing performed by the above means. Also, the present invention can be regarded as a computer program for causing a computer to execute this method, or a computer readable storage medium storing this computer program non-temporarily. Each of the above means and processes can be combined with one another as much as possible to constitute the present invention.

本発明によれば、項目の仕様が未知の新規データと項目の仕様が既知の既知データとの項目とを、精度良く対応付けることが可能となる。 According to the present invention, it is possible to accurately associate an item of new data whose item specification is unknown with an item of known data whose item specification is known.

図１は、実施形態にかかるデータ処理装置の機能構成を示す図である。FIG. 1 is a diagram showing a functional configuration of a data processing apparatus according to the embodiment. 図２Ａは事前学習処理の流れを示すフローチャート、図２Ｂは特徴量に基づく判別精度を説明する概念図、図２Ｃは判別精度情報の例をそれぞれ示す。FIG. 2A is a flowchart showing the flow of prior learning processing, FIG. 2B is a conceptual diagram for explaining discrimination accuracy based on feature amounts, and FIG. 2C shows an example of discrimination accuracy information. 図３は、実施形態にかかるデータ統合処理の流れを示すフローチャートである。FIG. 3 is a flowchart showing the flow of data integration processing according to the embodiment. 図４は、実施形態にかかる対応項目候補決定処理の流れを示すフローチャートである。FIG. 4 is a flowchart showing the flow of the corresponding item candidate determination process according to the embodiment.

＜システム概要＞
本実施形態にかかるデータ処理装置１は、項目の仕様が既知のマスターＤＢ（データベース）を有し、項目の仕様が未知の新規データを取り込む際に、項目の対応付けを支援する。以下では、データの具体例として車両の状態に関するデータ（以下、車両データとも称する）を用いて説明を行うが、このことは本発明が適用可能なデータの種類を限定するものではない。 <System outline>
The data processing apparatus 1 according to the present embodiment has a master DB (database) whose item specifications are known, and supports the association of items when capturing new data whose item specifications are unknown. Although the following description is given using data relating to the state of a vehicle (hereinafter also referred to as vehicle data) as a specific example of data, this does not limit the types of data to which the present invention can be applied.

項目の仕様が既知のデータは、各項目の格納されているあるいは格納すべきデータが既知のデータである。自らが生成・管理するデータベースは仕様が明らかであるので、マスターＤＢに含まれるデータは項目の仕様が既知である。 The data whose item specification is known is data in which the data stored for each item or data to be stored is known. Since the specifications of the database generated and managed by itself are clear, the data contained in the master DB have known item specifications.

一方、項目の仕様が未知の新規データは、項目名（フィールド名や属性名とも称される）のテキストは分かるものの、そこに格納されているデータがどのようなデータであるかが不明なデータである。典型的には、第三者（サードパーティ）が生成・管理するデータが該当する。 On the other hand, for new data whose item specification is unknown, although the text of the item name (also referred to as a field name or attribute name) is known, it is unknown whether the data stored therein is what data It is. Typically, data generated and managed by a third party (third party) is relevant.

本実施形態にかかるデータ処理装置１は、新規データを取得し、新規データに含まれる各項目と対応するマスターＤＢの項目の候補をユーザ（オペレータ）に提示する。これにより、ユーザが行うデータ項目の対応付け処理を容易化することができる。本実施形態にかかるデータ処理装置１は、マスターＤＢ内の各項目についてデータ値の特徴に基づく判別の精度が高いか否かを記憶しておき、この情報を用いて対応する項目の候補の求め方を切り替える。 The data processing apparatus 1 according to the present embodiment acquires new data, and presents the user (operator) with candidate items of the master DB corresponding to the respective items included in the new data. Thus, the process of associating data items performed by the user can be facilitated. The data processing apparatus 1 according to the present embodiment stores, for each item in the master DB, whether or not the determination accuracy based on the feature of the data value is high, and uses this information to obtain a candidate for the corresponding item. Switch the person.

＜構成＞
本実施形態にかかるデータ処理装置１は、ＣＰＵ（Central Processing Unit）やＭＰ
Ｕ（Micro Processing Unit）などのプロセッサ（演算処理装置）、ＲＡＭ（Random Access Memory）などの主記憶装置、半導体メモリ・磁気ディスク・光ディスク・光磁気ディ
スクなどの補助記憶装置、キーボードや種々のポインティングデバイス（マウス、タッチパッド、タッチパネル、ペンタブレット等）やマイクなどの入力装置、ディスプレイ装置（液晶ディスプレイ・ＣＲＴディスプレイ・プロジェクタ等）や音声出力装置などの出力装置、有線通信や無線通信を行うための通信装置などを含んで構成される汎用コンピュータ（情報処理装置）として構成される。データ処理装置１は、補助記憶装置に記憶されているコンピュータプログラムが主記憶装置上に展開されたプロセッサが実行することにより、以下の各機能を提供する。ただし、以下の機能の一部または全部を、ＡＳＩＣ（Application Specific Integrated Circuit）やＦＰＧＡ（Field Programmable Gate Array）やＤＳＰ（Digital Signal Processor）などを用いて実現しても構わない。また、データ処理装置１は、必ずしも１台のコンピュータとして構成される必要はなく、複数のコンピュータが協働することによって、その機能を提供してもよい。 <Configuration>
The data processing apparatus 1 according to the present embodiment includes a CPU (central processing unit) and an MP.
Processor (operation processing unit) such as U (Micro Processing Unit), main storage device such as RAM (Random Access Memory), auxiliary storage device such as semiconductor memory, magnetic disk, optical disk, magneto-optical disk, keyboard and various pointing devices Input devices such as a mouse, touch pad, touch panel, pen tablet, etc., microphones, output devices such as display devices (liquid crystal displays, CRT displays, projectors, etc.) and audio output devices, communication for wired communication and wireless communication It is configured as a general-purpose computer (information processing apparatus) configured to include an apparatus and the like. The data processing device 1 provides the following functions by being executed by a processor in which a computer program stored in the auxiliary storage device is loaded on the main storage device. However, some or all of the following functions may be realized using an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), a digital signal processor (DSP), or the like. In addition, the data processing device 1 does not necessarily have to be configured as a single computer, and the function may be provided by cooperation of a plurality of computers.

図１は、本実施形態にかかるデータ処理装置１が有する機能ブロックを示す図である。図１に示すように、データ処理装置１は、その機能部として、新規データ入力部１０、データ処理部２０、入出力部３０、マスターＤＢ４０、判別精度情報記憶部５０を有する。 FIG. 1 is a diagram showing functional blocks included in the data processing apparatus 1 according to the present embodiment. As shown in FIG. 1, the data processing apparatus 1 has a new data input unit 10, a data processing unit 20, an input / output unit 30, a master DB 40, and a discrimination accuracy information storage unit 50 as its functional units.

新規データ入力部１０は、他の装置によって生成された新規データを取得する機能部である。データの取得方法は特に限定されず、有線通信あるいは無線通信による取得や、記憶媒体を経由した取得が可能である。 The new data input unit 10 is a functional unit that acquires new data generated by another device. The method of acquiring data is not particularly limited, and acquisition by wired communication or wireless communication or acquisition via a storage medium is possible.

データ処理部２０は、新規データの各項目と対応するマスターＤＢの項目の候補を決定する機能を有するとともに、対応候補を決定する際に必要となるデータを学習する学習機能も有する。より詳細には、データ処理部２０は、データ値類似度算出部２１、項目名類似度算出部２２、対応項目候補決定部２３、学習処理部２４、特徴量算出部２５のサブ機能部を有する。データ処理部２０の機能については、以下でフローチャートともに詳細に説明する。 The data processing unit 20 has a function of determining the candidate of the item of the master DB corresponding to each item of the new data, and also has a learning function of learning data required when determining the corresponding candidate. More specifically, the data processing unit 20 includes sub-function units of a data value similarity calculation unit 21, an item name similarity calculation unit 22, a correspondence item candidate determination unit 23, a learning processing unit 24, and a feature amount calculation unit 25. . The functions of the data processing unit 20 will be described in detail together with the flowchart below.

入出力部３０は、データ処理部２０によって決定された新規データの各項目に対応する項目候補をユーザに提示する機能と、ユーザによる対応項目の選択を受け付ける機能を有
する。 The input / output unit 30 has a function of presenting to the user an item candidate corresponding to each item of the new data determined by the data processing unit 20, and a function of accepting the selection of the corresponding item by the user.

マスターＤＢ４０には、項目の仕様が既知のデータが格納される。マスターＤＢ４０に格納される１レコードのデータは、例えば、「操舵角」「アクセル開度」「速度」「ヨーレート」「右ウィンカー」「左ウィンカー」「右後方距離センサ値」「左後方距離センサ値」を項目名として持つデータ項目を有する。これらのデータ項目の仕様は、データ処理装置１（の製造者・管理者・ユーザ）にとって既知である。例えば、データ項目「操舵角」は、ステアリングホイールの回転角度を、右方向（時計回り）を正、左方向（反時計回り）を負として、１度単位の数値で表したデータである、という仕様が定められている。 The master DB 40 stores data whose item specifications are known. Data of one record stored in the master DB 40 is, for example, “steering angle” “accelerator opening” “speed” “yaw rate” “right blinker” “left blinker” “right rear distance sensor value” “left rear distance sensor value Has a data item with "" as an item name. The specifications of these data items are known to (the manufacturer / administrator / user of) the data processing device 1. For example, the data item "steering angle" is data representing the rotation angle of the steering wheel in a unit of one degree, with the right direction (clockwise) as positive and the left direction (counterclockwise) as negative. Specifications are defined.

なお、マスターＤＢ４０に含まれる情報は上記に示したデータ項目に限られるわけではない。例えば、データを作成した装置やユーザの識別子や、データの作成日時や登録日時なども含まれる。また、本実施形態においては車両データを取り扱うので、各レコードにはトリップの識別子も含まれ、同一トリップにおける車両データを取得可能とする。なお、トリップとはある地点（始点）から別の地点（終点）までの移動を意味する。ここで、始点や終点は種々の方法によって決定可能でる。 The information included in the master DB 40 is not limited to the data items described above. For example, the identifier of the device or user who created the data, the creation date of the data, the registration date, etc. are included. In addition, since vehicle data is handled in the present embodiment, each record also includes a trip identifier, and vehicle data in the same trip can be acquired. In addition, a trip means the movement from a certain point (start point) to another point (end point). Here, the start point and the end point can be determined by various methods.

本実施形態では、判別精度情報記憶部５０には、マスターＤＢ４０各項目について、データ値の特徴に基づいて精度良く判別（分類、識別）できるか否かという情報（以下、判別精度情報と称する）が格納される。この判別精度情報は、データ処理部２０による学習処理によって生成され、データ処理部２０による対応項目候補の決定処理において参照される。 In the present embodiment, in the discrimination accuracy information storage unit 50, information on whether or not each item of the master DB 40 can be accurately discriminated (classified and identified) based on the characteristics of data values (hereinafter referred to as discrimination accuracy information) Is stored. The determination accuracy information is generated by learning processing by the data processing unit 20, and is referred to in the determination processing of the corresponding item candidate by the data processing unit 20.

＜処理＞
本実施形態かかるデータ処理装置１が行う処理は、データ統合処理の前に行われる事前学習処理と、学習結果を用いた対応項目候補の決定処理を含むデータ統合処理の２つに大別される。以下では、それぞれの処理の内容について詳しく説明する。 <Processing>
The processing performed by the data processing apparatus 1 according to the present embodiment is roughly divided into two: pre-learning processing performed before data integration processing, and data integration processing including processing for determining corresponding item candidates using a learning result . Below, the content of each process is demonstrated in detail.

（１．事前学習処理）
図２Ａは、データ処理部２０が行う事前学習処理の流れを示すフローチャートである。この事前学習処理では、マスターＤＢ４０の各項目についてデータ値の特徴に基づく判別精度が高いか否かを機械学習を用いて決定する。 (1. Prior learning process)
FIG. 2A is a flowchart illustrating the flow of the prior learning process performed by the data processing unit 20. In this prior learning process, it is determined using machine learning whether or not the discrimination accuracy based on the features of the data value is high for each item of the master DB 40.

ステップＳ１０において、データ処理部２０は、マスターＤＢ４０からデータを取得して、特徴量算出部２５を用いて、トリップごとおよび項目ごとにデータ値の特徴量を算出する。データ値の特徴量は、例えば、１トリップ内におけるデータ値の最大値、最小値、平均値、分散あるいは、１トリップ内におけるデータ値の時間変化（時間差分）の最大値、最小値、平均値、分散のいずれかまたは複数の組み合わせとして表される。時間変化は、時系列的に隣接するデータ同士の差分であってもよいし、所定の期間離れたデータ同士の差分であってもよい。 In step S10, the data processing unit 20 acquires data from the master DB 40, and uses the feature amount calculation unit 25 to calculate feature amounts of data values for each trip and each item. The feature value of the data value is, for example, the maximum value, the minimum value, the average value, the variance of the data value within one trip, or the maximum value, the minimum value, the average value of the time change (time difference) of the data value within one trip. , Represented as one or more combinations of variances. The time change may be a difference between data adjacent in time series or may be a difference between data separated by a predetermined period.

学習に用いるデータはマスターＤＢ４０に格納されているデータに限定する必要はなく、どの項目のデータであるか（どの項目のデータではない場合も含む）が既知のデータであれば、任意のデータを学習に用いることができる。 The data used for learning need not be limited to the data stored in the master DB 40, and any data may be used as long as it is data of which item (including the case where it is not data of any item) is known. It can be used for learning.

ステップＳ１１において、学習処理部２４は、機械学習アルゴリズムを用いてデータ値の特徴から、どの項目のデータであるかを識別するための分類器（識別器、学習器）を生成する。学習アルゴリズムは特に限定されず、上記の分類が可能な分類器を生成可能であれば、任意のアルゴリズムが採用可能である。例えば、Random ForestやSVMなどのアルゴリズムを採用することができる。 In step S11, the learning processing unit 24 generates a classifier (a classifier, a learning device) for identifying which item of data is data from the feature of the data value using a machine learning algorithm. The learning algorithm is not particularly limited, and any algorithm can be adopted as long as it can generate a classifier capable of the above classification. For example, an algorithm such as Random Forest or SVM can be adopted.

ステップＳ１２において、学習処理部２４は、各項目の評価用データを取得して、そのデータ値の特徴量を算出して、生成した学習器を用いて判別精度を評価する。評価用データはどの項目のデータであるかの正解が分かれば、マスターＤＢ４０に格納されているデータであってもよいし、それ以外のデータであってもよい。評価にはｋ分割交差検定などの交差検定を用いることも好ましい。また、評価指標として、適合率（Precision）、再
現率（Recall）、Ｆ値（F-measure）の１つ以上を組み合わせたものを用いることができ
るが、それ以外の評価指標を用いてもよい。 In step S12, the learning processing unit 24 acquires evaluation data of each item, calculates the feature amount of the data value, and evaluates the determination accuracy using the generated learning device. The evaluation data may be data stored in the master DB 40 or other data as long as the correct answer of which item of data is known. It is also preferable to use cross-validation such as k-fold cross-validation for evaluation. Moreover, although what combined one or more of a precision (Recision), a recall (Recall), and F value (F-measure) can be used as an evaluation index, you may use the other evaluation index. .

ステップＳ１３において、学習処理部２４は、各項目について評価値があらかじめ定められた閾値以上であるか否かを判定し、評価値が閾値以上の項目は精度の高い判別が可能な項目と判断し、評価値が閾値未満の項目は精度の高い判別ができない項目であると判断する。学習処理部２４は、この判断結果を判別精度情報として判別精度情報記憶部５０に記憶する。 In step S13, the learning processing unit 24 determines whether the evaluation value for each item is equal to or greater than a predetermined threshold, and determines that the item whose evaluation value is equal to or greater than the threshold is an item that can be accurately determined. An item whose evaluation value is less than the threshold value is determined to be an item which can not be determined with high accuracy. The learning processing unit 24 stores the determination result in the determination accuracy information storage unit 50 as determination accuracy information.

分類器を用いた分類（識別）について図２Ｂを参照して説明する。図２Ｂは、「操舵角」「右ウィンカー」「左ウィンカー」の項目についてトリップごとに求められる特徴量を特徴量空間にプロットした概念図である。図において、三角印が操舵角の特徴量、丸印が右ウィンカーの特徴量、バツ印が左ウィンカーの特徴量をそれぞれ示す。ここで、操舵角の特徴量は、他のデータの特徴量から特徴量空間上で分離していることから、データ値の特徴量に基づいて精度の良い判別が可能である。一方、右ウィンカーと左ウィンカーの特徴量は特徴量空間の比較的近い位置に混在しており、データ値の特徴量に基づいて精度の良い判別は困難である。 Classification (identification) using a classifier will be described with reference to FIG. 2B. FIG. 2B is a conceptual diagram in which feature quantities obtained for each trip for items of “steering angle”, “right turn signal”, and “left turn signal” are plotted in a feature amount space. In the figure, triangular marks indicate steering angle feature amounts, circle marks indicate right blinker feature amounts, and cross marks indicate left blinker feature amounts. Here, since the feature quantities of the steering angle are separated on the feature quantity space from the feature quantities of other data, it is possible to accurately determine based on the feature quantities of data values. On the other hand, the feature quantities of the right blinker and the left blinker are mixed at relatively close positions in the feature space, and accurate determination is difficult based on the feature quantities of data values.

図２Ｃは、判別精度情報記憶部５０に記憶される判別精度情報の例である。ここでは、「操舵角」「アクセル開度」「速度」「ヨーレート」は、データ値の特徴に基づいて精度の良い判別ができると判定され、「右ウィンカー」「左ウィンカー」「右後方距離センサ値」「左後方距離センサ値」は、データ値の特徴に基づいて精度の良い判別が困難であると判定される。 FIG. 2C is an example of the determination accuracy information stored in the determination accuracy information storage unit 50. Here, it is determined that "steering angle", "accelerator opening", "speed" and "yaw rate" can be accurately determined based on the characteristics of data values, and "right blinker" "left blinker" "right rear distance sensor The values “left rear distance sensor value” are determined to be difficult to accurately determine based on the characteristics of the data values.

（２．データ統合処理）
次に、データ処理装置１において、新規データをマスターＤＢ４０に統合するデータ統合処理について図３，図４のフローチャートを参照して説明する。 (2. Data integration process)
Next, data integration processing for integrating new data into the master DB 40 in the data processing apparatus 1 will be described with reference to the flowcharts of FIGS. 3 and 4.

ステップＳ２０において、新規データ入力部１０から統合対象の新規データを取得する。新規データ入力部１０は、例えば、通信（有線通信または無線通信）によって、データを生成した装置から、あるいは複数の装置からのデータを集約（収集）した装置から、統合対象の新規データを取得する。ここで、新規データは１レコードに複数の項目が含まれ、かつ、同一のトリップに属するレコードが把握可能な形式で入力されるものとする。 In step S20, new data to be integrated is acquired from the new data input unit 10. The new data input unit 10 acquires new data to be integrated, for example, from a device that has generated data or from a device that aggregates (collects) data from a plurality of devices by communication (wired communication or wireless communication). . Here, it is assumed that the new data includes a plurality of items in one record, and is input in a format in which records belonging to the same trip can be grasped.

ステップＳ２１において、データ処理部２０は、新規データに含まれるデータ項目から、対応するマスターＤＢ４０の項目の候補を求めるデータ項目を１つ選択する。以下では、ここで選択されたデータ項目を対象データ項目と称し、このデータ項目のデータを対象データと称する。 In step S21, the data processing unit 20 selects one data item for which a candidate of the corresponding item of the master DB 40 is to be obtained from the data items included in the new data. Hereinafter, the data item selected here is referred to as a target data item, and data of the data item is referred to as target data.

ステップＳ２２において、データ処理部２０は、対象データの項目に対応するマスターＤＢ４０の項目候補を決定する。ステップＳ２２の対応項目候補の決定処理の詳細は、図４に示される。 In step S22, the data processing unit 20 determines an item candidate of the master DB 40 corresponding to the item of the target data. Details of the corresponding item candidate determination process of step S22 are shown in FIG.

ステップＳ３０において、データ値類似度算出部２１は、対象データ項目と、マスター
ＤＢ４０の各項目との間で、データ値の類似度を算出する。具体的には、特徴量算出部２５によって対象データの特徴量を求め、学習処理部２４が事前学習処理によって生成した分類器によって、マスターＤＢ４０の各項目との特徴量の類似度（以下、値の類似度とも称する）を算出する。対象データの特徴量の算出方法は事前学習処理と同様（ステップＳ１０）であるので、詳しい説明は省略する。 In step S30, the data value similarity calculation unit 21 calculates the similarity of data values between the target data item and each item of the master DB 40. Specifically, the feature amount of the target data is obtained by the feature amount calculation unit 25, and the degree of similarity of the feature amount with each item of the master DB 40 by the classifier generated by the learning processing unit 24 by the prior learning processing To calculate the similarity of The method of calculating the feature amount of the target data is the same as the prior learning process (step S10), and thus the detailed description is omitted.

ステップＳ３１において、対応項目候補決定部２３は、値の類似度が高い上位のデータ項目の中に、データ値に基づく判別精度が高い項目が存在するか否かを判定する。値の類似度が高い上位のデータ項目は、値の類似度が所定の閾値以上のデータ項目として決定してもよいし、値の類似度が高い方からあらかじめ定められた所定個のデータ項目として決定してもよい。データ値に基づく判別精度が高い項目であるか否かは、判別精度情報記憶部５０に記憶されている判別精度情報を用いて判断可能である。 In step S31, the correspondence item candidate determination unit 23 determines whether or not there is an item with high determination accuracy based on the data value among the high-order data items with high value similarity. The upper data item having high value similarity may be determined as a data item whose value similarity is equal to or higher than a predetermined threshold, or as a predetermined number of data items determined in advance from the one having high value similarity. You may decide. Whether the item has a high determination accuracy based on the data value can be determined using the determination accuracy information stored in the determination accuracy information storage unit 50.

値の類似度が高い上位のデータ項目の中に、データ値に基づく判別精度が高い項目が存在する場合（Ｓ３１−ＹＥＳ）には、処理はステップＳ３２に進む。ステップＳ３２では、対応項目候補決定部２３は、値の類似度が閾値以上のデータ項目であり、かつ、データ値に基づく判別精度が高い項目を、対象データ項目に対応する項目の候補として決定する。この条件に該当するデータ項目が複数存在する場合には、対応項目候補決定部２３は、値の類似度が高いほど順位（優先度）を高く決定した上で、これら複数のデータ項目を対応項目候補として決定する。 If there is an item having a high determination accuracy based on the data value among the upper data items having high value similarity (S31-YES), the process proceeds to step S32. In step S32, the correspondence item candidate determination unit 23 determines an item having a value similarity value equal to or higher than the threshold value and having a high determination accuracy based on the data value as a candidate for the item corresponding to the target data item. . When there are a plurality of data items corresponding to this condition, the corresponding item candidate determination unit 23 determines the higher the degree of priority (priority) as the similarity of the value is higher, and then the plurality of data items correspond to the corresponding items. Decide as a candidate.

一方、値の類似度が高い上位のデータ項目の中に、データ値に基づく判別精度が高い項目が存在しない場合（Ｓ３１−ＮＯ）には、処理はステップＳ３３に進む。ステップＳ３３では、項目名類似度算出部２２が、対象データ項目とマスターＤＢ４０の各項目との間で、項目名（テキスト）の類似度を算出する。項目名の類似度の指標として、ユークリッド距離を用いてもよいし、レーベンシュタイン距離などの編集距離を用いてもよい。また、語彙の類似度を記憶した辞書を有し、この辞書を参照して項目名の類似度を算出してもよい。 On the other hand, if there is no item with high determination accuracy based on the data value among the upper data items with high value similarity (S31-NO), the process proceeds to step S33. In step S33, the item name similarity calculation unit 22 calculates the similarity of the item name (text) between the target data item and each item of the master DB 40. Euclidean distance may be used as an index of the degree of similarity of item names, or editing distance such as Levenshtein distance may be used. In addition, a dictionary storing the degree of similarity of vocabulary may be provided, and the degree of similarity of the item name may be calculated with reference to this dictionary.

ステップＳ３４において、対応項目候補決定部２３は、値の類似度が閾値以上のデータ項目を、対象データ項目に対応する項目の候補として決定する。この際、対応項目候補決定部２３は、項目名の類似度が高いほど、候補としての順位（優先度）を高く決定する。 In step S34, the correspondence item candidate determination unit 23 determines a data item whose value similarity is equal to or more than a threshold value as a candidate for the item corresponding to the target data item. At this time, the correspondence item candidate determination unit 23 determines the order (priority) as a candidate to be higher as the similarity of the item name is higher.

以上で、ステップＳ２２の対応項目候補の決定処理が終了する。 This is the end of the process of determining the corresponding item candidate in step S22.

図３のフローチャートの説明に戻る。ステップＳ２３では、対応項目候補決定部２３によって決定された対応項目候補が、入出力部３０によってユーザに提示される。この提示は、それぞれの候補の順位（優先度）が分かる形式で、かつ、候補を選択可能な形式で、ユーザに提示されることが好ましい。例えば、順位にしたがった順序で提示したり、順位を表す数値とともに提示したりする形式が考えられる。また、対応項目候補を提示する際に、これらの候補（および優先度）が値の類似度に基づいて決定された（ステップＳ３２の処理で決定された）のか、項目名の類似度に基づいて決定された（ステップＳ３４の処理で決定された）のかが分かる形式での提示を行うことも好ましい。また、候補が値の類似度に基づいて決定されている場合には、値の類似度を合わせて表示したり、値の類似度が特に高い候補を強調表示（色を変えたり、太字にしたり、ハイライトしたりなど）したりしてもよい。同様に、候補が項目名の類似度に基づいて決定されている場合には、項目名の類似度を合わせて表示したり、項目名の類似度あるいは値の類似度が特に高い項目候補を強調表示したりしてもよい。 The description will return to the flowchart of FIG. In step S23, the input / output unit 30 presents the user with the correspondence item candidate determined by the correspondence item candidate determination unit 23. This presentation is preferably presented to the user in a form in which the order (priority) of each candidate is known, and in a form in which the candidates can be selected. For example, it is possible to present in the order according to the order, or to present with a numerical value indicating the order. In addition, when presenting the corresponding item candidate, whether these candidates (and priorities) are determined based on the similarity of the values (determined in the process of step S32) or based on the similarity of the item name It is also preferable to perform presentation in a form that shows whether it has been determined (determined in the process of step S34). In addition, when the candidate is determined based on the similarity of the value, the similarity of the value is displayed together, or the candidate with particularly high similarity of the value is highlighted (color is changed or bolded) , Highlighting, etc.). Similarly, when the candidate is determined based on the similarity of the item names, the similarity of the item names is displayed together, or the item name having a particularly high similarity or value similarity is emphasized. It may be displayed.

また、項目候補をユーザに提示する際には、そのいずれかをユーザが選択できる形式で
提示することが好ましい。また、対応する項目をユーザが自ら入力可能とすることも好ましい。これにより、ユーザが選択した対応項目をデータ処理装置１が取得することができる。 In addition, when presenting item candidates to the user, it is preferable to present any of them in a format that allows the user to select. It is also preferable that the user can input the corresponding item by himself. Thereby, the data processing apparatus 1 can acquire the corresponding item selected by the user.

なお、ステップＳ２２の対応項目候補の決定処理では、候補に該当する項目が一つもない事態も想定される。具体的には、値の類似度が閾値以上となる項目が一つも存在しない場合である。このような場合には、ステップＳ２３においては、対応する項目の候補がない旨をユーザに提示するとよい。 In addition, in the determination process of the corresponding item candidate in step S22, it is also assumed that there is no item corresponding to the candidate. Specifically, there is no item in which the value similarity is equal to or higher than the threshold. In such a case, in step S23, it may be suggested to the user that there is no corresponding item candidate.

ユーザは、ステップＳ２３において提示された対応項目候補のいずれかを選択したり、あるいは候補として提示されていない項目のいずれかを選択したりすることができる。ステップＳ２４においては、入出力部３０を介して、データ処理部２０が対象データ項目に対応する項目名の入力を、ユーザから取得する。 The user can select any of the corresponding item candidates presented in step S23 or can select any of the items not presented as candidates. In step S24, the data processing unit 20 acquires the input of the item name corresponding to the target data item from the user via the input / output unit 30.

ステップＳ２５において、データ処理部２０は、対象データをユーザが入力した項目のデータとして、マスターＤＢ４０に取り込んでデータの統合を行う。なお、データの取り込みの際にデータのフォーマット変換などが必要であれば、変換方法をユーザから取得して、当該変換を施した上でマスターＤＢ４０に取り込んでもよい。 In step S25, the data processing unit 20 takes in target data as data of an item input by the user into the master DB 40 and integrates the data. In addition, if format conversion of data etc. is necessary at the time of taking in of data, the conversion method may be acquired from the user, and after performing the conversion, the data may be taken into the master DB 40.

ステップＳ２６では、統合対象データ（新規データ）に未処理のデータ項目が存在するか判定し、未処理のデータ項目が存在する場合（Ｓ２６−ＹＥＳ）にはステップＳ２１に戻って、次のデータ項目について上記の処理を繰り返し実行する。全てのデータ項目の処理が完了した場合（Ｓ２６−ＮＯ）には、処理はステップＳ２７に進む。 In step S26, it is determined whether there is an unprocessed data item in the integration target data (new data), and if there is an unprocessed data item (S26-YES), the process returns to step S21, and the next data item Repeat the above process for. If the processing of all data items is completed (S26-NO), the process proceeds to step S27.

ステップＳ２７では、新規データの各データ項目と、ステップＳ２４においてユーザから入力された対応項目とを用いて、学習処理部２４が分類器の再学習を行う。学習処理部２４は、ユーザによって入力された項目を対応項目の正解として再学習処理を行うことで、分類器の分類精度を向上させることができる。また、分類器の再学習に合わせて、判別精度情報を改めて計算することも好ましい。 In step S27, the learning processing unit 24 performs relearning of the classifier using each data item of the new data and the corresponding item input from the user in step S24. The learning processing unit 24 can improve the classification accuracy of the classifier by performing the relearning process with the item input by the user as the correct answer of the corresponding item. It is also preferable to calculate discrimination accuracy information anew in accordance with the relearning of the classifier.

＜本実施形態の有利な効果＞
本実施形態にかかるデータ処理装置１においては、事前学習によりマスターＤＢ４０に含まれる項目を、データ値の特徴に基づく判別精度が高い項目と低い項目に分類している。したがって、対応項目候補を求める際に、この判別精度に関する情報を用いた決定が行える。 <Benefit of this embodiment>
In the data processing apparatus 1 according to the present embodiment, items included in the master DB 40 by prior learning are classified into items having high determination accuracy based on the features of data values and items having low determination accuracy. Therefore, when obtaining the corresponding item candidate, it is possible to make a determination using the information on the determination accuracy.

また、本実施形態では、データ値の特徴に基づいて判別精度が高い項目については、項目名を考慮することなくデータ値の特徴に基づいて対応項目候補を決定している。すなわち、データ値の特徴に基づいて十分な精度で判別できる場合には、項目名を利用しない。これにより、対応項目候補の決定精度を向上させることができる。これは、項目名の命名には恣意性があり、データ値に基づいて精度良く判別できる項目に対して項目名を考慮すると、判別精度が悪化するおそれがあるためである。 Further, in the present embodiment, for the item having high determination accuracy based on the feature of the data value, the corresponding item candidate is determined based on the feature of the data value without considering the item name. That is, if it can be determined with sufficient accuracy based on the characteristics of the data value, the item name is not used. Thereby, the determination accuracy of the corresponding item candidate can be improved. This is because the naming of item names is arbitrary, and when considering item names for items that can be accurately determined based on data values, the accuracy of determination may be degraded.

また、本実施形態では、データ値の特徴に基づいて判別精度が高い項目が、値の類似が上位の項目に存在しなければ、項目名の類似度にしたがった順序で対応項目候補を決定している。すなわち、データ値の特徴だけに基づくと十分な精度で判別ができない場合に、項目名を利用するようにしている。このような状況ではデータ値の特徴からは精度の良い対応項目候補の決定できないので、項目名を利用することで対応項目候補の決定精度を向上させることができる。 Further, in the present embodiment, if the item having high determination accuracy based on the characteristic of the data value does not exist in the upper item with the value similarity, the corresponding item candidate is determined in the order according to the item name similarity. ing. That is, the item name is used when the determination can not be made with sufficient accuracy based on only the feature of the data value. In such a situation, since it is not possible to determine the corresponding item candidate with high accuracy from the feature of the data value, it is possible to improve the determination accuracy of the corresponding item candidate by using the item name.

本実施形態にかかるデータ処理装置は、他人や他の企業が生成・管理しているデータを取り込む際に、取り込むデータに対応する項目の候補をユーザに提示できるので、データ項目の対応付けに要するユーザの負担を軽減することができる。大量のデータ（ビッグデータ）に基づいて有益な情報や知見を導出する技術が進展している現状において、本実施形態にかかるデータ処理装置を用いれば、大量のデータを用意する処理（解析の前処理）が容易になる。 Since the data processing apparatus according to the present embodiment can present to the user the candidate of the item corresponding to the data to be taken in when taking in the data generated / managed by another person or another company, the data processing apparatus needs to associate the data items. The burden on the user can be reduced. In the current situation where technology for deriving useful information and knowledge based on a large amount of data (big data) has been developed, processing for preparing a large amount of data (before analysis) using the data processing apparatus according to the present embodiment Processing) becomes easy.

また、本実施形態では、対象データに対応する項目をユーザから受け付けて、この結果を用いて分類器の再学習処理を行っているので、利用が進むにつれて分類器の精度がより向上することが期待できる。 Further, in the present embodiment, since the item corresponding to the target data is received from the user and the relearning process of the classifier is performed using this result, the accuracy of the classifier is further improved as the use progresses. I can expect it.

＜変形例＞
上記の説明では車両データを例としているが、本発明はデータの種類にかかわらずに任意のデータに対して適用可能であることは、当業者に明らかであろう。 <Modification>
Although vehicle data is taken as an example in the above description, it will be apparent to those skilled in the art that the present invention is applicable to any data regardless of the type of data.

また、１トリップ内のデータから特徴量を求めているが、データの特徴が表現でき特徴量を用いた識別が可能であれば、特徴量の求め方は特に限定されない。例えば、１トリップ内のデータではなくて、あらかじめ定められた所定期間の間のデータを用いて、特徴量を算出してもよい。また、時間的に隣接するあらかじめ定められた数のデータを用いて、特徴量を算出してもよい。また、特徴量の算出方法は、取り扱うデータの種類に応じて適宜決定することも好ましい。 Further, although the feature amount is obtained from data in one trip, the method for obtaining the feature amount is not particularly limited as long as the feature of the data can be expressed and the identification using the feature amount is possible. For example, the feature amount may be calculated using data for a predetermined period determined in advance, not data for one trip. Alternatively, the feature amount may be calculated using a predetermined number of data adjacent in time. Further, it is also preferable to appropriately determine the method of calculating the feature amount according to the type of data to be handled.

上記では、判別精度情報がデータ値の特徴に基づいて精度の良い判別ができるかできないかを表す情報である例を説明した。判別精度情報は、このような二値の情報である必要はなく、判別精度の高さを３段階以上分けた情報であっても、数値（例えば１〜１０や１〜１００）で表す情報であってもよい。この場合、ステップＳ３１における判別精度が高い項目かどうかの判定では、判別精度情報が所定値以上の項目を判別精度が高いと判定すればよい。閾値はあらかじめ定められた固定値であってもよいし、ステップＳ３０において求められる各項目との類似度を考慮して定められる値であってもよい。 In the above, an example has been described in which the determination accuracy information is information indicating whether it is possible or not to perform accurate determination based on the characteristics of data values. Discrimination accuracy information does not have to be such binary information, and is information represented by numerical values (for example, 1 to 10 or 1 to 100) even if it is information obtained by dividing the height of discrimination accuracy by three or more steps. It may be. In this case, in the determination in step S31 whether the item is high in the determination accuracy, it may be determined that the item in which the determination accuracy information is equal to or more than a predetermined value is high in the determination accuracy. The threshold may be a predetermined fixed value, or may be a value determined in consideration of the degree of similarity with each item obtained in step S30.

上記の実施形態および変形例の説明は、本発明の実施形態を説明するための例示に過ぎず、本発明をその開示の範囲に限定する趣旨のものではない。また、上記の実施形態および各変形例において説明した要素技術は、それぞれ技術的に矛盾しない範囲で組み合わせて本発明を実施することができる。 The above description of the embodiment and the modification is merely an example for describing the embodiment of the present invention, and is not intended to limit the present invention to the scope of the disclosure. In addition, the present invention can be implemented by combining the element techniques described in the above-described embodiment and the respective modifications, as long as no technical contradiction arises.

１データ処理装置
１０新規データ入力部
２０データ処理部
３０入出力部
４０マスターデータベース（ＤＢ）
５０判別精度情報記憶部 1 data processing apparatus 10 new data input unit 20 data processing unit 30 input / output unit 40 master database (DB)
50 Discrimination accuracy information storage unit

Claims

A data processing apparatus for associating an item with new data whose item specification is unknown and known data whose item specification is known,
Storage means for storing information on discrimination accuracy according to features of data values for a plurality of items of known data;
Acquisition means for acquiring item names and data values of new data;
Processing means for obtaining candidates of items of the known data corresponding to the new data, and outputting item names of the candidates;
Equipped with
The processing means
For the plurality of items of the known data, the similarity of features of data values with the new data is determined;
If there is an item with high determination accuracy by the feature of the data value among the high-order predetermined number of items with high similarity of the data value, the item name of the item with high determination accuracy by the feature of the data value is Output along with the ranking according to the feature similarity of data value,
When there is no item with high determination accuracy due to the feature of the data value among the high-order predetermined items having high similarity of data values, the item name of the high-order predetermined number of items and the item name of the new data And the item names of the upper predetermined number of items are outputted together with the ranking according to the similarity of the item names,
Data processing apparatus characterized by the above.

The processing means determines the degree of similarity of the features of the data value using a learning device previously learned using data whose item name is known.
The data processing apparatus according to claim 1.

It further comprises an input unit for receiving an input of the item of the known data corresponding to the new data,
Re-learn the learner using the input to the input means,
The data processing device according to claim 2.

The processing means determines the similarity of the item name based on the edit distance of the item name.
The data processing apparatus according to any one of claims 1 to 3.

The processing means obtains, as the upper predetermined number of items, items having a similarity of the data value equal to or more than a predetermined threshold value.
The data processor according to any one of claims 1 to 4.

The processing means outputs that there is no similar item when there is no item whose similarity of the data value is equal to or more than the predetermined threshold value.
The data processing device according to claim 5.

The processing device outputs the item name of the candidate in a user-selectable manner.
The data processing device according to any one of claims 1 to 6.

The characteristic of the data value is one of the maximum value, the minimum value, the average value, the variance, or the maximum value, the minimum value, the average value, or the variance within the predetermined period of the data value within the predetermined period. Required based on multiples,
The data processing device according to any one of claims 1 to 7.

The known data and the new data are data related to a vehicle,
The predetermined period is one trip period,
The data processing apparatus according to claim 8.

A data processing method performed by a data processing apparatus that associates an item with new data whose item specification is unknown and known data whose item specification is known,
The data processor
Pre-storing information about discrimination accuracy according to features of data values for a plurality of items of known data;
Obtaining item names and data values of new data;
Obtaining a candidate of the item of the known data corresponding to the new data, and outputting the item name of the candidate;
Run
In the processing step, the data processing device
For the plurality of items of the known data, the similarity of features of data values with the new data is determined;
If there is an item with high determination accuracy by the feature of the data value among the high-order predetermined number of items with high similarity of the data value, the item name of the item with high determination accuracy by the feature of the data value is Output along with the ranking according to the feature similarity of data value,
When there is no item with high determination accuracy due to the feature of the data value among the high-order predetermined items having high similarity of data values, the item name of the high-order predetermined number of items and the item name of the new data And the item names of the upper predetermined number of items are outputted together with the ranking according to the similarity of the item names,
A data processing method characterized in that.

A program for causing a computer to execute the method according to claim 10.