JP2010152692A

JP2010152692A - Similarity calculation apparatus, similarity calculation method and program

Info

Publication number: JP2010152692A
Application number: JP2008330765A
Authority: JP
Inventors: Hidenori Tsukahara; 英徳塚原; Ryohei Fujimaki; 遼平藤巻
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2008-12-25
Filing date: 2008-12-25
Publication date: 2010-07-08
Anticipated expiration: 2028-12-25
Also published as: JP5386976B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a similarity calculation apparatus and a similarity calculation method that can prevent degradation of reliability of a similarity calculation result in calculating the similarity between data having missing values and other data. <P>SOLUTION: The similarity calculation apparatus includes a reception means for receiving first and second data having an area related to each attribute in which an attribute value for the attribute related to the area is or is not described, a storage means for storing distribution information about each attribute indicating the distribution of possible values of the attribute value for the attribute, a first calculation means for calculating the similarity between attribute values for attributes for which the attribute value is described in both first and second data, and as for attributes for which the attribute value is missing in at least either of the first and second data, calculating the similarity expectation between attribute values from the distribution information, and a second calculation means for calculating the similarity between the first and second data from the similarities and similarity expectations calculated by the first calculation means. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、類似度計算装置、類似度計算方法およびプログラムに関する。 The present invention relates to a similarity calculation device, a similarity calculation method, and a program.

複数の属性（項目とも称される）のそれぞれに１対１で対応する複数の属性データ（以下「属性値」と称する）を持つデータが、さまざまな分野で使用されている。 Data having a plurality of attribute data (hereinafter referred to as “attribute values”) one-to-one corresponding to each of a plurality of attributes (also referred to as items) is used in various fields.

例えば、自動車の複数センサから収集される複数の属性値を持つ車両状態データは、属性として、車両の速度、エンジン回転数、および、シフトポジション等を持つ。また、これらの属性の中には、データが取得される車両の車種や年式、グレード等により、属性値が取得されない属性（属性値が欠損している属性）が存在したり、または、センサの不具合により、車両状態データ内の一部の属性値が欠損してしまう場合もある。 For example, vehicle state data having a plurality of attribute values collected from a plurality of sensors of an automobile has a vehicle speed, an engine speed, a shift position, and the like as attributes. In addition, among these attributes, there is an attribute whose attribute value is not acquired (an attribute whose attribute value is missing) depending on the type, year, grade, etc. of the vehicle from which data is acquired, or a sensor Due to the above problem, some attribute values in the vehicle state data may be lost.

複数の属性値のうちの一部の属性値が欠損しているデータ同士の類似度を計算する方法として、欠損している属性値（以下「欠損値」と称する）自体をある代表値、例えば０や平均値などで補完し、欠損値が補完されたデータ同士の類似度を計算する方法がある。 As a method of calculating the similarity between data in which some of the attribute values are missing, the missing attribute value (hereinafter referred to as “missing value”) itself is a representative value, for example, There is a method of calculating the similarity between the data complemented with 0 or an average value and the missing value supplemented.

特許文献１には、欠損値を有する属性（以下「特定属性」と称する）以外の属性に対応する属性値を用いて、欠損値を持たないデータの中から、欠損値を持つデータに類似した類似データを求め、類似データが持つ属性値の中から、特定属性に対応する属性値を特定し、その特定された属性値で欠損値を補完する方法が記載されている。
特開２００２−２１５６４６号公報 Patent Document 1 uses an attribute value corresponding to an attribute other than an attribute having a missing value (hereinafter referred to as a “specific attribute”) to resemble data having a missing value from data having no missing value. A method is described in which similar data is obtained, an attribute value corresponding to a specific attribute is specified from the attribute values of the similar data, and the missing value is complemented with the specified attribute value.
JP 2002-215646 A

欠損値を代表値で補完すると、データ間の類似度の計算の際に、類似度の偏りが生じてしまうという課題があった。 When the missing value is complemented with the representative value, there is a problem that the similarity is biased when calculating the similarity between the data.

また、類似データを用いて欠損値を補完したデータと、その類似データと、の間の類似度を計算すると、そのデータ間が不当に類似してしまう。このため、類似性の高いデータを用いて欠損値を補完する方法は、欠損値を持つデータと他のデータとの類似度を求める際には適した方法ではなかった。 Moreover, if the similarity between the data which complemented the missing value using similar data and the similar data is calculated, the data will be unfairly similar. For this reason, the method of complementing a missing value using highly similar data is not a suitable method for obtaining the similarity between data having a missing value and other data.

よって、欠損値を持つデータと他のデータとの類似度の計算では、類似度の計算結果の信頼性が低くなるという課題があった。 Therefore, in the calculation of the similarity between data having missing values and other data, there is a problem that the reliability of the calculation result of the similarity is lowered.

本発明の目的は、上記課題を解決可能な類似度計算装置、類似度計算方法およびプログラムを提供することである。 The objective of this invention is providing the similarity calculation apparatus, similarity calculation method, and program which can solve the said subject.

本発明の類似度計算装置は、予め定められた各属性に１対１で関連づけられた各領域に、当該領域に関連づけられた属性に対応する属性値が記載されているかまたは当該属性値が記載されていない第１および第２のデータを、受け付ける受付手段と、前記属性ごとに、当該属性に対応する属性値が取り得る値の分布を表す分布情報を記憶する記憶手段と、前記属性値が前記第１および第２のデータの両方に記載されている属性については、当該属性に対応する属性値同士の類似度を計算し、前記属性値が少なくとも前記第１および第２のデータの一方に存在しない属性については、前記記憶手段内の分布情報を用いて、当該属性に対応する属性値同士の類似度の期待値を計算する第１計算手段と、前記第１計算手段にて計算された類似度および類似度の期待値に基づいて、前記第１および第２のデータの類似度を計算する第２計算手段と、を含む。 In the similarity calculation device of the present invention, an attribute value corresponding to an attribute associated with the region is described in each region associated with each predetermined attribute on a one-to-one basis, or the attribute value is described. Receiving means for receiving the first and second data that has not been performed, storage means for storing distribution information representing a distribution of values that can be taken by the attribute value corresponding to the attribute, for each attribute, and the attribute value For the attribute described in both the first and second data, the similarity between attribute values corresponding to the attribute is calculated, and the attribute value is at least one of the first and second data. About the attribute which does not exist, it calculated by the 1st calculation means which calculates the expected value of the similarity degree of the attribute value corresponding to the said attribute using the distribution information in the said storage means, and the said 1st calculation means Similarity and Based on the expected value of the similarity, and a second calculating means for calculating a similarity between the first and second data.

本発明の類似度計算方法は、類似度計算装置が行う類似度計算方法であって、予め定められた各属性に１対１で関連づけられた各領域に、当該領域に関連づけられた属性に対応する属性値が記載されているかまたは当該属性値が記載されていない第１および第２のデータを、受け付け、前記属性ごとに、当該属性に対応する属性値が取り得る値の分布を表す分布情報を記憶手段に記憶し、前記属性値が前記第１および第２のデータの両方に記載されている属性については、当該属性に対応する属性値同士の類似度を計算し、前記属性値が少なくとも前記第１および第２のデータの一方に存在しない属性については、前記記憶手段内の分布情報を用いて、当該属性に対応する属性値同士の類似度の期待値を計算し、前記計算された類似度および類似度の期待値に基づいて、前記第１および第２のデータの類似度を計算する。 The similarity calculation method of the present invention is a similarity calculation method performed by the similarity calculation device, and corresponds to each region associated with each predetermined attribute in a one-to-one correspondence with the attribute associated with the region. Distribution information representing a distribution of values that can be taken by the attribute value corresponding to the attribute, for each attribute, accepting first and second data in which the attribute value is described or the attribute value is not described Is stored in the storage means, and for the attribute whose attribute value is described in both the first and second data, the similarity between attribute values corresponding to the attribute is calculated, and the attribute value is at least For an attribute that does not exist in one of the first and second data, an expected value of similarity between attribute values corresponding to the attribute is calculated using the distribution information in the storage unit, and the calculated Similarity and Based on the expected value of similarity score to calculate the similarity between the first and second data.

本発明のプログラムは、コンピュータに、予め定められた各属性に１対１で関連づけられた各領域に、当該領域に関連づけられた属性に対応する属性値が記載されているかまたは当該属性値が記載されていない第１および第２のデータを、受け付ける受付処理と、前記属性ごとに、当該属性に対応する属性値が取り得る値の分布を表す分布情報を記憶手段に記憶する記憶処理と、前記属性値が前記第１および第２のデータの両方に記載されている属性については、当該属性に対応する属性値同士の類似度を計算し、前記属性値が少なくとも前記第１および第２のデータの一方に存在しない属性については、前記記憶手段内の分布情報を用いて、当該属性に対応する属性値同士の類似度の期待値を計算する第１計算処理と、前記計算された類似度および類似度の期待値に基づいて、前記第１および第２のデータの類似度を計算する第２計算処理と、を実行させる。 According to the program of the present invention, an attribute value corresponding to an attribute associated with the area is described in each area associated with each predetermined attribute on the computer, or the attribute value is described. Receiving processing for receiving the first and second data that has not been performed, storage processing for storing distribution information representing a distribution of values that can be taken by the attribute value corresponding to the attribute for each attribute in the storage means, For an attribute whose attribute value is described in both the first and second data, the similarity between attribute values corresponding to the attribute is calculated, and the attribute value is at least the first and second data. For the attribute that does not exist in one of the above, the first calculation process that calculates the expected value of the similarity between the attribute values corresponding to the attribute using the distribution information in the storage means, and the calculated similarity And based on the similarity of the expected value, a second calculation process of calculating a similarity between the first and second data, to the execution.

本発明によれば、欠損値を持つデータと他のデータとの類似度の計算において、類似度の計算結果の信頼性の低下を防止することが可能になる。 According to the present invention, in the calculation of the similarity between data having missing values and other data, it is possible to prevent a decrease in reliability of the calculation result of the similarity.

以下、本発明の実施の形態について図面を参照して説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

（第１の実施の形態）
図１は、本発明の第１の実施の形態の類似度計算装置１００を示したブロック図である。 (First embodiment)
FIG. 1 is a block diagram showing a similarity calculation apparatus 100 according to the first embodiment of this invention.

類似度計算装置１００は、入力処理部１１０と、データ分布記憶部１２０と、計算部１３０および１４０と、を含む。計算部１３０は、属性間類似度計算処理部１３０ａと、期待値計算処理部１３０ｂと、を含む。計算部１４０は、類似度統合処理部１４０ａと、出力処理部１４０ｂと、を含む。 Similarity calculation apparatus 100 includes an input processing unit 110, a data distribution storage unit 120, and calculation units 130 and 140. The calculation unit 130 includes an attribute similarity calculation processing unit 130a and an expected value calculation processing unit 130b. The calculation unit 140 includes a similarity integration processing unit 140a and an output processing unit 140b.

入力処理部１１０は、一般的に受付手段と呼ぶことができる。 Input processing unit 110 can generally be referred to as accepting means.

入力処理部１１０は、データ間の類似度を計算する対象となるデータ（以下「計算対象データ」と称する）、あるいは、データ分布記憶部１２０に記憶させるデータ分布を表す分布情報を入力する機能を有する。 The input processing unit 110 has a function of inputting data that is a target for calculating the similarity between data (hereinafter referred to as “calculation target data”) or distribution information representing a data distribution to be stored in the data distribution storage unit 120. Have.

入力処理部１１０は、計算対象データとして、第１のデータおよび第２のデータを受け付ける。 The input processing unit 110 receives first data and second data as calculation target data.

第１のデータおよび第２のデータのそれぞれは、予め定められた各属性に１対１で関連づけられた各領域を有する。各領域には、その領域に関連づけられた属性に対応する属性値が記載されているか、または、その属性値が記載されていない。 Each of the first data and the second data has each area associated with each predetermined attribute on a one-to-one basis. In each area, an attribute value corresponding to an attribute associated with the area is described, or the attribute value is not described.

図２は、第１のデータの一例を示した説明図であり、図３は、第２のデータの一例を示した説明図である。図２および３では、属性として、車両の速度、エンジン回転数、および、シフトポジションが用いられている。なお、第１および第２のデータは、図２、３に示したものに限らず適宜変更可能である。 FIG. 2 is an explanatory diagram showing an example of the first data, and FIG. 3 is an explanatory diagram showing an example of the second data. 2 and 3, the vehicle speed, engine speed, and shift position are used as attributes. The first and second data are not limited to those shown in FIGS. 2 and 3 and can be changed as appropriate.

図２に示した第１のデータでは、属性である「速度」に関連づけられた領域２ａには、「速度」に対応する属性値である「ＡＡＡＡ」が記載されている。また、属性である「エンジン回転数」に関連づけられた領域２ｂには、「エンジン回転数」に対応する属性値である「ＸＸＸＸ」が記載されている。また、属性である「シフトポジション」に関連づけられた領域２ｃには、「シフトポジション」に対応する属性値、さらに言えば、なんらの値も記載されていない。 In the first data shown in FIG. 2, the area 2a associated with the attribute “speed” describes “AAAA” as the attribute value corresponding to “speed”. In the area 2b associated with the attribute “engine speed”, “XXXX” that is an attribute value corresponding to “engine speed” is described. Further, in the area 2c associated with the attribute “shift position”, the attribute value corresponding to the “shift position”, that is, no value is described.

図３に示した第２のデータでは、属性である「速度」に関連づけられた領域３ａには、「速度」に対応する属性値である「ＢＢＢＢ」が記載されている。また、属性である「エンジン回転数」に関連づけられた領域３ｂには、「エンジン回転数」に対応する属性値、さらに言えば、なんらの値も記載されていない。また、属性である「シフトポジション」に関連づけられた領域３ｃには、「シフトポジション」に対応する属性値、さらに言えば、なんらの値も記載されていない。 In the second data shown in FIG. 3, the attribute value “BBBB” corresponding to “speed” is described in the area 3a associated with the attribute “speed”. In addition, in the area 3b associated with the attribute “engine speed”, an attribute value corresponding to “engine speed”, that is, no value is described. In addition, in the area 3c associated with the attribute “shift position”, the attribute value corresponding to “shift position”, that is, no value is described.

また、入力処理部１１０は、第１および第２のデータが有する属性ごとに、その属性に対応する属性値が取り得る値の分布（例えば、確率分布または確率密度関数）を表す分布情報を受け付ける。 Further, the input processing unit 110 receives distribution information representing a distribution of values (for example, probability distribution or probability density function) that can be taken by the attribute value corresponding to the attribute for each attribute of the first and second data. .

入力処理部１１０は、キーボード等のように人間から直接データを受け付ける装置のみならず、外部システム等と接続されるインターフェースなどでもよい。 The input processing unit 110 is not limited to a device that directly receives data from a human such as a keyboard, but may be an interface connected to an external system or the like.

データ分布記憶部１２０は、一般的に記憶手段と呼ぶことができる。 Data distribution storage unit 120 can be generally referred to as storage means.

データ分布記憶部１２０は、入力処理部１１０から入力された属性ごとの分布情報を記憶しておく機能を備えている。ただし、分布情報は、入力処理部１１０から入力されず、予め記憶されていてもよい。 The data distribution storage unit 120 has a function of storing distribution information for each attribute input from the input processing unit 110. However, the distribution information may be stored in advance without being input from the input processing unit 110.

計算部１３０は、一般的に第１計算手段と呼ぶことができる。 Calculation unit 130 can be generally referred to as first calculation means.

計算部１３０は、属性値が第１および第２のデータの両方に存在する属性（例えば、図２および３での「速度」）については、その属性に対応する属性値同士の類似度を計算する。 For the attribute whose attribute value exists in both the first and second data (for example, “speed” in FIGS. 2 and 3), the calculation unit 130 calculates the similarity between the attribute values corresponding to the attribute. To do.

また、計算部１３０は、属性値が少なくとも第１および第２のデータの一方に存在しない属性（例えば、図２および３での「エンジン回転数」および「シフトポジション」）については、データ分布記憶部１２０内の分布情報を用いて、その属性に対応する属性値同士の類似度の期待値を計算する。 Further, the calculation unit 130 stores data distribution for an attribute whose attribute value does not exist in at least one of the first and second data (for example, “engine speed” and “shift position” in FIGS. 2 and 3). Using the distribution information in the unit 120, the expected value of the similarity between attribute values corresponding to the attribute is calculated.

例えば、計算部１３０は、属性ごとに、その属性に対応する属性値が、第１および第２のデータの両方に存在するか判定する。計算部１３０は、属性値が両方に存在する場合、その属性値同士の類似度を計算し、属性値が少なくとも第１および第２のデータの一方に存在しない場合、データ分布記憶部１２０内の分布情報を用いて、その属性に対応する属性値同士の類似度の期待値を計算する。 For example, for each attribute, the calculation unit 130 determines whether an attribute value corresponding to the attribute exists in both the first data and the second data. When the attribute value exists in both, the calculation unit 130 calculates the similarity between the attribute values. When the attribute value does not exist in at least one of the first and second data, the calculation unit 130 stores the attribute value in the data distribution storage unit 120. Using the distribution information, an expected value of similarity between attribute values corresponding to the attribute is calculated.

属性間類似度計算処理部１３０ａは、属性値が第１および第２のデータの両方に存在する属性について、その属性に対応する属性値同士の類似度を計算する。 The inter-attribute similarity calculation processing unit 130a calculates the similarity between the attribute values corresponding to the attribute having the attribute value in both the first and second data.

期待値計算処理部１３０ｂは、属性値が少なくとも第１および第２のデータの一方に存在しない属性について、データ分布記憶部１２０内の分布情報を用いて、その属性に対応する属性値同士の類似度の期待値を計算する。 The expected value calculation processing unit 130b uses the distribution information in the data distribution storage unit 120 for attributes whose attribute values do not exist in at least one of the first and second data, and the similarity between attribute values corresponding to the attributes Calculate the expected value of degree.

計算部１４０は、一般的に第２計算手段と呼ぶことができる。 Calculation unit 140 can generally be referred to as second calculation means.

計算部１４０は、計算部１３０にて計算された類似度および類似度の期待値に基づいて、第１および第２のデータの類似度を計算する。 The calculation unit 140 calculates the similarity between the first and second data based on the similarity calculated by the calculation unit 130 and the expected value of the similarity.

例えば、計算部１４０は、計算部１３０にて計算された類似度および類似度の期待値を加算し、その加算結果を、第１および第２のデータの類似度とする。 For example, the calculation unit 140 adds the similarity calculated by the calculation unit 130 and the expected value of the similarity, and sets the addition result as the similarity between the first and second data.

類似度統合処理部１４０ａは、計算部１３０での属性ごとの計算結果を統合し、第１および第２のデータの類似度を計算する。 The similarity integration processing unit 140a integrates the calculation results for each attribute in the calculation unit 130, and calculates the similarity between the first and second data.

出力処理部１４０ｂは、類似度統合処理部１４０ａにて計算されたデータ間類似度を出力する機能を有する。出力処理部１４０ｂは、ディスプレイや音声による出力、あるいは外部システム等と接続されるインターフェースなどでもよい。 The output processing unit 140b has a function of outputting the inter-data similarity calculated by the similarity integration processing unit 140a. The output processing unit 140b may be a display, audio output, an interface connected to an external system, or the like.

次に、動作を説明する。 Next, the operation will be described.

図４は、類似度計算装置１００の動作を説明するためのフローチャートである。以下、図１および図４を参照して類似度計算装置１００の動作を説明する。 FIG. 4 is a flowchart for explaining the operation of the similarity calculation apparatus 100. Hereinafter, the operation of the similarity calculation apparatus 100 will be described with reference to FIGS. 1 and 4.

まず、入力処理部１１０は、類似度を計算する対象となる第１および第２のデータを受け付ける（ステップＡ１）。 First, the input processing unit 110 receives first and second data for which similarity is to be calculated (step A1).

以下説明のために、第１および第２のデータを、それぞれｘ₁、ｘ₂とベクトルで表記する。 Hereinafter, for the sake of explanation, the first and second data are expressed by vectors x ₁ and x ₂ , respectively.

ただし、ｘ₁およびｘ₂のｊ番目の属性は、同一の属性を表し、ｊ番目の属性に対応する属性値をｘ_1jおよびｘ_2jと表記する。 However, the j-th attribute of x ₁ and x ₂ represents the same attribute, and attribute values corresponding to the j-th attribute are expressed as x _1j and x _2j .

なお、入力されたデータ内にｊ番目の属性に対応する属性値が存在しない場合、すなわち、ｊ番目の属性に関連する領域内に値が無い（例えば、mull）場合には、その属性値は欠損値として扱われる。 If there is no attribute value corresponding to the jth attribute in the input data, that is, if there is no value in the area related to the jth attribute (for example, mull), the attribute value is Treated as missing values.

次に、計算部１３０は、ｘ₁およびｘ₂の各属性に関して、以下に示すステップＡ２からステップＡ６の処理を繰り返し行なう。以下ではｊ番目の属性に関する処理を説明することとする。 Then, calculation section 130, for each attribute of x ₁ and x _2, repeats the process of step A6 from step A2 shown below. In the following, processing regarding the jth attribute will be described.

まず、属性間類似度計算処理部１３０ａは、ｘ_1jおよびｘ_2jの少なくともいずれかが欠損しているかを確認する（ステップＡ２）。 First, the attribute similarity calculation processing unit 130a confirms whether at least one of x _1j and x _2j is missing (step A2).

ｘ_1jおよびｘ_2jの両方とも欠損していない場合には（ステップＡ３）、属性間類似度計算処理部１３０ａは、自己に記憶されている属性間類似度を計算するための計算式に基づいて、ｘ_1jとｘ_2jとの属性間類似度を計算する（ステップＡ４）。 When neither x _1j nor x _2j is missing (step A3), the attribute similarity calculation processing unit 130a is based on a calculation formula for calculating the attribute similarity stored in itself. , Similarity between attributes of x _1j and x _2j is calculated (step A4).

ここで、属性間類似度は、ｘ_1jとｘ_2jに関して計算される任意の関数で表され、ｊ番目の属性に関する属性間類似度をｇ_j（ｘ_1j，ｘ_2j）と表記する。 Here, the similarity between attributes is represented by an arbitrary function calculated with respect to x _1j and x _2j , and the similarity between attributes regarding the j-th attribute is expressed as g _j (x _1j , x _2j ).

属性間類似度は、データが類似しているほど大きい値をとってもよいし、データが類似しているほど小さい値をとってもよいこととする。 The similarity between attributes may take a larger value as the data is similar, or may take a smaller value as the data is similar.

属性間類似度としては、例えば、ｊ番目の属性に対応する属性値が連続値をとる場合には、２乗距離（ｘ_1j−ｘ_2j）²や絶対値距離｜ｘ_1j−ｘ_2j｜などを利用可能であるし、ｊ番目の属性に対応する属性値が離散値をとる場合には、ハミング距離（ｘ_1jとｘ_2jが一致する場合には０、一致しない場合には１）などを利用することが可能である。 As the similarity between attributes, for example, when the attribute value corresponding to the jth attribute takes a continuous value, the square distance (x _1j −x _2j ) ² or the absolute value distance | x _1j −x _2j | If the attribute value corresponding to the jth attribute takes a discrete value, the Hamming distance (0 if x _1j and x _2j match, 1 if not, etc.) It is possible to use.

ｘ_1jあるいはｘ_2jが欠損している場合には（ステップＡ３）、属性間類似度計算処理部１３０ａは、ｇ_j（ｘ_1j，ｘ_2j）を直接計算することができない。このため、属性間類似度計算処理部１３０ａは、期待値計算処理部１３０ｂを動作させる。期待値計算処理部１３０ｂは、データ分布記憶部１２０からｊ番目の属性に関する分布情報を読み込み、分布情報が表す確率分布（あるいは確率密度関数）ｐ_jを用いて、属性間類似度の期待値を計算する。期待値計算処理部１３０ｂは、その計算結果を属性間類似度計算処理部１３０ａに提供する（ステップＡ５)。 When x _1j or x _2j is missing (step A3), the attribute similarity calculation processing unit 130a cannot directly calculate g _j (x _1j , x _2j ). Therefore, the attribute similarity calculation processing unit 130a operates the expected value calculation processing unit 130b. The expected value calculation processing unit 130b reads the distribution information related to the jth attribute from the data distribution storage unit 120, and uses the probability distribution (or probability density function) p _j represented by the distribution information to calculate the expected value of the similarity between attributes. calculate. The expected value calculation processing unit 130b provides the calculation result to the attribute similarity calculation processing unit 130a (step A5).

このため、属性間類似度計算処理部１３０ａは、属性ごとに、属性間類似度、または、属性間類似度の期待値を有することになる。 For this reason, the attribute similarity calculation processing unit 130a has an attribute similarity or an expected value of the attribute similarity for each attribute.

ここで、ｐ_jとしては、ｊ番目の属性に対応する属性値が離散値やシンボル値（変数の値を、整数などの実際の値ではなく、記号などのシンボルとして表現した値）をとる場合には、例えば、多項分布などの離散値の分布を利用することが可能であり、ｊ番目の属性が実数値をとる場合には、正規分布などの連続値の分布を利用することが可能である。 Here, as p _j , the attribute value corresponding to the j-th attribute takes a discrete value or a symbol value (a value representing a variable value as a symbol such as a symbol instead of an actual value such as an integer). For example, a discrete value distribution such as a multinomial distribution can be used. When the jth attribute takes a real value, a continuous value distribution such as a normal distribution can be used. is there.

以下、期待値の計算の具体的な手順を説明する。 Hereinafter, a specific procedure for calculating the expected value will be described.

まず、ｘ_1jとｘ_2jのうち、片側のみが欠損している場合には、欠損していない属性値を活用するために、期待値計算処理部１３０ｂは、（１）式に従って、属性間類似度の期待値を計算する。 First, when only one side of x _1j and x _2j is missing, the expected value calculation processing unit 130b uses the similarity between attributes according to the equation (1) in order to use the missing attribute value. Calculate the expected value of degree.

なお、（１）式は、ｘ_2jのみが欠損している場合に用いる計算式であるが、ｘ_1jのみが欠損している場合には、期待値計算処理部１３０ｂは、（１）式のｘ_1jをｘ_2jに入れ替えることによって、同様の手法で属性間類似度の期待値を計算可能である。 Note that equation (1) is a calculation equation used when only x _2j is missing, but when only x _1j is missing, the expected value calculation processing unit 130b determines that equation (1) By replacing x _1j with x _2j , the expected value of the similarity between attributes can be calculated by the same method.

なお、（１）式において、ｘとしては、ｊ番目の属性に対応する属性値が取り得るすべての値が用いられる。 In Equation (1), as x, all values that can be taken by the attribute value corresponding to the j-th attribute are used.

次に、ｘ_1jとｘ_2jの両方が欠損している場合には、期待値計算処理部１３０ｂは、データの持つ属性値を利用できないため、期待値計算処理部１３０ｂは、（２）式に従って、属性間類似度の期待値を計算する。 Next, when both x _1j and x _2j are missing, the expected value calculation processing unit 130b cannot use the attribute value of the data. Calculate the expected value of similarity between attributes.

なお、（２）式において、ｘおよびｙとしては、ｊ番目の属性に対応する属性値が取り得るすべての値が用いられる。 In Expression (2), as x and y, all values that can be taken by the attribute value corresponding to the j-th attribute are used.

（１）式および（２）式は、ｊ番目の属性に対応する属性値が連続値をとる場合も含めるため、期待値の計算手法として積分が用いられているが、ｊ番目の属性に対応する属性値が離散値をとる場合には、期待値計算処理部１３０ｂは、積分を和に置き換えて、属性間類似度の期待値を計算してもよい。 Equations (1) and (2) include the case where the attribute value corresponding to the jth attribute takes a continuous value, and therefore integration is used as the expected value calculation method, but this corresponds to the jth attribute. When the attribute value to be taken takes a discrete value, the expected value calculation processing unit 130b may calculate the expected value of the similarity between attributes by replacing the integral with the sum.

なお、（１）式に従った期待値の計算は、ｊ番目の属性に対応する属性値が有限数の離散値をとる場合には、事前に計算しておくことが可能であり、期待値計算処理部１３０ｂまたはデータ分布記憶部１２０は、その値を予め記憶しておいてもよい。 Note that the expected value according to the equation (1) can be calculated in advance when the attribute value corresponding to the jth attribute takes a finite number of discrete values. The calculation processing unit 130b or the data distribution storage unit 120 may store the value in advance.

（２）式に従った期待値計算は、ｊ番目の属性に対応する属性値の種類に関わらず事前に計算可能であるため、期待値計算処理部１３０ｂまたはデータ分布記憶部１２０は、その値を予め記憶しておいてもよい。 Since the expected value calculation according to the equation (2) can be calculated in advance regardless of the type of the attribute value corresponding to the jth attribute, the expected value calculation processing unit 130b or the data distribution storage unit 120 May be stored in advance.

期待値の計算に関して、ｊ番目の属性に対応する属性値が離散値をとり、ｇ_j（ｘ_1j，ｘ_2j）としてハミング距離を利用し、ｐ_jとして多項分布を利用する場合について説明すると、（１）式として（３）式を用いることが可能であり、（２）式として（４）式を用いることが可能である。 Regarding the calculation of the expected value, the case where the attribute value corresponding to the jth attribute takes a discrete value, uses the Hamming distance as g _j (x _1j , x _2j ), and uses the multinomial distribution as p _j . The expression (3) can be used as the expression (1), and the expression (4) can be used as the expression (2).

ただし、（３）式において、ｘ_1jはｊ番目の属性に対応する属性値が取り得る離散値のうちのｋ番目の離散値をとるとし、ｐ^k _jはｊ番目の属性に対応する属性値としてｋ番目の離散値が出現する確率を表すとする。 In Equation (3), x _1j is the kth discrete value of the discrete values that can be taken by the attribute value corresponding to the jth attribute, and p ^k _j is the attribute value corresponding to the jth attribute. Let k denote the probability that the k-th discrete value will appear.

属性間類似度計算処理部１３０ａは、ｊ番目の属性について、属性間類似度、または、属性間類似度の期待値を得ると、残りの属性があるか確認し、残りの属性がある場合には、計算の対象となる属性を更新して、処理をステップＡ２に戻し、残りの属性がなければ、処理をステップＡ７に進める（ステップＡ６）。 When the inter-attribute similarity calculation processing unit 130a obtains the inter-attribute similarity or the expected value of the inter-attribute similarity for the j-th attribute, the inter-attribute similarity calculation processing unit 130a checks whether there is the remaining attribute. Updates the attribute to be calculated and returns the process to step A2. If there are no remaining attributes, the process proceeds to step A7 (step A6).

ステップＡ７では、類似度統合処理部１４０ａは、ステップＡ２からステップＡ６で計算された各属性に関する属性間類似度あるいはその期待値を統合することによって、第１および第２のデータ間の類似度を計算する。 In step A7, the similarity integration processing unit 140a integrates the similarity between attributes or the expected value for each attribute calculated in steps A2 to A6, thereby calculating the similarity between the first and second data. calculate.

本実施形態では、類似度統合処理部１４０ａは、ステップＡ２からステップＡ６で計算された、各属性に関する属性間類似度あるいはその期待値をｓ_jとすると、（５）式に従って、第１および第２のデータ間の類似度を、ｓ_jに関する任意の関数として定義することが可能である。 In this embodiment, the similarity integration processing unit 140a calculates the first and first values according to equation (5), where s _j is the similarity between attributes or the expected value for each attribute calculated in steps A2 to A6. It is possible to define the similarity between two data as an arbitrary function regarding s _j .

ただし、ｆは任意の関数、ｄはｘ₁およびｘ₂が有する属性の数である。 Here, f is an arbitrary function, and d is the number of attributes that x ₁ and x ₂ have.

ｆの具体的な例としては、各属性の属性間類似度の和である（６）式が挙げられる。 As a specific example of f, Expression (6), which is the sum of similarity between attributes of each attribute, can be given.

また、ｓ_jに関する任意の非線形関数をｆとして利用することが可能である。 It is also possible to use any nonlinear function related to s _j as f.

出力処理部１４０ｂは、ステップＡ７で計算されたデータ間類似度を出力する（ステップＡ８）。 The output processing unit 140b outputs the similarity between data calculated in step A7 (step A8).

次に、本実施形態の効果を説明する。 Next, the effect of this embodiment will be described.

本実施形態によれば、計算部１３０は、属性値が第１および第２のデータの両方に存在する属性については、その属性に対応する属性値同士の類似度を計算し、属性値が少なくとも第１および第２のデータの一方に存在しない属性については、データ分布記憶部１２０内の分布情報を用いて、その属性に対応する属性値同士の類似度の期待値を計算する。 According to the present embodiment, for an attribute whose attribute value exists in both the first and second data, the calculation unit 130 calculates the similarity between attribute values corresponding to the attribute, and the attribute value is at least For an attribute that does not exist in one of the first and second data, the distribution information in the data distribution storage unit 120 is used to calculate the expected value of the similarity between the attribute values corresponding to the attribute.

このため、類似度が計算される２つのデータの一方あるいは両方に欠損値がある場合であっても、その欠損している属性値の取りうる値の分布を利用して、その属性に関する属性間類似度の期待値を計算し、その期待値を利用して、第１および第２のデータの類似度が計算される。 For this reason, even if one or both of the two data for which the similarity is calculated has a missing value, the distribution of possible values of the missing attribute value is used to determine the attribute An expected value of similarity is calculated, and the similarity between the first and second data is calculated using the expected value.

よって、類似度が計算される２つのデータの一方あるいは両方に欠損値がある場合であっても、信頼性の高いデータ間類似度を計算することが可能となる。 Therefore, even when one or both of the two data whose similarity is calculated has a missing value, it is possible to calculate the similarity between data with high reliability.

また、本実施形態では、計算部１４０は、計算部１３０にて計算された類似度および類似度の期待値を加算し、その加算結果を、第１および第２のデータの類似度とする。 In the present embodiment, the calculation unit 140 adds the similarity calculated by the calculation unit 130 and the expected value of the similarity, and uses the addition result as the similarity between the first and second data.

この場合、第１および第２のデータの類似度を容易に計算することができる。 In this case, the similarity between the first and second data can be easily calculated.

（第２の実施の形態）
次に、本発明の第２の実施の形態の類似度計算装置２００を説明する。 (Second Embodiment)
Next, a similarity calculation apparatus 200 according to the second embodiment of this invention will be described.

図５は、類似度計算装置２００を示したブロック図である。図５において、図１に示したものと同一のものには同一符号を付してある。 FIG. 5 is a block diagram showing the similarity calculation device 200. In FIG. 5, the same components as those shown in FIG.

類似度計算装置２００は、図１に示す類似度計算装置１００の構成に加え、分布を学習するためのデータを入力する入力処理部２１０と、データの分布を学習するためのデータ分布学習処理部２２０と、を含む。 In addition to the configuration of the similarity calculation apparatus 100 shown in FIG. 1, the similarity calculation apparatus 200 includes an input processing unit 210 that inputs data for learning the distribution, and a data distribution learning processing unit for learning the data distribution. 220.

入力処理部２１０は、一般的に学習用データ受付手段と呼ぶことができる。 Input processing unit 210 can generally be referred to as learning data receiving means.

入力処理部２１０は、各属性に１対１で関連づけられた各領域に、その領域に関連づけられた属性に対応する属性値が記載されているかまたはその属性値が記載されていない複数の学習用データを受け付ける。 The input processing unit 210 has a plurality of learning objects in which attribute values corresponding to the attributes associated with each region are described in each region associated with each attribute on a one-to-one basis, or the attribute values are not described. Accept data.

データ分布学習処理部２２０は、一般的に処理手段と呼ぶことができる。 Data distribution learning processing unit 220 can be generally referred to as processing means.

データ分布学習処理部２２０は、入力処理部２１０が受け付けた複数の学習用データに基づいて、属性ごとに分布情報を生成し、その分布情報をデータ分布記憶部１２０に記憶する。 The data distribution learning processing unit 220 generates distribution information for each attribute based on the plurality of learning data received by the input processing unit 210 and stores the distribution information in the data distribution storage unit 120.

類似度計算装置２００は、入力処理部２１０とデータ分布学習処理部２２０とを含むため、必ずしも予め分布情報を記憶しておく必要がなく、データの分布が事前に計算可能な場合には、分布を学習するためのデータ（学習用データ）を入力し、そのデータから分布を計算、学習して記憶しておくことが可能となる。 Since the similarity calculation device 200 includes an input processing unit 210 and a data distribution learning processing unit 220, it is not always necessary to store distribution information in advance, and if the distribution of data can be calculated in advance, the distribution It is possible to input data for learning (learning data), calculate the distribution from the data, learn it, and store it.

類似度計算装置２００の動作は、図４に示す類似度計算装置１００の動作とほぼ同様であるが、期待値計算処理部１３０ｂが単一属性の属性間類似度の期待値を計算するステップＡ５のみが異なる。 The operation of the similarity calculation device 200 is substantially the same as the operation of the similarity calculation device 100 shown in FIG. 4, but the expected value calculation processing unit 130b calculates the expected value of the similarity between attributes of a single attribute Step A5 Only the difference.

類似度計算装置１００では、単一属性の属性間類似度の期待値を計算する際、予め記憶されていた分布情報を利用したが、類似度計算装置２００では、図６に示す動作で記憶した分布情報を利用して単一属性の属性間類似度の期待値を計算することができる。 In the similarity calculation device 100, the distribution information stored in advance was used when calculating the expected value of the similarity between attributes of a single attribute, but in the similarity calculation device 200, it was stored by the operation shown in FIG. The expected value of similarity between attributes of a single attribute can be calculated using the distribution information.

図６を参照すると、まず、入力処理部２１０が、複数の学習用データを受け付ける（ステップＢ１）。 Referring to FIG. 6, first, the input processing unit 210 receives a plurality of learning data (step B1).

データ分布学習処理部２２０は、複数の学習用データに基づいて、属性ごとに、分布情報を計算する。この際、計算される分布は、例えば、離散値における多項分布などであるが、連続値などで分布を計算できないものについては、類似度計算装置１００と同様に、予めデータ分布記憶部１２０に記憶しておくことで対応可能である（ステップＢ２）。 The data distribution learning processing unit 220 calculates distribution information for each attribute based on a plurality of learning data. At this time, the distribution to be calculated is, for example, a multinomial distribution with discrete values, but those that cannot be calculated with continuous values or the like are stored in advance in the data distribution storage unit 120 as in the similarity calculation device 100. This can be handled (step B2).

ステップＢ２の処理は、全属性に対して行われる。 The process of step B2 is performed for all attributes.

データ分布学習処理部２２０は、全属性について計算された分布情報をデータ分布記憶部１２０に記憶する（ステップＢ３）。 The data distribution learning processing unit 220 stores the distribution information calculated for all attributes in the data distribution storage unit 120 (step B3).

本実施形態では、データ分布学習処理部２２０は、入力処理部２１０が受け付けた複数の学習用データに基づいて、属性ごとに分布情報を生成し、その分布情報をデータ分布記憶部１２０に記憶する。 In the present embodiment, the data distribution learning processing unit 220 generates distribution information for each attribute based on a plurality of learning data received by the input processing unit 210 and stores the distribution information in the data distribution storage unit 120. .

このため、必ずしも予め分布情報を記憶しておく必要がなく、分布情報を事前に計算可能な場合には、学習用データを用いて分布情報を計算、学習して記憶しておくことが可能となる。 For this reason, it is not always necessary to store the distribution information in advance, and if the distribution information can be calculated in advance, the distribution information can be calculated, learned and stored using learning data. Become.

上記各実施形態は、自動車の複数センサから収集される複数の属性値を持つデータ同士の類似度を計算する際、あるいは、様々な入力属性を利用する市場予測や金融関連の予測システム、更には様々なコンピュータ関連機器の状態類似度計算など、様々な用途に適用可能である。 In each of the above embodiments, when calculating the degree of similarity between data having a plurality of attribute values collected from a plurality of sensors of an automobile, or a market prediction or a financial related prediction system using various input attributes, It can be applied to various uses such as state similarity calculation of various computer-related devices.

なお、類似度計算装置１００および２００は、専用のハードウェアにより実現されるもの以外に、入力処理部１１０、データ分布記憶部１２０、計算部１３０および計算部１４０、入力処理部２１０、および、データ分布学習処理部２２０の各機能を実現するためのプログラムを、コンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行するものであってもよい。 The similarity calculation devices 100 and 200 are not only realized by dedicated hardware, but also include an input processing unit 110, a data distribution storage unit 120, a calculation unit 130 and a calculation unit 140, an input processing unit 210, and data A program for realizing each function of the distribution learning processing unit 220 may be recorded on a computer-readable recording medium, and the program recorded on the recording medium may be read and executed by a computer system. .

コンピュータ読み取り可能な記録媒体とは、フレキシブルディスク、光磁気ディスク、ＣＤ−ＲＯＭ（Compact Disk Read Only Memory）等の記録媒体、コンピュータシステムに内蔵されるハードディスク装置等の記憶装置を指す。 The computer-readable recording medium refers to a recording medium such as a flexible disk, a magneto-optical disk, a CD-ROM (Compact Disk Read Only Memory), and a storage device such as a hard disk device built in the computer system.

さらに、コンピュータ読み取り可能な記録媒体は、インターネットを介してプログラムを送信する場合のように、短時間の間、動的にプログラムを保持するもの（伝送媒体もしくは伝送波）、その場合のサーバとなるコンピュータシステム内部の揮発性メモリのように、一定時間プログラムを保持しているものも含む。 Furthermore, a computer-readable recording medium is a server that dynamically holds a program (transmission medium or transmission wave) for a short period of time, as in the case of transmitting a program via the Internet, and a server in that case. Some of them hold programs for a certain period of time, such as volatile memory inside computer systems.

以上説明した各実施形態において、図示した構成は単なる一例であって、本発明はその構成に限定されるものではない。 In each embodiment described above, the illustrated configuration is merely an example, and the present invention is not limited to the configuration.

本発明の第１の実施の形態の類似度計算装置１００を示したブロック図である。It is the block diagram which showed the similarity calculation apparatus 100 of the 1st Embodiment of this invention. 第１のデータの一例を示した説明図である。It is explanatory drawing which showed an example of 1st data. 第２のデータの一例を示した説明図である。It is explanatory drawing which showed an example of 2nd data. 類似度計算装置１００の動作を説明するためのフローチャートである。6 is a flowchart for explaining the operation of the similarity calculation device 100. 類似度計算装置２００を示したブロック図である。2 is a block diagram showing a similarity calculation device 200. FIG. 類似度計算装置２００の動作の一部を説明するためのフローチャートである。5 is a flowchart for explaining a part of the operation of the similarity calculation device 200.

Explanation of symbols

１００類似度計算装置
１１０入力処理装置
１２０データ分布記憶装置
１３０計算部
１３０ａ属性間類似度計算処理部
１３０ｂ期待値計算処理部
１４０計算部
１４０ａ類似度統合処理部
１４０ｂ出力処理部
２００類似度計算装置
２１０入力処理部
２２０データ分布学習処理部 100 Similarity Calculation Device 110 Input Processing Device 120 Data Distribution Storage Device 130 Calculation Unit 130a Inter-attribute Similarity Calculation Processing Unit 130b Expected Value Calculation Processing Unit 140 Calculation Unit 140a Similarity Integration Processing Unit 140b Output Processing Unit 200 Similarity Calculation Device 210 Input processing unit 220 Data distribution learning processing unit

Claims

First and second attribute values corresponding to the attribute associated with the region are described in each region associated with each predetermined attribute on a one-to-one basis, or the attribute value is not described. Receiving means for receiving data; and
Storage means for storing, for each attribute, distribution information representing a distribution of values that can be taken by the attribute value corresponding to the attribute;
For an attribute whose attribute value is described in both the first and second data, a similarity between attribute values corresponding to the attribute is calculated, and the attribute value is at least the first and second data. For attributes that do not exist in one of the data, using the distribution information in the storage means, a first calculation means for calculating the expected value of the similarity between the attribute values corresponding to the attribute;
And a second calculation means for calculating the similarity between the first and second data based on the similarity calculated by the first calculation means and the expected value of the similarity.

The similarity calculation apparatus according to claim 1,
In each region, an attribute value corresponding to an attribute associated with the region is described, or learning data receiving means for receiving a plurality of learning data in which the attribute value is not described,
A similarity calculation device further comprising: processing means for generating the distribution information for each of the attributes based on the plurality of learning data and storing the distribution information in the storage means.

In the similarity calculation apparatus according to claim 1 or 2,
The second calculation means adds the similarity calculated by the first calculation means and the expected value of the similarity, and sets the addition result as the similarity of the first and second data. Computing device.

A similarity calculation method performed by the similarity calculation device,
First and second attribute values corresponding to the attribute associated with the region are described in each region associated with each predetermined attribute on a one-to-one basis, or the attribute value is not described. Accept data,
For each of the attributes, distribution information representing a distribution of values that can be taken by the attribute value corresponding to the attribute is stored in the storage unit,
For an attribute whose attribute value is described in both the first and second data, a similarity between attribute values corresponding to the attribute is calculated, and the attribute value is at least the first and second data. For attributes that do not exist in one of the data, the distribution information in the storage means is used to calculate the expected value of the similarity between the attribute values corresponding to the attribute,
A similarity calculation method for calculating a similarity between the first data and the second data based on the calculated similarity and an expected value of the similarity.

The similarity calculation method according to claim 4,
In each of the areas, an attribute value corresponding to an attribute associated with the area is described, or a plurality of learning data in which the attribute value is not described is received,
Generating the distribution information for each of the attributes based on the plurality of learning data,
A similarity calculation method of storing the generated distribution information in the storage means when storing the distribution information.

The similarity calculation method according to claim 4 or 5,
When calculating the similarity between the first and second data, the calculated similarity and the expected value of the similarity are added, and the addition result is calculated as the similarity between the first and second data. A similarity calculation method.

On the computer,
First and second attribute values corresponding to the attribute associated with the region are described in each region associated with each predetermined attribute on a one-to-one basis, or the attribute value is not described. A reception process for receiving data;
For each attribute, a storage process for storing distribution information representing a distribution of values that can be taken by the attribute value corresponding to the attribute in a storage unit;
For an attribute whose attribute value is described in both the first and second data, a similarity between attribute values corresponding to the attribute is calculated, and the attribute value is at least the first and second data. For an attribute that does not exist in one of the data, using the distribution information in the storage means, a first calculation process that calculates an expected value of similarity between attribute values corresponding to the attribute;
A program for executing a second calculation process for calculating a similarity between the first and second data based on the calculated similarity and an expected value of the similarity.

The program according to claim 7,
In the computer,
In each area, an attribute value corresponding to an attribute associated with the area is described or a learning data reception process for receiving a plurality of learning data in which the attribute value is not described;
A generation process for generating the distribution information for each of the attributes based on the plurality of learning data;
In the storage process, a program for storing the generated distribution information in the storage unit.

In the program according to claim 7 or 8,
When calculating the similarity between the first and second data, the calculated similarity and the expected value of the similarity are added, and the addition result is calculated as the similarity between the first and second data. Program.