WO2017064769A1 - Information processing system and computer program - Google Patents
Information processing system and computer program Download PDFInfo
- Publication number
- WO2017064769A1 WO2017064769A1 PCT/JP2015/079048 JP2015079048W WO2017064769A1 WO 2017064769 A1 WO2017064769 A1 WO 2017064769A1 JP 2015079048 W JP2015079048 W JP 2015079048W WO 2017064769 A1 WO2017064769 A1 WO 2017064769A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- field
- rank
- axis
- data
- value
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
Definitions
- the present invention relates to a technique for calculating a distance between data.
- the correspondence between the case data and the classification is maintained, and the data is classified based on the distance between the data to be classified and the case data, and the correspondence between the case data and the classification.
- a technique is known (for example, Patent Document 1).
- the distance between the data composed of a plurality of items and the case data is obtained as a weighted sum of the differences between the values of the respective items.
- this technique corresponds to a technique for calculating the distance between individual records. It becomes. Therefore, even if this technique is applied to the calculation of the distance between records as it is, the distance between the tables reflecting the characteristics of the entire table appearing across a plurality of records cannot be calculated.
- an object of the present invention is to provide an information processing system capable of calculating a distance between sets according to characteristics of the set of records including a plurality of fields as a whole set.
- the present invention sets an order of each field of the record in an information processing system for calculating a distance between a first set of records composed of a plurality of fields and a second set of records.
- the mutual information amount of the first rank field of the first set and the second set is the field of the first rank.
- An evaluation value calculation unit that calculates the amount of information as an evaluation value of the field of the rank; and at least a part of the evaluation value calculated by the evaluation value calculation unit for each field, the first set It is obtained by a distance calculation unit to calculate the serial distance between the second set.
- the rank setting unit sets the rank of the field having the maximum entropy in the first set among the fields of the record to the rank of the first rank, Thereafter, until the order of all fields is set, the order of the field with the highest conditional entropy of the field under the field for which the order has already been set in the first set is set last.
- the order of the respective fields may be set by repeating the process of setting the next order of the order.
- the mutual information amount and the conditional mutual information amount of each field of the first set and the second set are independent evaluation value, and the distance between the first set and the second set is calculated using the obtained evaluation value.
- the mutual information amount of each field and the conditional mutual information amount reflect the characteristics across the records of the field
- a set of records composed of a plurality of fields is recorded.
- the distance between sets can be calculated according to the characteristics of the set as a whole.
- the mutual information amount of each field and the evaluation value of each field obtained as conditional mutual information amount are independent from each other. Therefore, the first set and the second distance can be easily evaluated from various viewpoints. You can do it.
- the mutual information amount between the two tables is equal to the sum of the mutual information amount between each item and the conditional mutual information amount. It is possible to perform processing based on a mutual amount of mutual information.
- an information processing system capable of calculating the distance between sets according to the characteristics of the set of records including a plurality of fields as the whole set.
- FIG. 1 shows a configuration of an information processing system according to the present embodiment.
- the information processing system includes a processor 1 and a storage 2.
- the processor 1 includes an axis determination unit 11 and a determination processing unit 12.
- the processor 1 is a general-purpose computer including, for example, a CPU, a memory, and various peripheral devices such as a display device and an input device.
- the axis determination unit 11 and the determination processing unit 12 are predetermined by the computer. This is implemented as a computer function realized by executing the program.
- the storage 2 stores reference data, axis information, determination result data, and determination target data.
- the storage 2 may be an external storage connected to the processor 1, or a network storage or a data server that the processor 1 accesses via a network.
- the reference data is a table that is a set of records (each row from 1 to 6 in the figure) composed of a plurality of fields (each column of A, B, and C in the figure).
- the data to be determined is also a table that is a set of records (each row in the figure) including a plurality of fields, and the field configuration of the data to be determined is equal to the field configuration of the reference data.
- the number of data records is equal to the number of records in the reference data.
- records having the same order (order) of the reference data and the data to be judged store information on the same target in each field. That is, the nth record of the reference data and the nth record of the data to be judged store information about the nth object in each field.
- the m-th attribute value of the n-th object is stored in the m-th field of the n-th record of the reference data and the determination target data. More specifically, for example, in the m-th field of the n-th record of the reference data, the n-th sensor among the N sensors that detect M states of different objects is detected. The target value of the mth state is stored, and the detection value of the mth state detected by the nth sensor is stored in the mth field of the nth record of the determination target data.
- the axis information includes axis number information and axis definition information.
- the number of fields of the reference data G is registered in the axis number information.
- a field identifier indicating a field to be used as each axis from the first axis to the nth axis, where n is the number of fields of the reference data G is registered.
- the determination result data includes an entry for the reference data G, an entry for the determination target data T, and an entry for the mutual information amount I.
- the entropy HG () of each axis from the first axis to the nth axis of the reference data G is registered, and in the entry of the determination target data T, the first of the determination target data T is registered.
- the entropy HT () of each axis from the axis to the nth axis is registered.
- the mutual information amount I entry the mutual information amount I () between the reference data G of each axis from the first axis to the nth axis and the determination target data T is registered.
- the axis determination unit 11 of the processor 1 creates axis information using the reference data G and calculates the entropy HG of each axis from the first axis to the n-th axis of the reference data G. Then, the axis determination process registered in the determination result data is performed.
- the determination processing unit 12 of the processor 1 uses the determined data T, each reference data G, and the axis information, and the entropy HT of each axis from the first axis to the nth axis of the determined data T, 1 Calculate the mutual information I () between the reference data G of each axis from the axis to the nth axis and the judged data T and register it in the judgment result data. Also, based on the calculated mutual information I () Data determination processing for evaluating the distance between the data G and the determination target data T is performed.
- FIG. 4 shows the procedure of the determination result data creation process.
- the axis determination unit 11 first registers the number of fields of the reference data G in the axis number information (step 400). Then, the entropy H (X) of each field of the reference data G is calculated (step 402).
- H (X) represents the entropy of field X.
- the field having the maximum entropy calculated for the reference data G is set as the first axis, and the field identifier of the field is registered as the field identifier of the field used as the first axis in the axis definition information (step 404).
- the entropy of the field as the first axis calculated in step 402 is registered as the first axis entropy HG (1) of the reference data Gj in the entry of the reference data G of the determination result data (step 406).
- step 408 the following processing is sequentially performed for each value of i from 2 to n (steps 408, 416, and 418). That is, first, for each reference data G, conditional entropy H under the field from the first axis to the i-1th axis of each field X excluding the field from the first axis to the i-1th axis. F1, F2,..., Fi-1 (X) are calculated (step 410).
- Fj represents the field with the j-th axis (where j ⁇ i)
- Fi-1 (X) represents the fields F1, F2, ...
- H F1, F2, ..., Fi-1 (X) is the a-th value among the values appearing in the field Fs with Fs a as the s-axis of the reference data G
- p (F1 a , F2 b , ..., Fi-1 y ) be the occurrence probability in the reference data G of the value pair (F1 a , F2 b , ..., Fi-1 y )
- H F1, F2, ..., Fi-1 (X) - ⁇ a ⁇ b ...
- the field with the maximum conditional entropy calculated is set to the i-th axis, and the field identifier of that field is registered in the axis definition information as the field identifier of the field used as the i-th axis (step 412).
- conditional entropy calculated in step 410 for the field set in the i-th axis is registered as the i-th axis entropy HG (i) of the reference data G in the entry of the reference data G of the determination result data (step 414). .
- the reference data G shown in FIG. 5A is a table including six records having three fields A, B, and C.
- a 1 and A 2 appearing in the field A
- B 1 and B 2 appearing in the field B
- values appearing in the field C are C 1 and C 2 . There are only two.
- H (X) - ⁇ i [ (p (X i) ⁇ Log ⁇ p (X i) ⁇ ] Is calculated by
- X i represents the i-th value among the values appearing as the value of the field X in the reference data G
- p (X i ) is the reference data G of the record having X i as the value of the field X.
- Occurrence probability (number of records of reference data G having X i as the value of field X / total number of records of reference data G)
- Log represents the logarithm of base 2
- ⁇ i is the sum of i Represents.
- H (A) is the largest. If so, field A is set to the first axis.
- the process proceeds to the process for the second axis, and in step 410, the field A is set as the first axis, and the fields B and C except for the field A are set as the first axis.
- the number (frequency) of each value B i (B 1 and B 2 ) of the field B in the subset of the reference data G composed of records having the value A k and the conditional occurrence probability p Ak (B i ) Is obtained as shown in the table of FIG.
- p Ak (B i) is (in the subset of the reference data G consisting of records having the value A k, the probability of occurrence of the record having the value B i) conditional probability of B i for the value A k a To express.
- each conditional occurrence probability p Ak (B i ) of each value in the field B obtained as shown in the table of FIG. 6a, and the occurrence probability p ( A k ), the conditional entropy H A (B) of field B is H A (B) - ⁇ k [p (A k ) ⁇ ⁇ i [p Ak (B i ) ⁇ Log ⁇ p Ak (B i ) ⁇ ]] Is calculated as Note that ⁇ k represents the sum of k.
- H A (C) - ⁇ k [p (A k ) ⁇ ⁇ i [p Ak (C i ) ⁇ Log ⁇ p Ak (C i ) ⁇ ]] Is calculated as
- the value of the conditional entropy H A (B) of the field B is the largest of the conditional entropies H A (B) and H A (C) of the two fields B and C. If so, field B is set to the second axis.
- step 410 field A is the first axis and field B is the second axis.
- the conditional entropy H AB (C) under the first axis (field A) and the second axis (field B) of the field C excluding the field A and the field A as the second axis is calculated.
- p Ak, Bs (C i ) is the conditional occurrence probability of C i for the value pair (A k, B s ) (part of the reference data G consisting of records having both the value A k and the value B s Represents the probability of occurrence of a record having a value C i in the set).
- the field C of value C1 the power for a set of values B 2 value A 2 and a field B of field A
- the value of the field A is A 2
- the value of field B is B 2
- the number of records in which the value of the field C is C1 included in the subset of the reference data G composed of a certain record is 2, it is 2.
- the three-axis entropy H (A), H A (B), H A, B (C) can be considered as the spectral decomposition of the entropy H of the reference data G along each axis.
- the axes are sequentially set from the field having the maximum entropy or conditional entropy calculated for each reference data G for the following reason. That is, the sum of entropy and conditional mutual entropy becomes the same value regardless of the order of the axes.
- noise resistance is higher when processing is performed from a field having a large entropy or conditional mutual entropy.
- FIG. 8 shows the procedure of this data determination process.
- the determination processing unit 12 calculates the entropy of the determination target data T for each axis defined in the axis definition information (steps 800, 810, and 814) in the data determination process (step 802).
- the calculated entropy of each axis of the determination target data T is registered as the entropy HT (i) of the axis in the determination target data T entry of the determination result data (step 804).
- the entropy of each axis in step 802 is calculated by calculating the entropy of the field serving as the first axis for the first axis in the same manner as the entropy calculation shown in FIG.
- the i-th axis is the i-th axis under the field from the 1st axis to the i-1th axis. This is done by calculating the conditional entropy of the field that is the axis of.
- the data to be judged T is the table shown in FIG. 2b
- the first axis field indicated by the axis definition information is A
- the second axis field is B
- the third axis field is C
- entropy H (A) is calculated in the same manner as in FIG. 5b
- the calculated H (A) is set as the entropy of the first axis of the judged data T
- the conditional entropy H for the second axis is the same as in FIG.
- H A (B) is calculated, and the calculated H A (B) is taken as the entropy of the second axis of the judged data T, and H A, B (C) is calculated for the third axis in the same manner as in FIG.
- the calculated H A, B (C) is the entropy of the third axis of the determination target data T.
- the determination processing unit 12 calculates the mutual information amount of the reference data G and the determination target data T for each axis defined in the axis definition information (steps 800, 810, and 814) (step 806), the calculated mutual information amount of each axis is registered in the entry of mutual information amount I of the determination result data as the mutual information amount I (i) of the axis (step 808).
- the mutual information amount of the field as the first axis of the reference data G and the determination target data T is obtained as the mutual information amount I (1) of the first axis.
- the conditional mutual information amount under the fields from the first axis to the i-th axis is obtained as the mutual information amount I (i) of the i-th axis.
- a method for calculating the mutual information amount I (i) of each axis will be described later.
- step 810 the distance between the reference data G and the determination target data T is evaluated (step 812), and the data determination process is terminated.
- the evaluation of the distance between the reference data G and the determination target data T based on the mutual information I (i) of each axis is performed by, for example, summing the mutual information I (i) of each axis or a part of the axes. Alternatively, it can be performed by evaluating that the distance is shorter as the weighted sum using the appropriate weight of the mutual information I (i) of each axis or a part of the axes is larger.
- step 812 in addition to the mutual information amount I (i) of each axis, the entropy HG (i) calculated for each axis of the reference data G and the entropy HT (i calculated for each axis of the determination target data T ) May be taken into consideration, and the relationship between the reference data G and the determination target data T may be determined.
- the reference data G and determination target data T shown in FIG. 9A are tables including six records having three fields A, B, and C.
- there are only two values A 1 and A 2 appearing in the field A only two values B 1 and B 2 appearing in the field B, and values appearing in the field C are C 1 and C 2 . There are only two.
- field A is set on the first axis
- field B is set on the second axis
- field C is set on the third axis.
- a set of records in the same order (order) of the reference data G and the judged data T is defined as a record set RS
- the same number of records as the reference data G and the judged data T defined by this is defined.
- a set of sets RS is called a record set set. That is, the nth record set RS_n is composed of the nth record of the reference data G and the nth record of the determination target data T. Further, since the number of records of the reference data G and the determination target data T is 6, the number of record sets RS included in the record set set is also 6 from RS_1 to RS_6.
- the mutual information I (1) on the first axis is calculated as shown in FIG. Focus on field A.
- a i represents the i-th value among the values appearing in the field A of the reference data G
- a j represents the j-th value among the values appearing in the field A of the determination target data T
- p ( A i ) represents the probability of appearance in the record set of the record set RS including the record of the reference data G whose field A value is Ai
- p (A j ) is the judged data T whose field A value is Aj. Represents the probability of appearance in the record set of the record set RS including the record of.
- p (A i , A j ) is a record set set of a record set RS composed of records in which the value of the field A of the reference data G is Ai and the records of the field A of the data to be judged T having Aj Represents the probability of appearance.
- the record is composed of a record in which the value of the field A of the reference data G is Ai and a record in which the value of the field A of the to-be-determined data T is Aj.
- the appearance probability p (A i , A j ) of the record set RS of the record set RS is obtained as shown in FIG. 9c.
- mutual information I (A) is obtained by using p (A i , A j ), p (A i ), and p (A j ) obtained as shown in FIGS.
- A) be the mutual information I (1) of the first axis.
- I (1) of the first axis is calculated by Note that ⁇ i, j represents the sum of i and j.
- the mutual information I (2) on the second axis is calculated because the first axis is the field A and the second axis is the field B. Therefore, as shown in FIG. This is performed by paying attention to A, field B, and field B of data to be judged T.
- p (A k ) represents an appearance probability in the record set of the record set RS including the record of the reference data G having the value A k as the value of the field A
- conditional occurrence probability p Ak (B i ) Includes a record of reference data G having B i as a value of field B in a subset of the record set set consisting of a record set RS including a record of reference data G having a value A k as a value of field A
- the conditional occurrence probability p Ak (B j ) is a subset of the record set set consisting of the record set RS including the record of the reference data G having the value A k as the value of the field A
- the probability of occurrence of the record set RS including the record of the data to be judged T having B j as the value of the field B is shown.
- p Ak (B i , B j ) is the value of field B in the subset of the record set set consisting of record set RS including the record of reference data G having the value A k as the value of field A. It represents the probability of occurrence of a record set RS including both records of reference data G having B i and records of data to be judged T having B j as the value of field B.
- p (A k ) and p Ak (B i , B j ) are obtained as shown in FIG. 10B with respect to the reference data G and the determination target data T shown in FIG. 10A.
- p A1 (B 1 , B 2 ) is a record set RS included in a subset of the record set set including the record set RS including the record of the reference data G whose field A value is A 1
- the mutual information I (2) on the second axis is calculated by
- the mutual information I (3) on the third axis is calculated because the first axis is field A, the second axis is field B, and the third axis is field C. As shown in FIG. This is performed by paying attention to the field A, the field B, the field C, and the field C of the determination target data T in the set set reference data G.
- p (A k, B s ) appears in the record set of the record set RS including the record of the reference data G having the value A k as the value of the field A and the value B s as the value of the field B. Represents a probability.
- conditional occurrence probability p Ak, Bs (C i ) has a value A k as the value of the field A, and a record comprising the record set RS including the record of the reference data G having the value B s as the value of the field B
- the occurrence probability of the record set RS including the record of the reference data G having C i as the value of the field C in the subset of the set set is represented, and the conditional occurrence probability p Ak, Bs (C j ) Determined to have C j as the value of field C in the subset of the record set set consisting of record set RS with the record of reference data G having the value A k as the value and the value B s as the value of field B Represents the probability of occurrence of a record set RS containing data T records.
- P Ak, Bs (C i , C j ) has a value A k as the value of field A and a record set RS including a record of reference data G having a value B s as the value of field B
- p A1, B1 (C 1 , C 1 ) is a record set set consisting of a record set RS including records of the reference data G in which the value of the field A is A 1 and the value of the field B is B 1
- p (A k, B s ), p Ak, Bs (C i , C j ), p Ak, Bs (C i ), p Ak, Bs (C j ) is used to determine the conditional mutual information I Ak, Bs (c) of the field C under the fields A and B as the first and second axes, and the calculated I Ak, Bs ( Let c) be the mutual information I (3) on the third axis.
- the method for calculating the mutual information amount I (i) has been shown by taking the case where the number of fields is three as an example, but the fourth axis is similarly applied to the case where the number of fields is four or more.
- the mutual information amount I (n) can be calculated as a conditional mutual information amount under each field from 1 to n-1 axes.
- the mutual information amount and the conditional mutual information amount of each field of the reference data G and the judged data T are obtained, and the obtained mutual information amount and the conditional mutual information amount are used.
- the distance between the reference data G and the judged data T is calculated.
- the mutual information amount of each field and the conditional mutual information amount reflect the characteristics across the records of the field, so the reference data G and the data to be judged T according to the characteristics of the entire data. The distance between can be calculated.
- the mutual information amount and conditional mutual information amount of each field are independent of each other, the distance between the reference data G and the judged data T can be easily evaluated from various viewpoints. .
- the mutual information amount between the two tables is equal to the sum of the mutual information amount between each item and the conditional mutual information amount. It is possible to perform processing based on a mutual amount of mutual information.
- each axis is sequentially set by setting the field with the maximum entropy or conditional entropy calculated for each reference data G as an axis. You may make it carry out according to a reference
- the field as the axis may be set in order according to the field order.
- the above embodiment can also be performed by transposing fields and records.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The purpose of the present invention is to calculate a distance between tables of records, each record consisting of a plurality of fields. In the information processing system according to the present invention, an axis determination unit 11 determines the order of fields of reference data G constituting a table. Following this order of fields, a data determination processing unit 12 first calculates the mutual information between the first field of the reference data G and the first field of target data T, and then calculates conditional mutual information between each subsequent field of the reference data G and the corresponding field of the target data T on the basis of fields preceding the subsequent field. The data determination processing unit then calculates a distance between the reference data G and the target data T on the basis of the mutual information, or the conditional mutual information, calculated for each field.
Description
本発明は、データ間の距離を算出する技術に関するものである。
The present invention relates to a technique for calculating a distance between data.
データ間の距離を算出する技術としては、事例データと分類との対応を保持する共に、分類するデータと事例データとの距離と、事例データと分類との対応とに基づいて当該データを分類する技術が知られている(たとえば、特許文献1)。ここで、この技術では、複数の項目よりなるデータと事例データの距離を、各項目の値の差の加重和として求めている。
As a technique for calculating the distance between data, the correspondence between the case data and the classification is maintained, and the data is classified based on the distance between the data to be classified and the case data, and the correspondence between the case data and the classification. A technique is known (for example, Patent Document 1). Here, in this technique, the distance between the data composed of a plurality of items and the case data is obtained as a weighted sum of the differences between the values of the respective items.
上述したデータ間の距離を算出する技術の、複数のフィールドよりなるレコードの集合であるテーブルへの適用を考えた場合、この技術は、個々のレコード間の距離の算定を行う技術に相当するものとなる。したがって、この技術をそのまま、レコード間の距離の算定に適用しても、複数のレコードを横断して表れるテーブルの全体としての特性を反映した、テーブル同士の距離を算出することはできない。
Considering the application of the above-mentioned technique for calculating the distance between data to a table that is a set of records composed of a plurality of fields, this technique corresponds to a technique for calculating the distance between individual records. It becomes. Therefore, even if this technique is applied to the calculation of the distance between records as it is, the distance between the tables reflecting the characteristics of the entire table appearing across a plurality of records cannot be calculated.
そこで、本発明は、複数のフィールドよりなるレコードの集合の、集合全体としての特性に応じた、集合間の距離の算定を行える情報処理システムを提供することを課題とする。
Therefore, an object of the present invention is to provide an information processing system capable of calculating a distance between sets according to characteristics of the set of records including a plurality of fields as a whole set.
前記課題達成のために、本発明は、複数のフィールドよりなるレコードの第1の集合と前記レコードの第2の集合との距離を算定する情報処理システムに、前記レコードの各フィールドの順位を設定する順位設定部と、前記順位設定部が設定した順位に従って、前記第1の集合と前記第2の集合との順位が第1位のフィールドの相互情報量を、当該順位が第1位のフィールドの評価値として算出すると共に、前記第1の集合のと前記第2の集合との順位が第1位以外の各順位の各フィールドの当該フィールドより上位の各フィールドのもとでの条件付相互情報量を、当該順位のフィールドの評価値として算出する評価値算出部と、前記評価値算出部が各フィールドについて算出した前記評価値の少なくとも一部を用いて、前記第1の集合と前記第2の集合との距離を算定する距離算定部とを備えたものである。
In order to achieve the object, the present invention sets an order of each field of the record in an information processing system for calculating a distance between a first set of records composed of a plurality of fields and a second set of records. According to the order set by the order setting unit and the order set by the order setting unit, the mutual information amount of the first rank field of the first set and the second set is the field of the first rank. Each of the first set and the second set in a rank other than the first rank in each field higher than the corresponding field. An evaluation value calculation unit that calculates the amount of information as an evaluation value of the field of the rank; and at least a part of the evaluation value calculated by the evaluation value calculation unit for each field, the first set It is obtained by a distance calculation unit to calculate the serial distance between the second set.
ここで、このような情報処理システムは、前記順位設定部において、前記レコードのフィールドのうちの、前記第1の集合におけるエントロピーが最大のフィールドの順位を第1位の順位に設定した上で、以降、全てのフィールドの順位が設定されるまで、前記第1の集合における、既に順位が設定されたフィールドのもとでのフィールドの条件付エントロピーが最大の順位のフィールドの順位を、最後に設定した順位の次の順位に設定する処理を繰り返すことにより、前記各フィールドの順位を設定するように構成してもよい。
Here, in such an information processing system, the rank setting unit sets the rank of the field having the maximum entropy in the first set among the fields of the record to the rank of the first rank, Thereafter, until the order of all fields is set, the order of the field with the highest conditional entropy of the field under the field for which the order has already been set in the first set is set last. The order of the respective fields may be set by repeating the process of setting the next order of the order.
以上のような情報処理システムによれば、第1の集合と第2の集合の各フィールドの相互情報量と条件付相互情報量として、第1の集合と第2の集合の間の各フィールド毎に独立した評価値を求め、求めた評価値を用いて第1の集合と第2の集合の距離を算定する。ここで、各フィールドの相互情報量と条件付相互情報量には、当該フィールドのレコードを横断した特性が反映されているので、本情報処理システムによれば、複数のフィールドよりなるレコードの集合の、集合全体としての特性に応じた、集合間の距離の算定を行えるようになる。また、各フィールドの相互情報量、条件付相互情報量として求めた各フィールドの評価値は相互に独立しているので、容易に、さまざまな観点による第1の集合と第2の距離の評価を行えるようになる。また、2つのテーブル間の相互情報量は、各項目間の相互情報量および条件付相互情報量の総和に等しいため、各項目間の相互情報量に重み付けをしなければ、過不足無く全体的な相互情報量に基づく処理を行うことが出来る。
According to the information processing system as described above, for each field between the first set and the second set, the mutual information amount and the conditional mutual information amount of each field of the first set and the second set. An independent evaluation value is obtained, and the distance between the first set and the second set is calculated using the obtained evaluation value. Here, since the mutual information amount of each field and the conditional mutual information amount reflect the characteristics across the records of the field, according to the information processing system, a set of records composed of a plurality of fields is recorded. The distance between sets can be calculated according to the characteristics of the set as a whole. In addition, the mutual information amount of each field and the evaluation value of each field obtained as conditional mutual information amount are independent from each other. Therefore, the first set and the second distance can be easily evaluated from various viewpoints. You can do it. The mutual information amount between the two tables is equal to the sum of the mutual information amount between each item and the conditional mutual information amount. It is possible to perform processing based on a mutual amount of mutual information.
以上のように、本発明によれば、複数のフィールドよりなるレコードの集合の、集合全体としての特性に応じた、集合間の距離の算定を行える情報処理システムを提供することができる。
As described above, according to the present invention, it is possible to provide an information processing system capable of calculating the distance between sets according to the characteristics of the set of records including a plurality of fields as the whole set.
以下、本発明に係る情報処理システムの実施形態について説明する。
図1に、本実施形態に係る情報処理システムの構成を示す。
図示するように、情報処理システムは、プロセッサ1とストレージ2とにより構成される。
また、プロセッサ1は、軸決定部11と判定処理部12とを備えている。
ここで、プロセッサ1は、たとえば、CPUや、メモリや、表示装置や入力装置などの各種周辺デバイスとを備えた汎用のコンピュータであり、軸決定部11と判定処理部12とは、コンピュータが所定のプログラムを実行することにより実現されるコンピュータの機能として実装される。 Hereinafter, embodiments of an information processing system according to the present invention will be described.
FIG. 1 shows a configuration of an information processing system according to the present embodiment.
As illustrated, the information processing system includes aprocessor 1 and a storage 2.
Further, theprocessor 1 includes an axis determination unit 11 and a determination processing unit 12.
Here, theprocessor 1 is a general-purpose computer including, for example, a CPU, a memory, and various peripheral devices such as a display device and an input device. The axis determination unit 11 and the determination processing unit 12 are predetermined by the computer. This is implemented as a computer function realized by executing the program.
図1に、本実施形態に係る情報処理システムの構成を示す。
図示するように、情報処理システムは、プロセッサ1とストレージ2とにより構成される。
また、プロセッサ1は、軸決定部11と判定処理部12とを備えている。
ここで、プロセッサ1は、たとえば、CPUや、メモリや、表示装置や入力装置などの各種周辺デバイスとを備えた汎用のコンピュータであり、軸決定部11と判定処理部12とは、コンピュータが所定のプログラムを実行することにより実現されるコンピュータの機能として実装される。 Hereinafter, embodiments of an information processing system according to the present invention will be described.
FIG. 1 shows a configuration of an information processing system according to the present embodiment.
As illustrated, the information processing system includes a
Further, the
Here, the
次に、ストレージ2には、基準データと、軸情報と、判定結果データと、被判定データとが格納される。
ここで、ストレージ2は、プロセッサ1に接続された外部ストレージであっても、プロセッサ1がネットワーク介してアクセスするネットワークストレージやデータサーバなどであってもよい。 Next, thestorage 2 stores reference data, axis information, determination result data, and determination target data.
Here, thestorage 2 may be an external storage connected to the processor 1, or a network storage or a data server that the processor 1 accesses via a network.
ここで、ストレージ2は、プロセッサ1に接続された外部ストレージであっても、プロセッサ1がネットワーク介してアクセスするネットワークストレージやデータサーバなどであってもよい。 Next, the
Here, the
次に、図2aに示すように、基準データは、複数のフィールド(図のA,B,Cの各列)よりなるレコード(図の1から6までの各行)の集合であるテーブルである。
また、図2bに示すように、被判定データも、複数のフィールドよりなるレコード(図の各行)の集合であるテーブルであり、被判定データのフィールド構成は基準データのフィールド構成と等しく、被判定データのレコード数は基準データのレコード数と等しい。
また、基準データと被判定データの順位(順番)が同じレコードは、同じ対象についての情報を各フィールドに格納している。すなわち、基準データのn番目のレコードと、被判定データのn番目のレコードとは、n番目の対象についての情報を各フィールドに格納している。 Next, as shown in FIG. 2a, the reference data is a table that is a set of records (each row from 1 to 6 in the figure) composed of a plurality of fields (each column of A, B, and C in the figure).
Further, as shown in FIG. 2b, the data to be determined is also a table that is a set of records (each row in the figure) including a plurality of fields, and the field configuration of the data to be determined is equal to the field configuration of the reference data. The number of data records is equal to the number of records in the reference data.
Further, records having the same order (order) of the reference data and the data to be judged store information on the same target in each field. That is, the nth record of the reference data and the nth record of the data to be judged store information about the nth object in each field.
また、図2bに示すように、被判定データも、複数のフィールドよりなるレコード(図の各行)の集合であるテーブルであり、被判定データのフィールド構成は基準データのフィールド構成と等しく、被判定データのレコード数は基準データのレコード数と等しい。
また、基準データと被判定データの順位(順番)が同じレコードは、同じ対象についての情報を各フィールドに格納している。すなわち、基準データのn番目のレコードと、被判定データのn番目のレコードとは、n番目の対象についての情報を各フィールドに格納している。 Next, as shown in FIG. 2a, the reference data is a table that is a set of records (each row from 1 to 6 in the figure) composed of a plurality of fields (each column of A, B, and C in the figure).
Further, as shown in FIG. 2b, the data to be determined is also a table that is a set of records (each row in the figure) including a plurality of fields, and the field configuration of the data to be determined is equal to the field configuration of the reference data. The number of data records is equal to the number of records in the reference data.
Further, records having the same order (order) of the reference data and the data to be judged store information on the same target in each field. That is, the nth record of the reference data and the nth record of the data to be judged store information about the nth object in each field.
すなわち、たとえば、基準データと被判定データのn番目のレコードのm番目のフィールドには、n番目の対象物のm番目の属性値が格納される。
より具体的には、たとえば、基準データのn番目のレコードのm番目のフィールドには、各々が異なる対象物のM個の状態を検出するN個のセンサのうちのn番目のセンサが検出するm番目の状態の目標値が格納され、被判定データのn番目のレコードのm番目のフィールドには、n番目のセンサが検出したm番目の状態の検出値が格納される。 That is, for example, the m-th attribute value of the n-th object is stored in the m-th field of the n-th record of the reference data and the determination target data.
More specifically, for example, in the m-th field of the n-th record of the reference data, the n-th sensor among the N sensors that detect M states of different objects is detected. The target value of the mth state is stored, and the detection value of the mth state detected by the nth sensor is stored in the mth field of the nth record of the determination target data.
より具体的には、たとえば、基準データのn番目のレコードのm番目のフィールドには、各々が異なる対象物のM個の状態を検出するN個のセンサのうちのn番目のセンサが検出するm番目の状態の目標値が格納され、被判定データのn番目のレコードのm番目のフィールドには、n番目のセンサが検出したm番目の状態の検出値が格納される。 That is, for example, the m-th attribute value of the n-th object is stored in the m-th field of the n-th record of the reference data and the determination target data.
More specifically, for example, in the m-th field of the n-th record of the reference data, the n-th sensor among the N sensors that detect M states of different objects is detected. The target value of the mth state is stored, and the detection value of the mth state detected by the nth sensor is stored in the mth field of the nth record of the determination target data.
さて、ここで、以下では、“G”は基準データを、“T”は被判定データを表すものとして説明を行う。
次に、図3aに示すように、軸情報は、軸数情報と軸定義情報とを含む。
ここで、軸数情報には、基準データGのフィールド数が登録される。
また、軸定義情報には、基準データGのフィールド数をnとして第1軸から第n軸までの各軸として用いるフィールドを示すフィールド識別子が登録される。
また、図3bに示すように、判定結果データには、基準データGのエントリと、被判定データTのエントリと、相互情報量Iのエントリを備えている。そして、基準データGのエントリには、基準データGの第1軸から第n軸までの各軸のエントロピーHG()が登録され、被判定データTのエントリには、被判定データTの第1軸から第n軸までの各軸のエントロピーHT()が登録される。また、相互情報量Iのエントリには、第1軸から第n軸までの各軸の基準データGと被判定データTとの相互情報量I()が登録される。 In the following description, “G” represents reference data and “T” represents data to be determined.
Next, as shown in FIG. 3A, the axis information includes axis number information and axis definition information.
Here, the number of fields of the reference data G is registered in the axis number information.
In the axis definition information, a field identifier indicating a field to be used as each axis from the first axis to the nth axis, where n is the number of fields of the reference data G, is registered.
Further, as shown in FIG. 3b, the determination result data includes an entry for the reference data G, an entry for the determination target data T, and an entry for the mutual information amount I. In the entry of the reference data G, the entropy HG () of each axis from the first axis to the nth axis of the reference data G is registered, and in the entry of the determination target data T, the first of the determination target data T is registered. The entropy HT () of each axis from the axis to the nth axis is registered. In the mutual information amount I entry, the mutual information amount I () between the reference data G of each axis from the first axis to the nth axis and the determination target data T is registered.
次に、図3aに示すように、軸情報は、軸数情報と軸定義情報とを含む。
ここで、軸数情報には、基準データGのフィールド数が登録される。
また、軸定義情報には、基準データGのフィールド数をnとして第1軸から第n軸までの各軸として用いるフィールドを示すフィールド識別子が登録される。
また、図3bに示すように、判定結果データには、基準データGのエントリと、被判定データTのエントリと、相互情報量Iのエントリを備えている。そして、基準データGのエントリには、基準データGの第1軸から第n軸までの各軸のエントロピーHG()が登録され、被判定データTのエントリには、被判定データTの第1軸から第n軸までの各軸のエントロピーHT()が登録される。また、相互情報量Iのエントリには、第1軸から第n軸までの各軸の基準データGと被判定データTとの相互情報量I()が登録される。 In the following description, “G” represents reference data and “T” represents data to be determined.
Next, as shown in FIG. 3A, the axis information includes axis number information and axis definition information.
Here, the number of fields of the reference data G is registered in the axis number information.
In the axis definition information, a field identifier indicating a field to be used as each axis from the first axis to the nth axis, where n is the number of fields of the reference data G, is registered.
Further, as shown in FIG. 3b, the determination result data includes an entry for the reference data G, an entry for the determination target data T, and an entry for the mutual information amount I. In the entry of the reference data G, the entropy HG () of each axis from the first axis to the nth axis of the reference data G is registered, and in the entry of the determination target data T, the first of the determination target data T is registered. The entropy HT () of each axis from the axis to the nth axis is registered. In the mutual information amount I entry, the mutual information amount I () between the reference data G of each axis from the first axis to the nth axis and the determination target data T is registered.
さて、図1に戻り、プロセッサ1の軸決定部11は、基準データGを用いて、軸情報を作成すると共に、基準データGの第1軸から第n軸までの各軸のエントロピーHGを算出して判定結果データに登録する軸決定処理を行う。
Now, returning to FIG. 1, the axis determination unit 11 of the processor 1 creates axis information using the reference data G and calculates the entropy HG of each axis from the first axis to the n-th axis of the reference data G. Then, the axis determination process registered in the determination result data is performed.
また、プロセッサ1の判定処理部12は、被判定データTと各基準データGと軸情報とを用いて、被判定データTの第1軸から第n軸までの各軸のエントロピーHTと、1軸から第n軸までの各軸の基準データGと被判定データTとの相互情報量I()を算出して判定結果データに登録すると共に、算出した相互情報量I()に基づいて基準データGと被判定データTの距離を評価するデータ判定処理を行う。
Further, the determination processing unit 12 of the processor 1 uses the determined data T, each reference data G, and the axis information, and the entropy HT of each axis from the first axis to the nth axis of the determined data T, 1 Calculate the mutual information I () between the reference data G of each axis from the axis to the nth axis and the judged data T and register it in the judgment result data. Also, based on the calculated mutual information I () Data determination processing for evaluating the distance between the data G and the determination target data T is performed.
以下、プロセッサ1の軸決定部11が行う軸決定処理について説明する。
図4に、この判定結果データ作成処理の手順を示す。
図示するように、軸決定部11は、まず、基準データGのフィールド数を、軸数情報に登録する(ステップ400)。
そして、基準データGの各フィールドのエントロピーH(X)を算出する(ステップ402)。ここで、H(X)はフィールドXのエントロピーを表す。また、H(X)は、Xi を基準データGのフィールドXに表れる値のうちのi番目の値とし、p (Xi)は基準データGにおけるXiの発生確率とし、Logは2を底とする対数とし、Σiをiについての総和として、
H (X)=-Σi[p (Xi)×Log{p (Xi)}]
によって表される。但し、Log0は0とする。なお、Logの底は2以外の数であってもよい。 Hereinafter, the axis determination process performed by the axis determination unit 11 of theprocessor 1 will be described.
FIG. 4 shows the procedure of the determination result data creation process.
As shown in the drawing, the axis determination unit 11 first registers the number of fields of the reference data G in the axis number information (step 400).
Then, the entropy H (X) of each field of the reference data G is calculated (step 402). Here, H (X) represents the entropy of field X. H (X) is defined as the i-th value of the values appearing in the field X of the reference data G, where X i is p (X i ) is the occurrence probability of X i in the reference data G, Log is the logarithm with 2 as the base, Σ i is the sum of i,
H (X) =-Σ i (p (X i ) × Log {p (X i )}]
Represented by However, Log0 is 0. The bottom of Log may be a number other than two.
図4に、この判定結果データ作成処理の手順を示す。
図示するように、軸決定部11は、まず、基準データGのフィールド数を、軸数情報に登録する(ステップ400)。
そして、基準データGの各フィールドのエントロピーH(X)を算出する(ステップ402)。ここで、H(X)はフィールドXのエントロピーを表す。また、H(X)は、Xi を基準データGのフィールドXに表れる値のうちのi番目の値とし、p (Xi)は基準データGにおけるXiの発生確率とし、Logは2を底とする対数とし、Σiをiについての総和として、
H (X)=-Σi[p (Xi)×Log{p (Xi)}]
によって表される。但し、Log0は0とする。なお、Logの底は2以外の数であってもよい。 Hereinafter, the axis determination process performed by the axis determination unit 11 of the
FIG. 4 shows the procedure of the determination result data creation process.
As shown in the drawing, the axis determination unit 11 first registers the number of fields of the reference data G in the axis number information (step 400).
Then, the entropy H (X) of each field of the reference data G is calculated (step 402). Here, H (X) represents the entropy of field X. H (X) is defined as the i-th value of the values appearing in the field X of the reference data G, where X i is p (X i ) is the occurrence probability of X i in the reference data G, Log is the logarithm with 2 as the base, Σ i is the sum of i,
H (X) =-Σ i (p (X i ) × Log {p (X i )}]
Represented by However, Log0 is 0. The bottom of Log may be a number other than two.
ここで、このエントロピーの具体的な算出例については後述する。
そして、基準データGについて算出したエントロピーが最大となったフィールドを第1軸に設定し、そのフィールドのフィールド識別子を軸定義情報に第1軸として用いるフィールドのフィールド識別子として登録する(ステップ404)。 Here, a specific calculation example of the entropy will be described later.
Then, the field having the maximum entropy calculated for the reference data G is set as the first axis, and the field identifier of the field is registered as the field identifier of the field used as the first axis in the axis definition information (step 404).
そして、基準データGについて算出したエントロピーが最大となったフィールドを第1軸に設定し、そのフィールドのフィールド識別子を軸定義情報に第1軸として用いるフィールドのフィールド識別子として登録する(ステップ404)。 Here, a specific calculation example of the entropy will be described later.
Then, the field having the maximum entropy calculated for the reference data G is set as the first axis, and the field identifier of the field is registered as the field identifier of the field used as the first axis in the axis definition information (step 404).
そして、判定結果データの基準データGのエントリに、ステップ402で算出した第1軸としたフィールドのエントロピーを、基準データGjの第1軸エントロピーHG(1)として登録する(ステップ406)。
Then, the entropy of the field as the first axis calculated in step 402 is registered as the first axis entropy HG (1) of the reference data Gj in the entry of the reference data G of the determination result data (step 406).
次に、基準データGのフィールド数をnとして、2からnまでのiの各値について(ステップ408、416、418)、順次、以下の処理を行う。
すなわち、まず、各基準データGについて、第1軸から第i-1軸としたフィールドを除く各フィールドXの、第1軸から第i-1軸までのフィールドのもとでの条件付エントロピーHF1,F2,...,Fi-1(X)を算出する(ステップ410)。ここで、Fjは、第j軸(但し、j<i)としたフィールドを表し、HF1,F2,...,Fi-1(X)は、フィールドF1,F2,...,Fi-1のもとでのフィールドXの条件付エントロピーを表す。 Next, assuming that the number of fields of the reference data G is n, the following processing is sequentially performed for each value of i from 2 to n (steps 408, 416, and 418).
That is, first, for each reference data G, conditional entropy H under the field from the first axis to the i-1th axis of each field X excluding the field from the first axis to the i-1th axis. F1, F2,..., Fi-1 (X) are calculated (step 410). Here, Fj represents the field with the j-th axis (where j <i), and H F1, F2, ..., Fi-1 (X) represents the fields F1, F2, ..., Fi- Represents the conditional entropy of field X under 1.
すなわち、まず、各基準データGについて、第1軸から第i-1軸としたフィールドを除く各フィールドXの、第1軸から第i-1軸までのフィールドのもとでの条件付エントロピーHF1,F2,...,Fi-1(X)を算出する(ステップ410)。ここで、Fjは、第j軸(但し、j<i)としたフィールドを表し、HF1,F2,...,Fi-1(X)は、フィールドF1,F2,...,Fi-1のもとでのフィールドXの条件付エントロピーを表す。 Next, assuming that the number of fields of the reference data G is n, the following processing is sequentially performed for each value of i from 2 to n (steps 408, 416, and 418).
That is, first, for each reference data G, conditional entropy H under the field from the first axis to the i-1th axis of each field X excluding the field from the first axis to the i-1th axis. F1, F2,..., Fi-1 (X) are calculated (step 410). Here, Fj represents the field with the j-th axis (where j <i), and H F1, F2, ..., Fi-1 (X) represents the fields F1, F2, ..., Fi- Represents the conditional entropy of field X under 1.
なお、条件付エントロピーHF1,F2,...,Fi-1(X)は、Fsa を基準データGの第s軸としたフィールドFsに表れる値のうちのa番目の値とし、 p(F1a,F2b,...,Fi-1y)を、値の組(F1a,F2b,...,Fi-1y)の基準データG中での発生確率とし、p F1a,F2b,...,Fi-1y (Xi)を、値の組(F1a,F2b,...,Fi-1y)を持つレコードよりなる基準データGの部分集合中の、Xiの発生確率として、
HF1,F2,...,Fi-1(X)=-ΣaΣb...Σy[p( F1a,F2b,...,Fi-1y)
×Σi[p F1a,F2b,...,Fi-1y (Xi)×Log{p F1a,F2b,...,Fi-1y (Xi)}]]
によって表される。 Conditional entropy H F1, F2, ..., Fi-1 (X) is the a-th value among the values appearing in the field Fs with Fs a as the s-axis of the reference data G, Let p (F1 a , F2 b , ..., Fi-1 y ) be the occurrence probability in the reference data G of the value pair (F1 a , F2 b , ..., Fi-1 y ), and p F1a, F2b, ..., Fi-1y (X i ) in a subset of the reference data G consisting of records with value pairs (F1 a , F2 b , ..., Fi-1 y ) As the probability of occurrence of X i ,
H F1, F2, ..., Fi-1 (X) =-Σ a Σ b ... Σ y [p (F1 a , F2 b , ..., Fi-1 y )
× Σ i [p F1a, F2b, ..., Fi-1y (X i ) × Log {p F1a, F2b, ..., Fi-1y (X i )}]]
Represented by
HF1,F2,...,Fi-1(X)=-ΣaΣb...Σy[p( F1a,F2b,...,Fi-1y)
×Σi[p F1a,F2b,...,Fi-1y (Xi)×Log{p F1a,F2b,...,Fi-1y (Xi)}]]
によって表される。 Conditional entropy H F1, F2, ..., Fi-1 (X) is the a-th value among the values appearing in the field Fs with Fs a as the s-axis of the reference data G, Let p (F1 a , F2 b , ..., Fi-1 y ) be the occurrence probability in the reference data G of the value pair (F1 a , F2 b , ..., Fi-1 y ), and p F1a, F2b, ..., Fi-1y (X i ) in a subset of the reference data G consisting of records with value pairs (F1 a , F2 b , ..., Fi-1 y ) As the probability of occurrence of X i ,
H F1, F2, ..., Fi-1 (X) =-Σ a Σ b ... Σ y [p (F1 a , F2 b , ..., Fi-1 y )
× Σ i [p F1a, F2b, ..., Fi-1y (X i ) × Log {p F1a, F2b, ..., Fi-1y (X i )}]]
Represented by
ここで、この条件付エントロピーの具体的な算出例については後述する。
そして、算出した条件付エントロピーが最大となったフィールドを第i軸に設定し、そのフィールドのフィールド識別子を軸定義情報に第i軸として用いるフィールドのフィールド識別子として登録する(ステップ412)。 Here, a specific calculation example of the conditional entropy will be described later.
Then, the field with the maximum conditional entropy calculated is set to the i-th axis, and the field identifier of that field is registered in the axis definition information as the field identifier of the field used as the i-th axis (step 412).
そして、算出した条件付エントロピーが最大となったフィールドを第i軸に設定し、そのフィールドのフィールド識別子を軸定義情報に第i軸として用いるフィールドのフィールド識別子として登録する(ステップ412)。 Here, a specific calculation example of the conditional entropy will be described later.
Then, the field with the maximum conditional entropy calculated is set to the i-th axis, and the field identifier of that field is registered in the axis definition information as the field identifier of the field used as the i-th axis (step 412).
また、判定結果データの基準データGのエントリに、第i軸に設定したフィールドについてステップ410で算出した条件付エントロピーを基準データGの第i軸のエントロピーHG(i)として登録する(ステップ414)。
Further, the conditional entropy calculated in step 410 for the field set in the i-th axis is registered as the i-th axis entropy HG (i) of the reference data G in the entry of the reference data G of the determination result data (step 414). .
そして、以上のようにして、第1軸から第n軸までの各軸について、軸定義情報と判定結果データへの登録が完了したならば、軸決定処理を終了する。
以下、ステップ402で行うエントロピーの算出と、ステップ410で行う条件付エントロピーの算出の詳細について、図5aに示す基準データGを例にとり説明する。
図5aに示す基準データGは、A,B,Cの3つのフィールドを持つレコードを6個含むテーブルである。また、フィールドAに表れる値はA1とA2の二つのみであり、フィールドBに表れる値はB1とB2の二つのみであり、フィールドCに表れる値はC1とC2の二つのみである。 If the registration from the first axis to the nth axis in the axis definition information and the determination result data is completed as described above, the axis determination process ends.
Hereinafter, the details of the entropy calculation performed in step 402 and the conditional entropy calculation performed in step 410 will be described using the reference data G shown in FIG. 5A as an example.
The reference data G shown in FIG. 5a is a table including six records having three fields A, B, and C. In addition, there are only two values A 1 and A 2 appearing in the field A, only two values B 1 and B 2 appearing in the field B, and values appearing in the field C are C 1 and C 2 . There are only two.
以下、ステップ402で行うエントロピーの算出と、ステップ410で行う条件付エントロピーの算出の詳細について、図5aに示す基準データGを例にとり説明する。
図5aに示す基準データGは、A,B,Cの3つのフィールドを持つレコードを6個含むテーブルである。また、フィールドAに表れる値はA1とA2の二つのみであり、フィールドBに表れる値はB1とB2の二つのみであり、フィールドCに表れる値はC1とC2の二つのみである。 If the registration from the first axis to the nth axis in the axis definition information and the determination result data is completed as described above, the axis determination process ends.
Hereinafter, the details of the entropy calculation performed in step 402 and the conditional entropy calculation performed in step 410 will be described using the reference data G shown in FIG. 5A as an example.
The reference data G shown in FIG. 5a is a table including six records having three fields A, B, and C. In addition, there are only two values A 1 and A 2 appearing in the field A, only two values B 1 and B 2 appearing in the field B, and values appearing in the field C are C 1 and C 2 . There are only two.
そして、ステップ402における基準データGの各フィールドのエントロピーH (X)は、
H(X)= -Σi[(p(Xi)×Log{p(Xi)}]
によって算出される。 Then, the entropy H (X) of each field of the reference data G in step 402 is
H (X) = -Σ i [ (p (X i) × Log {p (X i)}]
Is calculated by
H(X)= -Σi[(p(Xi)×Log{p(Xi)}]
によって算出される。 Then, the entropy H (X) of each field of the reference data G in step 402 is
H (X) = -Σ i [ (p (X i) × Log {p (X i)}]
Is calculated by
ここで、Xiは、基準データG中にフィールドXの値として表れる値のうちのi番目の値を表し、p(Xi)は、XiをフィールドXの値として持つレコードの基準データG中における発生確率(XiをフィールドXの値として持つ基準データGのレコード数/基準データGのレコードの総数)を表し、Logは2を底とする対数を表し、Σiはiについての総和を表す。
Here, X i represents the i-th value among the values appearing as the value of the field X in the reference data G, and p (X i ) is the reference data G of the record having X i as the value of the field X. Occurrence probability (number of records of reference data G having X i as the value of field X / total number of records of reference data G), Log represents the logarithm of base 2, and Σ i is the sum of i Represents.
すなわち、図5aの基準データGのフィールドAの各値Ai(A1とA2)の度数と発生確率p(Ai)は、図5bの表のように求まる。
すなわち、たとえば、基準データG中のフィールドAの値がA2であるレコードの数は3であるので、A2の度数は3となる。また、基準データGのレコード数は6であるので、A2の発生確率p(A2)=3/6=0.5となる。 That is, the frequency and occurrence probability p (A i ) of each value A i (A 1 and A 2 ) of the field A of the reference data G in FIG. 5a are obtained as shown in the table of FIG. 5b.
That is, for example, since the number of records in which the value of the field A in the reference data G is A 2 is 3, the frequency of A2 is 3. Since the number of records of the reference data G is 6, the occurrence probability p (A 2 ) of A2 = 3/6 = 0.5.
すなわち、たとえば、基準データG中のフィールドAの値がA2であるレコードの数は3であるので、A2の度数は3となる。また、基準データGのレコード数は6であるので、A2の発生確率p(A2)=3/6=0.5となる。 That is, the frequency and occurrence probability p (A i ) of each value A i (A 1 and A 2 ) of the field A of the reference data G in FIG. 5a are obtained as shown in the table of FIG. 5b.
That is, for example, since the number of records in which the value of the field A in the reference data G is A 2 is 3, the frequency of A2 is 3. Since the number of records of the reference data G is 6, the occurrence probability p (A 2 ) of A2 = 3/6 = 0.5.
そして、図5bの表のように求めた、フィールドAの各値Aiの発生確率p(Ai)よりフィールドAのエントロピーH (A)は、
H(A)=-Σi[p(Ai)×Log{p(Ai)}]
として算出される。 Then, from the occurrence probability p (A i ) of each value A i of the field A obtained as shown in the table of FIG. 5B, the entropy H (A) of the field A is
H (A) =-Σ i [p (A i ) × Log {p (A i )}]
Is calculated as
H(A)=-Σi[p(Ai)×Log{p(Ai)}]
として算出される。 Then, from the occurrence probability p (A i ) of each value A i of the field A obtained as shown in the table of FIG. 5B, the entropy H (A) of the field A is
H (A) =-Σ i [p (A i ) × Log {p (A i )}]
Is calculated as
また、フィールドB、フィールドCについても同様に、フィールドBの各値の度数と発生確率は図5cの表のように、フィールドCの各値の度数と発生確率は図5dの表のように求まり、フィールドBのエントロピーH (B)は、
H(B)= -Σi[p(Bi)×Log{p(Bi)}]
として、
フィールドCのエントロピーH (C)は、
H(C)= -Σi[p(Ci)×Log{p(Ci)}]
として算出される。 Similarly, for field B and field C, the frequency and occurrence probability of each value in field B are obtained as shown in the table of FIG. 5c, and the frequency and occurrence probability of each value of field C are obtained as shown in the table of FIG. 5d. , The entropy H (B) of field B is
H (B) = -Σ i [p (B i ) × Log {p (B i )}]
As
The entropy H (C) of field C is
H (C) = -Σ i [p (C i ) × Log {p (C i )}]
Is calculated as
H(B)= -Σi[p(Bi)×Log{p(Bi)}]
として、
フィールドCのエントロピーH (C)は、
H(C)= -Σi[p(Ci)×Log{p(Ci)}]
として算出される。 Similarly, for field B and field C, the frequency and occurrence probability of each value in field B are obtained as shown in the table of FIG. 5c, and the frequency and occurrence probability of each value of field C are obtained as shown in the table of FIG. 5d. , The entropy H (B) of field B is
H (B) = -Σ i [p (B i ) × Log {p (B i )}]
As
The entropy H (C) of field C is
H (C) = -Σ i [p (C i ) × Log {p (C i )}]
Is calculated as
次に、ステップ410で行う条件付エントロピーの算出について説明する。
いま、図5のようにして、基準データGのA,B,Cの3つのフィールドについて求めたエントロピーH(A),H(B),H(C)のうち、H(A)が最大であった場合、フィールドAが第1軸に設定される。 Next, calculation of conditional entropy performed in step 410 will be described.
Now, as shown in FIG. 5, among the entropies H (A), H (B), and H (C) obtained for the three fields A, B, and C of the reference data G, H (A) is the largest. If so, field A is set to the first axis.
いま、図5のようにして、基準データGのA,B,Cの3つのフィールドについて求めたエントロピーH(A),H(B),H(C)のうち、H(A)が最大であった場合、フィールドAが第1軸に設定される。 Next, calculation of conditional entropy performed in step 410 will be described.
Now, as shown in FIG. 5, among the entropies H (A), H (B), and H (C) obtained for the three fields A, B, and C of the reference data G, H (A) is the largest. If so, field A is set to the first axis.
そして、第1軸が設定されたならば、第2軸についての処理に進み、ステップ410で、フィールドAを第1軸として、第1軸としたフィールドAを除くフィールドB、フィールドCの、第1軸(フィールドA)のもとでの条件付エントロピーHA(B)、 HA(C)を算出する。
If the first axis is set, the process proceeds to the process for the second axis, and in step 410, the field A is set as the first axis, and the fields B and C except for the field A are set as the first axis. Calculate conditional entropy H A (B), H A (C) under one axis (field A).
この場合、フィールドBの各値Bi(B1とB2)の、値Akを持つレコードよりなる基準データGの部分集合中における数(度数)と、条件付発生確率pAk(Bi)は図6aの表のように求まる。なお、pAk(Bi)は、値Akに対するBiの条件付発生確率(値Akを持つレコードよりなる基準データGの部分集合中における、値Biを持つレコードの発生確率)を表す。
In this case, the number (frequency) of each value B i (B 1 and B 2 ) of the field B in the subset of the reference data G composed of records having the value A k and the conditional occurrence probability p Ak (B i ) Is obtained as shown in the table of FIG. Incidentally, p Ak (B i) is (in the subset of the reference data G consisting of records having the value A k, the probability of occurrence of the record having the value B i) conditional probability of B i for the value A k a To express.
すなわち、たとえば、フィールドBの値B1の、フィールドAの値A2に対する度数は、フィールドAの値がA2であるレコードよりなる基準データGの部分集合に含まれる、フィールドBの値がB1であるレコードの数は1であるので1となる。また、フィールドAの値がA2であるレコードよりなる基準データGの部分集合のレコード数は3であるので、フィールドBの値B1のフィールドAの値A2に対する条件付発生確率pA2(B1)=1/3≒0.33となる。
That is, for example, the frequency of the value B 1 of the field B with respect to the value A 2 of the field A includes the value of the field B included in the subset of the reference data G composed of records having the value of the field A of A 2. Since the number of records that are 1 is 1, it is 1. Further, since the number of records in the subset of the reference data G composed of records whose field A value is A 2 is 3, the conditional occurrence probability p A2 (B 1 for the field A value A 2 of the field B value B 1 ) = 1/3 ≒ 0.33.
そして、以上のようにして、図6aの表のように求めたフィールドBの各値の各条件付発生確率pAk(Bi)と、フィールドAの各値の基準データGにおける発生確率p(Ak)より、フィールドBの条件付エントロピーHA (B)は、
HA(B)= -Σk[p(Ak)×Σi [pAk(Bi) ×Log{pAk(Bi)}]]
として算出される。なお、Σkは、kについての総和を表す。 Then, as described above, each conditional occurrence probability p Ak (B i ) of each value in the field B obtained as shown in the table of FIG. 6a, and the occurrence probability p ( A k ), the conditional entropy H A (B) of field B is
H A (B) = -Σ k [p (A k ) × Σ i [p Ak (B i ) × Log {p Ak (B i )}]]
Is calculated as Note that Σ k represents the sum of k.
HA(B)= -Σk[p(Ak)×Σi [pAk(Bi) ×Log{pAk(Bi)}]]
として算出される。なお、Σkは、kについての総和を表す。 Then, as described above, each conditional occurrence probability p Ak (B i ) of each value in the field B obtained as shown in the table of FIG. 6a, and the occurrence probability p ( A k ), the conditional entropy H A (B) of field B is
H A (B) = -Σ k [p (A k ) × Σ i [p Ak (B i ) × Log {p Ak (B i )}]]
Is calculated as Note that Σ k represents the sum of k.
また、フィールドCについても同様に、フィールドCの各値の度数と条件付発生確率は図6bの表のように求まり、フィールドCのエントロピーHA (C)は、
HA (C)=-Σk [p(Ak)×Σi[pAk(Ci) ×Log{pAk(Ci)}]]
として算出される。 Similarly, for field C, the frequency of each value in field C and the conditional occurrence probability are obtained as shown in the table of FIG. 6B, and the entropy H A (C) of field C is
H A (C) =-Σ k [p (A k ) × Σ i [p Ak (C i ) × Log {p Ak (C i )}]]
Is calculated as
HA (C)=-Σk [p(Ak)×Σi[pAk(Ci) ×Log{pAk(Ci)}]]
として算出される。 Similarly, for field C, the frequency of each value in field C and the conditional occurrence probability are obtained as shown in the table of FIG. 6B, and the entropy H A (C) of field C is
H A (C) =-Σ k [p (A k ) × Σ i [p Ak (C i ) × Log {p Ak (C i )}]]
Is calculated as
次に、図6のようにして、B,Cの2つのフィールドの条件付エントロピーHA(B),HA(C)のうち、フィールドBの条件付エントロピーHA (B)の値が最大であった場合、フィールドBが第2軸に設定される。
Next, as shown in FIG. 6, the value of the conditional entropy H A (B) of the field B is the largest of the conditional entropies H A (B) and H A (C) of the two fields B and C. If so, field B is set to the second axis.
そして、第2軸が設定されたならば、第3軸(i=3)についての処理に進み、ステップ410で、フィールドAを第1軸、フィールドBを第2軸として、第1軸としたフィールドAと第2軸としたフィールドフィールドBを除くフィールドCの、第1軸(フィールドA)と第2軸(フィールドB)のもとでの条件付エントロピーHAB(C)を算出する。
If the second axis is set, the process proceeds to the third axis (i = 3). In step 410, field A is the first axis and field B is the second axis. The conditional entropy H AB (C) under the first axis (field A) and the second axis (field B) of the field C excluding the field A and the field A as the second axis is calculated.
この場合、フィールドCの各値Ci(C1とC2)の、値Akと値Bsの双方を持つレコードよりなる基準データGの部分集合中における数(度数)と、条件付発生確率pAk,Bs (Ci)は図7の表のように求まる。
In this case, the number (frequency) of each value C i (C 1 and C 2 ) in the field C in the subset of the reference data G consisting of records having both the value Ak and the value B s and the conditional occurrence The probability p Ak, Bs (C i ) is obtained as shown in the table of FIG.
なお、pAk,Bs (Ci)は、値の組(Ak,Bs)に対するCiの条件付発生確率(値Akと値Bsの双方を持つレコードよりなる基準データGの部分集合中における、値Ciを持つレコードの発生確率)を表す。
Note that p Ak, Bs (C i ) is the conditional occurrence probability of C i for the value pair (A k, B s ) (part of the reference data G consisting of records having both the value A k and the value B s Represents the probability of occurrence of a record having a value C i in the set).
すなわち、たとえば、フィールドCの値C1の、フィールドAの値A2とフィールドBの値B2の組に対する度数は、フィールドAの値がA2であり、かつ、フィールドBの値がB2であるレコードよりなる基準データGの部分集合に含まれる、フィールドCの値がC1であるレコードの数は2であるので2となる。また、フィールドAの値がA2であり、かつ、フィールドBの値がB2であるレコードよりなる基準データGの部分集合のレコード数は2であるので、フィールドCの値C1のフィールドAの値A2とフィールドBの値B2の組に対する条件付発生確率pA2,B2(C1)=2/2=1となる。
Thus, for example, the field C of value C1, the power for a set of values B 2 value A 2 and a field B of field A, the value of the field A is A 2, and the value of field B is B 2 Since the number of records in which the value of the field C is C1 included in the subset of the reference data G composed of a certain record is 2, it is 2. In addition, since the number of records in the subset of the reference data G including the records in which the value of the field A is A 2 and the value of the field B is B 2 is 2, the field A of the value C 1 of the field C The conditional occurrence probability p A2, B2 (C 1 ) = 2/2 = 1 for the set of the value A 2 of A and the value B 2 of the field B.
そして、以上のようにして、図7の表のように求めたフィールドCの各値の各条件付発生確率pAk,Bs(Ci)と、 フィールドAの各値AkとフィールドBの各値Bsとの組(Ak,Bs)の各々の基準データGにおける発生確率p(Ak,Bs)より、フィールドCの条件付エントロピーHA,B(C)は、
HA,B(C)= -ΣkΣs[p(Ak Bs)×Σi[pAk,Bs(Ci) ×Log{pAk,Bs(Ci)}]]
として算出される。なお、Σsは、sについての総和を表す。 Then, as described above, the conditional probability p Ak for each value of field C determined as shown in Table 7, the Bs (C i), each of the values A k and field B Field A From the occurrence probability p (A k, B s ) in each reference data G of the pair (A k , B s ) with the value B s , the conditional entropy H A, B (C) of the field C is
H A, B (C) = -Σ k Σ s [p (A k B s ) × Σ i [p Ak, Bs (C i ) × Log {p Ak, Bs (C i )}]]
Is calculated as Note that Σ s represents the total sum of s.
HA,B(C)= -ΣkΣs[p(Ak Bs)×Σi[pAk,Bs(Ci) ×Log{pAk,Bs(Ci)}]]
として算出される。なお、Σsは、sについての総和を表す。 Then, as described above, the conditional probability p Ak for each value of field C determined as shown in Table 7, the Bs (C i), each of the values A k and field B Field A From the occurrence probability p (A k, B s ) in each reference data G of the pair (A k , B s ) with the value B s , the conditional entropy H A, B (C) of the field C is
H A, B (C) = -Σ k Σ s [p (A k B s ) × Σ i [p Ak, Bs (C i ) × Log {p Ak, Bs (C i )}]]
Is calculated as Note that Σ s represents the total sum of s.
そして、図5aの基準データGの場合には、フィールドが3つであるので、フィールドCが第3軸に設定される。
なお、以上図5、6、7を用いて、基準データGのフィールドが3つである場合の条件付エントロピーの算出法について示したが、基準データGのフィールドが4以上である場合には、以上と同様にして、条件付エントロピーの算出と軸の設定とを順次繰り返していく。 In the case of the reference data G in FIG. 5a, since there are three fields, the field C is set to the third axis.
The calculation method of conditional entropy in the case where there are three fields of the reference data G has been described with reference to FIGS. 5, 6, and 7, but when the field of the reference data G is 4 or more, In the same manner as described above, calculation of conditional entropy and axis setting are sequentially repeated.
なお、以上図5、6、7を用いて、基準データGのフィールドが3つである場合の条件付エントロピーの算出法について示したが、基準データGのフィールドが4以上である場合には、以上と同様にして、条件付エントロピーの算出と軸の設定とを順次繰り返していく。 In the case of the reference data G in FIG. 5a, since there are three fields, the field C is set to the third axis.
The calculation method of conditional entropy in the case where there are three fields of the reference data G has been described with reference to FIGS. 5, 6, and 7, but when the field of the reference data G is 4 or more, In the same manner as described above, calculation of conditional entropy and axis setting are sequentially repeated.
さて、ここで、以上のようにして一つの基準データGについて算出した、3軸のエントロピーH(A),HA(B),HA,B(C)の和H(A)+ HA(B)+ HA,B(C)は、
H= -ΣkΣsΣi[p(Ak,Bs,Ci)×Log{p(Ak,Bs,Ci)}]
と等しくなる。なお、p(Ak,Bs,Ci)は、値の組(Ak,Bs,Ci)の基準データG中における発生確率である。 Now, the sum H (A) + H A of the three-axis entropy H (A), H A (B), H A, B (C) calculated for one reference data G as described above. (B) + H A, B (C) is
H = -Σ k Σ s Σ i [p (A k , B s , C i ) × Log {p (A k , B s , C i )}]
Is equal to Note that p (A k , B s , C i ) is a probability of occurrence in the reference data G of a set of values (A k , B s , C i ).
H= -ΣkΣsΣi[p(Ak,Bs,Ci)×Log{p(Ak,Bs,Ci)}]
と等しくなる。なお、p(Ak,Bs,Ci)は、値の組(Ak,Bs,Ci)の基準データG中における発生確率である。 Now, the sum H (A) + H A of the three-axis entropy H (A), H A (B), H A, B (C) calculated for one reference data G as described above. (B) + H A, B (C) is
H = -Σ k Σ s Σ i [p (A k , B s , C i ) × Log {p (A k , B s , C i )}]
Is equal to Note that p (A k , B s , C i ) is a probability of occurrence in the reference data G of a set of values (A k , B s , C i ).
したがって、3軸のエントロピーH(A),HA(B),HA,B(C)は、基準データGのエントロピーHを、各軸にスペクトル分解したものと考えることができる。
また、以上の判定結果データ作成処理において、各基準データGについて算出したエントロピーや条件付エントロピーが最大となるフィールドから順次軸として設定していくのは、以下の理由によるものである。
すなわち、エントロピーや条件付相互エントロピーの総和は軸の順序に関係なく、同じ値になる。しかし、現実のシステムでは、エントロピーや条件付相互エントロピーの大きいフィールドから処理をするほうが、耐ノイズ性が高くなる。 Therefore, the three-axis entropy H (A), H A (B), H A, B (C) can be considered as the spectral decomposition of the entropy H of the reference data G along each axis.
In the determination result data creation process described above, the axes are sequentially set from the field having the maximum entropy or conditional entropy calculated for each reference data G for the following reason.
That is, the sum of entropy and conditional mutual entropy becomes the same value regardless of the order of the axes. However, in an actual system, noise resistance is higher when processing is performed from a field having a large entropy or conditional mutual entropy.
また、以上の判定結果データ作成処理において、各基準データGについて算出したエントロピーや条件付エントロピーが最大となるフィールドから順次軸として設定していくのは、以下の理由によるものである。
すなわち、エントロピーや条件付相互エントロピーの総和は軸の順序に関係なく、同じ値になる。しかし、現実のシステムでは、エントロピーや条件付相互エントロピーの大きいフィールドから処理をするほうが、耐ノイズ性が高くなる。 Therefore, the three-axis entropy H (A), H A (B), H A, B (C) can be considered as the spectral decomposition of the entropy H of the reference data G along each axis.
In the determination result data creation process described above, the axes are sequentially set from the field having the maximum entropy or conditional entropy calculated for each reference data G for the following reason.
That is, the sum of entropy and conditional mutual entropy becomes the same value regardless of the order of the axes. However, in an actual system, noise resistance is higher when processing is performed from a field having a large entropy or conditional mutual entropy.
以上、プロセッサ1の軸決定部11が行う軸決定処理について説明した。
次に、プロセッサ1の判定処理部12が行うデータ判定処理について説明する。
図8に、このデータ判定処理の手順を示す。
図示するように、判定処理部12は、データ判定処理において、軸定義情報に定義されている各軸について(ステップ800、810、814)、被判定データTのエントロピーを算出し(ステップ802)、算出した被判定データTの各軸のエントロピーを、判定結果データの被判定データTのエントリに、当該軸のエントロピーHT(i)として登録する(ステップ804)。 The axis determination process performed by the axis determination unit 11 of theprocessor 1 has been described above.
Next, data determination processing performed by thedetermination processing unit 12 of the processor 1 will be described.
FIG. 8 shows the procedure of this data determination process.
As shown in the figure, thedetermination processing unit 12 calculates the entropy of the determination target data T for each axis defined in the axis definition information ( steps 800, 810, and 814) in the data determination process (step 802). The calculated entropy of each axis of the determination target data T is registered as the entropy HT (i) of the axis in the determination target data T entry of the determination result data (step 804).
次に、プロセッサ1の判定処理部12が行うデータ判定処理について説明する。
図8に、このデータ判定処理の手順を示す。
図示するように、判定処理部12は、データ判定処理において、軸定義情報に定義されている各軸について(ステップ800、810、814)、被判定データTのエントロピーを算出し(ステップ802)、算出した被判定データTの各軸のエントロピーを、判定結果データの被判定データTのエントリに、当該軸のエントロピーHT(i)として登録する(ステップ804)。 The axis determination process performed by the axis determination unit 11 of the
Next, data determination processing performed by the
FIG. 8 shows the procedure of this data determination process.
As shown in the figure, the
ここで、ステップ802における各軸のエントロピーの算出は、第1軸については図5に示したエントロピーの算出と同様に第1軸となっているフィールドのエントロピーを算出することにより行い、第2軸以降の軸については、図7、8に示した条件付エントロピーと同様に、当該軸を第i番目の軸として、第1軸から第i-1軸までのフィールドのもとでの第i番目の軸となっているフィールドの条件付エントロピーを算出することにより行う。
Here, the entropy of each axis in step 802 is calculated by calculating the entropy of the field serving as the first axis for the first axis in the same manner as the entropy calculation shown in FIG. As for the subsequent axes, like the conditional entropy shown in FIGS. 7 and 8, the i-th axis is the i-th axis under the field from the 1st axis to the i-1th axis. This is done by calculating the conditional entropy of the field that is the axis of.
すなわち、被判定データTが図2bに示すテーブルであり、軸定義情報が示す第1軸のフィールドがA、第2軸のフィールドがB、第3軸のフィールドがCであれば、第1軸については図5bと同様にエントロピーH(A)を算定し、算定したH(A)を被判定データTの第1軸のエントロピーとし、第2軸については、図6aと同様に条件付エントロピーHA(B)を算定し、算定したHA(B)を被判定データTの第2軸のエントロピーとし、第3軸については図7と同様にHA,B (C)を算定し、算定したHA,B (C)を被判定データTの第3軸のエントロピーとする。
That is, if the data to be judged T is the table shown in FIG. 2b, the first axis field indicated by the axis definition information is A, the second axis field is B, and the third axis field is C, the first axis 5b, entropy H (A) is calculated in the same manner as in FIG. 5b, and the calculated H (A) is set as the entropy of the first axis of the judged data T, and the conditional entropy H for the second axis is the same as in FIG. A (B) is calculated, and the calculated H A (B) is taken as the entropy of the second axis of the judged data T, and H A, B (C) is calculated for the third axis in the same manner as in FIG. The calculated H A, B (C) is the entropy of the third axis of the determination target data T.
また、判定処理部12は、データ判定処理において、軸定義情報に定義されている各軸について(ステップ800、810、814)、基準データGと被判定データTの相互情報量を算出し(ステップ806)、算出した各軸の相互情報量を、判定結果データの相互情報量Iのエントリに、当該軸の相互情報量I(i)として登録する(ステップ808)。
Further, in the data determination process, the determination processing unit 12 calculates the mutual information amount of the reference data G and the determination target data T for each axis defined in the axis definition information ( steps 800, 810, and 814) (step 806), the calculated mutual information amount of each axis is registered in the entry of mutual information amount I of the determination result data as the mutual information amount I (i) of the axis (step 808).
ここで、第1軸については、基準データGと被判定データTの第1軸としたフィールドの相互情報量を第1軸の相互情報量I(1)として求める。また、第2軸以降の第i軸については、第1軸から第i-1軸までのフィールドのもとでの条件付相互情報量を第i軸の相互情報量I(i)として求める。このような各軸の相互情報量I(i)の算出法については後述する。
そして、以上のようにして被判定データTの各軸のエントロピーHT(i)と各軸の相互情報量I(i)の算出と判定結果データへの登録が完了したならば(ステップ810)、判定結果データに登録されている各軸の相互情報量I(i)に基づいて、基準データGと被判定データTとの距離を評価し(ステップ812)、データ判定処理を終了する。 Here, for the first axis, the mutual information amount of the field as the first axis of the reference data G and the determination target data T is obtained as the mutual information amount I (1) of the first axis. For the i-th axis after the second axis, the conditional mutual information amount under the fields from the first axis to the i-th axis is obtained as the mutual information amount I (i) of the i-th axis. A method for calculating the mutual information amount I (i) of each axis will be described later.
Then, if the entropy HT (i) of each axis of the determination target data T and the mutual information I (i) of each axis are calculated and registered in the determination result data as described above (step 810), Based on the mutual information I (i) of each axis registered in the determination result data, the distance between the reference data G and the determination target data T is evaluated (step 812), and the data determination process is terminated.
そして、以上のようにして被判定データTの各軸のエントロピーHT(i)と各軸の相互情報量I(i)の算出と判定結果データへの登録が完了したならば(ステップ810)、判定結果データに登録されている各軸の相互情報量I(i)に基づいて、基準データGと被判定データTとの距離を評価し(ステップ812)、データ判定処理を終了する。 Here, for the first axis, the mutual information amount of the field as the first axis of the reference data G and the determination target data T is obtained as the mutual information amount I (1) of the first axis. For the i-th axis after the second axis, the conditional mutual information amount under the fields from the first axis to the i-th axis is obtained as the mutual information amount I (i) of the i-th axis. A method for calculating the mutual information amount I (i) of each axis will be described later.
Then, if the entropy HT (i) of each axis of the determination target data T and the mutual information I (i) of each axis are calculated and registered in the determination result data as described above (step 810), Based on the mutual information I (i) of each axis registered in the determination result data, the distance between the reference data G and the determination target data T is evaluated (step 812), and the data determination process is terminated.
ここで、各軸の相互情報量I(i)に基づく、基準データGと被判定データTとの距離の評価は、たとえば、各軸または一部の軸の相互情報量I(i)の和や、各軸または一部の軸の相互情報量I(i)の適当な重みを用いた加重和が大きいほど、距離が小さいと評価すること等により行うことができる。なお、ステップ812では、さらに、各軸の相互情報量I(i)に加え、基準データGの各軸について算出したエントロピーHG(i)、被判定データTの各軸について算出したエントロピー HT(i)を考慮して、基準データGと被判定データTとの関係についての判定を行うようにしてもよい。
Here, the evaluation of the distance between the reference data G and the determination target data T based on the mutual information I (i) of each axis is performed by, for example, summing the mutual information I (i) of each axis or a part of the axes. Alternatively, it can be performed by evaluating that the distance is shorter as the weighted sum using the appropriate weight of the mutual information I (i) of each axis or a part of the axes is larger. In step 812, in addition to the mutual information amount I (i) of each axis, the entropy HG (i) calculated for each axis of the reference data G and the entropy HT (i calculated for each axis of the determination target data T ) May be taken into consideration, and the relationship between the reference data G and the determination target data T may be determined.
以下、ステップ806で行う相互情報量I(i)の算出の詳細について、図9aに示す基準データGと被判定データTを例にとり説明する。
図9aに示す基準データG, 被判定データTは、A,B,Cの3つのフィールドを持つレコードを6個含むテーブルである。また、フィールドAに表れる値はA1とA2の二つのみであり、フィールドBに表れる値はB1とB2の二つのみであり、フィールドCに表れる値はC1とC2の二つのみである。 Hereinafter, details of the calculation of the mutual information amount I (i) performed instep 806 will be described using the reference data G and the determination target data T illustrated in FIG. 9A as examples.
The reference data G and determination target data T shown in FIG. 9a are tables including six records having three fields A, B, and C. In addition, there are only two values A 1 and A 2 appearing in the field A, only two values B 1 and B 2 appearing in the field B, and values appearing in the field C are C 1 and C 2 . There are only two.
図9aに示す基準データG, 被判定データTは、A,B,Cの3つのフィールドを持つレコードを6個含むテーブルである。また、フィールドAに表れる値はA1とA2の二つのみであり、フィールドBに表れる値はB1とB2の二つのみであり、フィールドCに表れる値はC1とC2の二つのみである。 Hereinafter, details of the calculation of the mutual information amount I (i) performed in
The reference data G and determination target data T shown in FIG. 9a are tables including six records having three fields A, B, and C. In addition, there are only two values A 1 and A 2 appearing in the field A, only two values B 1 and B 2 appearing in the field B, and values appearing in the field C are C 1 and C 2 . There are only two.
また、第1軸にはフィールドAが、第2軸にはフィールドBが、第3軸にはフィールドCが設定されているものとする。
また、基準データGと被判定データTの同じ順位(順番)のレコードのセットをレコードセットRSと定義すると共に、これによって定義される基準データGと被判定データTのレコード数と同じ数のレコードセットRSの集合をレコードセット集合と呼ぶこととする。すなわち、基準データGのn番目のレコードと被判定データTのn番目のレコードとよりn番目のレコードセットRS_nは構成される。また、基準データGと被判定データTのレコード数は6であるので、レコードセット集合に含まれるレコードセットRSの数もRS_1からRS_6までの6となる。 It is also assumed that field A is set on the first axis, field B is set on the second axis, and field C is set on the third axis.
In addition, a set of records in the same order (order) of the reference data G and the judged data T is defined as a record set RS, and the same number of records as the reference data G and the judged data T defined by this is defined. A set of sets RS is called a record set set. That is, the nth record set RS_n is composed of the nth record of the reference data G and the nth record of the determination target data T. Further, since the number of records of the reference data G and the determination target data T is 6, the number of record sets RS included in the record set set is also 6 from RS_1 to RS_6.
また、基準データGと被判定データTの同じ順位(順番)のレコードのセットをレコードセットRSと定義すると共に、これによって定義される基準データGと被判定データTのレコード数と同じ数のレコードセットRSの集合をレコードセット集合と呼ぶこととする。すなわち、基準データGのn番目のレコードと被判定データTのn番目のレコードとよりn番目のレコードセットRS_nは構成される。また、基準データGと被判定データTのレコード数は6であるので、レコードセット集合に含まれるレコードセットRSの数もRS_1からRS_6までの6となる。 It is also assumed that field A is set on the first axis, field B is set on the second axis, and field C is set on the third axis.
In addition, a set of records in the same order (order) of the reference data G and the judged data T is defined as a record set RS, and the same number of records as the reference data G and the judged data T defined by this is defined. A set of sets RS is called a record set set. That is, the nth record set RS_n is composed of the nth record of the reference data G and the nth record of the determination target data T. Further, since the number of records of the reference data G and the determination target data T is 6, the number of record sets RS included in the record set set is also 6 from RS_1 to RS_6.
さて、第1軸の相互情報量I(1)の算出は、第1軸がフィールドAであるので、図9bに示すように、レコードセット集合の基準データGのフィールドAと被判定データTのフィールドAに着目して行う。
Now, since the first axis is the field A, the mutual information I (1) on the first axis is calculated as shown in FIG. Focus on field A.
ここで、Aiは基準データGのフィールドAに表れる値のうちのi番目の値を表し、Ajは被判定データTのフィールドAに表れる値のうちのj番目の値を表し、p(Ai)はフィールドAの値がAiである基準データGのレコードを含むレコードセットRSのレコード集合中の出現確率を表し、p(Aj)はフィールドAの値がAjである被判定データTのレコードを含むレコードセットRSのレコード集合中の出現確率を表す。
Here, A i represents the i-th value among the values appearing in the field A of the reference data G, A j represents the j-th value among the values appearing in the field A of the determination target data T, and p ( A i ) represents the probability of appearance in the record set of the record set RS including the record of the reference data G whose field A value is Ai, and p (A j ) is the judged data T whose field A value is Aj. Represents the probability of appearance in the record set of the record set RS including the record of.
また、p(Ai,Aj)は、基準データGのフィールドA の値がAiであるレコードと、被判定データTのフィールドA の値がAjであるレコードよりなるレコードセットRSのレコードセット集合中の出現確率を表す。
Further, p (A i , A j ) is a record set set of a record set RS composed of records in which the value of the field A of the reference data G is Ai and the records of the field A of the data to be judged T having Aj Represents the probability of appearance.
さて、図9aに示した基準データG, 被判定データTに対して、基準データGのフィールドA の値がAiであるレコードと、被判定データTのフィールドA の値がAjであるレコードよりなるレコードセットRSのレコードセットRSの出現確率p(Ai,Aj)は図9cのように求まる。
Now, for the reference data G and to-be-determined data T shown in FIG. 9a, the record is composed of a record in which the value of the field A of the reference data G is Ai and a record in which the value of the field A of the to-be-determined data T is Aj. The appearance probability p (A i , A j ) of the record set RS of the record set RS is obtained as shown in FIG. 9c.
すなわち、レコードセットRSに含まれるレコード数は6であり、基準データGのフィールドA の値がA1であり、被判定データTのフィールドA の値がA2であるレコードセットRSは、基準データGと被判定データTの4番目のレコードよりなる4番目のレコードセットRS_4のみであるので、その数(度数)は1となり、p(A1,A2)=1/6≒0.17となる。
That is, the number of records in the record set RS is 6, the reference value of the field A data G is A 1, record set RS value of the field A of the determination data T is A 2, the reference data Since there is only the fourth record set RS_4 consisting of the fourth record of G and data to be judged T, the number (frequency) is 1, and p (A 1 , A 2 ) = 1 / 6≈0.17.
また、p(Ai)、p(Aj)は、図9d1、d2のように求まる。
すなわち、たとえば、i=1のp(Ai)は、フィールドA の値がA1である基準データGのレコードを含むレコードセットRSの数(度数)は3であり、レコードセット集合のレコード数は6であるので、p(Ai)=3/6=0.5として求まる。同様に、j=2のp(Aj)は、フィールドA の値がA2である被判定データTのレコードを含むレコードセットRSの数(度数)は3であり、レコードセット集合のレコード数は6であるので、p(Aj)=3/6=0.5として求まる。 Further, p (A i ) and p (A j ) are obtained as shown in FIGS. 9d1 and d2.
That is, for example, for p (A i ) with i = 1, the number (frequency) of record sets RS including records of the reference data G whose field A is A1 is 3, and the number of records in the record set set is 3 Since it is 6, it is obtained as p (A i ) = 3/6 = 0.5. Similarly, p = 2 (A j ) of j = 2 is 3. The number (frequency) of record sets RS including the record of the judged data T whose field A value is A2 is 3, and the number of records in the record set set is 3 Since it is 6, it is obtained as p (A j ) = 3/6 = 0.5.
すなわち、たとえば、i=1のp(Ai)は、フィールドA の値がA1である基準データGのレコードを含むレコードセットRSの数(度数)は3であり、レコードセット集合のレコード数は6であるので、p(Ai)=3/6=0.5として求まる。同様に、j=2のp(Aj)は、フィールドA の値がA2である被判定データTのレコードを含むレコードセットRSの数(度数)は3であり、レコードセット集合のレコード数は6であるので、p(Aj)=3/6=0.5として求まる。 Further, p (A i ) and p (A j ) are obtained as shown in FIGS. 9d1 and d2.
That is, for example, for p (A i ) with i = 1, the number (frequency) of record sets RS including records of the reference data G whose field A is A1 is 3, and the number of records in the record set set is 3 Since it is 6, it is obtained as p (A i ) = 3/6 = 0.5. Similarly, p = 2 (A j ) of j = 2 is 3. The number (frequency) of record sets RS including the record of the judged data T whose field A value is A2 is 3, and the number of records in the record set set is 3 Since it is 6, it is obtained as p (A j ) = 3/6 = 0.5.
そして、図9c、d1、d2のように求めたp(Ai,Aj)、p(Ai)、p(Aj)を用いて相互情報量I (A)を求め、求めたI (A)を第1軸の相互情報量I(1)とする。
Then, mutual information I (A) is obtained by using p (A i , A j ), p (A i ), and p (A j ) obtained as shown in FIGS. Let A) be the mutual information I (1) of the first axis.
すなわち、
I(1)=I(A)
=-Σ i,j[p(Ai,Aj) ×Log{p(Ai,Aj )/(p(Ai)× p(Aj))}]
によって第1軸の相互情報量I(1)を算定する。なお、Σi, jはiとjについての総和を表す。 That is,
I (1) = I (A)
= -Σ i, j [p (A i , A j ) × Log {p (A i , A j ) / (p (A i ) × p (A j ))}]
The mutual information I (1) of the first axis is calculated by Note that Σ i, j represents the sum of i and j.
I(1)=I(A)
=-Σ i,j[p(Ai,Aj) ×Log{p(Ai,Aj )/(p(Ai)× p(Aj))}]
によって第1軸の相互情報量I(1)を算定する。なお、Σi, jはiとjについての総和を表す。 That is,
I (1) = I (A)
= -Σ i, j [p (A i , A j ) × Log {p (A i , A j ) / (p (A i ) × p (A j ))}]
The mutual information I (1) of the first axis is calculated by Note that Σ i, j represents the sum of i and j.
次に、第2軸の相互情報量I(2)の算出は、第1軸がフィールドA、第2軸がフィールドBであるので、図10aに示すようにレコードセット集合の基準データGのフィールドAとフィールドBと被判定データTのフィールドBに着目して行う。
Next, the mutual information I (2) on the second axis is calculated because the first axis is the field A and the second axis is the field B. Therefore, as shown in FIG. This is performed by paying attention to A, field B, and field B of data to be judged T.
ここで、p(Ak)は、フィールドAの値として値Akを持つ基準データGのレコードを含むレコードセットRSのレコード集合中の出現確率を表し、条件付発生確率pAk(Bi)は、フィールドAの値として値Akを持つ基準データGのレコードを含むレコードセットRSよりなる、レコードセット集合の部分集合中の、フィールドBの値としてBiを持つ基準データGのレコードを含むレコードセットRSの発生確率を表し、条件付発生確率pAk(Bj)は、フィールドAの値として値Akを持つ基準データGのレコードを含むレコードセットRSよりなる、レコードセット集合の部分集合中の、フィールドBの値としてBjを持つ被判定データTのレコードを含むレコードセットRSの発生確率を表す。
Here, p (A k ) represents an appearance probability in the record set of the record set RS including the record of the reference data G having the value A k as the value of the field A, and the conditional occurrence probability p Ak (B i ) Includes a record of reference data G having B i as a value of field B in a subset of the record set set consisting of a record set RS including a record of reference data G having a value A k as a value of field A Represents the probability of occurrence of the record set RS, and the conditional occurrence probability p Ak (B j ) is a subset of the record set set consisting of the record set RS including the record of the reference data G having the value A k as the value of the field A The probability of occurrence of the record set RS including the record of the data to be judged T having B j as the value of the field B is shown.
また、pAk(Bi,Bj)は、フィールドAの値として値Akを持つ基準データGのレコードを含むレコードセットRSよりなる、レコードセット集合の部分集合中の、フィールドBの値としてBiを持つ基準データGのレコードとフィールドBの値としてBjを持つ被判定データTのレコードの双方を含むレコードセットRSの発生確率を表す。
Also, p Ak (B i , B j ) is the value of field B in the subset of the record set set consisting of record set RS including the record of reference data G having the value A k as the value of field A. It represents the probability of occurrence of a record set RS including both records of reference data G having B i and records of data to be judged T having B j as the value of field B.
さて、図10aに示した基準データG、 被判定データTに対して、p(Ak)、pAk(Bi,Bj)は、図10bに示すように求まる。
たとえば、p(A1)は、レコードセット集合に含まれるレコード数は6であり、フィールドA の値がA1である基準データGのレコードを含むレコードセットRSの数は3であるので、p(A1)=3/6=0.5として求まる。 Now, p (A k ) and p Ak (B i , B j ) are obtained as shown in FIG. 10B with respect to the reference data G and the determination target data T shown in FIG. 10A.
For example, p (A 1), the number of records contained in the recordset set is 6, the value of field A is the number of records set RS that contains the record of the reference data G is A 1 is a 3, p It is obtained as (A 1 ) = 3/6 = 0.5.
たとえば、p(A1)は、レコードセット集合に含まれるレコード数は6であり、フィールドA の値がA1である基準データGのレコードを含むレコードセットRSの数は3であるので、p(A1)=3/6=0.5として求まる。 Now, p (A k ) and p Ak (B i , B j ) are obtained as shown in FIG. 10B with respect to the reference data G and the determination target data T shown in FIG. 10A.
For example, p (A 1), the number of records contained in the recordset set is 6, the value of field A is the number of records set RS that contains the record of the reference data G is A 1 is a 3, p It is obtained as (A 1 ) = 3/6 = 0.5.
また、たとえば、pA1(B1,B2)は、フィールドA の値がA1である基準データGのレコードを含むレコードセットRSよりなる、レコードセット集合の部分集合に含まれるレコードセットRSの数は3であり、この部分集合に含まれるフィールドBの値としてB1を持つ基準データGのレコードとフィールドBの値としてB2を持つ被判定データTのレコードの双方を含むレコードセットRSの数(度数)は1であるので、pA1(B1,B2)=1/3≒0.33として求まる。
Also, for example, p A1 (B 1 , B 2 ) is a record set RS included in a subset of the record set set including the record set RS including the record of the reference data G whose field A value is A 1 The number is 3, and the record set RS includes both the record of the reference data G having B 1 as the value of the field B and the record of the judged data T having B 2 as the value of the field B included in this subset. Since the number (frequency) is 1, it is obtained as p A1 (B 1 , B 2 ) = 1 / 3≈0.33.
また、pAk(Bi)、pAk(Bj)は、図10c1、c2のように求まる。
すなわち、k=1,i=1のpAk(Bi)は、フィールドA の値がA1である基準データGのレコードを含むレコードセットRSよりなる、レコードセット集合の部分集合中のレコードセットRSの数は3であり、この部分集合中のフィールドB の値がB1である基準データGのレコードを含むレコードセットRSの数は3であるので、k=1,i=1のpAk(Bi)=3/3=1となる。 Further, p Ak (B i ) and p Ak (B j ) are obtained as shown in FIGS. 10c1 and c2.
That is, p Ak (B i ) of k = 1, i = 1 is a record set RS in a subset of the record set set including the record set RS including the record of the reference data G whose field A value is A1. Is 3, and the number of record sets RS including records of the reference data G in which the value of the field B in this subset is B1 is 3, so that p Ak (B = k = 1, i = 1 i ) = 3/3 = 1.
すなわち、k=1,i=1のpAk(Bi)は、フィールドA の値がA1である基準データGのレコードを含むレコードセットRSよりなる、レコードセット集合の部分集合中のレコードセットRSの数は3であり、この部分集合中のフィールドB の値がB1である基準データGのレコードを含むレコードセットRSの数は3であるので、k=1,i=1のpAk(Bi)=3/3=1となる。 Further, p Ak (B i ) and p Ak (B j ) are obtained as shown in FIGS. 10c1 and c2.
That is, p Ak (B i ) of k = 1, i = 1 is a record set RS in a subset of the record set set including the record set RS including the record of the reference data G whose field A value is A1. Is 3, and the number of record sets RS including records of the reference data G in which the value of the field B in this subset is B1 is 3, so that p Ak (B = k = 1, i = 1 i ) = 3/3 = 1.
また、k=1,j=1のpAk(Bi)は、フィールドA の値がA1である基準データGのレコードを含むレコードセットRSよりなるレコードセットRSの部分集合中のレコードセットRSの数は3であり、この部分集合中のフィールドB の値がB1である被判定データTのレコードを含むレコードセットRSの数は2であるので、k=1,j=1のpAk(Bj)=2/3≒0.67となる。
Further, p Ak (B i ) of k = 1, j = 1 is the record set RS in the subset of the record set RS including the record of the reference data G whose field A value is A1. The number is 3, and the number of record sets RS including the record of the judged data T in which the value of the field B in this subset is B1, is 2, so that p Ak (B = k = 1, j = 1 j ) = 2 / 3≈0.67.
そして、図10b、c1、c2のように求めたp(Ak)、pAk(Bi, Bj)、pAk (Bi)、pAk (Bj)を用いて、第1軸としたフィールドAのもとでの、フィールドBの条件付き相互情報量IAk,(B)を求め、求めたI Ak,(B)を第2軸の相互情報量I(2)とする。
Then, using p (A k ), p Ak (B i , B j ), p Ak (B i ), p Ak (B j ) obtained as shown in FIGS. 10 b, c 1, c 2, Under the field A, the conditional mutual information I Ak, (B) of the field B is obtained, and the obtained I Ak, (B) is defined as the mutual information I (2) of the second axis.
すなわち、
I(2)=IA(B)
=-Σk[p(Ak)×Σi,j [pAk(Bi,Bj)×Log{pAk(Bi,Bj)/(pAk(Bi)×pAk(Bj))}]]
によって第2軸の相互情報量I(2)を算定する。 That is,
I (2) = I A (B)
= -Σ k (p (A k ) × Σ i, j (p Ak (B i , B j ) × Log {p Ak (B i , B j ) / (p Ak (B i ) × p Ak (B j ))}]]
The mutual information I (2) on the second axis is calculated by
I(2)=IA(B)
=-Σk[p(Ak)×Σi,j [pAk(Bi,Bj)×Log{pAk(Bi,Bj)/(pAk(Bi)×pAk(Bj))}]]
によって第2軸の相互情報量I(2)を算定する。 That is,
I (2) = I A (B)
= -Σ k (p (A k ) × Σ i, j (p Ak (B i , B j ) × Log {p Ak (B i , B j ) / (p Ak (B i ) × p Ak (B j ))}]]
The mutual information I (2) on the second axis is calculated by
次に、第3軸の相互情報量I(3)の算出は、第1軸がフィールドA、第2軸がフィールドB、第3軸がフィールドCであるので、図11aに示すように、レコードセット集合の基準データGのフィールドAとフィールドBとフィールドCと被判定データTのフィールドCに着目して行う。
Next, the mutual information I (3) on the third axis is calculated because the first axis is field A, the second axis is field B, and the third axis is field C. As shown in FIG. This is performed by paying attention to the field A, the field B, the field C, and the field C of the determination target data T in the set set reference data G.
ここで、p(Ak、Bs)は、フィールドAの値として値Akを持ち、フィールドBの値として値Bsを持つ基準データGのレコードを含むレコードセットRSのレコード集合中の出現確率を表す。また、条件付発生確率pAk,Bs(Ci)は、フィールドAの値として値Akを持ち、フィールドBの値として値Bsを持つ基準データGのレコードを含むレコードセットRSよりなるレコードセット集合の部分集合中の、フィールドCの値としてCiを持つ基準データGのレコードを含むレコードセットRSの発生確率を表し、条件付発生確率pAk,Bs(Cj)は、フィールドAの値として値Akを持ち、フィールドBの値として値Bsを持つ基準データGのレコードを含むレコードセットRSよりなるレコードセット集合の部分集合中の、フィールドCの値としてCjを持つ被判定データTのレコードを含むレコードセットRSの発生確率を表す。
Here, p (A k, B s ) appears in the record set of the record set RS including the record of the reference data G having the value A k as the value of the field A and the value B s as the value of the field B. Represents a probability. Also, the conditional occurrence probability p Ak, Bs (C i ) has a value A k as the value of the field A, and a record comprising the record set RS including the record of the reference data G having the value B s as the value of the field B The occurrence probability of the record set RS including the record of the reference data G having C i as the value of the field C in the subset of the set set is represented, and the conditional occurrence probability p Ak, Bs (C j ) Determined to have C j as the value of field C in the subset of the record set set consisting of record set RS with the record of reference data G having the value A k as the value and the value B s as the value of field B Represents the probability of occurrence of a record set RS containing data T records.
また、pAk,Bs (Ci, Cj)は、フィールドAの値として値Akを持ち、フィールドBの値として値Bsを持つ基準データGのレコードを含むレコードセットRSよりなるレコードセット集合の部分集合中の、フィールドCの値としてCiを持つ基準データGのレコードとフィールドCの値としてCjを持つ被判定データTのレコードとの双方を含むレコードセットRSの発生確率を表す。
P Ak, Bs (C i , C j ) has a value A k as the value of field A and a record set RS including a record of reference data G having a value B s as the value of field B Represents the probability of occurrence of a record set RS that includes both records of reference data G having C i as the value of field C and records of judged data T having C j as the value of field C in a subset of the set .
さて、図11aに示した基準データG, 被判定データTに対して、p(Ak、Bs)、pAk,Bs (Ci, Cj)は、図11bに示すように求まる。
Now, p (A k, B s ), p Ak, Bs (C i , C j ) are obtained as shown in FIG.
たとえば、p(A2、B1)は、レコードセット集合に含まれるレコード数は6であり、フィールドA の値がA2でありフィールドB の値がB1である基準データGのレコードを含むレコードセットRSの数は1であるので、p(A2、B1)=1/6≒0.17として求まる。
For example, p (A 2, B 1 ) includes records of reference data G in which the number of records included in the record set set is 6, the value of field A is A 2 and the value of field B is B 1 Since the number of record sets RS is 1, p (A 2, B 1 ) = 1 / 6≈0.17 is obtained.
また、たとえば、pA1,B1 (C1, C1)は、フィールドA の値がA1でありフィールドB の値がB1である基準データGのレコードを含むレコードセットRSよりなるレコードセット集合の部分集合に含まれるレコードセットRSの数は3であり、この部分集合に含まれるフィールドCの値としてC1を持つ基準データGのレコードとフィールドCの値としてc1を持つ被判定データTのレコードの双方を含むレコードセットRSの数(度数)は2であるので、pA1,B1 (C1, C1)=2/3≒0.67として求まる。
Also, for example, p A1, B1 (C 1 , C 1 ) is a record set set consisting of a record set RS including records of the reference data G in which the value of the field A is A 1 and the value of the field B is B 1 The number of record sets RS included in this subset is 3, and the record of reference data G having C 1 as the value of field C and the data T to be judged having c 1 as the value of field C included in this subset Since the number (frequency) of the record sets RS including both of the records is 2, it is obtained as p A1, B1 (C 1 , C 1 ) = 2 / 3≈0.67.
また、pAk,Bs(Ci)、pAk,Bs(Cj)は、図12a1、a2のように求まる。
すなわち、たとえば、k=1,s=1,i=1のpAk,Bs(Ci)は、フィールドA の値がA1でありフィールドB の値がB1である基準データGのレコードを含むレコードセットRSよりなるレコードセット集合の部分集合中のレコードセットRSの数は3であり、この部分集合中のフィールドC の値がC1である基準データGのレコードを含むレコードセットRSの数は2であるので、k=1,s=1,i=1のpAk,Bs(Ci)=2/3≒0.67となる。 Further, p Ak, Bs (C i ) and p Ak, Bs (C j ) are obtained as shown in FIGS.
That is, for example, p Ak, Bs (C i ) of k = 1, s = 1, i = 1 is a record of the reference data G in which the value of the field A is A 1 and the value of the field B is B 1 The number of record sets RS in the subset of the record set set including the record set RS is 3, and the number of record sets RS including the record of the reference data G in which the value of the field C in the subset is C1 is Since p = 1, k = 1, s = 1, i = 1, p Ak, Bs (C i ) = 2 / 3≈0.67.
すなわち、たとえば、k=1,s=1,i=1のpAk,Bs(Ci)は、フィールドA の値がA1でありフィールドB の値がB1である基準データGのレコードを含むレコードセットRSよりなるレコードセット集合の部分集合中のレコードセットRSの数は3であり、この部分集合中のフィールドC の値がC1である基準データGのレコードを含むレコードセットRSの数は2であるので、k=1,s=1,i=1のpAk,Bs(Ci)=2/3≒0.67となる。 Further, p Ak, Bs (C i ) and p Ak, Bs (C j ) are obtained as shown in FIGS.
That is, for example, p Ak, Bs (C i ) of k = 1, s = 1, i = 1 is a record of the reference data G in which the value of the field A is A 1 and the value of the field B is B 1 The number of record sets RS in the subset of the record set set including the record set RS is 3, and the number of record sets RS including the record of the reference data G in which the value of the field C in the subset is C1 is Since p = 1, k = 1, s = 1, i = 1, p Ak, Bs (C i ) = 2 / 3≈0.67.
また、k=1,s=1,i=2のpAk,Bs(Cj)は、フィールドA の値がA1でありフィールドB の値がB1である基準データGのレコードを含むレコードセットRSよりなるレコードセット集合の部分集合中のレコードセットRSの数は3であり、この部分集合中のフィールドC の値がC1である被判定データTのレコードを含むレコードセットRSの数は1であるので、k=1,s=1,i=2のpAk,Bs(Cj)=1/3≒0.33となる。
In addition, p Ak, Bs (C j ) with k = 1, s = 1, i = 2 includes a record of the reference data G in which the value of the field A is A 1 and the value of the field B is B 1 The number of record sets RS in the subset of the record set set consisting of the sets RS is 3, and the number of record sets RS including the record of the judged data T in which the value of the field C in the subset is C1 is 1. Therefore, p Ak, Bs (C j ) = 1 / 3≈0.33 where k = 1, s = 1, i = 2.
そして、図11b、図12a1、a2のように求めたp(Ak、Bs)、pAk,Bs (Ci, Cj)、pAk,Bs(Ci)、pAk,Bs(Cj)を用いて、第1軸、第2軸としたフィールドA,Bのもとでの、フィールドCの条件付き相互情報量IAk,Bs(c)を求め、求めたIAk,Bs(c)を第3軸の相互情報量I(3)とする。
Then, p (A k, B s ), p Ak, Bs (C i , C j ), p Ak, Bs (C i ), p Ak, Bs (C j ) is used to determine the conditional mutual information I Ak, Bs (c) of the field C under the fields A and B as the first and second axes, and the calculated I Ak, Bs ( Let c) be the mutual information I (3) on the third axis.
すなわち、
I(3)=IA,B(c)
=-ΣkΣs[p(Ak,Bs)×Σi,j[pAk,Bs(Ci,Cj)×Log{pAk,Bs(Ci,Cj)
/(pAk,Bs(Ci)×pAk,Bs(Cj))}]]
によって第3軸の相互情報量I(3)を算定する。 That is,
I (3) = I A, B (c)
= -Σ k Σ s (p (A k , B s ) × Σ i, j (p Ak, Bs (C i , C j ) × Log {p Ak, Bs (C i , C j )
/ (p Ak, Bs (C i ) × p Ak, Bs (C j ))}]]
To calculate the mutual information I (3) on the third axis.
I(3)=IA,B(c)
=-ΣkΣs[p(Ak,Bs)×Σi,j[pAk,Bs(Ci,Cj)×Log{pAk,Bs(Ci,Cj)
/(pAk,Bs(Ci)×pAk,Bs(Cj))}]]
によって第3軸の相互情報量I(3)を算定する。 That is,
I (3) = I A, B (c)
= -Σ k Σ s (p (A k , B s ) × Σ i, j (p Ak, Bs (C i , C j ) × Log {p Ak, Bs (C i , C j )
/ (p Ak, Bs (C i ) × p Ak, Bs (C j ))}]]
To calculate the mutual information I (3) on the third axis.
以上、相互情報量I(i)の算出の詳細について説明した。
ここで、以上では、フィールド数が3である場合を例にとり、相互情報量I(i)の算出の方法を示したが、フィールド数が4以上である場合にも、同様に、第4軸以降の各軸nについて、1からn-1軸とした各フィールドのもとでの条件付相互情報量として相互情報量I(n)を算出することできる。 The details of calculating the mutual information amount I (i) have been described above.
Here, the method for calculating the mutual information amount I (i) has been shown by taking the case where the number of fields is three as an example, but the fourth axis is similarly applied to the case where the number of fields is four or more. For each subsequent axis n, the mutual information amount I (n) can be calculated as a conditional mutual information amount under each field from 1 to n-1 axes.
ここで、以上では、フィールド数が3である場合を例にとり、相互情報量I(i)の算出の方法を示したが、フィールド数が4以上である場合にも、同様に、第4軸以降の各軸nについて、1からn-1軸とした各フィールドのもとでの条件付相互情報量として相互情報量I(n)を算出することできる。 The details of calculating the mutual information amount I (i) have been described above.
Here, the method for calculating the mutual information amount I (i) has been shown by taking the case where the number of fields is three as an example, but the fourth axis is similarly applied to the case where the number of fields is four or more. For each subsequent axis n, the mutual information amount I (n) can be calculated as a conditional mutual information amount under each field from 1 to n-1 axes.
以上、本発明の実施形態について説明した。
なお、以上の実施形態において、確率計算で分母が0で発散するときは、無限大ではなく確率を1とするものとする。 The embodiment of the present invention has been described above.
In the above embodiment, when the denominator diverges by 0 in the probability calculation, it is assumed that the probability is 1 instead of infinity.
なお、以上の実施形態において、確率計算で分母が0で発散するときは、無限大ではなく確率を1とするものとする。 The embodiment of the present invention has been described above.
In the above embodiment, when the denominator diverges by 0 in the probability calculation, it is assumed that the probability is 1 instead of infinity.
以上のように、本実施形態によれば、基準データGと被判定データTの各フィールドの相互情報量と条件付相互情報量を求め、求めた相互情報量と条件付相互情報量を用いて基準データGと被判定データTの距離を算定する。ここで、各フィールドの相互情報量と条件付相互情報量には、当該フィールドのレコードを横断した特性が反映されているので、データ全体としての特性に応じた、基準データGと被判定データTとの間の距離の算定を行えるようになる。また、各フィールドの相互情報量、条件付相互情報量は相互に独立しているので、容易に、さまざまな観点による基準データGと被判定データTとの間の距離の評価を行えるようになる。また、2つのテーブル間の相互情報量は、各項目間の相互情報量および条件付相互情報量の総和に等しいため、各項目間の相互情報量に重み付けをしなければ、過不足無く全体的な相互情報量に基づく処理を行うことが出来る。
As described above, according to the present embodiment, the mutual information amount and the conditional mutual information amount of each field of the reference data G and the judged data T are obtained, and the obtained mutual information amount and the conditional mutual information amount are used. The distance between the reference data G and the judged data T is calculated. Here, the mutual information amount of each field and the conditional mutual information amount reflect the characteristics across the records of the field, so the reference data G and the data to be judged T according to the characteristics of the entire data. The distance between can be calculated. In addition, since the mutual information amount and conditional mutual information amount of each field are independent of each other, the distance between the reference data G and the judged data T can be easily evaluated from various viewpoints. . The mutual information amount between the two tables is equal to the sum of the mutual information amount between each item and the conditional mutual information amount. It is possible to perform processing based on a mutual amount of mutual information.
ここで、以上の実施形態では、各基準データGについて算出したエントロピーや条件付エントロピーが最大となるフィールドを軸に設定することにより、順次各軸を設定したが、各軸の設定は、他の基準に従って行うようにしてもよい。たとえば、単純にフィールド順に従って、順次、軸とするフィールドを設定するようにしてもよい。
また、以上の実施形態は、フィールドとレコードを転置して行うことも可能である。 Here, in the above embodiment, each axis is sequentially set by setting the field with the maximum entropy or conditional entropy calculated for each reference data G as an axis. You may make it carry out according to a reference | standard. For example, the field as the axis may be set in order according to the field order.
Moreover, the above embodiment can also be performed by transposing fields and records.
また、以上の実施形態は、フィールドとレコードを転置して行うことも可能である。 Here, in the above embodiment, each axis is sequentially set by setting the field with the maximum entropy or conditional entropy calculated for each reference data G as an axis. You may make it carry out according to a reference | standard. For example, the field as the axis may be set in order according to the field order.
Moreover, the above embodiment can also be performed by transposing fields and records.
1…プロセッサ、2…ストレージ、11…軸決定部、12…判定処理部。
1 ... processor, 2 ... storage, 11 ... axis determining unit, 12 ... determination processing unit.
Claims (4)
- 複数のフィールドよりなるレコードの第1の集合と前記レコードの第2の集合との距離を算定する情報処理システムであって、
前記レコードの各フィールドの順位を設定する順位設定部と、
前記順位設定部が設定した順位に従って、前記第1の集合と前記第2の集合との順位が第1位のフィールドの相互情報量を、当該順位が第1位のフィールドの評価値として算出すると共に、前記第1の集合のと前記第2の集合との順位が第1位以外の各順位の各フィールドの当該フィールドより上位の各フィールドのもとでの条件付相互情報量を、当該順位のフィールドの評価値として算出する評価値算出部と、
前記評価値算出部が各フィールドについて算出した前記評価値の少なくとも一部を用いて、前記第1の集合と前記第2の集合との距離を算定する距離算定部とを有することを特徴とする情報処理システム。 An information processing system for calculating a distance between a first set of records including a plurality of fields and a second set of the records,
A rank setting unit for setting the rank of each field of the record;
According to the rank set by the rank setting unit, the mutual information amount of the field with the first rank between the first set and the second set is calculated as the evaluation value of the field with the first rank. In addition, the conditional mutual information amount under each field higher than that field of each field of each rank other than the first rank of the first set and the second set is the rank. An evaluation value calculation unit for calculating the evaluation value of the field of
The evaluation value calculation unit includes a distance calculation unit that calculates a distance between the first set and the second set using at least a part of the evaluation value calculated for each field. Information processing system. - 請求項1記載の情報処理システムであって、
前記順位設定部は、前記レコードのフィールドのうちの、前記第1の集合におけるエントロピーが最大のフィールドの順位を第1位の順位に設定した上で、以降、全てのフィールドの順位が設定されるまで、前記第1の集合における、既に順位が設定されたフィールドのもとでのフィールドの条件付エントロピーが最大の順位のフィールドの順位を、最後に設定した順位の次の順位に設定する処理を繰り返すことにより、前記各フィールドの順位を設定することを特徴とする情報処理システム。 An information processing system according to claim 1,
The rank setting unit sets the rank of the field having the maximum entropy in the first set among the fields of the record to the first rank, and thereafter ranks of all the fields are set. Up to the process of setting the rank of the field having the highest conditional entropy of the field under the field having the rank already set in the first set to the rank next to the rank set last. An information processing system which sets the order of each field by repeating. - コンピュータによって読み取られ実行されるコンピュータプログラムであって、
前記コンピュータを、
複数のフィールドよりなるレコードの第1の集合と前記レコードの第2の集合とを記憶した記憶部と、
前記レコードの各フィールドの順位を設定する順位設定部と、
前記順位設定部が設定した順位に従って、前記第1の集合と前記第2の集合との順位が第1位のフィールドの相互情報量を、当該順位が第1位のフィールドの評価値として算出すると共に、前記第1の集合のと前記第2の集合との順位が第1位以外の各順位の各フィールドの当該フィールドより上位の各フィールドのもとでの条件付相互情報量を、当該順位のフィールドの評価値として算出する評価値算出部と、
前記評価値算出部が各フィールドについて算出した前記評価値の少なくとも一部を用いて、前記第1の集合と前記第2の集合との距離を算定する距離算定部として機能させることを特徴とするコンピュータプログラム。 A computer program that is read and executed by a computer,
The computer,
A storage unit storing a first set of records composed of a plurality of fields and a second set of the records;
A rank setting unit for setting the rank of each field of the record;
According to the rank set by the rank setting unit, the mutual information amount of the field with the first rank between the first set and the second set is calculated as the evaluation value of the field with the first rank. In addition, the conditional mutual information amount under each field higher than that field of each field of each rank other than the first rank of the first set and the second set is the rank. An evaluation value calculation unit for calculating the evaluation value of the field of
The evaluation value calculation unit functions as a distance calculation unit that calculates a distance between the first set and the second set using at least a part of the evaluation value calculated for each field. Computer program. - 請求項3記載のコンピュータプログラムであって、
前記順位設定部は、前記レコードのフィールドのうちの、前記第1の集合におけるエントロピーが最大のフィールドの順位を第1位の順位に設定した上で、以降、全てのフィールドの順位が設定されるまで、前記第1の集合における、既に順位が設定されたフィールドのもとでのフィールドの条件付エントロピーが最大の順位のフィールドの順位を、最後に設定した順位の次の順位に設定する処理を繰り返すことにより、前記各フィールドの順位を設定することを特徴とするコンピュータプログラム。 A computer program according to claim 3,
The rank setting unit sets the rank of the field having the maximum entropy in the first set among the fields of the record to the first rank, and thereafter ranks of all the fields are set. Up to the process of setting the rank of the field having the highest conditional entropy of the field under the field having the rank already set in the first set to the rank next to the rank set last. A computer program which sets the order of each field by repeating.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2017545033A JPWO2017064769A1 (en) | 2015-10-14 | 2015-10-14 | Information processing system and computer program |
PCT/JP2015/079048 WO2017064769A1 (en) | 2015-10-14 | 2015-10-14 | Information processing system and computer program |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2015/079048 WO2017064769A1 (en) | 2015-10-14 | 2015-10-14 | Information processing system and computer program |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2017064769A1 true WO2017064769A1 (en) | 2017-04-20 |
Family
ID=58517445
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2015/079048 WO2017064769A1 (en) | 2015-10-14 | 2015-10-14 | Information processing system and computer program |
Country Status (2)
Country | Link |
---|---|
JP (1) | JPWO2017064769A1 (en) |
WO (1) | WO2017064769A1 (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2004127196A (en) * | 2002-10-07 | 2004-04-22 | Fuji Research Institute Corp | Community formation support system, its terminal, server, and program |
JP2007188343A (en) * | 2006-01-13 | 2007-07-26 | Mitsubishi Electric Corp | Schema integration support device, schema integration support method, and schema integration support program |
JP2008181459A (en) * | 2007-01-26 | 2008-08-07 | Mitsubishi Electric Corp | Table classifying device |
JP2010039593A (en) * | 2008-08-01 | 2010-02-18 | Mitsubishi Electric Corp | Table classification device, table classification method, and table classification program |
-
2015
- 2015-10-14 WO PCT/JP2015/079048 patent/WO2017064769A1/en active Application Filing
- 2015-10-14 JP JP2017545033A patent/JPWO2017064769A1/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2004127196A (en) * | 2002-10-07 | 2004-04-22 | Fuji Research Institute Corp | Community formation support system, its terminal, server, and program |
JP2007188343A (en) * | 2006-01-13 | 2007-07-26 | Mitsubishi Electric Corp | Schema integration support device, schema integration support method, and schema integration support program |
JP2008181459A (en) * | 2007-01-26 | 2008-08-07 | Mitsubishi Electric Corp | Table classifying device |
JP2010039593A (en) * | 2008-08-01 | 2010-02-18 | Mitsubishi Electric Corp | Table classification device, table classification method, and table classification program |
Non-Patent Citations (1)
Title |
---|
RIKA KASHIMA: "DB System Saikochiku ni Okeru Schema Matching Tekiyo", DAI 77 KAI (HEISEI 27 NEN) ZENKOKU TAIKAI KOEN RONBUNSHU (1) COMPUTE SYSTEM SOFTWARE KAGAKU KOGAKU DATA TO WEB, 17 March 2015 (2015-03-17), pages 1-501 - 1-502 * |
Also Published As
Publication number | Publication date |
---|---|
JPWO2017064769A1 (en) | 2018-08-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Greenwell et al. | Variable Importance Plots-An Introduction to the vip Package. | |
Hao et al. | Model selection for high-dimensional quadratic regression via regularization | |
CN108140075B (en) | Classifying user behavior as anomalous | |
JP2018055580A (en) | Machine learning program, machine learning method, and machine learning apparatus | |
Dehghan et al. | On the reflexive and anti-reflexive solutions of the generalised coupled Sylvester matrix equations | |
US20220114644A1 (en) | Recommendation system with sparse feature encoding | |
Kalyuzhny et al. | Dissimilarity‐overlap analysis of community dynamics: Opportunities and pitfalls | |
Cheng et al. | Chaotic enhanced colliding bodies optimization algorithm for structural reliability analysis | |
CN107580698A (en) | System and method for the complicating factors of the scheduling size that determines parallel processor kernel | |
Chao et al. | A novel regression analysis method for randomly truncated strong-motion data | |
CN108550019B (en) | Resume screening method and device | |
CN110362569A (en) | The method of calibration and device of tables of data, electronic equipment, storage medium | |
WO2017064769A1 (en) | Information processing system and computer program | |
CN117828033A (en) | AI intelligent question-answering optimization method, system and electronic equipment | |
US20210279561A1 (en) | Computational processing system, sensor system, computational processing method, and program | |
US10997497B2 (en) | Calculation device for and calculation method of performing convolution | |
US20190260572A1 (en) | Efficient computation of bivariate statistical moments for side channel vulnerability evaluation | |
Flesch et al. | Evolutionary stochastic games | |
CN115314239A (en) | Analysis method and related equipment for hidden malicious behaviors based on multi-model fusion | |
CN111797972B (en) | Method, device and electronic system for processing data by using convolutional neural network | |
US11515995B2 (en) | Efficient computation of univariate statistical moments for side channel vulnerability evaluation | |
US11520855B2 (en) | Matrix sketching using analog crossbar architectures | |
Perry et al. | First-exit times for Poisson shot noise | |
CN113284027A (en) | Method for training group recognition model, and method and device for recognizing abnormal group | |
CN112651764A (en) | Target user identification method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 15906233 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2017545033 Country of ref document: JP Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 15906233 Country of ref document: EP Kind code of ref document: A1 |