WO2017064769A1

WO2017064769A1 - Information processing system and computer program

Info

Publication number: WO2017064769A1
Application number: PCT/JP2015/079048
Authority: WO
Inventors: 古庄　晋二
Original assignee: 株式会社ターボデータラボラトリー
Priority date: 2015-10-14
Filing date: 2015-10-14
Publication date: 2017-04-20
Also published as: JPWO2017064769A1

Abstract

The purpose of the present invention is to calculate a distance between tables of records, each record consisting of a plurality of fields. In the information processing system according to the present invention, an axis determination unit 11 determines the order of fields of reference data G constituting a table. Following this order of fields, a data determination processing unit 12 first calculates the mutual information between the first field of the reference data G and the first field of target data T, and then calculates conditional mutual information between each subsequent field of the reference data G and the corresponding field of the target data T on the basis of fields preceding the subsequent field. The data determination processing unit then calculates a distance between the reference data G and the target data T on the basis of the mutual information, or the conditional mutual information, calculated for each field.

Description

Information processing system and computer program

The present invention relates to a technique for calculating a distance between data.

As a technique for calculating the distance between data, the correspondence between the case data and the classification is maintained, and the data is classified based on the distance between the data to be classified and the case data, and the correspondence between the case data and the classification. A technique is known (for example, Patent Document 1). Here, in this technique, the distance between the data composed of a plurality of items and the case data is obtained as a weighted sum of the differences between the values of the respective items.

JP 2002-149697 A

Considering the application of the above-mentioned technique for calculating the distance between data to a table that is a set of records composed of a plurality of fields, this technique corresponds to a technique for calculating the distance between individual records. It becomes. Therefore, even if this technique is applied to the calculation of the distance between records as it is, the distance between the tables reflecting the characteristics of the entire table appearing across a plurality of records cannot be calculated.

Therefore, an object of the present invention is to provide an information processing system capable of calculating a distance between sets according to characteristics of the set of records including a plurality of fields as a whole set.

In order to achieve the object, the present invention sets an order of each field of the record in an information processing system for calculating a distance between a first set of records composed of a plurality of fields and a second set of records. According to the order set by the order setting unit and the order set by the order setting unit, the mutual information amount of the first rank field of the first set and the second set is the field of the first rank. Each of the first set and the second set in a rank other than the first rank in each field higher than the corresponding field. An evaluation value calculation unit that calculates the amount of information as an evaluation value of the field of the rank; and at least a part of the evaluation value calculated by the evaluation value calculation unit for each field, the first set It is obtained by a distance calculation unit to calculate the serial distance between the second set.

Here, in such an information processing system, the rank setting unit sets the rank of the field having the maximum entropy in the first set among the fields of the record to the rank of the first rank, Thereafter, until the order of all fields is set, the order of the field with the highest conditional entropy of the field under the field for which the order has already been set in the first set is set last. The order of the respective fields may be set by repeating the process of setting the next order of the order.

According to the information processing system as described above, for each field between the first set and the second set, the mutual information amount and the conditional mutual information amount of each field of the first set and the second set. An independent evaluation value is obtained, and the distance between the first set and the second set is calculated using the obtained evaluation value. Here, since the mutual information amount of each field and the conditional mutual information amount reflect the characteristics across the records of the field, according to the information processing system, a set of records composed of a plurality of fields is recorded. The distance between sets can be calculated according to the characteristics of the set as a whole. In addition, the mutual information amount of each field and the evaluation value of each field obtained as conditional mutual information amount are independent from each other. Therefore, the first set and the second distance can be easily evaluated from various viewpoints. You can do it. The mutual information amount between the two tables is equal to the sum of the mutual information amount between each item and the conditional mutual information amount. It is possible to perform processing based on a mutual amount of mutual information.

As described above, according to the present invention, it is possible to provide an information processing system capable of calculating the distance between sets according to the characteristics of the set of records including a plurality of fields as the whole set.

It is a block diagram which shows the structure of the information processing system which concerns on embodiment of this invention. It is a figure which shows the reference data and to-be-determined data which concern on embodiment of this invention. It is a figure which shows the axis information and determination result data which concern on embodiment of this invention. It is a flowchart which shows the axis determination process which concerns on embodiment of this invention. It is a figure which shows the process example of the axis determination process which concerns on embodiment of this invention. It is a figure which shows the process example of the axis determination process which concerns on embodiment of this invention. It is a figure which shows the process example of the axis determination process which concerns on embodiment of this invention. It is a flowchart which shows the data determination process which concerns on embodiment of this invention. It is a figure which shows the process example of the data determination process which concerns on embodiment of this invention. It is a figure which shows the process example of the data determination process which concerns on embodiment of this invention. It is a figure which shows the process example of the data determination process which concerns on embodiment of this invention. It is a figure which shows the process example of the data determination process which concerns on embodiment of this invention.

Hereinafter, embodiments of an information processing system according to the present invention will be described.
FIG. 1 shows a configuration of an information processing system according to the present embodiment.
As illustrated, the information processing system includes a processor 1 and a storage 2.
Further, the processor 1 includes an axis determination unit 11 and a determination processing unit 12.
Here, the processor 1 is a general-purpose computer including, for example, a CPU, a memory, and various peripheral devices such as a display device and an input device. The axis determination unit 11 and the determination processing unit 12 are predetermined by the computer. This is implemented as a computer function realized by executing the program.

Next, the storage 2 stores reference data, axis information, determination result data, and determination target data.
Here, the storage 2 may be an external storage connected to the processor 1, or a network storage or a data server that the processor 1 accesses via a network.

Next, as shown in FIG. 2a, the reference data is a table that is a set of records (each row from 1 to 6 in the figure) composed of a plurality of fields (each column of A, B, and C in the figure).
Further, as shown in FIG. 2b, the data to be determined is also a table that is a set of records (each row in the figure) including a plurality of fields, and the field configuration of the data to be determined is equal to the field configuration of the reference data. The number of data records is equal to the number of records in the reference data.
Further, records having the same order (order) of the reference data and the data to be judged store information on the same target in each field. That is, the nth record of the reference data and the nth record of the data to be judged store information about the nth object in each field.

That is, for example, the m-th attribute value of the n-th object is stored in the m-th field of the n-th record of the reference data and the determination target data.
More specifically, for example, in the m-th field of the n-th record of the reference data, the n-th sensor among the N sensors that detect M states of different objects is detected. The target value of the mth state is stored, and the detection value of the mth state detected by the nth sensor is stored in the mth field of the nth record of the determination target data.

In the following description, “G” represents reference data and “T” represents data to be determined.
Next, as shown in FIG. 3A, the axis information includes axis number information and axis definition information.
Here, the number of fields of the reference data G is registered in the axis number information.
In the axis definition information, a field identifier indicating a field to be used as each axis from the first axis to the nth axis, where n is the number of fields of the reference data G, is registered.
Further, as shown in FIG. 3b, the determination result data includes an entry for the reference data G, an entry for the determination target data T, and an entry for the mutual information amount I. In the entry of the reference data G, the entropy HG () of each axis from the first axis to the nth axis of the reference data G is registered, and in the entry of the determination target data T, the first of the determination target data T is registered. The entropy HT () of each axis from the axis to the nth axis is registered. In the mutual information amount I entry, the mutual information amount I () between the reference data G of each axis from the first axis to the nth axis and the determination target data T is registered.

Now, returning to FIG. 1, the axis determination unit 11 of the processor 1 creates axis information using the reference data G and calculates the entropy HG of each axis from the first axis to the n-th axis of the reference data G. Then, the axis determination process registered in the determination result data is performed.

Further, the determination processing unit 12 of the processor 1 uses the determined data T, each reference data G, and the axis information, and the entropy HT of each axis from the first axis to the nth axis of the determined data T, 1 Calculate the mutual information I () between the reference data G of each axis from the axis to the nth axis and the judged data T and register it in the judgment result data. Also, based on the calculated mutual information I () Data determination processing for evaluating the distance between the data G and the determination target data T is performed.

Hereinafter, the axis determination process performed by the axis determination unit 11 of the processor 1 will be described.
FIG. 4 shows the procedure of the determination result data creation process.
As shown in the drawing, the axis determination unit 11 first registers the number of fields of the reference data G in the axis number information (step 400).
Then, the entropy H (X) of each field of the reference data G is calculated (step 402). Here, H (X) represents the entropy of field X. H (X) is defined as the i-th value of the values appearing in the field X of the reference data G, where X _i is p (X _i ) is the occurrence probability of X _i in the reference data G, Log is the logarithm with 2 as the base, Σ _i is the sum of i,
H (X) =-Σ _i (p (X _i ) × Log {p (X _i )}]
Represented by However, Log0 is 0. The bottom of Log may be a number other than two.

Here, a specific calculation example of the entropy will be described later.
Then, the field having the maximum entropy calculated for the reference data G is set as the first axis, and the field identifier of the field is registered as the field identifier of the field used as the first axis in the axis definition information (step 404).

Then, the entropy of the field as the first axis calculated in step 402 is registered as the first axis entropy HG (1) of the reference data Gj in the entry of the reference data G of the determination result data (step 406).

Next, assuming that the number of fields of the reference data G is n, the following processing is sequentially performed for each value of i from 2 to n (steps 408, 416, and 418).
That is, first, for each reference data G, conditional entropy H under the field from the first axis to the i-1th axis of each field X excluding the field from the first axis to the i-1th axis. _{F1, F2,..., Fi-1} (X) are calculated (step 410). Here, Fj represents the field with the j-th axis (where j <i), and H _{F1, F2, ..., Fi-1} (X) represents the fields F1, F2, ..., Fi- Represents the conditional entropy of field X under 1.

Conditional entropy H _{F1, F2, ..., Fi-1} (X) is the a-th value among the values appearing in the field Fs with Fs _a as the s-axis of the reference data G, Let p (F1 _a , F2 _b , ..., Fi-1 _y ) be the occurrence probability in the reference data G of the value pair (F1 _a , F2 _b , ..., Fi-1 _y ), and p _{F1a, F2b, ..., Fi-1y} (X _i ) in a subset of the reference data G consisting of records with value pairs (F1 _a , F2 _b , ..., Fi-1 _y ) As the probability of occurrence of X _i ,
H _{F1, F2, ..., Fi-1} (X) =-Σ _a Σ _b ... Σ _y [p (F1 _a , F2 _b , ..., Fi-1 _y )
× Σ _i [p _{F1a, F2b, ..., Fi-1y} (X _i ) × Log {p _{F1a, F2b, ..., Fi-1y} (X _i )}]]
Represented by

Here, a specific calculation example of the conditional entropy will be described later.
Then, the field with the maximum conditional entropy calculated is set to the i-th axis, and the field identifier of that field is registered in the axis definition information as the field identifier of the field used as the i-th axis (step 412).

Further, the conditional entropy calculated in step 410 for the field set in the i-th axis is registered as the i-th axis entropy HG (i) of the reference data G in the entry of the reference data G of the determination result data (step 414). .

If the registration from the first axis to the nth axis in the axis definition information and the determination result data is completed as described above, the axis determination process ends.
Hereinafter, the details of the entropy calculation performed in step 402 and the conditional entropy calculation performed in step 410 will be described using the reference data G shown in FIG. 5A as an example.
The reference data G shown in FIG. 5a is a table including six records having three fields A, B, and C. In addition, there are only two values A ₁ and A ₂ appearing in the field A, only two values B ₁ and B ₂ appearing in the field B, and values appearing in the field C are C ₁ and C ₂ . There are only two.

Then, the entropy H (X) of each field of the reference data G in step 402 is
_{H (X) = -Σ i [} (p (X i) × Log {p (X i)}]
Is calculated by

Here, X _i represents the i-th value among the values appearing as the value of the field X in the reference data G, and p (X _i ) is the reference data G of the record having X _i as the value of the field X. Occurrence probability (number of records of reference data G having X _i as the value of field X / total number of records of reference data G), Log represents the logarithm of base 2, and Σ _i is the sum of i Represents.

That is, the frequency and occurrence probability p (A _i ) of each value A _i (A ₁ and A ₂ ) of the field A of the reference data G in FIG. 5a are obtained as shown in the table of FIG. 5b.
That is, for example, since the number of records in which the value of the field A in the reference data G is A ₂ is 3, the frequency of A2 is 3. Since the number of records of the reference data G is 6, the occurrence probability p (A ₂ ) of A2 = 3/6 = 0.5.

Then, from the occurrence probability p (A _i ) of each value A _i of the field A obtained as shown in the table of FIG. 5B, the entropy H (A) of the field A is
H (A) =-Σ _i [p (A _i ) × Log {p (A _i )}]
Is calculated as

Similarly, for field B and field C, the frequency and occurrence probability of each value in field B are obtained as shown in the table of FIG. 5c, and the frequency and occurrence probability of each value of field C are obtained as shown in the table of FIG. 5d. , The entropy H (B) of field B is
H (B) = -Σ _i [p (B _i ) × Log {p (B _i )}]
As
The entropy H (C) of field C is
H (C) = -Σ _i [p (C _i ) × Log {p (C _i )}]
Is calculated as

Next, calculation of conditional entropy performed in step 410 will be described.
Now, as shown in FIG. 5, among the entropies H (A), H (B), and H (C) obtained for the three fields A, B, and C of the reference data G, H (A) is the largest. If so, field A is set to the first axis.

If the first axis is set, the process proceeds to the process for the second axis, and in step 410, the field A is set as the first axis, and the fields B and C except for the field A are set as the first axis. Calculate conditional entropy H _A (B), H _A (C) under one axis (field A).

In this case, the number (frequency) of each value B _i (B ₁ and B ₂ ) of the field B in the subset of the reference data G composed of records having the value A _k and the conditional occurrence probability p _Ak (B _i ) Is obtained as shown in the table of FIG. Incidentally, p _Ak (B _i) is (in the subset of the reference data G consisting of records having the value A _k, the probability of occurrence of the record having the value B _i) conditional probability of B _i for the value A _k a To express.

That is, for example, the frequency of the value B ₁ of the field B with respect to the value A ₂ of the field A includes the value of the field B included in the subset of the reference data G composed of records having the value of the field A of A _2. _Since the number of records that are 1 is 1, it is 1. Further, since the number of records in the subset of the reference data G composed of records whose field A value is A ₂ is 3, the conditional occurrence probability p _A2 (B ₁ for the field A value A 2 of the field B value B ₁ ) = 1/3 ≒ 0.33.

Then, as described above, each conditional occurrence probability p _Ak (B _i ) of each value in the field B obtained as shown in the table of FIG. 6a, and the occurrence probability p ( A _k ), the conditional entropy H _A (B) of field B is
H _A (B) = -Σ _k [p (A _k ) × Σ _i [p _Ak (B _i ) × Log {p _Ak (B _i )}]]
Is calculated as Note that Σ _k represents the sum of k.

Similarly, for field C, the frequency of each value in field C and the conditional occurrence probability are obtained as shown in the table of FIG. 6B, and the entropy H _A (C) of field C is
H _A (C) =-Σ _k [p (A _k ) × Σ _i [p _Ak (C _i ) × Log {p _Ak (C _i )}]]
Is calculated as

Next, as shown in FIG. 6, the value of the conditional entropy H _A (B) of the field B is the largest of the conditional entropies H _A (B) and H _A (C) of the two fields B and C. If so, field B is set to the second axis.

If the second axis is set, the process proceeds to the third axis (i = 3). In step 410, field A is the first axis and field B is the second axis. The conditional entropy H _AB (C) under the first axis (field A) and the second axis (field B) of the field C excluding the field A and the field A as the second axis is calculated.

In this case, the number (frequency) of each value C _i (C ₁ and C ₂ ) in the field C in the subset of the reference data G consisting of records having both the value _Ak and the value B _s and the conditional occurrence The probability p _{Ak, Bs} (C _i ) is obtained as shown in the table of FIG.

Note that p _{Ak, Bs} (C _i ) is the conditional occurrence probability of C _i for the value pair (A _k, B _s ) (part of the reference data G consisting of records having both the value A _k and the value B _s Represents the probability of occurrence of a record having a value C _i in the set).

Thus, for example, the field C of value C1, the power for a set of values B ₂ value A ₂ and a field B of field A, the value of the field A is A _2, and the value of field B is B ₂ Since the number of records in which the value of the field C is C1 included in the subset of the reference data G composed of a certain record is 2, it is 2. In addition, since the number of records in the subset of the reference data G including the records in which the value of the field A is A ₂ and the value of the field B is B ₂ is 2, the field A of the value C ₁ of the field C The conditional occurrence probability p _{A2, B2} (C ₁ ) = _2/2 = 1 for the set of the value A ₂ of A and the value B ₂ of the field B.

Then, as described above, the conditional probability p _Ak for each value of field C determined as shown in Table _7, the _Bs (C _i), each of the values A _k and field B Field A From the occurrence probability p (A _k, B _s ) in each reference data G of the pair (A _k , B _s ) with the value B _s , the conditional entropy H _{A, B} (C) of the field C is
H _{A, B} (C) = -Σ _k Σ _s [p (A _k B _s ) × Σ _i [p _{Ak, Bs} (C _i ) × Log {p _{Ak, Bs} (C _i )}]]
Is calculated as Note that Σ _s represents the total sum of s.

In the case of the reference data G in FIG. 5a, since there are three fields, the field C is set to the third axis.
The calculation method of conditional entropy in the case where there are three fields of the reference data G has been described with reference to FIGS. 5, 6, and 7, but when the field of the reference data G is 4 or more, In the same manner as described above, calculation of conditional entropy and axis setting are sequentially repeated.

Now, the sum H (A) + H _{A of the} three-axis entropy H (A), H _A (B), H _{A, B} (C) calculated for one reference data G as described above. (B) + H _{A, B} (C) is
H = -Σ _k Σ _s Σ _i [p (A _k , B _s , C _i ) × Log {p (A _k , B _s , C _i )}]
Is equal to Note that p (A _k , B _s , C _i ) is a probability of occurrence in the reference data G of a set of values (A _k , B _s , C _i ).

Therefore, the three-axis entropy H (A), H _A (B), H _{A, B} (C) can be considered as the spectral decomposition of the entropy H of the reference data G along each axis.
In the determination result data creation process described above, the axes are sequentially set from the field having the maximum entropy or conditional entropy calculated for each reference data G for the following reason.
That is, the sum of entropy and conditional mutual entropy becomes the same value regardless of the order of the axes. However, in an actual system, noise resistance is higher when processing is performed from a field having a large entropy or conditional mutual entropy.

The axis determination process performed by the axis determination unit 11 of the processor 1 has been described above.
Next, data determination processing performed by the determination processing unit 12 of the processor 1 will be described.
FIG. 8 shows the procedure of this data determination process.
As shown in the figure, the determination processing unit 12 calculates the entropy of the determination target data T for each axis defined in the axis definition information (

steps

800, 810, and 814) in the data determination process (step 802). The calculated entropy of each axis of the determination target data T is registered as the entropy HT (i) of the axis in the determination target data T entry of the determination result data (step 804).

Here, the entropy of each axis in step 802 is calculated by calculating the entropy of the field serving as the first axis for the first axis in the same manner as the entropy calculation shown in FIG. As for the subsequent axes, like the conditional entropy shown in FIGS. 7 and 8, the i-th axis is the i-th axis under the field from the 1st axis to the i-1th axis. This is done by calculating the conditional entropy of the field that is the axis of.

That is, if the data to be judged T is the table shown in FIG. 2b, the first axis field indicated by the axis definition information is A, the second axis field is B, and the third axis field is C, the first axis 5b, entropy H (A) is calculated in the same manner as in FIG. 5b, and the calculated H (A) is set as the entropy of the first axis of the judged data T, and the conditional entropy H for the second axis is the same as in FIG. _A (B) is calculated, and the calculated H _A (B) is taken as the entropy of the second axis of the judged data T, and H _{A, B} (C) is calculated for the third axis in the same manner as in FIG. The calculated H _{A, B} (C) is the entropy of the third axis of the determination target data T.

Further, in the data determination process, the determination processing unit 12 calculates the mutual information amount of the reference data G and the determination target data T for each axis defined in the axis definition information (

steps

800, 810, and 814) (step 806), the calculated mutual information amount of each axis is registered in the entry of mutual information amount I of the determination result data as the mutual information amount I (i) of the axis (step 808).

Here, for the first axis, the mutual information amount of the field as the first axis of the reference data G and the determination target data T is obtained as the mutual information amount I (1) of the first axis. For the i-th axis after the second axis, the conditional mutual information amount under the fields from the first axis to the i-th axis is obtained as the mutual information amount I (i) of the i-th axis. A method for calculating the mutual information amount I (i) of each axis will be described later.
Then, if the entropy HT (i) of each axis of the determination target data T and the mutual information I (i) of each axis are calculated and registered in the determination result data as described above (step 810), Based on the mutual information I (i) of each axis registered in the determination result data, the distance between the reference data G and the determination target data T is evaluated (step 812), and the data determination process is terminated.

Here, the evaluation of the distance between the reference data G and the determination target data T based on the mutual information I (i) of each axis is performed by, for example, summing the mutual information I (i) of each axis or a part of the axes. Alternatively, it can be performed by evaluating that the distance is shorter as the weighted sum using the appropriate weight of the mutual information I (i) of each axis or a part of the axes is larger. In step 812, in addition to the mutual information amount I (i) of each axis, the entropy HG (i) calculated for each axis of the reference data G and the entropy HT (i calculated for each axis of the determination target data T ) May be taken into consideration, and the relationship between the reference data G and the determination target data T may be determined.

Hereinafter, details of the calculation of the mutual information amount I (i) performed in step 806 will be described using the reference data G and the determination target data T illustrated in FIG. 9A as examples.
The reference data G and determination target data T shown in FIG. 9a are tables including six records having three fields A, B, and C. In addition, there are only two values A ₁ and A ₂ appearing in the field A, only two values B ₁ and B ₂ appearing in the field B, and values appearing in the field C are C ₁ and C ₂ . There are only two.

It is also assumed that field A is set on the first axis, field B is set on the second axis, and field C is set on the third axis.
In addition, a set of records in the same order (order) of the reference data G and the judged data T is defined as a record set RS, and the same number of records as the reference data G and the judged data T defined by this is defined. A set of sets RS is called a record set set. That is, the nth record set RS_n is composed of the nth record of the reference data G and the nth record of the determination target data T. Further, since the number of records of the reference data G and the determination target data T is 6, the number of record sets RS included in the record set set is also 6 from RS_1 to RS_6.

Now, since the first axis is the field A, the mutual information I (1) on the first axis is calculated as shown in FIG. Focus on field A.

Here, A _i represents the i-th value among the values appearing in the field A of the reference data G, A _j represents the j-th value among the values appearing in the field A of the determination target data T, and p ( A _i ) represents the probability of appearance in the record set of the record set RS including the record of the reference data G whose field A value is Ai, and p (A _j ) is the judged data T whose field A value is Aj. Represents the probability of appearance in the record set of the record set RS including the record of.

Further, p (A _i , A _j ) is a record set set of a record set RS composed of records in which the value of the field A of the reference data G is Ai and the records of the field A of the data to be judged T having Aj Represents the probability of appearance.

Now, for the reference data G and to-be-determined data T shown in FIG. 9a, the record is composed of a record in which the value of the field A of the reference data G is Ai and a record in which the value of the field A of the to-be-determined data T is Aj. The appearance probability p (A _i , A _j ) of the record set RS of the record set RS is obtained as shown in FIG. 9c.

That is, the number of records in the record set RS is 6, the reference value of the field A data G is A _1, record set RS value of the field A of the determination data T is A _2, the reference data Since there is only the fourth record set RS_4 consisting of the fourth record of G and data to be judged T, the number (frequency) is 1, and p (A ₁ , A ₂ ) = 1 / 6≈0.17.

Further, p (A _i ) and p (A _j ) are obtained as shown in FIGS. 9d1 and d2.
That is, for example, for p (A _i ) with i = 1, the number (frequency) of record sets RS including records of the reference data G whose field A is A1 is 3, and the number of records in the record set set is 3 Since it is 6, it is obtained as p (A _i ) = 3/6 = 0.5. Similarly, p = 2 (A _j ) of j = 2 is 3. The number (frequency) of record sets RS including the record of the judged data T whose field A value is A2 is 3, and the number of records in the record set set is 3 Since it is 6, it is obtained as p (A _j ) = 3/6 = 0.5.

Then, mutual information I (A) is obtained by using p (A _i , A _j ), p (A _i ), and p (A _j ) obtained as shown in FIGS. Let A) be the mutual information I (1) of the first axis.

That is,
I (1) = I (A)
= -Σ _{i, j} [p (A _i , A _j ) × Log {p (A _i , A _j ) / (p (A _i ) × p (A _j ))}]
The mutual information I (1) of the first axis is calculated by Note that Σ _{i, j} represents the sum of i and j.

Next, the mutual information I (2) on the second axis is calculated because the first axis is the field A and the second axis is the field B. Therefore, as shown in FIG. This is performed by paying attention to A, field B, and field B of data to be judged T.

Here, p (A _k ) represents an appearance probability in the record set of the record set RS including the record of the reference data G having the value A _k as the value of the field A, and the conditional occurrence probability p _Ak (B _i ) Includes a record of reference data G having B _i as a value of field B in a subset of the record set set consisting of a record set RS including a record of reference data G having a value A _k as a value of field A Represents the probability of occurrence of the record set RS, and the conditional occurrence probability p _Ak (B _j ) is a subset of the record set set consisting of the record set RS including the record of the reference data G having the value A _k as the value of the field A The probability of occurrence of the record set RS including the record of the data to be judged T having B _j as the value of the field B is shown.

Also, p _Ak (B _i , B _j ) is the value of field B in the subset of the record set set consisting of record set RS including the record of reference data G having the value A _k as the value of field A. It represents the probability of occurrence of a record set RS including both records of reference data G having B _i and records of data to be judged T having B _j as the value of field B.

Now, p (A _k ) and p _Ak (B _i , B _j ) are obtained as shown in FIG. 10B with respect to the reference data G and the determination target data T shown in FIG. 10A.
For example, p (A _1), the number of records contained in the recordset set is 6, the value of field A is the number of records set RS that contains the record of the reference data G is A ₁ is a 3, p It is obtained as (A ₁ ) = 3/6 = 0.5.

Also, for example, p _A1 (B ₁ , B ₂ ) is a record set RS included in a subset of the record set set including the record set RS including the record of the reference data G whose field A value is A ₁ The number is 3, and the record set RS includes both the record of the reference data G having B ₁ as the value of the field B and the record of the judged data T having B ₂ as the value of the field B included in this subset. Since the number (frequency) is 1, it is obtained as p _A1 (B ₁ , B ₂ ) = 1 / 3≈0.33.

Further, p _Ak (B _i ) and p _Ak (B _j ) are obtained as shown in FIGS. 10c1 and c2.
That is, p _Ak (B _i ) of k = 1, i = 1 is a record set RS in a subset of the record set set including the record set RS including the record of the reference data G whose field A value is A1. Is 3, and the number of record sets RS including records of the reference data G in which the value of the field B in this subset is B1 is 3, so that p _Ak (B = k = 1, i = 1 _i ) = _3/3 = 1.

Further, p _Ak (B _i ) of k = 1, j = 1 is the record set RS in the subset of the record set RS including the record of the reference data G whose field A value is A1. The number is 3, and the number of record sets RS including the record of the judged data T in which the value of the field B in this subset is B1, is 2, so that p _Ak (B = k = 1, j = 1 _j ) = 2 / 3≈0.67.

Then, using p (A _k ), p _Ak (B _i , B _j ), p _Ak (B _i ), p _Ak (B _j ) obtained as shown in FIGS. 10 b, c 1, c 2, Under the field A, the conditional mutual information I _Ak, (B) of the field B is obtained, and the obtained I _Ak, (B) is defined as the mutual information I (2) of the second axis.

That is,
I (2) = I _A (B)
= -Σ _k (p (A _k ) × Σ _{i, j} (p _Ak (B _i , B _j ) × Log {p _Ak (B _i , B _j ) / (p _Ak (B _i ) × p _Ak (B _j ))}]]
The mutual information I (2) on the second axis is calculated by

Next, the mutual information I (3) on the third axis is calculated because the first axis is field A, the second axis is field B, and the third axis is field C. As shown in FIG. This is performed by paying attention to the field A, the field B, the field C, and the field C of the determination target data T in the set set reference data G.

Here, p (A _k, B _s ) appears in the record set of the record set RS including the record of the reference data G having the value A _k as the value of the field A and the value B _s as the value of the field B. Represents a probability. Also, the conditional occurrence probability p _{Ak, Bs} (C _i ) has a value A _k as the value of the field A, and a record comprising the record set RS including the record of the reference data G having the value B _s as the value of the field B The occurrence probability of the record set RS including the record of the reference data G having C _i as the value of the field C in the subset of the set set is represented, and the conditional occurrence probability p _{Ak, Bs} (C _j ) Determined to have C _j as the value of field C in the subset of the record set set consisting of record set RS with the record of reference data G having the value A _k as the value and the value B _s as the value of field B Represents the probability of occurrence of a record set RS containing data T records.

P _{Ak, Bs} (C _i , C _j ) has a value A _k as the value of field A and a record set RS including a record of reference data G having a value B _s as the value of field B Represents the probability of occurrence of a record set RS that includes both records of reference data G having C _i as the value of field C and records of judged data T having C _j as the value of field C in a subset of the set .

Now, p (A _k, B _s ), p _{Ak, Bs} (C _i , C _j ) are obtained as shown in FIG.

For example, p (A _2, B ₁ ) includes records of reference data G in which the number of records included in the record set set is 6, the value of field A is A ₂ and the value of field B is B ₁ Since the number of record sets RS is 1, p (A _2, B ₁ ) = 1 / 6≈0.17 is obtained.

Also, for example, p _{A1, B1} (C ₁ , C ₁ ) is a record set set consisting of a record set RS including records of the reference data G in which the value of the field A is A ₁ and the value of the field B is B ₁ The number of record sets RS included in this subset is 3, and the record of reference data G having C ₁ as the value of field C and the data T to be judged having c ₁ as the value of field C included in this subset Since the number (frequency) of the record sets RS including both of the records is 2, it is obtained as p _{A1, B1} (C ₁ , C ₁ ) = 2 / 3≈0.67.

Further, p _{Ak, Bs} (C _i ) and p _{Ak, Bs} (C _j ) are obtained as shown in FIGS.
That is, for example, p _{Ak, Bs} (C _i ) of k = 1, s = 1, i = 1 is a record of the reference data G in which the value of the field A is A ₁ and the value of the field B is B ₁ The number of record sets RS in the subset of the record set set including the record set RS is 3, and the number of record sets RS including the record of the reference data G in which the value of the field C in the subset is C1 is Since p = 1, k = 1, s = 1, i = 1, p _{Ak, Bs} (C _i ) = 2 / 3≈0.67.

In addition, p _{Ak, Bs} (C _j ) with k = 1, s = 1, i = 2 includes a record of the reference data G in which the value of the field A is A ₁ and the value of the field B is B ₁ The number of record sets RS in the subset of the record set set consisting of the sets RS is 3, and the number of record sets RS including the record of the judged data T in which the value of the field C in the subset is C1 is 1. Therefore, p _{Ak, Bs} (C _j ) = 1 / _3≈0.33 where k = 1, s = 1, i = 2.

Then, p (A _k, B _s ), p _{Ak, Bs} (C _i , C _j ), p _{Ak, Bs} (C _i ), p _{Ak, Bs} (C _j ) is used to determine the conditional mutual information I _{Ak, Bs} (c) of the field C under the fields A and B as the first and second axes, and the calculated I _{Ak, Bs} ( Let c) be the mutual information I (3) on the third axis.

That is,
I (3) = I _{A, B} (c)
= -Σ _k Σ _s (p (A _k , B _s ) × Σ _{i, j} (p _{Ak, Bs} (C _i , C _j ) × Log {p _{Ak, Bs} (C _i , C _j )
/ (p _{Ak, Bs} (C _i ) × p _{Ak, Bs} (C _j ))}]]
To calculate the mutual information I (3) on the third axis.

The details of calculating the mutual information amount I (i) have been described above.
Here, the method for calculating the mutual information amount I (i) has been shown by taking the case where the number of fields is three as an example, but the fourth axis is similarly applied to the case where the number of fields is four or more. For each subsequent axis n, the mutual information amount I (n) can be calculated as a conditional mutual information amount under each field from 1 to n-1 axes.

The embodiment of the present invention has been described above.
In the above embodiment, when the denominator diverges by 0 in the probability calculation, it is assumed that the probability is 1 instead of infinity.

As described above, according to the present embodiment, the mutual information amount and the conditional mutual information amount of each field of the reference data G and the judged data T are obtained, and the obtained mutual information amount and the conditional mutual information amount are used. The distance between the reference data G and the judged data T is calculated. Here, the mutual information amount of each field and the conditional mutual information amount reflect the characteristics across the records of the field, so the reference data G and the data to be judged T according to the characteristics of the entire data. The distance between can be calculated. In addition, since the mutual information amount and conditional mutual information amount of each field are independent of each other, the distance between the reference data G and the judged data T can be easily evaluated from various viewpoints. . The mutual information amount between the two tables is equal to the sum of the mutual information amount between each item and the conditional mutual information amount. It is possible to perform processing based on a mutual amount of mutual information.

Here, in the above embodiment, each axis is sequentially set by setting the field with the maximum entropy or conditional entropy calculated for each reference data G as an axis. You may make it carry out according to a reference | standard. For example, the field as the axis may be set in order according to the field order.
Moreover, the above embodiment can also be performed by transposing fields and records.

1 ... processor, 2 ... storage, 11 ... axis determining unit, 12 ... determination processing unit.

Claims

An information processing system for calculating a distance between a first set of records including a plurality of fields and a second set of the records,
A rank setting unit for setting the rank of each field of the record;
According to the rank set by the rank setting unit, the mutual information amount of the field with the first rank between the first set and the second set is calculated as the evaluation value of the field with the first rank. In addition, the conditional mutual information amount under each field higher than that field of each field of each rank other than the first rank of the first set and the second set is the rank. An evaluation value calculation unit for calculating the evaluation value of the field of
The evaluation value calculation unit includes a distance calculation unit that calculates a distance between the first set and the second set using at least a part of the evaluation value calculated for each field. Information processing system.
An information processing system according to claim 1,
The rank setting unit sets the rank of the field having the maximum entropy in the first set among the fields of the record to the first rank, and thereafter ranks of all the fields are set. Up to the process of setting the rank of the field having the highest conditional entropy of the field under the field having the rank already set in the first set to the rank next to the rank set last. An information processing system which sets the order of each field by repeating.
A computer program that is read and executed by a computer,
The computer,
A storage unit storing a first set of records composed of a plurality of fields and a second set of the records;
A rank setting unit for setting the rank of each field of the record;
According to the rank set by the rank setting unit, the mutual information amount of the field with the first rank between the first set and the second set is calculated as the evaluation value of the field with the first rank. In addition, the conditional mutual information amount under each field higher than that field of each field of each rank other than the first rank of the first set and the second set is the rank. An evaluation value calculation unit for calculating the evaluation value of the field of
The evaluation value calculation unit functions as a distance calculation unit that calculates a distance between the first set and the second set using at least a part of the evaluation value calculated for each field. Computer program.
A computer program according to claim 3,
The rank setting unit sets the rank of the field having the maximum entropy in the first set among the fields of the record to the first rank, and thereafter ranks of all the fields are set. Up to the process of setting the rank of the field having the highest conditional entropy of the field under the field having the rank already set in the first set to the rank next to the rank set last. A computer program which sets the order of each field by repeating.