CN109871897A

CN109871897A - Hailin lattice distance is the method for the over-sampling of reference standard

Info

Publication number: CN109871897A
Application number: CN201910147977.9A
Authority: CN
Inventors: 董明刚; 姜振龙; 敬超
Original assignee: Guilin University of Technology
Current assignee: Guilin University of Technology
Priority date: 2019-02-28
Filing date: 2019-02-28
Publication date: 2019-06-11

Abstract

The invention discloses the oversampler methods that a kind of Hailin lattice distance is reference standard.It is reference point that pseudorandom, which chooses a certain sample point in group, sample point is synthesized using SMOTE technology, during synthesizing sample point, the Hailin lattice distance of group and other classes where calculating reference point, Hailin lattice distance matrix is formed, the minimum value of Hailin lattice distance matrix column vector is calculated；The sample point generated every time is placed individually into group, the Hailin lattice distance of group and other classes where calculating reference point forms Hailin lattice distance matrix, calculates the minimum value of Hailin lattice distance matrix column vector.The minimum value for comparing Hailin lattice distance twice judges the quality for synthesizing sample point.The present invention can improve the quality of new synthesis sample point, avoid sample point overlap problem, achieve the purpose that improve new synthesis sample point mass in the case where influence as small as possible other classes, the fitness and generalization of the sample point suitable for improving oversampling technique synthesis under specific two class and multiclass unbalanced dataset.

Description

Hailin lattice distance is the method for the over-sampling of reference standard

Technical field

The present invention relates to oversampling technique fields in study uneven under specific set of data, and in particular to Hailin lattice distance is The method of the over-sampling of reference standard.

Background technique

Unbalanced data (Imbalance Data) the i.e. imbalanced training sets of categories of datasets.By taking two classification problems as an example, When the ratio of most classes and minority class sample in data set is greater than unbalance factor IR (Imbalance Ratio), such number According to referred to as unbalanced data.It has been generally acknowledged that data set of the IR equal to 1.45 or 1.5 is the data set of balance.The injustice of data set RateSuch as major class sample size has 50, group sample size has 20,Then data, that is, unbalanced data at this time.

Imbalance study is to handle under unbalanced data data, so that classification prediction can obtain higher standard Exactness.The method of imbalance study has very much, for example, cost-sensitive (Cost-Sensitive), the method for sampling (Sampling Method), integrated study (Ensemble Learning) etc..

Over-sampling (Oversampling) is one kind of the method for sampling in uneven study, and over-sampling mainly increases a small number of The sample size of class.It is reference point that sample point is randomly selected from minority class set, and new sample is synthesized based on reference point New sample point is added in group, increases the quantity of group by this point.

Synthetic Minority Oversampling Technique is a kind of synthesis method of sampling, referred to as SMOTE, it has been proved to all relatively more effective in many fields.It is mainly based upon existing minority class sample, calculates sample Then similarity between feature space creates artificial synthesized sample.

Hailin lattice distance (Hellinger Distance) is also known as Bhattacharyya distance.In probability and statistics In, Hellinger Distance is used to measure the similitude between two probability distribution, in imbalance study, Hailin Lattice distance can be used to measure the similarity degree between two classes, and the significance level of attribute is calculated, this is mentioned to realize the present invention Having supplied may.

Summary of the invention

It will lead to overlapping, overfitting and slight extensive problem for current oversampling technique SMOTE synthesis sample point, The present invention provides the method for the over-sampling that lattice distance in Hailin is reference standard, and this method is avoided that sample point is overlapped, improved and adopt The fitness and generalization of the sample point of sample technology SMOTE synthesis.

Thinking of the present invention: it is reference point that pseudorandom, which chooses a certain sample point in group, uses Synthetic to reference point Minority Oversampling Technique technology generates new sample point, during synthesizing sample point, calculates reference The Hailin lattice distance of group and other classes where point, forms Hailin lattice distance matrix, calculates Hailin lattice distance matrix column vector Minimum value；The sample point generated every time is placed individually into group, again the Hailin of group and other classes where calculating reference point Lattice distance forms Hailin lattice distance matrix, calculates the minimum value of Hailin lattice distance matrix column vector.Compare synthesis sample point before and The minimum value of Hailin lattice distance after synthesizing sample point judges the quality for synthesizing sample point.

Specific steps are as follows:

Step 1 chooses reference point S in group, in a manner of k nearest neighbor, randomly selects target point G by reference to point S, wherein Reference point S and target point G belong to same category.

Step 2, which is calculated, belongs to of a sort k nearest neighbor quantity k with sample point S₁Belong to of a sort k nearest neighbor with target point G Quantity k₂.Wherein k nearest neighbor quantity K_nIt is voluntarily determined according to data set situation, generally takes K_n=10.

Step 3 judges whether reference point S and target point G meet basic demand.If k₂>k₁Or k₁>K_n/ 2 and k₂>K_n/ 2, step 4 is executed, step 1, step 2 and step 3 are otherwise repeated, until meeting condition.

Step 4 synthesizes the Hailin lattice distance before sample point and calculates: calculating before synthesis sample point group where reference point S and its Hailin lattice distance between its each class forms distance matrix M₁。

Step 5 takes M₁The minimum value of each column in matrix forms row vector R₁。

Step 6 calculates the Hailin lattice distance after synthesis sample point, and newly generated sample point is added in current group, is formed New group C₁, calculate C₁Hailin lattice distance between other each classes forms distance matrix M2.

Step 7 takes M₂The minimum value of each column in matrix forms row vector R₂。

Step 8 compares R₁And R₂Size come determine newly synthesize sample point quality.

Lattice distance in Hailin of the present invention is the method for the over-sampling of reference standard, and wherein lattice distance in Hailin can be measured Similarity degree between two classes is more difficult to classify if the similarity degree between two classes is bigger, before synthesizing sample point Make the sample point holding or drop of synthesis on the basis of meeting synthesis sample point with the Hailin lattice distance after synthesis sample point Similarity degree between low two classes achievees the purpose that improve new synthesis sample point mass, be suitable for uneven in specific multiclass It avoids SMOTE technology to synthesize the overlap problem after sample point under data set, improves the fitness of the sample point of SMOTE technology synthesis And generalization.

Detailed description of the invention

Fig. 1 is the specific steps flow chart of the embodiment of the present invention.

Fig. 2 is SMOTE technology figure used in the embodiment of the present invention.

Fig. 3 is effect picture after SMOTE of embodiment of the present invention synthesis sample point.

Fig. 4 is the embodiment of the present invention using Hailin lattice as the effect picture of reference standard.

Fig. 5 is effect picture of the embodiment of the present invention under precision (Precision) evaluation criterion.

Fig. 6 is effect picture of the embodiment of the present invention under MAUC evaluation criterion.

Specific embodiment:

The present embodiment using UCI (http://mlr.cs.umass.edu/ml/dataets.html) and Keil (htt: // Sc2s.ugr.es/keel/datasets.php the unbalanced dataset announced on), oversampler method use Synthetic Minority Oversampling Technique technology.Since sample size is too many in data set on platform, it has not been convenient to open up Show, for convenience of description specific steps, the present embodiment uses an artificially defined data set D, as shown in Fig. 2, data set D has 4 A attribute is made of 4 classes, and 2 major class a and b, sample size is respectively 19 and 20；2 groups c and d, sample size difference For 5 and 10.Wherein K takes 5 in k nearest neighbor.

Specific steps are as follows:

The selection of step 1 sample point, one sample point S of selection is as a reference point from first group c, with k nearest neighbor side Formula randomly selects target point G by reference to point S, and wherein target point G and reference point S belong to group c, as shown in Figure 1.

Step 2, which is calculated, belongs to of a sort k nearest neighbor quantity k with reference point S₁Belong to of a sort k nearest neighbor with target point G Quantity k₂.Wherein k nearest neighbor quantity K_n=5.

Step 3 judges whether reference point S and target point G meet basic demand.If k₂>k₁Or k₁> 3 and k₂> 3, it executes Otherwise step 4 repeats step 1, step 2 and step 3, until meeting condition.

Step 4 Hailin lattice distance calculates: calculating before synthesizing sample point between reference point S place group and other each classes Hailin lattice distance forms distance matrix M₁。

4.1st step calculation formula and construction M₁It is as follows:

Hailin lattice range formula:

We are using the group c currently considered as positive class, and other class a, b, d are as negative class.

Wherein d_H(tpr, fpr) indicates the Hailin lattice distance between positive class t and negative class f；Tpr indicates that positive class is correctly divided into The probability of positive class, fpr indicate that negative class is correctly divided into the probability of negative class.

4.2nd step can be calculated tpr and fpr by confusion matrix.Wherein confusion matrix is as follows:

4.3rd step calculates group c and a, the Hailin lattice distance of b, d.4.1 and 4.2 steps are repeated, group c and a can be calculated, The Hailin lattice distance of b, d:

D (c, a)=[v_a1 v_a2 v_a3 v_a4]

D (c, b)=[v_b1 v_b2 v_b3 v_b4]

D (c, d)=[v_d1 v_d2 v_d3 v_d4]

Wherein, V_a1Indicate the Hailin lattice distance of the positive class c and negative class a on first attribute.

4.4th step forms distance matrix M₁.Often there is the corresponding Hailin lattice distance value of an attribute, data set D there are 4 categories Property, that is, there are 4 Hailin lattice distance values.

4.5th step calculating matrix M₁The minimum value of each column obtains row vector R₁,

It can be obtained after calculating, R1=[v_m1 v_m2 v_m3 v_m4], wherein v_m1Indicate M₁The minimum value of first row.

4.6th step synthesizes new sample point S by Synthetic Minority Oversampling Technique_n。 Sample point composite formula is as follows: S_n=S+ α * (G-S), the value range of α are (0,1).

4.7th step calculates synthesis sample point S_nHailin lattice distance later.By S_nIt is put into positive class c, executes the 4.1st step, 4.2 steps, 4.3 steps, 4.4 steps and 4.5 steps form matrix M₂With row vector R₂。

4.8th step judges synthetic point S_nQuality.Compare R₂And R₂Size, if R₂>R₁, illustrate the quality of the point of synthesis It is good, step 5 is executed, the 4.9th step is otherwise executed.

4.9th step executes the 4.6th step, the 4.7th step, by S_nIt is left initial optimal value, i.e. S_p=S_n.Synthesize sample point S_ni,M_2iAnd R_2iIf R_2i>R₁, step 5 is executed, R is otherwise compared_2iAnd R₂Size, if R_2i>R₂, enable R₂=R_2i,S_p=S_ni.Most Max { i } secondary 4.9th steps that execute execute step 5 more later.Wherein S_pIndicate the optimal sample point of synthesis, S_niIndicate i-th synthesis Sample point, M_2iIndicate the Hailin lattice distance matrix obtained after i-th synthesis sample point, R_2iIndicate corresponding M_2iIt is each The row vector that the minimum value of column is formed, i=1,2,3 ... 10, max { i } indicates the maximum value of i.

Step 5, S_nIt is put into D_nIn, 1 step repeated, step 2, step 3 and step 4, until the sample size of group c Equal to the sample size of the maximum major class of quantity, the i.e. quantity of major class b.

Step 6 selects next group to execute step 1, step 2, step 3, step 4 and step 5, until all groups all It has selected.

Step 7, by D_nMiddle data are put into D, form equilibrium data collection, as shown in Figure 4.

Claims

1. a kind of method that Hailin lattice distance is the over-sampling of reference standard, which is characterized in that Hailin lattice are that the judgement of standard has Body step are as follows:

(1) the Hailin lattice distance before synthesizing sample point where reference point between group and other each classes is calculated, distance is formed Matrix M₁；

(2) M is taken₁The minimum value of each column in matrix forms row vector R₁；

(3) new sample point is synthesized, where each newly synthesized sample point is individually added into reference point in group, is formed new small Class C₁, calculate C₁Hailin lattice distance between other each classes forms distance matrix M₂；

(4) M is taken₂The minimum value of each column in matrix forms row vector R₂；

(5) compare R₁And R₂Size come determine newly generate sample point quality.