CN109871897A - Hailin lattice distance is the method for the over-sampling of reference standard - Google Patents

Hailin lattice distance is the method for the over-sampling of reference standard Download PDF

Info

Publication number
CN109871897A
CN109871897A CN201910147977.9A CN201910147977A CN109871897A CN 109871897 A CN109871897 A CN 109871897A CN 201910147977 A CN201910147977 A CN 201910147977A CN 109871897 A CN109871897 A CN 109871897A
Authority
CN
China
Prior art keywords
sample point
lattice distance
hailin
point
hailin lattice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201910147977.9A
Other languages
Chinese (zh)
Inventor
董明刚
姜振龙
敬超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guilin University of Technology
Original Assignee
Guilin University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guilin University of Technology filed Critical Guilin University of Technology
Priority to CN201910147977.9A priority Critical patent/CN109871897A/en
Publication of CN109871897A publication Critical patent/CN109871897A/en
Withdrawn legal-status Critical Current

Links

Abstract

The invention discloses the oversampler methods that a kind of Hailin lattice distance is reference standard.It is reference point that pseudorandom, which chooses a certain sample point in group, sample point is synthesized using SMOTE technology, during synthesizing sample point, the Hailin lattice distance of group and other classes where calculating reference point, Hailin lattice distance matrix is formed, the minimum value of Hailin lattice distance matrix column vector is calculated;The sample point generated every time is placed individually into group, the Hailin lattice distance of group and other classes where calculating reference point forms Hailin lattice distance matrix, calculates the minimum value of Hailin lattice distance matrix column vector.The minimum value for comparing Hailin lattice distance twice judges the quality for synthesizing sample point.The present invention can improve the quality of new synthesis sample point, avoid sample point overlap problem, achieve the purpose that improve new synthesis sample point mass in the case where influence as small as possible other classes, the fitness and generalization of the sample point suitable for improving oversampling technique synthesis under specific two class and multiclass unbalanced dataset.

Description

Hailin lattice distance is the method for the over-sampling of reference standard
Technical field
The present invention relates to oversampling technique fields in study uneven under specific set of data, and in particular to Hailin lattice distance is The method of the over-sampling of reference standard.
Background technique
Unbalanced data (Imbalance Data) the i.e. imbalanced training sets of categories of datasets.By taking two classification problems as an example, When the ratio of most classes and minority class sample in data set is greater than unbalance factor IR (Imbalance Ratio), such number According to referred to as unbalanced data.It has been generally acknowledged that data set of the IR equal to 1.45 or 1.5 is the data set of balance.The injustice of data set RateSuch as major class sample size has 50, group sample size has 20,Then data, that is, unbalanced data at this time.
Imbalance study is to handle under unbalanced data data, so that classification prediction can obtain higher standard Exactness.The method of imbalance study has very much, for example, cost-sensitive (Cost-Sensitive), the method for sampling (Sampling Method), integrated study (Ensemble Learning) etc..
Over-sampling (Oversampling) is one kind of the method for sampling in uneven study, and over-sampling mainly increases a small number of The sample size of class.It is reference point that sample point is randomly selected from minority class set, and new sample is synthesized based on reference point New sample point is added in group, increases the quantity of group by this point.
Synthetic Minority Oversampling Technique is a kind of synthesis method of sampling, referred to as SMOTE, it has been proved to all relatively more effective in many fields.It is mainly based upon existing minority class sample, calculates sample Then similarity between feature space creates artificial synthesized sample.
Hailin lattice distance (Hellinger Distance) is also known as Bhattacharyya distance.In probability and statistics In, Hellinger Distance is used to measure the similitude between two probability distribution, in imbalance study, Hailin Lattice distance can be used to measure the similarity degree between two classes, and the significance level of attribute is calculated, this is mentioned to realize the present invention Having supplied may.
Summary of the invention
It will lead to overlapping, overfitting and slight extensive problem for current oversampling technique SMOTE synthesis sample point, The present invention provides the method for the over-sampling that lattice distance in Hailin is reference standard, and this method is avoided that sample point is overlapped, improved and adopt The fitness and generalization of the sample point of sample technology SMOTE synthesis.
Thinking of the present invention: it is reference point that pseudorandom, which chooses a certain sample point in group, uses Synthetic to reference point Minority Oversampling Technique technology generates new sample point, during synthesizing sample point, calculates reference The Hailin lattice distance of group and other classes where point, forms Hailin lattice distance matrix, calculates Hailin lattice distance matrix column vector Minimum value;The sample point generated every time is placed individually into group, again the Hailin of group and other classes where calculating reference point Lattice distance forms Hailin lattice distance matrix, calculates the minimum value of Hailin lattice distance matrix column vector.Compare synthesis sample point before and The minimum value of Hailin lattice distance after synthesizing sample point judges the quality for synthesizing sample point.
Specific steps are as follows:
Step 1 chooses reference point S in group, in a manner of k nearest neighbor, randomly selects target point G by reference to point S, wherein Reference point S and target point G belong to same category.
Step 2, which is calculated, belongs to of a sort k nearest neighbor quantity k with sample point S1Belong to of a sort k nearest neighbor with target point G Quantity k2.Wherein k nearest neighbor quantity KnIt is voluntarily determined according to data set situation, generally takes Kn=10.
Step 3 judges whether reference point S and target point G meet basic demand.If k2>k1Or k1>Kn/ 2 and k2>Kn/ 2, step 4 is executed, step 1, step 2 and step 3 are otherwise repeated, until meeting condition.
Step 4 synthesizes the Hailin lattice distance before sample point and calculates: calculating before synthesis sample point group where reference point S and its Hailin lattice distance between its each class forms distance matrix M1
Step 5 takes M1The minimum value of each column in matrix forms row vector R1
Step 6 calculates the Hailin lattice distance after synthesis sample point, and newly generated sample point is added in current group, is formed New group C1, calculate C1Hailin lattice distance between other each classes forms distance matrix M2.
Step 7 takes M2The minimum value of each column in matrix forms row vector R2
Step 8 compares R1And R2Size come determine newly synthesize sample point quality.
Lattice distance in Hailin of the present invention is the method for the over-sampling of reference standard, and wherein lattice distance in Hailin can be measured Similarity degree between two classes is more difficult to classify if the similarity degree between two classes is bigger, before synthesizing sample point Make the sample point holding or drop of synthesis on the basis of meeting synthesis sample point with the Hailin lattice distance after synthesis sample point Similarity degree between low two classes achievees the purpose that improve new synthesis sample point mass, be suitable for uneven in specific multiclass It avoids SMOTE technology to synthesize the overlap problem after sample point under data set, improves the fitness of the sample point of SMOTE technology synthesis And generalization.
Detailed description of the invention
Fig. 1 is the specific steps flow chart of the embodiment of the present invention.
Fig. 2 is SMOTE technology figure used in the embodiment of the present invention.
Fig. 3 is effect picture after SMOTE of embodiment of the present invention synthesis sample point.
Fig. 4 is the embodiment of the present invention using Hailin lattice as the effect picture of reference standard.
Fig. 5 is effect picture of the embodiment of the present invention under precision (Precision) evaluation criterion.
Fig. 6 is effect picture of the embodiment of the present invention under MAUC evaluation criterion.
Specific embodiment:
The present embodiment using UCI (http://mlr.cs.umass.edu/ml/dataets.html) and Keil (htt: // Sc2s.ugr.es/keel/datasets.php the unbalanced dataset announced on), oversampler method use Synthetic Minority Oversampling Technique technology.Since sample size is too many in data set on platform, it has not been convenient to open up Show, for convenience of description specific steps, the present embodiment uses an artificially defined data set D, as shown in Fig. 2, data set D has 4 A attribute is made of 4 classes, and 2 major class a and b, sample size is respectively 19 and 20;2 groups c and d, sample size difference For 5 and 10.Wherein K takes 5 in k nearest neighbor.
Specific steps are as follows:
The selection of step 1 sample point, one sample point S of selection is as a reference point from first group c, with k nearest neighbor side Formula randomly selects target point G by reference to point S, and wherein target point G and reference point S belong to group c, as shown in Figure 1.
Step 2, which is calculated, belongs to of a sort k nearest neighbor quantity k with reference point S1Belong to of a sort k nearest neighbor with target point G Quantity k2.Wherein k nearest neighbor quantity Kn=5.
Step 3 judges whether reference point S and target point G meet basic demand.If k2>k1Or k1> 3 and k2> 3, it executes Otherwise step 4 repeats step 1, step 2 and step 3, until meeting condition.
Step 4 Hailin lattice distance calculates: calculating before synthesizing sample point between reference point S place group and other each classes Hailin lattice distance forms distance matrix M1
4.1st step calculation formula and construction M1It is as follows:
Hailin lattice range formula:
We are using the group c currently considered as positive class, and other class a, b, d are as negative class.
Wherein dH(tpr, fpr) indicates the Hailin lattice distance between positive class t and negative class f;Tpr indicates that positive class is correctly divided into The probability of positive class, fpr indicate that negative class is correctly divided into the probability of negative class.
4.2nd step can be calculated tpr and fpr by confusion matrix.Wherein confusion matrix is as follows:
4.3rd step calculates group c and a, the Hailin lattice distance of b, d.4.1 and 4.2 steps are repeated, group c and a can be calculated, The Hailin lattice distance of b, d:
D (c, a)=[va1 va2 va3 va4]
D (c, b)=[vb1 vb2 vb3 vb4]
D (c, d)=[vd1 vd2 vd3 vd4]
Wherein, Va1Indicate the Hailin lattice distance of the positive class c and negative class a on first attribute.
4.4th step forms distance matrix M1.Often there is the corresponding Hailin lattice distance value of an attribute, data set D there are 4 categories Property, that is, there are 4 Hailin lattice distance values.
4.5th step calculating matrix M1The minimum value of each column obtains row vector R1,
It can be obtained after calculating, R1=[vm1 vm2 vm3 vm4], wherein vm1Indicate M1The minimum value of first row.
4.6th step synthesizes new sample point S by Synthetic Minority Oversampling Techniquen。 Sample point composite formula is as follows: Sn=S+ α * (G-S), the value range of α are (0,1).
4.7th step calculates synthesis sample point SnHailin lattice distance later.By SnIt is put into positive class c, executes the 4.1st step, 4.2 steps, 4.3 steps, 4.4 steps and 4.5 steps form matrix M2With row vector R2
4.8th step judges synthetic point SnQuality.Compare R2And R2Size, if R2>R1, illustrate the quality of the point of synthesis It is good, step 5 is executed, the 4.9th step is otherwise executed.
4.9th step executes the 4.6th step, the 4.7th step, by SnIt is left initial optimal value, i.e. Sp=Sn.Synthesize sample point Sni,M2iAnd R2iIf R2i>R1, step 5 is executed, R is otherwise compared2iAnd R2Size, if R2i>R2, enable R2=R2i,Sp=Sni.Most Max { i } secondary 4.9th steps that execute execute step 5 more later.Wherein SpIndicate the optimal sample point of synthesis, SniIndicate i-th synthesis Sample point, M2iIndicate the Hailin lattice distance matrix obtained after i-th synthesis sample point, R2iIndicate corresponding M2iIt is each The row vector that the minimum value of column is formed, i=1,2,3 ... 10, max { i } indicates the maximum value of i.
Step 5, SnIt is put into DnIn, 1 step repeated, step 2, step 3 and step 4, until the sample size of group c Equal to the sample size of the maximum major class of quantity, the i.e. quantity of major class b.
Step 6 selects next group to execute step 1, step 2, step 3, step 4 and step 5, until all groups all It has selected.
Step 7, by DnMiddle data are put into D, form equilibrium data collection, as shown in Figure 4.

Claims (1)

1. a kind of method that Hailin lattice distance is the over-sampling of reference standard, which is characterized in that Hailin lattice are that the judgement of standard has Body step are as follows:
(1) the Hailin lattice distance before synthesizing sample point where reference point between group and other each classes is calculated, distance is formed Matrix M1
(2) M is taken1The minimum value of each column in matrix forms row vector R1
(3) new sample point is synthesized, where each newly synthesized sample point is individually added into reference point in group, is formed new small Class C1, calculate C1Hailin lattice distance between other each classes forms distance matrix M2
(4) M is taken2The minimum value of each column in matrix forms row vector R2
(5) compare R1And R2Size come determine newly generate sample point quality.
CN201910147977.9A 2019-02-28 2019-02-28 Hailin lattice distance is the method for the over-sampling of reference standard Withdrawn CN109871897A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910147977.9A CN109871897A (en) 2019-02-28 2019-02-28 Hailin lattice distance is the method for the over-sampling of reference standard

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910147977.9A CN109871897A (en) 2019-02-28 2019-02-28 Hailin lattice distance is the method for the over-sampling of reference standard

Publications (1)

Publication Number Publication Date
CN109871897A true CN109871897A (en) 2019-06-11

Family

ID=66919464

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910147977.9A Withdrawn CN109871897A (en) 2019-02-28 2019-02-28 Hailin lattice distance is the method for the over-sampling of reference standard

Country Status (1)

Country Link
CN (1) CN109871897A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113819959A (en) * 2021-11-24 2021-12-21 中国空气动力研究与发展中心设备设计与测试技术研究所 Suspension system anomaly detection method based on Hailinge distance and correlation coefficient

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113819959A (en) * 2021-11-24 2021-12-21 中国空气动力研究与发展中心设备设计与测试技术研究所 Suspension system anomaly detection method based on Hailinge distance and correlation coefficient

Similar Documents

Publication Publication Date Title
Chen et al. Distributed individuals for multiple peaks: A novel differential evolution for multimodal optimization problems
CN110147450B (en) Knowledge complementing method and device for knowledge graph
CN108491874B (en) Image list classification method based on generation type countermeasure network
CN106776351B (en) A kind of combined test use-case prioritization method based on One-test-at-a-time strategy
CN108763857A (en) A kind of process soft-measuring modeling method generating confrontation network based on similarity
García et al. Theoretical analysis of a performance measure for imbalanced data
CN107451619A (en) A kind of small target detecting method that confrontation network is generated based on perception
CN110334580A (en) The equipment fault classification method of changeable weight combination based on integrated increment
CN105095494B (en) The method that a kind of pair of categorized data set is tested
CN104331893B (en) Complex image multi-threshold segmentation method
CN105574547B (en) Adapt to integrated learning approach and device that dynamic adjusts base classifier weight
CN109163911A (en) A kind of fault of engine fuel system diagnostic method based on improved bat algorithm optimization ELM
CN104298893B (en) Imputation method of genetic expression deletion data
CN109871622A (en) A kind of low-voltage platform area line loss calculation method and system based on deep learning
CN109410179A (en) A kind of image abnormity detection method based on generation confrontation network
KR101680055B1 (en) Method for developing the artificial neural network model using a conjunctive clustering method and an ensemble modeling technique
CN110363229A (en) A kind of characteristics of human body's parameter selection method combined based on improvement RReliefF and mRMR
CN108875788A (en) A kind of SVM classifier parameter optimization method based on modified particle swarm optiziation
CN109583594A (en) Deep learning training method, device, equipment and readable storage medium storing program for executing
CN109657802A (en) A kind of Mixture of expert intensified learning method and system
CN106021524A (en) Working method for tree-augmented Navie Bayes classifier used for large data mining based on second-order dependence
CN109829494A (en) A kind of clustering ensemble method based on weighting similarity measurement
CN105740917B (en) The semi-supervised multiple view feature selection approach of remote sensing images with label study
CN109871897A (en) Hailin lattice distance is the method for the over-sampling of reference standard
CN105138527B (en) A kind of data classification homing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20190611