CN109871897A - Hailin lattice distance is the method for the over-sampling of reference standard - Google Patents
Hailin lattice distance is the method for the over-sampling of reference standard Download PDFInfo
- Publication number
- CN109871897A CN109871897A CN201910147977.9A CN201910147977A CN109871897A CN 109871897 A CN109871897 A CN 109871897A CN 201910147977 A CN201910147977 A CN 201910147977A CN 109871897 A CN109871897 A CN 109871897A
- Authority
- CN
- China
- Prior art keywords
- sample point
- lattice distance
- hailin
- point
- hailin lattice
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Abstract
The invention discloses the oversampler methods that a kind of Hailin lattice distance is reference standard.It is reference point that pseudorandom, which chooses a certain sample point in group, sample point is synthesized using SMOTE technology, during synthesizing sample point, the Hailin lattice distance of group and other classes where calculating reference point, Hailin lattice distance matrix is formed, the minimum value of Hailin lattice distance matrix column vector is calculated;The sample point generated every time is placed individually into group, the Hailin lattice distance of group and other classes where calculating reference point forms Hailin lattice distance matrix, calculates the minimum value of Hailin lattice distance matrix column vector.The minimum value for comparing Hailin lattice distance twice judges the quality for synthesizing sample point.The present invention can improve the quality of new synthesis sample point, avoid sample point overlap problem, achieve the purpose that improve new synthesis sample point mass in the case where influence as small as possible other classes, the fitness and generalization of the sample point suitable for improving oversampling technique synthesis under specific two class and multiclass unbalanced dataset.
Description
Technical field
The present invention relates to oversampling technique fields in study uneven under specific set of data, and in particular to Hailin lattice distance is
The method of the over-sampling of reference standard.
Background technique
Unbalanced data (Imbalance Data) the i.e. imbalanced training sets of categories of datasets.By taking two classification problems as an example,
When the ratio of most classes and minority class sample in data set is greater than unbalance factor IR (Imbalance Ratio), such number
According to referred to as unbalanced data.It has been generally acknowledged that data set of the IR equal to 1.45 or 1.5 is the data set of balance.The injustice of data set
RateSuch as major class sample size has 50, group sample size has 20,Then data, that is, unbalanced data at this time.
Imbalance study is to handle under unbalanced data data, so that classification prediction can obtain higher standard
Exactness.The method of imbalance study has very much, for example, cost-sensitive (Cost-Sensitive), the method for sampling (Sampling
Method), integrated study (Ensemble Learning) etc..
Over-sampling (Oversampling) is one kind of the method for sampling in uneven study, and over-sampling mainly increases a small number of
The sample size of class.It is reference point that sample point is randomly selected from minority class set, and new sample is synthesized based on reference point
New sample point is added in group, increases the quantity of group by this point.
Synthetic Minority Oversampling Technique is a kind of synthesis method of sampling, referred to as
SMOTE, it has been proved to all relatively more effective in many fields.It is mainly based upon existing minority class sample, calculates sample
Then similarity between feature space creates artificial synthesized sample.
Hailin lattice distance (Hellinger Distance) is also known as Bhattacharyya distance.In probability and statistics
In, Hellinger Distance is used to measure the similitude between two probability distribution, in imbalance study, Hailin
Lattice distance can be used to measure the similarity degree between two classes, and the significance level of attribute is calculated, this is mentioned to realize the present invention
Having supplied may.
Summary of the invention
It will lead to overlapping, overfitting and slight extensive problem for current oversampling technique SMOTE synthesis sample point,
The present invention provides the method for the over-sampling that lattice distance in Hailin is reference standard, and this method is avoided that sample point is overlapped, improved and adopt
The fitness and generalization of the sample point of sample technology SMOTE synthesis.
Thinking of the present invention: it is reference point that pseudorandom, which chooses a certain sample point in group, uses Synthetic to reference point
Minority Oversampling Technique technology generates new sample point, during synthesizing sample point, calculates reference
The Hailin lattice distance of group and other classes where point, forms Hailin lattice distance matrix, calculates Hailin lattice distance matrix column vector
Minimum value;The sample point generated every time is placed individually into group, again the Hailin of group and other classes where calculating reference point
Lattice distance forms Hailin lattice distance matrix, calculates the minimum value of Hailin lattice distance matrix column vector.Compare synthesis sample point before and
The minimum value of Hailin lattice distance after synthesizing sample point judges the quality for synthesizing sample point.
Specific steps are as follows:
Step 1 chooses reference point S in group, in a manner of k nearest neighbor, randomly selects target point G by reference to point S, wherein
Reference point S and target point G belong to same category.
Step 2, which is calculated, belongs to of a sort k nearest neighbor quantity k with sample point S1Belong to of a sort k nearest neighbor with target point G
Quantity k2.Wherein k nearest neighbor quantity KnIt is voluntarily determined according to data set situation, generally takes Kn=10.
Step 3 judges whether reference point S and target point G meet basic demand.If k2>k1Or k1>Kn/ 2 and k2>Kn/
2, step 4 is executed, step 1, step 2 and step 3 are otherwise repeated, until meeting condition.
Step 4 synthesizes the Hailin lattice distance before sample point and calculates: calculating before synthesis sample point group where reference point S and its
Hailin lattice distance between its each class forms distance matrix M1。
Step 5 takes M1The minimum value of each column in matrix forms row vector R1。
Step 6 calculates the Hailin lattice distance after synthesis sample point, and newly generated sample point is added in current group, is formed
New group C1, calculate C1Hailin lattice distance between other each classes forms distance matrix M2.
Step 7 takes M2The minimum value of each column in matrix forms row vector R2。
Step 8 compares R1And R2Size come determine newly synthesize sample point quality.
Lattice distance in Hailin of the present invention is the method for the over-sampling of reference standard, and wherein lattice distance in Hailin can be measured
Similarity degree between two classes is more difficult to classify if the similarity degree between two classes is bigger, before synthesizing sample point
Make the sample point holding or drop of synthesis on the basis of meeting synthesis sample point with the Hailin lattice distance after synthesis sample point
Similarity degree between low two classes achievees the purpose that improve new synthesis sample point mass, be suitable for uneven in specific multiclass
It avoids SMOTE technology to synthesize the overlap problem after sample point under data set, improves the fitness of the sample point of SMOTE technology synthesis
And generalization.
Detailed description of the invention
Fig. 1 is the specific steps flow chart of the embodiment of the present invention.
Fig. 2 is SMOTE technology figure used in the embodiment of the present invention.
Fig. 3 is effect picture after SMOTE of embodiment of the present invention synthesis sample point.
Fig. 4 is the embodiment of the present invention using Hailin lattice as the effect picture of reference standard.
Fig. 5 is effect picture of the embodiment of the present invention under precision (Precision) evaluation criterion.
Fig. 6 is effect picture of the embodiment of the present invention under MAUC evaluation criterion.
Specific embodiment:
The present embodiment using UCI (http://mlr.cs.umass.edu/ml/dataets.html) and Keil (htt: //
Sc2s.ugr.es/keel/datasets.php the unbalanced dataset announced on), oversampler method use Synthetic
Minority Oversampling Technique technology.Since sample size is too many in data set on platform, it has not been convenient to open up
Show, for convenience of description specific steps, the present embodiment uses an artificially defined data set D, as shown in Fig. 2, data set D has 4
A attribute is made of 4 classes, and 2 major class a and b, sample size is respectively 19 and 20;2 groups c and d, sample size difference
For 5 and 10.Wherein K takes 5 in k nearest neighbor.
Specific steps are as follows:
The selection of step 1 sample point, one sample point S of selection is as a reference point from first group c, with k nearest neighbor side
Formula randomly selects target point G by reference to point S, and wherein target point G and reference point S belong to group c, as shown in Figure 1.
Step 2, which is calculated, belongs to of a sort k nearest neighbor quantity k with reference point S1Belong to of a sort k nearest neighbor with target point G
Quantity k2.Wherein k nearest neighbor quantity Kn=5.
Step 3 judges whether reference point S and target point G meet basic demand.If k2>k1Or k1> 3 and k2> 3, it executes
Otherwise step 4 repeats step 1, step 2 and step 3, until meeting condition.
Step 4 Hailin lattice distance calculates: calculating before synthesizing sample point between reference point S place group and other each classes
Hailin lattice distance forms distance matrix M1。
4.1st step calculation formula and construction M1It is as follows:
Hailin lattice range formula:
We are using the group c currently considered as positive class, and other class a, b, d are as negative class.
Wherein dH(tpr, fpr) indicates the Hailin lattice distance between positive class t and negative class f;Tpr indicates that positive class is correctly divided into
The probability of positive class, fpr indicate that negative class is correctly divided into the probability of negative class.
4.2nd step can be calculated tpr and fpr by confusion matrix.Wherein confusion matrix is as follows:
4.3rd step calculates group c and a, the Hailin lattice distance of b, d.4.1 and 4.2 steps are repeated, group c and a can be calculated,
The Hailin lattice distance of b, d:
D (c, a)=[va1 va2 va3 va4]
D (c, b)=[vb1 vb2 vb3 vb4]
D (c, d)=[vd1 vd2 vd3 vd4]
Wherein, Va1Indicate the Hailin lattice distance of the positive class c and negative class a on first attribute.
4.4th step forms distance matrix M1.Often there is the corresponding Hailin lattice distance value of an attribute, data set D there are 4 categories
Property, that is, there are 4 Hailin lattice distance values.
4.5th step calculating matrix M1The minimum value of each column obtains row vector R1,
It can be obtained after calculating, R1=[vm1 vm2 vm3 vm4], wherein vm1Indicate M1The minimum value of first row.
4.6th step synthesizes new sample point S by Synthetic Minority Oversampling Techniquen。
Sample point composite formula is as follows: Sn=S+ α * (G-S), the value range of α are (0,1).
4.7th step calculates synthesis sample point SnHailin lattice distance later.By SnIt is put into positive class c, executes the 4.1st step,
4.2 steps, 4.3 steps, 4.4 steps and 4.5 steps form matrix M2With row vector R2。
4.8th step judges synthetic point SnQuality.Compare R2And R2Size, if R2>R1, illustrate the quality of the point of synthesis
It is good, step 5 is executed, the 4.9th step is otherwise executed.
4.9th step executes the 4.6th step, the 4.7th step, by SnIt is left initial optimal value, i.e. Sp=Sn.Synthesize sample point
Sni,M2iAnd R2iIf R2i>R1, step 5 is executed, R is otherwise compared2iAnd R2Size, if R2i>R2, enable R2=R2i,Sp=Sni.Most
Max { i } secondary 4.9th steps that execute execute step 5 more later.Wherein SpIndicate the optimal sample point of synthesis, SniIndicate i-th synthesis
Sample point, M2iIndicate the Hailin lattice distance matrix obtained after i-th synthesis sample point, R2iIndicate corresponding M2iIt is each
The row vector that the minimum value of column is formed, i=1,2,3 ... 10, max { i } indicates the maximum value of i.
Step 5, SnIt is put into DnIn, 1 step repeated, step 2, step 3 and step 4, until the sample size of group c
Equal to the sample size of the maximum major class of quantity, the i.e. quantity of major class b.
Step 6 selects next group to execute step 1, step 2, step 3, step 4 and step 5, until all groups all
It has selected.
Step 7, by DnMiddle data are put into D, form equilibrium data collection, as shown in Figure 4.
Claims (1)
1. a kind of method that Hailin lattice distance is the over-sampling of reference standard, which is characterized in that Hailin lattice are that the judgement of standard has
Body step are as follows:
(1) the Hailin lattice distance before synthesizing sample point where reference point between group and other each classes is calculated, distance is formed
Matrix M1;
(2) M is taken1The minimum value of each column in matrix forms row vector R1;
(3) new sample point is synthesized, where each newly synthesized sample point is individually added into reference point in group, is formed new small
Class C1, calculate C1Hailin lattice distance between other each classes forms distance matrix M2;
(4) M is taken2The minimum value of each column in matrix forms row vector R2;
(5) compare R1And R2Size come determine newly generate sample point quality.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910147977.9A CN109871897A (en) | 2019-02-28 | 2019-02-28 | Hailin lattice distance is the method for the over-sampling of reference standard |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910147977.9A CN109871897A (en) | 2019-02-28 | 2019-02-28 | Hailin lattice distance is the method for the over-sampling of reference standard |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109871897A true CN109871897A (en) | 2019-06-11 |
Family
ID=66919464
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910147977.9A Withdrawn CN109871897A (en) | 2019-02-28 | 2019-02-28 | Hailin lattice distance is the method for the over-sampling of reference standard |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109871897A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113819959A (en) * | 2021-11-24 | 2021-12-21 | 中国空气动力研究与发展中心设备设计与测试技术研究所 | Suspension system anomaly detection method based on Hailinge distance and correlation coefficient |
-
2019
- 2019-02-28 CN CN201910147977.9A patent/CN109871897A/en not_active Withdrawn
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113819959A (en) * | 2021-11-24 | 2021-12-21 | 中国空气动力研究与发展中心设备设计与测试技术研究所 | Suspension system anomaly detection method based on Hailinge distance and correlation coefficient |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Chen et al. | Distributed individuals for multiple peaks: A novel differential evolution for multimodal optimization problems | |
CN110147450B (en) | Knowledge complementing method and device for knowledge graph | |
CN108491874B (en) | Image list classification method based on generation type countermeasure network | |
CN106776351B (en) | A kind of combined test use-case prioritization method based on One-test-at-a-time strategy | |
CN108763857A (en) | A kind of process soft-measuring modeling method generating confrontation network based on similarity | |
García et al. | Theoretical analysis of a performance measure for imbalanced data | |
CN107451619A (en) | A kind of small target detecting method that confrontation network is generated based on perception | |
CN110334580A (en) | The equipment fault classification method of changeable weight combination based on integrated increment | |
CN105095494B (en) | The method that a kind of pair of categorized data set is tested | |
CN104331893B (en) | Complex image multi-threshold segmentation method | |
CN105574547B (en) | Adapt to integrated learning approach and device that dynamic adjusts base classifier weight | |
CN109163911A (en) | A kind of fault of engine fuel system diagnostic method based on improved bat algorithm optimization ELM | |
CN104298893B (en) | Imputation method of genetic expression deletion data | |
CN109871622A (en) | A kind of low-voltage platform area line loss calculation method and system based on deep learning | |
CN109410179A (en) | A kind of image abnormity detection method based on generation confrontation network | |
KR101680055B1 (en) | Method for developing the artificial neural network model using a conjunctive clustering method and an ensemble modeling technique | |
CN110363229A (en) | A kind of characteristics of human body's parameter selection method combined based on improvement RReliefF and mRMR | |
CN108875788A (en) | A kind of SVM classifier parameter optimization method based on modified particle swarm optiziation | |
CN109583594A (en) | Deep learning training method, device, equipment and readable storage medium storing program for executing | |
CN109657802A (en) | A kind of Mixture of expert intensified learning method and system | |
CN106021524A (en) | Working method for tree-augmented Navie Bayes classifier used for large data mining based on second-order dependence | |
CN109829494A (en) | A kind of clustering ensemble method based on weighting similarity measurement | |
CN105740917B (en) | The semi-supervised multiple view feature selection approach of remote sensing images with label study | |
CN109871897A (en) | Hailin lattice distance is the method for the over-sampling of reference standard | |
CN105138527B (en) | A kind of data classification homing method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20190611 |