CN109726821A - Data balancing method, device, computer readable storage medium and electronic equipment - Google Patents
Data balancing method, device, computer readable storage medium and electronic equipment Download PDFInfo
- Publication number
- CN109726821A CN109726821A CN201811427339.4A CN201811427339A CN109726821A CN 109726821 A CN109726821 A CN 109726821A CN 201811427339 A CN201811427339 A CN 201811427339A CN 109726821 A CN109726821 A CN 109726821A
- Authority
- CN
- China
- Prior art keywords
- sample
- value
- new feature
- feature
- feature value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Complex Calculations (AREA)
- Image Analysis (AREA)
Abstract
This disclosure relates to a kind of data balancing method, device, computer readable storage medium and electronic equipment.Method includes: that minority class sample is determined from multiple data samples;According to the probability distribution of the minority class sample, over-sampling is carried out to the minority class sample, so that the quantity of the minority class sample reaches first threshold.Increase in this way, being distributed according to minority class sample probability and carrying out sample, so that the sample after increasing does not destroy the distribution of minority class sample, does not influence the authenticity of the minority class sample, and then guarantee the subsequent resulting model accuracy of training.
Description
Technical field
This disclosure relates to technical field of data processing, and in particular, to a kind of data balancing method, device, computer can
Read storage medium and electronic equipment.
Background technique
Recently as artificial intelligence technology and the fast development of machine learning, there are a variety of machine learning models,
These models can be applied to several scenes, such as prediction, classification after the training study by mass data sample, thus real
Existing intelligent processing, meets the use demand of user.Having one kind in these machine learning models is disaggregated model.User can be with
To a large amount of data sample of the mode input, these data samples are divided into positive sample and negative sample, are instructed using these data samples
Practice the model, to obtain the disaggregated model with certain nicety of grading.
In practical applications, when being trained study to disaggregated model, there is a large amount of unbalanced sample sets, that is,
Certain a kind of sample size is long-range less than another kind of sample size, causes model can not depth for the sample of the type of negligible amounts
Study, it is thus typically necessary to over-sampling be carried out to the sample of the type of negligible amounts, to increase the quantity of such sample.However,
In the prior art, mostly it is to carry out over-sampling in a manner of generating sample at random, the distribution of sample can be destroyed in this way, influence sample
Authenticity.
Summary of the invention
In order to solve the problems, such as present in the relevant technologies, the disclosure provides a kind of data balancing method, device, computer can
Read storage medium and electronic equipment.
To achieve the goals above, disclosure first aspect provides a kind of data balancing method, comprising:
Minority class sample is determined from multiple data samples;
According to the probability distribution of the minority class sample, over-sampling is carried out to the minority class sample, so that the minority
The quantity of class sample reaches first threshold.
Optionally, each sample includes one or more features;
The probability distribution according to the minority class sample carries out over-sampling to the minority class sample, comprising:
According to the initial mean value and initial variance of each feature in the minority class sample, the Gauss of each feature is generated
Distribution;
For each feature in the minority class sample, following over-sampling treatment process is executed:
According to the Gaussian Profile of this feature, a new feature value of this feature is generated, which makees
For the first new feature value;
Verify the validity of the first new feature value;
If it is invalid to verify the first new feature value, the first new feature value is deleted, otherwise, is retained
The first new feature value;
If the first threshold has not yet been reached in the characteristic value sum of this feature, it is processed to re-execute the over-sampling
Journey, until the characteristic value sum of this feature reaches the first threshold.
Optionally, the validity for verifying the first new feature value, comprising:
Calculate the current mean value and current variance of this feature;
T verification is carried out to the current mean value and F verification is carried out to the current variance;
When the first threshold has not yet been reached in the characteristic value sum of this feature, if the current mean value is not verified by T
And the current variance is not verified by F, then it is invalid to verify the first new feature value;
When the characteristic value sum of this feature reaches the first threshold, if the current mean value does not pass through T verification or institute
It states current variance not verify by F, then it is invalid to verify the first new feature value.
Optionally, if invalid in described the first new feature value of verifying, the first new feature value is deleted, otherwise, is retained
After the step of first new feature value, the over-sampling treatment process further include:
If the current mean value is not verified by T but the current variance is verified by F, according to the following formula, generate
The another new feature value of this feature, the another new feature value is as the second new feature value:
X=2 (E0+C)-E1
Wherein, X is the second new feature value;E1Indicate the mean value of this feature before generating the second new feature value;E0
Indicate that the initial mean value, C are constant;
If the current mean value is verified by T but the current variance is not verified by F, it is less than institute in the current variance
When stating initial variance, deleted from generated characteristic value in addition to the first new feature value closest to the initial mean value
Characteristic value, and generate a third new feature value farthest apart from the initial mean value;It is greater than in the current variance described initial
When variance, the farthest feature of initial mean value described in distance in addition to the first new feature value is deleted from generated characteristic value
Value, and generate a third new feature value nearest apart from the initial mean value.
Optionally, the validity for verifying the first new feature value, comprising:
If the preset range of Gaussian Profile of the first new feature value beyond said features, verifies the first new feature value
In vain, wherein the preset range is that [initial mean value-n* primary standard is poor, the initial mean value+n* primary standard
Difference], n is the numerical value greater than zero.
Optionally, the method also includes:
Most class samples are determined from the multiple data sample;
Lack sampling is carried out to most class samples, so that the quantity of the majority class sample reaches second threshold.
It is optionally, described that lack sampling is carried out to most class samples, comprising:
Determine the probability density of each sample in most class samples;
Execute following lack sampling treatment process:
Determine that first sample, the first sample are any sample in most class samples;
In other most class samples in addition to the first sample, the general of probability density and the first sample is determined
The immediate sample of rate density is the second sample;
Delete second sample;
If the second threshold has not yet been reached in the sum of the majority class sample, it is processed to re-execute the lack sampling
Journey, until the sum of most class samples reaches the second threshold.
Disclosure second aspect provides a kind of data balancing device, comprising:
First determining module, for determining minority class sample from multiple data samples;
Over-sampling module adopted the minority class sample for the probability distribution according to the minority class sample
Sample, so that the quantity of the minority class sample reaches first threshold.
Optionally, each sample includes one or more features;The over-sampling module includes:
Submodule is generated, it is raw for the initial mean value and initial variance according to each feature in the minority class sample
At the Gaussian Profile of each feature;
Over-sampling implementation sub-module, for executing at following over-sampling for each feature in the minority class sample
Reason process:
According to the Gaussian Profile of this feature, a new feature value of this feature is generated, the new feature value is as the first new feature
Value;
Verify the validity of the first new feature value;
If it is invalid to verify the first new feature value, the first new feature value is deleted, otherwise, retains first new feature
Value;
If the first threshold has not yet been reached in the characteristic value sum of this feature, it is processed to re-execute the over-sampling
Journey, until the characteristic value sum of this feature reaches the first threshold.
Optionally, described device further include:
Second determining module, for determining most class samples from the multiple data sample;
Lack sampling module, for carrying out lack sampling to most class samples, so that the quantity of the majority class sample reaches
Second threshold.
Optionally, the lack sampling module includes:
Submodule is determined, for determining the probability density of each sample in most class samples;
Lack sampling implementation sub-module, for executing following lack sampling treatment process:
Determine that first sample, the first sample are any sample in most class samples;
In other most class samples in addition to the first sample, the general of probability density and the first sample is determined
The immediate sample of rate density is the second sample;
Delete second sample;
If the second threshold has not yet been reached in the sum of the majority class sample, it is processed to re-execute the lack sampling
Journey, until the sum of most class samples reaches the second threshold.
The disclosure third aspect provides a kind of computer readable storage medium, is stored thereon with the computer program program quilt
The step of the method provided by disclosure first aspect is realized when processor executes.
Disclosure fourth aspect provides a kind of electronic equipment, comprising:
Memory is stored thereon with computer program;
Processor, for executing the computer program in the memory, to realize that disclosure first aspect is mentioned
The step of the method for confession.
Through the above technical solutions, determining minority class sample from multiple data samples;According to the minority class sample
Probability distribution, to the minority class sample carry out over-sampling so that the quantity of the minority class sample reach first threshold.This
Sample is distributed progress sample according to minority class sample probability and increases, so that the sample after increasing does not destroy the distribution of minority class sample,
The authenticity of the minority class sample is not influenced, and then guarantees the subsequent resulting model accuracy of training.
Other feature and advantage of the disclosure will the following detailed description will be given in the detailed implementation section.
Detailed description of the invention
Attached drawing is and to constitute part of specification for providing further understanding of the disclosure, with following tool
Body embodiment is used to explain the disclosure together, but does not constitute the limitation to the disclosure.In the accompanying drawings:
Fig. 1 is a kind of flow chart of data balancing method shown according to an exemplary embodiment.
Fig. 2 is a kind of flow chart of the data balancing method shown according to another exemplary embodiment.
Fig. 3 is a kind of schematic diagram of the Gaussian Profile of CPU usage feature shown according to an exemplary embodiment.
Fig. 4 is a kind of flow chart of over-sampling treatment process shown according to an exemplary embodiment.
Fig. 5 is a kind of process of method that most class samples are carried out with lack sampling shown according to an exemplary embodiment
Figure.
Fig. 6 is a kind of block diagram of data balancing device shown according to an exemplary embodiment.
Fig. 7 is the block diagram of a kind of electronic equipment shown according to an exemplary embodiment
Specific embodiment
It is described in detail below in conjunction with specific embodiment of the attached drawing to the disclosure.It should be understood that this place is retouched
The specific embodiment stated is only used for describing and explaining the disclosure, is not limited to the disclosure.
When being trained to disaggregated model, due to the case where there are positive and negative sample imbalances, it is therefore desirable to positive and negative to this
The quantity of sample carries out balanced.In practice, data sample equalization methods mainly include two methods of oversampler method and lack sampling.
Wherein, oversampler method promotes the quantity of minority class sample, lack sampling method essentially by the mode for generating new samples
The quantity of most class samples is reduced essentially by the mode for deleting sample.It therefore, in practical applications, can be by right
Minority class sample carries out over-sampling, or carries out lack sampling to most class samples, and data sample equilibrium may be implemented.The prior art
In, only data sample is carried out using one of over-sampling and lack sampling balanced.In this way, at model training initial stage, user
It is difficult to judge that over-sampling or lack sampling is selected to carry out equilibrium to data sample, and if only in over-sampling and lack sampling
One of carry out data balancing, cause the data balancing time longer.Therefore, in order to solve the above-mentioned technical problem, the disclosure provides
A kind of data balancing method, device, computer readable storage medium and electronic equipment, to realize the automatic, quick of data sample
It is balanced.
Referring to FIG. 1, Fig. 1 is a kind of flow chart of data balancing method shown according to an exemplary embodiment.Such as Fig. 1
Shown, this method may comprise steps of.
In a step 11, minority class sample and most class samples are determined respectively from multiple data samples.
In the disclosure, includes two class samples in multiple data sample, the quantity of the two classes sample is counted respectively, by quantity
Few a kind of sample is determined as minority class sample, and a kind of sample more than quantity is determined as most class samples.
It illustratively, may include positive sample and negative sample in multiple data sample, to each data for belonging to positive sample
Sample demarcates label 0 in advance, demarcates label 1 in advance to each data sample for belonging to negative sample.In this way, passing through statistics label 0
With the quantity of label 1, that is, it can determine that minority class sample and most class samples.Illustratively, it is assumed that the quantity of label 0 is n, label
1 quantity is m, and n is less than m, then above-mentioned positive sample is minority class sample, and negative sample is most class samples.
In step 12, over-sampling is carried out to minority class sample, so that the quantity of minority class sample reaches first threshold.
As described above, over-sampling is for generating data, therefore, in the disclosure in order to increase minority class sample
Quantity reduces its gap between the quantity of most class samples, over-sampling can be carried out to the minority class sample, so that minority class
The quantity of sample reaches first threshold.Wherein, which can be the numerical value of user's self-setting, which is positive whole
Number, and the numerical value should be greater than carrying out the quantity of the minority class sample before over-sampling.
In step 13, lack sampling is carried out to most class samples, so that the quantity of most class samples reaches second threshold.
Correspondingly, in order to reduce the quantity of most class samples, its gap between the quantity of minority class sample is reduced, it is right
The majority class sample carries out lack sampling, so that the quantity of the multi-quantity sample reaches second threshold.Similarly, which can
Think the numerical value of user's self-setting, which is positive integer, and the numerical value should be less than carrying out most class samples before lack sampling
This quantity.Wherein, according to user's actual need, above-mentioned first threshold and second threshold can may be not for identical numerical value
Same numerical value.This is not especially limited in the disclosure.
By adopting the above technical scheme, user, which is not necessarily to decision during data balancing, is selection over-sampling on earth or owes to adopt
The processing mode of sample, the program can integrate over-sampling and lack sampling, realize to the automatic of most class samples and minority class sample
Equilibrium, and improve the precision of training gained model.Further, since over-sampling can be carried out to minority class sample simultaneously and to most classes
Sample carries out lack sampling, can fast implement the equilibrium between the quantity of minority class sample and the quantity of most class samples, reduce
The time of data balancing.
In a kind of embodiment of step 12, carrying out over-sampling to minority class sample can be with are as follows: in minority class sample
It is random to generate sample.It illustratively, can be using existing sample in random reproduction minority class sample as newly-generated sample.However,
When generating sample using aforesaid way, there may be some problems.For example, the sample replicated is in the minority class sample
When the lower sample of probability of occurrence, the distribution of minority class sample will be destroyed in this way, influences the authenticity of minority class sample, in turn
Influence the resulting model accuracy of subsequent training.
In order to avoid there is the above problem, the disclosure also provides a kind of data balancing method.Referring to FIG. 2, Fig. 2 is basis
A kind of flow chart of data balancing method shown in another exemplary embodiment.As shown in Fig. 2, this method may include following step
Suddenly.
In step 21, minority class sample is determined from multiple data samples.
Wherein, the specific embodiment of step 21 can refer to above in association with described in Fig. 1 from multiple data samples point
The method for not determining minority class sample and most class samples.
In step 22, according to the probability distribution of minority class sample, over-sampling is carried out to minority class sample, so that minority class
The quantity of sample reaches first threshold.
Increase in this way, being distributed according to minority class sample probability and carrying out sample, so that the sample after increasing does not destroy minority class
The distribution of sample does not influence the authenticity of the minority class sample, and then guarantees the subsequent resulting model accuracy of training.
Specifically, in the disclosure, each sample includes one or more features, according to the probability of the minority class sample
Distribution, the method for carrying out over-sampling to minority class sample may include:
Firstly, generating the height of each feature according to the initial mean value and initial variance of each feature in minority class sample
This distribution.
In the disclosure, each sample standard deviation in multiple data sample includes one or more features, wherein different
Feature is used to information different classes of in characterize data sample.For example, whether breaking down in data sample detection computer
When sample, each sample is included at least in the sample: CPU (Central Processing Unit, central processing unit) utilization rate
And memory usage, the CPU usage characterize the use information of CPU in the sample, memory usage characterizes memory in the sample
Use information.
For each feature in minority class sample, the mean value and variance of each feature are calculated separately, the minority class sample
To carry out the sample before over-sampling, therefore, above-mentioned mean value, variance can be referred to as initial mean value, initial variance, subsequent to distinguish
Mean value, the variance of each feature in minority class sample after addition sample.It is generated according to the initial mean value and initial variance
The Gaussian Profile of each feature.The Gaussian Profile of each feature can reflect each feature of this feature in the minority class sample
It is worth the probability occurred.Wherein, the mean value of each feature, variance are calculated and Gaussian Profile is generated according to mean value and variance and belongs to public affairs
Know common sense, details are not described herein again.
Assuming that the minority class sample is the sample that computer breaks down, it correspondingly, in the minority class sample include that CPU is used
Rate feature and memory usage feature.In this way, can be according to the initial mean value of the CPU usage feature in minority class sample and initial
Variance generates the Gaussian Profile of the CPU usage feature, and initial mean value and initial variance according to memory usage feature
Generate the Gaussian Profile of the memory usage feature.For generating the Gaussian Profile of CPU usage feature, to CPU usage
Feature calculates separately initial mean value μ0With initial variance σ0 2, and according to initial mean value μ0With initial variance σ0 2Generating the CPU makes
With the Gaussian Profile of rate feature, wherein the Gaussian Profile of the CPU usage feature can be, for example, the institute in Fig. 3 shown in Fig. 3
In the Gaussian Profile figure shown, abscissa indicates that CPU usage, ordinate indicate probability.
Then, after the Gaussian Profile for generating each feature, for each feature in minority class sample, be performed both by with
Lower over-sampling treatment process, wherein as shown in figure 4, the over-sampling treatment process may comprise steps of.
In step 41, according to the Gaussian Profile of this feature, a new feature value is generated, the new feature value is new special as first
Value indicative.
When carrying out sample to minority class sample and increasing, the feature that includes in the sample that is increased need to it is every in minority class sample
The feature that a sample includes is consistent.For example, being increased when each sample includes CPU usage feature and memory usage feature
The sample added is also needed comprising CPU usage feature and memory usage feature.It therefore, in the disclosure, can be by each spy
Increase characteristic value in sign respectively, realizes the purpose for increasing sample.
Specifically, random to generate a new spy according to the Gaussian Profile of this feature for each feature in minority class sample
Value indicative, new feature value generated is as the first new feature value.For example, the Gaussian Profile of CPU usage feature shown in Fig. 3
Abscissa in generate the numerical value of a CPU usage at random, the numerical value of CPU usage generated is the first new feature
Value.
In step 42, the validity of the first new feature value is verified.
Above-mentioned first new feature value be although generated according to Gaussian Profile, but generate the first new feature value may with it is first
Beginning mean value difference is larger, it is also possible to influence the authenticity of this feature.Therefore, it after generating the first new feature value, also to verify
The validity of the first new feature value, that is, when the first new feature value is added in this feature by verifying, if the spy can be destroyed
The authenticity of sign.Be added in this feature in the first new feature value, when will not destroy the authenticity of this feature, then verify this first
New feature value is effective, and it is invalid otherwise to verify the first new feature value.Illustratively, after the first new feature value being increased by judgement
Feature current mean value and initial mean value difference whether mode within a preset range, be to verify the first new feature value
The no authenticity that can destroy this feature.When difference within a preset range when, then the spy will not be destroyed by verifying the first new feature value
Otherwise the authenticity of sign verifies the authenticity that the first new feature value destroys this feature.
At step 43, invalid if verifying the first new feature value, delete the first new feature value
In step 44, effective if verifying the first new feature value, retain the first new feature value.
It is added in the feature of minority class sample by the first new feature value, when destroying the authenticity of this feature, verifying
The First Eigenvalue is invalid, at this point, the first new feature value is deleted.The first new feature value is being added to minority class sample
In feature, when will not destroy the authenticity of this feature, verifying the First Eigenvalue is effective, at this point, retaining the first new feature value.
Finally, judging whether the characteristic value sum of this feature reaches the first threshold.
Over-sampling, when so that the quantity of the minority class sample reaching first threshold, minority class are being carried out to minority class sample
The characteristic value sum of each feature in sample also needs to reach the first threshold.Therefore, every to retain or delete generated one newly
It after characteristic value, is both needed to judge whether the characteristic value sum of this feature reaches the first threshold, when reaching the first threshold, stops
Only generate the new feature value of this feature.Otherwise step 41-44 included in above-mentioned over-sampling treatment process is re-executed, until
Until the characteristic value sum of this feature reaches the first threshold.
In addition, in one embodiment, after retaining the first new feature value, the characteristic value sum of this feature is not up to
It is zhou duicheng tuxing according to the figure of Gaussian Profile, and the characteristics of symmetry axis L=initial mean value in the case where the first threshold,
Generate with the symmetrical another new feature value of the first new feature value, and another new feature value is also effective.
It should be noted that when sample includes multiple features, it, can be according to successively in above-mentioned over-sampling treatment process
A new feature value is generated in the Gaussian Profile of each feature, until the characteristic value sum of each feature reaches first threshold
Sequence carry out over-sampling, new feature value can also be generated in the Gaussian Profile of a certain feature according to elder generation, in the spy of this feature
When value indicative sum reaches first threshold, then new feature value is generated in the Gaussian Profile of other features, until the spy of all features
Sequence until value indicative sum reaches first threshold carries out over-sampling.Over-sampling can also be carried out in other orders, in this public affairs
It opens in embodiment and this is not especially limited.
Pass through the above method, it can be ensured that increasing the characteristic value in each feature will not influence the distribution of this feature, and
The authenticity of this feature is not influenced, and then guarantees that the sample after increasing does not influence the authenticity of the minority class sample.
In a kind of possible embodiment of the disclosure, above-mentioned steps 42 may include:
Firstly, calculating the current mean value and current variance of this feature.Wherein, the current mean value, current variance refer to generation
The mean value, variance of this feature after the first new paricular value.
Then, T verification is carried out to current mean value and F verification is carried out to current variance.Specifically, current mean value is carried out
Whether T verification mainly examines current mean value and the difference of initial mean value significant.Current mean value and the difference of initial mean value compared with
When being significant, T verification does not pass through, and otherwise T verification passes through.To current variance carry out F verification mainly examine current variance and just
Whether beginning variance has significant difference.In current variance and initial variance there are when significant difference, F verification does not pass through, otherwise F
Verification passes through.
Then, when first threshold has not yet been reached in the characteristic value sum of this feature, if current mean value is not verified and is worked as by T
Front difference is not verified by F, then it is invalid to verify the first new feature value.When the characteristic value sum of this feature reaches first threshold
When, if current mean value is not verified by F by T verification or current variance, it is invalid to verify the first new feature value.
Specifically, after generating the first new feature value, when first threshold has not yet been reached in the characteristic value sum of this feature,
When current mean value is not verified by T and current variance is not verified by F, it is invalid to verify the first new feature value.Generating first
After new feature value, when the characteristic value sum of this feature reaches first threshold, at least one of above-mentioned T verification and F verification are not led to
It is out-of-date, it is invalid to verify the first new feature value.
In one embodiment, when first threshold has not yet been reached in the characteristic value sum of this feature, by the first new spy
After value indicative is retained in this feature, another new spy can be generated according to above in association with over-sampling treatment process as described in Figure 2
Value indicative, until the characteristic value sum of this feature reaches first threshold.
In another embodiment, it is contemplated that when first threshold has not yet been reached in the characteristic value sum of this feature, only exist
When current mean value is not verified by T and current variance is not verified by F, it is invalid just to verify the first new feature value, and deleting should
First new feature value.When thering is one not pass through in T verification and F verification, still retains the first new feature value, be likely at this time
Cause the current mean value, current variance and initial mean value, initial variance of this feature inconsistent, but difference is not significant.
In the disclosure, in order to further reduce the difference between current mean value, current variance and initial mean value, initial variance
It is different, in the current mean value, current variance and inconsistent initial mean value, initial variance of this feature, by being distributed the method supplied
Keep the current mean value, current variance and initial mean value, initial variance that generate this feature after the second new feature value as consistent as possible,
Eliminate above-mentioned difference.
Specifically, after step 44, above-mentioned over-sampling treatment process can also include:
If current mean value is not verified by T but current variance is verified by F, according to the following formula, this feature is generated
Another new feature value, the another new feature value is as the second new feature value:
X=2 (E0+C)-E1 (1)
Wherein, X is the second new feature value;E1Indicate the mean value of this feature before generating the second new feature value;E0It indicates just
Beginning mean value, C are constant.
Do not pass through in T verification, and F is verified when passing through, and shows that the difference of the current mean value and initial mean value is more significant, when
Front difference is consistent with initial variance.It, can be by generating one the at this point, for the difference reduced between current mean value and initial mean value
Two new feature values, to eliminate the first new feature value influence caused by this feature.It illustratively, can be raw by above-mentioned formula (1)
At the second new feature value.Wherein, the difference of C characterization the user acceptable current mean value and initial mean value in formula (1), the C
Numerical value it is smaller, characterization generate the second new feature value after this feature current mean value and initial mean value between difference get over
It is small.
In this way, generating the second new feature value using formula (1), it is ensured that generate this feature after the second new feature value
Current mean value and initial mean value between difference meet user demand.
If current mean value is verified by T but current variance is not verified by F, when current variance is less than initial variance, from
It deletes in generated characteristic value closest to the characteristic value of initial mean value in addition to the first new feature value, and it is initial to generate a distance
The farthest third new feature value of mean value;When current variance is greater than initial variance, is deleted from generated characteristic value and remove first
The characteristic value farthest apart from initial mean value except new feature value, and generate a third new feature value nearest apart from initial mean value.
If current mean value is verified by T but current variance is not verified by F, show the current mean value and initial mean value almost
Unanimously, the difference of current variance and initial variance is more significant.At this time in order to reduce the difference of current variance and initial variance, need
Other characteristic values in addition to the first new feature value are deleted from generated characteristic value, and regenerate a third new feature
Value.
According in Gaussian Profile, variance is bigger, and Gaussian distribution curve is gentler, and probability distribution is more dispersed, and, variance is got over
The characteristics of small Gaussian distribution curve is more precipitous, and probability distribution is more concentrated.When current variance is greater than initial variance, show generation the
The probability distribution of this feature after one new feature value is compared with the probability distribution of this feature before generating the first new feature value
It is more dispersed, therefore, it is farthest that distance initial mean value in addition to the first new feature value can be deleted from generated characteristic value
Characteristic value, and generate a third feature value nearest apart from the initial mean value so that generate third new feature value after should
The probability distribution of feature and the probability distribution of this feature before generating the first new feature value are more consistent.
Current variance be less than initial variance when, show generate the first new feature value after this feature probability distribution with
The probability distribution of this feature before generating the first new feature value, therefore, can be from generated characteristic value compared to more concentrating
It deletes in addition to the first new feature value closest to the characteristic value of the initial mean value, and it is farthest apart from the initial mean value to generate one
Third feature value, so that before the first new feature value of the probability distribution of this feature after generation third new feature value and generation
The probability distribution of this feature is more consistent.
Using the above scheme, it can be improved the probability that the first new feature value generated is retained in this feature, reduce
The time of over-sampling treatment process can also eliminate the first new feature value to the current of this feature by being distributed the method supplied
The influence of mean value, current variance further ensures that the authenticity of minority class sample after over-sampling.
In addition to above-described using carrying out T verification to current mean value and carry out F method of calibration verifying the to current variance
It, can also be by judging that generated first is new special in the embodiment of another kind replacement except the validity of one new feature value
The relationship of the preset range of value indicative and the Gaussian Profile of its said features, verifies the validity of the first new feature value.
Specifically, if above-mentioned steps 42 may include: the pre- of Gaussian Profile of the first new feature value beyond said features
If range, then it is invalid to verify the first new feature value, wherein preset range is that [initial mean value-n* primary standard is poor, institute
It is poor to state initial mean value+n* primary standard], n is the numerical value greater than zero.
The characteristics of according to Gaussian Profile, the area in certain section reflects that the characteristic value quantity in the section accounts for spy on abscissa
The percentage of value indicative sum, that is, characteristic value falls in the probability in the section.For example, abscissa zone [mean-standard deviation,
Value+standard deviation] in area be about 68.3%, in abscissa zone [mean value -2.58* standard deviation, mean value+2.58* standard deviation]
Interior area is about 99.7%, etc..Therefore, in order to guarantee that characteristic value generated is respectively positioned on the maximum probability area of Gaussian Profile
Between, it avoids the characteristic value generated from belonging to " small probability event ", in the disclosure, can will exceed that [initial mean value-n* is initially marked
Quasi- poor, the initial mean value+n* primary standard is poor] characteristic value in section, it is verified as in vain.Wherein, n can be, for example, 3.
Pass through the above method, it can be ensured that increase the characteristic value in each feature and will not influence the authenticity of this feature,
And then guarantee that the sample after increasing does not influence the authenticity of the minority class sample.
It is to be described in detail for the treatment process of over-sampling above.Lack sampling is carried out below with reference to Fig. 5 detailed
Explanation.As shown in figure 5, step 13 may comprise steps of in Fig. 1.
In step 131, the probability density of each sample in majority class sample is determined.
Specifically, the initial mean value and initial variance of each feature in most class samples can be calculated separately out, it is first according to this
Beginning mean value, initial variance and probability density formulaCalculate each characteristic value in each sample
Probability density, and the sum of the probability density of characteristic value each in the sample is determined as to the probability density of the sample.Wherein, above-mentioned
X characteristic feature value in formula, f (x) characterize the probability density of this feature value, μ0Characterize initial mean value, σ0 2Characterize initial variance.
After the probability density for determining each sample, following lack sampling can be executed for the majority class sample and is handled
Process:
In step 132, first sample is determined.
Wherein, which is any sample in the majority class sample.It therefore, can be random in the majority class sample
A sample is chosen, as the first sample.
In step 133, in other most class samples in addition to the first sample, determine probability density and this first
The immediate sample of the probability density of sample is the second sample.
In step 134, second sample is deleted.
When carrying out lack sampling to most class samples, from deletion portion in the sample repeated or in the sample that is closer to
Divide sample, so that most class samples after deleting sample are consistent with the distribution of most class samples before not deleting sample.Cause
This, in the disclosure, after determining first sample, according in the probability density of the first sample and the majority class sample
The probability density of each sample, in other most class samples in addition to the first sample, determine probability density and this
The sample is determined as the second sample by the immediate sample of the probability density of one sample, and deletes second sample.
In step 135, judge whether the sum of the majority class sample reaches the second threshold.
After deleting the second sample, judge whether the sum of the majority class sample reaches the second threshold.If the majority
When the sum of class sample reaches second threshold, it can stop deleting sample.Otherwise above-mentioned steps 132-135 is re-executed, until this
Until the sum of most class samples reaches second threshold.
Aforesaid way is sampled, can avoid that the distribution of the majority class sample is caused to change after deleting sample, in turn
Ensure the authenticity of the majority class sample.
Based on the same inventive concept, the disclosure also provides a kind of data balancing device.Referring to FIG. 6, Fig. 6 is shown according to one
Example property implements a kind of block diagram of the data balancing device exemplified.As shown in fig. 6, the apparatus may include:
First determining module 61, for determining minority class sample from multiple data samples;
Over-sampling module 62 carried out the minority class sample for the probability distribution according to the minority class sample
Sampling, so that the quantity of the minority class sample reaches first threshold.
Optionally, each sample includes one or more features;The over-sampling module may include:
Submodule is generated, it is raw for the initial mean value and initial variance according to each feature in the minority class sample
At the Gaussian Profile of each feature;
Over-sampling implementation sub-module, for executing at above-mentioned over-sampling for each feature in the minority class sample
Reason process;If the first threshold has not yet been reached in the characteristic value sum of this feature, above-mentioned over-sampling treatment process is re-executed,
Until the characteristic value sum of this feature reaches the first threshold.
Optionally, described device further include:
Second determining module, for determining most class samples from the multiple data sample;
Lack sampling module, for carrying out lack sampling to most class samples, so that the quantity of the majority class sample reaches
Second threshold.
Optionally, the lack sampling module may include:
Submodule is determined, for determining the probability density of each sample in most class samples;
Lack sampling implementation sub-module, for executing above-mentioned lack sampling treatment process;If the sum of the majority class sample is still
The not up to described second threshold then re-executes the lack sampling treatment process, until the sum of most class samples reaches
Until the second threshold.
About the device in above-described embodiment, wherein modules execute the concrete mode of operation in related this method
Embodiment in be described in detail, no detailed explanation will be given here.
Fig. 7 is the block diagram of a kind of electronic equipment 700 shown according to an exemplary embodiment.As shown in fig. 7, the electronics is set
Standby 700 may include: processor 701, memory 702.The electronic equipment 700 can also include multimedia component 703, input/
Export one or more of (I/O) interface 704 and communication component 705.
Wherein, processor 701 is used to control the integrated operation of the electronic equipment 700, to complete above-mentioned data balancing side
All or part of the steps in method.Memory 702 is for storing various types of data to support the behaviour in the electronic equipment 700
To make, these data for example may include the instruction of any application or method for operating on the electronic equipment 700, with
And the relevant data of application program, such as contact data, the message of transmitting-receiving, picture, audio, video etc..The memory 702
It can be realized by any kind of volatibility or non-volatile memory device or their combination, such as static random-access is deposited
Reservoir (Static Random Access Memory, abbreviation SRAM), electrically erasable programmable read-only memory
(Electrically Erasable Programmable Read-Only Memory, abbreviation EEPROM), erasable programmable
Read-only memory (Erasable Programmable Read-Only Memory, abbreviation EPROM), programmable read only memory
(Programmable Read-Only Memory, abbreviation PROM), and read-only memory (Read-Only Memory, referred to as
ROM), magnetic memory, flash memory, disk or CD.Multimedia component 703 may include screen and audio component.Wherein
Screen for example can be touch screen, and audio component is used for output and/or input audio signal.For example, audio component may include
One microphone, microphone is for receiving external audio signal.The received audio signal can be further stored in storage
Device 702 is sent by communication component 705.Audio component further includes at least one loudspeaker, is used for output audio signal.I/O
Interface 704 provides interface between processor 701 and other interface modules, other above-mentioned interface modules can be keyboard, mouse,
Button etc..These buttons can be virtual push button or entity button.Communication component 705 is for the electronic equipment 700 and other
Wired or wireless communication is carried out between equipment.Wireless communication, such as Wi-Fi, bluetooth, near-field communication (Near Field
Communication, abbreviation NFC), 2G, 3G or 4G or they one or more of combination, therefore corresponding communication
Component 705 may include: Wi-Fi module, bluetooth module, NFC module.
In one exemplary embodiment, electronic equipment 700 can be by one or more application specific integrated circuit
(Application Specific Integrated Circuit, abbreviation ASIC), digital signal processor (Digital
Signal Processor, abbreviation DSP), digital signal processing appts (Digital Signal Processing Device,
Abbreviation DSPD), programmable logic device (Programmable Logic Device, abbreviation PLD), field programmable gate array
(Field Programmable Gate Array, abbreviation FPGA), controller, microcontroller, microprocessor or other electronics member
Part is realized, for executing above-mentioned data balancing method.
In a further exemplary embodiment, a kind of computer readable storage medium including program instruction is additionally provided, it should
The step of above-mentioned data balancing method is realized when program instruction is executed by processor.For example, the computer readable storage medium
It can be the above-mentioned memory 702 including program instruction, above procedure instruction can be executed by the processor 701 of electronic equipment 700
To complete above-mentioned data balancing method.
The preferred embodiment of the disclosure is described in detail in conjunction with attached drawing above, still, the disclosure is not limited to above-mentioned reality
The detail in mode is applied, in the range of the technology design of the disclosure, a variety of letters can be carried out to the technical solution of the disclosure
Monotropic type, these simple variants belong to the protection scope of the disclosure.
It is further to note that specific technical features described in the above specific embodiments, in not lance
In the case where shield, it can be combined in any appropriate way.In order to avoid unnecessary repetition, the disclosure to it is various can
No further explanation will be given for the combination of energy.
In addition, any combination can also be carried out between a variety of different embodiments of the disclosure, as long as it is without prejudice to originally
Disclosed thought equally should be considered as disclosure disclosure of that.
Claims (10)
1. a kind of data balancing method characterized by comprising
Minority class sample is determined from multiple data samples;
According to the probability distribution of the minority class sample, over-sampling is carried out to the minority class sample, so that the minority class sample
This quantity reaches first threshold.
2. the method according to claim 1, wherein each sample includes one or more features;
The probability distribution according to the minority class sample carries out over-sampling to the minority class sample, comprising:
According to the initial mean value and initial variance of each feature in the minority class sample, the Gauss point of each feature is generated
Cloth;
For each feature in the minority class sample, following over-sampling treatment process is executed:
According to the Gaussian Profile of this feature, a new feature value of this feature is generated, the new feature value is as the first new feature value;
Verify the validity of the first new feature value;
If it is invalid to verify the first new feature value, the first new feature value is deleted, otherwise, retains the first new feature value;
If the first threshold has not yet been reached in the characteristic value sum of this feature, the over-sampling treatment process is re-executed, directly
Until the characteristic value sum of this feature reaches the first threshold.
3. according to the method described in claim 2, it is characterized in that, the validity for verifying the first new feature value, comprising:
Calculate the current mean value and current variance of this feature;
T verification is carried out to the current mean value and F verification is carried out to the current variance;
When the first threshold has not yet been reached in the characteristic value sum of this feature, if the current mean value is not verified by T and institute
It states current variance not verify by F, then it is invalid to verify the first new feature value;
When the characteristic value sum of this feature reaches the first threshold, if the current mean value by T verification or described is not worked as
Front difference is not verified by F, then it is invalid to verify the first new feature value.
4. if according to the method described in claim 3, delete it is characterized in that, invalid in described the first new feature value of verifying
Except the first new feature value, otherwise, after the step of retaining the first new feature value, the over-sampling treatment process further include:
If the current mean value is not verified by T but the current variance is verified by F, according to the following formula, the spy is generated
The another new feature value of sign, the another new feature value is as the second new feature value:
X=2 (E0+C)-E1
Wherein, X is the second new feature value;E1Indicate the mean value of this feature before generating the second new feature value;E0It indicates
The initial mean value, C are constant;
If the current mean value is verified by T but the current variance is not verified by F, it is less than in the current variance described first
When beginning variance, the feature in addition to the first new feature value closest to the initial mean value is deleted from generated characteristic value
Value, and generate a third new feature value farthest apart from the initial mean value;It is greater than the initial variance in the current variance
When, the farthest characteristic value of initial mean value described in distance in addition to the first new feature value is deleted from generated characteristic value,
And generate a third new feature value nearest apart from the initial mean value.
5. according to the method described in claim 2, it is characterized in that, the validity for verifying the first new feature value, comprising:
If the first new feature value beyond said features Gaussian Profile preset range, verify the first new feature value without
Effect, wherein the preset range is [initial mean value-n* primary standard is poor, and the initial mean value+n* primary standard is poor], n
For the numerical value greater than zero.
6. method according to any one of claims 1-5, which is characterized in that the method also includes:
Most class samples are determined from the multiple data sample;
Lack sampling is carried out to most class samples, so that the quantity of the majority class sample reaches second threshold.
7. according to the method described in claim 6, it is characterized in that, described carry out lack sampling to most class samples, comprising:
Determine the probability density of each sample in most class samples;
Execute following lack sampling treatment process:
Determine that first sample, the first sample are any sample in most class samples;
In other most class samples in addition to the first sample, determine that the probability of probability density and the first sample is close
Spending immediate sample is the second sample;
Delete second sample;
If the second threshold has not yet been reached in the sum of the majority class sample, the lack sampling treatment process is re-executed,
Until the sum of most class samples reaches the second threshold.
8. a kind of data balancing device characterized by comprising
First determining module, for determining minority class sample from multiple data samples;
Over-sampling module carries out over-sampling to the minority class sample for the probability distribution according to the minority class sample, with
Make the quantity of the minority class sample up to first threshold.
9. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is held by processor
The step of any one of claim 1-7 the method is realized when row.
10. a kind of electronic equipment characterized by comprising
Memory is stored thereon with computer program;
Processor, for executing the computer program in the memory, to realize described in any one of claim 1-7
The step of method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811427339.4A CN109726821B (en) | 2018-11-27 | 2018-11-27 | Data equalization method and device, computer readable storage medium and electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811427339.4A CN109726821B (en) | 2018-11-27 | 2018-11-27 | Data equalization method and device, computer readable storage medium and electronic equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109726821A true CN109726821A (en) | 2019-05-07 |
CN109726821B CN109726821B (en) | 2021-07-09 |
Family
ID=66294872
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811427339.4A Active CN109726821B (en) | 2018-11-27 | 2018-11-27 | Data equalization method and device, computer readable storage medium and electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109726821B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111738197A (en) * | 2020-06-30 | 2020-10-02 | 中国联合网络通信集团有限公司 | Training image information processing method and device |
CN112416911A (en) * | 2019-08-23 | 2021-02-26 | 广州虎牙科技有限公司 | Sample data acquisition method, device, equipment and storage medium |
CN115034317A (en) * | 2022-06-17 | 2022-09-09 | 中国平安人寿保险股份有限公司 | Training method and device of policy identification model and policy identification method and device |
CN116451084A (en) * | 2023-06-13 | 2023-07-18 | 北京航空航天大学 | Training sample preprocessing method for driving style recognition model |
Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1137230A2 (en) * | 2000-03-22 | 2001-09-26 | ELMER S.p.A. | MAP equalizer |
US20030007552A1 (en) * | 2001-07-09 | 2003-01-09 | Intel Corporation | Reduced alphabet equalizer using iterative equalization |
CN101980202A (en) * | 2010-11-04 | 2011-02-23 | 西安电子科技大学 | Semi-supervised classification method of unbalance data |
CN102495901A (en) * | 2011-12-16 | 2012-06-13 | 山东师范大学 | Method for keeping balance of implementation class data through local mean |
CN105654513A (en) * | 2015-12-30 | 2016-06-08 | 电子科技大学 | Moving target detection method based on sampling strategy |
CN106504111A (en) * | 2016-09-19 | 2017-03-15 | 清华大学 | In abnormal power usage mining, class is distributed the solution of imbalance problem |
CN106548196A (en) * | 2016-10-20 | 2017-03-29 | 中国科学院深圳先进技术研究院 | A kind of random forest sampling approach and device for non-equilibrium data |
CN107169518A (en) * | 2017-05-18 | 2017-09-15 | 北京京东金融科技控股有限公司 | Data classification method, device, electronic installation and computer-readable medium |
CN107341497A (en) * | 2016-11-11 | 2017-11-10 | 东北大学 | The unbalanced weighting data streams Ensemble classifier Forecasting Methodology of sampling is risen with reference to selectivity |
CN108319967A (en) * | 2017-11-22 | 2018-07-24 | 中国电子科技集团公司电子科学研究院 | A kind of method and system that unbalanced data are handled |
CN108491474A (en) * | 2018-03-08 | 2018-09-04 | 平安科技(深圳)有限公司 | A kind of data classification method, device, equipment and computer readable storage medium |
CN108647727A (en) * | 2018-05-10 | 2018-10-12 | 广州大学 | Unbalanced data classification lack sampling method, apparatus, equipment and medium |
-
2018
- 2018-11-27 CN CN201811427339.4A patent/CN109726821B/en active Active
Patent Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1137230A2 (en) * | 2000-03-22 | 2001-09-26 | ELMER S.p.A. | MAP equalizer |
US20030007552A1 (en) * | 2001-07-09 | 2003-01-09 | Intel Corporation | Reduced alphabet equalizer using iterative equalization |
CN101980202A (en) * | 2010-11-04 | 2011-02-23 | 西安电子科技大学 | Semi-supervised classification method of unbalance data |
CN102495901A (en) * | 2011-12-16 | 2012-06-13 | 山东师范大学 | Method for keeping balance of implementation class data through local mean |
CN105654513A (en) * | 2015-12-30 | 2016-06-08 | 电子科技大学 | Moving target detection method based on sampling strategy |
CN106504111A (en) * | 2016-09-19 | 2017-03-15 | 清华大学 | In abnormal power usage mining, class is distributed the solution of imbalance problem |
CN106548196A (en) * | 2016-10-20 | 2017-03-29 | 中国科学院深圳先进技术研究院 | A kind of random forest sampling approach and device for non-equilibrium data |
CN107341497A (en) * | 2016-11-11 | 2017-11-10 | 东北大学 | The unbalanced weighting data streams Ensemble classifier Forecasting Methodology of sampling is risen with reference to selectivity |
CN107169518A (en) * | 2017-05-18 | 2017-09-15 | 北京京东金融科技控股有限公司 | Data classification method, device, electronic installation and computer-readable medium |
CN108319967A (en) * | 2017-11-22 | 2018-07-24 | 中国电子科技集团公司电子科学研究院 | A kind of method and system that unbalanced data are handled |
CN108491474A (en) * | 2018-03-08 | 2018-09-04 | 平安科技(深圳)有限公司 | A kind of data classification method, device, equipment and computer readable storage medium |
CN108647727A (en) * | 2018-05-10 | 2018-10-12 | 广州大学 | Unbalanced data classification lack sampling method, apparatus, equipment and medium |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112416911A (en) * | 2019-08-23 | 2021-02-26 | 广州虎牙科技有限公司 | Sample data acquisition method, device, equipment and storage medium |
CN111738197A (en) * | 2020-06-30 | 2020-10-02 | 中国联合网络通信集团有限公司 | Training image information processing method and device |
CN111738197B (en) * | 2020-06-30 | 2023-09-05 | 中国联合网络通信集团有限公司 | Training image information processing method and device |
CN115034317A (en) * | 2022-06-17 | 2022-09-09 | 中国平安人寿保险股份有限公司 | Training method and device of policy identification model and policy identification method and device |
CN116451084A (en) * | 2023-06-13 | 2023-07-18 | 北京航空航天大学 | Training sample preprocessing method for driving style recognition model |
CN116451084B (en) * | 2023-06-13 | 2023-08-11 | 北京航空航天大学 | Training sample preprocessing method for driving style recognition model |
Also Published As
Publication number | Publication date |
---|---|
CN109726821B (en) | 2021-07-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109726821A (en) | Data balancing method, device, computer readable storage medium and electronic equipment | |
US10943186B2 (en) | Machine learning model training method and device, and electronic device | |
CN106716382B (en) | The method and system of aggregation multiple utility program behavioural analysis for mobile device behavior | |
CN108171335A (en) | Choosing method, device, storage medium and the electronic equipment of modeling data | |
CN112417439A (en) | Account detection method, device, server and storage medium | |
JP2020510917A (en) | Risk management control method and device | |
TW201913441A (en) | Model safety detection method, device and electronic device | |
US20190080327A1 (en) | Method, apparatus, and electronic device for risk feature screening and descriptive message generation | |
CN103927483B (en) | The detection method of decision model and rogue program for detecting rogue program | |
KR20200057903A (en) | Artificial intelligence model platform and operation method thereof | |
TWI674514B (en) | Malicious software recognition apparatus and method | |
CN113221104B (en) | Detection method of abnormal behavior of user and training method of user behavior reconstruction model | |
CN111968625A (en) | Sensitive audio recognition model training method and recognition method fusing text information | |
US20240037408A1 (en) | Method and apparatus for model training and data enhancement, electronic device and storage medium | |
EP3739524A1 (en) | Method and system for protecting a machine learning model against extraction | |
CN109492531A (en) | Face image key point extraction method and device, storage medium and electronic equipment | |
CN111428236A (en) | Malicious software detection method, device, equipment and readable medium | |
CN110096013A (en) | A kind of intrusion detection method and device of industrial control system | |
CN111367773A (en) | Method, system, equipment and medium for detecting network card of server | |
WO2020168874A1 (en) | Classifier robustness test method and device, terminal and storage medium | |
CN114943307A (en) | Model training method and device, storage medium and electronic equipment | |
CN112884569A (en) | Credit assessment model training method, device and equipment | |
CN108805211A (en) | IN service type cognitive method based on machine learning | |
CN105608460A (en) | Method and system for fusing multiple classifiers | |
US11973756B2 (en) | Systems and methods for improving computer identification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |