CN109726821A

CN109726821A - Data balancing method, device, computer readable storage medium and electronic equipment

Info

Publication number: CN109726821A
Application number: CN201811427339.4A
Authority: CN
Inventors: 刘志鹏; 高睿; 邹存璐
Original assignee: Neusoft Corp
Current assignee: Neusoft Corp
Priority date: 2018-11-27
Filing date: 2018-11-27
Publication date: 2019-05-07
Anticipated expiration: 2038-11-27
Also published as: CN109726821B

Abstract

This disclosure relates to a kind of data balancing method, device, computer readable storage medium and electronic equipment.Method includes: that minority class sample is determined from multiple data samples；According to the probability distribution of the minority class sample, over-sampling is carried out to the minority class sample, so that the quantity of the minority class sample reaches first threshold.Increase in this way, being distributed according to minority class sample probability and carrying out sample, so that the sample after increasing does not destroy the distribution of minority class sample, does not influence the authenticity of the minority class sample, and then guarantee the subsequent resulting model accuracy of training.

Description

Data balancing method, device, computer readable storage medium and electronic equipment

Technical field

This disclosure relates to technical field of data processing, and in particular, to a kind of data balancing method, device, computer can Read storage medium and electronic equipment.

Background technique

Recently as artificial intelligence technology and the fast development of machine learning, there are a variety of machine learning models, These models can be applied to several scenes, such as prediction, classification after the training study by mass data sample, thus real Existing intelligent processing, meets the use demand of user.Having one kind in these machine learning models is disaggregated model.User can be with To a large amount of data sample of the mode input, these data samples are divided into positive sample and negative sample, are instructed using these data samples Practice the model, to obtain the disaggregated model with certain nicety of grading.

In practical applications, when being trained study to disaggregated model, there is a large amount of unbalanced sample sets, that is, Certain a kind of sample size is long-range less than another kind of sample size, causes model can not depth for the sample of the type of negligible amounts Study, it is thus typically necessary to over-sampling be carried out to the sample of the type of negligible amounts, to increase the quantity of such sample.However, In the prior art, mostly it is to carry out over-sampling in a manner of generating sample at random, the distribution of sample can be destroyed in this way, influence sample Authenticity.

Summary of the invention

In order to solve the problems, such as present in the relevant technologies, the disclosure provides a kind of data balancing method, device, computer can Read storage medium and electronic equipment.

To achieve the goals above, disclosure first aspect provides a kind of data balancing method, comprising:

Minority class sample is determined from multiple data samples；

According to the probability distribution of the minority class sample, over-sampling is carried out to the minority class sample, so that the minority The quantity of class sample reaches first threshold.

Optionally, each sample includes one or more features；

The probability distribution according to the minority class sample carries out over-sampling to the minority class sample, comprising:

According to the initial mean value and initial variance of each feature in the minority class sample, the Gauss of each feature is generated Distribution；

For each feature in the minority class sample, following over-sampling treatment process is executed:

According to the Gaussian Profile of this feature, a new feature value of this feature is generated, which makees

For the first new feature value；

Verify the validity of the first new feature value；

If it is invalid to verify the first new feature value, the first new feature value is deleted, otherwise, is retained

The first new feature value；

If the first threshold has not yet been reached in the characteristic value sum of this feature, it is processed to re-execute the over-sampling Journey, until the characteristic value sum of this feature reaches the first threshold.

Optionally, the validity for verifying the first new feature value, comprising:

Calculate the current mean value and current variance of this feature；

T verification is carried out to the current mean value and F verification is carried out to the current variance；

When the first threshold has not yet been reached in the characteristic value sum of this feature, if the current mean value is not verified by T And the current variance is not verified by F, then it is invalid to verify the first new feature value；

When the characteristic value sum of this feature reaches the first threshold, if the current mean value does not pass through T verification or institute It states current variance not verify by F, then it is invalid to verify the first new feature value.

Optionally, if invalid in described the first new feature value of verifying, the first new feature value is deleted, otherwise, is retained After the step of first new feature value, the over-sampling treatment process further include:

If the current mean value is not verified by T but the current variance is verified by F, according to the following formula, generate The another new feature value of this feature, the another new feature value is as the second new feature value:

X=2 (E₀+C)-E₁

Wherein, X is the second new feature value；E₁Indicate the mean value of this feature before generating the second new feature value；E₀ Indicate that the initial mean value, C are constant；

If the current mean value is verified by T but the current variance is not verified by F, it is less than institute in the current variance When stating initial variance, deleted from generated characteristic value in addition to the first new feature value closest to the initial mean value Characteristic value, and generate a third new feature value farthest apart from the initial mean value；It is greater than in the current variance described initial When variance, the farthest feature of initial mean value described in distance in addition to the first new feature value is deleted from generated characteristic value Value, and generate a third new feature value nearest apart from the initial mean value.

Optionally, the validity for verifying the first new feature value, comprising:

If the preset range of Gaussian Profile of the first new feature value beyond said features, verifies the first new feature value In vain, wherein the preset range is that [initial mean value-n* primary standard is poor, the initial mean value+n* primary standard Difference], n is the numerical value greater than zero.

Optionally, the method also includes:

Most class samples are determined from the multiple data sample；

Lack sampling is carried out to most class samples, so that the quantity of the majority class sample reaches second threshold.

It is optionally, described that lack sampling is carried out to most class samples, comprising:

Determine the probability density of each sample in most class samples；

Execute following lack sampling treatment process:

Determine that first sample, the first sample are any sample in most class samples；

In other most class samples in addition to the first sample, the general of probability density and the first sample is determined The immediate sample of rate density is the second sample；

Delete second sample；

If the second threshold has not yet been reached in the sum of the majority class sample, it is processed to re-execute the lack sampling Journey, until the sum of most class samples reaches the second threshold.

Disclosure second aspect provides a kind of data balancing device, comprising:

First determining module, for determining minority class sample from multiple data samples；

Over-sampling module adopted the minority class sample for the probability distribution according to the minority class sample Sample, so that the quantity of the minority class sample reaches first threshold.

Optionally, each sample includes one or more features；The over-sampling module includes:

Submodule is generated, it is raw for the initial mean value and initial variance according to each feature in the minority class sample At the Gaussian Profile of each feature；

Over-sampling implementation sub-module, for executing at following over-sampling for each feature in the minority class sample Reason process:

According to the Gaussian Profile of this feature, a new feature value of this feature is generated, the new feature value is as the first new feature Value；

Verify the validity of the first new feature value；

If it is invalid to verify the first new feature value, the first new feature value is deleted, otherwise, retains first new feature Value；

Optionally, described device further include:

Second determining module, for determining most class samples from the multiple data sample；

Lack sampling module, for carrying out lack sampling to most class samples, so that the quantity of the majority class sample reaches Second threshold.

Optionally, the lack sampling module includes:

Submodule is determined, for determining the probability density of each sample in most class samples；

Lack sampling implementation sub-module, for executing following lack sampling treatment process:

Delete second sample；

The disclosure third aspect provides a kind of computer readable storage medium, is stored thereon with the computer program program quilt The step of the method provided by disclosure first aspect is realized when processor executes.

Disclosure fourth aspect provides a kind of electronic equipment, comprising:

Memory is stored thereon with computer program；

Processor, for executing the computer program in the memory, to realize that disclosure first aspect is mentioned The step of the method for confession.

Through the above technical solutions, determining minority class sample from multiple data samples；According to the minority class sample Probability distribution, to the minority class sample carry out over-sampling so that the quantity of the minority class sample reach first threshold.This Sample is distributed progress sample according to minority class sample probability and increases, so that the sample after increasing does not destroy the distribution of minority class sample, The authenticity of the minority class sample is not influenced, and then guarantees the subsequent resulting model accuracy of training.

Other feature and advantage of the disclosure will the following detailed description will be given in the detailed implementation section.

Detailed description of the invention

Attached drawing is and to constitute part of specification for providing further understanding of the disclosure, with following tool Body embodiment is used to explain the disclosure together, but does not constitute the limitation to the disclosure.In the accompanying drawings:

Fig. 1 is a kind of flow chart of data balancing method shown according to an exemplary embodiment.

Fig. 2 is a kind of flow chart of the data balancing method shown according to another exemplary embodiment.

Fig. 3 is a kind of schematic diagram of the Gaussian Profile of CPU usage feature shown according to an exemplary embodiment.

Fig. 4 is a kind of flow chart of over-sampling treatment process shown according to an exemplary embodiment.

Fig. 5 is a kind of process of method that most class samples are carried out with lack sampling shown according to an exemplary embodiment Figure.

Fig. 6 is a kind of block diagram of data balancing device shown according to an exemplary embodiment.

Fig. 7 is the block diagram of a kind of electronic equipment shown according to an exemplary embodiment

Specific embodiment

It is described in detail below in conjunction with specific embodiment of the attached drawing to the disclosure.It should be understood that this place is retouched The specific embodiment stated is only used for describing and explaining the disclosure, is not limited to the disclosure.

When being trained to disaggregated model, due to the case where there are positive and negative sample imbalances, it is therefore desirable to positive and negative to this The quantity of sample carries out balanced.In practice, data sample equalization methods mainly include two methods of oversampler method and lack sampling. Wherein, oversampler method promotes the quantity of minority class sample, lack sampling method essentially by the mode for generating new samples The quantity of most class samples is reduced essentially by the mode for deleting sample.It therefore, in practical applications, can be by right Minority class sample carries out over-sampling, or carries out lack sampling to most class samples, and data sample equilibrium may be implemented.The prior art In, only data sample is carried out using one of over-sampling and lack sampling balanced.In this way, at model training initial stage, user It is difficult to judge that over-sampling or lack sampling is selected to carry out equilibrium to data sample, and if only in over-sampling and lack sampling One of carry out data balancing, cause the data balancing time longer.Therefore, in order to solve the above-mentioned technical problem, the disclosure provides A kind of data balancing method, device, computer readable storage medium and electronic equipment, to realize the automatic, quick of data sample It is balanced.

Referring to FIG. 1, Fig. 1 is a kind of flow chart of data balancing method shown according to an exemplary embodiment.Such as Fig. 1 Shown, this method may comprise steps of.

In a step 11, minority class sample and most class samples are determined respectively from multiple data samples.

In the disclosure, includes two class samples in multiple data sample, the quantity of the two classes sample is counted respectively, by quantity Few a kind of sample is determined as minority class sample, and a kind of sample more than quantity is determined as most class samples.

It illustratively, may include positive sample and negative sample in multiple data sample, to each data for belonging to positive sample Sample demarcates label 0 in advance, demarcates label 1 in advance to each data sample for belonging to negative sample.In this way, passing through statistics label 0 With the quantity of label 1, that is, it can determine that minority class sample and most class samples.Illustratively, it is assumed that the quantity of label 0 is n, label 1 quantity is m, and n is less than m, then above-mentioned positive sample is minority class sample, and negative sample is most class samples.

In step 12, over-sampling is carried out to minority class sample, so that the quantity of minority class sample reaches first threshold.

As described above, over-sampling is for generating data, therefore, in the disclosure in order to increase minority class sample Quantity reduces its gap between the quantity of most class samples, over-sampling can be carried out to the minority class sample, so that minority class The quantity of sample reaches first threshold.Wherein, which can be the numerical value of user's self-setting, which is positive whole Number, and the numerical value should be greater than carrying out the quantity of the minority class sample before over-sampling.

In step 13, lack sampling is carried out to most class samples, so that the quantity of most class samples reaches second threshold.

Correspondingly, in order to reduce the quantity of most class samples, its gap between the quantity of minority class sample is reduced, it is right The majority class sample carries out lack sampling, so that the quantity of the multi-quantity sample reaches second threshold.Similarly, which can Think the numerical value of user's self-setting, which is positive integer, and the numerical value should be less than carrying out most class samples before lack sampling This quantity.Wherein, according to user's actual need, above-mentioned first threshold and second threshold can may be not for identical numerical value Same numerical value.This is not especially limited in the disclosure.

By adopting the above technical scheme, user, which is not necessarily to decision during data balancing, is selection over-sampling on earth or owes to adopt The processing mode of sample, the program can integrate over-sampling and lack sampling, realize to the automatic of most class samples and minority class sample Equilibrium, and improve the precision of training gained model.Further, since over-sampling can be carried out to minority class sample simultaneously and to most classes Sample carries out lack sampling, can fast implement the equilibrium between the quantity of minority class sample and the quantity of most class samples, reduce The time of data balancing.

In a kind of embodiment of step 12, carrying out over-sampling to minority class sample can be with are as follows: in minority class sample It is random to generate sample.It illustratively, can be using existing sample in random reproduction minority class sample as newly-generated sample.However, When generating sample using aforesaid way, there may be some problems.For example, the sample replicated is in the minority class sample When the lower sample of probability of occurrence, the distribution of minority class sample will be destroyed in this way, influences the authenticity of minority class sample, in turn Influence the resulting model accuracy of subsequent training.

In order to avoid there is the above problem, the disclosure also provides a kind of data balancing method.Referring to FIG. 2, Fig. 2 is basis A kind of flow chart of data balancing method shown in another exemplary embodiment.As shown in Fig. 2, this method may include following step Suddenly.

In step 21, minority class sample is determined from multiple data samples.

Wherein, the specific embodiment of step 21 can refer to above in association with described in Fig. 1 from multiple data samples point The method for not determining minority class sample and most class samples.

In step 22, according to the probability distribution of minority class sample, over-sampling is carried out to minority class sample, so that minority class The quantity of sample reaches first threshold.

Increase in this way, being distributed according to minority class sample probability and carrying out sample, so that the sample after increasing does not destroy minority class The distribution of sample does not influence the authenticity of the minority class sample, and then guarantees the subsequent resulting model accuracy of training.

Specifically, in the disclosure, each sample includes one or more features, according to the probability of the minority class sample Distribution, the method for carrying out over-sampling to minority class sample may include:

Firstly, generating the height of each feature according to the initial mean value and initial variance of each feature in minority class sample This distribution.

In the disclosure, each sample standard deviation in multiple data sample includes one or more features, wherein different Feature is used to information different classes of in characterize data sample.For example, whether breaking down in data sample detection computer When sample, each sample is included at least in the sample: CPU (Central Processing Unit, central processing unit) utilization rate And memory usage, the CPU usage characterize the use information of CPU in the sample, memory usage characterizes memory in the sample Use information.

For each feature in minority class sample, the mean value and variance of each feature are calculated separately, the minority class sample To carry out the sample before over-sampling, therefore, above-mentioned mean value, variance can be referred to as initial mean value, initial variance, subsequent to distinguish Mean value, the variance of each feature in minority class sample after addition sample.It is generated according to the initial mean value and initial variance The Gaussian Profile of each feature.The Gaussian Profile of each feature can reflect each feature of this feature in the minority class sample It is worth the probability occurred.Wherein, the mean value of each feature, variance are calculated and Gaussian Profile is generated according to mean value and variance and belongs to public affairs Know common sense, details are not described herein again.

Assuming that the minority class sample is the sample that computer breaks down, it correspondingly, in the minority class sample include that CPU is used Rate feature and memory usage feature.In this way, can be according to the initial mean value of the CPU usage feature in minority class sample and initial Variance generates the Gaussian Profile of the CPU usage feature, and initial mean value and initial variance according to memory usage feature Generate the Gaussian Profile of the memory usage feature.For generating the Gaussian Profile of CPU usage feature, to CPU usage Feature calculates separately initial mean value μ₀With initial variance σ₀ ², and according to initial mean value μ₀With initial variance σ₀ ²Generating the CPU makes With the Gaussian Profile of rate feature, wherein the Gaussian Profile of the CPU usage feature can be, for example, the institute in Fig. 3 shown in Fig. 3 In the Gaussian Profile figure shown, abscissa indicates that CPU usage, ordinate indicate probability.

Then, after the Gaussian Profile for generating each feature, for each feature in minority class sample, be performed both by with Lower over-sampling treatment process, wherein as shown in figure 4, the over-sampling treatment process may comprise steps of.

In step 41, according to the Gaussian Profile of this feature, a new feature value is generated, the new feature value is new special as first Value indicative.

When carrying out sample to minority class sample and increasing, the feature that includes in the sample that is increased need to it is every in minority class sample The feature that a sample includes is consistent.For example, being increased when each sample includes CPU usage feature and memory usage feature The sample added is also needed comprising CPU usage feature and memory usage feature.It therefore, in the disclosure, can be by each spy Increase characteristic value in sign respectively, realizes the purpose for increasing sample.

Specifically, random to generate a new spy according to the Gaussian Profile of this feature for each feature in minority class sample Value indicative, new feature value generated is as the first new feature value.For example, the Gaussian Profile of CPU usage feature shown in Fig. 3 Abscissa in generate the numerical value of a CPU usage at random, the numerical value of CPU usage generated is the first new feature Value.

In step 42, the validity of the first new feature value is verified.

Above-mentioned first new feature value be although generated according to Gaussian Profile, but generate the first new feature value may with it is first Beginning mean value difference is larger, it is also possible to influence the authenticity of this feature.Therefore, it after generating the first new feature value, also to verify The validity of the first new feature value, that is, when the first new feature value is added in this feature by verifying, if the spy can be destroyed The authenticity of sign.Be added in this feature in the first new feature value, when will not destroy the authenticity of this feature, then verify this first New feature value is effective, and it is invalid otherwise to verify the first new feature value.Illustratively, after the first new feature value being increased by judgement Feature current mean value and initial mean value difference whether mode within a preset range, be to verify the first new feature value The no authenticity that can destroy this feature.When difference within a preset range when, then the spy will not be destroyed by verifying the first new feature value Otherwise the authenticity of sign verifies the authenticity that the first new feature value destroys this feature.

At step 43, invalid if verifying the first new feature value, delete the first new feature value

In step 44, effective if verifying the first new feature value, retain the first new feature value.

It is added in the feature of minority class sample by the first new feature value, when destroying the authenticity of this feature, verifying The First Eigenvalue is invalid, at this point, the first new feature value is deleted.The first new feature value is being added to minority class sample In feature, when will not destroy the authenticity of this feature, verifying the First Eigenvalue is effective, at this point, retaining the first new feature value.

Finally, judging whether the characteristic value sum of this feature reaches the first threshold.

Over-sampling, when so that the quantity of the minority class sample reaching first threshold, minority class are being carried out to minority class sample The characteristic value sum of each feature in sample also needs to reach the first threshold.Therefore, every to retain or delete generated one newly It after characteristic value, is both needed to judge whether the characteristic value sum of this feature reaches the first threshold, when reaching the first threshold, stops Only generate the new feature value of this feature.Otherwise step 41-44 included in above-mentioned over-sampling treatment process is re-executed, until Until the characteristic value sum of this feature reaches the first threshold.

In addition, in one embodiment, after retaining the first new feature value, the characteristic value sum of this feature is not up to It is zhou duicheng tuxing according to the figure of Gaussian Profile, and the characteristics of symmetry axis L=initial mean value in the case where the first threshold, Generate with the symmetrical another new feature value of the first new feature value, and another new feature value is also effective.

It should be noted that when sample includes multiple features, it, can be according to successively in above-mentioned over-sampling treatment process A new feature value is generated in the Gaussian Profile of each feature, until the characteristic value sum of each feature reaches first threshold Sequence carry out over-sampling, new feature value can also be generated in the Gaussian Profile of a certain feature according to elder generation, in the spy of this feature When value indicative sum reaches first threshold, then new feature value is generated in the Gaussian Profile of other features, until the spy of all features Sequence until value indicative sum reaches first threshold carries out over-sampling.Over-sampling can also be carried out in other orders, in this public affairs It opens in embodiment and this is not especially limited.

Pass through the above method, it can be ensured that increasing the characteristic value in each feature will not influence the distribution of this feature, and The authenticity of this feature is not influenced, and then guarantees that the sample after increasing does not influence the authenticity of the minority class sample.

In a kind of possible embodiment of the disclosure, above-mentioned steps 42 may include:

Firstly, calculating the current mean value and current variance of this feature.Wherein, the current mean value, current variance refer to generation The mean value, variance of this feature after the first new paricular value.

Then, T verification is carried out to current mean value and F verification is carried out to current variance.Specifically, current mean value is carried out Whether T verification mainly examines current mean value and the difference of initial mean value significant.Current mean value and the difference of initial mean value compared with When being significant, T verification does not pass through, and otherwise T verification passes through.To current variance carry out F verification mainly examine current variance and just Whether beginning variance has significant difference.In current variance and initial variance there are when significant difference, F verification does not pass through, otherwise F Verification passes through.

Then, when first threshold has not yet been reached in the characteristic value sum of this feature, if current mean value is not verified and is worked as by T Front difference is not verified by F, then it is invalid to verify the first new feature value.When the characteristic value sum of this feature reaches first threshold When, if current mean value is not verified by F by T verification or current variance, it is invalid to verify the first new feature value.

Specifically, after generating the first new feature value, when first threshold has not yet been reached in the characteristic value sum of this feature, When current mean value is not verified by T and current variance is not verified by F, it is invalid to verify the first new feature value.Generating first After new feature value, when the characteristic value sum of this feature reaches first threshold, at least one of above-mentioned T verification and F verification are not led to It is out-of-date, it is invalid to verify the first new feature value.

In one embodiment, when first threshold has not yet been reached in the characteristic value sum of this feature, by the first new spy After value indicative is retained in this feature, another new spy can be generated according to above in association with over-sampling treatment process as described in Figure 2 Value indicative, until the characteristic value sum of this feature reaches first threshold.

In another embodiment, it is contemplated that when first threshold has not yet been reached in the characteristic value sum of this feature, only exist When current mean value is not verified by T and current variance is not verified by F, it is invalid just to verify the first new feature value, and deleting should First new feature value.When thering is one not pass through in T verification and F verification, still retains the first new feature value, be likely at this time Cause the current mean value, current variance and initial mean value, initial variance of this feature inconsistent, but difference is not significant.

In the disclosure, in order to further reduce the difference between current mean value, current variance and initial mean value, initial variance It is different, in the current mean value, current variance and inconsistent initial mean value, initial variance of this feature, by being distributed the method supplied Keep the current mean value, current variance and initial mean value, initial variance that generate this feature after the second new feature value as consistent as possible, Eliminate above-mentioned difference.

Specifically, after step 44, above-mentioned over-sampling treatment process can also include:

If current mean value is not verified by T but current variance is verified by F, according to the following formula, this feature is generated Another new feature value, the another new feature value is as the second new feature value:

X=2 (E₀+C)-E₁ (1)

Wherein, X is the second new feature value；E₁Indicate the mean value of this feature before generating the second new feature value；E₀It indicates just Beginning mean value, C are constant.

Do not pass through in T verification, and F is verified when passing through, and shows that the difference of the current mean value and initial mean value is more significant, when Front difference is consistent with initial variance.It, can be by generating one the at this point, for the difference reduced between current mean value and initial mean value Two new feature values, to eliminate the first new feature value influence caused by this feature.It illustratively, can be raw by above-mentioned formula (1) At the second new feature value.Wherein, the difference of C characterization the user acceptable current mean value and initial mean value in formula (1), the C Numerical value it is smaller, characterization generate the second new feature value after this feature current mean value and initial mean value between difference get over It is small.

In this way, generating the second new feature value using formula (1), it is ensured that generate this feature after the second new feature value Current mean value and initial mean value between difference meet user demand.

If current mean value is verified by T but current variance is not verified by F, when current variance is less than initial variance, from It deletes in generated characteristic value closest to the characteristic value of initial mean value in addition to the first new feature value, and it is initial to generate a distance The farthest third new feature value of mean value；When current variance is greater than initial variance, is deleted from generated characteristic value and remove first The characteristic value farthest apart from initial mean value except new feature value, and generate a third new feature value nearest apart from initial mean value.

If current mean value is verified by T but current variance is not verified by F, show the current mean value and initial mean value almost Unanimously, the difference of current variance and initial variance is more significant.At this time in order to reduce the difference of current variance and initial variance, need Other characteristic values in addition to the first new feature value are deleted from generated characteristic value, and regenerate a third new feature Value.

According in Gaussian Profile, variance is bigger, and Gaussian distribution curve is gentler, and probability distribution is more dispersed, and, variance is got over The characteristics of small Gaussian distribution curve is more precipitous, and probability distribution is more concentrated.When current variance is greater than initial variance, show generation the The probability distribution of this feature after one new feature value is compared with the probability distribution of this feature before generating the first new feature value It is more dispersed, therefore, it is farthest that distance initial mean value in addition to the first new feature value can be deleted from generated characteristic value Characteristic value, and generate a third feature value nearest apart from the initial mean value so that generate third new feature value after should The probability distribution of feature and the probability distribution of this feature before generating the first new feature value are more consistent.

Current variance be less than initial variance when, show generate the first new feature value after this feature probability distribution with The probability distribution of this feature before generating the first new feature value, therefore, can be from generated characteristic value compared to more concentrating It deletes in addition to the first new feature value closest to the characteristic value of the initial mean value, and it is farthest apart from the initial mean value to generate one Third feature value, so that before the first new feature value of the probability distribution of this feature after generation third new feature value and generation The probability distribution of this feature is more consistent.

Using the above scheme, it can be improved the probability that the first new feature value generated is retained in this feature, reduce The time of over-sampling treatment process can also eliminate the first new feature value to the current of this feature by being distributed the method supplied The influence of mean value, current variance further ensures that the authenticity of minority class sample after over-sampling.

In addition to above-described using carrying out T verification to current mean value and carry out F method of calibration verifying the to current variance It, can also be by judging that generated first is new special in the embodiment of another kind replacement except the validity of one new feature value The relationship of the preset range of value indicative and the Gaussian Profile of its said features, verifies the validity of the first new feature value.

Specifically, if above-mentioned steps 42 may include: the pre- of Gaussian Profile of the first new feature value beyond said features If range, then it is invalid to verify the first new feature value, wherein preset range is that [initial mean value-n* primary standard is poor, institute It is poor to state initial mean value+n* primary standard], n is the numerical value greater than zero.

The characteristics of according to Gaussian Profile, the area in certain section reflects that the characteristic value quantity in the section accounts for spy on abscissa The percentage of value indicative sum, that is, characteristic value falls in the probability in the section.For example, abscissa zone [mean-standard deviation, Value+standard deviation] in area be about 68.3%, in abscissa zone [mean value -2.58* standard deviation, mean value+2.58* standard deviation] Interior area is about 99.7%, etc..Therefore, in order to guarantee that characteristic value generated is respectively positioned on the maximum probability area of Gaussian Profile Between, it avoids the characteristic value generated from belonging to " small probability event ", in the disclosure, can will exceed that [initial mean value-n* is initially marked Quasi- poor, the initial mean value+n* primary standard is poor] characteristic value in section, it is verified as in vain.Wherein, n can be, for example, 3.

Pass through the above method, it can be ensured that increase the characteristic value in each feature and will not influence the authenticity of this feature, And then guarantee that the sample after increasing does not influence the authenticity of the minority class sample.

It is to be described in detail for the treatment process of over-sampling above.Lack sampling is carried out below with reference to Fig. 5 detailed Explanation.As shown in figure 5, step 13 may comprise steps of in Fig. 1.

In step 131, the probability density of each sample in majority class sample is determined.

Specifically, the initial mean value and initial variance of each feature in most class samples can be calculated separately out, it is first according to this Beginning mean value, initial variance and probability density formulaCalculate each characteristic value in each sample Probability density, and the sum of the probability density of characteristic value each in the sample is determined as to the probability density of the sample.Wherein, above-mentioned X characteristic feature value in formula, f (x) characterize the probability density of this feature value, μ₀Characterize initial mean value, σ₀ ²Characterize initial variance.

After the probability density for determining each sample, following lack sampling can be executed for the majority class sample and is handled Process:

In step 132, first sample is determined.

Wherein, which is any sample in the majority class sample.It therefore, can be random in the majority class sample A sample is chosen, as the first sample.

In step 133, in other most class samples in addition to the first sample, determine probability density and this first The immediate sample of the probability density of sample is the second sample.

In step 134, second sample is deleted.

When carrying out lack sampling to most class samples, from deletion portion in the sample repeated or in the sample that is closer to Divide sample, so that most class samples after deleting sample are consistent with the distribution of most class samples before not deleting sample.Cause This, in the disclosure, after determining first sample, according in the probability density of the first sample and the majority class sample The probability density of each sample, in other most class samples in addition to the first sample, determine probability density and this The sample is determined as the second sample by the immediate sample of the probability density of one sample, and deletes second sample.

In step 135, judge whether the sum of the majority class sample reaches the second threshold.

After deleting the second sample, judge whether the sum of the majority class sample reaches the second threshold.If the majority When the sum of class sample reaches second threshold, it can stop deleting sample.Otherwise above-mentioned steps 132-135 is re-executed, until this Until the sum of most class samples reaches second threshold.

Aforesaid way is sampled, can avoid that the distribution of the majority class sample is caused to change after deleting sample, in turn Ensure the authenticity of the majority class sample.

Based on the same inventive concept, the disclosure also provides a kind of data balancing device.Referring to FIG. 6, Fig. 6 is shown according to one Example property implements a kind of block diagram of the data balancing device exemplified.As shown in fig. 6, the apparatus may include:

First determining module 61, for determining minority class sample from multiple data samples；

Over-sampling module 62 carried out the minority class sample for the probability distribution according to the minority class sample Sampling, so that the quantity of the minority class sample reaches first threshold.

Optionally, each sample includes one or more features；The over-sampling module may include:

Over-sampling implementation sub-module, for executing at above-mentioned over-sampling for each feature in the minority class sample Reason process；If the first threshold has not yet been reached in the characteristic value sum of this feature, above-mentioned over-sampling treatment process is re-executed, Until the characteristic value sum of this feature reaches the first threshold.

Optionally, described device further include:

Optionally, the lack sampling module may include:

Lack sampling implementation sub-module, for executing above-mentioned lack sampling treatment process；If the sum of the majority class sample is still The not up to described second threshold then re-executes the lack sampling treatment process, until the sum of most class samples reaches Until the second threshold.

About the device in above-described embodiment, wherein modules execute the concrete mode of operation in related this method Embodiment in be described in detail, no detailed explanation will be given here.

Fig. 7 is the block diagram of a kind of electronic equipment 700 shown according to an exemplary embodiment.As shown in fig. 7, the electronics is set Standby 700 may include: processor 701, memory 702.The electronic equipment 700 can also include multimedia component 703, input/ Export one or more of (I/O) interface 704 and communication component 705.

Wherein, processor 701 is used to control the integrated operation of the electronic equipment 700, to complete above-mentioned data balancing side All or part of the steps in method.Memory 702 is for storing various types of data to support the behaviour in the electronic equipment 700 To make, these data for example may include the instruction of any application or method for operating on the electronic equipment 700, with And the relevant data of application program, such as contact data, the message of transmitting-receiving, picture, audio, video etc..The memory 702 It can be realized by any kind of volatibility or non-volatile memory device or their combination, such as static random-access is deposited Reservoir (Static Random Access Memory, abbreviation SRAM), electrically erasable programmable read-only memory (Electrically Erasable Programmable Read-Only Memory, abbreviation EEPROM), erasable programmable Read-only memory (Erasable Programmable Read-Only Memory, abbreviation EPROM), programmable read only memory (Programmable Read-Only Memory, abbreviation PROM), and read-only memory (Read-Only Memory, referred to as ROM), magnetic memory, flash memory, disk or CD.Multimedia component 703 may include screen and audio component.Wherein Screen for example can be touch screen, and audio component is used for output and/or input audio signal.For example, audio component may include One microphone, microphone is for receiving external audio signal.The received audio signal can be further stored in storage Device 702 is sent by communication component 705.Audio component further includes at least one loudspeaker, is used for output audio signal.I/O Interface 704 provides interface between processor 701 and other interface modules, other above-mentioned interface modules can be keyboard, mouse, Button etc..These buttons can be virtual push button or entity button.Communication component 705 is for the electronic equipment 700 and other Wired or wireless communication is carried out between equipment.Wireless communication, such as Wi-Fi, bluetooth, near-field communication (Near Field Communication, abbreviation NFC), 2G, 3G or 4G or they one or more of combination, therefore corresponding communication Component 705 may include: Wi-Fi module, bluetooth module, NFC module.

In one exemplary embodiment, electronic equipment 700 can be by one or more application specific integrated circuit (Application Specific Integrated Circuit, abbreviation ASIC), digital signal processor (Digital Signal Processor, abbreviation DSP), digital signal processing appts (Digital Signal Processing Device, Abbreviation DSPD), programmable logic device (Programmable Logic Device, abbreviation PLD), field programmable gate array (Field Programmable Gate Array, abbreviation FPGA), controller, microcontroller, microprocessor or other electronics member Part is realized, for executing above-mentioned data balancing method.

In a further exemplary embodiment, a kind of computer readable storage medium including program instruction is additionally provided, it should The step of above-mentioned data balancing method is realized when program instruction is executed by processor.For example, the computer readable storage medium It can be the above-mentioned memory 702 including program instruction, above procedure instruction can be executed by the processor 701 of electronic equipment 700 To complete above-mentioned data balancing method.

The preferred embodiment of the disclosure is described in detail in conjunction with attached drawing above, still, the disclosure is not limited to above-mentioned reality The detail in mode is applied, in the range of the technology design of the disclosure, a variety of letters can be carried out to the technical solution of the disclosure Monotropic type, these simple variants belong to the protection scope of the disclosure.

It is further to note that specific technical features described in the above specific embodiments, in not lance In the case where shield, it can be combined in any appropriate way.In order to avoid unnecessary repetition, the disclosure to it is various can No further explanation will be given for the combination of energy.

In addition, any combination can also be carried out between a variety of different embodiments of the disclosure, as long as it is without prejudice to originally Disclosed thought equally should be considered as disclosure disclosure of that.

Claims

1. a kind of data balancing method characterized by comprising

Minority class sample is determined from multiple data samples；

According to the probability distribution of the minority class sample, over-sampling is carried out to the minority class sample, so that the minority class sample This quantity reaches first threshold.

2. the method according to claim 1, wherein each sample includes one or more features；

According to the initial mean value and initial variance of each feature in the minority class sample, the Gauss point of each feature is generated Cloth；

Verify the validity of the first new feature value；

If it is invalid to verify the first new feature value, the first new feature value is deleted, otherwise, retains the first new feature value；

If the first threshold has not yet been reached in the characteristic value sum of this feature, the over-sampling treatment process is re-executed, directly Until the characteristic value sum of this feature reaches the first threshold.

3. according to the method described in claim 2, it is characterized in that, the validity for verifying the first new feature value, comprising:

Calculate the current mean value and current variance of this feature；

When the first threshold has not yet been reached in the characteristic value sum of this feature, if the current mean value is not verified by T and institute It states current variance not verify by F, then it is invalid to verify the first new feature value；

When the characteristic value sum of this feature reaches the first threshold, if the current mean value by T verification or described is not worked as Front difference is not verified by F, then it is invalid to verify the first new feature value.

4. if according to the method described in claim 3, delete it is characterized in that, invalid in described the first new feature value of verifying Except the first new feature value, otherwise, after the step of retaining the first new feature value, the over-sampling treatment process further include:

If the current mean value is not verified by T but the current variance is verified by F, according to the following formula, the spy is generated The another new feature value of sign, the another new feature value is as the second new feature value:

X=2 (E₀+C)-E₁

Wherein, X is the second new feature value；E₁Indicate the mean value of this feature before generating the second new feature value；E₀It indicates The initial mean value, C are constant；

If the current mean value is verified by T but the current variance is not verified by F, it is less than in the current variance described first When beginning variance, the feature in addition to the first new feature value closest to the initial mean value is deleted from generated characteristic value Value, and generate a third new feature value farthest apart from the initial mean value；It is greater than the initial variance in the current variance When, the farthest characteristic value of initial mean value described in distance in addition to the first new feature value is deleted from generated characteristic value, And generate a third new feature value nearest apart from the initial mean value.

5. according to the method described in claim 2, it is characterized in that, the validity for verifying the first new feature value, comprising:

If the first new feature value beyond said features Gaussian Profile preset range, verify the first new feature value without Effect, wherein the preset range is [initial mean value-n* primary standard is poor, and the initial mean value+n* primary standard is poor], n For the numerical value greater than zero.

6. method according to any one of claims 1-5, which is characterized in that the method also includes:

Most class samples are determined from the multiple data sample；

7. according to the method described in claim 6, it is characterized in that, described carry out lack sampling to most class samples, comprising:

Determine the probability density of each sample in most class samples；

Execute following lack sampling treatment process:

In other most class samples in addition to the first sample, determine that the probability of probability density and the first sample is close Spending immediate sample is the second sample；

Delete second sample；

If the second threshold has not yet been reached in the sum of the majority class sample, the lack sampling treatment process is re-executed, Until the sum of most class samples reaches the second threshold.

8. a kind of data balancing device characterized by comprising

Over-sampling module carries out over-sampling to the minority class sample for the probability distribution according to the minority class sample, with Make the quantity of the minority class sample up to first threshold.

9. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is held by processor The step of any one of claim 1-7 the method is realized when row.

10. a kind of electronic equipment characterized by comprising

Memory is stored thereon with computer program；

Processor, for executing the computer program in the memory, to realize described in any one of claim 1-7 The step of method.