CN115310524A - Sample generation method, device and equipment based on wind control model and storage medium - Google Patents

Sample generation method, device and equipment based on wind control model and storage medium Download PDF

Info

Publication number
CN115310524A
CN115310524A CN202210886072.5A CN202210886072A CN115310524A CN 115310524 A CN115310524 A CN 115310524A CN 202210886072 A CN202210886072 A CN 202210886072A CN 115310524 A CN115310524 A CN 115310524A
Authority
CN
China
Prior art keywords
data
gaussian
current
target
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210886072.5A
Other languages
Chinese (zh)
Inventor
李潇
岳帅
吴艳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Fuli Technology Co ltd
Original Assignee
Shanghai Fuli Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Fuli Technology Co ltd filed Critical Shanghai Fuli Technology Co ltd
Priority to CN202210886072.5A priority Critical patent/CN115310524A/en
Publication of CN115310524A publication Critical patent/CN115310524A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention belongs to the technical field of credit wind control, and discloses a sample generation method, a device, equipment and a storage medium based on a wind control model. The method comprises the following steps: obtaining current expected data according to current sample data; obtaining current Gaussian parameter data according to the current expected data, and determining target Gaussian parameter data according to the current Gaussian parameter data; acquiring a corresponding relation between the evaluation data and the Gaussian mixture component quantity, and determining the target Gaussian mixture component quantity according to the corresponding relation; determining target Gaussian mixture distribution according to the target Gaussian parameter data and the target Gaussian mixture component amount, and generating target sample data according to the target Gaussian mixture distribution; and carrying out wind control model training according to the current sample data and the target sample data, and carrying out user detection through the wind control model obtained through training. By means of the method, data deviation caused by screening of samples is avoided, the problem of survivors deviation is solved, and the generalization capability of the wind control model is improved.

Description

Sample generation method, device, equipment and storage medium based on wind control model
Technical Field
The invention relates to the technical field of credit wind control, in particular to a sample generation method, a sample generation device, sample generation equipment and a storage medium based on a wind control model.
Background
The wind control model with rejection attribute generally has a very serious problem in iteration: survivor deviation, only samples above the score threshold of the previous version of the model can enter the current model for training, and the samples carry no or little information of rejected samples, so that the samples gradually deviate from the real distribution. Models learned from biased sample sets have difficulty giving accurate results when dealing with samples that fail to be characterized. Over time, as the model iterates, the features that are highly discriminative are weakened, even acting in the exact opposite sense on the model. The traditional solution is to add pseudo marks and weights to rejection samples by using rejection inference, the effect of the rejection inference depends on the overall negative sample ratio, but the actual overall negative sample ratio is often difficult to obtain, so that the effect of the rejection inference is limited, and no deviation data is still difficult to obtain.
The above is only for the purpose of assisting understanding of the technical aspects of the present invention, and does not represent an admission that the above is prior art.
Disclosure of Invention
The invention mainly aims to provide a sample generation method, a sample generation device, sample generation equipment and a storage medium based on a wind control model, and aims to solve the technical problems that in the prior art, false marks and weights are added to rejected samples by using rejection inference to improve survivor deviation, but the true total negative sample proportion is difficult to obtain, deviation-free data is still difficult to obtain, and the application effect of the wind control model is influenced.
In order to achieve the above object, the present invention provides a sample generation method based on a wind control model, including the following steps:
obtaining current expected data according to current sample data;
obtaining current Gaussian parameter data according to the current expected data, and determining target Gaussian parameter data according to the current Gaussian parameter data;
acquiring a corresponding relation between evaluation data and the Gaussian mixture component quantity, and determining the target Gaussian mixture component quantity according to the corresponding relation;
determining target Gaussian mixture distribution of the current sample data according to the target Gaussian parameter data and the target Gaussian mixture component quantity, and generating target sample data according to the target Gaussian mixture distribution;
and carrying out wind control model training according to the current sample data and the target sample data, and carrying out user detection according to the wind control model obtained by training.
Optionally, the obtaining of the current expected data according to the current sample data includes:
obtaining initial mean data and initial standard deviation data according to current sample data;
obtaining initial Gaussian parameter data according to the initial mean data and the initial standard deviation data;
acquiring a corresponding relation between expected data and initial Gaussian parameter data;
and obtaining the current expected data according to the initial Gaussian parameter data and the corresponding relation between the expected data and the initial Gaussian parameter data.
Optionally, the obtaining current gaussian parameter data according to the current expected data, and determining target gaussian parameter data according to the current gaussian parameter data include:
acquiring a corresponding relation between current expected data and Gaussian parameter data;
obtaining current Gaussian parameter data according to the current expected data and the corresponding relation between the current expected data and the Gaussian parameter data;
and when the current Gaussian parameter data meet a preset convergence condition, determining target Gaussian parameter data according to the current Gaussian parameter data, wherein the target Gaussian parameter data comprise target mean data, target standard deviation data and target weight data.
Optionally, after obtaining the current gaussian parameter data according to the current expected data and the corresponding relationship between the current expected data and the gaussian parameter data, the method further includes:
when the current Gaussian parameter data do not meet the preset convergence condition, updating the initial Gaussian parameter data according to the current Gaussian parameter data;
and returning to execute the step of obtaining the current expected data according to the initial Gaussian parameter data and the relation between the expected data and the initial Gaussian parameter data.
Optionally, the obtaining a corresponding relationship between the evaluation data and the number of gaussian mixture components, and determining the number of target gaussian mixture components according to the corresponding relationship, includes:
obtaining a current likelihood function;
acquiring a corresponding relation among a likelihood function, the amount of the Gaussian mixture components and the evaluation data;
obtaining the corresponding relation between the evaluation data and the Gaussian mixture component quantity according to the corresponding relation between the current likelihood function and the likelihood function as well as the Gaussian mixture component quantity and the evaluation data;
and when the current evaluation data meet the preset data condition, determining the target Gaussian mixture component quantity according to the current evaluation data and the corresponding relation between the evaluation data and the Gaussian mixture component quantity.
Optionally, the obtaining the current likelihood function includes:
acquiring corresponding relations among Gaussian parameter data, the number of Gaussian mixture components and Gaussian mixture distribution;
determining initial Gaussian mixture distribution according to the target Gaussian parameter data and the corresponding relation between the Gaussian parameter data and the Gaussian mixture component quantity and the Gaussian distribution;
and obtaining the current likelihood function according to the initial Gaussian mixture distribution.
Optionally, the determining, according to the target gaussian parameter data and the target gaussian mixture component amount, a target gaussian mixture distribution of current sample data, and generating, according to the target gaussian mixture distribution, target sample data includes:
determining target Gaussian mixture distribution of current sample data according to the target parameter data, the target Gaussian mixture component and the corresponding relation between the Gaussian parameter data, the target Gaussian mixture component and the Gaussian mixture distribution;
acquiring the quantity of preset sample data;
and generating target sample data according to the target Gaussian mixture distribution and the number of preset sample data.
In addition, in order to achieve the above object, the present invention further provides a sample generation device based on a wind control model, including:
the acquisition module is used for acquiring current expected data according to current sample data;
the acquisition module is further used for obtaining current Gaussian parameter data according to the current expected data and determining target Gaussian parameter data according to the current Gaussian parameter data;
the acquisition module is also used for acquiring the corresponding relation between the evaluation data and the amount of the Gaussian mixture components, and determining the amount of the target Gaussian mixture components according to the corresponding relation;
the generation module is used for determining the target Gaussian mixture distribution of the current sample data according to the target Gaussian parameter data and the target Gaussian mixture component amount, and generating target sample data according to the target Gaussian mixture distribution;
and the training module is used for carrying out wind control model training according to the current sample data and the target sample data and carrying out user detection according to the wind control model obtained through training.
In addition, in order to achieve the above object, the present invention further provides a sample generating device based on a wind control model, where the sample generating device based on a wind control model includes: the system comprises a memory, a processor and a wind control model based sample generation program stored on the memory and executable on the processor, wherein the wind control model based sample generation program is configured to realize the steps of the wind control model based sample generation method.
In addition, to achieve the above object, the present invention further provides a storage medium, wherein the storage medium stores a sample generation program based on a wind control model, and the sample generation program based on the wind control model, when executed by a processor, implements the steps of the sample generation method based on the wind control model as described above.
According to the method, current expected data are obtained according to current sample data, current Gaussian parameter data are further obtained, target Gaussian parameter data are determined according to the current Gaussian parameter data, the number of target Gaussian mixture components is determined according to the corresponding relation between evaluation data and the number of the Gaussian mixture components, target Gaussian mixture distribution of the current sample data is determined according to the obtained target Gaussian parameter data and the number of the target Gaussian mixture components, target sample data are generated, wind control model training is further carried out according to the current sample data and the target sample data, and user detection is carried out according to the wind control model obtained through training. Compared with the prior art that the actual total negative sample proportion is difficult to obtain by using a rejection deduction mode, and the problem of survivor deviation is difficult to solve, the method can generate a new sample which obeys the distribution form of the current sample, combines the new sample with the current sample and then carries out wind control model training, and the trained model has stronger generalization capability, so that the technical problem that deviation-free data are still difficult to obtain by rejection deduction is solved, data deviation caused by sample screening is avoided, the problem of survivor deviation is solved, and the detection effect of the model is improved.
Drawings
FIG. 1 is a schematic structural diagram of a sample generation device based on a wind control model for a hardware operating environment according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of a first embodiment of a sample generation method based on a wind control model according to the present invention;
FIG. 3 is a schematic diagram illustrating survivor deviation according to an embodiment of the sample generation method based on a wind control model;
FIG. 4 is a schematic diagram of Gaussian distribution of sample data in an embodiment of a sample generation method based on a wind control model according to the present invention;
FIG. 5 is a Gaussian distribution schematic diagram of mixed sample data according to an embodiment of a sample generation method based on a wind control model;
FIG. 6 is a schematic sample distribution diagram of an embodiment of a sample generation method based on a wind control model according to the present invention;
FIG. 7 is a schematic diagram of a Gaussian mixture model according to an embodiment of the sample generation method based on a wind control model;
FIG. 8 is a schematic diagram illustrating a new sample generation process according to an embodiment of the method for generating a sample based on a wind-controlled model of the present invention;
FIG. 9 is a schematic flow chart of a second embodiment of a sample generation method based on a wind control model according to the present invention;
FIG. 10 is a schematic flow chart of a third embodiment of a sample generation method based on a wind control model according to the present invention;
FIG. 11 is a schematic view of AIC distribution according to an embodiment of a sample generation method based on a wind control model;
fig. 12 is a block diagram of a sample generation device based on a wind control model according to a first embodiment of the present invention.
The implementation, functional features and advantages of the present invention will be further described with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Referring to fig. 1, fig. 1 is a schematic structural diagram of a sample generation device based on a wind control model for a hardware operating environment according to an embodiment of the present invention.
As shown in fig. 1, the sample generation device based on the wind control model may include: a processor 1001, such as a Central Processing Unit (CPU), a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a Wireless interface (e.g., a Wireless-Fidelity (Wi-Fi) interface). The Memory 1005 may be a Random Access Memory (RAM) Memory, or may be a Non-Volatile Memory (NVM), such as a disk Memory. The memory 1005 may alternatively be a storage device separate from the processor 1001 described previously.
Those skilled in the art will appreciate that the configuration shown in FIG. 1 does not constitute a limitation of a wind-controlled model-based sample generation apparatus, and may include more or fewer components than shown, or some components in combination, or a different arrangement of components.
As shown in fig. 1, a memory 1005 as a storage medium may include an operating system, a network communication module, a user interface module, and a sample generation program based on a wind control model.
In the sample generation device based on the wind control model shown in fig. 1, the network interface 1004 is mainly used for data communication with a network server; the user interface 1003 is mainly used for data interaction with a user; the processor 1001 and the memory 1005 of the sample generation device based on the wind control model according to the present invention may be disposed in the sample generation device based on the wind control model, and the sample generation device based on the wind control model calls the sample generation program based on the wind control model stored in the memory 1005 through the processor 1001 and executes the sample generation method based on the wind control model provided by the embodiment of the present invention.
An embodiment of the present invention provides a sample generation method based on a wind control model, and referring to fig. 2, fig. 2 is a schematic flow diagram of a first embodiment of a sample generation method based on a wind control model according to the present invention.
In this embodiment, the sample generation method based on the wind control model includes the following steps:
step S10: and obtaining the current expected data according to the current sample data.
It should be noted that, an execution subject of this embodiment is a computer, and may be any computer capable of running a sample generation program based on a wind control model, and this embodiment does not limit this, and a new sample is generated by a sample generation program pair based on a wind control model that is provided in a computer, and is used for training a wind control model.
It should be understood that part of the wind control models have rejection attributes, for example, application score cards in wind control, and can reject low-score clients that do not meet requirements, each time the models are iterated, samples used are screened by a previous model, and only samples higher than a score threshold of the previous model can enter a current model for training, as shown in a survivor deviation diagram shown in fig. 3, these samples that can be observed are survivors, and they usually do not carry or carry little information of rejected samples, so that the samples gradually deviate from a real distribution, and a survivor deviation problem occurs. However, a model learned from a biased sample set has difficulty in giving accurate results when dealing with samples that fail to be characterized, and over time, as the model iterates, features that are well-differentiated are weakened and even act in exactly the opposite way on the model, for example: the weight coefficient of a certain feature changes from a positive number to a negative number.
It will be appreciated that to correct biased information, the model need only be retrained using unbiased data. The traditional solution is to add pseudo-marks and weights to rejection samples by using rejection inference, and the effect of rejection inference depends on the overall negative sample ratio, but it is difficult to obtain a true overall negative sample ratio, so that the effect of rejection inference is limited, and it is still difficult to obtain unbiased data. Gaussian distribution (also known as normal distribution) is a distribution form that exists in a large amount in nature, and is the most common, and a Gaussian Mixture Model (GMM), which is a widely used clustering algorithm, is a linear combination of multiple Gaussian distribution functions, and is itself a generative Model, and theoretically, the GMM can fit any type of distribution, and new samples of the distribution can be generated from each distribution component of the Model, so this embodiment uses the Gaussian Mixture Model to solve survivor deviations.
In a specific implementation, when the sample data is one-dimensional data, the gaussian distribution obeys the following probability density function:
Figure BDA0003765707000000071
in the formula, the parameter represents μmean, the mean corresponds to the middle position of normal distribution, the parameter σ represents standard deviation, and measures the degree of data dispersion around the mean, x is the current sample data, f (x | μ, σ) 2 ) Is a probability density function. As shown in the schematic diagram of gaussian distribution of sample data shown in fig. 4, after the mean value and the standard deviation are calculated, the probability density function complying with the sample data distribution can be obtained, and when the specific quantity of each sample data is unknown, the mean value is presumed to be 180 and the standard deviation is presumed to be 5 by analyzing the section where the most data appears in the vicinity of 180, and the probability density function complying with the sample data distribution can also be obtained. If the sample contains two groups of sample data, the gaussian distribution drawn before is actually the result of superposition of two gaussian distributions, for example, as shown in the gaussian distribution diagram of mixed sample data shown in fig. 5, the probability value shown in the vertical axis is calculated on the premise that each sample data group is known, but in general, we cannot grasp this information, for example: when data is collected, no record is made, so that not only is the parameter of each distribution obtained, but also the division condition of the group needs to be generated, and at this time, a plurality of gaussian models are combined together by adopting a linear combination local formula to form a gaussian mixture model, and the specific probability density distribution formula is as follows:
Figure BDA0003765707000000072
in the formula, y is current sample data, K is the number of sub-Gaussian models in the mixed model, K =1,2 \8230, K, alpha k Is the probability that the sample data belongs to the kth sub-Gaussian model, i.e., the weight value, and α k ≥0,
Figure BDA0003765707000000073
φ(y|θ k ) Is a gaussian distribution density function of the kth sub-gaussian model,
Figure BDA0003765707000000074
p (y | θ) is a probability density function. Calculating each parameter in the mixed model and determining a sub-Gaussian modelAfter the number is obtained, a Gaussian mixture model which obeys the data distribution of a plurality of groups of different samples can be obtained. The embodiment estimates parameters of a gaussian mixture model by using an Expectation Maximization (EM) algorithm, and determines the number of sub-gaussian models by using an Akaike Information Criterion (AIC).
In this embodiment, the current sample data is original sample data used for training the wind control model. The EM algorithm is an iterative algorithm, the maximum likelihood estimation value of the probability model parameters can be solved, the optimal parameters of the Gaussian mixture model can be estimated through the EM algorithm, the current expected data is the probability that each sample data comes from the sub-Gaussian model, the current expected data is obtained through calculation of the EM algorithm, the current Gaussian parameter data can be further determined after the current expected data is obtained, and iteration is carried out.
Step S20: and obtaining current Gaussian parameter data according to the current expected data, and determining target Gaussian parameter data according to the current Gaussian parameter data.
It should be noted that the current gaussian parameter data is a parameter value of the gaussian mixture model calculated according to the current expected data, and includes current mean value data, current standard deviation data and current weight data, that is, the currently calculated mean value, standard deviation and weight, and the target gaussian parameter data is an optimal parameter of the gaussian mixture model complying with the current sample distribution, and includes target mean value data, target standard deviation data and target weight data, that is, the optimal mean value, optimal standard deviation and optimal weight that are finally required to be calculated.
In this embodiment, current gaussian parameter data can be calculated according to the obtained current expected data, and after multiple iterations of the EM algorithm, when the current gaussian parameter data changes very little, the gaussian parameter data at this time is considered to be the optimal gaussian parameter data, and can be used to establish a required gaussian mixture model.
Step S30: and acquiring a corresponding relation between the evaluation data and the Gaussian mixture component quantity, and determining the target Gaussian mixture component quantity according to the corresponding relation.
In a specific implementation, it is necessary to determine how many gaussian distributions are used for clustering each modeling to achieve the best clustering effect. Many parameter estimation problems all adopt a likelihood function as an objective function, when training data are enough, model precision can be continuously improved, but the model complexity is increased, and a problem which is very common in machine learning is brought at the same time: overfitting, therefore, a method is needed to find the best balance between model complexity and model-to-dataset description capability. The AIC is a standard for measuring the fitting superiority and inferiority of the statistical model, is established on the concept of entropy, provides a standard for balancing the complexity of an estimation model and the goodness of fitting data, and can avoid the overfitting problem by adding a penalty term of the model complexity, thereby finding the optimal Gaussian mixture component quantity of the mixed model, namely the optimal sub-Gaussian model quantity.
It can be understood that the evaluation data is an AIC value, the target gaussian mixture component quantity is an optimal gaussian mixture component quantity, namely an optimal gaussian sub-model quantity, and a corresponding relation between the evaluation data and the gaussian mixture component quantity is a calculation expression between the AIC value and the gaussian mixture component quantity.
Step S40: and determining the target Gaussian mixture distribution of the current sample data according to the target Gaussian parameter data and the target Gaussian mixture component amount, and generating the target sample data according to the target Gaussian mixture distribution.
The step S40 includes: and determining the target Gaussian mixture distribution of the current sample data according to the target parameter data, the target Gaussian mixture component quantity and the corresponding relation between the Gaussian parameter data, the Gaussian mixture component quantity and the Gaussian mixture distribution, acquiring the quantity of preset sample data, and generating the target sample data according to the target Gaussian mixture distribution and the quantity of the preset sample data.
It should be noted that a correspondence between the gaussian parameter data, the gaussian mixture component quantity, and the gaussian mixture distribution is a probability density function of a gaussian mixture model, the target gaussian mixture distribution is a finally obtained gaussian mixture model complying with the current sample distribution, and the preset sample data quantity is a sample quantity required to be generated by the gaussian mixture model, that is, a quantity of target sample data, for example: 100, 200, etc. may be set in a sample generation program based on the wind control model, which is not limited in this embodiment, where the target sample data is a new sample generated by a gaussian mixture model, and the new sample obeys the distribution of the current sample.
In a specific implementation, a gaussian mixture model complying with the current sample distribution is established by calculating parameters of the obtained gaussian mixture model and an optimal gaussian mixture component amount, and a new sample complying with the current sample distribution is generated by the established gaussian mixture model, so that data deviation caused by sample screening is avoided, for example: clustering the sample distribution shown in fig. 6 by using 10 sub-gaussian models can obtain a gaussian mixture model schematic diagram as shown in fig. 7, setting the number of new samples to 200, and generating new samples as shown in fig. 8 by using the obtained gaussian mixture model, wherein the new samples obey the distribution of the original samples and do not coincide with the original sample points.
Step S50: and carrying out wind control model training according to the current sample data and the target sample data, and carrying out user detection according to the wind control model obtained by training.
In the embodiment, current expected data and current Gaussian parameter data are obtained according to current sample data, so that target Gaussian parameter data are determined, the number of optimal sub-Gaussian models is determined according to the corresponding relation between evaluation data and Gaussian mixture component quantities, gaussian mixture models complying with the current sample data are obtained according to the obtained target Gaussian parameter data and the number of the optimal sub-Gaussian models, so that target sample data are generated, wind control model training is performed according to the current sample data and the target sample data, and user detection is performed according to the wind control models obtained through training. The newly generated samples obey the distribution of the current samples and are not overlapped with the current sample points, the new samples are combined with the historical samples, the trained wind control model has stronger generalization capability, the problem of depositor deviation is solved, and the detection effect of the wind control model is improved.
Referring to fig. 9, fig. 9 is a schematic flowchart of a second embodiment of a sample generation method based on a wind control model according to the present invention.
Based on the first embodiment, the step S10 includes:
step S101: and obtaining initial mean data and initial standard deviation data according to the current sample data, and obtaining initial Gaussian parameter data according to the initial mean data and the initial standard deviation data.
It can be understood that the initial mean data is an initialized mean, and can be set in a sample generation program based on a wind control model, the initial standard deviation data is a preset standard deviation, and can be set in the sample generation program based on the wind control model, and the initial gaussian parameter data is initialized gaussian parameter data.
In the specific implementation, current sample data is observed, a relatively accurate mean value and standard deviation are estimated, mean value data and standard deviation data are initialized according to the estimated values to obtain initial mean value data and initial standard deviation data, and Gaussian parameter data are initialized according to the obtained initial mean value data and initial standard deviation data to obtain initial Gaussian parameter data.
Step S102: and acquiring the corresponding relation between the expected data and the initial Gaussian parameter data.
It should be noted that the expected data is the probability that each sample data comes from a sub-gaussian model, the gaussian parameter data is a parameter of a gaussian mixture model, and the corresponding relationship between the expected data and the initial gaussian parameter data is a calculation expression of the expected data, and the calculation expression is as follows:
Figure BDA0003765707000000101
in the formula, y j K is the number of sub-Gaussian models in the mixed model for the jth current sample data, K =1,2 \8230, and K, alpha k Is the probability that the sample data belongs to the kth sub-Gaussian model, i.e., the weight value, and α k ≥0,
Figure BDA0003765707000000102
φ(y jk ) Is a gaussian distribution density function of the kth sub-gaussian model,
Figure BDA0003765707000000103
Figure BDA0003765707000000104
probability from sub-gaussian model for each sample data.
Step S103: and obtaining the current expected data according to the initial Gaussian parameter data and the corresponding relation between the expected data and the initial Gaussian parameter data.
In specific implementation, the initial gaussian parameter data is substituted into a calculation expression of expected data, so that the currently calculated expected data can be obtained and used for calculating the gaussian mixture model parameters of a new iteration.
Further, the step S20 includes:
step S201: and obtaining the current Gaussian parameter data according to the current expected data and the corresponding relation between the current expected data and the Gaussian parameter data.
It can be understood that the corresponding relationship between the current expected data and the gaussian parameter data is a calculation relationship of each gaussian parameter, the current gaussian parameter data includes current mean data, current standard deviation data and current weight data, and the calculation relationship is as follows:
Figure BDA0003765707000000111
Figure BDA0003765707000000112
Figure BDA0003765707000000113
in the formula (I), the compound is shown in the specification,y j j is the j current sample data, N is the current sample number, K is the number of sub-Gaussian models in the mixed model, K =1,2 \8230 k In the form of the initial mean value data,
Figure BDA0003765707000000114
as the data of the current mean value,
Figure BDA0003765707000000115
as the data of the current standard deviation is,
Figure BDA0003765707000000116
is the current weight data.
Step S202: and when the current Gaussian parameter data meet a preset convergence condition, determining target Gaussian parameter data according to the current Gaussian parameter data.
It should be noted that the preset convergence condition is that the gaussian parameter after iteration has a very small change, | θ i+1i If | < ε, where ε is a very small positive number, when the i +1 th parameter θ i+1 With the ith parameter theta i When the change is small enough, the expected data and the Gaussian parameter data are both considered to be converged, and the current Gaussian parameter data is determined to be the needed target Gaussian parameter data.
After the step S201, the method further includes: and when the current Gaussian parameter data does not meet the preset convergence condition, updating the initial Gaussian parameter data according to the current Gaussian parameter data, and returning to execute the step of obtaining the current expected data according to the initial Gaussian parameter data and the relation between the expected data and the initial Gaussian parameter data.
In the specific implementation, if the obtained current gaussian parameter data and the current expected data do not meet the convergence condition, the initial gaussian parameter data is updated to the current gaussian parameter data, a new iteration is performed, the current expected data and the current gaussian parameter data are continuously calculated until the convergence condition is met, and the iteration is considered to be completed.
In this embodiment, a mean value and a standard deviation are initialized according to current sample data, expected data are calculated according to the initialized mean value and the initialized standard deviation, gaussian mixture model parameters (mean value, standard deviation and weight) of a new iteration are obtained according to the calculated expected data, if the change of the current gaussian mixture model parameters is small enough compared with the change of the previous parameters, the current parameters are considered to meet a convergence condition, so that the optimal parameters of the gaussian mixture model are estimated, and if the current parameters do not meet the convergence condition, iterative calculation is continued until convergence. The optimal parameter values are estimated by using an EM algorithm, the Gaussian mixture model established according to the estimated optimal parameter values can better fit the current sample data, the deviation in the new sample generation process is reduced, the new sample complying with the current distribution form is generated finally, and the accuracy of sample generation is improved.
Referring to fig. 10, fig. 10 is a schematic flowchart of a third embodiment of a sample generation method based on a wind control model according to the present invention.
Based on the first embodiment, the step S30 includes:
step S301: a current likelihood function is obtained.
The step S301 includes: acquiring corresponding relations among Gaussian parameter data, the Gaussian mixture component quantity and Gaussian mixture distribution, determining initial Gaussian mixture distribution according to the corresponding relations among the target Gaussian parameter data, the Gaussian mixture component quantity and the Gaussian mixture distribution, and acquiring a current likelihood function according to the initial Gaussian mixture distribution.
It should be noted that the corresponding relationship between the gaussian parameter data, the number of gaussian mixture components, and the gaussian mixture distribution is a probability density function of a gaussian mixture model, the initial gaussian mixture distribution is the gaussian mixture model with only the parameter data determined, and the current likelihood function is a likelihood function of the current gaussian mixture model.
In specific implementation, an initial Gaussian mixture model is established according to the obtained target mean data, target standard deviation data and target weight data, and a corresponding likelihood function is obtained according to the initial Gaussian mixture model.
Step S302: and acquiring the corresponding relation between the likelihood function, the mixed Gaussian component quantity and the evaluation data, and acquiring the corresponding relation between the evaluation data and the mixed Gaussian component quantity according to the current likelihood function and the corresponding relation between the likelihood function, the mixed Gaussian component quantity and the evaluation data.
It can be understood that the corresponding relationship among the likelihood function, the amount of the gaussian mixture components and the evaluation data is an AIC calculation relational expression, and the calculation expression is as follows:
AIC=2k-2ln L
in the formula, k is the number of sub-gaussian models, and L is a likelihood function, in this embodiment, the number k of sub-gaussian models with the smallest AIC value is the optimal number of sub-gaussian models. The AIC value is related to the likelihood function and the number of sub-gaussian models, and the likelihood function is related to the number of sub-gaussian models, so that the correspondence between AIC and the number of sub-gaussian models can be obtained.
Step S303: and when the current evaluation data meet the preset data condition, determining the number of the target Gaussian mixture components according to the current evaluation data and the corresponding relation between the evaluation data and the number of the Gaussian mixture components.
It should be understood that the preset data conditions are that the evaluation data is as small as possible, k is as small as possible, and ln L is as large as possible.
In specific implementation, when the complexity of the model is improved, the likelihood function L is increased, so that the AIC is reduced, but when k is too large, the likelihood function is slowed down, the AIC is increased, the model is too complex, and an overfitting phenomenon is easily caused, so that the model with the minimum AIC needs to be selected, the AIC not only needs to improve the fitting degree (maximum likelihood) of the model, but also needs to introduce a punishment item, so that the model parameters are as few as possible, and the possibility of overfitting is reduced. As shown in fig. 11, the optimal number of sub-gaussian models is the value that minimizes the inflection point of AIC, and then 10 sub-gaussian models are used to approach the optimal solution.
In this embodiment, by evaluating the corresponding relationship between the AIC value and the number k of sub-gaussian models, the optimal number of sub-gaussian models is found, which makes the AIC value as small as possible and k as small as possible, and the optimal number of sub-gaussian models is used to establish a gaussian mixture model, which not only improves the fitting degree of the mixture model, but also reduces the possibility of over-fitting, thereby improving the accuracy of a new sample.
In addition, an embodiment of the present invention further provides a storage medium, where the storage medium stores a sample generation program based on a wind control model, and the sample generation program based on the wind control model, when executed by a processor, implements the steps of the sample generation method based on the wind control model as described above.
Referring to fig. 12, fig. 12 is a block diagram illustrating a first embodiment of a sample generation apparatus based on a wind control model according to the present invention.
As shown in fig. 12, a sample generation apparatus based on a wind control model according to an embodiment of the present invention includes:
the obtaining module 10 is configured to obtain current expected data according to current sample data.
The obtaining module 10 is further configured to obtain current gaussian parameter data according to the current expected data, and determine target gaussian parameter data according to the current gaussian parameter data.
The obtaining module 10 is further configured to obtain a corresponding relationship between the evaluation data and the amount of the gaussian mixture components, and determine the amount of the target gaussian mixture components according to the corresponding relationship.
And a generating module 20, configured to determine a target gaussian mixture distribution of the current sample data according to the target gaussian parameter data and the target gaussian mixture component amount, and generate target sample data according to the target gaussian mixture distribution.
And the training module 30 is configured to perform wind control model training according to the current sample data and the target sample data, and perform user detection according to the trained wind control model.
In the embodiment, current expected data and current Gaussian parameter data are obtained according to current sample data, so that target Gaussian parameter data are determined, the number of optimal sub-Gaussian models is determined according to the corresponding relation between evaluation data and Gaussian mixture component quantities, gaussian mixture models complying with the current sample data are obtained according to the obtained target Gaussian parameter data and the number of the optimal sub-Gaussian models, so that target sample data are generated, wind control model training is performed according to the current sample data and the target sample data, and user detection is performed according to the wind control models obtained through training. The newly generated samples obey the distribution of the current samples and are not overlapped with the current sample points, the new samples are combined with the historical samples, the trained wind control model has stronger generalization capability, the problem of depositor deviation is solved, and the detection effect of the wind control model is improved.
In an embodiment, the obtaining module 10 is further configured to obtain initial mean data and initial standard deviation data according to current sample data;
obtaining initial Gaussian parameter data according to the initial mean data and the initial standard deviation data;
acquiring a corresponding relation between expected data and initial Gaussian parameter data;
and obtaining the current expected data according to the initial Gaussian parameter data and the corresponding relation between the expected data and the initial Gaussian parameter data.
In an embodiment, the obtaining module 10 is further configured to obtain a corresponding relationship between current expected data and gaussian parameter data;
obtaining current Gaussian parameter data according to the current expected data and the corresponding relation between the current expected data and the Gaussian parameter data;
and when the current Gaussian parameter data meet a preset convergence condition, determining target Gaussian parameter data according to the current Gaussian parameter data, wherein the target Gaussian parameter data comprise target mean data, target standard deviation data and target weight data.
In an embodiment, the obtaining module 10 is further configured to update the initial gaussian parameter data according to the current gaussian parameter data when the current gaussian parameter data does not satisfy a preset convergence condition;
and returning to execute the step of obtaining the current expected data according to the initial Gaussian parameter data and the relation between the expected data and the initial Gaussian parameter data.
In an embodiment, the obtaining module 10 is further configured to obtain a current likelihood function;
acquiring a corresponding relation among a likelihood function, the Gaussian mixture component quantity and evaluation data;
obtaining the corresponding relation between the evaluation data and the Gaussian mixture component quantity according to the current likelihood function and the corresponding relation between the likelihood function and the Gaussian mixture component quantity and the evaluation data;
and when the current evaluation data meet the preset data condition, determining the target Gaussian mixture component quantity according to the current evaluation data and the corresponding relation between the evaluation data and the Gaussian mixture component quantity.
In an embodiment, the obtaining module 10 is further configured to obtain a corresponding relationship between gaussian parameter data, a gaussian mixture component number and a gaussian mixture distribution;
determining initial Gaussian mixture distribution according to the target Gaussian parameter data and the corresponding relation between the Gaussian parameter data and the Gaussian mixture component quantity and the Gaussian distribution;
and obtaining the current likelihood function according to the initial Gaussian mixture distribution.
In an embodiment, the generating module 20 is further configured to determine a target gaussian mixture distribution of current sample data according to the target parameter data, the target gaussian mixture component number and a corresponding relationship between the gaussian parameter data, the gaussian mixture component number and the gaussian mixture distribution;
acquiring the number of preset sample data;
and generating target sample data according to the target Gaussian mixed distribution and the preset sample data quantity.
It should be understood that the above is only an example, and the technical solution of the present invention is not limited in any way, and in a specific application, a person skilled in the art may set the technical solution as needed, and the present invention is not limited thereto.
It should be noted that the above-described work flows are only exemplary, and do not limit the scope of the present invention, and in practical applications, a person skilled in the art may select some or all of them to achieve the purpose of the solution of the embodiment according to actual needs, and the present invention is not limited herein.
In addition, the technical details that are not described in detail in this embodiment may refer to a sample generation method based on a wind control model provided in any embodiment of the present invention, and are not described herein again.
Further, it is to be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of other like elements in a process, method, article, or system comprising the element.
The above-mentioned serial numbers of the embodiments of the present invention are only for description, and do not represent the advantages and disadvantages of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention or portions thereof that contribute to the prior art may be embodied in the form of a software product, where the computer software product is stored in a storage medium (e.g. Read Only Memory (ROM)/RAM, magnetic disk, optical disk), and includes several instructions for enabling a terminal device (e.g. a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention, and all equivalent structures or equivalent processes performed by the present invention or directly or indirectly applied to other related technical fields are also included in the scope of the present invention.

Claims (10)

1. A sample generation method based on a wind control model is characterized by comprising the following steps:
obtaining current expected data according to current sample data;
obtaining current Gaussian parameter data according to the current expected data, and determining target Gaussian parameter data according to the current Gaussian parameter data;
acquiring a corresponding relation between evaluation data and the Gaussian mixture component quantity, and determining the target Gaussian mixture component quantity according to the corresponding relation;
determining target Gaussian mixture distribution of the current sample data according to the target Gaussian parameter data and the target Gaussian mixture component quantity, and generating target sample data according to the target Gaussian mixture distribution;
and carrying out wind control model training according to the current sample data and the target sample data, and carrying out user detection according to the wind control model obtained by training.
2. The method of claim 1, wherein said obtaining current desired data from current sample data comprises:
obtaining initial mean data and initial standard deviation data according to current sample data;
obtaining initial Gaussian parameter data according to the initial mean data and the initial standard deviation data;
acquiring a corresponding relation between expected data and initial Gaussian parameter data;
and obtaining the current expected data according to the initial Gaussian parameter data and the corresponding relation between the expected data and the initial Gaussian parameter data.
3. The method of claim 2, wherein said obtaining current gaussian parameter data based on said current desired data, and determining target gaussian parameter data based on said current gaussian parameter data, comprises:
acquiring a corresponding relation between current expected data and Gaussian parameter data;
obtaining current Gaussian parameter data according to the current expected data and the corresponding relation between the current expected data and the Gaussian parameter data;
and when the current Gaussian parameter data meet a preset convergence condition, determining target Gaussian parameter data according to the current Gaussian parameter data, wherein the target Gaussian parameter data comprise target mean data, target standard deviation data and target weight data.
4. The method of claim 3, wherein after obtaining the current Gaussian parameter data according to the current expected data and the corresponding relationship between the current expected data and the Gaussian parameter data, the method further comprises:
when the current Gaussian parameter data do not meet the preset convergence condition, updating the initial Gaussian parameter data according to the current Gaussian parameter data;
and returning to execute the step of obtaining the current expected data according to the initial Gaussian parameter data and the relation between the expected data and the initial Gaussian parameter data.
5. The method of claim 1, wherein obtaining a correspondence between the evaluation data and the number of gaussian mixture components from which the target number of gaussian mixture components is determined comprises:
obtaining a current likelihood function;
acquiring a corresponding relation among a likelihood function, the Gaussian mixture component quantity and evaluation data;
obtaining the corresponding relation between the evaluation data and the Gaussian mixture component quantity according to the current likelihood function and the corresponding relation between the likelihood function and the Gaussian mixture component quantity and the evaluation data;
and when the current evaluation data meet the preset data condition, determining the target Gaussian mixture component quantity according to the current evaluation data and the corresponding relation between the evaluation data and the Gaussian mixture component quantity.
6. The method of claim 5, wherein said obtaining a current likelihood function comprises:
acquiring corresponding relation among Gaussian parameter data, the number of Gaussian mixture components and Gaussian mixture distribution;
determining initial Gaussian mixture distribution according to the target Gaussian parameter data and the corresponding relation between the Gaussian parameter data, the Gaussian mixture component quantity and the Gaussian distribution;
and obtaining the current likelihood function according to the initial Gaussian mixture distribution.
7. The method according to any one of claims 1 to 6, wherein the determining a target Gaussian mixture distribution of current sample data according to the target Gaussian parameter data and a target Gaussian mixture component number, and generating target sample data according to the target Gaussian mixture distribution comprises:
determining target Gaussian mixture distribution of current sample data according to the target parameter data, the target Gaussian mixture component and the corresponding relation between the Gaussian parameter data, the target Gaussian mixture component and the Gaussian mixture distribution;
acquiring the quantity of preset sample data;
and generating target sample data according to the target Gaussian mixture distribution and the number of preset sample data.
8. A sample generation device based on a wind control model is characterized by comprising:
the acquisition module is used for acquiring current expected data according to current sample data;
the acquisition module is further used for obtaining current Gaussian parameter data according to the current expected data and determining target Gaussian parameter data according to the current Gaussian parameter data;
the acquisition module is also used for acquiring the corresponding relation between the evaluation data and the amount of the Gaussian mixture components, and determining the amount of the target Gaussian mixture components according to the corresponding relation;
the generating module is used for determining the target Gaussian mixture distribution of the current sample data according to the target Gaussian parameter data and the target Gaussian mixture component amount, and generating the target sample data according to the target Gaussian mixture distribution;
and the training module is used for carrying out wind control model training according to the current sample data and the target sample data and carrying out user detection according to the wind control model obtained by training.
9. A sample generation device based on a wind control model, the device comprising: a memory, a processor and a wind control model based sample generation program stored on the memory and executable on the processor, the wind control model based sample generation program being configured to implement the steps of the wind control model based sample generation method according to any one of claims 1 to 7.
10. A storage medium, wherein a sample generation program based on a wind control model is stored on the storage medium, and when executed by a processor, the sample generation program based on the wind control model realizes the steps of the sample generation method based on the wind control model according to any one of claims 1 to 7.
CN202210886072.5A 2022-07-26 2022-07-26 Sample generation method, device and equipment based on wind control model and storage medium Pending CN115310524A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210886072.5A CN115310524A (en) 2022-07-26 2022-07-26 Sample generation method, device and equipment based on wind control model and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210886072.5A CN115310524A (en) 2022-07-26 2022-07-26 Sample generation method, device and equipment based on wind control model and storage medium

Publications (1)

Publication Number Publication Date
CN115310524A true CN115310524A (en) 2022-11-08

Family

ID=83859130

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210886072.5A Pending CN115310524A (en) 2022-07-26 2022-07-26 Sample generation method, device and equipment based on wind control model and storage medium

Country Status (1)

Country Link
CN (1) CN115310524A (en)

Similar Documents

Publication Publication Date Title
JP7266674B2 (en) Image classification model training method, image processing method and apparatus
CN111444951B (en) Sample recognition model generation method, device, computer equipment and storage medium
CN112785005B (en) Multi-objective task assistant decision-making method and device, computer equipment and medium
CN111079780A (en) Training method of space map convolution network, electronic device and storage medium
CN113128671B (en) Service demand dynamic prediction method and system based on multi-mode machine learning
US20230140696A1 (en) Method and system for optimizing parameter intervals of manufacturing processes based on prediction intervals
CN116596095B (en) Training method and device of carbon emission prediction model based on machine learning
CN110795736B (en) Malicious android software detection method based on SVM decision tree
CN114781532A (en) Evaluation method and device of machine learning model, computer equipment and medium
CN111639688B (en) Local interpretation method of Internet of things intelligent model based on linear kernel SVM
Huang et al. Unsupervised nonlinear feature selection from high-dimensional signed networks
Tembine Mean field stochastic games: Convergence, Q/H-learning and optimality
Lim et al. More powerful selective kernel tests for feature selection
CN112950347A (en) Resource data processing optimization method and device, storage medium and terminal
CN116227939A (en) Enterprise credit rating method and device based on graph convolution neural network and EM algorithm
CN111539444A (en) Gaussian mixture model method for modified mode recognition and statistical modeling
CN110837853A (en) Rapid classification model construction method
CN113393023B (en) Mold quality evaluation method, apparatus, device and storage medium
CN115310524A (en) Sample generation method, device and equipment based on wind control model and storage medium
Begum et al. Software Defects Identification: Results Using Machine Learning and Explainable Artificial Intelligence Techniques
CN113239034A (en) Big data resource integration method and system based on artificial intelligence and cloud platform
US7634450B2 (en) System and method for determining difficulty measures for training cases used in developing a solution to a problem
Ivanytska et al. Study of Methods of Complex Data Analysis that Based on Machine Learning Technologies
US11609936B2 (en) Graph data processing method, device, and computer program product
Gluhovsky Multinomial least angle regression

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination