CN111563548B - Data preprocessing method, system and related equipment based on reinforcement learning - Google Patents

Data preprocessing method, system and related equipment based on reinforcement learning Download PDF

Info

Publication number
CN111563548B
CN111563548B CN202010363808.1A CN202010363808A CN111563548B CN 111563548 B CN111563548 B CN 111563548B CN 202010363808 A CN202010363808 A CN 202010363808A CN 111563548 B CN111563548 B CN 111563548B
Authority
CN
China
Prior art keywords
classifier
model
training
original
adopting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010363808.1A
Other languages
Chinese (zh)
Other versions
CN111563548A (en
Inventor
张伟哲
张宾
周颖
束建钢
黄兴森
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peng Cheng Laboratory
Original Assignee
Peng Cheng Laboratory
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peng Cheng Laboratory filed Critical Peng Cheng Laboratory
Priority to CN202010363808.1A priority Critical patent/CN111563548B/en
Publication of CN111563548A publication Critical patent/CN111563548A/en
Application granted granted Critical
Publication of CN111563548B publication Critical patent/CN111563548B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches

Abstract

The embodiment of the invention provides a data preprocessing method, a data preprocessing system and related equipment based on reinforcement learning, which realize feedback adjustment in the oversampling process of an original sample based on a reinforcement learning mechanism and improve the rationality of oversampling of the data sample. The method of the embodiment of the invention comprises the following steps: training a preset variation self-encoder model by adopting an original sample in an original training set to obtain a variation self-encoder model; optimizing the variational self-encoder model based on a reinforcement learning mechanism; and randomly generating new samples from the encoder model according to the optimized variation.

Description

Data preprocessing method, system and related equipment based on reinforcement learning
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a data preprocessing method, system and related device based on reinforcement learning.
Background
The imbalance problem of data samples refers to uneven distribution of data of different categories in the data set. If the overdue probability in the financial wind control is lower, the data volume is far less than that of normal data, so that the sensitivity of the data mining result to overdue risk users is lost, and the data mining result is invalid.
A common solution to the current imbalance problem of data samples is to oversample the data samples from a data perspective, generating new samples. The existing oversampling algorithm lacks a feedback link, and the rationality of generating a new sample needs to be improved.
In view of this, there is a need for a new data preprocessing method.
Disclosure of Invention
The embodiment of the invention provides a data preprocessing method, a data preprocessing system and related equipment based on reinforcement learning, which realize feedback adjustment in the oversampling process of an original sample based on a reinforcement learning mechanism and improve the rationality of oversampling of the data sample.
The first aspect of the embodiment of the invention provides a data preprocessing method based on reinforcement learning, which can comprise the following steps:
training a preset variation self-encoder model by adopting an original sample in an original training set to obtain a variation self-encoder model;
optimizing the variational self-encoder model based on a reinforcement learning mechanism;
and randomly generating new samples from the encoder model according to the optimized variation.
Optionally, as a possible implementation manner, in the reinforcement learning-based data preprocessing method in the embodiment of the present invention, the optimization of the variation self-encoder model based on the reinforcement learning mechanism may include:
Training a preset classifier model by adopting an original sample in an original training set to obtain a classifier model;
executing a preset number of iterative computations, wherein one iterative computation in the iterative computations comprises:
randomly generating a new sample by adopting the variation self-encoder model;
training the classifier model by adopting the new sample, classifying the original sample in the original training set by adopting the new classifier model after training, calculating classification index parameters and state variables, and taking the classification index parameters as environment rewarding variables;
and calculating estimated rewards of the decoder of the variable self-encoder model by adopting a preset evaluator and the state variables, and optimizing the decoder of the variable self-encoder model according to the estimated rewards so as to maximize the estimated rewards.
Optionally, as a possible implementation manner, in the reinforcement learning-based data preprocessing method in the embodiment of the present invention, the training the classifier model using the new sample may include:
if the classifier model is a counter-transmissible classifier, directly training the classifier model by adopting the new sample; if the classifier model is a non-back propagation classifier, adding the new sample into the original training set, and training the preset classifier model by adopting the expanded original training set to obtain a new classifier model.
Optionally, as a possible implementation manner, in an embodiment of the present invention, the method may further include: and training the evaluator according to the difference between the environmental reward variable and the estimated reward so as to minimize the difference between the environmental reward variable and the estimated reward.
Optionally, as a possible implementation manner, in an embodiment of the present invention, after the classifier model is trained using the new sample, the method may further include:
if the classifier model is a counter-transmissible classifier, updating parameters of the original classifier according to a preset proportion; if the classifier model is a non-backward propagation classifier, the original classifier is reserved according to a preset probability.
Optionally, as a possible implementation manner, the data preprocessing method based on reinforcement learning in the embodiment of the present invention further includes:
and selecting a new sample of a preset type according to the classification result of the classifier on the new sample, and storing the new sample in the original training set.
A second aspect of an embodiment of the present invention provides a reinforcement learning-based data preprocessing system, which may include:
the training unit is used for training a preset variation self-encoder model by adopting an original sample in an original training set to obtain a variation self-encoder model;
An optimization unit for optimizing the variation self-encoder model based on a reinforcement learning mechanism;
and the output unit is used for randomly generating new samples from the encoder model according to the optimized variation.
Optionally, as a possible implementation manner, the optimizing unit in the embodiment of the present invention may include:
the training module is used for training a preset classifier model by adopting an original sample in an original training set to obtain the classifier model;
the processing module is used for executing iterative computation of a preset number, and one iterative computation in the iterative computation comprises the following steps:
randomly generating a new sample by adopting the variation self-encoder model;
training the classifier model by adopting the new sample, classifying the original sample in the original training set by adopting the new classifier model after training, calculating classification index parameters and state variables, and taking the classification index parameters as environment rewarding variables;
and calculating estimated rewards of the decoder of the variable self-encoder model by adopting a preset evaluator and the state variables, and optimizing the decoder of the variable self-encoder model according to the estimated rewards so as to maximize the estimated rewards.
Optionally, as a possible implementation manner, the data preprocessing system based on reinforcement learning in the embodiment of the present invention may include: and the training module is used for training the evaluator according to the difference between the environmental rewards variable and the estimated rewards so as to minimize the difference between the environmental rewards variable and the estimated rewards.
Optionally, as a possible implementation manner, the processing module in the embodiment of the present invention may include:
the processing submodule is used for directly training the classifier model by adopting the new sample if the classifier model is a counter-transmissible classifier; if the classifier model is a non-back propagation classifier, adding the new sample into the original training set, and training the preset classifier model by adopting the expanded original training set to obtain a new classifier model.
Optionally, as a possible implementation manner, the processing module in the embodiment of the present invention may further include:
the adjusting sub-module is used for updating parameters of the original classifier according to a preset proportion if the classifier model is a counter-transmissible classifier; if the classifier model is a non-backward propagation classifier, the original classifier is reserved according to a preset probability.
Optionally, as a possible implementation manner, the processing module in the embodiment of the present invention may further include:
and the storage sub-module is used for selecting a new sample of a preset type to store in the original training set according to the classification result of the classifier on the new sample.
A third aspect of the embodiments of the present invention provides a computer apparatus comprising a processor for implementing the steps as in any one of the possible implementations of the first aspect and the first aspect when executing a computer program stored in a memory.
A fourth aspect of the embodiments of the present invention provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs steps as in any one of the possible implementations of the first aspect and the first aspect.
From the above technical solutions, the embodiment of the present invention has the following advantages:
in the embodiment of the invention, an original sample in an original training set is adopted to train a preset variation self-encoder model to obtain a variation self-encoder model, then the variation self-encoder model is optimized based on a reinforcement learning mechanism, and finally a new sample is randomly generated according to the optimized variation self-encoder model. Compared with the prior art, the feedback adjustment in the oversampling process of the original sample is realized based on the reinforcement learning mechanism, and the rationality of the oversampling of the data sample is improved.
Drawings
FIG. 1 is a diagram illustrating an embodiment of a reinforcement learning-based data preprocessing method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of another embodiment of a reinforcement learning-based data preprocessing method according to an embodiment of the present invention;
FIG. 3 is a schematic architecture diagram of a specific embodiment of a reinforcement learning-based data preprocessing method according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of an embodiment of a reinforcement learning-based data preprocessing system according to an embodiment of the present invention;
FIG. 5 is a diagram of a computer device according to an embodiment of the present invention.
Detailed Description
The embodiment of the invention provides a data preprocessing method, a data preprocessing system and related equipment based on reinforcement learning, which realize feedback adjustment in the oversampling process of an original sample based on a reinforcement learning mechanism and improve the rationality of oversampling of the data sample.
In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
The terms first, second, third, fourth and the like in the description and in the claims and in the above drawings are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments described herein may be implemented in other sequences than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The existing oversampling algorithm breaks the data generation and the final task, and the algorithm corrects the skewed data set only through different oversampling means, but the influence of the oversampling on the downstream task is not considered, so that the improvement effect of new samples generated by different oversampling algorithms on different tasks is inconsistent. In order to solve the problems in the existing methods, the invention provides a data preprocessing method based on reinforcement learning, which improves the rationality of generating new samples.
For ease of understanding, a specific flow in the embodiment of the present invention is described below, referring to fig. 1, and an embodiment of a data preprocessing method based on reinforcement learning in the embodiment of the present invention may include:
101. training a preset variation self-encoder model by adopting an original sample in an original training set to obtain a variation self-encoder model;
when the over sampling of the original training set needs to be realized to generate a new sample, the reinforcement learning-based data preprocessing system can train a preset variation self-encoder model by adopting the original sample in the original training set to obtain the variation self-encoder model.
Among them, a variable auto-encoder (VAE) is a type of data generation model (general model) for generating data similar to the original data. The variation self-encoder can learn the distribution information of the original samples in the original training set, and the decoder can sample from the preset distribution to generate new sample data.
102. Optimizing a variational self-encoder model based on a reinforcement learning mechanism;
in practical application, the data preprocessing and the task are mutually influenced, so that the data preprocessing and the task are split, and reasonable data cannot be generated in a targeted manner according to the requirements of the task. In view of this, the applicant has noted that a certain feedback correction mechanism may be provided, for example, to implement new sample measurement and correction based on reinforcement learning mechanisms.
The basic principle of reinforcement learning in machine learning is as follows: if a certain action (action) policy of an actor (actor) results in a positive reward (recall) of the environment (environment), the tendency of the actor to later generate this action policy will be enhanced. Reinforcement learning regards learning as a heuristic evaluation process, an actor selects an action for an environment, the state (state) changes after the environment receives the action, a report (prize or punishment) is generated and fed back to the actor, and the actor selects the next action according to a reinforcement signal and the current state of the environment, wherein the selection principle is that the probability of being reinforced by the report is increased.
In practical application, the behavior of outputting a new sample from the encoder model can be used as an action in the reinforcement learning mechanism, and reasonable environment (environment) variables, state variables and rewards (reward) can be set according to the task demands of users, so that the optimization of the variation from the encoder model can be realized, and the specific implementation mode of the reinforcement learning mechanism is not limited.
103. And randomly generating new samples from the encoder model according to the optimized variation.
Based on the variation self-encoder model after reinforcement learning mechanism optimization, new samples can be randomly generated, new samples related to task demands of users can be further screened and output, feedback adjustment in the oversampling process of the original samples is realized, and the rationality of oversampling of the data samples is improved.
In the embodiment of the invention, an original sample in an original training set is adopted to train a preset variation self-encoder model to obtain a variation self-encoder model, then the variation self-encoder model is optimized based on a reinforcement learning mechanism, and finally a new sample is randomly generated according to the optimized variation self-encoder model. Compared with the prior art, the feedback adjustment in the oversampling process of the original sample is realized based on the reinforcement learning mechanism, and the rationality of the oversampling of the data sample is improved.
For ease of understanding, the implementation of the reinforcement learning mechanism in embodiments of the present invention will be described in detail below. Referring to fig. 2, another embodiment of a reinforcement learning-based data preprocessing method according to an embodiment of the present invention may include:
201. training a preset variation self-encoder model by adopting an original sample in an original training set to obtain a variation self-encoder model;
the variation in the data generation step is divided into an encoder (decoder) and a decoder (decoder), so that the original data distribution can be modeled, and the distribution assumption on the original data is not needed in modeling, so that the application range is wider.
When the over-sampling of the original training set needs to be realized to generate a new sample, the data preprocessing system based on reinforcement learning can model the original sample distribution in the original training set of the data by adopting a preset variation self-encoder, and provide a random over-sampling function by limiting the distribution form of hidden layer space variables. The loss function is shown in formula (1):
L=‖X o -X f2 +λKL(P(z∣c,X)‖N(0,I)) (1)
KL(P‖Q)=E x~p [logP(x)-logQ(x)] (2)
wherein KL represents a Kullback-Leibler divergence for measuring the difference between two mathematical distributions, where KL divergence is defined as shown in equation (2), X o Representing raw data, X f Representing reconstructed data from the encoder after the variation, P (z-c, X) represents the true hidden variable distribution of the dataset under the encoder map, z represents the hidden variable, and N (0,I) represents the multi-dimensional standard normal distribution. KL (P II Q) represents the KL divergence between the calculated distributions P (x) and Q (x), E x~p The x is expressed as the compliance distribution P (x).
After obtaining the original samples in the original training set, the original samples in the original training set may be used to train a preset variational self-encoder model, minimizing the loss function defined in equation (1), and obtaining the variational self-encoder model.
202. Training a preset classifier model by adopting an original sample in an original training set to obtain a classifier model;
In the embodiment of the invention, the behavior of outputting a new sample from the encoder model can be used as the action in the reinforcement learning mechanism, and a reasonable classifier can be set as an environment (environment) variable according to the task requirement of a user. For example, a bayesian framework classifier can be used in discrete data, a multi-layer neural network or a support vector machine can be used in continuous data, and the classifier model can be reasonably set according to the requirements of users and can be used for other tasks in supervised learning, such as linear regression and the like, and the classifier model is not limited in this particular aspect.
203. Randomly generating a new sample by adopting a variation self-encoder model;
after training the variational self-encoder model, a batch of hidden layer variables can be randomly sampled in a normal distribution of hidden layer space and mapped to new samples in sample space by the decoder of the variational self-encoder.
204. Training a classifier model by adopting a new sample, classifying the original samples in the original training set by adopting the new classifier model after training, and calculating environment rewarding variables according to classification index parameters;
after randomly generating new samples, the system may train the classifier model with the new samples and classify the original samples in the original training set with the new classifier model after training.
In the embodiment of the invention, the classification index r of the classifier can be used as a reward (reward) in the environment, and the specific classification index r is determined according to the selected classifier and is not limited herein. The state (state) in reinforcement learning is set as follows: if the adopted classifier is a neural network classifier capable of back propagation, the weight of the neural network classifier is directly used as a state, as shown in a formula (3), wherein w t ,b t Is the neural network weight corresponding to the current iteration t times; if the adopted classifier is a non-back-transmissible classifier, taking the positive class classification probability of the classifier on the original data set as a state, as shown in a formula (4), calculating the probability that each training sample is classified as a positive class under the current classifier, wherein X o Represent training samples, θ t Representing the classifier corresponding to the iteration t times.
S t = (w t ,b t ) (3)
P t =P(X o ∣θ t ) (4)
Alternatively, as one possible implementation, the specific process of training the classifier model with the new sample may include:
if the classifier model is a counter-transmissible classifier, directly training the classifier model by adopting a new sample; if the classifier model is a non-back propagation classifier, adding a new sample into the original training set, and training a preset classifier model by adopting the expanded original training set to obtain a new classifier model.
205. Calculating estimated rewards of the decoder of the variable self-encoder model by adopting a preset evaluator and state variables, and optimizing the decoder of the variable self-encoder model according to the estimated rewards so as to maximize the estimated rewards;
the reinforcement learning mechanism is provided with an evaluator (implemented based on a multi-layer neural network) that can calculate a predicted prize for the decoder of the variational self-encoder model based on the state variables (see equation 5) and optimize the decoder of the variational self-encoder model based on the predicted prize so as to maximize the predicted prize.
r p =critic(S t+1 ,X p ) (5)
According to the embodiment of the invention, the evaluator can be trained according to the loss function corresponding to the evaluator, so that the new sample can be evaluated more accurately. After each time according to the environment rewards variable and the state variable corresponding to the new sample, the system can calculate the estimated rewards of the evaluator according to the formula (5), and fine-tune the decoder according to the corresponding estimated rewards, so that the estimated rewards of the new sample generated subsequently can be maximized. The specific updated parameter calculation process is shown in formula (6), wherein critic in formula (5) is the multi-layer neural network corresponding to the evaluator, S t+1 Representing the corresponding state of the new classifier, X p Representing the current new sample, r is the true reward returned by the classifier, r p For the estimated reward of the evaluator, the updated decoder parameters include w D B for decoder weights D Biased for the decoder.
206. Training an evaluator according to the difference between the environmental reward variable and the predicted reward;
optionally, in order to further improve the accuracy of the evaluator, the embodiment of the present invention may further be based on a ringThe difference between the environmental reward variable and the predicted reward is trained to minimize the difference between the environmental reward variable and the predicted reward, and the subsequent evaluation of the new sample is more accurate. After each time according to the environment rewards variable and the state variable corresponding to the new sample, the system can calculate the loss function of the evaluator according to the formula (7), the specific evaluation updated parameter calculation process is shown as the formula (8), wherein the updated evaluation parameters comprise w critic ,b critic The weights and the biases corresponding to the neural network models in the evaluator are respectively.
loss 3 =‖r-r p2 (7)
It will be appreciated that the process of reinforcement learning is continuous, and that the rationality of data sample oversampling can be further improved by multiple rounds of reinforcement learning. The steps 203 to 206 may be used as an iteration calculation corresponding to a round of reinforcement learning, and in practical application, a preset number of iteration calculations may be performed according to the needs of the user, and the specific number of iteration is not limited herein. If 206 is not needed, the steps 203 to 205 may be used as an iterative calculation corresponding to a round of reinforcement learning
207. If the classifier model is a counter-transmissible classifier, updating parameters of the original classifier according to a preset proportion; if the classifier model is a non-backward propagation classifier, the original classifier is reserved according to a preset probability;
in the process of performing multiple rounds of reinforcement learning, new samples randomly generated in a reinforcement learning process of a certain round may be unreasonable, and a new classifier updated based on the unreasonable new samples may have a negative effect on a final task, which is not beneficial to the rationality of oversampling of data samples. In order to reduce negative effects, optionally, as a possible implementation manner, the embodiment of the present invention may further selectively update the original classifier after each round of reinforcement learning, and the specific update process may include: if the classifier model is a counter-transmissible classifier, updating parameters of the original classifier according to a preset proportion; if the classifier model is a non-backward propagation classifier, the original classifier is reserved according to a preset probability.
For example, when the classifier model is a back-propagation classifier, the weights of the original classifier may be updated by a preset ratio τ, in the manner in equation (9), where w t+1 The neural network weights of the post-classifier are calculated for the t+1st iteration. When the classifier model is a non-backward propagation classifier, the original classifier can be reserved according to a preset proportion lambda to generate a random number P, and if P is more than lambda, theta t+1 =θ t If P is less than or equal to lambda, theta t+1 =θ t+1 Wherein θ is t+1 The post-classifier is computed for the t+1st iteration.
It will be appreciated that the process of updating the original classifier in step 206 is an optional step and may be selectively performed according to the needs of the user.
208. And randomly generating new samples from the encoder model according to the optimized variation.
Based on a reinforcement learning mechanism, a new sample related to the task requirement of a user can be output by the variation self-encoder model after optimization of the execution of the preset number of iterative computations, so that the feedback adjustment of the oversampling of the original sample is realized, and the rationality of the oversampling of the data sample is improved.
Optionally, the method can also reserve a classifier model at the end of training, and can directly use the classifier in the reinforcement learning process when the task demands the classifier, thereby saving the system computing resources.
Optionally, a new sample generated in the iterative computation process corresponding to each round of reinforcement learning may be selected and stored in the original training set according to the classification result of the classifier on the new sample, for example, the user needs a positive sample, and then the positive sample may be selected and stored according to the classification result of the classifier on the new sample.
Compared with the traditional process of expanding the data set and then training the classifier, the method combines the final performance of the classifier with the data generation, takes the improvement of the classification performance as the target of the data generation stage, uses a gradient search mode to purposefully generate specific data so as to improve the performance of the downstream classifier and reduce the amount of generated data as much as possible. The invention improves the correlation and rationality of the classification performance of the generated data and the final classifier.
For ease of understanding, the reinforcement learning-based data preprocessing method in the embodiment of the present invention will be described below in connection with specific application embodiments.
Referring to fig. 3, the following specific steps of a reinforcement learning-based data generation framework are implemented:
the structure of the whole frame is shown in fig. 3, and the frame has 3 parts which respectively correspond to different structures in reinforcement learning. The generator corresponds to an action body (actor) in reinforcement learning, and the new sample generated by the generator is used as an action (action) taken by the action body; the classifier corresponds to an environment (environment), and the classification index of the classifier serves as a reward (reward) in the environment. Wherein,
a generator: the generator includes a variable self-encoder. The variation is divided into an encoder (decoder) and a decoder (decoder) from the encoder, so that the original data distribution can be modeled, and distribution assumptions on the original data are not needed in modeling, so that the application range is wider.
The original distribution of the data is modeled by a variational self-encoder, and a random oversampling function is provided by limiting the distribution form of hidden layer space variables. The loss function is shown in formula (1):
L=‖X o -X f2 +λKL(P(z∣c,X)‖N(0,I)) (1)
KL(P‖Q)=E x~p [logP(x)-logQ(x)] (2)
wherein KL represents the Kullback-Leibler divergence for measuring the difference between two mathematical distributions, where KL divergence is defined as shown in equation (2), X o Representing raw data, X f Representing the passing ofThe variation is derived from the reconstructed data of the encoder, P (z|c, X) represents the true hidden variable distribution of the dataset under the encoder map, z represents the hidden variable, and N (0,I) represents the multidimensional standard normal distribution. KL (P II Q) represents the KL divergence between the calculated distributions P (x) and Q (x), E x~p The x is expressed as the compliance distribution P (x).
The training process of the variable self-encoder is divided into two steps: in the first step, only modeling is carried out on the original distribution of the data, and the loss function in the formula (1) is directly optimized without considering the downstream classifier and classification standard; and in the second step, the generator is optimized according to feedback information of the evaluator (critic), so that the generated samples can effectively improve the final classification target.
In this step, since the loss function of the encoder is independent of the hidden layer data distribution, only the decoder is optimized and the evaluator estimates the prize as shown in equation (5).
r p =critic(S t+1 ,X p ) (5)
Wherein the critic in the formula (5) is a multi-layer neural network corresponding to the evaluator, S t+1 Representing the corresponding state of the new classifier, X p Representing the current generated sample.
Decoder parameter update As shown in equation (6), its parameters are updated using a gradient-increasing algorithm, where w D B for decoder weights D Biased for the decoder.
A classifier: different classifiers can be selected according to different task characteristics, such as a Bayesian framework classifier can be used in discrete data, a multi-layer neural network or a support vector machine can be adopted in continuous data, and the framework can be used for supervising other tasks in learning, such as regression and the like. The invention has the advantage that different indexes of a plurality of downstream tasks can be optimized. Taking the classification task as an example, the classifier is classified into a classifier that can use back propagation optimization and a classifier that cannot use back propagation optimization. Gradients may be superimposed continuously in back propagation, so the original training set is used for training in the early stages of classifier training, while during the generator fine-tuning phase, only the generated data is used to optimize the classifier to evaluate the effect of this batch of generated data on the classifier. When training a classifier which cannot be optimized in a back propagation way, the updating states of the classifier cannot be directly overlapped, so that the original training set is used for training in the initial training stage, and in the fine tuning stage of the generator, generated data are added into the original training set for expansion, a new classifier is trained by adopting the expanded data set, and meanwhile, whether the batch of data is reserved or not is confirmed according to the classification result of the new classifier, namely, the data set becomes larger and larger.
Depending on the classifier, the state variables in reinforcement learning are set as follows: in the back-propagation classifier, the parameters of the neural network classifier are directly used as states, as shown in formula (3), wherein w t ,b t Is the neural network weight corresponding to the current iteration t times; in non-back-propagation classifiers, the positive class classification probability of the classifier on the original dataset is taken as the state, as shown in equation (4), where X o Represent training samples, θ t Representing the classifier corresponding to the iteration t times, the formula (4) calculates the probability that each training sample is classified as a positive class under the current classifier.
S t = (w t ,b t ) (3)
S t =P(X o ∣θ t ) (4)
An evaluator: the evaluator adopts a reinforcement learning framework, and measures new samples according to classification results of the new classifier, wherein the measurement results are used as estimated rewards r p (calculation reference formula (5)) to participate in the optimization of the generator, the setting can ensure the correlation between the new sample and the measurement result, and the framework can also encapsulate the calculation details of rewards, so that different task indexes can be optimized.
To obtain a more accurate evaluator, a multi-layer neural network may be used, the loss function of which is shown in equation (7), where r is the true reward returned by the classifier, r p Parameter updating mode for critic prediction rewardsAs shown in formula (8), wherein w critic ,b critic The weight and bias corresponding to the evaluator are respectively.
loss 3 =‖r-r p2 (7)
In addition, if the classifier can use back propagation, reinforcement learning of each round updates the weight of the new classifier to the original classifier in a certain proportion, as shown in formula (9), the parameters of the original classifier are updated in a preset proportion τ, where w t+1 The neural network weights of the post-classifier are calculated for the t+1st iteration. When the classifier model is a non-backward propagation classifier, the original classifier can be updated according to a preset proportion lambda to generate a random number P, and if P is more than lambda, theta t+1 =θ t If P is less than or equal to lambda, theta t+1 =θ t+1 Wherein θ is t+1 The post-classifier is computed for the t+1st iteration.
Compared with the traditional oversampling process, the method combines the final performance of the classifier with the final performance of the data generation, takes the improvement of the classification performance as the target of the data generation stage, purposefully generates specific data by using a gradient search mode to improve the performance of the downstream classifier, and reduces the amount of generated data as much as possible. The invention improves the correlation between the generated data and the classification performance of the final classifier. The embodiment utilizes a reinforcement learning method to fit the influence of new data on the classifier and feeds the prediction result back to the data generator. The classifier is used as an environment in reinforcement learning, and the type of the classifier is not required to be limited; the classification performance is used as rewards, and the classification index type is not required to be limited.
It will be appreciated that the implementation of reinforcement learning is merely exemplary, and the task type, the replacement classifier or other models in supervised learning may be modified according to the requirement in practical application, and the classification evaluation index, such as the precision, the F1 value, the AUC value, etc., may be set according to the selected classifier type when the computing environment rewards, which is not limited herein.
Referring to fig. 4, the embodiment of the invention further provides a data preprocessing system based on reinforcement learning, which may include:
a training unit 401, configured to train a preset variation self-encoder model by using an original sample in an original training set, so as to obtain a variation self-encoder model;
an optimization unit 402 for optimizing the variational self-encoder model based on a reinforcement learning mechanism;
an output unit 403 for randomly generating new samples from the encoder model according to the optimized variation.
Optionally, as a possible implementation manner, the optimizing unit in the embodiment of the present invention may include:
the training module is used for training a preset classifier model by adopting an original sample in an original training set to obtain the classifier model;
the processing module is used for executing iterative computation with preset quantity, and one iterative computation in the iterative computation comprises the following steps:
Randomly generating a new sample by adopting a variation self-encoder model;
training a classifier model by adopting a new sample, classifying the original sample in the original training set by adopting the new classifier model after training, calculating classification index parameters and state variables, and taking the classification index parameters as environment rewarding variables;
and calculating estimated rewards of the decoder of the variable self-encoder model according to the state variables, and optimizing the decoder of the variable self-encoder model according to the estimated rewards so as to maximize the estimated rewards.
Optionally, as a possible implementation manner, the system in the embodiment of the present invention may include: and the training module is used for training the evaluator according to the difference between the environmental rewards variable and the estimated rewards so as to minimize the difference between the environmental rewards variable and the estimated rewards.
Optionally, as a possible implementation manner, the processing module in the embodiment of the present invention may include:
the processing submodule is used for directly training the classifier model by adopting a new sample if the classifier model is a counter-transmissible classifier; if the classifier model is a non-back propagation classifier, adding a new sample into the original training set, and training a preset classifier model by adopting the expanded original training set to obtain a new classifier model.
Optionally, as a possible implementation manner, the processing module in the embodiment of the present invention may further include:
the adjusting sub-module is used for updating parameters of the original classifier according to a preset proportion if the classifier model is a counter-transmissible classifier; if the classifier model is a non-backward propagation classifier, the original classifier is reserved according to a preset probability.
Optionally, as a possible implementation manner, the processing module in the embodiment of the present invention may further include:
and the storage sub-module is used for selecting a new sample of a preset type to store in the original training set according to the classification result of the classifier on the new sample.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.
The reinforcement learning-based data preprocessing system in the embodiment of the present invention is described above from the point of view of the modularized functional entity, and referring to fig. 5, the computer device in the embodiment of the present invention is described below from the point of view of hardware processing:
the computer device 1 may include a memory 11, a processor 12, and an input-output bus 13. The steps in the reinforcement learning-based data preprocessing method embodiment shown in fig. 1 described above, such as steps 101 to 103 shown in fig. 1, are implemented when the processor 12 executes a computer program. In the alternative, the processor may implement the functions of the modules or units in the above-described embodiments of the apparatus when executing the computer program.
In some embodiments of the present invention, the processor is specifically configured to implement the following steps:
training a preset variation self-encoder model by adopting an original sample in an original training set to obtain a variation self-encoder model;
optimizing a variational self-encoder model based on a reinforcement learning mechanism;
and randomly generating new samples from the encoder model according to the optimized variation.
In the alternative, as a possible implementation, the processor may be further configured to implement the following steps:
training a preset classifier model by adopting an original sample in an original training set to obtain a classifier model;
executing a preset number of iterative computations, wherein one iterative computation in the iterative computations comprises:
randomly generating a new sample by adopting a variation self-encoder model;
training a classifier model by adopting a new sample, classifying the original sample in the original training set by adopting the new classifier model after training, calculating classification index parameters and state variables, and taking the classification index parameters as environment rewarding variables;
and calculating estimated rewards of the decoder of the variable self-encoder model according to the state variables, and optimizing the decoder of the variable self-encoder model according to the estimated rewards so as to maximize the estimated rewards.
In the alternative, as a possible implementation, the processor may be further configured to implement the following steps: and training an evaluator to minimize the difference between the environmental rewards variable and the predicted rewards based on the difference between the environmental rewards variable and the predicted rewards.
In the alternative, as a possible implementation, the processor may be further configured to implement the following steps:
if the classifier model is a counter-transmissible classifier, directly training the classifier model by adopting a new sample; if the classifier model is a non-back propagation classifier, adding a new sample into the original training set, and training a preset classifier model by adopting the expanded original training set to obtain a new classifier model.
In the alternative, as a possible implementation, the processor may be further configured to implement the following steps:
if the classifier model is a counter-transmissible classifier, updating parameters of the original classifier according to a preset proportion; if the classifier model is a non-backward propagation classifier, the original classifier is reserved according to a preset probability.
In the alternative, as a possible implementation, the processor may be further configured to implement the following steps:
and selecting a new sample of a preset type according to the classification result of the classifier on the new sample, and storing the new sample in the original training set.
The memory 11 includes at least one type of readable storage medium including flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, etc. The memory 11 may in some embodiments be an internal storage unit of the computer device 1, such as a hard disk of the computer device 1. The memory 11 may also be an external storage device of the computer apparatus 1 in other embodiments, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the computer apparatus 1. Further, the memory 11 may also include both an internal storage unit and an external storage device of the computer apparatus 1. The memory 11 may be used not only for storing application software installed in the computer apparatus 1 and various types of data, for example, codes of the computer program 01, but also for temporarily storing data that has been output or is to be output.
The processor 12 may in some embodiments be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor or other data processing chip for executing program code or processing data stored in the memory 11, e.g. executing a computer program 01 or the like.
The input/output bus 13 may be a peripheral component interconnect standard (peripheral component interconnect, PCI) bus, or an extended industry standard architecture (extended industry standard architecture, EISA) bus, among others. The bus may be classified as an address bus, a data bus, a control bus, etc.
Further, the computer apparatus may also comprise a wired or wireless network interface 14, and the network interface 14 may optionally comprise a wired interface and/or a wireless interface (e.g. WI-FI interface, bluetooth interface, etc.), typically used to establish a communication connection between the computer apparatus 1 and other electronic devices.
Optionally, the computer device 1 may further comprise a user interface, which may comprise a Display (Display), an input unit such as a Keyboard (Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch, or the like. The display may also be referred to as a display screen or display unit, as appropriate, for displaying information processed in the computer device 1 and for displaying a visual user interface.
Fig. 5 shows only a computer device 1 with components 11-14 and a computer program 01, it being understood by a person skilled in the art that the structure shown in fig. 5 does not constitute a limitation of the computer device 1, and may comprise fewer or more components than shown, or may combine certain components, or a different arrangement of components.
The present invention also provides a computer readable storage medium having a computer program stored thereon, which when executed by a processor, can implement the steps of:
training a preset variation self-encoder model by adopting an original sample in an original training set to obtain a variation self-encoder model;
optimizing a variational self-encoder model based on a reinforcement learning mechanism;
and randomly generating new samples from the encoder model according to the optimized variation.
In the alternative, as a possible implementation, the processor may be further configured to implement the following steps:
training a preset classifier model by adopting an original sample in an original training set to obtain a classifier model;
executing a preset number of iterative computations, wherein one iterative computation in the iterative computations comprises:
randomly generating a new sample by adopting a variation self-encoder model;
Training a classifier model by adopting a new sample, classifying the original sample in the original training set by adopting the new classifier model after training, calculating classification index parameters and state variables, and taking the classification index parameters as environment rewarding variables;
and calculating estimated rewards of the decoder of the variable self-encoder model according to the state variables, and optimizing the decoder of the variable self-encoder model according to the estimated rewards so as to maximize the estimated rewards.
In the alternative, as a possible implementation, the processor may be further configured to implement the following steps: and training an evaluator to minimize the difference between the environmental rewards variable and the predicted rewards based on the difference between the environmental rewards variable and the predicted rewards.
In the alternative, as a possible implementation, the processor may be further configured to implement the following steps:
if the classifier model is a counter-transmissible classifier, directly training the classifier model by adopting a new sample; if the classifier model is a non-back propagation classifier, adding a new sample into the original training set, and training a preset classifier model by adopting the expanded original training set to obtain a new classifier model.
In the alternative, as a possible implementation, the processor may be further configured to implement the following steps:
If the classifier model is a counter-transmissible classifier, updating parameters of the original classifier according to a preset proportion; if the classifier model is a non-backward propagation classifier, the original classifier is reserved according to a preset probability.
In the alternative, as a possible implementation, the processor may be further configured to implement the following steps:
and selecting a new sample of a preset type according to the classification result of the classifier on the new sample, and storing the new sample in the original training set.
In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of elements is merely a logical functional division, and there may be additional divisions of actual implementation, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, randomAccess Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (8)

1. A reinforcement learning-based data preprocessing method, comprising:
training a preset variation self-encoder model by adopting an original sample in an original training set to obtain a variation self-encoder model; the original sample is data related to overdue risk users in the financial wind control field;
optimizing the variational self-encoder model based on a reinforcement learning mechanism;
randomly generating a new sample from the encoder model according to the optimized variation;
the reinforcement learning mechanism based optimization of the variational self-encoder model includes:
training a preset classifier model by adopting an original sample in an original training set to obtain a classifier model;
Executing a preset number of iterative computations, wherein one iterative computation in the iterative computations comprises:
randomly generating a new sample by adopting the variation self-encoder model;
training the classifier model by adopting the new sample, classifying the original sample in the original training set by adopting the new classifier model after training, calculating classification index parameters and state variables, and taking the classification index parameters as environment rewarding variables;
calculating estimated rewards of the decoder of the variable self-encoder model by adopting a preset evaluator and the state variables, and optimizing the decoder of the variable self-encoder model according to the estimated rewards so as to maximize the estimated rewards;
after training the classifier model with the new sample, the method further comprises:
if the classifier model is a counter-transmissible classifier, updating parameters of the original classifier according to a preset proportion; if the classifier model is a non-backward propagation classifier, the original classifier is reserved according to a preset probability.
2. The method as recited in claim 1, further comprising:
and training the evaluator according to the difference between the environmental reward variable and the estimated reward so as to minimize the difference between the environmental reward variable and the estimated reward.
3. The method of claim 1, wherein the training the classifier model with the new sample comprises:
if the classifier model is a counter-transmissible classifier, directly training the classifier model by adopting the new sample; if the classifier model is a non-back propagation classifier, adding the new sample into the original training set, and training the preset classifier model by adopting the expanded original training set to obtain a new classifier model.
4. The method as recited in claim 1, further comprising:
and selecting a new sample of a preset type according to the classification result of the classifier on the new sample, and storing the new sample in the original training set.
5. A reinforcement learning-based data preprocessing system, comprising:
the training unit is used for training a preset variation self-encoder model by adopting an original sample in an original training set to obtain a variation self-encoder model;
an optimization unit for optimizing the variation self-encoder model based on a reinforcement learning mechanism;
an output unit for randomly generating new samples from the encoder model according to the optimized variation;
the optimizing unit includes:
The training module is used for training a preset classifier model by adopting an original sample in an original training set to obtain the classifier model;
the processing module is used for executing iterative computation of a preset number, and one iterative computation in the iterative computation comprises the following steps:
randomly generating a new sample by adopting the variation self-encoder model;
training the classifier model by adopting the new sample, classifying the original sample in the original training set by adopting the new classifier model after training, calculating classification index parameters and state variables, and taking the classification index parameters as environment rewarding variables;
calculating estimated rewards of the decoder of the variable self-encoder model by adopting a preset evaluator and the state variables, and optimizing the decoder of the variable self-encoder model according to the estimated rewards so as to maximize the estimated rewards;
the processing module comprises:
the processing submodule is used for directly training the classifier model by adopting the new sample if the classifier model is a counter-transmissible classifier; if the classifier model is a non-back propagation classifier, adding the new sample into the original training set, and training the preset classifier model by adopting the expanded original training set to obtain a new classifier model.
6. The system of claim 5, further comprising:
and the training module is used for training the evaluator according to the difference between the environmental rewards variable and the estimated rewards so as to minimize the difference between the environmental rewards variable and the estimated rewards.
7. A computer device comprising a processor for implementing the steps of the method according to any one of claims 1 to 4 when executing a computer program stored in a memory.
8. A computer-readable storage medium having stored thereon a computer program, characterized by: the computer program implementing the steps of the method according to any one of claims 1 to 4 when executed by a processor.
CN202010363808.1A 2020-04-30 2020-04-30 Data preprocessing method, system and related equipment based on reinforcement learning Active CN111563548B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010363808.1A CN111563548B (en) 2020-04-30 2020-04-30 Data preprocessing method, system and related equipment based on reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010363808.1A CN111563548B (en) 2020-04-30 2020-04-30 Data preprocessing method, system and related equipment based on reinforcement learning

Publications (2)

Publication Number Publication Date
CN111563548A CN111563548A (en) 2020-08-21
CN111563548B true CN111563548B (en) 2024-02-02

Family

ID=72074602

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010363808.1A Active CN111563548B (en) 2020-04-30 2020-04-30 Data preprocessing method, system and related equipment based on reinforcement learning

Country Status (1)

Country Link
CN (1) CN111563548B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107274029A (en) * 2017-06-23 2017-10-20 深圳市唯特视科技有限公司 A kind of future anticipation method of interaction medium in utilization dynamic scene
CN108334497A (en) * 2018-02-06 2018-07-27 北京航空航天大学 The method and apparatus for automatically generating text
CN109886388A (en) * 2019-01-09 2019-06-14 平安科技(深圳)有限公司 A kind of training sample data extending method and device based on variation self-encoding encoder
CN110046712A (en) * 2019-04-04 2019-07-23 天津科技大学 Decision search learning method is modeled based on the latent space for generating model
CN110728314A (en) * 2019-09-30 2020-01-24 西安交通大学 Method for detecting active users of large-scale scheduling-free system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11593660B2 (en) * 2018-09-18 2023-02-28 Insilico Medicine Ip Limited Subset conditioning using variational autoencoder with a learnable tensor train induced prior

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107274029A (en) * 2017-06-23 2017-10-20 深圳市唯特视科技有限公司 A kind of future anticipation method of interaction medium in utilization dynamic scene
CN108334497A (en) * 2018-02-06 2018-07-27 北京航空航天大学 The method and apparatus for automatically generating text
CN109886388A (en) * 2019-01-09 2019-06-14 平安科技(深圳)有限公司 A kind of training sample data extending method and device based on variation self-encoding encoder
CN110046712A (en) * 2019-04-04 2019-07-23 天津科技大学 Decision search learning method is modeled based on the latent space for generating model
CN110728314A (en) * 2019-09-30 2020-01-24 西安交通大学 Method for detecting active users of large-scale scheduling-free system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王怀远 ; 陈启凡 ; .基于代价敏感堆叠变分自动编码器的暂态稳定评估方法.中国电机工程学报.(07),全文. *

Also Published As

Publication number Publication date
CN111563548A (en) 2020-08-21

Similar Documents

Publication Publication Date Title
US10460230B2 (en) Reducing computations in a neural network
US10460236B2 (en) Neural network learning device
CN110476172A (en) Neural framework for convolutional neural networks is searched for
CN108021983A (en) Neural framework search
CN111080397A (en) Credit evaluation method and device and electronic equipment
US11704570B2 (en) Learning device, learning system, and learning method
US10592777B2 (en) Systems and methods for slate optimization with recurrent neural networks
CN112884236B (en) Short-term load prediction method and system based on VDM decomposition and LSTM improvement
CN114072809A (en) Small and fast video processing network via neural architectural search
US11763151B2 (en) System and method for increasing efficiency of gradient descent while training machine-learning models
CN114817571A (en) Method, medium, and apparatus for predicting achievement quoted amount based on dynamic knowledge graph
CN110007371A (en) Wind speed forecasting method and device
Gullapalli Associative reinforcement learning of real-valued functions
CN111563548B (en) Data preprocessing method, system and related equipment based on reinforcement learning
CN112381591A (en) Sales prediction optimization method based on LSTM deep learning model
Ortega-Zamorano et al. FPGA implementation of neurocomputational models: comparison between standard back-propagation and C-Mantec constructive algorithm
WO2022222230A1 (en) Indicator prediction method and apparatus based on machine learning, and device and storage medium
CN110322055B (en) Method and system for improving grading stability of data risk model
CN113191527A (en) Prediction method and device for population prediction based on prediction model
CN111898626A (en) Model determination method and device and electronic equipment
US11928128B2 (en) Construction of a meta-database from autonomously scanned disparate and heterogeneous sources
US20230351169A1 (en) Real-time prediction of future events using integrated input relevancy
US11822564B1 (en) Graphical user interface enabling interactive visualizations using a meta-database constructed from autonomously scanned disparate and heterogeneous sources
US20230368013A1 (en) Accelerated model training from disparate and heterogeneous sources using a meta-database
US20230351491A1 (en) Accelerated model training for real-time prediction of future events

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant