CN111563548B

CN111563548B - Data preprocessing method, system and related equipment based on reinforcement learning

Info

Publication number: CN111563548B
Application number: CN202010363808.1A
Authority: CN
Inventors: 张伟哲; 张宾; 周颖; 束建钢; 黄兴森
Original assignee: Peng Cheng Laboratory
Current assignee: Peng Cheng Laboratory
Priority date: 2020-04-30
Filing date: 2020-04-30
Publication date: 2024-02-02
Anticipated expiration: 2040-04-30
Also published as: CN111563548A

Abstract

The embodiment of the invention provides a data preprocessing method, a data preprocessing system and related equipment based on reinforcement learning, which realize feedback adjustment in the oversampling process of an original sample based on a reinforcement learning mechanism and improve the rationality of oversampling of the data sample. The method of the embodiment of the invention comprises the following steps: training a preset variation self-encoder model by adopting an original sample in an original training set to obtain a variation self-encoder model; optimizing the variational self-encoder model based on a reinforcement learning mechanism; and randomly generating new samples from the encoder model according to the optimized variation.

Description

Data preprocessing method, system and related equipment based on reinforcement learning

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a data preprocessing method, system and related device based on reinforcement learning.

Background

The imbalance problem of data samples refers to uneven distribution of data of different categories in the data set. If the overdue probability in the financial wind control is lower, the data volume is far less than that of normal data, so that the sensitivity of the data mining result to overdue risk users is lost, and the data mining result is invalid.

A common solution to the current imbalance problem of data samples is to oversample the data samples from a data perspective, generating new samples. The existing oversampling algorithm lacks a feedback link, and the rationality of generating a new sample needs to be improved.

In view of this, there is a need for a new data preprocessing method.

Disclosure of Invention

The embodiment of the invention provides a data preprocessing method, a data preprocessing system and related equipment based on reinforcement learning, which realize feedback adjustment in the oversampling process of an original sample based on a reinforcement learning mechanism and improve the rationality of oversampling of the data sample.

The first aspect of the embodiment of the invention provides a data preprocessing method based on reinforcement learning, which can comprise the following steps:

training a preset variation self-encoder model by adopting an original sample in an original training set to obtain a variation self-encoder model;

optimizing the variational self-encoder model based on a reinforcement learning mechanism;

and randomly generating new samples from the encoder model according to the optimized variation.

Optionally, as a possible implementation manner, in the reinforcement learning-based data preprocessing method in the embodiment of the present invention, the optimization of the variation self-encoder model based on the reinforcement learning mechanism may include:

Training a preset classifier model by adopting an original sample in an original training set to obtain a classifier model;

executing a preset number of iterative computations, wherein one iterative computation in the iterative computations comprises:

randomly generating a new sample by adopting the variation self-encoder model;

training the classifier model by adopting the new sample, classifying the original sample in the original training set by adopting the new classifier model after training, calculating classification index parameters and state variables, and taking the classification index parameters as environment rewarding variables;

and calculating estimated rewards of the decoder of the variable self-encoder model by adopting a preset evaluator and the state variables, and optimizing the decoder of the variable self-encoder model according to the estimated rewards so as to maximize the estimated rewards.

Optionally, as a possible implementation manner, in the reinforcement learning-based data preprocessing method in the embodiment of the present invention, the training the classifier model using the new sample may include:

if the classifier model is a counter-transmissible classifier, directly training the classifier model by adopting the new sample; if the classifier model is a non-back propagation classifier, adding the new sample into the original training set, and training the preset classifier model by adopting the expanded original training set to obtain a new classifier model.

Optionally, as a possible implementation manner, in an embodiment of the present invention, the method may further include: and training the evaluator according to the difference between the environmental reward variable and the estimated reward so as to minimize the difference between the environmental reward variable and the estimated reward.

Optionally, as a possible implementation manner, in an embodiment of the present invention, after the classifier model is trained using the new sample, the method may further include:

if the classifier model is a counter-transmissible classifier, updating parameters of the original classifier according to a preset proportion; if the classifier model is a non-backward propagation classifier, the original classifier is reserved according to a preset probability.

Optionally, as a possible implementation manner, the data preprocessing method based on reinforcement learning in the embodiment of the present invention further includes:

and selecting a new sample of a preset type according to the classification result of the classifier on the new sample, and storing the new sample in the original training set.

A second aspect of an embodiment of the present invention provides a reinforcement learning-based data preprocessing system, which may include:

the training unit is used for training a preset variation self-encoder model by adopting an original sample in an original training set to obtain a variation self-encoder model;

An optimization unit for optimizing the variation self-encoder model based on a reinforcement learning mechanism;

and the output unit is used for randomly generating new samples from the encoder model according to the optimized variation.

Optionally, as a possible implementation manner, the optimizing unit in the embodiment of the present invention may include:

the training module is used for training a preset classifier model by adopting an original sample in an original training set to obtain the classifier model;

the processing module is used for executing iterative computation of a preset number, and one iterative computation in the iterative computation comprises the following steps:

randomly generating a new sample by adopting the variation self-encoder model;

Optionally, as a possible implementation manner, the data preprocessing system based on reinforcement learning in the embodiment of the present invention may include: and the training module is used for training the evaluator according to the difference between the environmental rewards variable and the estimated rewards so as to minimize the difference between the environmental rewards variable and the estimated rewards.

Optionally, as a possible implementation manner, the processing module in the embodiment of the present invention may include:

the processing submodule is used for directly training the classifier model by adopting the new sample if the classifier model is a counter-transmissible classifier; if the classifier model is a non-back propagation classifier, adding the new sample into the original training set, and training the preset classifier model by adopting the expanded original training set to obtain a new classifier model.

Optionally, as a possible implementation manner, the processing module in the embodiment of the present invention may further include:

the adjusting sub-module is used for updating parameters of the original classifier according to a preset proportion if the classifier model is a counter-transmissible classifier; if the classifier model is a non-backward propagation classifier, the original classifier is reserved according to a preset probability.

and the storage sub-module is used for selecting a new sample of a preset type to store in the original training set according to the classification result of the classifier on the new sample.

A third aspect of the embodiments of the present invention provides a computer apparatus comprising a processor for implementing the steps as in any one of the possible implementations of the first aspect and the first aspect when executing a computer program stored in a memory.

A fourth aspect of the embodiments of the present invention provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs steps as in any one of the possible implementations of the first aspect and the first aspect.

From the above technical solutions, the embodiment of the present invention has the following advantages:

in the embodiment of the invention, an original sample in an original training set is adopted to train a preset variation self-encoder model to obtain a variation self-encoder model, then the variation self-encoder model is optimized based on a reinforcement learning mechanism, and finally a new sample is randomly generated according to the optimized variation self-encoder model. Compared with the prior art, the feedback adjustment in the oversampling process of the original sample is realized based on the reinforcement learning mechanism, and the rationality of the oversampling of the data sample is improved.

Drawings

FIG. 1 is a diagram illustrating an embodiment of a reinforcement learning-based data preprocessing method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of another embodiment of a reinforcement learning-based data preprocessing method according to an embodiment of the present invention;

FIG. 3 is a schematic architecture diagram of a specific embodiment of a reinforcement learning-based data preprocessing method according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of an embodiment of a reinforcement learning-based data preprocessing system according to an embodiment of the present invention;

FIG. 5 is a diagram of a computer device according to an embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

The terms first, second, third, fourth and the like in the description and in the claims and in the above drawings are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments described herein may be implemented in other sequences than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The existing oversampling algorithm breaks the data generation and the final task, and the algorithm corrects the skewed data set only through different oversampling means, but the influence of the oversampling on the downstream task is not considered, so that the improvement effect of new samples generated by different oversampling algorithms on different tasks is inconsistent. In order to solve the problems in the existing methods, the invention provides a data preprocessing method based on reinforcement learning, which improves the rationality of generating new samples.

For ease of understanding, a specific flow in the embodiment of the present invention is described below, referring to fig. 1, and an embodiment of a data preprocessing method based on reinforcement learning in the embodiment of the present invention may include:

101. training a preset variation self-encoder model by adopting an original sample in an original training set to obtain a variation self-encoder model;

when the over sampling of the original training set needs to be realized to generate a new sample, the reinforcement learning-based data preprocessing system can train a preset variation self-encoder model by adopting the original sample in the original training set to obtain the variation self-encoder model.

Among them, a variable auto-encoder (VAE) is a type of data generation model (general model) for generating data similar to the original data. The variation self-encoder can learn the distribution information of the original samples in the original training set, and the decoder can sample from the preset distribution to generate new sample data.

102. Optimizing a variational self-encoder model based on a reinforcement learning mechanism;

in practical application, the data preprocessing and the task are mutually influenced, so that the data preprocessing and the task are split, and reasonable data cannot be generated in a targeted manner according to the requirements of the task. In view of this, the applicant has noted that a certain feedback correction mechanism may be provided, for example, to implement new sample measurement and correction based on reinforcement learning mechanisms.

The basic principle of reinforcement learning in machine learning is as follows: if a certain action (action) policy of an actor (actor) results in a positive reward (recall) of the environment (environment), the tendency of the actor to later generate this action policy will be enhanced. Reinforcement learning regards learning as a heuristic evaluation process, an actor selects an action for an environment, the state (state) changes after the environment receives the action, a report (prize or punishment) is generated and fed back to the actor, and the actor selects the next action according to a reinforcement signal and the current state of the environment, wherein the selection principle is that the probability of being reinforced by the report is increased.

In practical application, the behavior of outputting a new sample from the encoder model can be used as an action in the reinforcement learning mechanism, and reasonable environment (environment) variables, state variables and rewards (reward) can be set according to the task demands of users, so that the optimization of the variation from the encoder model can be realized, and the specific implementation mode of the reinforcement learning mechanism is not limited.

103. And randomly generating new samples from the encoder model according to the optimized variation.

Based on the variation self-encoder model after reinforcement learning mechanism optimization, new samples can be randomly generated, new samples related to task demands of users can be further screened and output, feedback adjustment in the oversampling process of the original samples is realized, and the rationality of oversampling of the data samples is improved.

For ease of understanding, the implementation of the reinforcement learning mechanism in embodiments of the present invention will be described in detail below. Referring to fig. 2, another embodiment of a reinforcement learning-based data preprocessing method according to an embodiment of the present invention may include:

201. training a preset variation self-encoder model by adopting an original sample in an original training set to obtain a variation self-encoder model;

the variation in the data generation step is divided into an encoder (decoder) and a decoder (decoder), so that the original data distribution can be modeled, and the distribution assumption on the original data is not needed in modeling, so that the application range is wider.

When the over-sampling of the original training set needs to be realized to generate a new sample, the data preprocessing system based on reinforcement learning can model the original sample distribution in the original training set of the data by adopting a preset variation self-encoder, and provide a random over-sampling function by limiting the distribution form of hidden layer space variables. The loss function is shown in formula (1):

L＝‖X _o -X _f ‖ ² +λKL(P(z∣c，X)‖N(0,I)) (1)

KL(P‖Q)＝E _x～p [logP(x)-logQ(x)] (2)

wherein KL represents a Kullback-Leibler divergence for measuring the difference between two mathematical distributions, where KL divergence is defined as shown in equation (2), X _o Representing raw data, X _f Representing reconstructed data from the encoder after the variation, P (z-c, X) represents the true hidden variable distribution of the dataset under the encoder map, z represents the hidden variable, and N (0,I) represents the multi-dimensional standard normal distribution. KL (P II Q) represents the KL divergence between the calculated distributions P (x) and Q (x), E _x～p The x is expressed as the compliance distribution P (x).

After obtaining the original samples in the original training set, the original samples in the original training set may be used to train a preset variational self-encoder model, minimizing the loss function defined in equation (1), and obtaining the variational self-encoder model.

202. Training a preset classifier model by adopting an original sample in an original training set to obtain a classifier model;

In the embodiment of the invention, the behavior of outputting a new sample from the encoder model can be used as the action in the reinforcement learning mechanism, and a reasonable classifier can be set as an environment (environment) variable according to the task requirement of a user. For example, a bayesian framework classifier can be used in discrete data, a multi-layer neural network or a support vector machine can be used in continuous data, and the classifier model can be reasonably set according to the requirements of users and can be used for other tasks in supervised learning, such as linear regression and the like, and the classifier model is not limited in this particular aspect.

203. Randomly generating a new sample by adopting a variation self-encoder model;

after training the variational self-encoder model, a batch of hidden layer variables can be randomly sampled in a normal distribution of hidden layer space and mapped to new samples in sample space by the decoder of the variational self-encoder.

204. Training a classifier model by adopting a new sample, classifying the original samples in the original training set by adopting the new classifier model after training, and calculating environment rewarding variables according to classification index parameters;

after randomly generating new samples, the system may train the classifier model with the new samples and classify the original samples in the original training set with the new classifier model after training.

In the embodiment of the invention, the classification index r of the classifier can be used as a reward (reward) in the environment, and the specific classification index r is determined according to the selected classifier and is not limited herein. The state (state) in reinforcement learning is set as follows: if the adopted classifier is a neural network classifier capable of back propagation, the weight of the neural network classifier is directly used as a state, as shown in a formula (3), wherein w _t ，b _t Is the neural network weight corresponding to the current iteration t times; if the adopted classifier is a non-back-transmissible classifier, taking the positive class classification probability of the classifier on the original data set as a state, as shown in a formula (4), calculating the probability that each training sample is classified as a positive class under the current classifier, wherein X _o Represent training samples, θ _t Representing the classifier corresponding to the iteration t times.

S _t ＝ (w _t ，b _t ) (3)

P _t ＝P(X _o ∣θ _t ) (4)

Alternatively, as one possible implementation, the specific process of training the classifier model with the new sample may include:

if the classifier model is a counter-transmissible classifier, directly training the classifier model by adopting a new sample; if the classifier model is a non-back propagation classifier, adding a new sample into the original training set, and training a preset classifier model by adopting the expanded original training set to obtain a new classifier model.

205. Calculating estimated rewards of the decoder of the variable self-encoder model by adopting a preset evaluator and state variables, and optimizing the decoder of the variable self-encoder model according to the estimated rewards so as to maximize the estimated rewards;

the reinforcement learning mechanism is provided with an evaluator (implemented based on a multi-layer neural network) that can calculate a predicted prize for the decoder of the variational self-encoder model based on the state variables (see equation 5) and optimize the decoder of the variational self-encoder model based on the predicted prize so as to maximize the predicted prize.

r _p ＝critic(S _t+1 ，X _p ) (5)

According to the embodiment of the invention, the evaluator can be trained according to the loss function corresponding to the evaluator, so that the new sample can be evaluated more accurately. After each time according to the environment rewards variable and the state variable corresponding to the new sample, the system can calculate the estimated rewards of the evaluator according to the formula (5), and fine-tune the decoder according to the corresponding estimated rewards, so that the estimated rewards of the new sample generated subsequently can be maximized. The specific updated parameter calculation process is shown in formula (6), wherein critic in formula (5) is the multi-layer neural network corresponding to the evaluator, S _t+1 Representing the corresponding state of the new classifier, X _p Representing the current new sample, r is the true reward returned by the classifier, r _p For the estimated reward of the evaluator, the updated decoder parameters include w _D B for decoder weights _D Biased for the decoder.

206. Training an evaluator according to the difference between the environmental reward variable and the predicted reward;

optionally, in order to further improve the accuracy of the evaluator, the embodiment of the present invention may further be based on a ringThe difference between the environmental reward variable and the predicted reward is trained to minimize the difference between the environmental reward variable and the predicted reward, and the subsequent evaluation of the new sample is more accurate. After each time according to the environment rewards variable and the state variable corresponding to the new sample, the system can calculate the loss function of the evaluator according to the formula (7), the specific evaluation updated parameter calculation process is shown as the formula (8), wherein the updated evaluation parameters comprise w _critic ，b _critic The weights and the biases corresponding to the neural network models in the evaluator are respectively.

loss ₃ ＝‖r-r _p ‖ ² (7)

It will be appreciated that the process of reinforcement learning is continuous, and that the rationality of data sample oversampling can be further improved by multiple rounds of reinforcement learning. The steps 203 to 206 may be used as an iteration calculation corresponding to a round of reinforcement learning, and in practical application, a preset number of iteration calculations may be performed according to the needs of the user, and the specific number of iteration is not limited herein. If 206 is not needed, the steps 203 to 205 may be used as an iterative calculation corresponding to a round of reinforcement learning

207. If the classifier model is a counter-transmissible classifier, updating parameters of the original classifier according to a preset proportion; if the classifier model is a non-backward propagation classifier, the original classifier is reserved according to a preset probability;

in the process of performing multiple rounds of reinforcement learning, new samples randomly generated in a reinforcement learning process of a certain round may be unreasonable, and a new classifier updated based on the unreasonable new samples may have a negative effect on a final task, which is not beneficial to the rationality of oversampling of data samples. In order to reduce negative effects, optionally, as a possible implementation manner, the embodiment of the present invention may further selectively update the original classifier after each round of reinforcement learning, and the specific update process may include: if the classifier model is a counter-transmissible classifier, updating parameters of the original classifier according to a preset proportion; if the classifier model is a non-backward propagation classifier, the original classifier is reserved according to a preset probability.

For example, when the classifier model is a back-propagation classifier, the weights of the original classifier may be updated by a preset ratio τ, in the manner in equation (9), where w _t+1 The neural network weights of the post-classifier are calculated for the t+1st iteration. When the classifier model is a non-backward propagation classifier, the original classifier can be reserved according to a preset proportion lambda to generate a random number P, and if P is more than lambda, theta _t+1 ＝θ _t If P is less than or equal to lambda, theta _t+1 ＝θ _t+1 Wherein θ is _t+1 The post-classifier is computed for the t+1st iteration.

It will be appreciated that the process of updating the original classifier in step 206 is an optional step and may be selectively performed according to the needs of the user.

208. And randomly generating new samples from the encoder model according to the optimized variation.

Based on a reinforcement learning mechanism, a new sample related to the task requirement of a user can be output by the variation self-encoder model after optimization of the execution of the preset number of iterative computations, so that the feedback adjustment of the oversampling of the original sample is realized, and the rationality of the oversampling of the data sample is improved.

Optionally, the method can also reserve a classifier model at the end of training, and can directly use the classifier in the reinforcement learning process when the task demands the classifier, thereby saving the system computing resources.

Optionally, a new sample generated in the iterative computation process corresponding to each round of reinforcement learning may be selected and stored in the original training set according to the classification result of the classifier on the new sample, for example, the user needs a positive sample, and then the positive sample may be selected and stored according to the classification result of the classifier on the new sample.

Compared with the traditional process of expanding the data set and then training the classifier, the method combines the final performance of the classifier with the data generation, takes the improvement of the classification performance as the target of the data generation stage, uses a gradient search mode to purposefully generate specific data so as to improve the performance of the downstream classifier and reduce the amount of generated data as much as possible. The invention improves the correlation and rationality of the classification performance of the generated data and the final classifier.

For ease of understanding, the reinforcement learning-based data preprocessing method in the embodiment of the present invention will be described below in connection with specific application embodiments.

Referring to fig. 3, the following specific steps of a reinforcement learning-based data generation framework are implemented:

the structure of the whole frame is shown in fig. 3, and the frame has 3 parts which respectively correspond to different structures in reinforcement learning. The generator corresponds to an action body (actor) in reinforcement learning, and the new sample generated by the generator is used as an action (action) taken by the action body; the classifier corresponds to an environment (environment), and the classification index of the classifier serves as a reward (reward) in the environment. Wherein,

a generator: the generator includes a variable self-encoder. The variation is divided into an encoder (decoder) and a decoder (decoder) from the encoder, so that the original data distribution can be modeled, and distribution assumptions on the original data are not needed in modeling, so that the application range is wider.

The original distribution of the data is modeled by a variational self-encoder, and a random oversampling function is provided by limiting the distribution form of hidden layer space variables. The loss function is shown in formula (1):

L＝‖X _o -X _f ‖ ² +λKL(P(z∣c，X)‖N(0,I)) (1)

KL(P‖Q)＝E _x～p [logP(x)-logQ(x)] (2)

wherein KL represents the Kullback-Leibler divergence for measuring the difference between two mathematical distributions, where KL divergence is defined as shown in equation (2), X _o Representing raw data, X _f Representing the passing ofThe variation is derived from the reconstructed data of the encoder, P (z|c, X) represents the true hidden variable distribution of the dataset under the encoder map, z represents the hidden variable, and N (0,I) represents the multidimensional standard normal distribution. KL (P II Q) represents the KL divergence between the calculated distributions P (x) and Q (x), E _x～p The x is expressed as the compliance distribution P (x).

The training process of the variable self-encoder is divided into two steps: in the first step, only modeling is carried out on the original distribution of the data, and the loss function in the formula (1) is directly optimized without considering the downstream classifier and classification standard; and in the second step, the generator is optimized according to feedback information of the evaluator (critic), so that the generated samples can effectively improve the final classification target.

In this step, since the loss function of the encoder is independent of the hidden layer data distribution, only the decoder is optimized and the evaluator estimates the prize as shown in equation (5).

r _p ＝critic(S _t+1 ，X _p ) (5)

Wherein the critic in the formula (5) is a multi-layer neural network corresponding to the evaluator, S _t+1 Representing the corresponding state of the new classifier, X _p Representing the current generated sample.

Decoder parameter update As shown in equation (6), its parameters are updated using a gradient-increasing algorithm, where w _D B for decoder weights _D Biased for the decoder.

A classifier: different classifiers can be selected according to different task characteristics, such as a Bayesian framework classifier can be used in discrete data, a multi-layer neural network or a support vector machine can be adopted in continuous data, and the framework can be used for supervising other tasks in learning, such as regression and the like. The invention has the advantage that different indexes of a plurality of downstream tasks can be optimized. Taking the classification task as an example, the classifier is classified into a classifier that can use back propagation optimization and a classifier that cannot use back propagation optimization. Gradients may be superimposed continuously in back propagation, so the original training set is used for training in the early stages of classifier training, while during the generator fine-tuning phase, only the generated data is used to optimize the classifier to evaluate the effect of this batch of generated data on the classifier. When training a classifier which cannot be optimized in a back propagation way, the updating states of the classifier cannot be directly overlapped, so that the original training set is used for training in the initial training stage, and in the fine tuning stage of the generator, generated data are added into the original training set for expansion, a new classifier is trained by adopting the expanded data set, and meanwhile, whether the batch of data is reserved or not is confirmed according to the classification result of the new classifier, namely, the data set becomes larger and larger.

Depending on the classifier, the state variables in reinforcement learning are set as follows: in the back-propagation classifier, the parameters of the neural network classifier are directly used as states, as shown in formula (3), wherein w _t ，b _t Is the neural network weight corresponding to the current iteration t times; in non-back-propagation classifiers, the positive class classification probability of the classifier on the original dataset is taken as the state, as shown in equation (4), where X _o Represent training samples, θ _t Representing the classifier corresponding to the iteration t times, the formula (4) calculates the probability that each training sample is classified as a positive class under the current classifier.

S _t ＝ (w _t ，b _t ) (3)

S _t ＝P(X _o ∣θ _t ) (4)

An evaluator: the evaluator adopts a reinforcement learning framework, and measures new samples according to classification results of the new classifier, wherein the measurement results are used as estimated rewards r _p (calculation reference formula (5)) to participate in the optimization of the generator, the setting can ensure the correlation between the new sample and the measurement result, and the framework can also encapsulate the calculation details of rewards, so that different task indexes can be optimized.

To obtain a more accurate evaluator, a multi-layer neural network may be used, the loss function of which is shown in equation (7), where r is the true reward returned by the classifier, r _p Parameter updating mode for critic prediction rewardsAs shown in formula (8), wherein w _critic ，b _critic The weight and bias corresponding to the evaluator are respectively.

loss ₃ ＝‖r-r _p ‖ ² (7)

In addition, if the classifier can use back propagation, reinforcement learning of each round updates the weight of the new classifier to the original classifier in a certain proportion, as shown in formula (9), the parameters of the original classifier are updated in a preset proportion τ, where w _t+1 The neural network weights of the post-classifier are calculated for the t+1st iteration. When the classifier model is a non-backward propagation classifier, the original classifier can be updated according to a preset proportion lambda to generate a random number P, and if P is more than lambda, theta _t+1 ＝θ _t If P is less than or equal to lambda, theta _t+1 ＝θ _t+1 Wherein θ is _t+1 The post-classifier is computed for the t+1st iteration.

Compared with the traditional oversampling process, the method combines the final performance of the classifier with the final performance of the data generation, takes the improvement of the classification performance as the target of the data generation stage, purposefully generates specific data by using a gradient search mode to improve the performance of the downstream classifier, and reduces the amount of generated data as much as possible. The invention improves the correlation between the generated data and the classification performance of the final classifier. The embodiment utilizes a reinforcement learning method to fit the influence of new data on the classifier and feeds the prediction result back to the data generator. The classifier is used as an environment in reinforcement learning, and the type of the classifier is not required to be limited; the classification performance is used as rewards, and the classification index type is not required to be limited.

It will be appreciated that the implementation of reinforcement learning is merely exemplary, and the task type, the replacement classifier or other models in supervised learning may be modified according to the requirement in practical application, and the classification evaluation index, such as the precision, the F1 value, the AUC value, etc., may be set according to the selected classifier type when the computing environment rewards, which is not limited herein.

Referring to fig. 4, the embodiment of the invention further provides a data preprocessing system based on reinforcement learning, which may include:

a training unit 401, configured to train a preset variation self-encoder model by using an original sample in an original training set, so as to obtain a variation self-encoder model;

an optimization unit 402 for optimizing the variational self-encoder model based on a reinforcement learning mechanism;

an output unit 403 for randomly generating new samples from the encoder model according to the optimized variation.

the processing module is used for executing iterative computation with preset quantity, and one iterative computation in the iterative computation comprises the following steps:

Randomly generating a new sample by adopting a variation self-encoder model;

training a classifier model by adopting a new sample, classifying the original sample in the original training set by adopting the new classifier model after training, calculating classification index parameters and state variables, and taking the classification index parameters as environment rewarding variables;

and calculating estimated rewards of the decoder of the variable self-encoder model according to the state variables, and optimizing the decoder of the variable self-encoder model according to the estimated rewards so as to maximize the estimated rewards.

Optionally, as a possible implementation manner, the system in the embodiment of the present invention may include: and the training module is used for training the evaluator according to the difference between the environmental rewards variable and the estimated rewards so as to minimize the difference between the environmental rewards variable and the estimated rewards.

the processing submodule is used for directly training the classifier model by adopting a new sample if the classifier model is a counter-transmissible classifier; if the classifier model is a non-back propagation classifier, adding a new sample into the original training set, and training a preset classifier model by adopting the expanded original training set to obtain a new classifier model.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

The reinforcement learning-based data preprocessing system in the embodiment of the present invention is described above from the point of view of the modularized functional entity, and referring to fig. 5, the computer device in the embodiment of the present invention is described below from the point of view of hardware processing:

the computer device 1 may include a memory 11, a processor 12, and an input-output bus 13. The steps in the reinforcement learning-based data preprocessing method embodiment shown in fig. 1 described above, such as steps 101 to 103 shown in fig. 1, are implemented when the processor 12 executes a computer program. In the alternative, the processor may implement the functions of the modules or units in the above-described embodiments of the apparatus when executing the computer program.

In some embodiments of the present invention, the processor is specifically configured to implement the following steps:

optimizing a variational self-encoder model based on a reinforcement learning mechanism;

In the alternative, as a possible implementation, the processor may be further configured to implement the following steps:

randomly generating a new sample by adopting a variation self-encoder model;

In the alternative, as a possible implementation, the processor may be further configured to implement the following steps: and training an evaluator to minimize the difference between the environmental rewards variable and the predicted rewards based on the difference between the environmental rewards variable and the predicted rewards.

The memory 11 includes at least one type of readable storage medium including flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, etc. The memory 11 may in some embodiments be an internal storage unit of the computer device 1, such as a hard disk of the computer device 1. The memory 11 may also be an external storage device of the computer apparatus 1 in other embodiments, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the computer apparatus 1. Further, the memory 11 may also include both an internal storage unit and an external storage device of the computer apparatus 1. The memory 11 may be used not only for storing application software installed in the computer apparatus 1 and various types of data, for example, codes of the computer program 01, but also for temporarily storing data that has been output or is to be output.

The processor 12 may in some embodiments be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor or other data processing chip for executing program code or processing data stored in the memory 11, e.g. executing a computer program 01 or the like.

The input/output bus 13 may be a peripheral component interconnect standard (peripheral component interconnect, PCI) bus, or an extended industry standard architecture (extended industry standard architecture, EISA) bus, among others. The bus may be classified as an address bus, a data bus, a control bus, etc.

Further, the computer apparatus may also comprise a wired or wireless network interface 14, and the network interface 14 may optionally comprise a wired interface and/or a wireless interface (e.g. WI-FI interface, bluetooth interface, etc.), typically used to establish a communication connection between the computer apparatus 1 and other electronic devices.

Optionally, the computer device 1 may further comprise a user interface, which may comprise a Display (Display), an input unit such as a Keyboard (Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch, or the like. The display may also be referred to as a display screen or display unit, as appropriate, for displaying information processed in the computer device 1 and for displaying a visual user interface.

Fig. 5 shows only a computer device 1 with components 11-14 and a computer program 01, it being understood by a person skilled in the art that the structure shown in fig. 5 does not constitute a limitation of the computer device 1, and may comprise fewer or more components than shown, or may combine certain components, or a different arrangement of components.

The present invention also provides a computer readable storage medium having a computer program stored thereon, which when executed by a processor, can implement the steps of:

randomly generating a new sample by adopting a variation self-encoder model;

In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of elements is merely a logical functional division, and there may be additional divisions of actual implementation, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, randomAccess Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A reinforcement learning-based data preprocessing method, comprising:

training a preset variation self-encoder model by adopting an original sample in an original training set to obtain a variation self-encoder model; the original sample is data related to overdue risk users in the financial wind control field;

randomly generating a new sample from the encoder model according to the optimized variation;

the reinforcement learning mechanism based optimization of the variational self-encoder model includes:

randomly generating a new sample by adopting the variation self-encoder model;

calculating estimated rewards of the decoder of the variable self-encoder model by adopting a preset evaluator and the state variables, and optimizing the decoder of the variable self-encoder model according to the estimated rewards so as to maximize the estimated rewards;

after training the classifier model with the new sample, the method further comprises:

2. The method as recited in claim 1, further comprising:

and training the evaluator according to the difference between the environmental reward variable and the estimated reward so as to minimize the difference between the environmental reward variable and the estimated reward.

3. The method of claim 1, wherein the training the classifier model with the new sample comprises:

4. The method as recited in claim 1, further comprising:

5. A reinforcement learning-based data preprocessing system, comprising:

an output unit for randomly generating new samples from the encoder model according to the optimized variation;

the optimizing unit includes:

randomly generating a new sample by adopting the variation self-encoder model;

the processing module comprises:

6. The system of claim 5, further comprising:

and the training module is used for training the evaluator according to the difference between the environmental rewards variable and the estimated rewards so as to minimize the difference between the environmental rewards variable and the estimated rewards.

7. A computer device comprising a processor for implementing the steps of the method according to any one of claims 1 to 4 when executing a computer program stored in a memory.

8. A computer-readable storage medium having stored thereon a computer program, characterized by: the computer program implementing the steps of the method according to any one of claims 1 to 4 when executed by a processor.