CN113743603A

CN113743603A - Control method, control device, storage medium and electronic equipment

Info

Publication number: CN113743603A
Application number: CN202010478055.9A
Authority: CN
Inventors: 张玥; 詹仙园; 朱翔宇; 霍雨森; 殷宏磊; 郑宇�
Original assignee: Jingdong City Beijing Digital Technology Co Ltd
Current assignee: Jingdong City Beijing Digital Technology Co Ltd
Priority date: 2020-05-29
Filing date: 2020-05-29
Publication date: 2021-12-03

Abstract

The application provides a control method, a control device, a storage medium and an electronic device. According to the control method provided by the embodiment of the application, the state detection parameters for representing the physical state of the equipment to be controlled are obtained, and the state detection parameters are input into the preset enhanced deep learning model to determine the control instructions corresponding to the state detection parameters, wherein the preset enhanced deep learning model is obtained by training a first data sample set formed by actual measurement values and a second data sample set formed by simulation values determined according to the preset simulation model and the first data sample set, so that the distribution of the learning strategies determined by the enhanced deep learning model obtained by combining real data and simulation data is closer to the distribution of the real strategies, and the matching degree of the control instructions determined by the deep enhanced learning model and the actual situation is improved.

Description

Control method, control device, storage medium and electronic equipment

Technical Field

The present application relates to the field of device control technologies, and in particular, to a control method, an apparatus, a storage medium, and an electronic device.

Background

Along with the development of the intelligent control field, the efficiency needs to be improved as much as possible on the premise of ensuring the safe and stable operation of the electronic equipment to be controlled.

The deep reinforcement learning algorithm is highly popular in different fields, and the appearance of the deep reinforcement learning enables the reinforcement learning technology to really go to practical use, so that the complex control problem of the electronic equipment in a real scene can be solved.

However, in the existing training of a deep reinforcement learning model based on a control algorithm, an imperfect simulation environment is mainly established through offline data, and the simulation environment is taken as a real environment for training. However, the imperfect simulation environment established based on the offline data cannot completely reflect the situation of the real environment, so that the deviation between the control instruction determined based on the deep reinforcement learning model and the actual situation is easily large.

Disclosure of Invention

The embodiment of the application provides a control method, a control device, a storage medium and electronic equipment, so as to solve the technical problem that the deviation between a control instruction determined according to a deep reinforcement learning model and an actual situation is large.

In a first aspect, an embodiment of the present application provides a control method, including:

acquiring state detection parameters, wherein the state detection parameters are used for representing the physical state of the equipment to be controlled;

and determining a control instruction according to the state detection parameters and a preset reinforced deep learning model, wherein the control instruction is used for controlling the equipment to be controlled, the preset reinforced deep learning model is obtained by training according to a first data sample set and a second data sample set, the first data sample set is a sample set formed by actual measurement values corresponding to the state detection parameters, and the second data sample set is a sample set formed by simulation values determined according to a preset simulation model.

In one possible design, the control method further includes:

obtaining the first set of data samples;

determining the second data sample set according to the first data sample set and the preset simulation model;

and determining a mixed sample pool according to the first data sample set and the second data sample set, wherein the mixed sample pool is used for training the preset reinforced deep learning model.

In one possible design, the determining the second set of data samples from the first set of data samples and the pre-set simulation model includes:

determining a data sample set to be selected according to the first data sample set and the preset simulation model;

and screening samples meeting preset joint distribution limiting conditions from the to-be-selected data sample set to determine the second data sample set.

In one possible design, the determining a candidate data sample set according to the first data sample set and the preset simulation model includes:

determining the data sample set to be selected according to the first data sample set and a preset joint condition probability distribution model, wherein the joint condition probability distribution model is a deep neural network model;

correspondingly, the step of screening the samples meeting the preset joint distribution limiting conditions from the to-be-selected data sample set includes:

and screening out samples with joint conditional probability greater than a preset probability threshold from the to-be-selected data sample set.

determining the data sample set to be selected according to the first data sample set and a preset countermeasure network;

and screening out samples with discrimination values larger than a preset discrimination threshold value from the to-be-selected data sample set, wherein the discrimination values are determined according to a discrimination network.

In one possible design, after the determining a mixed sample pool from the first set of data samples and the second set of data samples, the method further includes:

training a preset reference strategy model by using the first data sample set;

determining the strategy distribution difference between the trained reference strategy model and a preset learning strategy model;

training the preset learning strategy model, the preset reward value model and the preset safety value model by using the mixed sample pool, wherein the strategy distribution difference is used as a regularization item of the preset learning strategy model, the preset reward value model and the preset safety value model;

and when the training steps meet the preset conditions, taking the trained learning strategy model as the preset reinforced deep learning model.

In one possible design, the determining a difference in strategy distribution between the trained reference strategy model and a preset learning strategy model includes:

determining the strategy distribution difference between the trained reference strategy model and the preset learning strategy model according to a preset maximum mean difference algorithm; or,

and determining the strategy distribution difference between the trained reference strategy model and the preset learning strategy model according to a preset divergence algorithm.

In one possible design, the predetermined reference strategy model is a variational encoder.

In one possible design, a preset quadratic constraint form is used to determine a training target during the training process of the preset learning strategy model, the preset reward value model and the preset safety value model.

In one possible design, the preset simulation model is any one or a combination of a deep neural network model, a cyclic neural network model and a convolutional neural network model.

In a second aspect, an embodiment of the present application further provides a control device, including:

the parameter acquisition module is used for acquiring state detection parameters, and the state detection parameters are used for representing the physical state of the equipment to be controlled;

and the control processing module is used for determining a control instruction according to the state detection parameters and a preset reinforcement deep learning model, wherein the control instruction is used for controlling the equipment to be controlled, the preset reinforcement deep learning model is obtained by training according to a first data sample set and a second data sample set, the first data sample set is a sample set formed by actual measurement values corresponding to the state detection parameters, and the second data sample set is a sample set formed by simulation values determined according to a preset simulation model.

In one possible design, the control device further includes:

a sample acquisition module that acquires the first set of data samples;

the sample simulation module is used for determining the second data sample set according to the first data sample set and the preset simulation model;

and the sample generation module is used for determining a mixed sample pool according to the first data sample set and the second data sample set, wherein the mixed sample pool is used for training the preset reinforced deep learning model.

In one possible design, the sample simulation module is specifically configured to:

In one possible design, the control device further includes:

the model training module is used for training a preset benchmark strategy model by utilizing the first data sample set;

the difference determining module is used for determining the strategy distribution difference between the trained reference strategy model and a preset learning strategy model;

the model training module is further configured to train the preset learning strategy model, the preset reward value model and the preset safety value model by using the mixed sample pool, wherein the strategy distribution difference is used as a regularization item of the preset learning strategy model, the preset reward value model and the preset safety value model, and when the training step number meets a preset condition, the trained learning strategy model is used as the preset reinforcement deep learning model.

In one possible design, the difference determining module is specifically configured to:

In a third aspect, an embodiment of the present application further provides an electronic device, including:

a processor; and the number of the first and second groups,

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform any one of the control methods of the first aspect via execution of the executable instructions.

In a fourth aspect, the present application further provides a storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement any one of the control methods in the first aspect.

The control method, the control device, the storage medium and the electronic equipment provided by the embodiment of the application acquire the state detection parameters for representing the physical state of the equipment to be controlled, and inputting the state detection parameters into a preset enhanced deep learning model to determine control commands corresponding to the state detection parameters, wherein the preset intensified deep learning model is obtained by training a first data sample set formed by actual measurement values and a second data sample set formed by simulation values determined according to a preset simulation model and the first data sample set, therefore, the distribution of the learning strategies determined by the reinforced deep learning model based on the combination of the real data and the simulation data is closer to the real strategy distribution, and the matching degree of the control commands determined by the deep reinforced learning model and the actual situation is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to these drawings without inventive exercise.

FIG. 1 is a diagram of an application scenario of a control method illustrated in the present application according to an example embodiment;

FIG. 2 is a flow chart diagram illustrating a control method according to an example embodiment;

FIG. 3 is a schematic flow diagram illustrating the construction of a hybrid cuvette shown herein according to an example embodiment;

FIG. 4 is a flow diagram illustrating a hybrid training method according to an example embodiment;

FIG. 5 is a schematic flow diagram illustrating a training method for an enhanced deep learning model according to an example embodiment;

FIG. 6 is a flow diagram illustrating a security constraint and policy value regularization off-policy reinforcement learning approach according to an example embodiment;

FIG. 7 is a schematic diagram of a control device shown in the present application according to an example embodiment;

FIG. 8 is a schematic diagram of a control device shown in the present application according to another example embodiment;

fig. 9 is a schematic structural diagram of an electronic device shown in the present application according to an example embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims of the present application and in the above-described drawings (if any) are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Along with the development of the intelligent control field, the efficiency needs to be improved as much as possible on the premise of ensuring the safe and stable operation of the electronic equipment to be controlled. The deep reinforcement learning algorithm is highly popular in different fields, and the appearance of the deep reinforcement learning enables the reinforcement learning technology to really go to practical use, so that the complex control problem of the electronic equipment in a real scene can be solved. However, in the existing training of a deep reinforcement learning model based on a control algorithm, an imperfect simulation environment is mainly established through offline data, and the simulation environment is taken as a real environment for training. However, the imperfect simulation environment established based on the offline data cannot completely reflect the situation of the real environment, so that the deviation between the control instruction determined based on the deep reinforcement learning model and the actual situation is easily large.

In addition, it is worth to be noted that the deep reinforcement learning is a product of combining the deep learning and the reinforcement learning, and integrates the strong comprehension ability of the deep learning on the perception problems such as vision and the decision-making ability of the reinforcement learning, so as to realize the end-to-end learning. In addition, the appearance of deep reinforcement learning enables the reinforcement learning technology to be really practical, so that the complex problem in the real scene can be solved. In the past years, deep reinforcement learning algorithms have become popular in different fields, for example, in video games and chess games, defeating the top-high hands of humans; controlling a complex machine to operate; allocating network resources; the energy is greatly saved for the data center; and even automatically tuning the parameters of the machine learning algorithm.

The control field is one of the origins of the reinforcement learning thought and is also the most mature field of the reinforcement learning technology application. One common example is the use of artificial intelligence to tune machines and equipment, which previously required expert operators to accomplish. Reinforcement learning techniques such as deep mind in the uk help Google significantly reduce the energy consumption of its data center. In the field of automatic driving, driving is a sequential decision-making process, and therefore is naturally suitable for being processed by reinforcement learning. From the 80 s of ALVINN, TORCS to today's cara, the industry has attempted to solve the problem of autonomous driving of a single vehicle and the problem of traffic scheduling of multiple vehicles with reinforcement learning. In addition, similar ideas are also widely applied to the fields of various aircrafts and underwater unmanned planes. Many tasks in the field of natural language processing are multi-turn compared to tasks in the field of computer vision, i.e. the optimal solution needs to be sought through multiple iterative interactions, such as: a dialog system; moreover, feedback signals of tasks often need to be obtained after a series of decisions, such as: machine writing. The characteristics of such problems are naturally suitable for solving with reinforcement, and thus reinforcement has been applied in recent years to a variety of tasks in the field of natural language processing, such as text generation, text summarization, sequence labeling, dialogue robots (text/speech), machine translation, relationship extraction, and knowledge-graph reasoning. In addition, reinforcement learning is widely applied to the fields of commodity recommendation, news recommendation, online advertising, finance, communication, production scheduling, planning, resource access control and other operation and research fields, and even attempts are made in the fields of education training, health care, medicine and the like.

In addition, the basic idea of reinforcement learning is that the agent directs behavior through rewards gained by interactive learning with the environment, with the goal of seeking an optimal strategy for the agent to obtain the maximum cumulative rewards. The markov Decision process mdp (markov Decision processes) in the reinforcement learning is defined to be composed of a quadruple M ═ (S, a, R, T), where S represents a state space of the environment, a represents an action space of the agent, R (S, a) represents a reward function, the returned value of which represents a reward obtained by performing action a in state S, and T (S '| S, a) is a state transition probability function representing a probability of the environment transitioning to state S' after performing action a in state S. The goal of reinforcement learning is to find a mapping of environmental states to actions, i.e., the strategy π (a | s), that maximizes the expectation of future rewards

R_tGamma is the discount factor for the reward at time t. The intelligent agent selects an optimal action a in the current state s according to the strategy pi, executes the action, observes the reward r and the next state s' fed back by the environment, adjusts and updates the strategy pi based on the reward fed back, and continuously iterates until an optimal strategy is found, so that positive feedback can be obtained to the maximum extent.

The deep reinforcement learning algorithm training comprises two strategies: the system comprises a behavior strategy and a target strategy, wherein the behavior strategy is a strategy used for interacting with the environment to generate data, namely, making a decision in the exploration process; and the target strategy continuously learns and optimizes in the data generated by the behavior strategy, namely the strategy is trained to be applied practically. Therefore, the reinforcement learning can be divided into two major algorithms of on-policy and off-policy, the behavior policy and the target policy of the same policy algorithm are the same policy, the advantage is that the method is simple and direct, the policy can be optimized by directly utilizing data, but the local optimization is easy to fall into because the exploration and the utilization cannot be well kept; the algorithm of the different strategies separates the target strategy from the behavior strategy, and can better obtain the global optimal solution while maintaining exploration. However, the learning process of the different strategy algorithm is more tortuous and the convergence is slower.

In many real-world application scenarios, due to a plurality of practical factors such as high experimental cost and safety considerations, the real environment cannot be directly accessed, and some offline experience data sets are collected. In the prior art, an imperfect simulation environment is established mainly through the off-line data, and the simulation environment is taken as a real environment and then trained. However, the collected offline data is often an offline experience data set with a fixed scale, the data volume cannot meet the requirements of a general reinforcement learning algorithm, and the data distribution only covers a part of a real sample space and may not cover a real optimal solution, so that the existing algorithm has the following disadvantages:

1. an imperfect simulation environment established based on the offline data cannot completely reflect the scene of the real environment. In the sample space with dense offline data, the simulation environment feedback can be well attached to the real environment feedback, and the simulation environment feedback in the sample space with sparse historical data deviates from reality. When a sample space with sparse off-line data distribution is explored, the simulation environment cannot give real feedback, so that the strategy is over-fitted to the simulation environment, and the real optimal solution cannot be converged.

2. Practical application scenarios, such as the field of industrial control, generally require that as high an economic benefit as possible be produced while ensuring safe and stable operation. Actual production environments often have clear and stringent safety restrictions. While the collected offline data cannot represent specific security restrictions. While the state and motion space is usually high-dimensional and continuous, the performance of reinforcement learning algorithms is very sensitive to the distribution of empirical data sets when learning strategies on such off-line data sets. The reinforced algorithm trained based on the simulation environment may deviate greatly from the offline data.

In view of the above technical problems, the present application provides a control method, apparatus, storage medium and electronic device, by acquiring state detection parameters for representing the physical state of the equipment to be controlled and inputting the state detection parameters into a preset reinforced deep learning model, the control instruction corresponding to the state detection parameters is determined, wherein the preset intensified deep learning model is obtained by training a first data sample set formed by actual measurement values and a second data sample set formed by simulation values determined according to a preset simulation model and the first data sample set, therefore, the distribution of the learning strategies determined by the reinforced deep learning model based on the combination of the real data and the simulation data is closer to the real strategy distribution, and the matching degree of the control commands determined by the deep reinforced learning model and the actual situation is improved.

Fig. 1 is a diagram of an application scenario of a control method according to an example embodiment. As shown in fig. 1, the control method provided in this embodiment may be to control a device to be controlled, where the device to be controlled is an automobile, an unmanned aerial vehicle, an aircraft, a smart phone, and any other electronic device with a control function. The device to be controlled may include an input unit, a control unit, and an execution unit. Here, the device to be controlled may be an automobile, and with continued reference to fig. 1, as the input component, sensors may be provided, and for example, a speed sensor 101, a temperature sensor 102, and an impact sensor 103 may be included. And the control component may be an onboard controller 200. And, as the actuator, there may be an actuator 300, for example, a motor, an ignition rotation speed of an air bag, an automatic steering system, etc.

The vehicle acquires state parameters via sensors, wherein the state parameters are used to characterize a physical state of the vehicle, for example, the current vehicle speed. And after acquiring the vehicle speed, the controller 200 determines a control command, which may be an acceleration command or a deceleration command, by building a preset reinforcement deep learning model, so as to dynamically adjust the speed of the vehicle.

It should be noted that the preset reinforcement deep learning model may be obtained by training according to a first data sample set and a second data sample set, where the first data sample set is a sample set formed by actual measurement values corresponding to the state detection parameters, for example, the actual detected vehicle speed and the corresponding motion may be used, and the second data sample set is a sample set formed by simulation values determined according to the preset simulation model.

Fig. 2 is a flow chart diagram illustrating a control method according to an example embodiment. As shown in fig. 2, the control method provided in this embodiment includes:

step 101, obtaining a state detection parameter.

Specifically, the device to be controlled may measure a current state detection parameter through the input device, and the detected state detection parameter may be one or multiple. For the state detection parameters, it can be understood as characterizing the physical state of the device to be controlled, for example: speed, temperature, altitude, humidity, level status, etc.

And 102, determining a control instruction according to the state detection parameters and a preset reinforced deep learning model.

After the state detection parameters of the device to be controlled are obtained, the controller of the device to be controlled may process the state detection parameters according to the input state detection parameters to output a corresponding control decision. Specifically, the controller of the device to be controlled may make a control decision according to a single state detection parameter, or may make a combined control decision according to a plurality of state detection parameters. And finally, after the decision is completed, outputting a control instruction to control an execution part of the equipment to be controlled so as to realize a corresponding control function.

In this step, the processing logic performed in the controller may be implemented based on a preset enhanced deep learning model. The preset reinforcement deep learning model is obtained by training according to a first data sample set and a second data sample set, the first data sample set is a sample set formed by actual measurement values corresponding to state detection parameters, and the second data sample set is a sample set formed by simulation values determined according to the preset simulation model. Therefore, the first data sample set collected from the real environment is fully utilized, and an effective control decision strategy is learned on the premise of ensuring safety and stability.

And 103, controlling the equipment to be controlled according to the control instruction.

In this embodiment, a control instruction corresponding to a state detection parameter is determined by obtaining the state detection parameter for characterizing the physical state of the device to be controlled, and inputting the state detection parameter into a preset enhanced deep learning model, where the preset enhanced deep learning model is obtained by training a first data sample set formed by actual measurement values and a second data sample set formed by simulation values determined according to a preset simulation model and the first data sample set, so that the distribution of a learning strategy determined by the enhanced deep learning model obtained by combining real data and simulation data is closer to the distribution of a real strategy, and the matching degree between the control instruction determined by the deep enhanced learning model and the actual situation is improved.

FIG. 3 is a schematic diagram illustrating a construction flow of a hybrid cuvette according to an example embodiment of the present application. As shown in fig. 3, the control method provided in this embodiment includes:

step 201, a first data sample set is obtained.

Where off-line real data collected from the real-world environment may be used as the first set of data samples.

Step 202, determining a second data sample set according to the first data sample set and a preset simulation model.

However, the quantity of the off-line real data collected in the real environment is often limited, and the requirement of a general reinforcement learning algorithm cannot be met by using the off-line data alone. And the simulated data generated by the imperfect simulation environment is used, so that the trained reinforced algorithm strategy may have larger deviation with the historical data, and is not beneficial to the safety and stability required by the control system in the real world. A hybrid training method combining simulated data and real data may be employed. And adding simulation data, namely a second data sample set, so as to solve the problem of insufficient offline data. And moreover, the deviation brought by the simulation environment can be relieved by adding real off-line data.

Wherein, the simulation model is trained by using real off-line data, the model input is the current state s and the executed action a, and the output is the next state s', the reward feedback r and the safety limit c. According to the practical application requirements, the simulation model has several options, including Deep Neural Networks (DNN), Recurrent Neural Networks (RNN), Convolutional Neural Networks (CNN), and composite Neural Networks formed by combining any of the above three. For an environment with stronger time sequence correlation, such as industrial control, an RNN which embodies the time sequence correlation can be selected; for chess games with weak time sequence relevance, DNN can be selected; for a video game, visual features need to be extracted, the time sequence relevance is strong, and a structure of combination of RNN and CNN can be selected.

Step 203, determining a mixed sample pool according to the first data sample set and the second data sample set.

In addition, since the state and motion space are usually high-dimensional and continuous, the data distribution of the high-dimensional space is usually very sparse, and the simulated data generated by exploring in an imperfect simulation environment may deviate from the real data distribution, thereby generating a safety hazard in practical application. Fig. 4 is a flowchart illustrating a hybrid training method according to an exemplary embodiment of the present application, and as shown in fig. 4, in order to solve the problem of simulation data deviation, reliable simulation data may also be obtained by screening based on a joint distribution constraint policy.

With continued reference to FIG. 4, reliable simulation data is generated using the federated distribution limit strategy. First, a conventional search is performed, i.e., a set of data is randomly selected from offline data as a track (time series state action sequence s)₁，a₁，s₂，a₂…s_t，a_t) The starting point of the method is to obtain the action a of the next moment by utilizing a strategy model in the reinforcement learning algorithm, and then to input the action a into the simulation model to obtain the reward, the safety score and the state of the next moment, so as to obtain a group of single-step transfer data (s, a, r, c, s'). Then screening whether the group of data is reliable or not through a combined distribution limiting strategy, if not, discarding the group of single-step transfer data, and terminating exploration; and if the data is reliable, adding the set of single-step transfer data into the mixed sample pool, and performing the next exploration until the maximum exploration step number is reached and then terminating the exploration. And for a specific joint distribution boundThe strategy can be realized by the following modes:

in one possible design, the joint conditional probability distribution model may be trained using real off-line data. The method may include determining a candidate data sample set according to the first data sample set and a preset simulation model, and screening samples meeting a preset joint distribution limiting condition from the candidate data sample set to determine a second data sample set. The data sample set to be selected can be determined according to the first data sample set and a preset joint condition probability distribution model, wherein the joint condition probability distribution model is a deep neural network model, and samples with joint condition probabilities larger than a preset probability threshold value are screened out from the data sample set to be selected.

Specifically, the method may be based on a real offline data distribution, where the joint conditional probability distribution has several options, including a conventional continuous probability distribution model, such as normal distribution (normal distribution), exponential distribution (exponential distribution), and beta distribution (beta distribution). The model uses a deep neural network DNN to directly fit joint conditional probability distribution according to real off-line data distribution, the input of the model is the current state s and the executed action a, and the output is joint conditional probability p (a | s). A reasonable probability threshold may be selected, and reliable data may be obtained if the probability threshold is exceeded.

In another possible design, the data sample set to be selected may be determined according to the first data sample set and a preset countermeasure network, then, a sample with a discrimination value larger than a preset discrimination threshold value is screened out from the data sample set to be selected, and the discrimination value is determined according to the discrimination network. Specifically, a Generative Adaptive Networks (GAN) may be trained using real offline data. And (4) selecting a reasonable threshold, and evaluating (s, a) by using a discrimination network, wherein reliable data is obtained when the threshold is exceeded.

And finally, extracting real data according to a certain proportion, adding the real data into a mixed sample pool, and sampling a certain number of samples in the sample pool to train the reinforcement learning algorithm. And stopping training after the training step number reaches the preset maximum training step number.

In the embodiment, the problem of learning strategy deviation is effectively alleviated by screening simulation data generated in an imperfect simulation environment based on a hybrid training method of a joint distribution constraint strategy.

FIG. 5 is a flow diagram illustrating a training method of the reinforcement deep learning model according to an example embodiment of the present application. As shown in fig. 5, the control method provided in this embodiment includes:

step 301, a first set of data samples is obtained.

Step 302, determining a second data sample set according to the first data sample set and a preset simulation model.

Step 303, determining a mixed sample pool according to the first data sample set and the second data sample set.

It should be noted that, for the detailed description of steps 301 to 303 in this embodiment, reference may be made to the description of steps 201 to 203 in the embodiment shown in fig. 3, and details are not repeated here.

And 304, training a preset reference strategy model by using the first data sample set.

In an actual application environment, certain safety limiting conditions are often required to be met, and a simulation model cannot accurately evaluate the safety risk of a strategy. Fig. 6 is a schematic flow chart of a policy-separated reinforcement learning method for security constraint and policy value regularization according to an example embodiment of the present application, and as shown in fig. 6, in this embodiment, a security value model may be introduced to evaluate security risk of a current policy, so as to optimize the policy and meet security requirements at the same time. Meanwhile, strategy value regularization is introduced to further correct strategy distribution deviation and value evaluation deviation caused by simulation data distribution deviation.

Specifically, a certain number of samples may be sampled from the real data to train the benchmark strategy model. The benchmark strategy model is only trained by real data, but not an optimal strategy, but the benchmark strategy model can reflect the distribution of the real data, the distribution constraint is calculated by the distribution difference of the benchmark strategy and the learning strategy, and the deviation caused by the simulation data can be corrected by regularizing the learning strategy model and the value model.

For the benchmark strategy model, a Variational encoder (VAE) may be used. The VAE is formed of a network of two parts, one part is called an encoder, which maps from a high-dimensional input to a low-dimensional hidden variable, and the other part is called a decoder, which remaps from the low-dimensional hidden variable back to the high-dimensional input. The method comprises a model training stage and a sampling stage, wherein the model input in the training stage is a current state s and an executed action a, and the output is an action a, an encode and a decoder part for collaborative training, wherein the sampling stage only uses the decoder part to sample a certain number of samples in batch.

And 305, determining the strategy distribution difference between the trained reference strategy model and a preset learning strategy model.

For the calculation of the policy distribution difference, the policy distribution difference between the trained reference policy model and the preset learning policy model may be determined according to a Maximum mean difference algorithm (MMD). Or determining the strategy distribution difference between the trained reference strategy model and the preset learning strategy model according to a preset divergence algorithm.

Specifically, for the preset maximum mean difference algorithm, the preset maximum mean difference algorithm may be determined by the slave strategy pi_θAnd pi_bIn order to calculate the strategy pi by sampling_θAnd pi_bAn estimate of the MMD distance of (a).

K is a kernel function which can be a Gaussian kernel function or a Laplace kernel function; pi_θIs a learning strategy model, pi_bIs the reference strategy model, E is the mathematical expectation, x is the sampled value of the learning strategy model at state s, x 'is the sampled value of the learning strategy model at state s different from x, y is the sampled value of the reference strategy model at state s, y' is the sampled value of the reference strategy model at state s different from y.

For the predetermined divergence algorithm, a KL divergence (KLd) may be usedivergence). Direct estimation of pi_θAnd pi_bKL divergence of (A) is required to obtain pi_θAnd pi_bIs determined. A dual form of KL divergence may be used here.

And step 306, training a preset learning strategy model, a preset reward value model and a preset safety value model by using the mixed sample pool.

In this step, a preset learning strategy model, a preset reward value model and a preset safety value model may be trained by using the mixed sample pool, wherein the strategy distribution difference is used as a regularization term of the preset learning strategy model, the preset reward value model and the preset safety value model. And in the training process of the preset learning strategy model, the preset reward value model and the preset safety value model, a preset quadratic term constraint form is used for determining a training target.

A certain number of samples are sampled from the mixed sample pool to train a learning strategy model, an incentive value model and a safety value model. The training method uses the classical "actor-critic" (actor-critic) method to learn an action value function by minimizing Bellman errors (Bellman errors) on single-step transition data (s, a, r, s'), and then perform policy updates by maximizing the action value function. And (3) increasing the stability of model training by using a quadratic term constraint form, the learning objective of the strategy is as follows:

where a 'is the current time action value, a "is the next time action value, and s' is the next time state value.

The objective of the reward value function is:

the same can be achieved for the objective of the security merit function.

Wherein, pi_bIn order to be a reference strategy, the system is provided with,π_θin order to learn the strategy, it is proposed to,

is a pool of experiences, Q_ψFor the bonus value function or security value function to be learned,

the objective cost function is represented by a function of,

and the distribution difference between the reference strategy and the learning strategy is shown, and lambda and rho are adjustable parameters. And performing alternate iterative training according to the objective function.

And 307, when the training steps meet the preset conditions, taking the trained learning strategy model as a preset reinforced deep learning model.

The training may be stopped when the training step number reaches the maximum training step number, and the trained learning strategy model is used as the preset enhanced deep learning model.

In the embodiment, a safe and effective strategy is learned through methods such as joint distribution restrictive exploration, security constraint, strategy value regularization and the like. Firstly, training a simulation model based on real data, and performing combined distribution restrictive exploration on a simulation environment to generate reliable simulation data. Then, a strategy distribution network is trained based on real data to obtain real strategy distribution which is used as strategy value regularization of the reinforcement learning algorithm. And finally, mixing the simulation data and the real data to generate a mixed sample pool for reinforcing the strategy learning of the learning algorithm, introducing the strategy regularization generated in the second step, and correcting errors of value evaluation while restricting the distribution of the learning strategy so that the distribution of the learning strategy is closer to the distribution of the real strategy.

Specifically, the safety limiting method based on the safety value model solves the problem that the simulation environment cannot accurately evaluate the potential safety hazard by introducing the safety value model, so that the requirements of safety and stability in the actual application environment are met. In addition, the reinforced learning method of strategy value regularization based on strategy distribution difference introduces learning strategy and reference strategy distribution difference as regularization items of a value model and a learning strategy model, so that strategy distribution deviation and value estimation deviation brought by simulation data can be removed, and meanwhile, the stability of training can be further improved by a secondary item constraint form.

Fig. 7 is a schematic structural diagram of a control device shown in the present application according to an example embodiment. As shown in fig. 7, the control device 400 provided in this embodiment includes:

a parameter obtaining module 401, configured to obtain a state detection parameter, where the state detection parameter is used to represent a physical state of a device to be controlled;

a control processing module 402, configured to determine a control instruction according to the state detection parameter and a preset deep learning model, where the control instruction is used to control the device to be controlled, where the preset deep learning model is obtained by training according to a first data sample set and a second data sample set, the first data sample set is a sample set formed by actual measurement values corresponding to the state detection parameter, and the second data sample set is a sample set formed by simulation values determined according to a preset simulation model.

On the basis of the embodiment shown in fig. 7, fig. 8 is a schematic structural diagram of a control device shown in the present application according to another exemplary embodiment. As shown in fig. 8, the control device 400 further includes:

a sample obtaining module 403, obtaining the first data sample set;

a sample simulation module 404, configured to determine the second data sample set according to the first data sample set and the preset simulation model;

a sample generating module 405, configured to determine a mixed sample pool according to the first data sample set and the second data sample set, where the mixed sample pool is used to train the preset deep learning model.

In one possible design, the sample simulation module 404 is specifically configured to:

In one possible design, the control device 400 further includes:

a model training module 406, configured to train a preset reference strategy model by using the first data sample set;

a difference determining module 407, configured to determine a policy distribution difference between the trained reference policy model and a preset learning policy model;

the model training module 406 is further configured to train the preset learning strategy model, the preset reward value model, and the preset safety value model by using the mixed sample pool, where the strategy distribution difference is used as a regularization term of the preset learning strategy model, the preset reward value model, and the preset safety value model, and when the training step number meets a preset condition, the trained learning strategy model is used as the preset enhanced deep learning model.

In one possible design, the difference determining module 407 is specifically configured to:

The present embodiment provides a control device that may be used to perform the above-described method embodiments. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method of the present application.

Fig. 9 is a schematic structural diagram of an electronic device shown in the present application according to an example embodiment. As shown in fig. 9, the present embodiment provides an electronic device 500, including:

a processor 501; and the number of the first and second groups,

a memory 502 for storing executable instructions of the processor, which may also be a flash (flash memory);

wherein the processor 501 is configured to perform the steps of the above-described method via execution of the executable instructions. Reference may be made in detail to the foregoing description of the function of various components of the streaming data processing system.

Alternatively, the memory 502 may be separate or integrated with the processor 501.

When the memory 502 is a device independent from the processor 501, the electronic device 500 may further include:

a bus 503 for connecting the processor 501 and the memory 502.

The present embodiment also provides a readable storage medium, in which a computer program is stored, and when at least one processor of the electronic device executes the computer program, the electronic device executes the functions of the components in the streaming data processing system provided in the above-mentioned various embodiments.

The present embodiment also provides a program product comprising a computer program stored in a readable storage medium. The computer program can be read from a readable storage medium by at least one processor of the electronic device, and the execution of the computer program by the at least one processor causes the electronic device to implement the functions of the respective components in the streaming data processing system provided by the various embodiments described above.

Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims

1. A control method, comprising:

2. The control method according to claim 1, characterized by further comprising:

obtaining the first set of data samples;

3. The control method of claim 2, wherein said determining the second set of data samples from the first set of data samples and the pre-set simulation model comprises:

4. The control method according to claim 3, wherein the determining a candidate set of data samples from the first set of data samples and the predetermined simulation model comprises:

5. The control method according to claim 3, wherein the determining a candidate set of data samples from the first set of data samples and the predetermined simulation model comprises:

6. The control method according to any one of claims 2 to 5, further comprising, after the determining a mixed sample pool from the first set of data samples and the second set of data samples:

training a preset reference strategy model by using the first data sample set;

7. The control method according to claim 6, wherein the determining a strategy distribution difference between the trained reference strategy model and a preset learning strategy model comprises:

8. The control method according to claim 6, wherein the predetermined reference strategy model is a variational encoder.

9. The control method according to claim 6, wherein a preset quadratic constraint form is used to determine the training target during the training of the preset learning strategy model, the preset reward value model and the preset safety value model.

10. The control method according to any one of claims 2 to 5, wherein the preset simulation model is any one or a combination of a deep neural network model, a cyclic neural network model and a convolutional neural network model.

11. A control device, comprising:

12. An electronic device, comprising:

a processor; and

a memory for storing a computer program for the processor;

wherein the processor is configured to implement the control method of any one of claims 1 to 10 by executing the computer program.

13. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the control method of any one of claims 1 to 10.