CN115453880A - Training method of generative model for state prediction based on antagonistic neural network - Google Patents

Training method of generative model for state prediction based on antagonistic neural network Download PDF

Info

Publication number
CN115453880A
CN115453880A CN202211156355.0A CN202211156355A CN115453880A CN 115453880 A CN115453880 A CN 115453880A CN 202211156355 A CN202211156355 A CN 202211156355A CN 115453880 A CN115453880 A CN 115453880A
Authority
CN
China
Prior art keywords
sequence
target system
control action
state
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211156355.0A
Other languages
Chinese (zh)
Inventor
杨安东
李玮
胡瑜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CN202211156355.0A priority Critical patent/CN115453880A/en
Publication of CN115453880A publication Critical patent/CN115453880A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B13/00Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
    • G05B13/02Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
    • G05B13/04Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators
    • G05B13/042Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators in which a parameter or coefficient is automatically adjusted to optimise the performance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Automation & Control Theory (AREA)
  • Feedback Control In General (AREA)

Abstract

The invention provides a training method of a generative model for state prediction based on an antagonistic neural network, wherein the antagonistic neural network comprises the generative model and an arbiter, the generative model and the arbiter both comprise multilayer fully-connected networks, and the method comprises the following steps: s1, sampling a plurality of control action sequences of predicted time domain length from all possible control actions under the constraint condition of a target system, sending each control action sequence into the real environment of the target system for execution to obtain a target state sequence of the target system corresponding to each control action sequence, and sending each control action sequence into a generation model to obtain a target system future state prediction sequence corresponding to each control action sequence; s2, forming a sample by using the current state of the target system, each control action sequence and the target state sequence of the corresponding target system after each control action sequence is executed, and generating a training set; and S3, training the generated countermeasure network for multiple times by adopting the training set until convergence.

Description

Training method of generative model for state prediction based on antagonistic neural network
Technical Field
The invention relates to the field of automatic control, in particular to the field of continuous trajectory control based on MPC in automatic control, and more particularly to a training method of a generative model for state prediction based on an antagonistic neural network, and an MPC control method and device based on the training method.
Background
Along with the development of the technology, the automatic control technology is applied to all aspects of production and life, the life of people is greatly facilitated, and for example, the automatic control technology is adopted by mechanical arms, unmanned planes, automatic driving and the like. Among them, model Predictive Control (MPC), which is a multivariable Control method in an automatic Control method, is widely used because it can handle various constraints of various state variables and can be used in a more complex nonlinear environment. To achieve continuous trajectory control, MPC requires a multi-step prediction process. The MPC is a process control method, which predicts a state sequence of a target system within a future period of time by using a system model according to a current state of the target system automatically controlled by an application and a control action sequence planned to be executed in the future, i.e., performs multi-step prediction, and selects a control action according to a difference degree between the predicted state sequence and the target state sequence to control the target system to execute, so that the target system reaches a target state by executing the control action. Among these, a system model based on an efficient multi-step prediction framework and accurate prediction is very important for MPC implementation.
In the prior art, due to the limitations of calculation cost and model capability, most MPC methods adopt an iterative framework mode to complete a multi-step prediction process. The system model under the iterative framework only predicts the system state after one moment each time, and then sequentially predicts the future states after a plurality of moments in an iterative mode until the time domain length required to be predicted is reached, as shown in fig. 1, by s t Representing the state of a target system at the time t, wherein the predicted time domain length is N (N represents the predicted time domain length in the whole text), and then, based on the state s at the time t, the system model under the iterative framework t And predicting the state s of the target system at time t +1 by a future control action (denoted by u) planned to be executed at time t t+1 ', the system model in the iterative framework is based on the predicted state s at the time t +1 t+1 ' and the future control action planned to be executed at time t +1 predicts the state s of the target system at time t +2 t+2 ', and so on until the state s of the target system at the moment t + N is obtained t+N ' all predicted state concatenations can obtain a complete target system future state prediction sequence. In the iteration process, the system model adopts a single model to perform single-step prediction, so that the requirement on the system model is reduced, but the iteration process in a complex environment brings higher computational complexity, and in addition, the iteration framework has the problem of accumulative errors. For example, in an iterative framework, the deviation of the system model is assumed to be δ(s) t ,u t ) At time t +1, the predicted output is s 'from the system model' t+1 =f(s t ,u t )+δ(s t ,u t ) It can be seen that a small error introduced by the system model in the iterative framework is amplified continuously in the prediction process, and the error is accumulated continuously in the iterative process, and becomes severe as the prediction time domain length N becomes larger, resulting in a huge accumulated error. Under the iterative framework, the system model input has no context information, so that the quantity of system information which can be acquired by the system model established under the framework is limited, and the capability of the system model is further limited. Meanwhile, the time complexity of the process is O (N), where N is a Prediction time domain (Prediction horizon) length, and since MPC is generally used in environments such as robots, a very fast control action solving speed is required, and this time complexity determines that the system model cannot be too complex, which further limits the performance of the system model.
In order to alleviate the problem of accumulated errors of the system models in the iterative framework, researchers have proposed a multi-model synchronous prediction framework, which uses a plurality of system models, each model predicts a future time state of a fixed future time according to a current state and a planned future control action, and concatenates the prediction results of all the models to obtain a complete series of future states, as shown in fig. 2, with s t Representing the state of a target system at the time t, the prediction time domain length is N, and the system model 1 under the multi-model synchronous prediction framework is based on the state s at the time t t And a state s at time t +1 of the target system predicted by the planned future control action at time t t+1 ', simultaneously, the System model 2 in the framework of Multi-model synchronous prediction is based on the state s at time t t And future of scheduled execution at time t and time t +1State s at time t +2 of target system for control action prediction t+2 ', … …, system model N based on State s at time t in Multi-model synchronous prediction framework t And the state s of all planned future control actions to be executed from the time t to the time t + N-1 at the time t + N of the target system t+N ' all predicted state concatenations can obtain a complete target system future state prediction sequence. In a multi-model synchronous prediction framework, each system model is different, the problem of single model accumulated error under an iterative framework is greatly relieved, the speed of a multi-step prediction process is improved, but the design difficulty of a plurality of models is high, and the required calculation amount is multiplied during prediction. This is because the model corresponding to each time needs to be designed separately, and as the prediction time domain length increases, the later model corresponds to a larger amount of calculation because the control actions planned to be executed at all the previous prediction times are needed.
In summary, the existing system models for multi-step prediction have the problems of high calculation cost and inaccurate prediction, so how to design a system model based on an efficient multi-step prediction framework and accurate prediction is still a problem to be solved in MPC.
Disclosure of Invention
Therefore, an object of the present invention is to overcome the above-mentioned drawbacks of the prior art and to provide a new system model modeling method and MPC control method and apparatus based on the same.
According to a first aspect of the present invention, there is provided a training method for a generative model for state prediction based on an antagonistic neural network, the antagonistic neural network comprising the generative model and an arbiter, the generative model and the arbiter each comprising a multi-layer fully connected network; the generative model is used for selecting a control action sequence to be executed by the target system at a plurality of future moments according to the current state of the target system and a target state sequence of the target system, and the method comprises the following steps: s1, sampling a plurality of control action sequences of predicted time domain length from all possible control actions under the constraint condition of a target system, sending each control action sequence into the real environment of the target system for execution to obtain a target state sequence of the target system corresponding to each control action sequence, and sending each control action sequence into a generation model to obtain a target system future state prediction sequence corresponding to each control action sequence; s2, forming a sample by using the current state of the target system, each control action sequence and the target state sequence of the corresponding target system after each control action sequence is executed, and generating a training set; and S3, training the generated countermeasure network for multiple times by adopting the training set until convergence.
Preferably, the generative model and the arbiter each comprise a 4-layer fully connected network.
Preferably, in step S3, the discriminant loss and the L2 loss are used to train the generative model as follows:
Figure BDA0003855826050000031
where α is the discriminator loss weight, β is the L2 loss weight, and α + β =1,m is the batch size of the sampling control action sequence in each training, s t Is the current state, U, of the target system at the time of the current training i Ith control action sequence, S i ' represents the target system future state prediction sequence after the target system executes the ith control action sequence in the current state, D ω (s t ,U i ,S i ') indicates s based on the current state t And action sequence U i State S predicted by generative model i ' probability from the real environment.
According to a second aspect of the present invention, there is provided an MPC control method applied in an automatically controlled target system for selecting a sequence of control actions to be executed by the target system at a plurality of moments in the future according to a current state of the target system and a sequence of target states of the target system, the method comprising: t1, sampling control action sequences of a preset number of predicted time domain lengths corresponding to a target system, wherein the control action at each moment in the control action sequences is sampled from all feasible possible control actions under the constraint condition of the target system; t2, predicting future state sequences of the target system after the target system executes the control action sequence based on each sampled control action sequence and the current state of the target system by adopting a generation model trained by the method of the first aspect of the invention to obtain a preset number of future state prediction sequences of the target system with prediction time domain length; and T3, selecting a control action sequence corresponding to a target system future state prediction sequence closest to the target state sequence of the target system from the control action sequences corresponding to all the sampled target systems.
Preferably, in the step T1, a CEM strategy is adopted for sampling.
Preferably, in step T3, a control action sequence corresponding to a target system future state prediction sequence closest to the target state sequence of the target system is selected from the control action sequences corresponding to all the sampled target systems, and the first two control actions in the control action sequences are selected.
Preferably, the step T3 includes: t31, calculating loss between a target system future state prediction sequence corresponding to each sampled control action sequence and a target state sequence of the target system; and T32, selecting a control action sequence with the minimum loss between the corresponding target system future state prediction sequence and the target state sequence of the target system.
In some embodiments of the present invention, in the step T31, the loss between the target system future state prediction sequence and the target state sequence corresponding to each control action sequence is calculated by:
Figure BDA0003855826050000041
wherein L is i Indicating the loss between the predicted sequence of the future state of the target system corresponding to the ith control action sequence and the target state sequence of the target system, S star Indicating the target state of the target system, S i ' represents the target system future state prediction sequence, U, corresponding to the ith control action sequence executed by the target system i Shows the ith control action sequence, Q shows the state L2 lossWeight of loss, R represents the weight of the control action penalty term, (. C) T Representing a vector transposition.
Preferably, each control action sequence is a vector sum of a vector composed of a preset number of current states of the target system and a difference sequence of the output of the pre-trained system model, and the target system future state pre-sequencing sequence with the corresponding predicted time domain length is listed as the vector sum.
According to a third aspect of the present invention, there is provided an MPC control apparatus configured in an automatically-controlled target system for selecting a sequence of control actions to be performed by the target system at a plurality of times in the future, based on a current state of the target system and a sequence of target states of the target system, the apparatus comprising: the sampler is used for sampling a plurality of control action sequences of the predicted time domain length corresponding to the target system, wherein the control action at each moment in the control action sequences is sampled from all feasible possible control actions under the constraint condition of the target system; the generative model constructed by the method of the first aspect of the invention is used for predicting future state sequences of the target system after the target system executes the control action sequence based on each control action sequence sampled by the sampler and the current state of the target system to obtain a preset number of target system future state prediction sequences with prediction time domain lengths; and the optimizer is used for selecting a control action sequence corresponding to a target system future state prediction sequence closest to the target system target state sequence from all the sampled control action sequences.
Compared with the prior art, the invention has the advantages that: according to the single-model synchronous prediction framework provided by the invention, one model is used for completing a multi-step prediction process at one time, so that the time required by the multi-step prediction process is reduced, the prediction accuracy is improved, the capacity of the MPC in a complex environment is finally enhanced, and the MPC decision effect is improved; the system model based on the generated confrontation network uses the generator as the system model network, so that the system model can learn training data with different distributions by means of a discriminator, fit complex environmental characteristics and constraints which are difficult to define artificially, relieve training errors caused by the training data with non-uniform distribution, improve the model accuracy, design the system model network training loss in which the supervised learning loss and the discriminator loss are mutually fused in the training process, accelerate the system model network training process by means of a more obvious gradient direction of the supervised learning in the initial training stage, and can help the system model network to learn various environmental constraints which are difficult to define artificially by means of the discriminator in the later training stage to improve the model accuracy, accelerate the system model network training process and improve the final training effect. Based on the scheme of the invention, the time complexity of the multi-step prediction process is reduced from O (N) to O (1), and meanwhile, because no iterative process exists, the problem of accumulated errors in an iterative framework is eliminated.
Drawings
Embodiments of the invention are further described below with reference to the accompanying drawings, in which:
FIG. 1 is a diagram illustrating multi-step prediction based on an iterative framework in the prior art;
FIG. 2 is a schematic diagram of multi-step prediction based on a multi-model synchronous prediction framework in the prior art;
FIG. 3 is a multi-step prediction schematic of a single model simultaneous prediction framework according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a modeling system principle based on an antagonistic neural network according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a MPC device framework based on a system model implementation of an anti-neural network modeling, according to an embodiment of the present invention;
FIG. 6 is a schematic diagram illustrating the principle of multi-step synchronous prediction based on a system model for modeling an anti-neural network according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail by embodiments with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
As mentioned in the background, the multi-step prediction process in MPC still has some problems in the prior art, mainly that the system model for realizing multi-step prediction is still not efficient and accurate enough. In the multi-step prediction process, the system models used are largely classified into conventional mathematical models and learning-based models.
The effect of the traditional system model used at present greatly depends on manual design, the traditional system model is usually used in a linear or simple nonlinear environment, the system model which is well fitted to a complex nonlinear environment is difficult to design manually, and the model cannot be automatically learned and optimized. A typical conventional mathematical model is a conventional linear model of the form s (t + 1) = As (t) + Bu (t) + w, which is artificially defined, where s (t) represents the target system state at time t, u (t) represents the control action taken at time t, and A, B, w is a fixed parameter. Linear models work well in linear or simpler non-linear environments, however in more complex environments there are many properties that are very difficult to represent by means of manual definition, such as context information between states, state sequence patterns in different environments, etc. Because the environments at different moments are different and the incidence relation of the front state and the rear state is different, under an iterative framework, a linear model which can accurately predict different moments under a complex environment is difficult to manually define to realize multi-step prediction; similarly, under the multi-model synchronous prediction framework, it is more difficult to manually define prediction models at different time instants to realize multi-model synchronous prediction in a complex environment.
The learning-based method can learn from the acquired data by itself without manual design, and can fit linear and nonlinear environments by guiding the model to train in an expected direction through a specified loss function. With the development of the robot field, more complex tasks such as mobile robot path tracking, four-axis unmanned aerial vehicle autonomous navigation and the like appear at present. In such a complex nonlinear system environment, there are many features or noises, and it is difficult to manually define the mathematical expression of these features or noises by the conventional method. A learning based model such as a neural network can fit these noises and therefore has advantages. A typical system model based on learning is a fully-connected model, the model needs to be sent to a target system for execution through random sampling control actions, a data set is formed by recording the state of each control action after execution, and then the data set is used for training a fully-connected neural network to complete the fitting of the system model. However, learning-based models require the design of a penalty function for directing the direction of model optimization. At present, the commonly used supervised learning loss (L2 loss) helps a model to determine the optimization direction by comparing the output of a network model in the same environment with a true value, and the loss cannot capture the characteristic difference between different inputs, such as the difference between grasslands, mud lands and other different terrains in an automatic driving control task, so that a model fitting result always has a certain error with the true value. In addition, for the model based on learning, training data needs to be collected, and in the data collection stage, the collected data are the decisions of the environment state and human experts in the environment, such as paths, control quantities and the like. The collected data set provides a basis for the model learning system to change the control quantity in the training stage, and the training data collected in such a complex environment usually contains distribution deviation which is inevitably caused by different expert strategies. If the current mainstream supervised learning similar loss is used, the model is optimized in different directions due to the different distribution of data, so that the model training effect is influenced, and the multi-step prediction effect is influenced finally. The influence is mainly reflected in the fact that context information of the environment cannot be directly reflected, particularly, severe state changes rarely occur in a normal environment, and therefore, the model is difficult to completely fit a real environment, so that errors occur. Because the environments at different moments are different and the incidence relation of the front state and the rear state is different, a model which can meet the incidence relation of different states of a complex environment cannot be trained to realize multi-step prediction under an iterative framework; similarly, under the multi-model synchronous prediction framework, it is difficult to train corresponding models at different times to realize multi-model synchronous prediction, and even if the multi-model synchronous prediction framework can be realized, the number of models is correspondingly increased along with the increase of a prediction time domain, so that higher calculation overhead is brought, and the prediction speed is reduced.
The inventor finds that the defects in the prior art are caused by the long time consumption of the multi-step prediction process of the system model and inaccurate prediction results when MPC research is carried out. The reason for this is twofold: on the first hand, the existing iterative framework needs to iterate for many times and repeatedly call a system model for prediction, the speed is low, and the problem of accumulated errors exists, while the multi-model synchronous prediction framework needs to design a model for each future moment, the design difficulty is high, and the prediction speed is low due to the large calculation amount of a plurality of models. In the currently used system model, a linear model based on a traditional method is artificially defined, and can be well fitted for a linear environment or a simple nonlinear environment, but the system model is difficult to accurately define and fit under a complex nonlinear environment, and the error is large; the loss function used by the model based on learning only considers the difference between the prediction result and the real result, and does not consider the characteristics (such as expert strategy difference during data acquisition) which are difficult to be reflected by a fixed mathematical expression and the inherent extra constraints of the environment (such as context information between states, state sequence modes of different surfaces of grass, mud and the like), so that the learned model and the real environment always have certain errors, the accuracy of a multi-step prediction process is influenced, and the effect of the model prediction control method is finally influenced.
In order to solve the above problems, the inventors propose a scheme for improving a multi-step prediction framework, and construct a single-model synchronous prediction framework, in which improvement in two directions is mainly considered: 1) Abandoning an iteration mode, and realizing synchronous prediction by means of the multi-head output characteristic of the neural network, wherein the multi-head output characteristic of the neural network refers to the capability of realizing corresponding multi-output according to a plurality of inputs by configuring the structure of the neural network, and the multi-head output characteristic of the neural network is known by the technicians in the field and is not described in more detail in the invention; 2) Because the synchronous prediction saves the calculation time and can support more complex system models, a plurality of artificially defined models are not used for synchronously predicting the future states at different moments, and a single neural network model based on learning is used as the system model. When multi-step prediction is realized based on the system model scheme of the single-model synchronous prediction framework provided by the invention, one model can be used for completing the multi-step prediction process at one time, as shown in fig. 3, under the single-model synchronous prediction framework, one system model is used based on t hoursState of carving s t And one-time prediction of the planned execution future control action sequence with the length of N (prediction time domain length) is carried out to obtain a target system future state sequence with the length of N: s t+1 ′、s t+2 ′、…、s t+N ' such time complexity is O (1), and the time complexity brought by the synchronous prediction framework is reduced, so that the real-time performance of MPC can be ensured even if a more complex neural network is fitted to the system model. The prediction process is completed by using one model, so that the use of a plurality of models is avoided, the complexity of model design is reduced, and the calculation amount requirement is also reduced.
Based on the single-model synchronous prediction framework, an MPC samples a plurality of control action sequences with the length of N through a sampler at each sampling moment according to the current state of a target system, predicts according to each control action sequence by adopting a system model of the single-model synchronous prediction framework to obtain a target system future prediction state sequence, selects one closest to the target state sequence from the target system future prediction state sequences through an optimizer, and sends the first action or the first two actions or the former plurality of actions in the control action sequences corresponding to the item target system future prediction state sequence as the actions to be executed in the future of the target system to the target system for execution. Because the system model realizes multi-step prediction at one time, the prediction efficiency can be greatly improved.
In order to fully utilize the characteristics of the single-model synchronous prediction framework, a sufficiently strong system model needs to be designed. Since a real system of interaction between a complex environment such as a quadcopter, a mobile robot and the environment is difficult to describe, a conventional system model is difficult to design. While a system model based on pure supervised learning deteriorates very rapidly in the face of various distributions of training data. According to one embodiment of the invention, the system model is modeled by using the generation countermeasure network, and the generation countermeasure network is based on game theory and can learn the characteristics or the constraints which cannot be expressed by the pure supervised learning loss function. According to one embodiment of the invention, a generative model in a generative confrontation network is adopted as a system model, so that the system model learns training data of different distributions by means of discriminators in the generative confrontation network, and complex environment features and constraints which are difficult to define are fitted. The method can also be understood that by adding the discriminator on the basis of the original system model network, the model can be fitted with the characteristics which are difficult to artificially define in the complex environment, and the accuracy of the model is improved. Finally, in the process of training the system model, L2 loss based on supervised learning is added, the network is helped to be rapidly converged at the initial training stage, and the training process is accelerated, so that the speed and the accuracy of a multi-step prediction process of key steps in model predictive control are improved, and the efficiency and the control effect of the model predictive control are finally improved.
For better understanding of the present invention, first, the technical idea of modeling system model under single model synchronous prediction framework of the present invention is introduced: t1, designing a full-connection network as a generation model (system model network) for predicting a future state, designing a discriminator by using the same network structure, and forming a generation countermeasure network by using the generation model and the discriminator, wherein the output of the generation model is a predicted state, and the output of the discriminator is the probability that a state sequence input into the discriminator comes from a real environment; t2, sampling a series of planned and executed future control action sequences based on the current target system state; t3, sending the sampled control action sequence into a real environment of a target system to obtain an actual future state sequence as a true value; t4, sending the sampled control action sequence into a generation model, and outputting a predicted future state; t5, calculating L2 losses of the actual future state and the predicted future state, meanwhile, calculating the probability that the predicted future state belongs to the real environment by using a discriminator, and calculating the discrimination loss based on the probability; and T6, training the judgment loss and the L2 loss calculated in the step T5 as the loss of the training system model according to a certain weighted sum. Preferably, as shown in FIG. 4, the present invention takes the generative model as the system model G θ (s t U) which is expressed from the current state s t And controlling the mapping of the action sequence U to the future state space and the network parameter is theta, namely the system model can be based on the current state s t And controlling the action sequence to obtain a future predicted state sequence S'. In the modeling (training) process of the system model, an additional discriminator D is added ω (s t U, S') for determining a current state S t And the control action sequence U estimates that a segment of the state S' comes from a real future state rather than the system model network G θ (s t U), and judging the training effect of the system model based on the probability, wherein if the probability is 0.5, the network output of the system model is almost the same as the real future state, and the training effect is better. If the probability is biased to 0 or 1, the system model network does not fit a real environment well, and the system model network is easy to distinguish whether the system model network is the real environment or not, so that the discriminator can provide loss for training of the system model network, help fit various constraints which are difficult to define artificially, context information which cannot be directly represented by loss of artificial design and the like, and particularly rarely occur severe state change conditions in a normal environment and the like. In addition, the discriminator is only used in the process of training the system network, and the complexity in testing cannot be increased. In the training process of the countermeasure network, the system model directly takes the future state sequence of the target state as the output in most cases, but the prediction accuracy of the system model is possibly dependent on the network training effect in an excessive way by the training mode, and in order to avoid the excessive dependence of the prediction accuracy of the training system model on the network training effect, the difference prediction sequence of the future state and the current state is taken as the output of the system model for training.
According to one embodiment of the invention, pseudo code specific to the training process is shown in table 1,
Figure BDA0003855826050000101
wherein:
line 1: initializing generative model G, discriminator D, prediction time domain length N, generative model and discriminant training number ratio k, hyperparameters alpha, beta, c (where alpha is discriminant loss weight, beta is L2 loss weight, and alpha + beta =1,c is training gradient threshold)
Line 2: randomly sampling action sequences U with the length of N in a real environment for execution, and obtaining action sequences corresponding to each action sequenceState S in the real environment, recording actions and states, forming a data set P = { (S) i ,U i ) ,.. } (the state in the real environment corresponding to the action sequence is the target state, which is the target to be reached by generating the model prediction output)
Line 3: while unconverged do (multiple rounds of confrontational training on generative models and discriminators until convergence)
Line 4: for do (note that in each training round, the discriminant is trained a plurality of times (k times) from line 5 to line 10, and then trained once from line 11 to line 15 to generate a model)
Line 5: sampling m state-action pairs from a data set P
Figure BDA0003855826050000111
Figure BDA0003855826050000112
(samples for training the arbiter are sampled from the dataset, m is the batch size sampled during training, the batch size is related to the computing power of the hardware during training, the sequence of states contains the current state
Figure BDA0003855826050000113
And a series of future target states
Figure BDA0003855826050000114
)
Line 6: will be provided with
Figure BDA0003855826050000115
After N times of copying, the vector S is superposed base (embodiments of the invention use the sequence of the difference predictions of the future state and the current state as the output of the generative model, so that the current state needs to be replicated multiple times for superposition with the output of the generative model to obtain each future predicted state)
Line 7: predicting future state sequences using generative models
Figure BDA0003855826050000116
Figure BDA0003855826050000117
Representing generative models based on current state
Figure BDA0003855826050000118
And an action sequence U i The difference between the output future state and the current state is predicted sequence, and the predicted future state sequence is formed by adding the current state to each difference in the difference prediction sequence
Line 8: calculating gradients
Figure BDA0003855826050000119
Is the gradient of the discriminator, where D ω (s t ,U i ,S i ') indicates s based on the current state t And action sequence U i State S predicted by generative model i ' probability from real Environment, D ω (s t ,U i ,S i ) Represents the target state S i Probability from real environment)
Line 9: after gradient truncation, reverse transmission is carried out
Figure BDA00038558260500001110
D Represents the parameters of the discriminator, clip represents the gradient cut-off, which will not be described in detail here since the gradient cut-off is a technique known to those skilled in the art
Line 10: end for
Line 11: sampling m state-action pairs from a data set P
Figure BDA00038558260500001111
Figure BDA00038558260500001112
(samples from the dataset are sampled for training the generative model, m is the batch size sampled during training, batch size is related to the computational power of the hardware during training, and the sequence of states contains the current state
Figure BDA00038558260500001113
And a series of future target states
Figure BDA00038558260500001114
)
Line 12: will be provided with
Figure BDA00038558260500001115
After N times of copying, the vector S is superposed base (embodiments of the invention use the sequence of the difference predictions of the future state and the current state as the output of the generative model, so that the current state needs to be replicated multiple times for superposition with the output of the generative model to obtain each future predicted state)
Line 13: predicting future state sequences
Figure BDA00038558260500001116
The representation generation model is based on the current state
Figure BDA00038558260500001117
And an action sequence U i The difference between the output future state and the current state is predicted sequence, and the predicted future state sequence is formed by adding the current state to each difference in the difference prediction sequence
Line 14: calculating gradients
Figure BDA0003855826050000121
(
Figure BDA0003855826050000122
Is the gradient of the generative model, wherein D ω (s t ,U i ,S i ') indicates s based on the current state t And action sequence U i State S predicted by generative model i ' the probability from the real environment,
Figure BDA0003855826050000123
the loss of discrimination is represented by the loss of discrimination,
Figure BDA0003855826050000124
representing the L2 loss of the generative model, α is the discriminator loss weight, β is the L2 loss weight, and α + β =1,m is the batch size sampled during training)
Line 15: gradient counter-propagation
Figure BDA0003855826050000125
G Is a parameter of the generative model and updated based on gradient back propagation
Line 16: end while
It should be noted that the structure of the system model may be different in different target systems. For example, when applying MPC to a path tracking task in automatic driving, the present invention employs a system model of a 4-layer fully-connected network and a discriminator, and the number of neurons in each layer is 128,256,256,128 in turn.
From the pseudo code, compared with the traditional confrontation generation training, the method adds the supervised learning loss based on L2 in the training loss, accelerates the training process in a more obvious gradient direction at the initial training stage by means of the supervised learning, and adds the supervised learning loss and the discriminator loss according to a certain weight to obtain the system model loss function used by the method:
Figure BDA0003855826050000126
wherein α is a discriminator loss weight, β is an L2 loss weight, and α + β =1, and the specific α and β are determined according to specific tasks, it is generally ensured that when training tends to be stable, the L2 loss is smaller than the discriminator loss, m is the batch size of the sampling control action sequence in each training, s is the number of sampling control action sequences in each training, and t is the current state, U, of the target system at the time of the current training i Ith control action sequence, S i ' represents the target system future state prediction sequence after the target system executes the ith control action sequence in the current state, D ω (s t ,U i ,S i ') indicates s based on the current state t And moveMake a sequence U i State S predicted by generative model i ' probability from the real environment.
The system model obtained through the training process can be used for multi-step prediction under a single-model synchronous prediction framework. The following describes how to implement MPC based on the system model under the single model synchronous prediction framework in detail with reference to the drawings and the embodiments.
In MPC implementation, the current state s is mainly found at time t t Is optimally controlled to act in sequence
Figure BDA0003855826050000131
And sent to the real environment of the target system for execution, wherein the current state s is predicted according to a multi-step prediction process t And a given sequence of control actions U t Predicting a future state sequence S of length N and selecting the action sequence corresponding to the state closest to the target state as the optimal control action sequence
Figure BDA0003855826050000132
In an embodiment of the present invention, based on the MPC architecture implemented by the system model under the single model synchronous prediction framework of the present invention as shown in fig. 5, the sampler 1 first generates a batch of control action sequences according to the sampling strategy
Figure BDA0003855826050000133
The generated control action sequence is simultaneously input to the system model 3 generated based on the above-described embodiment under the single-model synchronous prediction framework 2 to predict the state at each time t to t + N. The optimizer 4 selects the best state (the best state is the state closest to the target state) from a series of predicted future states according to the task target, and obtains the control action sequence corresponding to the best state as the optimal control action sequence
Figure BDA0003855826050000134
This sequence is fed as a decision result to the real environment 5 of the target system. In practice, to ensure real-time MPC, the optimal control action is usually appliedSequence of
Figure BDA0003855826050000135
The partial actions (such as the first action or the first two actions) at the front of the time sequence in the control target system are sent to the real environment to be executed, the process is repeated at each moment, and the actions executed by the control target system in the real environment are continuously obtained, so that the real-time control is realized.
According to one embodiment of the invention, the sampler adopts a CEM strategy for sampling and uses a normal distribution as a motion sampling distribution, wherein motions refer to all possible motions under the constraint condition of a target system, for example, in automatic driving, the constraint condition of a required acceleration (in km/s) is assumed to be [5,10 ]]The constraint condition of angular velocity (in degrees/second) is [3,6 ]]Then, as long as the actions meeting the two constraints are feasible actions, the action sampling is to sample the acceleration and angular velocity pair corresponding to each moment, and obtain an action sequence composed of acceleration and angular velocity pairs at multiple moments. Briefly, in the sampling process, the expectation and variance of the motion sampling are initialized, a plurality of motion sequences with the length of N are sampled on the distribution, the sampled motion sequences are sent to a multi-step prediction process and an optimizer to obtain several better motion sequences, and the expectation of the normal distribution is recalculated according to the feedback (expressed by environment reward) of the motion sequences in the real environment (according to an embodiment of the invention, the expectation in the invention refers to finding the condition that the normal distribution meets the requirement
Figure BDA0003855826050000136
Is optimally operated in sequence U * Wherein L is i Is a predetermined predicted loss, and L i =(S tar -S′ I ) T Q(S tQr -S′ i )+U i T RU i ,S tar Is in target state, S' i Is the predicted state, (S) tar -S′ i ) Is the L2 loss of state, U i Is the action sequence of the sample, Q and R are the weights of the L2 loss of state and the action penalty term, respectively, and Q and R are determined by the specific target system task) and variance as the action sample score for the next roundIn this way, if the difference between the two expectations is less than a certain value, it is said that a better action expectation is found. Since the CEM strategy is a technique known to those skilled in the art, the present invention is not described in too much detail.
As shown in fig. 6, during each prediction, the sampler samples multiple action sequences (represented by U) with the length of N based on a sampling strategy, and the system model in the single-model synchronous prediction framework of the invention is based on the current state s t And each action sequence predicts the future state corresponding to the action sequence, namely predicts all future states corresponding to all action sequences (represented by S'). The optimizer selects a target state S from all action sequences based on the future states corresponding to the action sequences tar The action sequence corresponding to the nearest future state is taken as the optimal action sequence U * And sending the best action in the best action sequence into the environment for execution.
According to an embodiment of the present invention, pseudo code of the system model MPC under the single model synchronous prediction framework of the present invention is shown in table 2.
Figure BDA0003855826050000141
Wherein:
line 1: initializing a system model G by using trained generative model parameters, setting a prediction time domain length N, a maximum iteration number K and a hyperparameter Q, R (initializing the system model by using the generative model trained based on the generative countermeasure network before so that the system model can realize single-model synchronous prediction on a multi-step prediction process, wherein the maximum iteration number represents the number of action sequence items sampled during synchronous prediction or is called as the sampling number, the more action sequences are sampled, the more optimal action sequence is obtained, and Q and R respectively represent the L2 loss weight and the action penalty weight of a state)
Line 2: set target state sequence S sll-tar = { s1, s2, ·. } (sequence of all target states in the future period of time that the target system needs to predict when the target state sequence)
Line 3: while not reaching the final target state do (predicting future states in each prediction domain through lines 4-9 before reaching the final target state)
Line 4: updating the Current State s t (in each prediction time domain, the initial current state is different, so the corresponding initial current state for each prediction time domain needs to be updated in each prediction)
Line 5: based on the current state s t With the target state sequence S sll-tar Determining the target state S of the next prediction time domain N tar (all target states included in the target state sequence include a plurality of target states corresponding to time domains, the target states corresponding to each time domain are different, and during each prediction, the target state of the next time domain needs to be determined based on the current state, and then the future state of the next time domain is predicted through the lines 6 to 9.)
Line 6: for i =1, …, kdo (predicting the future state of the next time domain by sampling K motion sequences to find the best motion sequence from)
Line 7: sampling action sequences according to a CEM strategy
Figure BDA0003855826050000151
(CEM strategy is a technique known to those skilled in the art and the present invention is not described in detail)
Line 8: will s t After N times of copying, the vector S is superposed base (embodiments of the invention use the sequence of the difference predictions of the future state and the current state as the output of the generative model, so that the current state needs to be replicated multiple times for superposition with the output of the generative model to obtain each future predicted state in the next prediction horizon)
Line 9: predicting a future state sequence S i ′=G θ (s t ,U i )+S base
Figure BDA0003855826050000152
Representing generative models based on current state
Figure BDA0003855826050000153
And an action sequence U i The difference between the output future state and the current state is predicted sequence, and the predicted future state sequence is formed by adding the current state to each difference in the difference prediction sequence
Line 10: calculating the loss L i =(S tar -S′ i ) T Q(S tar -S′ i )+U i T RU i ((S tar -S′ i ) L2 loss representing state, Q and R respectively representing L2 loss weight and action penalty weight of state, L i Indicating the difference between the predicted future state and the target state, L i The larger the predicted future state is, the larger the difference between the predicted future state and the target state is, and conversely, the closer the predicted future state is to the target state. )
Line 11: end for
Line 12 calculation of optimal sequence of actions
Figure BDA0003855826050000161
(find out the action sequence that the corresponding predicted future state sequence is closest to the target state sequence from all action sequences)
Line 13 Slave U * The two most forward actions are selected as u t And will u t Sending the data into the real environment of the target system for execution (sending the first two actions in the optimal action sequence into the real environment for execution, continuously predicting the new optimal action, realizing the real-time control of the target system)
Line 14: end while
According to the embodiment, the method and the device for predicting the multi-step prediction by using the single model synchronous prediction framework and the generation of the confrontation network system model solve the problems of high time consumption and low prediction accuracy of a multi-step prediction process in model prediction control in a complex environment, and can provide better decision efficiency and decision effect under the conditions of complex environment and limited computational power.
In order to better verify the effect of the invention, the inventor uses the single-model synchronous prediction framework of the invention to perform multi-step prediction with other frameworks in the prior art in different task environments (game environment 1, game environment 2, path tracking environment, real vehicle automatic driving environment) (wherein, iter-Fixed refers to Fixed linear iterative prediction under the iterative framework, iter-CNN refers to convolutional neural network iterative prediction under the iterative framework, and SMS-DNN refers to deep neural network prediction under the multi-model synchronous prediction framework), and compares the accuracy and prediction time of each prediction method respectively, and the experimental results are shown in table 3.
TABLE 3
Figure BDA0003855826050000162
As can be seen from Table 3, the prediction accuracy of the method is obviously higher than that of other methods, and the required time is obviously shorter than that of other methods, so that the method can better meet the real-time requirement of multi-step prediction.
In conclusion, compared with the prior art, the single-model synchronous prediction framework provided by the invention completes the multi-step prediction process by using one model at one time, reduces the time required by the multi-step prediction process, improves the prediction accuracy, finally enhances the capacity of the MPC in a complex environment and improves the MPC decision effect; the system model based on the generated confrontation network uses the generator as the system model network, so that the system model can learn training data with different distributions by means of a discriminator, fit complex environmental characteristics and constraints which are difficult to define artificially, relieve training errors caused by the training data with non-uniform distribution, improve the model accuracy, design the system model network training loss in which the supervised learning loss and the discriminator loss are mutually fused in the training process, accelerate the system model network training process by means of a more obvious gradient direction of the supervised learning in the initial training stage, and can help the system model network to learn various environmental constraints which are difficult to define artificially by means of the discriminator in the later training stage to improve the model accuracy, accelerate the system model network training process and improve the final training effect. Based on the scheme of the invention, the time complexity of the multi-step prediction process is reduced from O (N) to O (1), and meanwhile, because no iterative process exists, the problem of accumulated errors in an iterative framework is eliminated.
It should be noted that, although the steps are described in a specific order, the steps are not necessarily performed in the specific order, and in fact, some of the steps may be performed concurrently or even in a changed order as long as the required functions are achieved.
The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therewith for causing a processor to implement various aspects of the present invention.
The computer readable storage medium may be a tangible device that retains and stores instructions for use by an instruction execution device. The computer readable storage medium may include, for example, but is not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing.
Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (12)

1. A training method of a generative model for state prediction based on an antagonistic neural network, the antagonistic neural network comprises the generative model and a discriminator, and the generative model and the discriminator are both characterized by comprising a multilayer fully-connected network; the generative model is used for selecting a control action sequence to be executed by the target system at a plurality of future moments according to the current state of the target system and a target state sequence of the target system, and the method comprises the following steps:
s1, sampling a plurality of control action sequences of predicted time domain length from all possible control actions under the constraint condition of a target system, sending each control action sequence into the real environment of the target system for execution to obtain a target state sequence of the target system corresponding to each control action sequence, and sending each control action sequence into a generation model to obtain a target system future state prediction sequence corresponding to each control action sequence;
s2, forming a sample by using the current state of the target system, each control action sequence and the target state sequence of the corresponding target system after each control action sequence is executed, and generating a training set;
and S3, training the generated confrontation network for multiple times by adopting a training set until convergence.
2. The method of claim 2, wherein the generative model and the arbiter each comprise a 4-layer fully connected network.
3. The method of claim 2, wherein in step S3, the discriminant loss and the L2 loss are used to train the generative model by:
Figure FDA0003855826040000011
where α is the discriminator loss weight, β is the L2 loss weight, and α + β =1,m is the batch size of the sample control action sequence in each trainingSmall, s t Is the current state, U, of the target system at the time of the current training i Ith control action sequence, S i ' represents the target system future state prediction sequence after the target system executes the ith control action sequence in the current state, D ω (s t ,U i ,S i ') indicates s based on the current state t And action sequence U i State S predicted by generative model i ' probability from the real environment.
4. An MPC control method applied to an automatically controlled target system for selecting a sequence of control actions to be executed by the target system at a plurality of moments in the future according to a current state of the target system and a sequence of target states of the target system, the method comprising:
t1, sampling control action sequences of a preset number of predicted time domain lengths corresponding to a target system, wherein the control action at each moment in the control action sequences is sampled from all feasible possible control actions under the constraint condition of the target system;
t2, predicting future state sequences of the target system after the target system executes the control action sequences by adopting a generative model trained by the method according to any one of claims 1 to 3 based on each sampled control action sequence and the current state of the target system to obtain a preset number of future state prediction sequences of the target system with prediction time domain length;
and T3, selecting a control action sequence corresponding to a target system future state prediction sequence closest to the target state sequence of the target system from the control action sequences corresponding to all the sampled target systems.
5. The method of claim 4, wherein in the step T1, a CEM strategy is adopted for sampling.
6. The method according to claim 4, wherein in step T3, a control action sequence corresponding to a target system future state prediction sequence closest to the target state sequence of the target system is selected from the control action sequences corresponding to all the sampled target systems, and the first two control actions in the control action sequence are selected.
7. The method according to claim 4, wherein said step T3 comprises:
t31, calculating loss between a target system future state prediction sequence corresponding to each sampled control action sequence and a target state sequence of the target system;
and T32, selecting a control action sequence with the minimum loss between the corresponding target system future state prediction sequence and the target state sequence of the target system.
8. The method according to claim 7, wherein in the step T31, the loss between the target system future state prediction sequence and the target state sequence corresponding to each control action sequence is calculated by:
Figure FDA0003855826040000021
wherein L is i Indicating the loss between the predicted sequence of the future state of the target system corresponding to the ith control action sequence and the target state sequence of the target system, S star Indicating the target state of the target system, S i ' represents the target system future state prediction sequence, U, corresponding to the ith control action sequence executed by the target system i Represents the ith control action sequence, Q represents the weight of state L2 loss, R represents the weight of control action penalty term, (. DEG) T Representing a vector transposition.
9. The method of claim 8, wherein each control action sequence is a vector sum of a vector consisting of a preset number of current states of the target system and a difference sequence of outputs of the pre-trained system model, in a target system future state pre-sequence of corresponding predicted time domain lengths.
10. An MPC control apparatus configured in an automatically controlled target system for selecting a sequence of control actions to be performed by the target system at a plurality of moments in the future based on a current state of the target system and a sequence of target states of the target system, the apparatus comprising:
the sampler is used for sampling a plurality of control action sequences of the predicted time domain length corresponding to the target system, wherein the control action at each moment in the control action sequences is sampled from all feasible possible control actions under the constraint condition of the target system;
the generative model constructed according to any one of claims 1 to 3 is used for predicting a future state sequence of a target system after the target system executes a control action sequence based on each control action sequence sampled by the sampler and a current state of the target system to obtain a preset number of future state prediction sequences of the target system with predicted time domain lengths;
and the optimizer is used for selecting a control action sequence corresponding to a target system future state prediction sequence closest to the target system target state sequence from all the sampled control action sequences.
11. A computer-readable storage medium, having stored thereon a computer program executable by a processor for performing the steps of the method of any one of claims 1-3, 4-9.
12. An electronic device, comprising:
one or more processors;
storage means for storing one or more programs which, when executed by the one or more processors, cause the electronic device to carry out the steps of the method according to any one of claims 1-3, 4-9.
CN202211156355.0A 2022-09-21 2022-09-21 Training method of generative model for state prediction based on antagonistic neural network Pending CN115453880A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211156355.0A CN115453880A (en) 2022-09-21 2022-09-21 Training method of generative model for state prediction based on antagonistic neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211156355.0A CN115453880A (en) 2022-09-21 2022-09-21 Training method of generative model for state prediction based on antagonistic neural network

Publications (1)

Publication Number Publication Date
CN115453880A true CN115453880A (en) 2022-12-09

Family

ID=84306001

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211156355.0A Pending CN115453880A (en) 2022-09-21 2022-09-21 Training method of generative model for state prediction based on antagonistic neural network

Country Status (1)

Country Link
CN (1) CN115453880A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116188648A (en) * 2023-04-27 2023-05-30 北京红棉小冰科技有限公司 Virtual person action generation method and device based on non-driving source and electronic equipment

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116188648A (en) * 2023-04-27 2023-05-30 北京红棉小冰科技有限公司 Virtual person action generation method and device based on non-driving source and electronic equipment

Similar Documents

Publication Publication Date Title
EP3593288B1 (en) Training action selection neural networks using look-ahead search
Okada et al. Path integral networks: End-to-end differentiable optimal control
CN109241291A (en) Knowledge mapping optimal path inquiry system and method based on deeply study
CN111142522A (en) Intelligent agent control method for layered reinforcement learning
CN113052372B (en) Dynamic AUV tracking path planning method based on deep reinforcement learning
CN112685165B (en) Multi-target cloud workflow scheduling method based on joint reinforcement learning strategy
CN114261400B (en) Automatic driving decision method, device, equipment and storage medium
CN111783994A (en) Training method and device for reinforcement learning
CN114839884B (en) Underwater vehicle bottom layer control method and system based on deep reinforcement learning
CN110716575A (en) UUV real-time collision avoidance planning method based on deep double-Q network reinforcement learning
CN116848532A (en) Attention neural network with short term memory cells
CN111159489A (en) Searching method
CN111768028A (en) GWLF model parameter adjusting method based on deep reinforcement learning
CN115453880A (en) Training method of generative model for state prediction based on antagonistic neural network
Hafez et al. Efficient intrinsically motivated robotic grasping with learning-adaptive imagination in latent space
CN114219066A (en) Unsupervised reinforcement learning method and unsupervised reinforcement learning device based on Watherstein distance
CN114330119A (en) Deep learning-based pumped storage unit adjusting system identification method
CN113419424A (en) Modeling reinforcement learning robot control method and system capable of reducing over-estimation
CN113341696A (en) Intelligent setting method for attitude control parameters of carrier rocket
Sumiea et al. Enhanced deep deterministic policy gradient algorithm using grey wolf optimizer for continuous control tasks
Byeon Advances in Value-based, Policy-based, and Deep Learning-based Reinforcement Learning
Masuda et al. Control of nonholonomic vehicle system using hierarchical deep reinforcement learning
Morales Deep Reinforcement Learning
Zhang et al. Reinforcement Learning from Demonstrations by Novel Interactive Expert and Application to Automatic Berthing Control Systems for Unmanned Surface Vessel
Li et al. Policy gradient methods with gaussian process modelling acceleration

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination