CN115167136A - Intelligent agent control method based on deep reinforcement learning and conditional entropy bottleneck - Google Patents
Intelligent agent control method based on deep reinforcement learning and conditional entropy bottleneck Download PDFInfo
- Publication number
- CN115167136A CN115167136A CN202210865762.2A CN202210865762A CN115167136A CN 115167136 A CN115167136 A CN 115167136A CN 202210865762 A CN202210865762 A CN 202210865762A CN 115167136 A CN115167136 A CN 115167136A
- Authority
- CN
- China
- Prior art keywords
- module
- network
- data
- encoder
- control
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 116
- 230000002787 reinforcement Effects 0.000 title claims abstract description 108
- 238000000605 extraction Methods 0.000 claims abstract description 148
- 238000004088 simulation Methods 0.000 claims abstract description 75
- 238000003860 storage Methods 0.000 claims abstract description 68
- 230000009471 action Effects 0.000 claims abstract description 62
- 230000006870 function Effects 0.000 claims abstract description 50
- 230000008447 perception Effects 0.000 claims abstract description 36
- 238000005457 optimization Methods 0.000 claims abstract description 29
- 238000005516 engineering process Methods 0.000 claims abstract description 7
- 239000013598 vector Substances 0.000 claims description 311
- 239000003795 chemical substances by application Substances 0.000 claims description 221
- 230000004927 fusion Effects 0.000 claims description 99
- 238000011156 evaluation Methods 0.000 claims description 51
- 238000013507 mapping Methods 0.000 claims description 38
- 230000003993 interaction Effects 0.000 claims description 33
- 238000009826 distribution Methods 0.000 claims description 24
- 238000012549 training Methods 0.000 claims description 24
- 238000011478 gradient descent method Methods 0.000 claims description 15
- 238000013135 deep learning Methods 0.000 claims description 10
- 238000012546 transfer Methods 0.000 claims description 8
- 238000013434 data augmentation Methods 0.000 claims description 7
- 239000011159 matrix material Substances 0.000 claims description 6
- 238000004891 communication Methods 0.000 claims description 3
- 238000005520 cutting process Methods 0.000 claims description 3
- YTAHJIFKAKIKAV-XNMGPUDCSA-N [(1R)-3-morpholin-4-yl-1-phenylpropyl] N-[(3S)-2-oxo-5-phenyl-1,3-dihydro-1,4-benzodiazepin-3-yl]carbamate Chemical compound O=C1[C@H](N=C(C2=C(N1)C=CC=C2)C1=CC=CC=C1)NC(O[C@H](CCN1CCOCC1)C1=CC=CC=C1)=O YTAHJIFKAKIKAV-XNMGPUDCSA-N 0.000 claims 1
- 230000007613 environmental effect Effects 0.000 claims 1
- 238000011217 control strategy Methods 0.000 abstract description 18
- 230000000875 corresponding effect Effects 0.000 description 26
- 241000282373 Panthera pardus Species 0.000 description 23
- 230000008569 process Effects 0.000 description 15
- 238000012512 characterization method Methods 0.000 description 9
- 238000010586 diagram Methods 0.000 description 9
- 241001455214 Acinonyx jubatus Species 0.000 description 7
- 238000013528 artificial neural network Methods 0.000 description 5
- 238000010801 machine learning Methods 0.000 description 4
- 238000011160 research Methods 0.000 description 3
- 230000000007 visual effect Effects 0.000 description 3
- 230000003190 augmentative effect Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 230000002596 correlated effect Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000010365 information processing Effects 0.000 description 2
- 238000011423 initialization method Methods 0.000 description 2
- 230000002452 interceptive effect Effects 0.000 description 2
- 230000001537 neural effect Effects 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000013178 mathematical model Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000013508 migration Methods 0.000 description 1
- 230000005012 migration Effects 0.000 description 1
- 238000009877 rendering Methods 0.000 description 1
- 230000008439 repair process Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05B—CONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
- G05B13/00—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
- G05B13/02—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
- G05B13/04—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators
- G05B13/042—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators in which a parameter or coefficient is automatically adjusted to optimise the performance
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/04—Inference or reasoning models
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02P—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
- Y02P90/00—Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
- Y02P90/02—Total factory control, e.g. smart factories, flexible manufacturing systems [FMS] or integrated manufacturing systems [IMS]
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- Mathematical Physics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Automation & Control Theory (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses an intelligent agent control method based on deep reinforcement learning and conditional entropy bottleneck, and aims to solve the problem that the control strategy precision of the deep reinforcement learning intelligent agent control method in an image continuous control task is low. The technical scheme is as follows: constructing an intelligent agent control system which consists of a perception module, an action module, a storage module, a data expansion module, a feature extraction module and a control module and is based on deep reinforcement learning and conditional entropy bottleneck; constructing a characteristic extraction module target function based on the conditional entropy bottleneck, and obtaining a corresponding optimization loss function through a variational reasoning technology; constructing an image continuous control task simulation environment; the intelligent agent trains a control system in a simulation environment to obtain optimized network parameters; and the feature extraction module and the control module load network parameters to obtain the trained control system. The trained control system is deployed in an intelligent agent in a real environment to complete an image continuous control task. By adopting the method and the device, the accuracy of the intelligent agent control strategy can be improved.
Description
Technical Field
The invention relates to the field of control, in particular to an intelligent agent control method based on deep reinforcement learning and conditional entropy bottleneck in an image continuous control task.
Background
The intelligent agent is an unmanned node with sensing, communication, movement, storage, calculation and other capabilities, and the control problem of the intelligent agent is closely related to the control of the robot. In the control field, a traditional control method usually relies on establishing a detailed and accurate mathematical model for a controlled object to control, but when the controlled object model is complex and uncertain factors such as external disturbance exist, the application of the traditional control method is severely limited, and a good control effect cannot be achieved in a continuous control task (namely, a control command of the controlled object is a continuous vector, such as the rotating speed and the torque of an intelligent joint). In addition, how to realize intelligent perception and autonomous control of a controlled object in a complex (such as an image observation scene) and unknown environment is a research difficulty in the field of intelligent agent control. The Deep Learning (DRL) method is used as an end-to-end sensing and control method, deep Learning (DL) and Reinforcement Learning (RL) are combined, and the method is widely applied to various fields such as robot control, chessboard games and the like by means of the expression capability of the Deep Learning and the decision-making capability of the Reinforcement Learning, and provides a new idea for solving the problem of intelligent agent control in the above scene.
In the task of image continuous control, the agent needs to directly learn continuous control commands from image observation, and the DRL unifies the sensing of image observation and the control of outputting continuous control commands into an end-to-end training paradigm, which makes it difficult to effectively solve the problem of agent control in the task of image continuous control (because efficient control strategy relies on robust features). To solve the problem, researchers decouple the perception and control in the DRL into corresponding sub-processes of characterization learning and strategy learning (the characterization learning is responsible for training an encoder network to extract robust features corresponding to the image observation of the intelligent agent, and the strategy learning solves an optimal control strategy corresponding to a control scheme based on the features), and introduces a self-designed auxiliary objective function to optimize the process of the characterization learning, thereby improving the performance of the control strategy.
In 2019, minne Li et al introduced a multiview Reinforcement Learning framework in the article Multi-View Reinforcement Learning published in the NIPS corpus of the neural information processing System conference, page 1420-1431, that is, the multiview Reinforcement Learning framework extended the partially observable Markov decision process to a multiview Markov decision process with multiple observation models, and proposed two methods for Learning control strategies from image observations based on data enhancement of image observations and policy migration between multiview. In 2020, michael Laskin et al, in the neural information processing System conference NIPS argument, 19884-19895, the article "Reinforcement Learning with Augmented Data", namely "Data Augmented Reinforcement Learning", proposed RAD method, and carried out a broad study on general Data augmentation method in pixel space and state space, proving that Data augmentation methods such as random translation, random clipping, random convolution and the like can improve the performance of control strategy obtained by Reinforcement Learning algorithm. In the article "Learning invariant for Learning reinforcement Learning without reconstruction", a Deep interactive simulation for Control (DBC) method is proposed, and the robust representation of image observation of an intelligent body is studied based on the interactive simulation measure of behavior similarity among states in the process of quantifying continuous Markov decision, so that the performance of a Control strategy for solving a downstream Control task is improved. Although the intelligent agent control method based on the deep reinforcement learning obtains a relatively accurate control strategy in the image continuous control task, the intelligent agent has a complex motion control model, friction exists between joints of the intelligent agent, and image observation of the intelligent agent on the environment contains visual information irrelevant to the control task, so that the accuracy of the solved control strategy is relatively low due to the change of the visual information irrelevant to the task in the image observation when the control strategy is solved by using a deep neural network in the existing intelligent agent control method based on the deep reinforcement learning.
Therefore, the existing intelligent agent control method based on deep reinforcement learning still has the problem of low control strategy precision in the image continuous control task.
Disclosure of Invention
The invention aims to solve the technical problem that the control strategy accuracy of the existing intelligent agent control method based on deep reinforcement learning in an image continuous control task is low, provides an intelligent agent control method based on deep reinforcement learning and conditional entropy bottleneck, and improves the accuracy of the intelligent agent control strategy.
According to the method, perception and control in deep reinforcement learning are decoupled, a feature extraction module corresponding to perception is trained on the basis of conditional entropy bottleneck, information related to a control task is extracted from image observation of an intelligent agent to serve as a robust state vector corresponding to the image observation, a control module for image continuous control tasks is trained on the basis of the state vector, and the accuracy of an intelligent agent control strategy is improved.
The specific technical scheme is as follows:
the method comprises the steps of firstly, constructing an intelligent agent control system based on deep reinforcement learning and conditional entropy bottlenecks, installing the control system on an intelligent agent, and enabling the intelligent agent to interact with an image continuous control task environment. The intelligent agent refers to an unmanned node (such as an unmanned aerial vehicle, a robot and the like) with sensing, communication, movement, storage, calculation and other capabilities, and includes but is not limited to a simulated mechanical arm, a simulated robot and the like constructed in a simulation environment. The image continuous control task environment refers to an entity interacting with an intelligent agent, and the intelligent agent observes the state of the environment in the form of an image and acts in the environment according to a continuous control instruction based on the image observation. The intelligent agent control system based on the deep reinforcement learning and the conditional entropy bottleneck is composed of a perception module, an action module, a storage module, a data expansion module, a feature extraction module and a control module.
The sensing module is an image sensor (such as a depth camera) and is connected with the feature extraction module and the storage module. The sensing module acquires an image observation (RGB image) containing the state of the intelligent agent (information of the intelligent agent) and the state of the environment (information except the intelligent agent) from the image continuous control task environment, and sends the image observation to the feature extraction module and the storage module.
The action module is an actuator (such as an engine, a steering gear and the like) of an intelligent agent control instruction, is connected with the control module, receives the control instruction from the control module, and acts in the image continuous control task environment according to the control instruction.
The storage module is a memory containing available space above 1GB, is connected with the sensing module, the control module and the data expansion module, receives image observation from the sensing module, receives control instructions from the control module, receives reward from the image continuous control task environment, and combines the image observation, the control instructions and the reward into track data (track data for short) of interaction between the intelligent agent and the image continuous control task environment. Trajectory data in quadruples(s) t ,a t ,r t ,s t+1 ) Is stored in a form of (a) wherein: s is t The image view received from the perception module when the t time of the intelligent agent is interacted with the image continuous control task environmentMeasurement of a t Is a control instruction from the control module, r, executed when the agent interacts with the image continuous control task environment for the tth time t When the t time of the intelligent agent is interacted with the image continuous control task environment, the environment is directed to the control instruction a t Reward value, s, of feedback t+1 The image observation is the image observation received from the perception module after the environment state is changed due to the t-th interaction between the intelligent agent and the image continuous control task environment (also called as the image observation when the t + 1-th interaction between the intelligent agent and the image continuous control task environment).
The data expansion module is connected with the storage module, the feature extraction module and the control module, and track data tau, tau =(s) required by training of the intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck is randomly selected from the storage module t ,a t ,r t ,s t+1 ) For s in τ t And s t+1 Carrying out N times of data expansion to obtain the track data tau after the N times of data expansion N , (j∈[1,N]J is the index of the image observation after N times of data expansion), and (tau) N And sending the data to a feature extraction module and a control module.
The feature extraction module is connected with the perception module, the data expansion module and the control module. The characteristic extraction module consists of a coder network, a target coder network, a characteristic fusion network, a single-view predictor network and a multi-view predictor network.
The coder network consists of a first coder network Encoder _1 and a second coder network Encoder _2, and is connected with the perception module, the data expansion module, the control module, the characteristic fusion network and the single-view predictor network. The Encoder _1 consists of 4 convolutional layers, 1 full-connection layer and 1 regularization layer and is connected with the sensing module, the data expansion module, the control module and the Encoder _2; encoder _2 is composed of 3 layers of full connection layers, and is connected with Encoder _1, feature fusion network and single-view predictor networkAre connected through a network. Encoder _1 receives s from the perception module when the agent interacts with the image continuation control task environment t The first, second, third and fourth convolution layers of Encode _1 are checked with s by a convolution kernel of 3 x 3 in sequence t Performing convolution operation to obtain s after four times of convolution operation t S after four convolution operations t A full connection layer transmitted to the Encoder _ 1; the fully connected layer of Encode _1 pairs s after the four convolution operations received from the fourth convolutional layer t Performing full connection operation to obtain full connection s t The corresponding state vector sends the fully connected state vector to a regularization layer of the Encoder _ 1; the regularization layer of Encoder _1 carries out regularization operation on the fully-connected state vectors received from the fully-connected layer of Encoder _1 to obtain regularized state vectors, and the regularized state vectors are used as first state vectors z t A is z is t And sending the data to a control module. When training an intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck, the Encoder _1 receives track data tau after data expansion from a data expansion module N The first, second, third, and fourth convolution layers of Encode _1 are sequentially checked with a convolution kernel τ of 3 × 3 N InPerforming convolution operation to obtain the result after four times of convolution operationAfter four convolution operationsA full connection layer transmitted to the Encoder _ 1; fully connected layer of Encoder _1 after four convolution operations on received convolution from the fourth convolution layerPerforming full connection operation to obtain full connectionCorresponding N statesThe vector is used for sending the fully connected N state vectors to a regularization layer of the Encoder _ 1; the regularization layer of Encoder _1 carries out regularization operation on the N state vectors which are received from the full connection layer of Encoder _1 and are subjected to full connection to obtain N state vectors which are subjected to regularization, and the N state vectors which are subjected to regularization are used as second state vectorsRepresenting the first and second state vectorsIs sent to the control module and willSending the data to an Encoder _2; the first, second and third full connection layers of Encoder _2 sequentially align the received signal from Encoder _1Performing full connection operation to obtain three times of full connection operationCorresponding to the mean and variance of the Gaussian distribution, carrying out re-parameterization operation on the mean and variance (from Auto-Encoding Variational Bayes published in the ICLR corpus of the international characterization learning conference in Diederik P.Kingma and Max Welling 2014) to obtain N re-parameterized state vectors, and using the N re-parameterized state vectorsShow that And sending the information to the feature fusion network and the single-view predictor network.
Target encoder network and data expansion module, controlThe module is connected with the feature fusion network and consists of 4 convolution layers, 1 full-connection layer and 1 regularization layer. In the process of intelligent agent control system training based on deep reinforcement learning and conditional entropy bottleneck, as the parameters of the encoder network are updated too fast, the training process is vibrated and unstable, and therefore, the target encoder network is introduced into the intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck to improve the stability of the training process. Target encoder network receives tau from data expansion module N The first, second, third, and fourth convolutional layers of the target encoder network are sequentially checked with a 3 × 3 convolutional kernel τ N In (1)Performing convolution operation to obtain the result after four times of convolution operationAfter four convolution operationsSending to a full connection layer; fully connected layer upon four convolution operations received from fourth convolution layerPerforming full connection operation to obtain full connectionThe corresponding N target state vectors send the N fully-connected target state vectors to the regularization layer; the regularization layer carries out regularization operation on the N fully-connected target state vectors received from the fully-connected layer to obtain N regularized target state vectors, and the N regularized target state vectors are used as target state vectorsRepresenting, the first target state vectorIs sent to the control module and will And sending the information to the feature fusion network.
The characteristic fusion network is connected with the encoder network, the target encoder network and the multi-view predictor network and consists of a first fusion network Feature _1 and a second fusion network Feature _2. Feature _1 and Feature _2 are each composed of 3 fully connected layers. Feature _1 is connected to the encoder network, the target encoder network and Feature _2, feature _1 receiving from the encoder networkReceiving from a target encoder networkThe first, the second and the third full connection layers of Feature _1 are sequentially pairedPerforming a full join operation ofStitching as a state fusion vectorWill be provided withSending the data to Feature _2; the first, the second and the third full connection layers of Feature _1 are sequentially pairedPerforming a full join operation ofSpliced into eyesBidding state fusion vectorFeature _2 is connected to Feature _1 and the multiview predictor network, and receives the data from Feature _1The first, the second and the third full connection layers of Feature _2 are sequentially pairedCarrying out reparameterization operation to obtain reparameterization state fusion vectorWill be provided withTo the multi-view predictor network.
The single-view predictor network is connected with the data expansion module and the encoder network and consists of 3 layers of full connection layers. The single view predictor network receives the track data tau after data expansion from the data expansion module N Received from the encoder networkFirst, second and third fully-connected layers of the single-view predictor network are sequentially pairedAnd τ N Control instruction a in t Performing full connection operation on the formed first spliced vector, and mapping the first spliced vector into a predicted target state vectorAnd a first predicted prize valueImplementing transfer kinetics equations and reward function equations (transfer kinetics equations and reward functions)The equation is a basic concept in Reinforcement Learning, see book repair Learning, by Richard s.sutton and Andrew g.barto: an Introduction, reinforced learning), wherein:a jth term representing the predicted target state vector,a jth term representing the first predicted prize value.
The multi-view predictor network is connected with the data expansion module and the characteristic fusion network and consists of 3 layers of full connection layers. The multiview predictor network receives the data-augmented trajectory data tau from the data augmentation module N Receiving from the feature fusion networkWill be provided withAnd τ N Control instruction a in t Forming a second splicing vector, carrying out full-connection operation on the second splicing vector by a first full-connection layer, a second full-connection layer and a third full-connection layer of the multi-view predictor network in sequence, and mapping the second splicing vector into a prediction target state fusion vectorAnd a second predictive prize valueAnd realizing the prediction of the transfer dynamics equation and the reward function equation.
The control module is connected with the feature extraction module, the data expansion module and the action module and consists of two evaluation networks (a first evaluation network Critic _1 and a second evaluation network Critic _ 2), two target evaluation networks (a first target evaluation network TarCritic _1 and a second target evaluation network TarCritic _ 2) and a strategy network. The purpose of designing two evaluation networks and two target evaluation networks in the control module is to prevent the problem of over-estimation when a single evaluation network or a single target evaluation network evaluates the advantages and disadvantages of the intelligent agent control command.
Critic _1 and Critic _2 are connected with the feature extraction module, the data expansion module and the strategy network, and are all composed of three fully-connected layers, and receive a first second state vector from the feature extraction moduleReceiving data-extended track data tau from a data extension module N Receiving control instruction a from policy network, for tau N Middle control instruction a t And evaluating the quality of the a in the policy network. The first, the second and the third full connecting layers of Critic _1 are sequentially pairedAnd a t Performing full connection operation on the formed third splicing vector, and mapping the third splicing vector into a first state-action value after three times of full connection operation(State-action value is a basic concept in reinforcement learning, and refers to the expected value of the reward that can be obtained after the agent executes the selected control command in the current state); the first, the second and the third full connecting layers of Critic _1 are sequentially pairedAnd c, performing full connection operation on the fourth splicing vector consisting of the a, and mapping the fourth splicing vector into a second state-action value after three times of full connection operationThe first, the second and the third full connecting layers of Critic _2 are sequentially pairedAnd a t The third splicing vector is subjected to full-connection operation for three timesMapping the third splice vector to a third state-action value after the full join operationThe first, the second and the third full connecting layers of Critic _2 are sequentially pairedAnd c, performing full connection operation on the fourth splicing vector consisting of the a, and mapping the fourth splicing vector into a fourth state-action value after three times of full connection operation
Both TarCritic _1 and TarCritic _2 are connected with the feature extraction module and the strategy network, both are composed of three full connection layers, and both receive a first target state vector from the feature extraction moduleAnd receiving a control command a 'from the policy network, and evaluating the quality of a'. The first, the second and the third full connecting layers of the TarCritic _1 are sequentially pairedAnd a' performing full connection operation on the target splicing vector, and mapping the target splicing vector into a first target state-action value after three times of full connection operationThe first, the second and the third full connecting layers of the TarCritic _2 are sequentially pairedAnd a' performing full connection operation on the target splicing vector, and mapping the target splicing vector into a second target state-action value after three times of full connection operation
Policy networkThe device is connected with the feature extraction module, the action module, the storage module, critic _1, critic _2, tarCritic _1 and TarCritic _2 and consists of three full connection layers. Receiving a first state vector z from a feature extraction module while an agent interacts with an image continuation control task environment t The first, second and third full connection layers of the strategy network are in turn aligned with z t Carrying out a full connection operation of t Mapped as control instruction a t A is mixing t And sending the information to an action module and a storage module. In training an agent control system based on deep reinforcement learning and conditional entropy bottlenecks, a first second state vector is received from a feature extraction moduleAnd a first target state vectorThe first, the second and the third full connection layers of the policy network are sequentially pairedPerforming a full connection operation toMapping the control command a into a control command a, and sending a to critical _1 and critical _2; the first, the second and the third full connection layers of the policy network are sequentially pairedPerforming a full connection operation toMapped as control instruction a ', sends a' to TarCritic _1 and TarCritic _2.
And secondly, constructing a target function of the feature extraction module based on the conditional entropy Bottleneck, and obtaining an optimization loss function of the feature extraction module through a Variational reasoning technology (from Deep Variational Information Bottleneck, namely a Deep Variational Information Bottleneck, an article published by Alexander A. Alemi et al in 2017 on ICLR corpus of International characterization learning conference). The method comprises the following steps:
2.1 in order to learn the state vector corresponding to the image observation, based on the conditional entropy bottleneck, a feature extraction module objective function shown in formula (1) is designed. Conditional Entropy bottlenecks (from The Conditional Entropy Bottleneck, the article "The Conditional Entropy bottle" published in The book journal by Ian Fischer 2020, namely "Conditional Entropy bottlenecks") are a method in information theory for extracting features Z of given data X, to predict a label Y, defined asIt is desirable that the information in Z is maximally correlated with Y.
Wherein: object represents the object of the feature extraction module,the image observation is obtained by performing j-th data expansion on the image observation in the t-th interaction by the data expansion module,the image observation is obtained by the data expansion module after the jth data expansion is carried out on the image observation in the t +1 th interaction,is to beThe reparameterized state vector obtained after input to the encoder network,is to beAfter input into the target encoder networkThe obtained target state vector is processed by the following steps,is to parameterize the N state vectorsInputting the parameterized state fusion vector obtained after the input into the feature fusion network,is to combine the N target state vectorsTarget state fusion vector, beta, obtained after input to a feature fusion network j Is a regularization factor, the suggested value is 1e -4 ~1e -2 ,Andis a conditional mutual information item.
2.2 applying variational reasoning technique to the formula (1) to obtain the optimized loss function of the feature extraction module shown in the formula (2):
wherein: m is the number of the track data randomly selected from the storage module by the data expansion module,is a Gaussian distribution Is calculated by Encoder _2 in the Encoder networkThe average value of (a) of (b),is the variance calculated by Encoder _2 in the Encoder network),is a distribution of the variation components that,andis Gaussian noise, inAndand the method is used in a re-parameterization process.Is representative of xi j In the expectation that the position of the target is not changed,is representative of xi 1 ,ξ 2 ,…,ξ N And xi 1N The desired product of the two,show thatAnd a t Item j of predicted target state vector obtained by inputting to single-view predictor networkAndcross entropy loss between and item j of the first predicted reward valueAnd r t The cross-entropy loss between the two,show thatAnd a t Predicted target state fusion vector obtained by inputting into multi-view predictor networkAndcross entropy loss between and secondary predicted reward valueAnd r t Cross entropy loss between.
And thirdly, constructing an image continuous Control task simulation environment in a DeepMind Control Suite (DMControl) simulation environment (the DMControl simulation environment is supported by a Multi-Joint dynamics with Contact physical engine) used in the research field of robots and the like), and preparing for training of an intelligent body Control system based on deep reinforcement learning and conditional entropy bottleneck. The method comprises the following steps:
3.1 installing a DMControl simulation environment (requiring the version of a physical engine MuJoCo to be mujo 200) in any computer provided with a Ubuntu (requiring the version of 16.04 and above) and a Pythroch deep learning framework, constructing an intelligent agent simulation model and an image continuous control task simulation environment, and setting a task target of an intelligent agent in an image continuous control task.
3.2 in the constructed image continuous control task simulation environment, the scale of the image observation of the intelligent agent for sensing the self state and the environment state is set to be 100 multiplied by 100, the control instruction executed by the intelligent agent is set to be a continuous vector, such as joint rotation speed, torque and the like, and the reward value fed back by the image continuous control task simulation environment after the intelligent agent executes the control instruction is set according to the task target.
Fourthly, the intelligent agent trains an intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck in the image continuous control task simulation environment constructed in the third step, and the method comprises the following steps:
4.1 initializing network parameters of a feature extraction module and a control module in the intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck, wherein the parameters comprise a weight matrix and a bias vector of a full connection layer, a convolution kernel of a convolution layer and a weight matrix and a bias vector of a regularization layer, and the parameters are generated by using an orthogonal initialization (a parameter initialization method of a neural network), wherein non-zero parameters are from normal distribution with a mean value of 0 and a standard deviation of 1.
4.2 initializing the storage module in the intelligent agent control system based on the deep reinforcement learning and the conditional entropy bottleneck, and setting the size of the storage module to be capable of storing A (A is more than or equal to 10) 5 ) And (3) forming a buffer area queue of track data when the intelligent agent interacts with the image continuous control task simulation environment, and emptying the buffer area queue.
4.3, initializing the number of interactions T =0 between the intelligent agent and the image continuous control task simulation environment constructed in the third step, setting the maximum number of interactions T (T is a positive integer and T is greater than or equal to 5A) between the intelligent agent and the image continuous control task simulation environment, setting the maximum number of interactions E (E is a positive integer and generally takes 1000) between each turn between the intelligent agent and the image continuous control task simulation environment, and setting the update frequency F (F is a positive integer and generally takes 2) of a target encoder network and a target evaluation network in the intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck.
And 4.4, randomly setting the initial state of the image continuous control task simulation environment constructed in the third step and the initial state of the intelligent agent simulation model.
4.5, the perception module acquires image observation when the intelligent agent interacts with the image continuous control task simulation environment, and sends the image observation to the feature extraction module and the storage module; the characteristic extraction module receives image observation, an Encoder _1 in an Encoder network encodes the image observation to obtain a first state vector corresponding to the image observation, and the first state vector is sent to the control module; the control module receives the first state vector, the strategy network maps the first state vector into a control instruction, and the control instruction is sent to the action module and the storage module, and the method comprises the following steps:
4.5.1 perception module obtains image observation s when t time of intelligent agent interacts with image continuous control task simulation environment t A 1, a t And sending the data to a feature extraction module and a storage module.
4.5.2 feature extraction Module receives image observations s t Encoder _1 in Encoder network will s t Coded as a first state vector z t Will z t And sending the data to a control module.
4.5.3 the control Module receives the first State vector z t Policy network will z t Mapping the control command to a to be executed when the t th time of the agent interacts with the image continuous control task simulation environment t A is mixing t And sending the data to the action module and the storage module.
4.6 the action Module receives the control instruction a t Performing a in an image continuous control task simulation environment t 。
4.7 the image continuous control task simulation environment returns the reward value r obtained when the intelligent agent interacts with the environment for the t time according to the reward value designed in the step 3.2 t R is to t And sending the data to a storage module.
4.8 State-of-the-Picture continuous control task simulation Environment execution of control Instructions a by Agents t When the environment changes, the sensing module acquires the image observation s corresponding to the changed environment state t+1 A 1 is to t+1 And sending the data to a storage module.
4.9 the storage Module receives s from the perception Module t And s t+1 Receive a from the control module t Receiving r from the image continuous control task simulation environment t Combining them into a quadruple of trace data(s) t ,a t ,r t ,s t+1 ) Storing the data in a buffer queue by the following method:
4.9.1 the memory module judges whether there is a track data in the buffer queue, if there is a track data, go to step 4.9.2, otherwise, go to step 4.9.3.
4.9.2 the storage module empties a trace data at the head of the buffer queue according to the first-in first-out principle.
4.9.3 memory Module will s t 、s t+1 、a t And r t Combined into quadruplets of trajectory data(s) t ,a t ,r t ,s t+1 ) And storing the data at the tail part of the buffer area queue.
4.10 let t = t +1. If t% E =0, it is indicated that the number of times of interaction between the intelligent agent and the image continuous control task simulation environment in the round reaches E, the round of control task is ended, and a new round of control task is restarted in step 4.4; otherwise, the round of control task is not finished, and the step 4.11 is switched to continue the round of control task.
4.11 the data expansion module determines whether there are L (L is generally set to 1000) pieces of trace data in the buffer queue of the storage module, and if there are L pieces of trace data, randomly selects M (M is generally set to 512) pieces of trace data from the buffer queue of the storage module, and makes the M pieces of trace data form a trace data set τ _ M:
let τ _ M m =(s t (m),a t (m),r t (m),s t+1 (M)) represents the M-th (m.di [1, M) ] of τ _ M]) Step 4.12, optimizing an intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck according to tau _ M; if there are no more L pieces of track data in the buffer queue, go to step 4.5.1.
4.12 the data expansion module uses a random cutting method (from RAD method) in data enhancement to sequentially perform N times of data expansion on the image observation of each track data in the tau _ M to obtain M track data after data expansion, and sends the M track data after data expansion to the feature extraction module and the control module; the feature extraction module and the control module receive M pieces of track data after data expansion, and an intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck is optimized, wherein the method comprises the following steps:
4.12.1 initialize the optimization times K =0, set the total optimization times K (typically K is set to 10).
4.12.2 the data expansion module uses a random clipping method in data enhancement to sequentially perform N times of data expansion on the image observation of each piece of track data in the tau _ M to obtain M pieces of track data after data expansion, and sends the M pieces of track data after data expansion to the feature extraction module and the control module, wherein the method comprises the following steps:
4.12.2.1 initialize the track data index m =1.
4.12.2.2 the number of initialization data expansions j =0, the total number of data expansions N is set (N is generally set to 2).
4.12.2.3 referring to the settings in RAD, the mth trace data τ _ M in τ _ M is clipped by random clipping in data enhancement m =(s t (m),a t (m),r t (m),s t+1 (m)) s having a mesoscale of 100X 100 t (m) Observation of an image cropped to a Scale of 84X 84S with a dimension of 100X 100 t+1 (m) Observation of an image cropped to a Scale of 84X 84
4.12.2.4 let j = j +1. If j is equal to the total data expansion times N, go to step 4.12.2.5, otherwise go to step 4.12.2.3.
4.12.2.5 data expansion Module will τ _ M m S in t (m) image observation with N-time data expansion Will s is t+1 (m) image observation with N-time data expansion Get τ _ M m Trajectory data after N times of data expansion
4.12.2.6 if M is less than M, making M = M +1, and turning to step 4.12.2.2; if M = M, it represents that the expansion of the M pieces of track data is finished, and M pieces of track data tau _ M after data expansion are obtained N :
4.12.2.7 data expansion Module will τ _ M N And sending the data to a feature extraction module and a control module.
4.12.3 feature extraction Module receives τ _ M from data augmentation Module N For τ _ M N In M pieces of track data, a gradient descent method (a common optimization method in machine learning) is sequentially used for minimizing a loss function of a feature extraction module shown in a formula (2), and a coder network, a feature fusion network, a single-view predictor network and a multi-view predictor network in the feature extraction module are optimized, wherein the method comprises the following steps:
4.12.3.1 encoder network, target encoder network, single View predictor network, and Multi-View predictor network receive τ _ M from data expansion Module N 。
4.12.3.2 initialize the track data index m =1.
Encoder _1 in 4.12.3.3 Encoder network will τ _ M N Middle m track dataIn (1) Encoding into N second state vectorsIs thatCorresponding second state vector) will be generatedSending the data to an Encoder _2; encoder _2 is obtained through full connection layerMean of corresponding Gaussian distributionsSum varianceWherein:is thatThe mean of the corresponding gaussian distribution is,is thatThe variance of the corresponding gaussian distribution. Encoder _2 pairAndcarrying out reparameterization operation to obtain N reparameterization state vectors Is thatCorresponding reparameterized state vector), will And sending the information to the feature fusion network and the single-view predictor network.
4.12.3.4 target encoder network will τ _ M N Middle m track dataIn Encoding into N target state vectorsIs thatCorresponding target state vector) will be generatedAnd sending the information to the feature fusion network.
4.12.3.5 feature fusion network reception from encoder networkFeature _1 pairPerforming feature fusion to obtain a state fusion vectorWill be provided withSending the data to Feature _2; feature _2 pairCarrying out reparameterization operation to obtain reparameterization state fusion vectorWill be provided withTo the multi-view predictor network.
4.12.3.6 feature fusion network reception from target encoder networkFeature _1 pairPerforming feature fusion to obtain a target state fusion vector
4.12.3.7 Single View predictor network receives from the encoder network From τ _ M N In the m track dataTo pair Andmiddle control instruction a t (m) performing branch dynamics equation prediction and reward function equation prediction on the first spliced vector to obtain a predicted target state vectorAnd a first predicted prize value
4.12.3.8 multiview predictor network receives from a feature fusion networkFrom τ _ M N In the m track dataTo pairAndmiddle control instruction a t (m) performing branch dynamics equation prediction and reward function equation prediction on the second spliced vector to obtain a predicted target state fusion vectorAnd a second predicted prize value
4.12.3.9 feature extraction Module uses the gradient descent method, based on the mean and variance obtained in step 4.12.3.3, the mean and variance obtained in step 4.12.3.4Obtained in step 4.12.3.6Obtained in step 4.12.3.7Andobtained in step 4.12.3.8Andr in t (m) minimizing the optimization penalty function in equation (2) by inverse updating of gradients to optimize the encoder network, the feature fusion network, the single view predictor network, and the multi-view predictor network.
4.12.3.10 if M is less than M, making M = M +1, and turning to step 4.12.3.3; otherwise, go to step 4.12.4.
4.12.4 Encoder _1 of Encoder network in feature extraction Module τ _ M N Image observation of middle M pieces of track data after first data expansionEncoding into M second state vectorsWill be provided withAnd sending the data to a control module.
4.12.5 target encoder network in feature extraction Module to τ _ M N Image observation of middle M pieces of track data after first data expansionEncoding into M target state vectors Will be provided withAnd sending the data to a control module.
4.12.6 the control Module receives τ _ M from the data expansion Module N Receiving a second state vector from the feature extraction module And a target state vectorFor τ _ M N M pieces of track data,Andthe loss functions shown in formula (3), formula (4) and formula (5) are minimized by using a gradient descent method in sequence, and an evaluation network and a strategy network are optimized, wherein the method comprises the following steps:
4.12.6.1 the policy network receives from the feature extraction moduleAnd critic _1 and Critic _2 are received from the feature extraction moduleReceiving τ _ M from data expansion module N (ii) a TarCritic _1 and TarCritic _2 are received from the feature extraction module
4.12.6.2 initialize the track data index m =1.
4.12.6.3 policy network slaveTo obtain the m second state vectorFromTo obtain the m-th target state vectorThe following operations are carried out: to pairPerforming action mapping to obtain a control command a (m), and sending the a (m) to Critic _1 and Critic _2; to pairThe operation mapping is performed to obtain a control command a '(m), and a' (m) is transmitted to TarCritic _1 and TarCritic _2.
4.12.6.4Critic _1receives control fingers from the policy networkLet a (m) be selected fromTo obtain the m second state vectorFrom τ _ M N Get the m-th track dataThe following operations are carried out: to pairAndin (a) of t (m) performing state-motion value estimation on the third splicing vector to obtain a first state-motion valueTo pairAnd a (m) to obtain a second state-action value
4.12.6.5Critic _2receives control instructions a (m) from the policy network, fromTo obtain the m second state vectorFrom τ _ M N In the m track dataThe following operations are carried out: to pairAndin (a) of t (m) performing state-motion value estimation on the third spliced vector to obtain a third state-motion valueTo pairAnd a (m) to obtain a fourth splicing vector, and performing state-action value estimation on the fourth splicing vector to obtain a fourth state-action value
4.12.6.6TarCritic _1receives control instruction a' (m) from the policy network, fromTo obtain the m-th target state vectorFor is toAnd a' (m) to obtain a first target state-action value
4.12.6.7TarCritic _2receives control instructions a' (m) from the policy network, fromTo obtain the m-th target state vectorFor is toAnd a' (m) to obtain a second target state-action value
4.12.6.8 the control module uses the gradient descent method to minimize the penalty function in equation (3) and optimizes Critic _1 and Critic _2 by updating the gradient backwards.
Wherein:is a parameter of the ith evaluation network,is a parameter of the ith target evaluation network, phi is a parameter of the policy network, i =1 or 2 is an index of two evaluation networks and two target evaluation networks in the control module,is a target state vectorCorresponding status value (status value is the basic concept in reinforcement learning, and refers to the expected value of the reward that the agent can obtain in the current status),is the first target state-action valueAnd a second target state-action valueIn betweenThe smaller value. Alpha is a temperature coefficient factor (the initial value is set to 0.1, and the optimization process is dynamically adjusted along with a policy network, see the article of Soft actor-critic for Off-policy maximum entropy depth reinforcement learning with a static actor published in the ICML corpus of the International machine learning conference, tuomas Haarnoja et al 2018), and gamma is a reward discount factor (gamma is generally set to 0.99).
4.12.6.9 the control module uses gradient descent to minimize the penalty function shown in equation (4) by updating the optimization strategy network back through the gradient.
Wherein:is the policy network at the second state vectorThe distribution of the control command a (m) output next,is the second state-action valueAnd fourth State-action valuesThe smaller value in between.
4.12.6.10 the control module uses the gradient descent method to minimize the loss function shown in equation (5) and optimizes the temperature coefficient factor by updating the gradient backwards.
Wherein:is the target entropy of the policy network, set to be the negative of the agent control instruction a (m) dimension.
4.12.6.11 if M is less than M, making M = M +1, and turning to step 4.12.6.3; otherwise, go to step 4.12.7.
4.12.7, judging whether t% F is equal to zero, if yes, updating parameters of a target encoder network and a target evaluation network in the intelligent agent control system based on the deep reinforcement learning and the conditional entropy bottleneck, and turning to the step 4.12.8, otherwise, turning to the step 4.12.9.
4.12.8 deep reinforcement learning and conditional entropy bottleneck based agent control system updates parameters of the target encoder network and parameters of the two target evaluation networks according to formula (6) and formula (7) using exponential moving average (a common method of neural network parameter update).
Wherein: tau is p And τ Q Is to update the hyper-parameters (generally will τ) of the target encoder network and the target evaluation network p Is set to 0.05, tau Q Set to 0.01), ζ is a parameter of the encoder network,is a parameter of the target encoder network.
4.12.9 let k = k +1. If K is equal to the total number of times of optimization K, it indicates that the optimization is completed, go to step 4.13, otherwise, go to step 4.12.2.1.
4.13 the agent control system based on deep reinforcement learning and conditional entropy bottleneck judges whether T is equal to the maximum number of interactions T, if so, it indicates that the training is finished, go to step 4.14, otherwise, go to step 4.5.1.
4.14 the intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck saves the network parameters of the feature extraction module and the control module optimized in the step 4.12 into a pt format file (the pt format file can be directly generated by the deep learning framework Pythrch).
And fifthly, loading the file in the pt format obtained in the step 4.14 by the feature extraction module and the control module, and initializing parameters of the feature extraction module and the control module by using parameters in the file in the pt format to obtain the trained intelligent agent control system based on the deep reinforcement learning and the conditional entropy bottleneck.
And sixthly, deploying the trained intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck on the intelligent agent in the real environment. In an intelligent agent control system based on deep reinforcement learning and conditional entropy bottlenecks, other networks except an encoder network in a feature extraction module are all related to optimization of the encoder network, and other networks except a strategy network in a control module are all related to optimization of the strategy network. Thus, after the control system is deployed on the agent, only the encoder network in the feature extraction module and the policy network in the control module are relevant to the actions of the agent.
Seventhly, the trained intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck assists the intelligent agent to complete the image continuous control task, and the method comprises the following steps:
7.1 initialize the number of actions T =0 of the agent and set the maximum number of actions T of the agent in the real environment 0 (T 0 Is a positive integer, typically 1000).
7.2 the perception module obtains the image observation of the real environment and sends the image observation to the feature extraction module; the characteristic extraction module receives image observation, an Encoder _1 in an Encoder network encodes the image observation to obtain a first state vector corresponding to the image observation, and the first state vector is sent to the control module; the control module receives the first state vector, the strategy network maps the first state vector into a control instruction, and the control instruction is sent to the action module; the action module receives the control instruction and executes the control instruction in the real environment, and the method comprises the following steps:
7.2.1 perception Module obtains image observations s of the real Environment of the agent at the tth action t A 1, a t And sending the data to a feature extraction module.
7.2.2 feature extraction Module receives image observations s t Encoder _1 in the Encoder network transforms s according to the method described in step 4.5.2 t Coded as a first state vector z t Will z t And sending the data to a control module.
7.2.3 control Module receiving z t The policy network will z according to the method described in step 4.5.3 t Mapping to control instruction a of t action of agent t A is mixing t And sending the information to an action module.
7.2.4 action Module receives control instruction a t Performing a in a real environment t 。
7.3 let t = t +1. If T is equal to the maximum action number T 0 And if not, turning to the step 7.2.1.
And eighthly, finishing.
The invention can achieve the following technical effects:
1. the invention provides an intelligent agent control method based on deep reinforcement learning and conditional entropy bottleneck, and an intelligent agent can realize better control in an image continuous control task by using the intelligent agent control method. The method provides reference for the design of the control scheme when the intelligent agent is deployed in the real world, and has high practical application value.
2. According to the method, perception and control in deep reinforcement learning are decoupled, a target function of a feature extraction module is designed on the basis of conditional entropy bottleneck in the second step, a loss function corresponding to the target function is obtained through a variational reasoning technology, and by optimizing the loss function, an intelligent agent can ignore visual information irrelevant to a task when observing and coding images in an image continuous control task, and a robust state vector is obtained. In the fourth step of the invention, the control module is trained on the basis of the robust state vector obtained by the feature extraction module, so that the complexity of a control task is reduced, and the accuracy of an intelligent agent control strategy is improved.
3. In the invention, after the trained intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck is deployed on the intelligent agent, the intelligent agent is only based on the encoder network in the characteristic extraction module and the strategy network action in the control module in the seventh step, the decision of the control instruction is simple, and the real-time requirement of intelligent agent control is met.
Drawings
FIG. 1 is a general flow diagram of the present invention;
FIG. 2 is a schematic diagram of the logic structure of the intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck;
FIG. 3 is a schematic diagram of an image continuous control task simulation environment constructed based on a real environment in the third step of the invention: hunting leopard robot running scene. The robot leopard looks at the environment in the form of an image through a sensing module, and the task goal is that the robot leopard needs to run quickly in a scene;
FIG. 4 is a schematic diagram showing a comparison between training results of a deep reinforcement Learning control method DBC, which is proposed in an article "Learning invariant for reinforcement Learning without reconstruction", published in the ICLR corpus of the international characterization Learning conference in 2021 by Amy Zhang et al in the invention and the background art, i.e., an invariant characterization for reinforcement Learning without reconstruction;
fig. 5 is a schematic diagram of the action of the leopard robot according to the control command obtained by the control module in the running scene of the leopard robot constructed based on the real environment shown in fig. 3 according to the present invention;
Detailed Description
FIG. 1 is a general flow diagram of the present invention; as shown in fig. 1, the present invention comprises the steps of:
firstly, an intelligent agent control system based on deep reinforcement learning and conditional entropy bottlenecks is constructed. The system is shown in fig. 2 and comprises a sensing module, an action module, a storage module, a data expansion module, a feature extraction module and a control module.
The sensing module is an image sensor (such as a depth camera) and is connected with the feature extraction module and the storage module. The sensing module acquires an image observation (RGB image) containing the state of the intelligent agent (information of the intelligent agent) and the state of the environment (information except the intelligent agent) from the image continuous control task environment, and sends the image observation to the feature extraction module and the storage module.
The action module is an actuator (such as an engine, a steering gear and the like) of an intelligent control instruction, is connected with the control module, receives the control instruction from the control module, and acts in the image continuous control task environment according to the control instruction.
The storage module is a memory containing available space above 1GB, is connected with the sensing module, the control module and the data expansion module, receives image observation from the sensing module, receives control instructions from the control module, receives reward from the image continuous control task environment, and combines the image observation, the control instructions and the reward into track data (track data for short) of interaction between the intelligent agent and the image continuous control task environment. Trajectory data in quadruplets(s) t ,a t ,r t ,s t+1 ) Is stored in a form of (a) wherein: s t Is the image observation received from the perception module when the agent interacts with the image continuous control task environment for the tth time, a t Is a control instruction from the control module, r, executed when the agent interacts with the image continuous control task environment for the tth time t When the t time of the intelligent agent is interacted with the image continuous control task environment, the environment is directed to the control instruction a t Reward value, s, of feedback t+1 The method is the image observation received from the perception module after the environment state changes due to the t-th interaction with the image continuous control task environment of the intelligent agent (also called the image observation when the t + 1-th interaction with the image continuous control task environment of the intelligent agent).
The data expansion module is connected with the storage module, the feature extraction module and the control module, and track data tau, tau =(s) required by training of the intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck is randomly selected from the storage module t ,a t ,r t ,s t+1 ) For s in τ t And s t+1 Performing N times of data expansion to obtain N times of dataAugmented trace data τ N , (j∈[1,N]J is the index of the image observation after N times of data expansion), and (tau) N And sending the data to a feature extraction module and a control module.
The feature extraction module is connected with the sensing module, the data expansion module and the control module. The characteristic extraction module consists of a coder network, a target coder network, a characteristic fusion network, a single-view predictor network and a multi-view predictor network.
The coder network consists of a first coder network Encoder _1 and a second coder network Encoder _2, and is connected with the perception module, the data expansion module, the control module, the characteristic fusion network and the single-view predictor network. The Encoder _1 consists of 4 convolutional layers, 1 full-connection layer and 1 regularization layer and is connected with the sensing module, the data expansion module, the control module and the Encoder _2; the Encoder _2 is composed of 3 layers of full connection layers and is connected with the Encoder _1, the characteristic fusion network and the single-view predictor network. Encoder _1 receives s from the perception module when the agent interacts with the image continuation control task environment t The first, second, third and fourth convolution layers of Encode _1 are checked with s by a convolution kernel of 3 x 3 in sequence t Performing convolution operation to obtain s after four times of convolution operation t S after four convolution operations t Sending the full connection layer to an Encoder _ 1; the fully connected layer of Encoder _1 pairs s after the fourth convolution operation received from the fourth convolution layer t Performing full connection operation to obtain full connection s t The corresponding state vector sends the fully connected state vector to a regularization layer of the Encoder _ 1; the regularization layer of Encoder _1 carries out regularization operation on the fully-connected state vectors received from the fully-connected layer of Encoder _1 to obtain regularized state vectors, and the regularized state vectors are used as first state vectors z t A is z is t And sending the data to a control module. Depth-based reinforcement in trainingWhen learning and conditional entropy bottleneck intelligent agent control system, the Encoder _1 receives track data tau after data expansion from the data expansion module N The first, second, third, and fourth convolution layers of Encode _1 are sequentially checked with a convolution kernel τ of 3 × 3 N InPerforming convolution operation to obtain the result after four times of convolution operationAfter four convolution operationsSending the full connection layer to an Encoder _ 1; fully connected layer of Encoder _1 after four convolution operations on received convolution from the fourth convolution layerPerforming full connection operation to obtain full connectionThe corresponding N state vectors send the fully connected N state vectors to a regularization layer of the Encoder _ 1; the regularization layer of Encoder _1 carries out regularization operation on the N state vectors which are received from the full connection layer of Encoder _1 and are subjected to full connection to obtain N state vectors which are subjected to regularization, and the N state vectors which are subjected to regularization are used as second state vectorsRepresenting the first and second state vectorsIs sent to the control module and willSending the data to an Encoder _2; the first, second and third full connection layers of Encoder _2 sequentially pair received from Encoder _1Performing full connection operation to obtain three times of full connection operationCorresponding to the mean and variance of the Gaussian distribution, carrying out re-parameterization operation on the mean and variance (from Auto-Encoding Variational Bayes published in the ICLR corpus of the international characterization learning conference in Diederik P.Kingma and Max Welling 2014) to obtain N re-parameterized state vectors, and using the N re-parameterized state vectorsShow that And sending the information to the feature fusion network and the single-view predictor network.
The target encoder network is connected with the data expansion module, the control module and the characteristic fusion network and consists of 4 convolution layers, 1 full-connection layer and 1 regularization layer. In the process of intelligent agent control system training based on deep reinforcement learning and conditional entropy bottleneck, as the parameters of the encoder network are updated too fast, the training process is vibrated and unstable, and therefore, the target encoder network is introduced into the intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck to improve the stability of the training process. Target encoder network receives tau from data expansion module N The first, second, third, and fourth convolutional layers of the target encoder network are sequentially checked with a 3 × 3 convolutional kernel τ N In (1)Performing convolution operation to obtain the result after four times of convolution operationAfter four convolution operationsSending to a full connection layer; after fully-connected layer has operated on four convolutions received from fourth convolutional layerPerforming full connection operation to obtain full connectionThe corresponding N target state vectors send the N fully-connected target state vectors to the regularization layer; the regularization layer carries out regularization operation on the N fully-connected target state vectors received from the fully-connected layer to obtain N regularized target state vectors, and the N regularized target state vectors are used as target state vectorsRepresenting, the first target state vectorIs sent to the control module and will And sending the information to the feature fusion network.
The characteristic fusion network is connected with the encoder network, the target encoder network and the multi-view predictor network and consists of a first fusion network Feature _1 and a second fusion network Feature _2. Feature _1 and Feature _2 are each composed of 3 fully connected layers. Feature _1 is connected to the encoder network, the target encoder network and Feature _2, the Feature _1 receiving from the encoder networkReceiving from a target encoder networkThe first, the second and the third full connection layers of Feature _1 are sequentially pairedPerforming a full connection operation toStitching as a state fusion vectorWill be provided withSending the data to Feature _2; the first, the second and the third full connection layers of Feature _1 are sequentially pairedPerforming a full connection operation toStitching as a target state fusion vectorFeature _2 is connected to Feature _1 and the multiview predictor network, and receives the data from Feature _1The first, the second and the third full connection layers of Feature _2 are sequentially pairedCarrying out reparameterization operation to obtain reparameterization state fusion vectorWill be provided withTo the multi-view predictor network.
The single-view predictor network is connected with the data expansion module and the encoder network and consists of 3 layers of full connection layers. The single view predictor network receives the track data tau after data expansion from the data expansion module N Received from the encoder networkThe first, second and third fully-connected layers of the single-view predictor network are sequentially pairedAnd τ N Control instruction a in t Performing full-connection operation on the formed first splicing vector, and mapping the first splicing vector into a predicted target state vectorAnd a first predicted prize valuePredictions for implementing the transfer kinetics equation and the reward function equation (which are basic concepts in Reinforcement Learning, see book Reinforcement Learning: an Introduction to Reinforcement Learning, by Richard S.Sutton and Andrew G.Barto), in which:a jth term representing the predicted target state vector,a jth term representing the first predicted prize value.
The multi-view predictor network is connected with the data expansion module and the characteristic fusion network and consists of 3 layers of full connection layers. The multiview predictor network receives the data-augmented trajectory data tau from the data augmentation module N Receiving from the feature fusion networkWill be provided withAnd τ N Control instruction a in t Forming a second splicing vector, carrying out full-connection operation on the second splicing vector by a first full-connection layer, a second full-connection layer and a third full-connection layer of the multi-view predictor network in sequence, and mapping the second splicing vector into a prediction target state fusion vectorAnd a second predictive prize valueAnd realizing the prediction of the transfer dynamics equation and the reward function equation.
The control module is connected with the feature extraction module, the data expansion module and the action module and consists of two evaluation networks (a first evaluation network Critic _1 and a second evaluation network Critic _ 2), two target evaluation networks (a first target evaluation network TarCritic _1 and a second target evaluation network TarCritic _ 2) and a strategy network. The purpose of designing two evaluation networks and two target evaluation networks in the control module is to prevent the problem of over-estimation when a single evaluation network or a single target evaluation network evaluates the advantages and disadvantages of the intelligent agent control command.
Critic _1 and Critic _2 are connected with the feature extraction module, the data expansion module and the strategy network, and are all composed of three fully-connected layers, and receive a first second state vector from the feature extraction moduleReceiving data-extended trajectory data tau from a data extension module N Receiving control instruction a from policy network, for tau N Middle control instruction a t And evaluating the quality of a in the policy network. The first, the second and the third full connecting layers of Critic _1 are sequentially pairedAnd a t Performing full-join operation on the formed third splicing vector, and mapping the third splicing vector into a first state-action value after three times of full-join operation(State-action value is a basic concept in reinforcement learning, and refers to the expected value of the reward that can be obtained after the agent executes the selected control command in the current state); the first, the second and the third full connecting layers of Critic _1 are sequentially pairedAnd c, performing full connection operation on the fourth splicing vector consisting of the a, and mapping the fourth splicing vector into a second state-action value after three times of full connection operationThe first, the second and the third full connecting layers of Critic _2 are sequentially pairedAnd a t Performing full connection operation on the formed third splicing vector, and mapping the third splicing vector into a third state-action value after three times of full connection operationThe first, the second and the third full connecting layers of Critic _2 are sequentially pairedAnd c, performing full connection operation on the fourth splicing vector consisting of the a, and mapping the fourth splicing vector into a fourth state-action value after three times of full connection operation
Both TarCritic _1 and TarCritic _2 are connected with the feature extraction module and the strategy network, both are composed of three full connection layers, and both receive a first target state vector from the feature extraction moduleAnd receiving a control command a 'from the policy network, and evaluating the quality of a'. The first, the second and the third full connecting layers of the TarCritic _1 are sequentially pairedAnd a' performing full connection operation on the target splicing vector, and mapping the target splicing vector into a first target state-action value after three times of full connection operationThe first, the second and the third full connecting layers of the TarCritic _2 are sequentially pairedAnd a' performing full connection operation on the target splicing vector, and mapping the target splicing vector into a second target state-action value after three times of full connection operation
The strategy network is connected with the feature extraction module, the action module, the storage module, critic _1, critic _2, tarCritic _1 and TarCritic _2 and consists of three fully-connected layers. Receiving a first state vector z from a feature extraction module while an agent interacts with an image continuation control task environment t The first, second and third full connection layers of the strategy network are in turn aligned with z t Carrying out a full connection operation of t Mapped as control instruction a t A is to t And sending the data to the action module and the storage module. In training an agent control system based on deep reinforcement learning and conditional entropy bottlenecks, a first second state vector is received from a feature extraction moduleAnd a first target state vectorThe first, the second and the third full connection layers of the policy network are sequentially pairedPerforming a full join operation ofMapping into a control instruction a, and sending a to Critic _1 and Critic _2; the first, the second and the third full connection layers of the policy network are sequentially pairedPerforming a full join operation ofMapped as control instruction a ', sends a' to TarCritic _1 and TarCritic _2.
And secondly, constructing a target function of the feature extraction module based on the conditional entropy Bottleneck, and obtaining an optimization loss function of the feature extraction module through a Variational reasoning technology (from Deep Variational Information Bottleneck, namely a Deep Variational Information Bottleneck, an article published by Alexander A. Alemi et al in 2017 on ICLR corpus of International characterization learning conference). The method comprises the following steps:
2.1 in order to learn the state vector corresponding to the image observation, based on the conditional entropy bottleneck, a feature extraction module objective function shown in formula (1) is designed. Conditional Entropy bottlenecks (from The Conditional Entropy Bottleneck, the article "Conditional Entropy bottle" published by Ian Fischer 2020 on The Entropy journal) are a method in information theory for extracting features Z of given data X to predict label Y, defined asIt is desirable that the information in Z is maximally correlated with Y.
Wherein: object represents the object of the feature extraction module,the image observation is obtained by the data expansion module after the jth data expansion is carried out on the image observation in the t interaction,the image observation is obtained by the data expansion module after the jth data expansion is carried out on the image observation in the t +1 th interaction,is to beThe reparameterized state vector obtained after input to the encoder network,is to beA target state vector obtained after input to the target encoder network,is to parameterize the N state vectorsInputting the parameterized state fusion vector obtained after the input into the feature fusion network,is to divide the N target state vectorsTarget state fusion vector, beta, obtained after input to a feature fusion network j Is a regularization factor, the suggested value is 1e -4 ~1e -2 ,Andis a conditional mutual information item.
2.2, applying variational reasoning technology to the formula (1) to obtain a feature extraction module optimized loss function shown in the formula (2):
wherein: m is the number of the track data randomly selected from the storage module by the data expansion module,is a Gaussian distribution Is the average value calculated by Encoder _2 in the Encoder network,is the variance calculated by Encoder _2 in the Encoder network),is a distribution of the variation components,andis Gaussian noise inAndand the method is used in a re-parameterization process.Is representative of xi j In the expectation that the position of the target is not changed,is representative of xi 1 ,ξ 2 ,…,ξ N And xi 1N The desired product of the two or more of the two,show thatAnd a t Item j of predicted target state vector obtained by inputting to single-view predictor networkAndcross entropy loss between and item j of the first predicted reward valueAnd r t The cross-entropy loss between the two,show thatAnd a t Predicted target state fusion vector obtained by inputting into multi-view predictor networkAndcross entropy loss between and secondary predicted reward valueAnd r t Cross entropy loss between.
And thirdly, constructing an image continuous Control task simulation environment in a DeepMind Control Suite (DMControl) simulation environment (the DMControl simulation environment is supported by a Multi-Joint dynamics with Contact physical engine) used in the research field of robots and the like), and preparing for training of an intelligent body Control system based on deep reinforcement learning and conditional entropy bottleneck. The method comprises the following steps:
3.1 installing a DMControl simulation environment (requiring the version of a physical engine MuJoCo to be mujo 200) in any computer provided with a Ubuntu (requiring the version of 16.04 and above) and a Pythroch deep learning framework, constructing an intelligent agent simulation model and an image continuous control task simulation environment, and setting a task target of an intelligent agent in an image continuous control task.
3.2 in the constructed image continuous control task simulation environment, the scale of the image observation of the intelligent agent for sensing the self state and the environment state is set to be 100 multiplied by 100, the control instruction executed by the intelligent agent is set to be a continuous vector, such as joint rotation speed, torque and the like, and the reward value fed back by the image continuous control task simulation environment after the intelligent agent executes the control instruction is set according to the task target.
Fig. 3 is an embodiment of the image continuation control task simulation environment constructed in the third step, which is a leopard robot running scene constructed based on a real environment, and an environment rendering diagram of the leopard robot running scene is shown. In this embodiment, the leopard robot uses the perception module to obtain the image observation with the dimension of 100 × 100, and the control command of the leopard robot is a 6-dimensional continuous vector. The leopard robot acts according to the control instructions in the scene, and the aim of the task is that the leopard robot needs to run quickly in the scene. The reward value r of the cheetah robot acting in the scene is linearly related to the moving speed v (the maximum value of the moving speed is 10 m/s), namely: r = max (0,min (v/10,1)).
Fourthly, the intelligent agent trains an intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck in the image continuous control task simulation environment established in the third step, and the method comprises the following steps:
4.1 initializing network parameters of a feature extraction module and a control module in the intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck, wherein the parameters comprise a weight matrix and a bias vector of a full connection layer, a convolution kernel of a convolution layer and a weight matrix and a bias vector of a regularization layer, and the parameters are generated by using an orthogonal initialization (a parameter initialization method of a neural network), wherein non-zero parameters are from normal distribution with a mean value of 0 and a standard deviation of 1.
4.2 initializing the storage module in the intelligent agent control system based on the deep reinforcement learning and the conditional entropy bottleneck, and setting the size of the storage module to be capable of storing A (A is more than or equal to 10) 5 ) And (3) forming a buffer area queue of track data when the intelligent agent interacts with the image continuous control task simulation environment, and emptying the buffer area queue.
4.3, initializing the interaction times T =0 of the agent and the image continuous control task simulation environment constructed in the third step, setting the maximum interaction times T (T is a positive integer and T is more than or equal to 5A) of the agent, the maximum interaction times E (E is a positive integer and generally takes a value of 1000) of each turn of the agent and the image continuous control task simulation environment, and updating frequencies F (F is a positive integer and generally takes a value of 2) of a target encoder network and a target evaluation network in the agent control system based on the deep reinforcement learning and conditional entropy bottleneck.
And 4.4, randomly setting the initial state of the image continuous control task simulation environment constructed in the third step and the initial state of the intelligent agent simulation model.
4.5, the perception module acquires image observation when the intelligent agent interacts with the image continuous control task simulation environment, and sends the image observation to the feature extraction module and the storage module; the feature extraction module receives the image observation, an Encoder _1 in an Encoder network encodes the image observation to obtain a first state vector corresponding to the image observation, and the first state vector is sent to the control module; the control module receives the first state vector, the strategy network maps the first state vector into a control instruction, and the control instruction is sent to the action module and the storage module, and the method comprises the following steps:
4.5.1 perception module obtains image observation s when t time of intelligent agent interacts with image continuous control task simulation environment t A 1 is to t And sending the data to a feature extraction module and a storage module.
4.5.2 feature extraction Module receives image observations s t Encoder _1 in the Encoder network will s t Coded as a first state vector z t Will z t And sending the data to a control module.
4.5.3 the control Module receives the first State vector z t Policy network will z t Mapping the control command a to be executed when the t th time of the agent interacts with the image continuous control task simulation environment t A is mixing t And sending the data to the action module and the storage module.
4.6 action Module receives control Command a t Performing a in an image-continuous control task simulation environment t 。
4.7 the image continuous control task simulation environment returns the reward value r obtained when the intelligent agent interacts with the environment for the t time according to the reward value designed in the step 3.2 t R is to t And sending the data to a storage module.
4.8 State of the image continuous control task simulation Environment because the agent executes the control instruction a t When the environment changes, the perception module obtains the image observation s corresponding to the changed environment state t+1 A 1 is to t+1 And sending the data to a storage module.
4.9 the storage Module receives s from the perception Module t And s t+1 Receive a from the control module t Receiving r from the image continuous control task simulation environment t Combining them into quadruplet trace data(s) t ,a t ,r t ,s t+1 ) Storing the data in a buffer queue by the following method:
4.9.1 the memory module judges whether there is a track data in the buffer queue, if there is a track data, go to step 4.9.2, otherwise, go to step 4.9.3.
4.9.2 the storage module empties a trace data at the head of the buffer queue according to the first-in first-out principle.
4.9.3 memory Module will s t 、s t+1 、a t And r t Combined into quadruplets of trajectory data(s) t ,a t ,r t ,s t+1 ) And storing the data at the tail part of the buffer area queue.
4.10 let t = t +1. If t% E =0, it is indicated that the number of times of interaction between the intelligent agent and the image continuous control task simulation environment in the round reaches E, the round of control task is ended, and a new round of control task is restarted in step 4.4; otherwise, the round of control task is not finished, and the step 4.11 is switched to continue the round of control task.
4.11 the data expansion module determines whether there are L (L is generally set to 1000) pieces of trace data in the buffer queue of the storage module, and if there are L pieces of trace data, randomly selects M (M is generally set to 512) pieces of trace data from the buffer queue of the storage module, and makes the M pieces of trace data form a trace data set τ _ M:
let τ _ M m =(s t (m),a t (m),r t (m),s t+1 (M)) represents the M-th (m.di [1, M) ] of τ _ M]) Step 4.12, the step of track data is transferred, and an intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck is optimized according to tau _ M; if there are no more L pieces of track data in the buffer queue, go to step 4.5.1.
4.12 the data expansion module uses a random cutting method (from RAD method) in data enhancement to sequentially perform N times of data expansion on the image observation of each track data in the tau _ M to obtain M track data after data expansion, and the M track data after data expansion are sent to the feature extraction module and the control module; the feature extraction module and the control module receive M pieces of track data after data expansion, and an intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck is optimized, wherein the method comprises the following steps:
4.12.1 initialize the optimization times K =0, set the total optimization times K (typically K is set to 10).
4.12.2 the data expansion module uses a random clipping method in data enhancement to sequentially perform N times of data expansion on the image observation of each piece of track data in the tau _ M to obtain M pieces of track data after data expansion, and sends the M pieces of track data after data expansion to the feature extraction module and the control module, and the method is as follows:
4.12.2.1 initialize the track data index m =1.
4.12.2.2 the number of initial data expansions j =0, the total number of data expansions N is set (N is generally set to 2).
4.12.2.3 referring to the settings in RAD, the mth trace data τ _ M in τ _ M is clipped by random clipping in data enhancement m =(s t (m),a t (m),r t (m),s t+1 (m)) s having a mesoscale of 100X 100 t (m) image Observation cropped to a Scale 84X 84S with a dimension of 100X 100 t+1 (m) image Observation cropped to a Scale 84X 84
4.12.2.4 let j = j +1. If j is equal to the total data expansion times N, go to step 4.12.2.5, otherwise go to step 4.12.2.3.
4.12.2.5 data expansion Module will τ _ M m S in t (m) image observation with N-time data expansion Will s is t+1 (m) image observation with N-time data expansion Get τ _ M m Trajectory data after N times of data expansion
4.12.2.6 if M is less than M, making M = M +1, and turning to step 4.12.2.2; if M = M, the expansion of the M track data is finished, and M track data tau _ M after data expansion is obtained N :
4.12.2.7 data expansion Module will τ _ M N And sending the data to a feature extraction module and a control module.
4.12.3 feature extraction Module receives τ _ M from data expansion Module N For τ _ M N In M pieces of track data, a gradient descent method (a common optimization method in machine learning) is sequentially used for minimizing a loss function of a feature extraction module shown in a formula (2), and a coder network, a feature fusion network, a single-view predictor network and a multi-view predictor network in the feature extraction module are optimized, wherein the method comprises the following steps:
4.12.3.1 encoder network, target encoder network, single View predictor network, and Multi-View predictor network receive τ _ M from data expansion Module N 。
4.12.3.2 initialize the track data index m =1.
Encoder _1 in 4.12.3.3 Encoder network will τ _ M N Middle m track dataIn (1) Encoding into N second state vectors(Is thatCorresponding second state vector) will be generatedSending the data to an Encoder _2; encoder _2 is obtained through full connection layerMean of corresponding Gaussian distributionsSum varianceWherein:is thatThe mean of the corresponding gaussian distribution is,is thatThe variance of the corresponding gaussian distribution. Encoder _2 pairAndcarrying out reparameterization operation to obtain N reparameterized state vectors Is thatCorresponding reparameterized state vector), will And sending the information to the feature fusion network and the single-view predictor network.
4.12.3.4 target encoder network will τ _ M N Middle m track dataIn Encoding into N target state vectors(Is thatCorresponding target state vector) will be generatedAnd sending the information to the feature fusion network.
4.12.3.5 feature fusion network reception from encoder networkFeature _1 pairPerforming feature fusion to obtain a state fusion vectorWill be provided withSending to Feature _2; feature _2 pairCarrying out reparameterization operation to obtain reparameterization state fusion vectorWill be provided withTo the multi-view predictor network.
4.12.3.6 feature fusion network reception from target encoder networkFeature _1 pairPerforming feature fusion to obtain a target state fusion vector
4.12.3.7 Single View predictor network receives from the encoder network From τ _ M N In the m track dataTo pair Andmiddle control instruction a t (m) performing branch dynamics equation prediction and reward function equation prediction on the first spliced vector to obtain a predicted target state vectorAnd a first predicted prize value
4.12.3.8 multiview predictor network receives from a feature fusion networkFrom τ _ M N In the m track dataFor is toAndmiddle control instruction a t (m) performing branch dynamics equation prediction and reward function equation prediction on the second spliced vector to obtain a predicted target state fusion vectorAnd second predictionReward value
4.12.3.9 feature extraction Module uses the gradient descent method, based on the mean and variance obtained in step 4.12.3.3, the mean and variance obtained in step 4.12.3.4Obtained in step 4.12.3.6Obtained in step 4.12.3.7Andobtained in step 4.12.3.8Andr in t (m) minimizing the optimization penalty function in equation (2) by inverse updating of the gradient to optimize the encoder network, the feature fusion network, the single view predictor network, and the multi-view predictor network.
4.12.3.10 if M is less than M, making M = M +1, and turning to step 4.12.3.3; otherwise, go to step 4.12.4.
4.12.4 Encoder _1 of Encoder network in feature extraction Module τ _ M N Image observation of middle M pieces of track data after first data expansionEncoding into M second state vectorsWill be provided withAnd sending the data to a control module.
4.12.5 target encoder network in feature extraction Module to τ _ M N Image observation of middle M pieces of track data after first data expansionEncoding into M target state vectors Will be provided withAnd sending the data to a control module.
4.12.6 control Module receives τ _ M from data expansion Module N Receiving a second state vector from the feature extraction module And a target state vectorFor τ _ M N M pieces of track data,Andthe loss functions shown in formula (3), formula (4) and formula (5) are minimized by a gradient descent method in sequence, and an evaluation network and a strategy network are optimized, wherein the method comprises the following steps:
4.12.6.1 the policy network receives from the feature extraction moduleAnd critic _1 and Critic _2 are received from the feature extraction moduleReceiving τ _ M from data expansion module N (ii) a TarCritic _1 and TarCritic _2 are received from the feature extraction module
4.12.6.2 initialize the track data index m =1.
4.12.6.3 policy network slaveTo obtain the m second state vectorFromTo obtain the mth target state vectorThe following operations are carried out: for is toPerforming action mapping to obtain a control command a (m), and sending the a (m) to Critic _1 and Critic _2; to pairThe operation mapping is performed to obtain a control command a '(m), and a' (m) is transmitted to TarCritic _1 and TarCritic _2.
4.12.6.4Critic _1receives control instructions a (m) from the policy network, fromTo obtain the m second state vectorFrom τ _ M N Get the m-th track dataThe following operations are carried out: to pairAndin (a) t (m) performing state-motion value estimation on the third splicing vector to obtain a first state-motion valueTo pairAnd a (m) to obtain a second state-action value
4.12.6.5Critic _2receives control instructions a (m) from the policy network, fromTo obtain the m second state vectorFrom τ _ M N In the m track dataIs carried out as followsThe following operations are carried out: to pairAndin (a) t (m) performing state-motion value estimation on the third spliced vector to obtain a third state-motion valueFor is toAnd a (m) to obtain a fourth state-action value
4.12.6.6TarCritic _1receives control instructions a' (m) from the policy network, fromTo obtain the m-th target state vectorTo pairAnd a' (m) to obtain a first target state-action value
4.12.6.7TarCritic _2receives control instructions a' (m) from the policy network, fromTo obtain the m-th target state vectorTo pairAnd a' (m) to obtain a second target state-action value
4.12.6.8 the control module uses the gradient descent method to minimize the penalty function in equation (3) and optimizes Critic _1 and Critic _2 by updating the gradient backwards.
Wherein:is a parameter of the ith evaluation network,is the parameter of the ith target evaluation network, phi is the parameter of the policy network, i =1 or 2 is the index of the two evaluation networks and the two target evaluation networks in the control module,is a target state vectorCorresponding status value (status value is the basic concept in reinforcement learning, and refers to the expectation value of the reward that the agent can obtain in the current status),is the first target state-action valueAnd a second target state-action valueThe smaller value in between. Alpha is a temperature coefficient factor (the initial value is set to 0.1, and the optimization process is dynamically adjusted along with a policy network, see the article Soft actor-critic who is published in 2018 by Tuomas Haarnoja et al on ICML corpus of International machine learning conference, namely the Soft actor-critic who is based on stochastic actor Off-policy maximum entropy depth reinforcement learning), and gamma is a reward discount factor (gamma is generally set to 0.99).
4.12.6.9 the control module uses gradient descent to minimize the penalty function shown in equation (4) by updating the optimization strategy network in reverse of the gradient.
Wherein:is the policy network at the second state vectorThe distribution of the control command a (m) output next,is the second state-action valueAnd fourth State-action valuesThe smaller value in between.
The 4.12.6.10 control module uses a gradient descent method to minimize the loss function shown in equation (5) and optimizes the temperature coefficient factor by updating the gradient backwards.
Wherein:is the target entropy of the policy network, set to the negative of the agent control instruction a (m) dimension.
4.12.6.11 if M is less than M, making M = M +1, and turning to step 4.12.6.3; otherwise, go to step 4.12.7.
4.12.7, judging whether t% F is equal to zero, if so, updating parameters of a target encoder network and a target evaluation network in the intelligent agent control system based on the deep reinforcement learning and the conditional entropy bottleneck, and turning to the step 4.12.8, otherwise, turning to the step 4.12.9.
4.12.8 agent control system based on deep reinforcement learning and conditional entropy bottlenecks uses exponential moving average (a common method of neural network parameter updating) to update parameters of the target encoder network and parameters of the two target evaluation networks according to equations (6) and (7).
Wherein: tau is p And τ Q Is to update the hyper-parameters (generally will τ) of the target encoder network and the target evaluation network p Set to 0.05, τ Q Set to 0.01), ζ is a parameter of the encoder network,are parameters of the target encoder network.
4.12.9 let k = k +1. If K is equal to the total number of times of optimization K, it is indicated that the optimization is completed, go to step 4.13, otherwise, go to step 4.12.2.1.
4.13 the agent control system based on deep reinforcement learning and conditional entropy bottleneck judges whether T is equal to the maximum number of interactions T, if so, it indicates that the training is finished, go to step 4.14, otherwise, go to step 4.5.1.
4.14 the intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck saves the network parameters of the feature extraction module and the control module optimized in the step 4.12 into a pt format file (the pt format file can be directly generated by a deep learning framework Pytrch).
Fig. 4 is a schematic diagram of training results of the embodiment of the cheetah robot running scenario shown in fig. 3. The figure shows the trend that the sum of reward values (represented by average reward, which is obtained by repeatedly running 30 rounds of tasks and averaging the sum of reward values in each round of tasks) fed back to a cheetah robot control instruction by an image continuous control task simulation environment in one round of control tasks changes along with the increase of the number of interaction times. The abscissa represents the number of times of interaction between the leopard robot and the environment, the ordinate represents the average return, and the larger the average return is, the better the control strategy of the leopard robot is. Compared with a deep reinforcement learning control method DBC in the background technology, the method uses the mutual simulation measurement to learn the corresponding state vector of the image observation when the DBC performs intelligent control. In fig. 4, compared with DBC, the average reward obtained by the leopard robot is higher, which indicates that the robust state vector is obtained when the image observation is encoded in the image continuous control task, further reduces the complexity of the control task, and improves the accuracy of the leopard robot control strategy.
And fifthly, loading the file in the pt format obtained in the step 4.14 by the feature extraction module and the control module, and initializing the parameters of the feature extraction module and the control module by using the parameters in the file in the pt format to obtain the trained intelligent agent control system based on the deep reinforcement learning and the conditional entropy bottleneck.
And sixthly, deploying the trained intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck on an intelligent agent constructed based on a real environment, namely the leopard robot.
And seventhly, the trained intelligent body control system based on deep reinforcement learning and conditional entropy bottleneck assists the intelligent body to complete the image continuous control task, namely, a running scene of the cheetah robot is constructed, and the task of rapidly running the cheetah robot is completed.
7.1 number of movements T =0 of the robot was initialized, and the maximum number of movements T of the robot was set 0 (T 0 Is a positive integer, the value of this embodiment is 1000).
7.2, the perception module acquires image observation of a running scene of the leopard robot and sends the image observation to the feature extraction module; the feature extraction module receives the image observation, an Encoder _1 in an Encoder network encodes the image observation to obtain a first state vector corresponding to the image observation, and the first state vector is sent to the control module; the control module receives the first state vector, the strategy network maps the first state vector into a control instruction, and the control instruction is sent to the action module; the action module receives the control instruction and executes the control instruction in the running scene of the cheetah robot, and the method comprises the following steps:
7.2.1 perception module obtains image observation s of Leopard robot running scene when Leopard robot acts for the t time t A 1 is to t And sending the data to a feature extraction module.
7.2.2 feature extraction Module receives image observations s t Encoder _1 in the Encoder network assigns s according to the method described in step 4.5.2 t Coded as a first state vector z t A is z is t And sending the data to a control module.
7.2.3 control Module receiving z t The policy network assigns z according to the method described in step 4.5.3 t Mapping as the control command a of the t-th action of the cheetah robot t A is to t And sending the information to an action module.
7.2.4 action Module receives control instruction a t Performing a in a Leopard robot running scenario t 。
7.3 let t = t +1. If T is equal to the maximum action number T 0 And if not, turning to the step 7.2.1.
And eighthly, finishing.
FIG. 5 is a schematic diagram of the action of the robot in the running scenario of the robot shown in FIG. 3, wherein 9 sub-graphs are shownHunting leopard robot T 0 Sequence of actions for 2 runs out of the next actions. From this sequence of actions, it can be seen that: the running process of the leopard robot can be divided into body curling, leg kicking, emptying and landing, which shows that the leopard robot completes the task of running quickly in a scene, and the effectiveness and the accuracy of the intelligent body control method provided by the invention are verified.
Claims (10)
1. An intelligent agent control method based on deep reinforcement learning and conditional entropy bottleneck is characterized by comprising the following steps:
firstly, constructing an intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck, installing the control system on an intelligent agent, and enabling the intelligent agent to interact with an image continuous control task environment; the intelligent agent refers to an unmanned node with sensing, communication, movement, storage and computing capabilities; the image continuous control task environment refers to an entity interacting with the intelligent agent, the intelligent agent observes the state of the environment in the form of an image and observes to act in the environment according to a continuous control instruction based on the image observation; the intelligent agent control system based on the deep reinforcement learning and the conditional entropy bottleneck consists of a perception module, an action module, a storage module, a data expansion module, a feature extraction module and a control module;
the sensing module is an image sensor and is connected with the feature extraction module and the storage module; the sensing module acquires image observation containing an intelligent agent state and an environment state from an image continuous control task environment, and sends the image observation to the feature extraction module and the storage module;
the action module is an actuator of an intelligent agent control instruction, is connected with the control module, receives the control instruction from the control module, and acts in the image continuous control task environment according to the control instruction;
the storage module is connected with the sensing module, the control module and the data expansion module, receives image observation from the sensing module, receives control instructions from the control module, receives rewards from the image continuous control task environment, and combines the image observation, the control instructions and the rewards into a track for interaction between the intelligent agent and the image continuous control task environmentTrace data; trajectory data in quadruplets(s) t ,a t ,r t ,s t+1 ) Is stored in a form of (a) wherein: s t Is the image observation received from the perception module when the t-th time of the intelligent agent interacts with the image continuous control task environment, a t Is a control instruction from the control module, r, executed when the agent interacts with the image continuous control task environment for the tth time t The environment is aimed at the control instruction a when the intelligent agent interacts with the image continuous control task environment for the tth time t Reward value, s, of feedback t+1 The method is characterized in that image observation received from a sensing module after the environment state is changed due to the t-th time of interaction between the intelligent agent and the environment of the image continuous control task is called as image observation when the t + 1-th time of interaction between the intelligent agent and the environment of the image continuous control task is carried out;
the data expansion module is connected with the storage module, the feature extraction module and the control module, and track data tau, tau =(s) required by training of the intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck is randomly selected from the storage module t ,a t ,r t ,s t +1 ) For s in τ t And s t+1 Carrying out N times of data expansion to obtain the track data tau after the N times of data expansion N , j∈[1,N]J is the index of the image observation after N times of data expansion, and tau is N Sending the data to a feature extraction module and a control module;
the characteristic extraction module is connected with the sensing module, the data expansion module and the control module; the characteristic extraction module consists of a coder network, a target coder network, a characteristic fusion network, a single-view predictor network and a multi-view predictor network;
the Encoder network consists of a first Encoder network Encoder _1 and a second Encoder network Encoder _2, a sensing module, a data expansion module, a control module, a characteristic fusion network and a single-view predictor networkConnecting; the Encoder _1 consists of 4 convolutional layers, 1 full-connection layer and 1 regularization layer and is connected with the sensing module, the data expansion module, the control module and the Encoder _2; the Encoder _2 consists of 3 layers of full connection layers and is connected with the Encoder _1, the feature fusion network and the single-view predictor network; encoder _1 receives s from the perception module when the agent interacts with the image continuation control task environment t The first, second, third and fourth convolution layers of Encode _1 are checked with s by a convolution kernel of 3 x 3 in sequence t Performing convolution operation to obtain s after four times of convolution operation t S after four convolution operations t A full connection layer transmitted to the Encoder _ 1; the fully connected layer of Encoder _1 pairs s after the fourth convolution operation received from the fourth convolution layer t Performing full connection operation to obtain full connection s t The corresponding state vector sends the fully connected state vector to a regularization layer of the Encoder _ 1; the regularization layer of Encoder _1 carries out regularization operation on the fully-connected state vectors received from the fully-connected layer of Encoder _1 to obtain regularized state vectors, and the regularized state vectors are used as first state vectors z t A is z is t Sending the data to a control module; when training an intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck, the Encoder _1 receives track data tau after data expansion from a data expansion module N The first, second, third, and fourth convolution layers of Encoder _1 are sequentially checked for τ using a 3 × 3 convolution check N InPerforming convolution operation to obtain the result after four times of convolution operationAfter four convolution operationsSending the full connection layer to an Encoder _ 1; fully connected layer of Encoder _1 after four convolution operations on received convolution from the fourth convolution layerPerforming full connection operation to obtain full connectionThe corresponding N state vectors send the fully connected N state vectors to a regularization layer of the Encoder _ 1; the regularization layer of Encoder _1 carries out regularization operation on the N state vectors which are received from the full connection layer of Encoder _1 and are subjected to full connection to obtain N state vectors which are subjected to regularization, and the N state vectors which are subjected to regularization are used as second state vectorsRepresenting the first and second state vectorsIs sent to the control module and willSending the data to an Encoder _2; the first, second and third full connection layers of Encoder _2 sequentially pair received from Encoder _1Carrying out full connection operation to obtain the product after three times of full connection operationCorresponding mean value and variance of Gaussian distribution, carrying out re-parameterization operation on the mean value and variance to obtain N re-parameterized state vectors, and using the N re-parameterized state vectorsShow thatSending the information to a feature fusion network and a single-view predictor network;
the target encoder network is connected with the data expansion module, the control module and the characteristic fusion network and consists of 4 convolution layers, 1 full-connection layer and 1 regularization layer; target encoder network receives tau from data expansion module N The first, second, third, and fourth convolutional layers of the target encoder network are sequentially checked with a 3 × 3 convolutional kernel for τ N In (1)Performing convolution operation to obtain the result after four times of convolution operationAfter four convolution operationsSending to a full connection layer; after fully-connected layer has operated on four convolutions received from fourth convolutional layerPerforming full connection operation to obtain full connectionThe corresponding N target state vectors send the N fully-connected target state vectors to the regularization layer; the regularization layer carries out regularization operation on the N fully-connected target state vectors received from the fully-connected layer to obtain N regularized target state vectors, and the N regularized target state vectors are used as target state vectorsRepresenting, the first target state vectorIs sent to the control module and willSending the information to a feature fusion network;
the characteristic fusion network is connected with the encoder network, the target encoder network and the multi-view predictor network and consists of a first fusion network Feature _1 and a second fusion network Feature _2; the Feature _1 and the Feature _2 are both composed of 3 full connection layers; feature _1 is connected to the encoder network, the target encoder network and Feature _2, the Feature _1 receiving from the encoder networkReceiving from a target encoder networkThe first, the second and the third full connection layers of Feature _1 are sequentially pairedPerforming a full join operation ofStitching as a state fusion vectorWill be provided withSending the data to Feature _2; the first, the second and the third full connection layers of Feature _1 are sequentially pairedPerforming a full join operation ofStitching as a target state fusion vectorFeature _2 is connected to Feature _1 and the multiview predictor network, and receives the data from Feature _1The first, the second and the third full connection layers of Feature _2 are sequentially pairedCarrying out reparameterization operation to obtain reparameterization state fusion vectorWill be provided withSending to a multi-view predictor network;
the single-view predictor network is connected with the data expansion module and the encoder network and consists of 3 layers of full connection layers; the single view predictor network receives the track data tau after data expansion from the data expansion module N Received from the encoder networkThe first, second and third fully-connected layers of the single-view predictor network are sequentially pairedAnd τ N Control instruction a in t Performing full-connection operation on the formed first splicing vector, and mapping the first splicing vector into a predicted target state vectorAnd a first predictive prize valueImplementing predictions of a transfer kinetics equation and a reward function equation, wherein:a jth term representing the predicted target state vector,a jth term representing a first predicted prize value;
the multi-view predictor network is connected with the data expansion module and the characteristic fusion network and consists of 3 layers of full connection layers; the multi-view predictor network receives the data-augmented trajectory data tau from the data augmentation module N Receiving from the feature fusion networkWill be provided withAnd τ N Control instruction a in t Forming a second splicing vector, carrying out full-connection operation on the second splicing vector by a first full-connection layer, a second full-connection layer and a third full-connection layer of the multi-view predictor network in sequence, and mapping the second splicing vector into a prediction target state fusion vectorAnd a second predictive prize valueRealizing the prediction of a transfer dynamics equation and a reward function equation;
the control module is connected with the feature extraction module, the data expansion module and the action module and consists of a first evaluation network criticic _1, a second evaluation network criticic _2, a first target evaluation network Tarcriticic _1, a second target evaluation network Tarcriticic _2 and a strategy network;
critic _1 and Critic _2 are connected with the feature extraction module, the data expansion module and the strategy network, are all composed of three full-connection layers, and receive a first second state vector from the feature extraction moduleReceiving data-extended track data tau from a data extension module N Receiving control instruction a from policy network, for tau N Middle control instruction a t Evaluating the quality of a in the policy network; the first, the second and the third full connecting layers of Critic _1 are sequentially pairedAnd a t Performing full-join operation on the formed third splicing vector, and mapping the third splicing vector into a first state-action value after three times of full-join operationThe first, the second and the third full connecting layers of Critic _1 are sequentially pairedAnd c, performing full connection operation on the fourth splicing vector consisting of the a, and mapping the fourth splicing vector into a second state-action value after three times of full connection operationThe first, the second and the third full connecting layers of Critic _2 are sequentially pairedAnd a t Performing full join operation on the formed third splicing vector, and mapping the third splicing vector into a third state-action value after three times of full join operationThe first, the second and the third full connecting layers of Critic _2 are sequentially pairedC, performing full connection operation on the fourth splicing vector consisting of the sum a, and performing full connection operation three timesThe fourth stitching vector is mapped to a fourth state-action value
Both TarCritic _1 and TarCritic _2 are connected with the feature extraction module and the strategy network, are composed of three fully connected layers, and receive a first target state vector from the feature extraction moduleReceiving a control instruction a 'from a policy network, and evaluating the quality of a'; the first, the second and the third full connecting layers of the TarCritic _1 are sequentially pairedAnd a' performing full connection operation on the target splicing vector, and mapping the target splicing vector into a first target state-action value after three times of full connection operationThe first, the second and the third full connecting layers of the TarCritic _2 are sequentially pairedAnd a' performing full connection operation on the target splicing vector, and mapping the target splicing vector into a second target state-action value after three times of full connection operation
The strategy network is connected with the feature extraction module, the action module, the storage module, the Critic _1, the Critic _2, the TarCritic _1 and the TarCritic _2 and consists of three fully-connected layers; receiving a first state vector z from a feature extraction module while an agent interacts with an image continuation control task environment t The first, second and third full connection layers of the strategy network are in turn aligned with z t Carrying out a full connection operation of t Mapped as control instruction a t A is mixing t Sending the data to an action module and a storage module; in training an agent control system based on deep reinforcement learning and conditional entropy bottlenecks, a first second state vector is received from a feature extraction moduleAnd a first target state vectorThe first, second and third full connection layers of the policy network are in turn pairedPerforming a full connection operation toMapping the control command a into a control command a, and sending a to critical _1 and critical _2; the first, second and third full connection layers of the policy network are in turn pairedPerforming a full join operation ofMapping into a control instruction a ', and sending a' to TarCritic _1 and TarCritic _2;
secondly, constructing a target function of the feature extraction module based on the conditional entropy bottleneck, and obtaining an optimized loss function of the feature extraction module through a variational reasoning technology; the method comprises the following steps:
2.1 designing a feature extraction module objective function shown in formula (1) based on the conditional entropy bottleneck;
wherein: object represents the object of the feature extraction module,the image observation is obtained by performing j-th data expansion on the image observation in the t-th interaction by the data expansion module,the image observation is obtained by the data expansion module after the jth data expansion is carried out on the image observation in the t +1 th interaction,is to beThe reparameterized state vector obtained after input to the encoder network,is to beA target state vector obtained after input to the target encoder network,is to parameterize the N state vectorsInputting the parameterized state fusion vector obtained after the input into the feature fusion network,is to combine the N target state vectorsTarget state fusion vector, beta, obtained after input to a feature fusion network j Is the factor of the regularization and is,andis a conditional mutual information item;
2.2 applying variational reasoning technique to the formula (1) to obtain the optimized loss function of the feature extraction module shown in the formula (2):
wherein: m is the number of the track data randomly selected from the storage module by the data expansion module,is a Gaussian distribution Is the average value calculated by Encoder _2 in the Encoder network,is the variance calculated by Encoder 2 in the Encoder network,is a distribution of the variation components,andis gaussian noise;is representative of xi j In the expectation that the position of the target is not changed,is representative of xi 1 ,ξ 2 ,…,ξ N And xi 1N The desired product of the two or more of the two,show thatAnd a t Item j of predicted target state vector obtained by input into single-view predictor networkAndcross entropy loss between and item j of the first predicted reward valueAnd r t The cross-entropy loss between the two,show thatAnd a t Predicted target state fusion vector obtained by inputting into multi-view predictor networkAndcross entropy loss between and secondary predicted reward valueAnd r t Cross entropy loss between;
thirdly, constructing an image continuous Control task simulation environment in a DeepMind Control Suite simulation environment of a DeepMind open source to prepare for training of an intelligent agent Control system based on deep reinforcement learning and conditional entropy bottleneck; the method comprises the following steps:
3.1 installing a DMControl simulation environment in a computer which is arbitrarily provided with a Ubuntu and a Pyroch deep learning framework, constructing an intelligent agent simulation model and an image continuous control task simulation environment, and setting a task target of an intelligent agent in an image continuous control task;
3.2 setting the scale of the intelligent agent for observing the self state and the image perceived by the environment state in the constructed image continuous control task simulation environment, setting the control instruction executed by the intelligent agent as a continuous vector, and setting the reward value fed back by the image continuous control task simulation environment after the intelligent agent executes the control instruction according to the task target;
fourthly, the intelligent agent trains an intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck in the image continuous control task simulation environment established in the third step, and the method comprises the following steps:
4.1 initializing network parameters of a feature extraction module and a control module in the intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck, wherein the parameters comprise a weight matrix and a bias vector of a full connection layer, a convolution kernel of a convolution layer and a weight matrix and a bias vector of a regularization layer, and the parameters are generated by using an orthogonal initialization mode, wherein nonzero parameters are from normal distribution with the mean value of 0 and the standard deviation of 1;
4.2, setting the size of a storage module in the intelligent agent control system based on the deep reinforcement learning and the conditional entropy bottleneck as a buffer area queue for storing track data formed when A intelligent agents interact with the image continuous control task simulation environment, and emptying the buffer area queue;
4.3, initializing the interaction times T =0 of the intelligent agent and the image continuous control task simulation environment constructed in the third step, and setting the maximum interaction times T of the intelligent agent, the maximum interaction times E of each round of the intelligent agent and the image continuous control task simulation environment, and the updating frequency F of a target encoder network and a target evaluation network in the intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck;
4.4, randomly setting the initial state of the image continuous control task simulation environment constructed in the third step and the initial state of the intelligent agent simulation model;
4.5, the perception module acquires image observation when the intelligent agent interacts with the image continuous control task simulation environment, and sends the image observation to the feature extraction module and the storage module; the characteristic extraction module receives image observation, an Encoder _1 in an Encoder network encodes the image observation to obtain a first state vector corresponding to the image observation, and the first state vector is sent to the control module; the control module receives the first state vector, the policy network maps the first state vector into a control instruction, and the control instruction is sent to the action module and the storage module, and the method comprises the following steps:
4.5.1 perception module obtains image observation s when t th time of intelligent agent interacts with image continuous control task simulation environment t A 1, a t Sending the data to a feature extraction module and a storage module;
4.5.2 feature extraction Module receives image observations s t Encoder _1 in the Encoder network will s t Encoded as a first state vector z t Will z t Sending the data to a control module;
4.5.3 the control Module receives the first State vector z t Policy network will z t Mapping the control command a to be executed when the t th time of the agent interacts with the image continuous control task simulation environment t A is to t Sending the information to an action module and a storage module;
4.6 the action Module receives the control instruction a t Performing a in an image continuous control task simulation environment t ;
4.7 the image continuous control task simulation environment returns the reward value r obtained when the intelligent agent interacts with the environment for the t time according to the reward value designed in the step 3.2 t R is to be t Sending the data to a storage module;
4.8 image continuationControlling the state of the task simulation environment to execute the control instruction a due to the agent t When the environment changes, the sensing module acquires the image observation s corresponding to the changed environment state t+1 A 1 is to t+1 Sending the data to a storage module;
4.9 the storage Module receives s from the perception Module t And s t+1 Receive a from the control module t Receiving r from the image continuous control task simulation environment t Combining them into a quadruple of trace data(s) t ,a t ,r t ,s t+1 ) Storing the data in a buffer queue;
4.10 let t = t +1; if t% E =0, turning to step 4.4; otherwise, turning to step 4.11;
4.11 the data expansion module judges whether there are L pieces of track data in the buffer area queue of the storage module, if there are L pieces of track data, then randomly selects M pieces of track data from the buffer area queue of the storage module, and makes the M pieces of track data form a track data set τ _ M:
let τ _ M m =(s t (m),a t (m),r t (m),s t+1 (M)) represents the M-th (m.di [1, M) ] of τ _ M]) Step 4.12, the step of track data is transferred, and an intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck is optimized according to tau _ M; if the queue of the buffer area has no L pieces of track data, turning to the step 4.5.1;
4.12 the data expansion module uses a random cutting method in data enhancement to sequentially carry out N times of data expansion on the image observation of each track data in the tau _ M to obtain M track data after data expansion, and the M track data after data expansion are sent to the feature extraction module and the control module; the feature extraction module and the control module receive M pieces of track data after data expansion, and an intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck is optimized, wherein the method comprises the following steps:
4.12.1 initializing the optimization times K =0, and setting the total optimization times K;
4.12.2 data expansion Module Using random clipping method in data enhancement, in sequence on τ M Carrying out N times of data expansion on the image observation of each piece of track data to obtain M pieces of track data after data expansionWill τ _ M N Sent to a feature extraction module and a control module,wherein,is τ _ M m Track data after N times of data expansion:
4.12.3 feature extraction Module receives τ _ M from data augmentation Module N For τ _ M N In the M pieces of track data, a loss function is optimized by using a feature extraction module shown in a gradient descent method minimization formula (2) in sequence, and an encoder network, a feature fusion network, a single-view predictor network and a multi-view predictor network in the feature extraction module are optimized, wherein the method comprises the following steps:
4.12.3.1 encoder network, target encoder network, single View predictor network, and Multi-View predictor network receive τ _ M from data expansion Module N ;
4.12.3.2 initializing trajectory data index m =1;
encoder _1 in 4.12.3.3 Encoder network will τ _ M N Middle m track dataIn (1) Encoding into N second state vectors Is thatCorresponding second state vector, willSending the data to an Encoder _2; encoder _2 is obtained through full connection layerMean of corresponding Gaussian distributionsSum varianceWherein:is thatThe mean of the corresponding gaussian distribution is,is thatThe variance of the corresponding gaussian distribution; encoder _2 pairAndcarrying out reparameterization operation to obtain N reparameterized state vectors Is thatCorresponding reparameterized state vector of Sending the information to a feature fusion network and a single-view predictor network;
4.12.3.4 target encoder network will τ _ M N Middle m track dataIn Encoding into N target state vectors Is thatCorresponding target state vector, willSending the information to a feature fusion network;
4.12.3.5 feature fusion network reception from encoder networkFeature _1 pairPerforming feature fusion to obtain a state fusion vectorWill be provided withSending the data to Feature _2; feature _2 pairCarrying out reparameterization operation to obtain reparameterization state fusion vectorWill be provided withSending to a multi-view predictor network;
4.12.3.6 feature fusion network reception from target encoder networkFeature _1 pairPerforming feature fusion to obtain a target state fusion vector
4.12.3.7 Single View predictor network receives from the encoder network From τ _ M N In the m track dataFor is to Andmiddle control instruction a t (m) performing transfer dynamics equation prediction and reward function equation prediction on the first spliced vector to obtain a predicted target state vectorAnd a first predictive prize value
4.12.3.8 multiview predictor network receives from a feature fusion networkFrom τ _ M N In the m track dataTo pairAndmiddle control instruction a t (m) performing branch dynamics equation prediction and reward function equation prediction on the second spliced vector to obtain a predicted target state fusion vectorAnd a second predictive prize value
4.12.3.9 feature extraction Module uses the gradient descent method, based on the mean and variance obtained in step 4.12.3.3, the mean and variance obtained in step 4.12.3.4Obtained in step 4.12.3.6Obtained in step 4.12.3.7Andobtained in step 4.12.3.8Andr in t (m) minimizing the optimization penalty function in equation (2) by reverse updating the optimization encoder network, the feature fusion network, the single view predictor network, and the multi-view predictor network through gradients;
4.12.3.10 if M < M, making M = M +1, and turning to step 4.12.3.3; otherwise, turning to step 4.12.4;
4.12.4 Encoder _1 of Encoder network in feature extraction Module τ _ M N Image observation of medium M pieces of track data after first data expansionEncoding into M second state vectorsWill be provided withSending the data to a control module;
4.12.5 target encoder network in feature extraction Module τ _ M N Image observation of middle M pieces of track data after first data expansionEncoding into M target state vectors Will be provided withSending the data to a control module;
4.12.6 the control Module receives τ _ M from the data expansion Module N Receiving a second state vector from the feature extraction module And a target state vectorFor τ _ M N M pieces of track data,Andthe loss functions shown in formula (3), formula (4) and formula (5) are minimized by using a gradient descent method in sequence, and an evaluation network and a strategy network are optimized, wherein the method comprises the following steps:
4.12.6.1 the policy network receives from the feature extraction moduleAnd critic _1 and Critic _2 are received from the feature extraction moduleReceiving τ _ M from data expansion module N (ii) a TarCritic _1 and TarCritic _2 are received from the feature extraction module
4.12.6.2 initialize the track data index m =1;
4.12.6.3policy network slaveTo obtain the m second state vectorFromTo obtain the m-th target state vectorThe following operations are carried out: for is toPerforming action mapping to obtain a control command a (m), and sending the a (m) to Critic _1 and Critic _2; for is toPerforming action mapping to obtain a control command a '(m), and sending the a' (m) to TarCritic _1 and TarCritic _2;
4.12.6.4Critic _1 receives control instructions a (m) from the policy network, fromTo obtain the m second state vectorFrom τ _ M N In the m track dataThe following operations are carried out: for is toAndin (a) of t (m) performing state-motion value estimation on the third splicing vector to obtain a first state-motion valueFor is toAnd a (m) to obtain a second state-action value
4.12.6.5Critic _2 receives control instructions a (m) from the policy network, fromTo obtain the m second state vectorFrom τ _ M N In the m track dataThe following operations are carried out: to pairAndin (a) of t (m) performing state-motion value estimation on the third spliced vector to obtain a third state-motion valueTo pairAnd a (m) in the fourth blockEstimating the state-motion value by connecting the vector to obtain a fourth state-motion value
4.12.6.6TarCritic _1 receives control instructions a' (m) from the policy network, fromTo obtain the m-th target state vectorFor is toAnd a' (m) to obtain a first target state-action value
4.12.6.7TarCritic _2 receives control instructions a' (m) from the policy network, fromTo obtain the mth target state vectorTo pairAnd a' (m) to obtain a second target state-action value
4.12.6.8 the control module uses gradient descent to minimize the penalty function in equation (3), optimizing Critic _1 and Critic _2 by updating the gradient backwards;
wherein:is a parameter of the ith evaluation network,is a parameter of the ith target evaluation network, phi is a parameter of the policy network, i =1 or 2 is an index of two evaluation networks and two target evaluation networks in the control module,is a target state vectorThe corresponding value of the state is set to,is the first target state-action valueAnd a second target state-action valueA smaller value in between, α is the temperature coefficient factor, γ is the reward discount factor;
4.12.6.9 the control module uses a gradient descent method to minimize the loss function shown in the formula (4) and updates the optimization strategy network through the reverse of the gradient;
wherein:is the policy network at the second state vectorThe distribution of the control command a (m) output next,is the second state-action valueAnd fourth State-action valuesThe smaller value in between;
4.12.6.10 the control module uses a gradient descent method to minimize the loss function shown in the formula (5), and the temperature coefficient factor is optimized through the reverse update of the gradient;
wherein:the target entropy of the strategy network is set as the negative number of the dimension of an agent control instruction a (m);
4.12.6.11 if M < M, let M = M +1, go to step 4.12.6.3; otherwise, turning to step 4.12.7;
4.12.7, judging whether t% F is equal to zero, if so, updating parameters of a target encoder network and a target evaluation network in the intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck, and turning to step 4.12.8, otherwise, turning to step 4.12.9;
4.12.8 the agent control system based on deep reinforcement learning and conditional entropy bottleneck uses exponential moving average to update parameters of the target encoder network and parameters of the two target evaluation networks according to formula (6) and formula (7);
wherein: tau. p And τ Q Is to update the hyper-parameters of the target encoder network and the target evaluation network, ζ is the parameter of the encoder network,is a parameter of the target encoder network;
4.12.9 let k = k +1; if K is equal to the total optimization times K, the optimization is completed, and the step 4.13 is switched, otherwise, the step 4.12.2.1 is switched;
4.13 the agent control system based on deep reinforcement learning and conditional entropy bottleneck judges whether T is equal to the maximum interaction time T, if yes, the training is finished, and the step 4.14 is carried out, otherwise, the step 4.5.1 is carried out;
4.14 the intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck saves the network parameters of the feature extraction module and the control module optimized in the step 4.12 into a pt-format file;
fifthly, loading the file in the pt format obtained in the step 4.14 by the feature extraction module and the control module, and initializing parameters of the feature extraction module and the control module by using parameters in the file in the pt format to obtain the trained intelligent agent control system based on the deep reinforcement learning and the conditional entropy bottleneck;
sixthly, deploying the trained intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck on an intelligent agent in a real environment;
seventhly, the trained intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck assists the intelligent agent to complete the image continuous control task, and the method comprises the following steps:
7.1 initializing action times T =0 of the agent, and setting the maximum action times T of the agent in the real environment 0 ;
7.2, the perception module obtains image observation of a real environment and sends the image observation to the feature extraction module; the feature extraction module receives the image observation, an Encoder _1 in an Encoder network encodes the image observation to obtain a first state vector corresponding to the image observation, and the first state vector is sent to the control module; the control module receives the first state vector, the strategy network maps the first state vector into a control instruction, and the control instruction is sent to the action module; the action module receives the control instruction and executes the control instruction in the real environment, and the method comprises the following steps:
7.2.1 perception Module obtains image observations s of the real Environment of the agent at the tth action t A 1 is to t Sending the data to a feature extraction module;
7.2.2 feature extraction Module receives image observations s t Encoder _1 in the Encoder network transforms s according to the method described in step 4.5.2 t Encoded as a first state vector z t Will z t Sending the data to a control module;
7.2.3 control Module receiving z t The policy network will z according to the method described in step 4.5.3 t Mapping to control instruction a of t action of agent t A is mixing t Sending the information to an action module;
7.2.4 action Module receives control instruction a t Performing a in a real Environment t ;
7.3 let t = t +1; if T is equal to the maximum action number T 0 If not, turning to the step 7.2.1;
and eighthly, finishing.
2. The intelligent agent control method based on deep reinforcement learning and conditional entropy bottleneck of claim 1, wherein the intelligent agent refers to an unmanned aerial vehicle or a robot or a simulated mechanical arm or a simulated robot constructed in a simulated environment; the image sensor refers to a depth camera; the action module refers to an engine and a steering gear of the intelligent agent; the memory module is a memory containing more than 1GB of available space.
3. The method as claimed in claim 1, wherein the agent state refers to information of the agent itself, the environment state refers to information other than the agent, and the image observation is an RGB image.
4. The intelligent agent control method based on deep reinforcement learning and conditional entropy bottleneck of claim 1, wherein the regularization factor β of the second step j Value of 1e -4 ~1e -2 。
5. The intelligent agent control method based on deep reinforcement learning and conditional entropy bottleneck as claimed in claim 1, wherein 3.1 steps of the Ubuntu require version 16.04 or more, and the DMControl simulation environment requires physical engine MuJoCo version MuJoCo200.
6. The method as claimed in claim 1, wherein 3.2 steps of image observation of the agent for sensing the self-state and the environmental state are set to 100 × 100, and the continuous type vector set by the control instruction executed by the agent includes joint rotation speed and torque.
7. The intelligent agent control method based on deep reinforcement learning and conditional entropy bottleneck as claimed in claim 1, wherein the number A of track data in the buffer queue in 4.2 steps satisfies A ≧ 10 5 (ii) a 4.3, the maximum interaction times T in the step are integers, and T is more than or equal to 5A;4.3, the maximum interaction times E in the step are positive integers, and the value is 1000;4.3, the updating frequency F is a positive integer and takes a value of 2;4.11 said L is set to 1000;4.11 step said M is set to 512;4.12.1, setting the total optimization times K to be 10;4.12.6.8 said temperatureThe initial value of the degree coefficient factor alpha is set to 0.1, and the reward discount factor gamma is set to 0.99;4.12.8 said step τ p Set to 0.05, τ Q Set to 0.01;4.14 said step pt format file is directly generated by the deep learning framework Pythrch.
8. An agent control method based on deep reinforcement learning and conditional entropy bottleneck as claimed in claim 1, characterized in that 4.9 steps of the storage module are used for storing the track data(s) t ,a t ,r t ,s t+1 ) The method for storing in the buffer queue is as follows:
4.9.1 the storage module judges whether there is A track data in the queue of the buffer area, if there is A track data, go to step 4.9.2, otherwise, go to step 4.9.3;
4.9.2 the storage module empties a track data at the head of the buffer queue according to the first-in first-out principle;
4.9.3 memory Module will s t 、s t+1 、a t And r t Combined into quadruplets of trajectory data(s) t ,a t ,r t ,s t+1 ) And storing the data at the tail part of the buffer area queue.
9. The intelligent agent control method based on deep reinforcement learning and conditional entropy bottleneck, as claimed in claim 1, wherein 4.12.2 the data expansion module performs N times of data expansion on image observation of each trace data in τ _ M in sequence by using a random clipping method in data enhancement to obtain M pieces of trace data after data expansion, and the method of sending the M pieces of trace data after data expansion to the feature extraction module and the control module is:
4.12.2.1 initializing trajectory data index m =1;
4.12.2.2, initializing data expansion times j =0, and setting total data expansion times N to be 2;
4.12.2.3 referring to the settings in RAD, the mth trace data τ _ M in τ _ M is clipped by random clipping in data enhancement m =(s t (m),a t (m),r t (m),s t+1 (m)) a mesoscale of100 x 100 s t (m) image Observation cropped to a Scale 84X 84S with a dimension of 100X 100 t+1 (m) Observation of an image cropped to a Scale of 84X 84
4.12.2.4 let j = j +1; if j is equal to the total data expansion times N, turning to step 4.12.2.5, otherwise, turning to step 4.12.2.3;
4.12.2.5 data expansion Module will τ _ M m S in t (m) image observation with N-time data expansion Will s is t+1 (m) image observation with N-time data expansion Get τ _ M m Trajectory data after N times of data expansion
4.12.2.6 o m<M, enabling M = M +1, and turning to step 4.12.2.2; if M = M, the expansion of the M track data is finished, and M track data tau _ M after data expansion is obtained N :
4.12.2.7 data expansion Module will τ _ M N And sending the data to a feature extraction module and a control module.
10. The intelligent agent control method based on deep reinforcement learning and conditional entropy bottleneck as claimed in claim 1, wherein the maximum action number T is 7.1 steps 0 Is a positive integer and takes 1000.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210865762.2A CN115167136B (en) | 2022-07-21 | 2022-07-21 | Intelligent agent control method based on deep reinforcement learning and conditional entropy bottleneck |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210865762.2A CN115167136B (en) | 2022-07-21 | 2022-07-21 | Intelligent agent control method based on deep reinforcement learning and conditional entropy bottleneck |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115167136A true CN115167136A (en) | 2022-10-11 |
CN115167136B CN115167136B (en) | 2023-04-07 |
Family
ID=83497263
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210865762.2A Active CN115167136B (en) | 2022-07-21 | 2022-07-21 | Intelligent agent control method based on deep reinforcement learning and conditional entropy bottleneck |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115167136B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117406706A (en) * | 2023-08-11 | 2024-01-16 | 汕头大学 | Multi-agent obstacle avoidance method and system combining causal model and deep reinforcement learning |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200143208A1 (en) * | 2018-11-05 | 2020-05-07 | Royal Bank Of Canada | Opponent modeling with asynchronous methods in deep rl |
CN112861442A (en) * | 2021-03-10 | 2021-05-28 | 中国人民解放军国防科技大学 | Multi-machine collaborative air combat planning method and system based on deep reinforcement learning |
CN113096153A (en) * | 2021-03-09 | 2021-07-09 | 山西三友和智慧信息技术股份有限公司 | Real-time active vision method based on deep reinforcement learning humanoid football robot |
CN113095488A (en) * | 2021-04-29 | 2021-07-09 | 电子科技大学 | Cooperative game method based on multi-agent maximum entropy reinforcement learning |
CN113283597A (en) * | 2021-06-11 | 2021-08-20 | 浙江工业大学 | Deep reinforcement learning model robustness enhancing method based on information bottleneck |
CN113392935A (en) * | 2021-07-09 | 2021-09-14 | 浙江工业大学 | Multi-agent deep reinforcement learning strategy optimization method based on attention mechanism |
CN113478486A (en) * | 2021-07-12 | 2021-10-08 | 上海微电机研究所(中国电子科技集团公司第二十一研究所) | Robot motion parameter self-adaptive control method and system based on deep reinforcement learning |
CN114154821A (en) * | 2021-11-22 | 2022-03-08 | 厦门深度赋智科技有限公司 | Intelligent scheduling dynamic scheduling method based on deep reinforcement learning |
-
2022
- 2022-07-21 CN CN202210865762.2A patent/CN115167136B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200143208A1 (en) * | 2018-11-05 | 2020-05-07 | Royal Bank Of Canada | Opponent modeling with asynchronous methods in deep rl |
CN113096153A (en) * | 2021-03-09 | 2021-07-09 | 山西三友和智慧信息技术股份有限公司 | Real-time active vision method based on deep reinforcement learning humanoid football robot |
CN112861442A (en) * | 2021-03-10 | 2021-05-28 | 中国人民解放军国防科技大学 | Multi-machine collaborative air combat planning method and system based on deep reinforcement learning |
CN113095488A (en) * | 2021-04-29 | 2021-07-09 | 电子科技大学 | Cooperative game method based on multi-agent maximum entropy reinforcement learning |
CN113283597A (en) * | 2021-06-11 | 2021-08-20 | 浙江工业大学 | Deep reinforcement learning model robustness enhancing method based on information bottleneck |
CN113392935A (en) * | 2021-07-09 | 2021-09-14 | 浙江工业大学 | Multi-agent deep reinforcement learning strategy optimization method based on attention mechanism |
CN113478486A (en) * | 2021-07-12 | 2021-10-08 | 上海微电机研究所(中国电子科技集团公司第二十一研究所) | Robot motion parameter self-adaptive control method and system based on deep reinforcement learning |
CN114154821A (en) * | 2021-11-22 | 2022-03-08 | 厦门深度赋智科技有限公司 | Intelligent scheduling dynamic scheduling method based on deep reinforcement learning |
Non-Patent Citations (2)
Title |
---|
李天旭: "基于深度强化学习的多智能体协同算法研究", 《2020年中国矿业大学硕士论文》 * |
李孟珂: "深度强化学习中的探索策略研究", 《2021年中国矿业大学硕士论文》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117406706A (en) * | 2023-08-11 | 2024-01-16 | 汕头大学 | Multi-agent obstacle avoidance method and system combining causal model and deep reinforcement learning |
CN117406706B (en) * | 2023-08-11 | 2024-04-09 | 汕头大学 | Multi-agent obstacle avoidance method and system combining causal model and deep reinforcement learning |
Also Published As
Publication number | Publication date |
---|---|
CN115167136B (en) | 2023-04-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Sanchez-Gonzalez et al. | Graph networks as learnable physics engines for inference and control | |
Hao et al. | Physics-informed machine learning: A survey on problems, methods and applications | |
Seo et al. | Reinforcement learning with action-free pre-training from videos | |
CN111602144A (en) | Generating neural network system for generating instruction sequences to control agents performing tasks | |
Iyer et al. | Geometric consistency for self-supervised end-to-end visual odometry | |
Zhou et al. | Robot navigation in a crowd by integrating deep reinforcement learning and online planning | |
CN111971691A (en) | Graph neural network representing a physical system | |
Roghair et al. | A vision based deep reinforcement learning algorithm for UAV obstacle avoidance | |
Shi et al. | Skill-based model-based reinforcement learning | |
Bernstein et al. | Reinforcement learning for computer vision and robot navigation | |
Zhao et al. | Applications of asynchronous deep reinforcement learning based on dynamic updating weights | |
Hao et al. | Software/hardware co-design for multi-modal multi-task learning in autonomous systems | |
Stadie et al. | Learning intrinsic rewards as a bi-level optimization problem | |
CN115167136B (en) | Intelligent agent control method based on deep reinforcement learning and conditional entropy bottleneck | |
CN114943182B (en) | Robot cable shape control method and equipment based on graph neural network | |
Ramamurthy et al. | Leveraging domain knowledge for reinforcement learning using MMC architectures | |
CN116643499A (en) | Model reinforcement learning-based agent path planning method and system | |
Zhang et al. | An end-to-end inverse reinforcement learning by a boosting approach with relative entropy | |
CN114219066A (en) | Unsupervised reinforcement learning method and unsupervised reinforcement learning device based on Watherstein distance | |
Jiang et al. | Generative adversarial interactive imitation learning for path following of autonomous underwater vehicle | |
Huang et al. | A general motion controller based on deep reinforcement learning for an autonomous underwater vehicle with unknown disturbances | |
Nahavandi et al. | Machine learning meets advanced robotic manipulation | |
Rathi et al. | Driving reinforcement learning with models | |
CN117540784A (en) | Training method for generating stream model, action prediction method and device | |
CN117037216A (en) | Badminton motion prediction method and device oriented to human skeleton |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |