CN115167136B - Intelligent agent control method based on deep reinforcement learning and conditional entropy bottleneck - Google Patents

Intelligent agent control method based on deep reinforcement learning and conditional entropy bottleneck Download PDF

Info

Publication number
CN115167136B
CN115167136B CN202210865762.2A CN202210865762A CN115167136B CN 115167136 B CN115167136 B CN 115167136B CN 202210865762 A CN202210865762 A CN 202210865762A CN 115167136 B CN115167136 B CN 115167136B
Authority
CN
China
Prior art keywords
module
network
data
vector
control
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210865762.2A
Other languages
Chinese (zh)
Other versions
CN115167136A (en
Inventor
史殿习
杨焕焕
杨绍武
彭滢璇
孙亦璇
史燕燕
赵琛然
胡浩萌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202210865762.2A priority Critical patent/CN115167136B/en
Publication of CN115167136A publication Critical patent/CN115167136A/en
Application granted granted Critical
Publication of CN115167136B publication Critical patent/CN115167136B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B13/00Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
    • G05B13/02Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
    • G05B13/04Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators
    • G05B13/042Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators in which a parameter or coefficient is automatically adjusted to optimise the performance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/02Total factory control, e.g. smart factories, flexible manufacturing systems [FMS] or integrated manufacturing systems [IMS]

Abstract

The invention discloses an intelligent agent control method based on deep reinforcement learning and conditional entropy bottleneck, and aims to solve the problem that the control strategy of the intelligent agent control method based on deep reinforcement learning in an image continuous control task is low in accuracy. The technical scheme is as follows: constructing an intelligent agent control system which consists of a perception module, an action module, a storage module, a data expansion module, a feature extraction module and a control module and is based on deep reinforcement learning and conditional entropy bottleneck; constructing a characteristic extraction module target function based on the conditional entropy bottleneck, and obtaining a corresponding optimization loss function through a variational reasoning technology; constructing an image continuous control task simulation environment; an intelligent agent trains a control system in a simulation environment to obtain optimized network parameters; and the feature extraction module and the control module load network parameters to obtain the trained control system. The trained control system is deployed in an intelligent agent in a real environment to complete the image continuous control task. By adopting the method and the device, the accuracy of the intelligent agent control strategy can be improved.

Description

Intelligent agent control method based on deep reinforcement learning and conditional entropy bottleneck
Technical Field
The invention relates to the field of control, in particular to an intelligent agent control method based on deep reinforcement learning and conditional entropy bottleneck in an image continuous control task.
Background
The intelligent agent is an unmanned node with sensing, communication, movement, storage, calculation and other capabilities, and the control problem of the intelligent agent is closely related to the control of the robot. In the control field, a traditional control method usually relies on establishing a detailed and accurate mathematical model for a controlled object to control, but when the controlled object model is complex and uncertain factors such as external disturbance exist, the application of the traditional control method is severely limited, and a good control effect cannot be achieved in a continuous control task (namely, a control command of the controlled object is a continuous vector, such as the rotating speed and the torque of an intelligent joint). In addition, how to realize intelligent perception and autonomous control of a controlled object in a complex (such as an image observation scene) and unknown environment is a research difficulty in the field of intelligent agent control. The Deep Learning (DRL) method is used as an end-to-end sensing and control method, deep Learning (DL) and Reinforcement Learning (RL) are combined, and the method is widely applied to various fields such as robot control, chessboard games and the like by means of the expression capability of the Deep Learning and the decision-making capability of the Reinforcement Learning, and provides a new idea for solving the problem of intelligent agent control in the above scene.
In the image continuous control task, the agent needs to directly learn continuous control instructions from image observation, and the DRL unifies the perception of processing image observation and the control of outputting continuous control instructions into an end-to-end training paradigm, which results in that the control problem of the agent in the image continuous control task is difficult to effectively solve (because an efficient control strategy depends on robust features). For the problem, researchers decouple perception and control in the DRL into corresponding representation learning and strategy learning sub-processes (the representation learning is used for training an encoder network to extract robust features corresponding to the observation of the intelligent body image, and the strategy learning is used for solving an optimal control strategy corresponding to a control scheme based on the features), a self-designed auxiliary objective function is introduced to optimize the representation learning process, and the performance of the control strategy is improved.
In 2019, minne Li et al, in the Neuro information processing System conference NIPS (Neuro-View Renformation Learning) on pages 1420-1431, a Multi-View Reinforcement Learning framework was introduced, a partially observable Markov decision process was expanded to a Multi-View Markov decision process with multiple observation models, and two methods for Learning control strategies from image observations were proposed based on data enhancement of image observations and policy migration between multiple views. In 2020, michael Laskin et al, in the neural information processing System conference NIPS argument, 19884-19895, the article "Reinforcement Learning with Augmented Data", namely "Data Augmented Reinforcement Learning", proposed RAD method, and carried out a broad study on general Data augmentation method in pixel space and state space, proving that Data augmentation methods such as random translation, random clipping, random convolution and the like can improve the performance of control strategy obtained by Reinforcement Learning algorithm. In an article, "Learning invariant representations for reinforcement Learning without reconstruction", published in the ICLR corpus at the international representation Learning conference in 2021 by Amy Zhang et al, a depth mutual simulation for Control (DBC) method is proposed, and a robust representation of image observation of an intelligent body is learned based on a mutual simulation metric of behavior similarity between states in a quantitative continuous markov decision process, thereby improving the performance of a Control strategy for solving a downstream Control task. Although the intelligent agent control method based on the deep reinforcement learning obtains a relatively accurate control strategy in the image continuous control task, the intelligent agent has a complex motion control model, friction exists between joints of the intelligent agent, and image observation of the intelligent agent on the environment contains visual information irrelevant to the control task, so that the accuracy of the solved control strategy is relatively low due to the change of the visual information irrelevant to the task in the image observation when the existing intelligent agent control method based on the deep reinforcement learning solves the control strategy by using a deep neural network.
Therefore, the existing intelligent agent control method based on deep reinforcement learning still has the problem of low control strategy precision in the image continuous control task.
Disclosure of Invention
The invention aims to solve the technical problem that the control strategy precision of the existing intelligent agent control method based on deep reinforcement learning in an image continuous control task is low, provides an intelligent agent control method based on deep reinforcement learning and conditional entropy bottleneck, and improves the precision of an intelligent agent control strategy.
According to the method, perception and control in deep reinforcement learning are decoupled, a feature extraction module corresponding to perception is trained on the basis of conditional entropy bottleneck, information related to a control task is extracted from image observation of an intelligent agent to serve as a robust state vector corresponding to the image observation, a control module for image continuous control tasks is trained on the basis of the state vector, and the accuracy of an intelligent agent control strategy is improved.
The specific technical scheme is as follows:
the method comprises the steps of firstly, constructing an intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck, installing the control system on an intelligent agent, and enabling the intelligent agent to interact with an image continuous control task environment. The intelligent agent refers to an unmanned node (such as an unmanned aerial vehicle, a robot and the like) with sensing, communication, movement, storage, calculation and other capabilities, and includes but is not limited to a simulated mechanical arm, a simulated robot and the like constructed in a simulation environment. The image continuous control task environment refers to an entity interacting with an intelligent agent, and the intelligent agent observes the state of the environment in the form of an image and acts in the environment according to a continuous control instruction based on the image observation. The intelligent agent control system based on the deep reinforcement learning and the conditional entropy bottleneck consists of a perception module, an action module, a storage module, a data expansion module, a feature extraction module and a control module.
The sensing module is an image sensor (such as a depth camera) and is connected with the feature extraction module and the storage module. The sensing module acquires an image observation (RGB image) containing the state of the intelligent agent (information of the intelligent agent) and the state of the environment (information except the intelligent agent) from the image continuous control task environment, and sends the image observation to the feature extraction module and the storage module.
The action module is an actuator (such as an engine, a steering gear and the like) of an intelligent agent control instruction, is connected with the control module, receives the control instruction from the control module, and acts in the image continuous control task environment according to the control instruction.
The storage module is a memory containing available space above 1GB, is connected with the sensing module, the control module and the data expansion module, receives image observation from the sensing module, receives control instructions from the control module, receives reward from the image continuous control task environment, and combines the image observation, the control instructions and the reward into track data (track data for short) of interaction between the intelligent agent and the image continuous control task environment. Trajectory data in quadruples(s) t ,a t ,r t ,s t+1 ) Is stored in a form of (a) wherein: s t Is the image observation received from the perception module when the t-th time of the intelligent agent interacts with the image continuous control task environment, a t Is a control instruction from the control module, r, executed when the agent interacts with the image continuous control task environment for the tth time t The environment is aimed at the control instruction a when the intelligent agent interacts with the image continuous control task environment for the tth time t Reward value, s, of feedback t+1 The image observation is the image observation received from the perception module after the environment state is changed due to the t-th interaction between the intelligent agent and the image continuous control task environment (also called as the image observation when the t + 1-th interaction between the intelligent agent and the image continuous control task environment).
The data expansion module is connected with the storage module, the feature extraction module and the control module, and track data tau, tau =(s) required by training of the intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck is randomly selected from the storage module t ,a t ,r t ,s t+1 ) For s in τ t And s t+1 Carrying out N times of data expansion to obtain the track data tau after the N times of data expansion N
Figure BDA0003758485410000031
Figure BDA0003758485410000032
(j∈[1,N]J is the index of the image observation after N times of data expansion), and (tau) N And sending the data to a feature extraction module and a control module.
The feature extraction module is connected with the sensing module, the data expansion module and the control module. The characteristic extraction module consists of a coder network, a target coder network, a characteristic fusion network, a single-view predictor network and a multi-view predictor network.
The Encoder network consists of a first Encoder network Encoder _1 and a second Encoder network Encoder _2, and is connected with the sensing module, the data expansion module, the control module, the characteristic fusion network and the single-view predictor network. The Encoder _1 consists of 4 convolutional layers, 1 full-connection layer and 1 regularization layer and is connected with the sensing module, the data expansion module, the control module and the Encoder _2; encoder _2 is composed of 3 layers of full connection layers and is connected with Encoder _1, a feature fusion network and a single-view predictor network. Encoder _1 receives s from the perception module when the agent interacts with the image continuation control task environment t The first, second, third and fourth convolution layers of Encode _1 are checked with s by a convolution kernel of 3 x 3 in sequence t Performing convolution operation to obtain s after four times of convolution operation t S after four convolution operations t Sending the full connection layer to an Encoder _ 1; the fully connected layer of Encoder _1 pairs s after the fourth convolution operation received from the fourth convolution layer t Performing full connection operation to obtain full connection result s t The corresponding state vector sends the fully connected state vector to a regularization layer of the Encoder _ 1; the regularization layer of Encoder _1 carries out regularization operation on the fully-connected state vectors received from the fully-connected layer of Encoder _1 to obtain regularized state vectors, and the regularized state vectors are used as first state vectors z t Will z t And sending the data to a control module. When an agent control system based on deep reinforcement learning and conditional entropy bottleneck is trained, the Encoder _1 receives track data tau after data expansion from a data expansion module N First, second, third, and fourth volumes of Encode _1The layers are checked with a 3 × 3 convolution N In (1)
Figure BDA0003758485410000033
Performing convolution operation to obtain four times of convolution operation>
Figure BDA0003758485410000034
Will be four times convolution operated->
Figure BDA0003758485410000035
Sending the full connection layer to an Encoder _ 1; fully connected layer of Encoder _1 @ after four convolution operations received from the fourth convolution layer>
Figure BDA0003758485410000036
Performing full connection operation to obtain full connection
Figure BDA0003758485410000037
The corresponding N state vectors send the fully connected N state vectors to a regularization layer of the Encoder _ 1; the regularization layer of the Encoder _1 regularizes the N fully-connected state vectors received from the fully-connected layer of the Encoder _1 to obtain N regularized state vectors, and the N regularized state vectors are used as second state vectors and are subjected to the & lter & gt/h & lt/er & gt>
Figure BDA0003758485410000038
Indicates that the first second status vector is->
Figure BDA0003758485410000039
Sending to the control module and combining>
Figure BDA00037584854100000310
Sending the data to an Encoder _2; the first, second and third full connection layers of Encoder _2 sequentially pair received from Encoder _1
Figure BDA00037584854100000311
Performing full connection operation to obtain three full connectionsOperated then->
Figure BDA00037584854100000312
Corresponding to the mean and variance of the Gaussian distribution, carrying out re-parameterization operation on the mean and variance (from Auto-Encoding Variational Bayes published in the ICLR corpus of the international characterization learning conference in Diederik P.Kingma and Max Welling 2014) to obtain N re-parameterized state vectors, and using the N re-parameterized state vectors
Figure BDA00037584854100000313
Indicates to >>
Figure BDA00037584854100000314
Figure BDA00037584854100000315
And sending the information to the feature fusion network and the single-view predictor network.
The target encoder network is connected with the data expansion module, the control module and the characteristic fusion network and consists of 4 convolution layers, 1 full-connection layer and 1 regularization layer. In the process of intelligent agent control system training based on deep reinforcement learning and conditional entropy bottleneck, as the parameters of the encoder network are updated too fast, the training process is vibrated and unstable, and therefore, the target encoder network is introduced into the intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck to improve the stability of the training process. Target encoder network receives tau from data expansion module N The first, second, third, and fourth convolutional layers of the target encoder network are sequentially checked with a 3 × 3 convolutional kernel τ N In (1)
Figure BDA0003758485410000041
Performing convolution operation to obtain four times of convolution operation>
Figure BDA0003758485410000042
Will be four times convolution operated->
Figure BDA0003758485410000043
Sending to a full connection layer; fully connected layer pairs ^ greater than or equal to after four convolution operations received from the fourth convolution layer>
Figure BDA0003758485410000044
Performing a full connection operation to obtain a full connection>
Figure BDA0003758485410000045
The corresponding N target state vectors send the fully-connected N target state vectors to the regularization layer; the regularization layer carries out regularization operation on the N fully-connected target state vectors received from the fully-connected layer to obtain N regularized target state vectors, the N regularized target state vectors are used as target state vectors, and the N regularized target state vectors are used as->
Figure BDA0003758485410000046
Indicating that a first target state vector is->
Figure BDA0003758485410000047
Is sent to the control module and will
Figure BDA0003758485410000048
Figure BDA0003758485410000049
And sending the information to the feature fusion network.
The characteristic fusion network is connected with the encoder network, the target encoder network and the multi-view predictor network and consists of a first fusion network Feature _1 and a second fusion network Feature _2. Feature _1 and Feature _2 are each composed of 3 fully connected layers. Feature _1 is connected to the encoder network, the target encoder network and Feature _2, the Feature _1 receiving from the encoder network
Figure BDA00037584854100000410
Receiving from a target encoder network
Figure BDA00037584854100000411
The first, the second and the third full connection layers of Feature _1 are sequentially paired
Figure BDA00037584854100000412
Performing a full connection operation to put>
Figure BDA00037584854100000413
Concatenating as the status fusion vector->
Figure BDA00037584854100000414
Will->
Figure BDA00037584854100000415
Sending the data to Feature _2; the first, second and third fully connected layers of Feature _1 in turn pair->
Figure BDA00037584854100000416
Performing a full connection operation to combine>
Figure BDA00037584854100000417
Splicing into a target state fusion vector->
Figure BDA00037584854100000418
Feature _2 is coupled to Feature _1 and the multi-view predictor network and receives ≧ from Feature _1>
Figure BDA00037584854100000419
The first, second and third fully connected layers of Feature _2 in turn pair->
Figure BDA00037584854100000420
Performing reparameterization operation to obtain reparameterization state fusion vector>
Figure BDA00037584854100000421
Will->
Figure BDA00037584854100000422
To the multi-view predictor network.
The single-view predictor network is connected with the data expansion module and the encoder network and consists of 3 layers of full connection layers. The single view predictor network receives the track data tau after data expansion from the data expansion module N Received from the encoder network
Figure BDA00037584854100000423
The first, second and third fully-connected layers of the single-view predictor network are paired->
Figure BDA00037584854100000424
And τ N Control instruction a in t Performing full-concatenation operation on the formed first spliced vector, and mapping the first spliced vector into a predicted target status vector->
Figure BDA00037584854100000425
And a first predicted prize value
Figure BDA00037584854100000426
Predictions for implementing the transfer kinetics equation and the reward function equation (which are basic concepts in Reinforcement Learning, see book Reinforcement Learning: an Introduction to Reinforcement Learning, by Richard S.Sutton and Andrew G.Barto), in which: />
Figure BDA00037584854100000427
A jth term, representing a predicted target state vector, is asserted>
Figure BDA00037584854100000428
A jth term representing the first predicted prize value.
The multi-view predictor network is connected with the data expansion module and the characteristic fusion network and consists of 3 layers of full connection layers. The multiview predictor network receives the data-augmented trajectory data tau from the data augmentation module N Receiving from the feature fusion network
Figure BDA00037584854100000429
Will be/are>
Figure BDA00037584854100000430
And τ N Control instruction a in t Forming a second splicing vector, carrying out full-connection operation on the second splicing vector by a first full-connection layer, a second full-connection layer and a third full-connection layer of the multi-view predictor network in sequence, and mapping the second splicing vector into a prediction target state fusion vector->
Figure BDA00037584854100000431
And a second predictive prize value >>
Figure BDA00037584854100000432
And realizing the prediction of the transfer dynamics equation and the reward function equation.
The control module is connected with the feature extraction module, the data expansion module and the action module and consists of two evaluation networks (a first evaluation network Critic _1 and a second evaluation network Critic _ 2), two target evaluation networks (a first target evaluation network TarCritic _1 and a second target evaluation network TarCritic _ 2) and a strategy network. The purpose of designing two evaluation networks and two target evaluation networks in the control module is to prevent the problem of over-estimation when a single evaluation network or a single target evaluation network evaluates the advantages and disadvantages of the intelligent agent control command.
Critic _1 and Critic _2 are connected with the feature extraction module, the data expansion module and the strategy network, are all composed of three full-connection layers, and receive a first second state vector from the feature extraction module
Figure BDA0003758485410000051
Receiving data-extended track data tau from a data extension module N Receiving control instruction a from policy network, for tau N Middle control instruction a t And evaluating the quality of the a in the policy network. First, second and third fully contiguous layers of Critic _1 are paired in sequence>
Figure BDA0003758485410000052
And a t Composition IIIThe spliced vectors are subjected to full-connection operation, and after three times of full-connection operation, a third spliced vector is mapped into a first state-action value->
Figure BDA0003758485410000053
(State-action value is a basic concept in reinforcement learning, and refers to the expected value of the reward that can be obtained after the agent executes the selected control command in the current state); the first, second and third fully-connected layers of Critic _1 are paired->
Figure BDA0003758485410000054
And a, performing full connection operation on a fourth spliced vector consisting of the sum of a and a, and mapping the fourth spliced vector into a second state-action value->
Figure BDA0003758485410000055
The first, second and third fully-connected layers of Critic _2 are paired->
Figure BDA0003758485410000056
And a t Performing full-connection operation on the formed third spliced vector, and mapping the third spliced vector into a third state-action value->
Figure BDA0003758485410000057
The first, second and third fully-connected layers of Critic _2 are paired->
Figure BDA0003758485410000058
A fourth splicing vector formed by the sum a is subjected to full connection operation, and the fourth splicing vector is mapped into a fourth state-action value->
Figure BDA0003758485410000059
Both TarCritic _1 and TarCritic _2 are connected with the feature extraction module and the strategy network, both are composed of three full connection layers, and both receive a first target state vector from the feature extraction module
Figure BDA00037584854100000510
And receiving a control command a 'from the policy network, and evaluating the quality of a'. The first, second and third fully-connected layers of TarCritic _1 are paired and/or reserved in sequence>
Figure BDA00037584854100000511
And a' performing full connection operation on the target splicing vector, and mapping the target splicing vector into a first target state-action value->
Figure BDA00037584854100000512
The first, second and third fully-connected layers of TarCritic _2 are paired->
Figure BDA00037584854100000513
And a' performing full connection operation on the target splicing vector, and mapping the target splicing vector into a second target state-action value->
Figure BDA00037584854100000514
The strategy network is connected with the feature extraction module, the action module, the storage module, the Critic _1, the Critic _2, the TarCritic _1 and the TarCritic _2 and consists of three full connection layers. Receiving a first state vector z from a feature extraction module while an agent interacts with an image continuation control task environment t The first, second and third full connection layers of the policy network are aligned to z in sequence t Carrying out a full connection operation of t Mapped as control instruction a t A is mixing t And sending the information to an action module and a storage module. In training an agent control system based on deep reinforcement learning and conditional entropy bottlenecks, a first second state vector is received from a feature extraction module
Figure BDA00037584854100000515
And a first target state vector->
Figure BDA00037584854100000516
The first, second and third fully-connected layers of the policy network are in turn paired->
Figure BDA00037584854100000517
Performing a full connection operation to put>
Figure BDA00037584854100000518
Mapping the control command a into a control command a, and sending a to critical _1 and critical _2; the first, second and third fully-connected layers of the policy network are in turn paired->
Figure BDA00037584854100000519
Performing a full connection operation to put>
Figure BDA00037584854100000520
Mapped as control instruction a ', sends a' to TarCritic _1 and TarCritic _2.
And secondly, constructing a target function of the feature extraction module based on the conditional entropy Bottleneck, and obtaining an optimized loss function of the feature extraction module by a Variational reasoning technology (from Deep Variational Information Bottleneck, a product published in 2017 by Alexander A. Alemi et al on ICLR corpus of International characterization learning conference). The method comprises the following steps:
2.1 in order to learn the state vector corresponding to the image observation, based on the conditional entropy bottleneck, a feature extraction module objective function shown in formula (1) is designed. Conditional Entropy bottlenecks (from The Conditional Entropy Bottleneck, the article "The Conditional Entropy bottle" published in The book journal by Ian Fischer 2020, namely "Conditional Entropy bottlenecks") are a method in information theory for extracting features Z of given data X, to predict a label Y, defined as
Figure BDA0003758485410000061
It is desirable that the information in Z is maximally correlated with Y.
Figure BDA0003758485410000062
Wherein: object represents the object of the feature extraction module,
Figure BDA0003758485410000063
the data expansion module performs j times of data expansion on the image observation in the t time of interaction to obtain image observation, and then the image observation is combined with the data expansion module>
Figure BDA0003758485410000064
The data expansion module carries out j times of data expansion on the image observation in the t +1 th interaction to obtain image observation, and then the image observation is subjected to judgment>
Figure BDA0003758485410000065
Is to>
Figure BDA0003758485410000066
A reparameterized status vector obtained after input into the encoder network, based on the comparison of the status vector and the value of the reference value>
Figure BDA0003758485410000067
Is to>
Figure BDA0003758485410000068
A target state vector obtained after input to the target encoder network,
Figure BDA0003758485410000069
is to parameterize the N re-valued status vectors->
Figure BDA00037584854100000610
A reparameterized state fusion vector obtained after input into the feature fusion network>
Figure BDA00037584854100000611
Is to vector N target status values->
Figure BDA00037584854100000612
Target state fusion vector, beta, obtained after input to a feature fusion network j Is a regularization factor, suggestsValue of 1e -4 ~1e -2
Figure BDA00037584854100000613
And &>
Figure BDA00037584854100000614
Is a conditional mutual information item.
2.2 applying variational reasoning technique to the formula (1) to obtain the optimized loss function of the feature extraction module shown in the formula (2):
Figure BDA00037584854100000615
wherein: m is the number of the track data randomly selected from the storage module by the data expansion module,
Figure BDA00037584854100000616
is a Gaussian distribution>
Figure BDA00037584854100000617
Figure BDA00037584854100000618
Is a mean value calculated by Encoder _2 in an Encoder network, in conjunction with a look-up table, and a look-up table>
Figure BDA00037584854100000619
Is the variance calculated by Encoder _2 in the Encoder network),. Or @>
Figure BDA00037584854100000620
Is a distribution of the variation components,
Figure BDA00037584854100000621
and &>
Figure BDA00037584854100000622
Is Gaussian noise, at->
Figure BDA00037584854100000637
And &>
Figure BDA00037584854100000624
And the method is used in a re-parameterization process.
Figure BDA00037584854100000625
Is representative of xi j In a predetermined direction, in a predetermined direction>
Figure BDA00037584854100000626
Is representative of xi 1 ,ξ 2 ,…,ξ N And xi 1N The desired product of the two or more of the two,
Figure BDA00037584854100000627
indicates will->
Figure BDA00037584854100000628
And a t The jth entry of the predicted target state vector input to the single-view predictor network->
Figure BDA00037584854100000629
And &>
Figure BDA00037584854100000630
Cross entropy loss between and item j of the first predictive prize value>
Figure BDA00037584854100000631
And r t (iii) a cross-entropy loss therebetween, is greater than or equal to>
Figure BDA00037584854100000632
Indicates will->
Figure BDA00037584854100000633
And a t Predicted target state fusion vector input into multi-view predictor network>
Figure BDA00037584854100000634
And &>
Figure BDA00037584854100000635
Cross entropy loss between and a second predicted reward value ≧>
Figure BDA00037584854100000636
And r t Cross entropy loss between.
And thirdly, constructing an image continuous Control task simulation environment in a DeepMind Control Suite (DMControl) simulation environment (the DMControl simulation environment is supported by a Multi-Joint dynamics with Contact (MuJoCo) used in the research field of robots and the like) of the DeepMind open source, and preparing for training of an intelligent Control system based on deep reinforcement learning and conditional entropy bottleneck. The method comprises the following steps:
3.1 installing a DMControl simulation environment (requiring the version of a physical engine MuJoCo to be mujo 200) in any computer provided with a Ubuntu (requiring the version of 16.04 and above) and a Pytrch deep learning frame, constructing an intelligent agent simulation model and an image continuous control task simulation environment, and setting a task target of an intelligent agent in an image continuous control task.
3.2 in the constructed image continuous control task simulation environment, the scale of the image observation of the intelligent agent for sensing the self state and the environment state is set to be 100 multiplied by 100, the control instruction executed by the intelligent agent is set to be a continuous vector, such as joint rotation speed, torque and the like, and the reward value fed back by the image continuous control task simulation environment after the intelligent agent executes the control instruction is set according to the task target.
Fourthly, the intelligent agent trains an intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck in the image continuous control task simulation environment established in the third step, and the method comprises the following steps:
4.1 initializing network parameters of a feature extraction module and a control module in the intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck, wherein the parameters comprise a weight matrix and a bias vector of a full connection layer, a convolution kernel of a convolution layer and a weight matrix and a bias vector of a regularization layer, and the parameters are generated by using an orthogonal initialization (a parameter initialization method of a neural network), wherein non-zero parameters are from normal distribution with a mean value of 0 and a standard deviation of 1.
4.2 initializing the storage module in the intelligent agent control system based on the deep reinforcement learning and the conditional entropy bottleneck, and setting the size of the storage module to be capable of storing A (A is more than or equal to 10) 5 ) And (3) forming a buffer area queue of track data when the intelligent agent interacts with the image continuous control task simulation environment, and emptying the buffer area queue.
4.3, initializing the number of interactions T =0 between the intelligent agent and the image continuous control task simulation environment constructed in the third step, setting the maximum number of interactions T (T is a positive integer and T is greater than or equal to 5A) between the intelligent agent and the image continuous control task simulation environment, setting the maximum number of interactions E (E is a positive integer and generally takes 1000) between each turn between the intelligent agent and the image continuous control task simulation environment, and setting the update frequency F (F is a positive integer and generally takes 2) of a target encoder network and a target evaluation network in the intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck.
And 4.4, randomly setting the initial state of the image continuous control task simulation environment constructed in the third step and the initial state of the intelligent agent simulation model.
4.5, the perception module acquires image observation when the intelligent agent interacts with the image continuous control task simulation environment, and sends the image observation to the feature extraction module and the storage module; the characteristic extraction module receives image observation, an Encoder _1 in an Encoder network encodes the image observation to obtain a first state vector corresponding to the image observation, and the first state vector is sent to the control module; the control module receives the first state vector, the strategy network maps the first state vector into a control instruction, and the control instruction is sent to the action module and the storage module, and the method comprises the following steps:
4.5.1 perception module obtains image observation s when t time of intelligent agent interacts with image continuous control task simulation environment t A 1, a t And sending the data to a feature extraction module and a storage module.
4.5.2 feature extraction Module receives image observations s t Encoder _1 in Encoder network will s t Encoded as a first state vector z t A is z is t And sending the data to a control module.
4.5.3 control Module interfaceReceive a first state vector z t Policy network will z t Mapping the control command to a to be executed when the t th time of the agent interacts with the image continuous control task simulation environment t A is mixing t And sending the data to the action module and the storage module.
4.6 action Module receives control Command a t Performing a in an image continuous control task simulation environment t
4.7 the image continuous control task simulation environment returns the reward value r obtained when the intelligent agent interacts with the environment for the t time according to the reward value designed in the step 3.2 t R is to t And sending the data to a storage module.
4.8 State of the image continuous control task simulation Environment because the agent executes the control instruction a t When the environment changes, the perception module obtains the image observation s corresponding to the changed environment state t+1 A 1 is to t+1 And sending the data to a storage module.
4.9 the storage Module receives s from the perception Module t And s t+1 Receive a from the control module t Receiving r from the image continuous control task simulation environment t Combining them into a quadruple of trace data(s) t ,a t ,r t ,s t+1 ) Storing the data in a buffer area queue, wherein the method comprises the following steps:
4.9.1 the memory module judges whether there is A track data in the buffer queue, if there is A track data, go to step 4.9.2, otherwise, go to step 4.9.3.
4.9.2 the storage module empties a piece of track data at the head of the buffer queue according to the first-in first-out principle.
4.9.3 memory module stores s t 、s t+1 、a t And r t Combined into quadruplets of trajectory data(s) t ,a t ,r t ,s t+1 ) And storing the data at the tail part of the buffer area queue.
4.10 let t = t +1. If t% E =0, it is indicated that the number of times of interaction between the intelligent agent and the image continuous control task simulation environment in the round reaches E, the round of control task is ended, and a new round of control task is restarted in step 4.4; otherwise, the round of control task is not finished, and the step 4.11 is switched to continue the round of control task.
4.11 the data expansion module determines whether there are L (L is generally set to 1000) pieces of trace data in the buffer queue of the storage module, and if there are L pieces of trace data, randomly selects M (M is generally set to 512) pieces of trace data from the buffer queue of the storage module, and makes the M pieces of trace data form a trace data set τ _ M:
Figure BDA0003758485410000081
let τ _ M m =(s t (m),a t (m),r t (m),s t+1 (M)) represents the M-th (m.epsilon. [1,M) in τ _ M]) Step 4.12, optimizing an intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck according to tau _ M; if there are no more L pieces of track data in the buffer queue, go to step 4.5.1.
4.12 the data expansion module uses a random cutting method (from RAD method) in data enhancement to sequentially perform N times of data expansion on the image observation of each track data in the tau _ M to obtain M track data after data expansion, and the M track data after data expansion are sent to the feature extraction module and the control module; the feature extraction module and the control module receive M pieces of trajectory data after data expansion, and optimize an intelligent agent control system based on deep reinforcement learning and conditional entropy bottlenecks, wherein the method comprises the following steps of:
4.12.1 initializes the number of optimizations K =0, setting the total number of optimizations K (typically K is set to 10).
4.12.2 the data expansion module uses a random clipping method in data enhancement to sequentially perform N times of data expansion on image observation of each piece of track data in τ _ M to obtain M pieces of track data after data expansion, and sends the M pieces of track data after data expansion to the feature extraction module and the control module, wherein the method comprises the following steps:
4.12.2.1 initializes track data index m =1.
4.12.2.2 initializes the number of data expansion j =0, and sets the total number of data expansion N (N is generally set to 2).
4.12.2.3 the mth trace data τ _ M in τ _ M is clipped by random clipping in data enhancement, referring to the settings in RAD m =(s t (m),a t (m),r t (m),s t+1 (m)) s having a mesoscale of 100X 100 t (m) Observation of an image cropped to a Scale of 84X 84
Figure BDA0003758485410000082
S with a dimension of 100X 100 t+1 (m) Observation of an image cropped to a scale of 84 × 84->
Figure BDA0003758485410000083
4.12.2.4 let j = j +1. If j is equal to the total data expansion number N, go to step 4.12.2.5, otherwise go to step 4.12.2.3.
4.12.2.5 data expansion module converts τ _ M to m S in t (m) image observation with N-time data expansion
Figure BDA0003758485410000091
Figure BDA0003758485410000092
Will s is t+1 (m) replacement by N times of data-augmented image observations>
Figure BDA0003758485410000093
Figure BDA0003758485410000094
Get τ _ M m The locus data which is expanded by the N times of data is judged>
Figure BDA0003758485410000095
Figure BDA0003758485410000096
4.12.2.6 if M < M, let M = M +1, go to step 4.122.2; if M = M, it represents that the expansion of the M pieces of track data is finished, and M pieces of track data tau _ M after data expansion are obtained N
Figure BDA0003758485410000097
Go to step 4.12.2.7.
4.12.2.7 data expansion Module converts τ _ M N And sending the data to a feature extraction module and a control module.
4.12.3 the feature extraction module receives τ _ M from the data expansion module N For τ _ M N In M pieces of track data, a gradient descent method (a common optimization method in machine learning) is sequentially used for minimizing a loss function of a feature extraction module shown in a formula (2), and a coder network, a feature fusion network, a single-view predictor network and a multi-view predictor network in the feature extraction module are optimized, wherein the method comprises the following steps:
4.12.3.1 encoder network, target encoder network, single view predictor network, and multi-view predictor network receive τ _ M from data expansion module N
4.12.3.2 initializes track data index m =1.
Encoder _1 in 4.12.3.3 Encoder network will be τ _ M N Middle m track data
Figure BDA0003758485410000098
In
Figure BDA0003758485410000099
Figure BDA00037584854100000910
Encoded into N second status vectors>
Figure BDA00037584854100000911
Is that
Figure BDA00037584854100000912
A corresponding second state vector), will &>
Figure BDA00037584854100000913
Sending the data to an Encoder _2; encoder _2 gets @byfull connection layer>
Figure BDA00037584854100000914
The mean value of the corresponding Gaussian distribution->
Figure BDA00037584854100000915
And variance->
Figure BDA00037584854100000916
Wherein: />
Figure BDA00037584854100000917
Is->
Figure BDA00037584854100000918
The mean value of the corresponding Gaussian distribution->
Figure BDA00037584854100000919
Is that
Figure BDA00037584854100000948
The variance of the corresponding gaussian distribution. Encoder _2 pair +>
Figure BDA00037584854100000921
And
Figure BDA00037584854100000922
carrying out reparameterization operation to obtain N reparameterization state vectors
Figure BDA00037584854100000923
Figure BDA00037584854100000924
Is->
Figure BDA00037584854100000925
Corresponding reparameterized state vector), will &>
Figure BDA00037584854100000926
Figure BDA00037584854100000927
And sending the information to the feature fusion network and the single-view predictor network.
4.12.3.4 target encoder network will be τ _ M N Middle m track data
Figure BDA00037584854100000928
In>
Figure BDA00037584854100000929
Figure BDA00037584854100000930
Encoded as N target status vectors>
Figure BDA00037584854100000931
Is->
Figure BDA00037584854100000932
The corresponding target status vector), will &>
Figure BDA00037584854100000933
And sending the information to the feature fusion network.
4.12.3.5 feature fusion network receives from an encoder network
Figure BDA00037584854100000934
Feature _1 pair>
Figure BDA00037584854100000935
Performing feature fusion to obtain a state fusion vector
Figure BDA00037584854100000936
Will->
Figure BDA00037584854100000937
Sending the data to Feature _2; feature _2 Pair->
Figure BDA00037584854100000938
Carrying out reparameterization operation to obtain a reparameterization state fusion vector->
Figure BDA00037584854100000939
Will->
Figure BDA00037584854100000940
To the multi-view predictor network.
4.12.3.6 feature fusion network receives from a target encoder network
Figure BDA00037584854100000941
Feature _1 Pair->
Figure BDA00037584854100000942
Performing feature fusion to obtain a target state fusion vector->
Figure BDA00037584854100000943
4.12.3.7 Single View predictor network receives from an encoder network
Figure BDA00037584854100000944
Figure BDA00037584854100000945
From τ _ M N In the mth track data->
Figure BDA00037584854100000946
To (X)>
Figure BDA00037584854100000947
Figure BDA0003758485410000101
And &>
Figure BDA0003758485410000102
Middle control instruction a t (m) performing branch dynamics equation prediction and reward function equation prediction on the first spliced vector to obtain a prediction target state vector->
Figure BDA0003758485410000103
And a first predictive award value ≧>
Figure BDA0003758485410000104
4.12.3.8 multiview predictor network receives from a feature fusion network
Figure BDA0003758485410000105
From τ _ M N In the mth track data->
Figure BDA0003758485410000106
Is paired and/or matched>
Figure BDA0003758485410000107
And &>
Figure BDA0003758485410000108
Middle control instruction a t (m) performing branch dynamics equation prediction and reward function equation prediction on the second spliced vector to obtain a prediction target state fusion vector->
Figure BDA0003758485410000109
And a second predictive prize value >>
Figure BDA00037584854100001010
4.12.3.9 feature extraction module uses gradient descent method based on the mean and variance obtained in step 4.12.3.3 and the mean and variance obtained in step 4.12.3.4
Figure BDA00037584854100001011
Based on the result of step 4.12.3.6>
Figure BDA00037584854100001012
Step 4.12.3.7Obtained->
Figure BDA00037584854100001013
And &>
Figure BDA00037584854100001014
Based on the result of step 4.12.3.8>
Figure BDA00037584854100001015
And &>
Figure BDA00037584854100001016
R in t (m) minimizing the optimization penalty function in equation (2) by inverse updating of the gradient to optimize the encoder network, the feature fusion network, the single view predictor network, and the multi-view predictor network.
4.12.3.10 if M is less than M, making M = M +1, go to step 4.12.3.3; otherwise, go to step 4.12.4.
Encoder _1 of coder network in 4.12.4 feature extraction module converts tau _ M N Image observation of middle M pieces of track data after first data expansion
Figure BDA00037584854100001017
Encoding into M second state vectors
Figure BDA00037584854100001018
Will be/are>
Figure BDA00037584854100001019
And sending the data to a control module.
The target encoder network in the 4.12.5 feature extraction module will τ _ M N Image observation of middle M pieces of track data after first data expansion
Figure BDA00037584854100001020
Encoding into M target state vectors
Figure BDA00037584854100001021
Figure BDA00037584854100001022
Will be/are>
Figure BDA00037584854100001023
And sending the data to a control module.
4.12.6 the control module receives τ _ M from the data expansion module N Receiving a second state vector from the feature extraction module
Figure BDA00037584854100001024
Figure BDA00037584854100001025
And a target status vector->
Figure BDA00037584854100001026
For τ _ M N M track data in (4)>
Figure BDA00037584854100001027
And &>
Figure BDA00037584854100001028
The loss functions shown in formula (3), formula (4) and formula (5) are minimized by using a gradient descent method in sequence, and an evaluation network and a strategy network are optimized, wherein the method comprises the following steps:
4.12.6.1 the policy network receives from the feature extraction module
Figure BDA00037584854100001029
And
Figure BDA00037584854100001030
Figure BDA00037584854100001031
critic _1 and Critic _2 are received from the feature extraction module
Figure BDA00037584854100001032
Receiving τ _ M from data expansion module N (ii) a TarCritic _1 and TarCritic _2 receive { [ MEANS ] from the feature extraction module>
Figure BDA00037584854100001033
Figure BDA00037584854100001034
4.12.6.2 initializes track data index m =1.
4.12.6.3 policy network slave
Figure BDA00037584854100001035
In which an mth second state vector is acquired>
Figure BDA00037584854100001036
Slave->
Figure BDA00037584854100001037
In which an mth target status vector is acquired>
Figure BDA00037584854100001038
The following operations are carried out: to pair
Figure BDA00037584854100001039
Performing action mapping to obtain a control command a (m), and sending the a (m) to Critic _1 and Critic _2; to (X)>
Figure BDA00037584854100001040
The operation mapping is performed to obtain a control command a '(m), and a' (m) is transmitted to TarCritic _1 and TarCritic _2.
4.12.6.4Critic _1receives control instructions a (m) from the policy network, from
Figure BDA00037584854100001041
In which an mth second state vector is acquired>
Figure BDA00037584854100001042
From τ _ M N In the mth track data->
Figure BDA00037584854100001043
The following operations are carried out: is paired and/or matched>
Figure BDA00037584854100001044
And &>
Figure BDA00037584854100001045
In (a) t (m) performing state-motion value estimation on the third splicing vector to obtain a first state-motion value
Figure BDA00037584854100001046
To (X)>
Figure BDA0003758485410000111
And a (m) to obtain a second state-action value->
Figure BDA0003758485410000112
4.12.6.5Critic _2receives control instructions a (m) from the policy network, from
Figure BDA0003758485410000113
In which an mth second state vector is acquired>
Figure BDA0003758485410000114
From τ _ M N In the mth track data->
Figure BDA0003758485410000115
The following operations are carried out: is paired and/or matched>
Figure BDA0003758485410000116
And &>
Figure BDA0003758485410000117
In (a) t (m) performing state-motion value estimation on the third spliced vector to obtain a third state-motion value
Figure BDA0003758485410000118
To (X)>
Figure BDA0003758485410000119
And a (m) to obtain a fourth state-action value>
Figure BDA00037584854100001110
4.12.6.6TarCritic _1receives control instructions a' (m) from the policy network, from
Figure BDA00037584854100001111
In order to acquire the mth target status vector->
Figure BDA00037584854100001112
Is paired and/or matched>
Figure BDA00037584854100001113
And a' (m) to obtain a first target state-action value
Figure BDA00037584854100001114
4.12.6.7TarCritic _2receives control instructions a' (m) from the policy network, from
Figure BDA00037584854100001115
In order to acquire the mth target status vector->
Figure BDA00037584854100001116
Is paired and/or matched>
Figure BDA00037584854100001117
And a' (m) to obtain a second target state-action value
Figure BDA00037584854100001118
4.12.6.8 the control module uses the gradient descent method to minimize the penalty function in equation (3) by updating the optimization criteria _1 and criteria _2 in the reverse direction of the gradient.
Figure BDA00037584854100001119
Wherein:
Figure BDA00037584854100001120
is a parameter of the ith evaluation network>
Figure BDA00037584854100001121
Is a parameter of the ith target evaluation network, phi is a parameter of the policy network, i =1 or 2 is an index of two evaluation networks and two target evaluation networks in the control module, and->
Figure BDA00037584854100001122
Is the target status vector->
Figure BDA00037584854100001123
Corresponding status value (status value is a basic concept in reinforcement learning and refers to an expected value of a reward that an agent can obtain in the current status), -or>
Figure BDA00037584854100001124
Is the first target state-action value
Figure BDA00037584854100001125
And a second target state-action value>
Figure BDA00037584854100001126
The smaller value in between. Alpha is a temperature coefficient factor (initial value is set to 0.1, dynamic adjustment is carried out along with a policy network in the optimization process, see the article Soft actor-critic of the international machine learning conference ICML argument published in 2018 by Tuomas Haarnoja et al, wherein the article Soft actor-critic entry depth retrieval learning with a store actorPolicy maximum entropy deep reinforcement learning), γ is a reward discount factor (γ is typically set to 0.99).
4.12.6.9 control module optimizes the policy network by reverse updating of the gradient using a gradient descent method to minimize the penalty function shown in equation (4).
Figure BDA00037584854100001127
Wherein:
Figure BDA00037584854100001128
is that the policy network is in the second status vector->
Figure BDA00037584854100001129
Distribution of the lower output control command a (m), based on the comparison result>
Figure BDA00037584854100001130
Is a second state-action value>
Figure BDA00037584854100001131
And fourth State-action values
Figure BDA00037584854100001132
The smaller value in between.
4.12.6.10 control module optimizes the temperature coefficient factor by updating the inverse of the gradient using a gradient descent method to minimize the penalty function shown in equation (5).
Figure BDA0003758485410000121
Wherein:
Figure BDA0003758485410000122
is the target entropy of the policy network, set to the negative of the agent control instruction a (m) dimension.
4.12.6.11 if M < M, let M = M +1, go to step 4.12.6.3; otherwise, go to step 4.12.7.
4.12.7, judging whether t% F is equal to zero, if yes, updating parameters of a target encoder network and a target evaluation network in the intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck, and turning to a step 4.12.8, otherwise, turning to a step 4.12.9.
4.12.8 an agent control system based on deep reinforcement learning and conditional entropy bottlenecks uses exponential moving average (a common method for neural network parameter updating) to update parameters of a target encoder network and parameters of two target evaluation networks according to equations (6) and (7).
Figure BDA0003758485410000123
Figure BDA0003758485410000124
Wherein: tau is p And τ Q Is to update the hyper-parameters (generally will τ) of the target encoder network and the target evaluation network p Set to 0.05, τ Q Set to 0.01), ζ is a parameter of the encoder network,
Figure BDA0003758485410000125
is a parameter of the target encoder network.
4.12.9 let k = k +1. If K is equal to the total number of times of optimization K, it indicates that the optimization is completed, go to step 4.13, otherwise, go to step 4.12.2.1.
4.13 the agent control system based on deep reinforcement learning and conditional entropy bottleneck judges whether T is equal to the maximum number of interactions T, if so, it means that training is finished, step 4.14 is performed, otherwise, step 4.5.1 is performed.
4.14 the intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck saves the network parameters of the feature extraction module and the control module optimized in the step 4.12 into a pt format file (the pt format file can be directly generated by the deep learning framework Pythrch).
And fifthly, loading the file in the pt format obtained in the step 4.14 by the feature extraction module and the control module, and initializing the parameters of the feature extraction module and the control module by using the parameters in the file in the pt format to obtain the trained intelligent agent control system based on the deep reinforcement learning and the conditional entropy bottleneck.
And sixthly, deploying the trained intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck on the intelligent agent in the real environment. In an intelligent agent control system based on deep reinforcement learning and conditional entropy bottlenecks, other networks except an encoder network in a feature extraction module are all related to optimization of the encoder network, and other networks except a policy network in a control module are all related to optimization of the policy network. Thus, after the control system is deployed on the agent, only the encoder network in the feature extraction module and the policy network in the control module are relevant to the actions of the agent.
Seventhly, the trained intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck assists the intelligent agent to complete the image continuous control task, and the method comprises the following steps:
7.1 initialize the number of actions T =0 of the agent and set the maximum number of actions T of the agent in the real environment 0 (T 0 Is a positive integer, typically 1000).
7.2 the perception module obtains the image observation of the real environment and sends the image observation to the feature extraction module; the characteristic extraction module receives image observation, an Encoder _1 in an Encoder network encodes the image observation to obtain a first state vector corresponding to the image observation, and the first state vector is sent to the control module; the control module receives the first state vector, the strategy network maps the first state vector into a control instruction, and the control instruction is sent to the action module; the action module receives the control instruction and executes the control instruction in the real environment, and the method comprises the following steps:
7.2.1 perception Module obtains image observations s of the real Environment at the tth action of the agent t A 1 is to t And sending the data to a feature extraction module.
7.2.2 feature extraction Module receives image observations s t In an encoder networkEncoder _1 converts s according to the method described in step 4.5.2 t Encoded as a first state vector z t A is z is t And sending the data to a control module.
7.2.3 control Module reception z t The policy network will z according to the method described in step 4.5.3 t Mapping to control instruction a of t action of agent t A is to t And sending the information to an action module.
7.2.4 the action module receives the control command a t Performing a in a real Environment t
7.3 let t = t +1. If T is equal to the maximum action number T 0 And if not, turning to the step 7.2.1.
And eighthly, ending.
The invention can achieve the following technical effects:
1. the invention provides an intelligent agent control method based on deep reinforcement learning and conditional entropy bottleneck, and an intelligent agent can realize better control in an image continuous control task by using the intelligent agent control method. The invention provides reference for the design of the control scheme when the intelligent agent is deployed in the real world, and has strong practical application value.
2. According to the method, perception and control in deep reinforcement learning are decoupled, a target function of a feature extraction module is designed on the basis of conditional entropy bottleneck in the second step, a loss function corresponding to the target function is obtained through a variational reasoning technology, and by optimizing the loss function, an intelligent agent can ignore visual information irrelevant to a task when observing and coding images in an image continuous control task, and a robust state vector is obtained. In the fourth step of the invention, the control module is trained on the basis of the robust state vector obtained by the feature extraction module, so that the complexity of the control task is reduced, and the accuracy of the intelligent agent control strategy is improved.
3. In the invention, after the trained intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck is deployed on the intelligent agent, the intelligent agent is only based on the encoder network in the characteristic extraction module and the strategy network action in the control module in the seventh step, the decision of the control instruction is simple, and the real-time requirement of intelligent agent control is met.
Drawings
FIG. 1 is a general flow diagram of the present invention;
FIG. 2 is a schematic diagram of the logic structure of the intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck;
FIG. 3 is a schematic diagram of an image continuous control task simulation environment constructed based on a real environment in the third step of the invention: hunting leopard robot running scene. The robot leopard looks at the environment in the form of an image through a sensing module, and the task goal is that the robot leopard needs to run quickly in a scene;
FIG. 4 is a schematic diagram showing a comparison between training results of a deep reinforcement Learning control method DBC, which is proposed in an article "Learning invariant for reinforcement Learning without reconstruction", published in the ICLR corpus of the international characterization Learning conference in 2021 by Amy Zhang et al in the invention and the background art, i.e., an invariant characterization for reinforcement Learning without reconstruction;
fig. 5 is a schematic diagram of the action of the leopard robot according to the control command obtained by the control module in the running scene of the leopard robot constructed based on the real environment shown in fig. 3 according to the present invention;
Detailed Description
FIG. 1 is a general flow diagram of the present invention; as shown in fig. 1, the present invention comprises the steps of:
firstly, constructing an intelligent agent control system based on deep reinforcement learning and conditional entropy bottlenecks. The system is shown in fig. 2 and comprises a sensing module, an action module, a storage module, a data expansion module, a feature extraction module and a control module.
The sensing module is an image sensor (such as a depth camera) and is connected with the feature extraction module and the storage module. The sensing module acquires an image observation (RGB image) containing the state of the intelligent agent (information of the intelligent agent) and the state of the environment (information except the intelligent agent) from the image continuous control task environment, and sends the image observation to the feature extraction module and the storage module.
The action module is an actuator (such as an engine, a steering gear and the like) of an intelligent control instruction, is connected with the control module, receives the control instruction from the control module, and acts in the image continuous control task environment according to the control instruction.
The storage module is a memory containing available space above 1GB, is connected with the sensing module, the control module and the data expansion module, receives image observation from the sensing module, receives control instructions from the control module, receives reward from the image continuous control task environment, and combines the image observation, the control instructions and the reward into track data (track data for short) of interaction between the intelligent agent and the image continuous control task environment. Trajectory data in quadruples(s) t ,a t ,r t ,s t+1 ) Is stored in a form of (a) wherein: s is t Is the image observation received from the perception module when the agent interacts with the image continuous control task environment for the tth time, a t Is a control instruction from the control module, r, executed when the agent interacts with the image continuous control task environment for the tth time t When the t time of the intelligent agent is interacted with the image continuous control task environment, the environment is directed to the control instruction a t Reward value, s, of feedback t+1 The method is the image observation received from the perception module after the environment state changes due to the t-th interaction with the image continuous control task environment of the intelligent agent (also called the image observation when the t + 1-th interaction with the image continuous control task environment of the intelligent agent).
The data expansion module is connected with the storage module, the feature extraction module and the control module, and track data tau, tau =(s) required by training of the intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck is randomly selected from the storage module t ,a t ,r t ,s t+1 ) For s in τ t And s t+1 Carrying out N times of data expansion to obtain the track data tau after the N times of data expansion N
Figure BDA0003758485410000141
Figure BDA0003758485410000142
(j∈[1,N]J is N data expansionIndex of post image observation), will τ N And sending the data to a feature extraction module and a control module.
The feature extraction module is connected with the sensing module, the data expansion module and the control module. The characteristic extraction module consists of a coder network, a target coder network, a characteristic fusion network, a single-view predictor network and a multi-view predictor network.
The coder network consists of a first coder network Encoder _1 and a second coder network Encoder _2, and is connected with the perception module, the data expansion module, the control module, the characteristic fusion network and the single-view predictor network. The Encoder _1 consists of 4 convolutional layers, 1 full-connection layer and 1 regularization layer and is connected with the sensing module, the data expansion module, the control module and the Encoder _2; encoder _2 is composed of 3 layers of full connection layers and is connected with Encoder _1, a feature fusion network and a single-view predictor network. Encoder _1 receives s from the perception module when the agent interacts with the image continuation control task environment t The first, second, third and fourth convolution layers of Encode _1 are checked with s by a convolution kernel of 3 x 3 in sequence t Performing convolution operation to obtain s after four times of convolution operation t S after four convolution operations t A full connection layer transmitted to the Encoder _ 1; the fully connected layer of Encode _1 pairs s after the four convolution operations received from the fourth convolutional layer t Performing full connection operation to obtain full connection result s t The corresponding state vector sends the fully connected state vector to a regularization layer of the Encoder _ 1; the regularization layer of the Encoder _1 carries out regularization operation on the fully-connected state vector received from the fully-connected layer of the Encoder _1 to obtain a regularized state vector, and the regularized state vector is used as a first state vector z t Will z t And sending the data to a control module. When an agent control system based on deep reinforcement learning and conditional entropy bottleneck is trained, the Encoder _1 receives track data tau after data expansion from a data expansion module N The first, second, third, and fourth convolution layers of Encode _1 are sequentially checked with a convolution kernel τ of 3 × 3 N In (1)
Figure BDA0003758485410000143
Performing convolution operation to obtain four times of convolution operation>
Figure BDA0003758485410000144
Will be four times convolution operated->
Figure BDA0003758485410000151
Sending the full connection layer to an Encoder _ 1; fully connected layer of Encoder _1 vs. four convolution operations received from the fourth convolution layer->
Figure BDA0003758485410000152
Performing full connection operation to obtain full connection
Figure BDA0003758485410000153
The corresponding N state vectors send the fully connected N state vectors to a regularization layer of the Encoder _ 1; the regularization layer of Encoder _1 carries out regularization operation on the N fully-connected state vectors received from the fully-connected layer of Encoder _1 to obtain N regularized state vectors, and the N regularized state vectors are used as second state vectors and are/is used>
Figure BDA0003758485410000154
Indicates that the first second status vector is->
Figure BDA0003758485410000155
Sending to the control module and combining>
Figure BDA0003758485410000156
Sending the data to an Encoder _2; the first, second and third full connection layers of Encoder _2 sequentially pair received from Encoder _1
Figure BDA0003758485410000157
Performing a full connection operation to obtain three times of full connection operation>
Figure BDA0003758485410000158
Corresponding toMean and variance of Gaussian distribution, and N re-parameterized state vectors obtained by re-parameterizing the mean and variance (from Auto-Encoding Variational Bayes published in the international conference on characterization learning ICLR (ICLR) in Diederik P.Kingma and Maxwelling 2014), and using the N re-parameterized state vectors obtained by the re-parameterization
Figure BDA0003758485410000159
Indicates to >>
Figure BDA00037584854100001510
Figure BDA00037584854100001511
And sending the information to the feature fusion network and the single-view predictor network.
The target encoder network is connected with the data expansion module, the control module and the characteristic fusion network and consists of 4 convolution layers, 1 full-connection layer and 1 regularization layer. In the process of training the intelligent agent control system based on the deep reinforcement learning and the conditional entropy bottleneck, the training process is vibrated and unstable due to the fact that the parameters of the encoder network are updated too fast, and therefore the stability of the training process is improved by introducing the target encoder network into the intelligent agent control system based on the deep reinforcement learning and the conditional entropy bottleneck. Target encoder network receives tau from data expansion module N The first, second, third, and fourth convolutional layers of the target encoder network are sequentially checked with a 3 × 3 convolutional kernel τ N In
Figure BDA00037584854100001512
Performing convolution operation to obtain four times of convolution operation>
Figure BDA00037584854100001513
Will be four times convolution operated->
Figure BDA00037584854100001514
Sending to a full connection layer; fully connected layer on greater than or equal to four convolution operations received from fourth convolution layer>
Figure BDA00037584854100001515
Performing full connection operation to obtain full connection>
Figure BDA00037584854100001516
The corresponding N target state vectors send the fully-connected N target state vectors to the regularization layer; the regularization layer carries out regularization operation on the N fully-connected target state vectors received from the fully-connected layer to obtain N regularized target state vectors, the N regularized target state vectors are used as target state vectors, and the N regularized target state vectors are used as ^ greater than or equal to>
Figure BDA00037584854100001517
Indicating that a first target state vector is->
Figure BDA00037584854100001518
Is sent to the control module and will
Figure BDA00037584854100001519
Figure BDA00037584854100001520
And sending the information to the feature fusion network.
The characteristic fusion network is connected with the encoder network, the target encoder network and the multi-view predictor network and consists of a first fusion network Feature _1 and a second fusion network Feature _2. Feature _1 and Feature _2 are each composed of 3 fully connected layers. Feature _1 is connected to the encoder network, the target encoder network and Feature _2, the Feature _1 receiving from the encoder network
Figure BDA00037584854100001521
Receiving from the target encoder network->
Figure BDA00037584854100001522
The first, the second and the third full connection layers of Feature _1 are sequentially paired
Figure BDA00037584854100001523
Performing a full connection operation to
Figure BDA00037584854100001524
Concatenating as the status fusion vector->
Figure BDA00037584854100001525
Will->
Figure BDA00037584854100001526
Sending the data to Feature _2; the first, second and third fully connected layers of Feature _1 in turn pair->
Figure BDA00037584854100001527
Performing a full connection operation to
Figure BDA00037584854100001528
Splicing into a target state fusion vector->
Figure BDA00037584854100001529
Feature _2 is coupled to Feature _1 and the multi-view predictor network and receives ≧ from Feature _1>
Figure BDA00037584854100001530
The first, second and third fully connected layers of Feature _2 in turn pair->
Figure BDA00037584854100001531
Performing reparameterization operation to obtain reparameterization state fusion vector>
Figure BDA00037584854100001532
Will->
Figure BDA00037584854100001533
To the multi-view predictor network.
The single-view predictor network is connected with the data expansion module and the encoder network and consists of 3 layers of full connection layers. Single view predictor network receives data from data expansion moduleAugmented trace data τ N Received from the encoder network
Figure BDA0003758485410000161
First, second, and third fully-connected layers of a single-view predictor network are paired { [ in sequence } in order>
Figure BDA0003758485410000162
And τ N Control instruction a in t Performing full-concatenation operation on the formed first spliced vector, and mapping the first spliced vector into a predicted target status vector->
Figure BDA0003758485410000163
And a first predicted prize value
Figure BDA0003758485410000164
Predictions for implementing the transfer kinetics equation and the reward function equation (which are basic concepts in Reinforcement Learning, see book Reinforcement Learning: an Introduction to Reinforcement Learning, by Richard S.Sutton and Andrew G.Barto), in which: />
Figure BDA0003758485410000165
A jth term, representing a predicted target state vector, is asserted>
Figure BDA0003758485410000166
A jth term representing the first predicted prize value.
The multi-view predictor network is connected with the data expansion module and the characteristic fusion network and consists of 3 layers of full connection layers. The multi-view predictor network receives the data-augmented trajectory data tau from the data augmentation module N Receiving from the feature fusion network
Figure BDA0003758485410000167
Will->
Figure BDA0003758485410000168
And τ N In (1)Control instruction a t Forming a second spliced vector, carrying out full-connection operation on the second spliced vector by a first full-connection layer, a second full-connection layer and a third full-connection layer of the multi-view predictor network in sequence, and mapping the second spliced vector into a prediction target state fusion vector ^ and/or a prediction target state fusion vector ^>
Figure BDA0003758485410000169
And a second predictive bonus value>
Figure BDA00037584854100001610
And realizing the prediction of the transfer dynamics equation and the reward function equation.
The control module is connected with the feature extraction module, the data expansion module and the action module and consists of two evaluation networks (a first evaluation network Critic _1 and a second evaluation network Critic _ 2), two target evaluation networks (a first target evaluation network TarCritic _1 and a second target evaluation network TarCritic _ 2) and a strategy network. The purpose of designing two evaluation networks and two target evaluation networks in the control module is to prevent the problem of over-estimation when a single evaluation network or a single target evaluation network evaluates the advantages and disadvantages of the intelligent agent control command.
Critic _1 and Critic _2 are connected with the feature extraction module, the data expansion module and the strategy network, are all composed of three full-connection layers, and receive a first second state vector from the feature extraction module
Figure BDA00037584854100001611
Receiving data-extended trajectory data tau from a data extension module N Receiving control instruction a from policy network, for tau N Middle control instruction a t And evaluating the quality of the a in the policy network. The first, second and third fully-connected layers of Critic _1 are paired->
Figure BDA00037584854100001612
And a t Performing full-connection operation on the formed third spliced vector, and mapping the third spliced vector into the first state-action value->
Figure BDA00037584854100001613
(State-action value is a basic concept in reinforcement learning, and refers to the expected value of the reward that can be obtained after the agent executes the selected control command in the current state); the first, second and third fully-connected layers of Critic _1 are paired->
Figure BDA00037584854100001614
A fourth splicing vector formed by the sum a is subjected to full connection operation, and the fourth splicing vector is mapped into a second state-action value->
Figure BDA00037584854100001615
The first, second and third fully-connected layers of Critic _2 are paired->
Figure BDA00037584854100001616
And a t Performing full-connection operation on the formed third splicing vector, and mapping the third splicing vector into a third state-action value->
Figure BDA00037584854100001617
The first, second and third fully-connected layers of Critic _2 are paired->
Figure BDA00037584854100001618
A fourth splicing vector formed by the sum a is subjected to full connection operation, and the fourth splicing vector is mapped into a fourth state-action value->
Figure BDA00037584854100001619
Both TarCritic _1 and TarCritic _2 are connected with the feature extraction module and the strategy network, are composed of three fully connected layers, and receive a first target state vector from the feature extraction module
Figure BDA00037584854100001620
Receiving control instructions a ', to a ' from a policy network 'The quality of (2) was evaluated. The first, second and third fully-connected layers of TarCritic _1 are paired and/or reserved in sequence>
Figure BDA00037584854100001621
And a' performing full connection operation on the target splicing vector, and mapping the target splicing vector into a first target state-action value->
Figure BDA00037584854100001622
The first, second and third fully-connected layers of TarCritic _2 are paired->
Figure BDA00037584854100001623
And a' performing full connection operation on the target splicing vector, and mapping the target splicing vector into a second target state-action value->
Figure BDA0003758485410000171
The strategy network is connected with the feature extraction module, the action module, the storage module, the Critic _1, the Critic _2, the TarCritic _1 and the TarCritic _2 and consists of three full connection layers. Receiving a first state vector z from a feature extraction module while an agent interacts with an image continuation control task environment t The first, second and third full connection layers of the strategy network are in turn aligned with z t Carrying out a full connection operation of t Mapped as control instruction a t A is mixing t And sending the data to the action module and the storage module. In training an agent control system based on deep reinforcement learning and conditional entropy bottlenecks, a first second state vector is received from a feature extraction module
Figure BDA0003758485410000172
And a first target state vector>
Figure BDA0003758485410000173
The first, second and third fully-connected layers of the policy network are in turn paired->
Figure BDA0003758485410000174
Performing a full connection operation to put>
Figure BDA0003758485410000175
Mapping into a control instruction a, and sending a to Critic _1 and Critic _2; the first, second and third fully-connected layers of the policy network are in turn paired->
Figure BDA0003758485410000176
Performing a full connection operation to put>
Figure BDA0003758485410000177
Mapped as control instruction a ', sends a' to TarCritic _1 and TarCritic _2.
And secondly, constructing a target function of the feature extraction module based on the conditional entropy Bottleneck, and obtaining an optimization loss function of the feature extraction module through a Variational reasoning technology (from Deep Variational Information Bottleneck, namely a Deep Variational Information Bottleneck, an article published by Alexander A. Alemi et al in 2017 on ICLR corpus of International characterization learning conference). The method comprises the following steps:
2.1 in order to learn the state vector corresponding to the image observation, based on the conditional entropy bottleneck, a feature extraction module objective function shown in formula (1) is designed. Conditional Entropy bottlenecks (from The Conditional Entropy Bottleneck, the article "The Conditional Entropy bottle" published in The book journal by Ian Fischer 2020, namely "Conditional Entropy bottlenecks") are a method in information theory for extracting features Z of given data X, to predict a label Y, defined as
Figure BDA0003758485410000178
It is desirable that the information in Z is maximally correlated with Y.
Figure BDA0003758485410000179
Wherein: object represents the object of the feature extraction module,
Figure BDA00037584854100001710
the data expansion module carries out j times of data expansion on the image observation in the t time of interaction to obtain image observation, and then the image observation is subjected to judgment>
Figure BDA00037584854100001711
The data expansion module carries out j times of data expansion on the image observation in the t +1 th interaction to obtain image observation, and then the image observation is subjected to judgment>
Figure BDA00037584854100001712
Is to>
Figure BDA00037584854100001721
A reparameterized status vector obtained after input into the encoder network, based on the comparison of the status vector and the value of the reference value>
Figure BDA00037584854100001713
Is to>
Figure BDA00037584854100001714
A target state vector obtained after input to the target encoder network,
Figure BDA00037584854100001715
is to parameterize the N re-valued status vectors->
Figure BDA00037584854100001716
A reparameterized state fusion vector obtained upon input into the feature fusion network, based on the state of the feature fusion network and the location of the feature fusion network>
Figure BDA00037584854100001717
Is to vector N target status values->
Figure BDA00037584854100001718
Target state fusion vector, beta, obtained after input to a feature fusion network j Is a regularization factor, and is suggested to take a value of 1e -4 ~1e -2
Figure BDA00037584854100001719
And &>
Figure BDA00037584854100001720
Is a conditional mutual information item.
2.2, applying variational reasoning technology to the formula (1) to obtain a feature extraction module optimized loss function shown in the formula (2):
Figure BDA0003758485410000181
wherein: m is the number of the track data randomly selected from the storage module by the data expansion module,
Figure BDA0003758485410000182
is a Gaussian distribution->
Figure BDA0003758485410000183
Figure BDA0003758485410000184
Is the mean value calculated by Encoder _2 in the Encoder network, is->
Figure BDA0003758485410000185
Is the variance calculated by Encoder _2 in the Encoder network),. Or @>
Figure BDA0003758485410000186
Is a distribution of the variation components,
Figure BDA0003758485410000187
and &>
Figure BDA0003758485410000188
Is Gaussian noise, at->
Figure BDA00037584854100001823
And &>
Figure BDA00037584854100001810
Ginseng radixUsed in the process of numeralization.
Figure BDA00037584854100001811
Is representative of xi j In a predetermined direction, in a predetermined direction>
Figure BDA00037584854100001812
Is representative of xi 1 ,ξ 2 ,…,ξ N And xi 1N The desired product of the two or more of the two,
Figure BDA00037584854100001813
indicates will->
Figure BDA00037584854100001814
And a t J-th item of a predicted target state vector input to a single-view predictor network>
Figure BDA00037584854100001815
And &>
Figure BDA00037584854100001816
And a j-th item of the first predictive prize value>
Figure BDA00037584854100001817
And r t (iv) cross entropy loss therebetween, based on the sum of the mean and the mean>
Figure BDA00037584854100001818
Indicates will->
Figure BDA00037584854100001819
And a t Predicted target state fusion vector input into multi-view predictor network>
Figure BDA00037584854100001820
And &>
Figure BDA00037584854100001821
Cross entropy loss between and a second predicted reward value ≧>
Figure BDA00037584854100001822
And r t Cross entropy loss between.
And thirdly, constructing an image continuous Control task simulation environment in a DeepMind Control Suite (DMControl) simulation environment (the DMControl simulation environment is supported by a Multi-Joint dynamics with Contact physical engine) used in the research field of robots and the like), and preparing for training of an intelligent body Control system based on deep reinforcement learning and conditional entropy bottleneck. The method comprises the following steps:
3.1 installing a DMControl simulation environment (requiring the version of a physical engine MuJoCo to be mujo 200) in any computer provided with a Ubuntu (requiring the version of 16.04 and above) and a Pythroch deep learning framework, constructing an intelligent agent simulation model and an image continuous control task simulation environment, and setting a task target of an intelligent agent in an image continuous control task.
3.2 in the constructed image continuous control task simulation environment, the scale of image observation of the intelligent agent for sensing the self state and the environment state is set to be 100 multiplied by 100, the control instruction executed by the intelligent agent is set to be a continuous vector, such as joint rotation speed, torque and the like, and the reward value fed back by the image continuous control task simulation environment after the intelligent agent executes the control instruction is set according to the task target.
Fig. 3 is an embodiment of the image continuation control task simulation environment constructed in the third step, which is a leopard robot running scene constructed based on a real environment, and an environment rendering diagram of the leopard robot running scene is shown. In this embodiment, the leopard robot uses the perception module to obtain the image observation with the scale of 100 × 100, and the control command of the leopard robot is a 6-dimensional continuous vector. The cheetah robot acts according to the control instruction in the scene, and the goal of the task is that the cheetah robot needs to run quickly in the scene. The reward value r of the cheetah robot acting in the scene is linearly related to the moving speed v (the maximum value of the moving speed is 10 m/s), namely: r = max (0,min (v/10,1)).
Fourthly, the intelligent agent trains an intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck in the image continuous control task simulation environment established in the third step, and the method comprises the following steps:
4.1 initializing network parameters of a feature extraction module and a control module in the intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck, wherein the parameters comprise a weight matrix and a bias vector of a full connection layer, a convolution kernel of a convolution layer and a weight matrix and a bias vector of a regularization layer, and the parameters are generated by using an orthogonal initialization (a parameter initialization method of a neural network), wherein non-zero parameters are from normal distribution with the average value of 0 and the standard deviation of 1.
4.2 initializing the storage module in the intelligent agent control system based on the deep reinforcement learning and the conditional entropy bottleneck, and setting the size of the storage module to be capable of storing A (A is more than or equal to 10) 5 ) And (3) forming a buffer area queue of track data when the intelligent agent interacts with the image continuous control task simulation environment, and emptying the buffer area queue.
4.3, initializing the number of interactions T =0 between the intelligent agent and the image continuous control task simulation environment constructed in the third step, setting the maximum number of interactions T (T is a positive integer and T is greater than or equal to 5A) between the intelligent agent and the image continuous control task simulation environment, setting the maximum number of interactions E (E is a positive integer and generally takes 1000) between each turn between the intelligent agent and the image continuous control task simulation environment, and setting the update frequency F (F is a positive integer and generally takes 2) of a target encoder network and a target evaluation network in the intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck.
And 4.4, randomly setting the initial state of the image continuous control task simulation environment constructed in the third step and the initial state of the intelligent agent simulation model.
4.5, the perception module acquires image observation when the intelligent agent interacts with the image continuous control task simulation environment, and sends the image observation to the feature extraction module and the storage module; the characteristic extraction module receives image observation, an Encoder _1 in an Encoder network encodes the image observation to obtain a first state vector corresponding to the image observation, and the first state vector is sent to the control module; the control module receives the first state vector, the policy network maps the first state vector into a control instruction, and the control instruction is sent to the action module and the storage module, and the method comprises the following steps:
4.5.1 perception module obtains image observation s when t th time of intelligent agent interacts with image continuous control task simulation environment t A 1, a t And sending the data to a feature extraction module and a storage module.
4.5.2 feature extraction Module receives image observations s t Encoder _1 in the Encoder network will s t Encoded as a first state vector z t A is z is t And sending the data to a control module.
4.5.3 the control Module receives the first State vector z t Policy network will z t Mapping the control command to a to be executed when the t th time of the agent interacts with the image continuous control task simulation environment t A is mixing t And sending the data to the action module and the storage module.
4.6 the action Module receives the control instruction a t Performing a in an image continuous control task simulation environment t
4.7 the image continuous control task simulation environment returns the reward value r obtained when the intelligent agent interacts with the environment for the t time according to the reward value designed in the step 3.2 t R is to be t And sending the data to a storage module.
4.8 State of the image continuous control task simulation Environment because the agent executes the control instruction a t When the environment changes, the perception module obtains the image observation s corresponding to the changed environment state t+1 A 1 is to t+1 And sending the data to a storage module.
4.9 the storage Module receives s from the perception Module t And s t+1 Receive a from the control module t Receiving r from the image continuous control task simulation environment t Combining them into a quadruple of trace data(s) t ,a t ,r t ,s t+1 ) Storing the data in a buffer area queue, wherein the method comprises the following steps:
4.9.1 the memory module judges whether there is A track data in the buffer queue, if there is A track data, go to step 4.9.2, otherwise, go to step 4.9.3.
4.9.2 the storage module empties a trace data at the head of the buffer queue according to the first-in first-out principle.
4.9.3 memory module t 、s t+1 、a t And r t Trajectory data(s) combined into quadruplets t ,a t ,r t ,s t+1 ) And storing the data at the tail part of the buffer area queue.
4.10 let t = t +1. If t% E =0, it is indicated that the number of times of interaction between the intelligent agent and the image continuous control task simulation environment in the round reaches E, the round of control task is ended, and a new round of control task is restarted in step 4.4; otherwise, the round control task is not finished, and the step 4.11 is executed to continue the round control task.
4.11 the data expansion module determines whether there are L (L is generally set to 1000) pieces of track data in the buffer queue of the storage module, and if there are L pieces of track data, randomly selects M (M is generally set to 512) pieces of track data from the buffer queue of the storage module, and makes the M pieces of track data form a track data set τ _ M:
Figure BDA0003758485410000201
let τ _ M m =(s t (m),a t (m),r t (m),s t+1 (M)) represents the M-th (M e [1,M) in τ _ M]) Step 4.12, optimizing an intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck according to tau _ M; if there are no more L pieces of track data in the buffer queue, go to step 4.5.1.
4.12 the data expansion module uses a random cutting method (from RAD method) in data enhancement to sequentially perform N times of data expansion on the image observation of each track data in the tau _ M to obtain M track data after data expansion, and sends the M track data after data expansion to the feature extraction module and the control module; the feature extraction module and the control module receive M pieces of track data after data expansion, and an intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck is optimized, wherein the method comprises the following steps:
4.12.1 initializes the number of optimizations K =0, setting the total number of optimizations K (typically K is set to 10).
The 4.12.2 data expansion module sequentially performs N times of data expansion on image observation of each piece of track data in the tau _ M by using a random cutting method in data enhancement to obtain M pieces of track data after data expansion, and sends the M pieces of track data after data expansion to the feature extraction module and the control module, wherein the method comprises the following steps of:
4.12.2.1 initializes track data index m =1.
4.12.2.2 initializes the number of data expansion j =0, and sets the total number of data expansion N (N is generally set to 2).
4.12.2.3 the mth trace data τ _ M in τ _ M is clipped by random clipping in data enhancement, referring to the settings in RAD m =(s t (m),a t (m),r t (m),s t+1 (m)) s having a mesoscale of 100X 100 t (m) Observation of an image cropped to a Scale of 84X 84
Figure BDA0003758485410000202
S with a dimension of 100X 100 t+1 (m) Observation of an image cropped to a scale of 84 × 84->
Figure BDA0003758485410000203
4.12.2.4 let j = j +1. If j is equal to the total number of data expansion times N, go to 4.12.2.5, otherwise go to 4.12.2.3.
4.12.2.5 data expansion Module converts τ _ M m S in t (m) image observation with N-time data expansion
Figure BDA0003758485410000204
Figure BDA0003758485410000205
Will s t+1 (m) replacement by N times of data-augmented image observations>
Figure BDA0003758485410000206
Figure BDA0003758485410000207
Get τ _ M m The locus data which is expanded by the N times of data is judged>
Figure BDA0003758485410000208
Figure BDA0003758485410000209
4.12.2.6 if M < M, let M = M +1, go to step 4.12.2.2; if M = M, it represents that the expansion of the M pieces of track data is finished, and M pieces of track data tau _ M after data expansion are obtained N
Figure BDA00037584854100002010
Go to step 4.12.2.7.
4.12.2.7 data expansion module converts τ _ M to N And sending the data to a feature extraction module and a control module.
4.12.3 feature extraction module receives τ _ M from data expansion module N For τ _ M N In M pieces of track data, a gradient descent method (a common optimization method in machine learning) is sequentially used for minimizing a loss function of a feature extraction module shown in a formula (2), and a coder network, a feature fusion network, a single-view predictor network and a multi-view predictor network in the feature extraction module are optimized, wherein the method comprises the following steps:
4.12.3.1 encoder network, target encoder network, single view predictor network, and multi-view predictor network receive τ _ M from data expansion module N
4.12.3.2 initializes track data index m =1.
Encoder _1 in 4.12.3.3 Encoder network will be τ _ M N Middle m track data
Figure BDA0003758485410000211
In (1)
Figure BDA0003758485410000212
Figure BDA0003758485410000213
Encoded as N second status vectors +>
Figure BDA0003758485410000214
(/>
Figure BDA00037584854100002156
Is that
Figure BDA00037584854100002157
A corresponding second state vector), will &>
Figure BDA0003758485410000216
Sending the data to an Encoder _2; encoder _2 gets @viafully connected layers>
Figure BDA0003758485410000217
Mean of corresponding Gaussian distributions
Figure BDA0003758485410000218
And variance +>
Figure BDA0003758485410000219
Wherein: />
Figure BDA00037584854100002110
Is->
Figure BDA00037584854100002111
Mean value of the corresponding Gaussian distribution>
Figure BDA00037584854100002112
Is->
Figure BDA00037584854100002113
The variance of the corresponding gaussian distribution. Encoder _2 pair
Figure BDA00037584854100002114
And &>
Figure BDA00037584854100002115
Performing reparameterization operation to obtain N reparameterized state vectors>
Figure BDA00037584854100002116
Figure BDA00037584854100002117
Is/>
Figure BDA00037584854100002118
Corresponding reparameterized state vector), will &>
Figure BDA00037584854100002119
Figure BDA00037584854100002120
To the feature fusion network and the single-view predictor network.
4.12.3.4 target encoder network will be τ _ M N Middle m track data
Figure BDA00037584854100002121
In (1)
Figure BDA00037584854100002122
Figure BDA00037584854100002123
Encoded as N target status vectors>
Figure BDA00037584854100002124
(/>
Figure BDA00037584854100002125
Is->
Figure BDA00037584854100002158
The corresponding target status vector), will &>
Figure BDA00037584854100002127
And sending the information to the feature fusion network.
4.12.3.5 feature fusion network receives from an encoder network
Figure BDA00037584854100002128
Feature _1 pair
Figure BDA00037584854100002129
Performing feature fusion to obtain a state fusion vector->
Figure BDA00037584854100002130
Will be/are>
Figure BDA00037584854100002131
Sending the data to Feature _2; feature _2 Pair->
Figure BDA00037584854100002132
Carrying out reparameterization operation to obtain a reparameterization state fusion vector->
Figure BDA00037584854100002133
Will->
Figure BDA00037584854100002134
To the multi-view predictor network.
4.12.3.6 feature fusion network receives from a target encoder network
Figure BDA00037584854100002135
Feature _1 Pair->
Figure BDA00037584854100002136
Performing feature fusion to obtain a target state fusion vector
Figure BDA00037584854100002137
4.12.3.7 the single view predictor network receives from the encoder network
Figure BDA00037584854100002138
Figure BDA00037584854100002139
From τ _ M N In the mth track data->
Figure BDA00037584854100002140
Is paired and/or matched>
Figure BDA00037584854100002141
Figure BDA00037584854100002142
And &>
Figure BDA00037584854100002143
Middle control instruction a t (m) performing branch dynamics equation prediction and reward function equation prediction on the first spliced vector to obtain a prediction target state vector->
Figure BDA00037584854100002144
And a first predictive bonus value>
Figure BDA00037584854100002145
4.12.3.8 a multiview predictor network receives from a feature fusion network
Figure BDA00037584854100002146
From τ _ M N In the mth track data->
Figure BDA00037584854100002147
Is paired and/or matched>
Figure BDA00037584854100002148
And &>
Figure BDA00037584854100002149
Middle control instruction a t (m) performing transfer dynamics equation prediction and reward function equation prediction on the second splicing vector to obtainPredicted target state fusion vector->
Figure BDA00037584854100002150
And a second predictive prize value >>
Figure BDA00037584854100002151
4.12.3.9 feature extraction module uses gradient descent method based on the mean and variance obtained in step 4.12.3.3 and the mean and variance obtained in step 4.12.3.4
Figure BDA00037584854100002152
Obtained in step 4.12.3.6
Figure BDA00037584854100002153
Based on the result of step 4.12.3.7>
Figure BDA00037584854100002154
And
Figure BDA00037584854100002155
based on the result of step 4.12.3.8>
Figure BDA0003758485410000221
And &>
Figure BDA0003758485410000222
R in t (m) minimizing the optimization penalty function in equation (2) by inverse updating of the gradient to optimize the encoder network, the feature fusion network, the single view predictor network, and the multi-view predictor network.
4.12.3.10 if M is less than M, making M = M +1, go to step 4.12.3.3; otherwise, go to step 4.12.4.
Encoder _1 of coder network in 4.12.4 feature extraction module converts tau _ M N Image observation of middle M pieces of track data after first data expansion
Figure BDA0003758485410000223
Encoding into M second state vectors
Figure BDA0003758485410000224
Will->
Figure BDA0003758485410000225
And sending the data to a control module.
The target encoder network in the 4.12.5 feature extraction module will τ _ M N Image observation of middle M pieces of track data after first data expansion
Figure BDA0003758485410000226
Encoding into M target state vectors
Figure BDA0003758485410000227
Figure BDA0003758485410000228
Will->
Figure BDA0003758485410000229
And sending the data to a control module.
4.12.6 the control module receives τ _ M from the data expansion module N Receiving a second state vector from the feature extraction module
Figure BDA00037584854100002210
Figure BDA00037584854100002211
And a target status vector->
Figure BDA00037584854100002212
For τ _ M N M track data in (4)>
Figure BDA00037584854100002213
And &>
Figure BDA00037584854100002214
Optimizing and evaluating network and strategy by minimizing loss functions shown in formula (3), formula (4) and formula (5) by using gradient descent method in sequenceThe method of the network is as follows:
4.12.6.1 the policy network receives from the feature extraction module
Figure BDA00037584854100002215
And
Figure BDA00037584854100002216
Figure BDA00037584854100002217
critic _1 and Critic _2 are received from the feature extraction module
Figure BDA00037584854100002218
Receiving τ _ M from data expansion module N (ii) a TarCritic _1 and TarCritic _2 receive { [ MEANS ] from the feature extraction module>
Figure BDA00037584854100002219
Figure BDA00037584854100002220
4.12.6.2 initializes track data index m =1.
4.12.6.3 policy network slave
Figure BDA00037584854100002221
In which the mth second status vector is acquired>
Figure BDA00037584854100002222
Slave->
Figure BDA00037584854100002223
In order to acquire the mth target status vector->
Figure BDA00037584854100002224
The following operations are carried out: to pair
Figure BDA00037584854100002225
Performing action mapping to obtain a control command a (m), and sending a (m) to Critic _1 and Critic _2(ii) a Is paired and/or matched>
Figure BDA00037584854100002226
The operation mapping is performed to obtain a control command a '(m), and a' (m) is transmitted to TarCritic _1 and TarCritic _2.
4.12.6.4Critic _1receives control instructions a (m) from the policy network, from
Figure BDA00037584854100002227
In which the mth second status vector is acquired>
Figure BDA00037584854100002228
From τ _ M N In which an mth track data->
Figure BDA00037584854100002229
The following operations are carried out: is paired and/or matched>
Figure BDA00037584854100002230
And &>
Figure BDA00037584854100002231
In (a) t (m) performing state-motion value estimation on the third splicing vector to obtain a first state-motion value
Figure BDA00037584854100002232
Is paired and/or matched>
Figure BDA00037584854100002233
And a (m) to obtain a second state-action value->
Figure BDA00037584854100002234
4.12.6.5Critic _2receives control instructions a (m) from the policy network, from
Figure BDA00037584854100002235
In which the mth second status vector is acquired>
Figure BDA00037584854100002236
From τ _ M N In which an mth track data->
Figure BDA00037584854100002237
The following operations are carried out: to (X)>
Figure BDA00037584854100002238
And &>
Figure BDA00037584854100002239
In (a) of t (m) performing state-motion value estimation on the third spliced vector to obtain a third state-motion value
Figure BDA00037584854100002240
To (X)>
Figure BDA00037584854100002241
And a (m) to obtain a fourth state-action value->
Figure BDA00037584854100002242
4.12.6.6TarCritic _1receives control instruction a' (m) from the policy network, from
Figure BDA00037584854100002243
In order to acquire the mth target status vector->
Figure BDA00037584854100002244
To (X)>
Figure BDA00037584854100002245
And a' (m) to obtain a first target state-action value
Figure BDA00037584854100002246
4.12.6.7TarCritic _2from policy networkReceiving control command a' (m), from
Figure BDA0003758485410000231
In order to acquire the mth target status vector->
Figure BDA0003758485410000232
Is paired and/or matched>
Figure BDA0003758485410000233
And a' (m) to obtain a second target state-action value
Figure BDA0003758485410000234
4.12.6.8 the control module uses the gradient descent method to minimize the penalty function in equation (3) by updating the optimization criteria _1 and criteria _2 in the reverse direction of the gradient.
Figure BDA0003758485410000235
Wherein:
Figure BDA0003758485410000236
is a parameter of the ith evaluation network>
Figure BDA0003758485410000237
Is a parameter of the ith target evaluation network, phi is a parameter of the strategy network, i =1 or 2 is an index of two evaluation networks and two target evaluation networks in the control module, and->
Figure BDA0003758485410000238
Is a target status vector>
Figure BDA0003758485410000239
Corresponding status value (status value is a basic concept in reinforcement learning and refers to an expected value of a reward that an agent can obtain in the current status), -or>
Figure BDA00037584854100002310
Is the first target state-action value
Figure BDA00037584854100002311
And a second target state-action value>
Figure BDA00037584854100002312
The smaller value in between. Alpha is a temperature coefficient factor (the initial value is set to 0.1, and the optimization process is dynamically adjusted along with a policy network, see the article Soft actor-critic who is published in 2018 by Tuomas Haarnoja et al on ICML corpus of International machine learning conference, namely the Soft actor-critic who is based on stochastic actor Off-policy maximum entropy depth reinforcement learning), and gamma is a reward discount factor (gamma is generally set to 0.99).
4.12.6.9 control module optimizes the policy network by reverse updating of the gradient using a gradient descent method to minimize the penalty function shown in equation (4).
Figure BDA00037584854100002313
Wherein:
Figure BDA00037584854100002314
is that the policy network is in the second status vector->
Figure BDA00037584854100002315
Distribution of the control commands a (m) which are output in the future>
Figure BDA00037584854100002316
Is a second state-action value>
Figure BDA00037584854100002317
And fourth State-action values
Figure BDA00037584854100002318
The smaller value in between.
4.12.6.10 the control module optimizes the temperature coefficient factor by updating the inverse of the gradient using the gradient descent method to minimize the penalty function shown in equation (5).
Figure BDA00037584854100002319
Wherein:
Figure BDA00037584854100002320
is the target entropy of the policy network, set to the negative of the agent control instruction a (m) dimension.
4.12.6.11 if M < M, let M = M +1, go to step 4.12.6.3; otherwise, go to step 4.12.7.
4.12.7, judging whether t% F is equal to zero, if yes, updating parameters of a target encoder network and a target evaluation network in the intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck, and turning to a step 4.12.8, otherwise, turning to a step 4.12.9.
4.12.8 an agent control system based on deep reinforcement learning and conditional entropy bottlenecks uses exponential moving average (a common method for neural network parameter update) to update parameters of a target encoder network and parameters of two target evaluation networks according to equations (6) and (7).
Figure BDA00037584854100002321
Figure BDA0003758485410000241
Wherein: tau. p And τ Q Is to update the hyper-parameters (generally will τ) of the target encoder network and the target evaluation network p Is set to 0.05, tau Q Set to 0.01), ζ is a parameter of the encoder network,
Figure BDA0003758485410000242
is a parameter of the target encoder network.
4.12.9 let k = k +1. If K is equal to the total number of times of optimization K, it indicates that the optimization is completed, go to step 4.13, otherwise, go to step 4.12.2.1.
4.13 the agent control system based on deep reinforcement learning and conditional entropy bottleneck judges whether T is equal to the maximum number of interactions T, if so, it indicates that the training is finished, go to step 4.14, otherwise, go to step 4.5.1.
4.14 the intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck saves the network parameters of the feature extraction module and the control module optimized in the step 4.12 into a pt format file (the pt format file can be directly generated by the deep learning framework Pythrch).
Fig. 4 is a schematic diagram of training results of the embodiment of the cheetah robot running scenario shown in fig. 3. The figure shows the trend that the sum of reward values (represented by average reward, which is obtained by repeatedly running 30 rounds of tasks and averaging the sum of reward values in each round of tasks) fed back to a cheetah robot control instruction by an image continuous control task simulation environment in one round of control tasks changes along with the increase of the number of interaction times. The abscissa represents the number of interactions between the cheetah robot and the environment, the ordinate represents the average return, and the larger the average return is, the better the control strategy of the cheetah robot is. Compared with a deep reinforcement learning control method DBC in the background technology, the method uses the mutual simulation measurement to learn the corresponding state vector of the image observation when the DBC performs intelligent control. In fig. 4, compared with DBC, the average reward obtained by the leopard robot of the present invention is higher, which indicates that the present invention obtains a more robust state vector when encoding image observation in an image continuous control task, further reduces the complexity of the control task, and improves the accuracy of the leopard robot control strategy.
And fifthly, loading the file in the pt format obtained in the step 4.14 by the feature extraction module and the control module, and initializing the parameters of the feature extraction module and the control module by using the parameters in the file in the pt format to obtain the trained intelligent agent control system based on the deep reinforcement learning and the conditional entropy bottleneck.
And sixthly, deploying the trained intelligent body control system based on deep reinforcement learning and conditional entropy bottleneck on an intelligent body constructed based on a real environment, namely a leopard robot.
And seventhly, the trained intelligent body control system based on deep reinforcement learning and conditional entropy bottleneck assists the intelligent body to complete the image continuous control task, namely, a running scene of the cheetah robot is constructed, and the task of rapidly running the cheetah robot is completed.
7.1 number of movements T =0 of the robot was initialized, and the maximum number of movements T of the robot was set 0 (T 0 Is a positive integer, the value of this embodiment is 1000).
7.2, the perception module acquires image observation of a running scene of the cheetah robot and sends the image observation to the feature extraction module; the characteristic extraction module receives image observation, an Encoder _1 in an Encoder network encodes the image observation to obtain a first state vector corresponding to the image observation, and the first state vector is sent to the control module; the control module receives the first state vector, the strategy network maps the first state vector into a control instruction, and the control instruction is sent to the action module; the action module receives the control instruction and executes the control instruction in the running scene of the leopard robot, and the method comprises the following steps:
7.2.1 perception module obtains image observation s of leopard robot running scene when leopard robot acts for the tth time t A 1 is to t And sending the data to a feature extraction module.
7.2.2 feature extraction Module receives image observations s t Encoder _1 in the Encoder network assigns s according to the method described in step 4.5.2 t Coded as a first state vector z t Will z t And sending the data to a control module.
7.2.3 control Module receiving z t The policy network will z according to the method described in step 4.5.3 t Mapping as the control command a of the t-th action of the cheetah robot t A is to t And sending the information to an action module.
7.2.4 the action module receives the control command a t Performing a in a Leopard robot running scenario t
7.3 let t = t +1. If T is equal to the maximum action number T 0 And turning to the eighth step, otherwise, turning to the step 7.2.1.
And eighthly, finishing.
FIG. 5 is a schematic diagram of the action of the cheetah robot in the example of the cheetah robot running scene shown in FIG. 3, wherein 9 sub-graphs in the schematic diagram are the cheetah robot T 0 Sequence of actions for 2 runs out of the next action. From this sequence of actions, it can be seen that: the running process of the leopard robot can be divided into body curling, leg kicking, emptying and landing, which shows that the leopard robot completes the task of rapidly running in a scene, and the effectiveness and the accuracy of the intelligent body control method provided by the invention are verified.

Claims (10)

1. An intelligent agent control method based on deep reinforcement learning and conditional entropy bottleneck is characterized by comprising the following steps:
firstly, constructing an intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck, installing the control system on an intelligent agent, and enabling the intelligent agent to interact with an image continuous control task environment; the intelligent agent refers to an unmanned node with sensing, communication, movement, storage and calculation capabilities; the image continuous control task environment refers to an entity interacting with the intelligent agent, the intelligent agent observes the state of the environment in the form of an image and observes to act in the environment according to a continuous control instruction based on the image observation; the intelligent agent control system based on the deep reinforcement learning and the conditional entropy bottleneck consists of a perception module, an action module, a storage module, a data expansion module, a feature extraction module and a control module;
the sensing module is an image sensor and is connected with the feature extraction module and the storage module; the sensing module acquires image observation containing an intelligent agent state and an environment state from an image continuous control task environment, and sends the image observation to the feature extraction module and the storage module;
the action module is an actuator of an intelligent agent control instruction, is connected with the control module, receives the control instruction from the control module, and acts in the image continuous control task environment according to the control instruction;
the storage module is connected with the sensing module, the control module and the data expansion module, receives image observation from the sensing module, receives a control instruction from the control module, receives reward from the image continuous control task environment, and combines the image observation, the control instruction and the reward into track data of interaction between the intelligent agent and the image continuous control task environment; trajectory data in quadruples(s) t ,a t ,r t ,s t+1 ) Is stored in a form of (a) wherein: s t Is the image observation received from the perception module when the t-th time of the intelligent agent interacts with the image continuous control task environment, a t Is a control instruction from the control module, r, executed when the agent interacts with the image continuous control task environment for the tth time t When the t time of the intelligent agent is interacted with the image continuous control task environment, the environment is directed to the control instruction a t Reward value, s, of feedback t+1 The method is image observation received from a sensing module after the environment state is changed due to the t-th interaction with the image continuous control task environment of the intelligent agent, and is called image observation when the t + 1-th interaction with the image continuous control task environment of the intelligent agent is carried out;
the data expansion module is connected with the storage module, the feature extraction module and the control module, and track data tau, tau =(s) required by training of the intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck is randomly selected from the storage module t ,a t ,r t ,s t +1 ) For s in τ t And s t+1 Carrying out N times of data expansion to obtain the track data tau after the N times of data expansion N
Figure FDA0004066255300000011
Figure FDA0004066255300000012
j is the index of image observation after N times of data expansion, and tau is N Sending the data to a feature extraction module and a control module;
the characteristic extraction module is connected with the sensing module, the data expansion module and the control module; the characteristic extraction module consists of a coder network, a target coder network, a characteristic fusion network, a single-view predictor network and a multi-view predictor network;
the Encoder network consists of a first Encoder network Encoder _1 and a second Encoder network Encoder _2, and is connected with the sensing module, the data expansion module, the control module, the characteristic fusion network and the single-view predictor network; the Encoder _1 consists of 4 convolutional layers, 1 full-connection layer and 1 regularization layer and is connected with the sensing module, the data expansion module, the control module and the Encoder _2; the Encoder _2 consists of 3 layers of full connection layers and is connected with the Encoder _1, the characteristic fusion network and the single-view predictor network; encoder _1 receives s from the perception module when the agent interacts with the image continuation control task environment t The first, second, third and fourth convolution layers of Encode _1 are checked with s by a convolution kernel of 3 x 3 in sequence Performing convolution operation to obtain s after four times of convolution operation t S after four convolution operations t Sending the full connection layer to an Encoder _ 1; the fully connected layer of Encode _1 pairs s after the four convolution operations received from the fourth convolutional layer t Performing full connection operation to obtain full connection result s t The corresponding state vector sends the fully connected state vector to a regularization layer of the Encoder _ 1; the regularization layer of the Encoder _1 carries out regularization operation on the fully-connected state vector received from the fully-connected layer of the Encoder _1 to obtain a regularized state vector, and the regularized state vector is used as a first state vector z t Will z t Sending the data to a control module; when training an intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck, the Encoder _1 receives track data tau after data expansion from a data expansion module N The first, second, third, and fourth convolution layers of Encode _1 are sequentially checked with a convolution kernel τ of 3 × 3 N In (1)
Figure FDA0004066255300000021
Performing convolution operation to obtain four times of convolution operation>
Figure FDA0004066255300000022
Will be four times convolution operated->
Figure FDA0004066255300000023
Sending the full connection layer to an Encoder _ 1; fully connected layer of Encoder _1 vs. four convolution operations received from the fourth convolution layer->
Figure FDA0004066255300000024
Performing full connection operation to obtain full connection>
Figure FDA0004066255300000025
The corresponding N state vectors send the fully connected N state vectors to a regularization layer of the Encoder _ 1; the regularization layer of Encoder _1 carries out regularization operation on the N state vectors which are received from the full connection layer of Encoder _1 and are subjected to full connection to obtain N state vectors which are subjected to regularization, and the N state vectors which are subjected to regularization are used as second state vectors
Figure FDA0004066255300000026
Indicates that the first second status vector is->
Figure FDA0004066255300000027
Sends the signal to the control module and will->
Figure FDA0004066255300000028
Sending the data to an Encoder _2; the first, second and third full connection layers of Encoder _2 sequentially pair received from Encoder _1
Figure FDA0004066255300000029
Performing a full connection operation to obtain three times of full connection operation>
Figure FDA00040662553000000210
Corresponding mean and variance of Gaussian distribution, and carrying out re-parameterization operation on the mean and variance to obtainN reparameterized state vectors using
Figure FDA00040662553000000211
Indicates to >>
Figure FDA00040662553000000212
Sending the information to a feature fusion network and a single-view predictor network;
the target encoder network is connected with the data expansion module, the control module and the characteristic fusion network and consists of 4 convolution layers, 1 full-connection layer and 1 regularization layer; target encoder network receives tau from data expansion module N The first, second, third, and fourth convolutional layers of the target encoder network are sequentially checked with a 3 × 3 convolutional kernel τ N In
Figure FDA00040662553000000213
Performing convolution operation to obtain four times of convolution operation>
Figure FDA00040662553000000214
After four convolution operations
Figure FDA00040662553000000215
Sending to a full connection layer; fully connected layer pairs ^ greater than or equal to after four convolution operations received from the fourth convolution layer>
Figure FDA00040662553000000216
Performing a full connection operation to obtain a full connection>
Figure FDA00040662553000000217
The corresponding N target state vectors send the N fully-connected target state vectors to the regularization layer; the regularization layer carries out regularization operation on the N fully-connected target state vectors received from the fully-connected layer to obtain N regularized target state vectors, the N regularized target state vectors are used as target state vectors, and the N regularized target state vectors are used as ^ greater than or equal to>
Figure FDA00040662553000000218
Indicating that a first target state vector is->
Figure FDA00040662553000000219
Sending to the control module and combining>
Figure FDA00040662553000000220
Sending the information to a feature fusion network;
the characteristic fusion network is connected with the encoder network, the target encoder network and the multi-view predictor network and consists of a first fusion network Feature _1 and a second fusion network Feature _2; the Feature _1 and the Feature _2 are both composed of 3 full connection layers; feature _1 is connected to the encoder network, the target encoder network and Feature _2, the Feature _1 receiving from the encoder network
Figure FDA00040662553000000221
Receiving from the target encoder network->
Figure FDA00040662553000000222
The first, second and third fully connected layers of Feature _1 in turn pair->
Figure FDA00040662553000000223
Performing a full connection operation to combine>
Figure FDA00040662553000000224
Concatenating as the status fusion vector->
Figure FDA00040662553000000225
Will->
Figure FDA00040662553000000226
Sending the data to Feature _2; the first, second and third fully connected layers of Feature _1 in turn pair->
Figure FDA00040662553000000227
Performing a full connection operation to put>
Figure FDA00040662553000000228
Splicing into a target state fusion vector->
Figure FDA00040662553000000229
Feature _2 is coupled to Feature _1 and the multi-view predictor network and receives ≧ from Feature _1>
Figure FDA00040662553000000230
The first, second and third fully connected layers of Feature _2 in turn pair->
Figure FDA00040662553000000231
Carrying out reparameterization operation to obtain a reparameterization state fusion vector>
Figure FDA0004066255300000031
Will->
Figure FDA0004066255300000032
Sending to a multi-view predictor network;
the single-view predictor network is connected with the data expansion module and the encoder network and consists of 3 layers of full connection layers; the single view predictor network receives the track data tau after data expansion from the data expansion module N Received from the encoder network
Figure FDA0004066255300000033
First, second, and third fully-connected layers of a single-view predictor network are paired { [ in sequence } in order>
Figure FDA0004066255300000034
And τ N Control instruction a in t Full connection operation is carried out on the formed first splicing vectorsIn doing so, the first stitching vector is mapped to a predicted target state vector ≧>
Figure FDA0004066255300000035
And a first predictive bonus value>
Figure FDA0004066255300000036
Implementing predictions of a transfer kinetics equation and a reward function equation, wherein: />
Figure FDA0004066255300000037
A jth term, representing a predicted target state vector, is asserted>
Figure FDA0004066255300000038
A jth term representing a first predicted prize value;
the multi-view predictor network is connected with the data expansion module and the characteristic fusion network and consists of 3 layers of full connection layers; the multi-view predictor network receives the data-augmented trajectory data tau from the data augmentation module N Receiving from the feature fusion network
Figure FDA0004066255300000039
Will->
Figure FDA00040662553000000310
And τ N Control instruction a in t Forming a second spliced vector, carrying out full-connection operation on the second spliced vector by a first full-connection layer, a second full-connection layer and a third full-connection layer of the multi-view predictor network in sequence, and mapping the second spliced vector into a prediction target state fusion vector ^ and/or a prediction target state fusion vector ^>
Figure FDA00040662553000000311
And a second predictive prize value >>
Figure FDA00040662553000000312
Realizing the prediction of a transfer dynamics equation and a reward function equation;
the control module is connected with the feature extraction module, the data expansion module and the action module and consists of a first evaluation network criticic _1, a second evaluation network criticic _2, a first target evaluation network Tarcriticic _1, a second target evaluation network Tarcriticic _2 and a strategy network;
critic _1 and Critic _2 are connected with the feature extraction module, the data expansion module and the strategy network, are all composed of three full-connection layers, and receive a first second state vector from the feature extraction module
Figure FDA00040662553000000313
Receiving data-extended trajectory data tau from a data extension module N Receiving control instruction a from policy network, for tau N Middle control instruction a t Evaluating the quality of a in the policy network; the first, second and third fully-connected layers of Critic _1 are paired->
Figure FDA00040662553000000314
And a Performing full-connection operation on the formed third spliced vector, and mapping the third spliced vector into the first state-action value->
Figure FDA00040662553000000315
The first, second and third fully-connected layers of Critic _1 are paired->
Figure FDA00040662553000000316
And a, performing full connection operation on a fourth spliced vector consisting of the sum of a and a, and mapping the fourth spliced vector into a second state-action value->
Figure FDA00040662553000000317
The first, second and third fully-connected layers of Critic _2 are paired->
Figure FDA00040662553000000318
And a t Of the first compositionPerforming full connection operation on the three spliced vectors, and mapping the third spliced vector into a third state-action value->
Figure FDA00040662553000000319
The first, second and third fully-connected layers of Critic _2 are paired->
Figure FDA00040662553000000320
And a, performing full connection operation on a fourth spliced vector consisting of a, and mapping the fourth spliced vector into a fourth state-action value ^ based on the fourth spliced vector after three times of full connection operation>
Figure FDA00040662553000000321
Both TarCritic _1 and TarCritic _2 are connected with the feature extraction module and the strategy network, both are composed of three full connection layers, and both receive a first target state vector from the feature extraction module
Figure FDA00040662553000000322
Receiving control instruction a from policy network To a, a Evaluating the quality of the product; the first, second and third fully-connected layers of TarCritic _1 are paired and/or reserved in sequence>
Figure FDA00040662553000000323
And a Performing full-connection operation on the formed target splicing vectors, and mapping the target splicing vectors into a first target state-action value->
Figure FDA00040662553000000324
The first, second and third fully-connected layers of TarCritic _2 are paired->
Figure FDA00040662553000000325
And a The formed target splicing vector is subjected to full connection operation, and the target splicing vector is mapped after three times of full connection operationShoot as a second target state-action value
Figure FDA00040662553000000326
The strategy network is connected with the feature extraction module, the action module, the storage module, the Critic _1, the Critic _2, the TarCritic _1 and the TarCritic _2 and consists of three fully-connected layers; receiving a first state vector z from a feature extraction module while an agent interacts with an image continuation control task environment The first, second and third full connection layers of the policy network are aligned to z in sequence t Carrying out a full ligation operation of t Mapping as control instruction a t A is mixing t Sending the information to an action module and a storage module; in training an agent control system based on deep reinforcement learning and conditional entropy bottlenecks, a first second state vector is received from a feature extraction module
Figure FDA0004066255300000041
And a first target state vector->
Figure FDA0004066255300000042
The first, second and third fully-connected layers of the policy network are in turn paired->
Figure FDA0004066255300000043
Performing a full connection operation to put>
Figure FDA0004066255300000044
Mapping the control command a into a control command a, and sending a to critical _1 and critical _2; the first, second and third fully-connected layers of the policy network are in turn paired->
Figure FDA0004066255300000045
Performing a full connection operation to combine>
Figure FDA0004066255300000046
Mapped as control instruction a A is to Sending to TarCritic _1 and TarCritic _2;
secondly, constructing a target function of the feature extraction module based on the conditional entropy bottleneck, and obtaining an optimized loss function of the feature extraction module through a variational reasoning technology; the method comprises the following steps:
2.1 designing a feature extraction module objective function shown in formula (1) based on the conditional entropy bottleneck;
Figure FDA0004066255300000047
wherein: object represents the object of the feature extraction module,
Figure FDA0004066255300000048
the data expansion module carries out j times of data expansion on the image observation in the t time of interaction to obtain image observation, and then the image observation is subjected to judgment>
Figure FDA0004066255300000049
The data expansion module carries out j times of data expansion on the image observation in the t +1 th interaction to obtain image observation, and then the image observation is subjected to judgment>
Figure FDA00040662553000000410
Is to>
Figure FDA00040662553000000411
A reparameterized status vector obtained after input into the encoder network, based on the comparison of the status vector and the value of the reference value>
Figure FDA00040662553000000412
Is to>
Figure FDA00040662553000000413
A target state vector obtained after input to the target encoder network,
Figure FDA00040662553000000414
is divided into NReparameterized status vector>
Figure FDA00040662553000000415
A reparameterized state fusion vector obtained upon input into the feature fusion network, based on the state of the feature fusion network and the location of the feature fusion network>
Figure FDA00040662553000000416
Is to vector N target status values->
Figure FDA00040662553000000417
Target state fusion vector, beta, obtained after input to a feature fusion network j Is the factor of the regularization and is,
Figure FDA00040662553000000418
and &>
Figure FDA00040662553000000419
Is a conditional mutual information item;
2.2 applying variational reasoning technique to the formula (1) to obtain the optimized loss function of the feature extraction module shown in the formula (2):
Figure FDA00040662553000000420
/>
wherein: m is the number of the track data randomly selected from the storage module by the data expansion module,
Figure FDA00040662553000000421
is a Gaussian distribution>
Figure FDA00040662553000000422
Figure FDA00040662553000000423
Is the mean value calculated by Encoder _2 in the Encoder network, is->
Figure FDA00040662553000000424
Is the variance, calculated by Encoder _2, in the Encoder network, is @>
Figure FDA00040662553000000425
Is a variation profile, in conjunction with a plurality of characteristic points>
Figure FDA00040662553000000426
And
Figure FDA00040662553000000427
is gaussian noise; />
Figure FDA00040662553000000428
Is representative of xi j In a predetermined direction, in a predetermined direction>
Figure FDA00040662553000000429
Is representative of xi 12 ,…,ξ N And xi 1N Desired product, in conjunction with a signal processing unit>
Figure FDA00040662553000000430
Indicates whether or not a combination>
Figure FDA00040662553000000431
And a t The jth entry of the predicted target state vector input to the single-view predictor network->
Figure FDA0004066255300000051
And &>
Figure FDA0004066255300000052
Cross entropy loss between and item j of the first predictive prize value>
Figure FDA0004066255300000053
And r t (iii) a cross-entropy loss therebetween, is greater than or equal to>
Figure FDA0004066255300000054
Indicates will->
Figure FDA0004066255300000055
And a t Predicted target state fusion vector input into multi-view predictor network>
Figure FDA0004066255300000056
And &>
Figure FDA0004066255300000057
Cross entropy loss between and a second predicted reward value ≧>
Figure FDA0004066255300000058
And r t Cross entropy loss between;
thirdly, constructing an image continuous Control task simulation environment in a DeepMind Control Suite simulation environment with a DeepMind open source, and preparing for training an intelligent agent Control system based on deep reinforcement learning and conditional entropy bottleneck; the method comprises the following steps:
3.1, installing a DMControl simulation environment in a computer which is arbitrarily provided with a Ubuntu and a Pythrch deep learning frame, constructing an intelligent agent simulation model and an image continuous control task simulation environment, and setting a task target of an intelligent agent in an image continuous control task;
3.2 setting the scale of image observation of the intelligent agent for sensing the self state and the environmental state in the constructed image continuous control task simulation environment, setting the control instruction executed by the intelligent agent as a continuous vector, and setting an incentive value fed back by the image continuous control task simulation environment after the intelligent agent executes the control instruction according to a task target;
fourthly, the intelligent agent trains an intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck in the image continuous control task simulation environment established in the third step, and the method comprises the following steps:
4.1 initializing network parameters of a feature extraction module and a control module in the intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck, wherein the parameters comprise a weight matrix and a bias vector of a full connection layer, a convolution kernel of a convolution layer and a weight matrix and a bias vector of a regularization layer, and the parameters are generated by using an orthogonal initialization mode, wherein nonzero parameters are from normal distribution with the mean value of 0 and the standard deviation of 1;
4.2 setting the size of a storage module in the intelligent agent control system based on the deep reinforcement learning and the conditional entropy bottleneck as a buffer area queue for storing track data formed when A intelligent agents interact with the image continuous control task simulation environment, and emptying the buffer area queue;
4.3, initializing the interaction times T =0 of the intelligent agent and the image continuous control task simulation environment constructed in the third step, and setting the maximum interaction times T of the intelligent agent, the maximum interaction times E of each round of the intelligent agent and the image continuous control task simulation environment, and the updating frequency F of a target encoder network and a target evaluation network in the intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck;
4.4, randomly setting the initial state of the image continuous control task simulation environment constructed in the third step and the initial state of the intelligent agent simulation model;
4.5, the perception module acquires image observation when the intelligent agent interacts with the image continuous control task simulation environment, and sends the image observation to the feature extraction module and the storage module; the characteristic extraction module receives image observation, an Encoder _1 in an Encoder network encodes the image observation to obtain a first state vector corresponding to the image observation, and the first state vector is sent to the control module; the control module receives the first state vector, the strategy network maps the first state vector into a control instruction, and the control instruction is sent to the action module and the storage module, and the method comprises the following steps:
4.5.1 perception module obtains image observation s when t th time of intelligent agent interacts with image continuous control task simulation environment t A 1 is to t Sending the data to a feature extraction module and a storage module;
4.5.2 feature extraction Module receives image observations s t Encoder _1 in Encoder network will s t Is coded as firstThe state vector z t Will z t Sending the data to a control module;
4.5.3 the control Module receives the first State vector z t Policy network will z t Mapping the control command a to be executed when the t th time of the agent interacts with the image continuous control task simulation environment t A is mixing t Sending the information to an action module and a storage module;
4.6 the action Module receives the control instruction a t Performing a in an image continuous control task simulation environment t
4.7 the image continuous control task simulation environment returns the reward value r obtained when the intelligent agent interacts with the environment for the t time according to the reward value designed in the step 3.2 t R is to t Sending the data to a storage module;
4.8 State-of-the-Picture continuous control task simulation Environment execution of control Instructions a by Agents t When the environment changes, the perception module obtains the image observation s corresponding to the changed environment state t+1 A 1 is to t+1 Sending the data to a storage module;
4.9 the storage Module receives s from the perception Module t And s t+1 Receive a from the control module t Receiving r from the image continuous control task simulation environment t Combining them into a quadruple of trace data(s) t ,a t ,r t ,s t+1 ) Storing the data in a buffer queue;
4.10 let t = t +1; if t% E =0, turning to step 4.4; otherwise, turning to step 4.11;
4.11 the data expansion module judges whether there are L pieces of track data in the buffer area queue of the storage module, if there are L pieces of track data, then randomly selects M pieces of track data from the buffer area queue of the storage module, and makes the M pieces of track data form a track data set τ _ M:
Figure FDA0004066255300000061
let τ _ M m =(s t (m),a t (m),r t (m),s t+1 (m)) representsM in τ _ M (M ∈ [1,M)]) Step 4.12, the step of track data is transferred, and an intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck is optimized according to tau _ M; if the queue of the buffer area has no L pieces of track data, turning to the step 4.5.1;
4.12 the data expansion module uses a random cutting method in data enhancement to sequentially perform N times of data expansion on the image observation of each track data in the tau _ M to obtain M track data after data expansion, and sends the M track data after data expansion to the feature extraction module and the control module; the feature extraction module and the control module receive M pieces of track data after data expansion, and an intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck is optimized, wherein the method comprises the following steps:
4.12.1 initializing optimization times K =0, and setting total optimization times K;
4.12.2 data expansion module uses a random clipping method in data enhancement to sequentially pair τ M Carrying out N times of data expansion on the image observation of each piece of track data to obtain M pieces of track data tau _ M after data expansion N Will τ _ M N Sent to a feature extraction module and a control module,
Figure FDA0004066255300000062
wherein it is present>
Figure FDA0004066255300000063
Is τ _ M m Track data after N times of data expansion:
Figure FDA0004066255300000064
4.12.3 the feature extraction module receives τ _ M from the data expansion module N For τ _ M N In the M pieces of track data, a loss function is optimized by using a feature extraction module shown in a gradient descent method minimization formula (2) in sequence, and an encoder network, a feature fusion network, a single-view predictor network and a multi-view predictor network in the feature extraction module are optimized, wherein the method comprises the following steps:
4.12.3.1 encoder network, target encoder network, single view predictor network, and multi-view predictor network receive τ _ M from data expansion module N
4.12.3.2 initializes track data index m =1;
encoder _1 in 4.12.3.3 Encoder network will be τ _ M N Middle m track data
Figure FDA0004066255300000065
In (1)
Figure FDA0004066255300000066
Encoded as N second status vectors +>
Figure FDA0004066255300000067
Figure FDA0004066255300000068
Is that
Figure FDA0004066255300000071
A corresponding second state vector will >>
Figure FDA0004066255300000072
Sending the data to an Encoder _2; encoder _2 gets @viafully connected layers>
Figure FDA0004066255300000073
Mean of corresponding Gaussian distributions
Figure FDA0004066255300000074
And variance +>
Figure FDA0004066255300000075
Wherein: />
Figure FDA0004066255300000076
Is/>
Figure FDA0004066255300000077
Mean value of the corresponding Gaussian distribution>
Figure FDA0004066255300000078
Is->
Figure FDA0004066255300000079
The variance of the corresponding gaussian distribution; encoder _2 pair
Figure FDA00040662553000000710
And &>
Figure FDA00040662553000000711
Performing a reparameterization operation to obtain N reparameterized status vectors>
Figure FDA00040662553000000712
Is->
Figure FDA00040662553000000713
Corresponding reparameterized status vector will >>
Figure FDA00040662553000000714
Sending the information to a feature fusion network and a single-view predictor network;
4.12.3.4 target encoder network will τ _ M N Middle m track data
Figure FDA00040662553000000715
In
Figure FDA00040662553000000716
Encoded into N target status vectors>
Figure FDA00040662553000000717
Figure FDA00040662553000000718
Is/>
Figure FDA00040662553000000719
Corresponding target status vector will >>
Figure FDA00040662553000000720
Sending the information to a feature fusion network;
4.12.3.5 feature fusion network receives from an encoder network
Figure FDA00040662553000000721
Feature _1 Pair->
Figure FDA00040662553000000722
Performing feature fusion to obtain a state fusion vector->
Figure FDA00040662553000000723
Will->
Figure FDA00040662553000000724
Sending the data to Feature _2; feature _2 Pair->
Figure FDA00040662553000000725
Carrying out reparameterization operation to obtain a reparameterization state fusion vector->
Figure FDA00040662553000000726
Will be/are>
Figure FDA00040662553000000727
Sending to a multi-view predictor network;
4.12.3.6 feature fusion network receives from a target encoder network
Figure FDA00040662553000000728
Feature _1 Pair->
Figure FDA00040662553000000729
Performing feature fusion to obtain a target state fusion vector>
Figure FDA00040662553000000730
4.12.3.7 Single-View predictor network receives ÷ from an encoder network>
Figure FDA00040662553000000731
From τ _ M N In which an mth track data->
Figure FDA00040662553000000732
Is paired and/or matched>
Figure FDA00040662553000000733
And
Figure FDA00040662553000000734
middle control instruction a t (m) performing branch dynamics equation prediction and reward function equation prediction on the first spliced vector to obtain a predicted target state vector->
Figure FDA00040662553000000735
And a first predicted prize value
Figure FDA00040662553000000736
4.12.3.8 multiview predictor network receives from a feature fusion network
Figure FDA00040662553000000737
From τ _ M N In the mth track data->
Figure FDA00040662553000000738
Is paired and/or matched>
Figure FDA00040662553000000739
And &>
Figure FDA00040662553000000740
Middle control instruction a t (m) performing branch dynamics equation prediction and reward function equation prediction on a second spliced vector consisting of the two vectors to obtain a prediction target state fusion vector->
Figure FDA00040662553000000741
And a second predictive prize value >>
Figure FDA00040662553000000742
4.12.3.9 feature extraction module uses gradient descent method based on the mean and variance obtained in step 4.12.3.3 and the mean and variance obtained in step 4.12.3.4
Figure FDA00040662553000000743
Punch combination obtained in step 4.12.3.6>
Figure FDA00040662553000000744
Punch combination obtained in step 4.12.3.7>
Figure FDA00040662553000000745
And &>
Figure FDA00040662553000000746
Based on the result of step 4.12.3.8>
Figure FDA00040662553000000747
And &>
Figure FDA00040662553000000748
R in t (m) minimizing the optimization penalty function in equation (2), optimizing the encoder network, the feature fusion network, the single view predictor network, and the multi-view predictor network by inverse update of the gradientComplexing;
4.12.3.10 if M < M, let M = M +1, go to step 4.12.3.3; otherwise, go to step 4.12.4;
encoder _1 of Encoder network in 4.12.4 feature extraction module converts tau _ M N Image observation of middle M pieces of track data after first data expansion
Figure FDA0004066255300000081
Encoding into M second state vectors
Figure FDA0004066255300000082
Will be/are>
Figure FDA0004066255300000083
Sending the data to a control module;
the target encoder network in the 4.12.5 feature extraction module will τ _ M N Image observation of middle M pieces of track data after first data expansion
Figure FDA0004066255300000084
Encoding into M target state vectors
Figure FDA0004066255300000085
Will be/are>
Figure FDA0004066255300000086
Sending the data to a control module;
4.12.6 the control module receives τ _ M from the data expansion module N Receiving a second state vector from the feature extraction module
Figure FDA0004066255300000087
And a target status vector->
Figure FDA0004066255300000088
For τ _ M N M track data in (4)>
Figure FDA0004066255300000089
And &>
Figure FDA00040662553000000810
The loss functions shown in formula (3), formula (4) and formula (5) are minimized by using a gradient descent method in sequence, and an evaluation network and a strategy network are optimized, wherein the method comprises the following steps:
4.12.6.1 the policy network receives from the feature extraction module
Figure FDA00040662553000000811
And
Figure FDA00040662553000000812
critic _1 and Critic _2 are received from the feature extraction module
Figure FDA00040662553000000813
Receiving τ _ M from data expansion module N (ii) a TarCritic _1 and TarCritic _2 receive { -from feature extraction module>
Figure FDA00040662553000000814
4.12.6.2 initializing trajectory data index m =1;
4.12.6.3 policy network slave
Figure FDA00040662553000000815
In which the mth second status vector is acquired>
Figure FDA00040662553000000816
From
Figure FDA00040662553000000817
In which an mth target status vector is acquired>
Figure FDA00040662553000000818
The following operations are carried out: to pair
Figure FDA00040662553000000819
Performing action mapping to obtain a control command a (m), and sending the a (m) to Critic _1 and Critic _2; is paired and/or matched>
Figure FDA00040662553000000820
Performing action mapping to obtain a control command a (m) mixing a (m) to TarCritic _1 and TarCritic _2;
4.12.6.4Critic _1receives control commands a (m) from the policy network, from
Figure FDA00040662553000000821
In which an mth second state vector is acquired>
Figure FDA00040662553000000822
From τ _ M N In which an mth track data->
Figure FDA00040662553000000823
The following operations are carried out: to (X)>
Figure FDA00040662553000000824
And
Figure FDA00040662553000000825
in (a) of t (m) performing state-motion value estimation on the third splicing vector to obtain a first state-motion value
Figure FDA00040662553000000826
To (X)>
Figure FDA00040662553000000827
And a (m) to obtain a second state-action value->
Figure FDA00040662553000000828
4.12.6.5Critic _2receives control instructions a (m) from the policy network, from
Figure FDA00040662553000000829
In which an mth second state vector is acquired>
Figure FDA00040662553000000830
From τ _ M N In the mth track data->
Figure FDA00040662553000000831
The following operations are carried out: is paired and/or matched>
Figure FDA00040662553000000832
And &>
Figure FDA00040662553000000833
In (a) of t (m) performing state-motion value estimation on the third spliced vector to obtain a third state-motion value
Figure FDA00040662553000000834
Is paired and/or matched>
Figure FDA00040662553000000835
And a (m) to obtain a fourth state-action value>
Figure FDA00040662553000000836
4.12.6.6TarCritic _1receives control instruction a from the policy network (m) from
Figure FDA00040662553000000837
In order to acquire the mth target status vector->
Figure FDA00040662553000000838
Is paired and/or matched>
Figure FDA00040662553000000839
And a (m) carrying out target state-action value estimation on the target splicing vector to obtain a first target state-action value
Figure FDA00040662553000000840
4.12.6.7TarCritic _2receives control instructions a from a policy network (m) from
Figure FDA00040662553000000841
In order to acquire the mth target status vector->
Figure FDA00040662553000000842
Is paired and/or matched>
Figure FDA00040662553000000843
And a (m) carrying out target state-action value estimation on the target splicing vector to obtain a second target state-action value
Figure FDA00040662553000000844
4.12.6.8 the control module uses gradient descent to minimize the penalty function in equation (3) by updating the optimization criteria _1 and criteria _2 in the reverse direction of the gradient;
Figure FDA0004066255300000091
wherein:
Figure FDA0004066255300000092
is a parameter of the ith evaluation network>
Figure FDA0004066255300000093
Is a parameter of the ith target evaluation networkNumber phi is a parameter of the policy network, i =1 or 2 is the index of two evaluation networks and two target evaluation networks in the control module, or +>
Figure FDA0004066255300000094
Is the target status vector->
Figure FDA0004066255300000095
Corresponding status value, <' > or>
Figure FDA0004066255300000096
Is the first target state-action value
Figure FDA0004066255300000097
And a second target state-action value>
Figure FDA0004066255300000098
The smaller value in between, α is the temperature coefficient factor, γ is the reward discount factor;
4.12.6.9 control module minimizes the loss function shown in formula (4) by using a gradient descent method, and optimizes the strategy network through reverse update of the gradient;
Figure FDA0004066255300000099
wherein:
Figure FDA00040662553000000910
is that the policy network is in the second status vector->
Figure FDA00040662553000000911
Distribution of the lower output control command a (m), based on the comparison result>
Figure FDA00040662553000000912
Is a second state-action value>
Figure FDA00040662553000000913
And fourth State-action values
Figure FDA00040662553000000914
The smaller value in between;
4.12.6.10 control module uses gradient descent method to minimize the loss function shown in formula (5), and optimizes the temperature coefficient factor through reverse update of gradient;
Figure FDA00040662553000000915
wherein:
Figure FDA00040662553000000916
the target entropy of the strategy network is set as the negative number of the dimension of an agent control instruction a (m);
4.12.6.11 if M < M, let M = M +1, go to step 4.12.6.3; otherwise, go to step 4.12.7;
4.12.7 determining whether t% F is equal to zero, if so, updating parameters of a target encoder network and a target evaluation network in the intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck, and turning to 4.12.8, otherwise, turning to 4.12.9;
4.12.8 an agent control system based on deep reinforcement learning and conditional entropy bottlenecks updates parameters of a target encoder network and parameters of two target evaluation networks according to formula (6) and formula (7) by using exponential moving average;
Figure FDA00040662553000000917
Figure FDA00040662553000000918
wherein: tau is p And τ Q Is to update the hyper-parameters of the target encoder network and the target evaluation network, ζ is a parameter of the encoder network,
Figure FDA00040662553000000919
is a parameter of the target encoder network;
4.12.9 let k = k +1; if K is equal to the total optimization times K, the optimization is completed, and the step 4.13 is switched, otherwise, the step 4.12.2.1 is switched;
4.13 the agent control system based on deep reinforcement learning and conditional entropy bottleneck judges whether T is equal to the maximum interaction time T, if so, the training is finished, and the step 4.14 is carried out, otherwise, the step 4.5.1 is carried out;
4.14 the intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck saves the network parameters of the feature extraction module and the control module optimized in the step 4.12 into a pt format file;
fifthly, loading the file in the pt format obtained in the step 4.14 by the feature extraction module and the control module, and initializing parameters of the feature extraction module and the control module by using parameters in the file in the pt format to obtain the trained intelligent agent control system based on the deep reinforcement learning and the conditional entropy bottleneck;
sixthly, deploying the trained intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck on an intelligent agent in a real environment;
seventhly, the trained intelligent agent control system based on deep reinforcement learning and conditional entropy bottleneck assists the intelligent agent to complete the image continuous control task, and the method comprises the following steps:
7.1 initialize the number of actions T =0 of the agent and set the maximum number of actions T of the agent in the real environment 0
7.2, the perception module obtains image observation of a real environment and sends the image observation to the feature extraction module; the characteristic extraction module receives image observation, an Encoder _1 in an Encoder network encodes the image observation to obtain a first state vector corresponding to the image observation, and the first state vector is sent to the control module; the control module receives the first state vector, the strategy network maps the first state vector into a control instruction, and the control instruction is sent to the action module; the action module receives the control instruction and executes the control instruction in the real environment, and the method comprises the following steps:
7.2.1 perception Module obtains image observations s of the real Environment of the agent at the tth action t A 1 is to t Sending the data to a feature extraction module;
7.2.2 feature extraction Module receives image observations s t Encoder _1 in the Encoder network transforms s according to the method described in step 4.5.2 t Coded as a first state vector z t Will z t Sending the data to a control module;
7.2.3 control Module receiving z t The policy network assigns z according to the method described in step 4.5.3 t Mapping to control instruction a of t action of agent t A is to t Sending the information to an action module;
7.2.4 the action module receives the control command a t Performing a in a real environment t
7.3 let t = t +1; if T is equal to the maximum action number T 0 If not, turning to the step 7.2.1;
and eighthly, finishing.
2. The intelligent agent control method based on deep reinforcement learning and conditional entropy bottleneck of claim 1, wherein the intelligent agent refers to an unmanned aerial vehicle or a robot or a simulated mechanical arm or a simulated robot constructed in a simulated environment; the image sensor refers to a depth camera; the action module refers to an engine and a steering gear of the intelligent agent; the memory module is a memory that includes more than 1GB of available space.
3. The method as claimed in claim 1, wherein the agent status refers to information of the agent itself, the environment status refers to information other than the agent, and the image observation is an RGB image.
4. A process as claimed in claim 1An intelligent agent control method based on deep reinforcement learning and conditional entropy bottleneck, which is characterized in that the regularization factor beta in the second step j Value of 1e -4 ~1e -2
5. The intelligent agent control method based on deep reinforcement learning and conditional entropy bottleneck as claimed in claim 1, wherein 3.1 steps of the Ubuntu require version 16.04 or more, and the DMControl simulation environment requires physical engine MuJoCo version MuJoCo200.
6. The method as claimed in claim 1, wherein 3.2 steps of image observation of the agent for sensing the self state and the environment state are set to 100 x 100, and the continuous type vector set by the control instruction executed by the agent comprises joint rotation speed and torque.
7. The intelligent agent control method based on deep reinforcement learning and conditional entropy bottleneck as claimed in claim 1, wherein the number A of track data in the buffer queue in 4.2 steps satisfies A ≧ 10 5 (ii) a 4.3, the maximum interaction times T are integers, and T is more than or equal to 5A;4.3, the maximum interaction times E is a positive integer and takes the value of 1000;4.3, the updating frequency F is a positive integer and takes a value of 2;4.11 said L is set to 1000;4.11 step said M is set to 512;4.12.1 setting the total optimization times k as 10;4.12.6.8 setting the initial value of the temperature coefficient factor alpha to 0.1 and setting the reward discount factor gamma to 0.99;4.12.8 step τ p Is set to 0.05, tau Q Set to 0.01; and 4.14, directly generating the file in the pt format by the deep learning framework Pythrch.
8. An agent control method based on deep reinforcement learning and conditional entropy bottleneck as claimed in claim 1, characterized in that 4.9 steps of the storage module are used for storing the track data(s) t ,a t ,r t ,s t+1 ) The method for storing in the buffer queue is as follows:
4.9.1 the storage module judges whether there is a track data in the buffer area queue, if there is a track data, go to step 4.9.2, otherwise, go to step 4.9.3;
4.9.2 the storage module empties a track data at the head of the buffer queue according to the first-in first-out principle;
4.9.3 memory module stores s t 、s t+1 、a t And r t Combined into quadruplets of trajectory data(s) t ,a t ,r t ,s t+1 ) And storing the data at the tail part of the buffer area queue.
9. The intelligent agent control method based on deep reinforcement learning and conditional entropy bottleneck as claimed in claim 1, wherein 4.12.2 said data expansion module uses random clipping method in data enhancement to perform N times of data expansion on image observation of each trace data in τ _ M in turn to obtain M pieces of trace data after data expansion, and the method for sending M pieces of trace data after data expansion to the feature extraction module and the control module is:
4.12.2.1 initializes track data index m =1;
4.12.2.2 initializing data expansion times j =0, setting total data expansion times N as 2;
4.12.2.3 the mth trace data τ _ M in τ _ M is clipped by random clipping in data enhancement, referring to the settings in RAD m =(s t (m),a t (m),r t (m),s t+1 (m)) s having a mesoscale of 100X 100 t (m) Observation of an image cropped to a Scale of 84X 84
Figure FDA0004066255300000111
S with a dimension of 100X 100 t+1 (m) Observation of an image cropped to a scale of 84 × 84->
Figure FDA0004066255300000112
4.12.2.4 let j = j +1; if j is equal to the total data expansion times N, go to step 4.12.2.5, otherwise go to step 4.12.2.3;
4.12.2.5 data expansion Module converts τ _ M m S in t (m) image observation with N-time data expansion
Figure FDA0004066255300000113
Will s t+1 (m) image observation with N-time data expansion
Figure FDA0004066255300000114
Get τ _ M m The locus data which is expanded by the N times of data is judged>
Figure FDA0004066255300000115
Figure FDA0004066255300000116
4.12.2.6 if m<M, making M = M +1, and turning to a step 4.12.2.2; if M = M, it represents that the expansion of the M pieces of track data is finished, and M pieces of track data tau _ M after data expansion are obtained N
Figure FDA0004066255300000117
Go to step 4.12.2.7;
4.12.2.7 data expansion Module converts τ _ M N And sending the data to a feature extraction module and a control module.
10. The intelligent agent control method based on deep reinforcement learning and conditional entropy bottleneck as claimed in claim 1, wherein the maximum action number T is 7.1 steps 0 Is a positive integer and takes 1000.
CN202210865762.2A 2022-07-21 2022-07-21 Intelligent agent control method based on deep reinforcement learning and conditional entropy bottleneck Active CN115167136B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210865762.2A CN115167136B (en) 2022-07-21 2022-07-21 Intelligent agent control method based on deep reinforcement learning and conditional entropy bottleneck

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210865762.2A CN115167136B (en) 2022-07-21 2022-07-21 Intelligent agent control method based on deep reinforcement learning and conditional entropy bottleneck

Publications (2)

Publication Number Publication Date
CN115167136A CN115167136A (en) 2022-10-11
CN115167136B true CN115167136B (en) 2023-04-07

Family

ID=83497263

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210865762.2A Active CN115167136B (en) 2022-07-21 2022-07-21 Intelligent agent control method based on deep reinforcement learning and conditional entropy bottleneck

Country Status (1)

Country Link
CN (1) CN115167136B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117406706B (en) * 2023-08-11 2024-04-09 汕头大学 Multi-agent obstacle avoidance method and system combining causal model and deep reinforcement learning

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11295174B2 (en) * 2018-11-05 2022-04-05 Royal Bank Of Canada Opponent modeling with asynchronous methods in deep RL
CN113096153A (en) * 2021-03-09 2021-07-09 山西三友和智慧信息技术股份有限公司 Real-time active vision method based on deep reinforcement learning humanoid football robot
CN112861442B (en) * 2021-03-10 2021-12-03 中国人民解放军国防科技大学 Multi-machine collaborative air combat planning method and system based on deep reinforcement learning
CN113095488A (en) * 2021-04-29 2021-07-09 电子科技大学 Cooperative game method based on multi-agent maximum entropy reinforcement learning
CN113283597A (en) * 2021-06-11 2021-08-20 浙江工业大学 Deep reinforcement learning model robustness enhancing method based on information bottleneck
CN113392935B (en) * 2021-07-09 2023-05-30 浙江工业大学 Multi-agent deep reinforcement learning strategy optimization method based on attention mechanism
CN113478486B (en) * 2021-07-12 2022-05-17 上海微电机研究所(中国电子科技集团公司第二十一研究所) Robot motion parameter self-adaptive control method and system based on deep reinforcement learning
CN114154821A (en) * 2021-11-22 2022-03-08 厦门深度赋智科技有限公司 Intelligent scheduling dynamic scheduling method based on deep reinforcement learning

Also Published As

Publication number Publication date
CN115167136A (en) 2022-10-11

Similar Documents

Publication Publication Date Title
Sanchez-Gonzalez et al. Graph networks as learnable physics engines for inference and control
Grimes et al. Dynamic imitation in a humanoid robot through nonparametric probabilistic inference.
Byravan et al. Imagined value gradients: Model-based policy optimization with tranferable latent dynamics models
CN112819253A (en) Unmanned aerial vehicle obstacle avoidance and path planning device and method
WO2021058588A1 (en) Training action selection neural networks using hindsight modelling
Zhou et al. Robot navigation in a crowd by integrating deep reinforcement learning and online planning
Roghair et al. A vision based deep reinforcement learning algorithm for uav obstacle avoidance
Shi et al. Skill-based model-based reinforcement learning
CN115167136B (en) Intelligent agent control method based on deep reinforcement learning and conditional entropy bottleneck
Bernstein et al. Reinforcement learning for computer vision and robot navigation
CN112633463A (en) Dual recurrent neural network architecture for modeling long term dependencies in sequence data
Zhao et al. Applications of asynchronous deep reinforcement learning based on dynamic updating weights
JP7478757B2 (en) Mixture distribution estimation for future prediction
Ramamurthy et al. Leveraging domain knowledge for reinforcement learning using MMC architectures
Stadie et al. Learning intrinsic rewards as a bi-level optimization problem
CN114219066A (en) Unsupervised reinforcement learning method and unsupervised reinforcement learning device based on Watherstein distance
Siriwardhana et al. Vusfa: Variational universal successor features approximator to improve transfer drl for target driven visual navigation
CN116643499A (en) Model reinforcement learning-based agent path planning method and system
Nahavandi et al. Machine learning meets advanced robotic manipulation
Paudel Learning for robot decision making under distribution shift: A survey
Namikawa et al. A model for learning to segment temporal sequences, utilizing a mixture of RNN experts together with adaptive variance
Shi et al. Efficient hierarchical policy network with fuzzy rules
Karatzas et al. On autonomous drone navigation using deep learning and an intelligent rainbow DQN agent
Butz et al. REPRISE: A Retrospective and Prospective Inference Scheme.
Hong et al. Dynamics-aware metric embedding: Metric learning in a latent space for visual planning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant