CN113052257B

CN113052257B - Deep reinforcement learning method and device based on visual transducer

Info

Publication number: CN113052257B
Application number: CN202110393996.7A
Authority: CN
Inventors: 金丹; 王昭; 龙玉婧
Original assignee: CETC Information Science Research Institute
Current assignee: CETC Information Science Research Institute
Priority date: 2021-04-13
Filing date: 2021-04-13
Publication date: 2024-04-16
Anticipated expiration: 2041-04-13
Also published as: CN113052257A

Abstract

The invention belongs to the technical field of artificial intelligence, and provides a deep reinforcement learning method and device based on a visual transducer, wherein the method comprises the following steps: constructing a depth reinforcement learning network structure based on a visual transducer, wherein the visual transducer comprises a multi-layer perceptron and a conversion encoder, and the conversion encoder comprises a multi-head attention layer and a feedforward network; initializing a deep reinforcement learning network weight, and constructing an experience playback pool according to the memory capacity; interacting with the operation environment through a greedy strategy, generating experience data and placing the experience data into an experience playback pool; when the number of samples in the experience playback pool meets a preset value, randomly extracting a batch of training sample images from the experience playback pool, preprocessing the training sample images, and inputting the training sample images into a deep reinforcement learning network for training; and when the deep reinforcement learning network meets the convergence condition, acquiring a reinforcement learning model. The invention can fill the blank of the application of the visual transducer in the field of reinforcement learning, improve the interpretability of the reinforcement learning method and more effectively carry out learning training.

Description

Deep reinforcement learning method and device based on visual transducer

Technical Field

The invention belongs to the technical field of artificial intelligence, and particularly relates to a deep reinforcement learning method and device based on a vision converter.

Background

In recent years, reinforcement learning has become a research hotspot in the field of machine learning. The agent maximizes the return or achieves some goal by learning strategies during interactions with the environment. By combining with the deep learning method, the deep reinforcement learning method breaks through many artificial intelligence tasks, such as game play, robot control, group decision, automatic driving, and the like.

Currently, the deep reinforcement learning method mainly comprises a method based on a value function, a method based on a strategy gradient and a method based on an Actor-Critic framework. In the existing reinforcement learning network framework, the adopted network structure is mainly a convolutional neural network and a long-short-term memory network. Convolutional neural networks focus on the extraction of local observation information, and the capturing capability of global observation information is weak. The long-short-time memory network has the advantage of processing sequence data, can learn and store information for a long time, but can not perform parallel training when being used as a circulating network structure.

Transducers (transducers) are widely used in natural language processing tasks, the architecture of the transducers can avoid recursion, parallel computation is realized, and the global dependence of input and output is modeled through a self-attention mechanism. However, the converter has not been studied in the reinforcement learning field. Accordingly, there is a need to provide an improved visual transducer based deep reinforcement learning method.

Disclosure of Invention

The invention aims to at least solve one of the technical problems in the prior art and provides a deep reinforcement learning method and device based on a visual transducer.

In one aspect of the present invention, there is provided a method of deep reinforcement learning based on a visual transducer, the method comprising:

constructing a depth reinforcement learning network structure based on a visual converter, wherein the visual converter comprises a multi-layer perceptron and a conversion encoder, and the conversion encoder comprises a multi-head attention layer and a feedforward network;

initializing the weight of the deep reinforcement learning network, and constructing an experience playback pool according to the capacity of a memory;

interacting with the running environment through a greedy strategy, generating experience data and placing the experience data into the experience playback pool;

when the number of samples in the experience playback pool meets the preset number of training samples, randomly extracting a batch of training sample images from the experience playback pool, and preprocessing the training sample images;

inputting the preprocessed training sample image into the deep reinforcement learning network for training;

and when the deep reinforcement learning network meets the convergence condition, acquiring a reinforcement learning model.

In some implementations, the interacting with the runtime environment through a greedy strategy generates and places empirical data into the empirical playback pool, including:

and interacting with the running environment through an epsilon-greedy strategy, acquiring experience data (s, a, r, s ') and putting the experience data into the experience playback pool, wherein s is the observed quantity at the current moment, a is the action at the current moment, r is the return returned by the environment, and s' is the observed quantity at the next moment.

In some embodiments, when the number of samples in the experience playback pool meets the preset number of training samples, randomly extracting a batch of training sample images from the experience playback pool, and preprocessing the training sample images, including:

when the number of samples in the experience playback pool meets the preset training sample numberWhen the quantity m is measured, randomly extracting training sample images with the quantity of batch from the experience playback pool, preprocessing the training sample images with the size of H.times.W, dividing the training sample images into N color blocks according to the size of the training sample images, wherein the size of each color block is P.times.P, H is the height of the training sample images, W is the width of the training sample images, and N=H.times.W/P ² ；

Flattening each color block X in the input t-2 moment, t-1 moment and t moment images by using a linear projection matrix to obtain a mapped D-dimensional vector X ₁ =embedding (X), and adding position Embedding and timing Embedding sequence encoding thereto to obtain a color block vector X ₂ ＝X ₁ +PositionEncoding+SequenceEncoding；

The state action value placeholder QvalueToken is connected with the color block vector X by means of learning parameters ₂ Splicing to obtain X ₃ ＝Concat(X ₂ QvalueToken) before inputting the processed data into the vision transformer, outputting the action state value X via the vision transformer _output Wherein, the method comprises the steps of, wherein,

X _output ＝MLP(X _hidden )，

X _hidden ＝LayerNorm(X_attention+FeedForward(X _attention ))，

X _attention ＝LayerNorm(X ₃ +SelfAttention(X ₃ W _Q ,X ₃ W _K ,X ₃ W _V ))，

wherein MLP is a multi-layer sensor, X _hidden For converting the output of the encoder, feedForward is a feed-forward network consisting of two layers of linear mapping and activation functions, X _attention For the output of the multi-head attention layer, selfAttention is the self-attention layer, W _Q 、W _K 、W _V The network weights are respectively linear mappings.

In some embodiments, the inputting the preprocessed training sample image into the deep reinforcement learning network for training comprises:

according to mean square errorThe difference loss function L trains the deep reinforcement learning network, where l=e [ r+γmax _a′ Q(s′,a′；θ ^- )-Q(s,a；θ)] ² Updating the weight of the deep reinforcement learning network,

wherein E is mathematical expectation, a is current moment action, a ' is next moment action, alpha is learning rate, gamma is discount coefficient, Q (s, a; theta) is Q value of the current value neural network, Q (s ', a '; theta) ^- ) Q value, θ and θ of the neural network as target values ^- And respectively obtaining the parameters of the current value neural network and the parameters of the target value neural network, wherein θ' is the parameter of the updated value neural network.

In another aspect of the invention, a deep reinforcement learning device based on a vision transducer is provided, and the device comprises a construction module, a data acquisition module, an input module, a training module and an acquisition module:

the construction module is used for constructing a depth reinforcement learning network structure based on a visual converter, wherein the visual converter comprises a multi-layer perceptron and a conversion encoder, the conversion encoder comprises a multi-head attention layer and a feedforward network, the weight of the depth reinforcement learning network is initialized, and an experience playback pool is constructed according to the capacity of a memory;

the data acquisition module is used for interacting with the operation environment through a greedy strategy, generating experience data and placing the experience data into the experience playback pool;

the input module is used for randomly extracting a batch of training sample images from the experience playback pool when the number of samples in the experience playback pool meets the preset training sample number, preprocessing the training sample images and inputting the preprocessed training sample images into the training module;

the training module is used for training the deep reinforcement learning network by utilizing the preprocessed training sample image;

the obtaining module is used for obtaining the reinforcement learning model when the deep reinforcement learning network meets the convergence condition.

In some embodiments, the data acquisition module is specifically configured to:

In some embodiments, the input module is specifically configured to:

when the number of samples in the experience playback pool meets the preset training sample number m, randomly extracting training sample images with the size of batch from the experience playback pool, preprocessing the training sample images with the size of H×W, dividing the training sample images into N color blocks according to the size of the training sample images, wherein the size of each color block is P×P, H is the height of the training sample images, W is the width of the training sample images, and N=H×W/P ² ；

X _output ＝MLP(X _hidden )，

X _hidden ＝LayerNorm(X_attention+FeedForward(X _attention ))，

In some embodiments, the training module is specifically configured to:

training the deep reinforcement learning network according to a mean square error loss function L, wherein L=Er+gammamax _a′ Q(s′,a′；θ ^- )-Q(s,a；θ)] ² Updating the weight of the deep reinforcement learning network,

In another aspect of the present invention, there is provided an electronic apparatus including:

one or more processors;

and a storage unit configured to store one or more programs that, when executed by the one or more processors, enable the one or more processors to implement the method described above.

In another aspect of the invention, a computer-readable storage medium is provided, on which a computer program is stored which, when being executed by a processor, enables the implementation of a method according to the preceding description.

According to the deep reinforcement learning method and device based on the visual transducer, the visual transducer is introduced into the deep reinforcement learning network, so that the blank of application of the visual transducer in the reinforcement learning field is filled, the interpretability of the reinforcement learning method is improved, learning training can be effectively carried out, and the deep reinforcement learning method and device based on the visual transducer can be applied to scenes using reinforcement learning algorithms, such as games, robot control and the like.

Drawings

FIG. 1 is a block diagram schematically illustrating the composition of an electronic device according to an embodiment of the present invention;

FIG. 2 is a flow chart of a deep reinforcement learning method based on a visual transducer according to another embodiment of the present invention;

FIG. 3 is a schematic diagram of a deep reinforcement learning network based on a vision transducer according to another embodiment of the present invention;

FIG. 4 is a schematic diagram of a transcoder according to another embodiment of the present invention;

fig. 5 is a schematic structural diagram of a deep reinforcement learning device based on a visual transducer according to another embodiment of the invention.

Detailed Description

The present invention will be described in further detail below with reference to the drawings and detailed description for the purpose of better understanding of the technical solution of the present invention to those skilled in the art.

First, an example electronic device for implementing the apparatus and method of embodiments of the present invention is described with reference to fig. 1.

As shown in fig. 1, electronic device 200 includes one or more processors 210, one or more storage devices 220, one or more input devices 230, one or more output devices 240, etc., interconnected by a bus system 250 and/or other forms of connection mechanisms. It should be noted that the components and structures of the electronic device shown in fig. 1 are exemplary only and not limiting, as the electronic device may have other components and structures as desired.

Processor 210 may be a neural network processor comprised of chips of a multi (many) core architecture, may be a separate Central Processing Unit (CPU), or may be a central processing unit + multi-core neural network processor array or other form of processing unit having data processing and/or instruction execution capabilities, and may control other components in electronic device 200 to perform desired functions.

The storage 220 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, random Access Memory (RAM) and/or cache memory (cache), and the like. The non-volatile memory may include, for example, read Only Memory (ROM), hard disk, flash memory, and the like. One or more computer program instructions may be stored on the computer readable storage medium that can be executed by a processor to perform client functions and/or other desired functions in embodiments of the present invention described below. Various applications and various data, such as various data used and/or generated by the applications, may also be stored in the computer readable storage medium.

The input device 230 may be a device used by a user to input instructions and may include one or more of a keyboard, a mouse, a microphone, a touch screen, and the like.

The output device 240 may output various information (e.g., images or sounds) to the outside (e.g., a user), and may include one or more of a display, a speaker, and the like.

Next, a visual transducer-based deep reinforcement learning method according to an embodiment of the present invention will be described with reference to fig. 2.

As illustrated in fig. 2, the present embodiment provides a deep reinforcement learning method S100 based on a visual transducer, where the method S100 includes:

s110, constructing a depth reinforcement learning network structure based on a visual converter, wherein the visual converter comprises a multi-layer perceptron and a conversion encoder, and the conversion encoder comprises a multi-head attention layer and a feedforward network.

Specifically, a state space, an action space and a reward function can be defined based on the reinforcement learning operation environment, and a deep reinforcement learning network structure based on the vision transducer can be constructed. Wherein, as shown in fig. 3, the visual converter comprises a multi-layer sensor and a conversion encoder. As shown in fig. 4, the transcoder includes a multi-headed attention layer and a feed forward network.

S120, initializing the weight of the deep reinforcement learning network, and constructing an experience playback pool according to the capacity of the memory.

Specifically, each weight of the deep reinforcement learning network can be initialized, and an experience playback pool is built according to the capacity of the memory.

S130, interacting with the running environment through a greedy strategy, generating experience data and putting the experience data into the experience playback pool.

Specifically, the reinforcement learning operating environment can be interacted through a greedy strategy, experience data are generated in the interaction process, and the experience data are put into an experience playback pool.

And S140, randomly extracting a batch of training sample images from the experience playback pool when the number of samples in the experience playback pool meets the preset training sample number, and preprocessing the training sample images.

Specifically, when the number of samples in the experience playback pool meets the preset number of training samples, a batch of training sample images can be randomly extracted from the experience playback pool, and then the training sample images are preprocessed according to actual needs.

It should be noted that the preset number of training samples may be the minimum number of training samples required for training the deep reinforcement learning network once, or may be any number of training samples set according to actual needs, and those skilled in the art may select the training samples as required, which is not limited in this embodiment.

S150, inputting the preprocessed training sample image into the deep reinforcement learning network for training.

Specifically, the pre-processed training sample image may be used as an input to train the deep reinforcement learning network.

S160, acquiring a reinforcement learning model when the deep reinforcement learning network meets convergence conditions.

Specifically, in the process of training the deep reinforcement learning network, when the deep reinforcement learning network meets the convergence condition, the current reinforcement learning model is obtained as the final reinforcement learning model.

The deep reinforcement learning method based on the visual transducer fills the blank of application of the visual transducer in the reinforcement learning field by introducing the visual transducer into the deep reinforcement learning network, improves the interpretability of the reinforcement learning method, can more effectively perform learning training, and can be applied to scenes using reinforcement learning algorithms, such as games, robot control and the like.

Illustratively, the interacting with the runtime environment through a greedy strategy generates and places experience data into the experience playback pool, including:

and when interaction is performed, the output action randomly extracts one action from all actions with epsilon probability, and extracts the action with the highest value with 1-epsilon probability, so that experience data (s, a, r, s ') are obtained and put into an experience playback pool, wherein s is the observed quantity at the current moment, a is the action at the current moment, r is the return of the environment, and s' is the observed quantity at the next moment.

Illustratively, when the number of samples in the experience playback pool meets the preset number of training samples, randomly extracting a batch of training sample images from the experience playback pool, and preprocessing the training sample images, including:

when the number of samples in the experience playback pool meets the preset training sample number m, randomly extracting training sample images with the batch size from the experience playback pool, and preprocessing the training sample images with the H×W size. Dividing the training sample image into N color blocks according to the size of the training sample image, wherein the size of each color block is P, and H is the training sample imageHeight, W is the width of the training sample image, n=h×w/P ² ；

Flattening each color block X in the input t-2 moment, t-1 moment and t moment images by using a linear projection matrix to obtain a mapped D-dimensional vector X ₁ =enhancement (X), specific operation is as full-join layer torch.nn.linear in pytorch deep learning framework, and vector X to D dimension ₁ Adding position embedding position encoding and time sequence encoding to obtain a color block vector X ₂ ＝X ₁ +PositionEncoding+SequenceEncoding, a specific operation is such as setting the module parameters in the pyrach deep learning framework to torch.

The state action value placeholder QvalueToken is connected with the color block vector X by means of learning parameters ₂ Splicing to obtain X ₃ ＝Concat(X ₂ QvalueToken), specific operations such as torch.nn.identity functions in the pytorch deep learning framework. Then the processed data is input into a vision converter, and the action state value X is output through the vision converter _output Wherein, the method comprises the steps of, wherein,

X _output ＝MLP(X _hidden )，

X _hidden ＝LayerNorm(X_attention+FeedForward(X _attention ))，

wherein MLP is a multi-layer sensor, X _hidden For converting the output of the encoder, feedForward is a feed-forward network consisting of two layers of linear mapping and activation functions, X _attention For the output of the multi-head attention layer, selfAttention is the self-attention layer, i.e. multi-head attention layer, W _Q 、W _K 、W _V The network weights are respectively linear mappings.

According to the deep reinforcement learning method based on the visual transducer, the interpretability of the reinforcement learning method can be further improved through the attention mechanism of the visual transducer, and useful global observation information can be further learned while local observation information is extracted, so that global information can be better captured. In addition, the embodiment enables the deep reinforcement learning network to use the observation information of the past time by using the time sequence coding of the vision converter, so that the learning training can be performed more effectively.

Illustratively, the inputting the preprocessed training sample image into the deep reinforcement learning network for training includes:

training a deep reinforcement learning network according to a mean square error loss function L, wherein L=Er+γmax _a′ Q(s′,a′；θ ^- )-Q(s,a；θ)] ² Updating the weight of the deep reinforcement learning network,

According to the deep reinforcement learning method based on the visual transducer, the deep reinforcement learning network can be trained in a parallel mode, so that the convergence speed of the deep reinforcement learning network is increased.

In another aspect of the invention, a deep reinforcement learning device based on a visual transducer is provided.

As illustrated in fig. 5, the present embodiment provides a deep reinforcement learning device 100 based on a vision transducer, where the device 100 includes a construction module 110, a data acquisition module 120, an input module 130, a training module 140, and an acquisition module 150. The apparatus 100 may be applied to the method described above, and details not mentioned in the following apparatus may be referred to in the related description, which is not repeated here.

The construction module 110 is configured to construct a deep reinforcement learning network structure based on a visual transducer, and define a state space, an action space and a reward function, wherein the visual transducer comprises a multi-layer perceptron and a conversion encoder, the conversion encoder comprises a multi-head attention layer and a feedforward network, initialize weights of the deep reinforcement learning network, and construct an experience playback pool according to capacity of a memory;

the data collection module 120 is configured to interact with the operation environment through a greedy strategy, generate experience data, and place the experience data in the experience playback pool;

the input module 130 is configured to randomly extract a batch of training sample images from the experience playback pool when the number of samples in the experience playback pool meets a preset training sample number, perform preprocessing on the training sample images, and input the preprocessed training sample images to the training module 140;

the training module 140 is configured to train the deep reinforcement learning network by using the preprocessed training sample image;

the obtaining module 150 is configured to obtain a reinforcement learning model when the deep reinforcement learning network meets a convergence condition.

The deep reinforcement learning device based on the visual transducer of the embodiment fills the blank of application of the visual transducer in the reinforcement learning field by introducing the visual transducer into the deep reinforcement learning network, improves the interpretability of the reinforcement learning method, can more effectively perform learning training, and can be applied to scenes using reinforcement learning algorithms, such as games, robot control and the like.

Illustratively, the data acquisition module 120 is specifically configured to:

Illustratively, the input module 130 is specifically configured to:

when the number of samples in the experience playback poolWhen the quantity meets the preset training sample quantity m, randomly extracting training sample images with the quantity of batch from the experience playback pool, preprocessing the training sample images with the size of H.W, dividing the training sample images into N color blocks according to the size of the training sample images, wherein the size of each color block is P.P, H is the height of the training sample images, W is the width of the training sample images, and N=H.W/P ² ；

X _output ＝MLP(X _hidden )，

X _hidden ＝LayerNorm(X_attention+FeedForward(X _attention ))，

The deep reinforcement learning device based on the visual transducer can further improve the interpretability of the reinforcement learning method through the attention mechanism of the visual transducer, and further learn useful global observation information while extracting local observation information, so that the global information is better captured. In addition, the embodiment enables the deep reinforcement learning network to use the observation information of the past time by using the time sequence coding of the vision converter, so that the learning training can be performed more effectively.

Illustratively, the training module 140 is specifically configured to:

The deep reinforcement learning device based on the vision converter of the embodiment can train the deep reinforcement learning network in a parallel mode, so that the convergence speed of the deep reinforcement learning network is increased.

one or more processors;

and a storage unit configured to store one or more programs that, when executed by the one or more processors, enable the one or more processors to implement the method according to the foregoing description.

The computer readable storage medium may be included in the apparatus or device of the present invention or may exist alone.

Wherein a computer readable storage medium may be any tangible medium that can contain, or store a program that can be an electronic, magnetic, optical, electromagnetic, infrared, semiconductor system, apparatus, device, more specific examples of which include, but are not limited to: a connection having one or more wires, a portable computer diskette, a hard disk, an optical fiber, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination thereof.

The computer-readable storage medium may also include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein, specific examples of which include, but are not limited to, electromagnetic signals, optical signals, or any suitable combination thereof.

It is to be understood that the above embodiments are merely illustrative of the application of the principles of the present invention, but not in limitation thereof. Various modifications and improvements may be made by those skilled in the art without departing from the spirit and substance of the invention, and are also considered to be within the scope of the invention.

Claims

1. A method of deep reinforcement learning based on a visual transducer, the method comprising:

when the deep reinforcement learning network meets convergence conditions, acquiring a reinforcement learning model;

the interacting with the running environment through the greedy strategy generates experience data and puts the experience data into the experience playback pool, and the method comprises the following steps:

the method comprises the steps of interacting with an operation environment through an epsilon-greedy strategy, obtaining experience data (s, a, r, s ') and putting the experience data into the experience playback pool, wherein s is the observed quantity at the current moment, a is the action at the current moment, r is the return returned by the environment, and s' is the observed quantity at the next moment;

when the number of samples in the experience playback pool meets the preset training sample number, randomly extracting a batch of training sample images from the experience playback pool, and preprocessing the training sample images, wherein the method comprises the following steps:

X _hidden ＝LayerNorm(X _attention +FeedForward(X _attention ))，

X _output ＝MLP(X _hidden )，

2. The method of claim 1, wherein inputting the preprocessed training sample image into the deep reinforcement learning network for training comprises:

wherein E is mathematical expectation, a is current moment action, a' is next moment action, alpha is learning rate, gamma is discount coefficient, Q (s, a; theta) is Q value of the current value neural network, Q is%s′,a′；θ ^- ) Q value, θ and θ of the neural network as target values ^- And respectively obtaining the parameters of the current value neural network and the parameters of the target value neural network, wherein θ' is the parameter of the updated value neural network.

3. The deep reinforcement learning device based on the vision converter is characterized by comprising a construction module, a data acquisition module, an input module, a training module and an acquisition module:

the acquisition module is used for acquiring a reinforcement learning model when the deep reinforcement learning network meets convergence conditions;

the data acquisition module is specifically used for:

the input module is specifically used for:

X _hidden ＝LayerNorm(X_attention+FeedForward(X _attention ))，

X _output ＝MLP(X _hidden )，

4. The apparatus of claim 3, wherein the training module is specifically configured to:

5. An electronic device, the electronic device comprising:

one or more processors;

a storage unit for storing one or more programs, which when executed by the one or more processors, enable the one or more processors to implement the method of claim 1 or 2.

6. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, is capable of realizing the method according to claim 1 or 2.