CN113052257A

CN113052257A - Deep reinforcement learning method and device based on visual converter

Info

Publication number: CN113052257A
Application number: CN202110393996.7A
Authority: CN
Inventors: 金丹; 王昭; 龙玉婧
Original assignee: CETC Information Science Research Institute
Current assignee: CETC Information Science Research Institute
Priority date: 2021-04-13
Filing date: 2021-04-13
Publication date: 2021-06-29
Anticipated expiration: 2041-04-13
Also published as: CN113052257B

Abstract

The invention belongs to the technical field of artificial intelligence, and provides a depth reinforcement learning method and device based on a visual converter, wherein the method comprises the following steps: constructing a depth reinforcement learning network structure based on a visual converter, wherein the visual converter comprises a multilayer perceptron and a conversion encoder, and the conversion encoder comprises a multi-head attention layer and a feedforward network; initializing deep reinforcement learning network weight, and constructing an experience playback pool according to the capacity of a memory; interacting with the operating environment through a greedy strategy to generate experience data and putting the experience data into an experience playback pool; when the number of samples in the experience playback pool meets a preset value, randomly extracting a batch of training sample images from the experience playback pool, preprocessing the training sample images, and inputting the training sample images into a deep reinforcement learning network for training; and when the deep reinforcement learning network meets the convergence condition, obtaining a reinforcement learning model. The invention can fill the blank of the application of the vision converter in the field of reinforcement learning, improve the interpretability of the reinforcement learning method and carry out learning training more effectively.

Description

Deep reinforcement learning method and device based on visual converter

Technical Field

The invention belongs to the technical field of artificial intelligence, and particularly relates to a depth reinforcement learning method and device based on a vision converter.

Background

In recent years, reinforcement learning has become a research focus in the field of machine learning. The agent achieves maximization of the reward or achieves some goal by learning strategies during interaction with the environment. By combining with the deep learning method, the deep reinforcement learning method makes a breakthrough in many artificial intelligence tasks, such as game games, robot control, group decision, automatic driving, and the like.

At present, the deep reinforcement learning method mainly includes a method based on a value function, a method based on a strategy gradient, and a method based on an Actor-Critic framework. In the existing reinforcement learning network framework, the adopted network structures are mainly a convolutional neural network and a long-term memory network. The convolutional neural network focuses on extracting local observation information, and the capturing capability of global observation information is weak. The long-time and short-time memory network has more advantages in processing sequence data and can learn and store information for a long time, but the long-time and short-time memory network is a circulating network structure and cannot be trained in parallel.

The converter (Transformer) is widely applied to natural language processing tasks, the converter architecture can avoid recursion, parallel computation is realized, and the global dependency of input and output is modeled through a self-attention mechanism. However, the converter has not been studied in the field of reinforcement learning. Therefore, there is a need to provide an improved depth reinforcement learning method based on a visual converter.

Disclosure of Invention

The present invention is directed to solve at least one of the problems of the prior art, and provides a depth reinforcement learning method and apparatus based on a visual converter.

In one aspect of the present invention, a depth reinforcement learning method based on a visual converter is provided, the method comprising:

constructing a deep reinforcement learning network structure based on a visual converter, wherein the visual converter comprises a multi-layer perceptron and a conversion encoder, and the conversion encoder comprises a multi-head attention layer and a feedforward network;

initializing the weight of the deep reinforcement learning network, and constructing an experience playback pool according to the capacity of a memory;

interacting with the operation environment through a greedy strategy to generate experience data and putting the experience data into the experience playback pool;

when the number of samples in the experience playback pool meets the preset number of training samples, randomly extracting a batch of training sample images from the experience playback pool, and preprocessing the training sample images;

inputting the preprocessed training sample image into the deep reinforcement learning network for training;

and when the deep reinforcement learning network meets the convergence condition, obtaining a reinforcement learning model.

In some embodiments, the interacting with the execution environment by the greedy policy to generate and place experience data into the experience playback pool includes:

and interacting with the operating environment through an epsilon-greedy strategy to obtain experience data (s, a, r, s ') and putting the experience data (s, a, r, s ') into the experience playback pool, wherein s is the observed quantity at the current moment, a is the action at the current moment, r is the return returned by the environment, and s ' is the observed quantity at the next moment.

In some embodiments, when the number of samples in the experience playback pool satisfies a preset number of training samples, randomly extracting a batch of training sample images from the experience playback pool, and preprocessing the training sample images includes:

when the number of samples in the experience playback pool meets a preset training sample number m, randomly extracting training sample images with the number of batch sizes from the experience playback pool, preprocessing the training sample images with the size of H x W, dividing the training sample images into N color blocks according to the size of the training sample images, wherein the size of each color block is P x P, H is the height of the training sample images, and W is the training sample number mWidth of training sample image, N ═ H × W/P²；

Using a linear projection matrix to flatten each color block X in the input images at the t-2 moment, the t-1 moment and the t moment to obtain a mapped D-dimensional vector X₁Embedding (X), and adding position embedding and sequence embedding to it to obtain a patch vector X₂＝X₁+PositionEncoding+SequenceEncoding；

The state action value placeholder QvalueToken and the color block vector X are processed by learning parameters₂Splicing to obtain X₃＝Concat(X₂QvalueToken), then inputting the processed data into the visual converter, and outputting the action state value X through the visual converter_outputWherein, in the step (A),

X_output＝MLP(X_hidden)，

X_hidden＝LayerNorm(X_attention+FeedForward(X_attention))，

X_attention＝LayerNorm(X₃+SelfAttention(X₃W_Q,X₃W_K,X₃W_V))，

wherein MLP is multilayer perceptron, X_hiddenFor the output of the transcoder, feed forward is a feed forward network consisting of two layers of linear mapping and activation functions, X_attentionSelfattention is the self-attention layer, W, for the output of the multi-head attention layer_Q、W_K、W_VRespectively, the network weights of the linear mapping.

In some embodiments, the inputting the preprocessed training sample image into the deep reinforcement learning network for training includes:

training the deep reinforcement learning network according to a mean square error loss function L, wherein L is E [ r + gamma max ═_a′Q(s′,a′；θ^-)-Q(s,a；θ)]²Updating the weight of the deep reinforcement learning network,

wherein E is the mathematical expectation, a is the current moment action, a ' is the next moment action, alpha is the learning rate, gamma is the discount coefficient, Q (s, a; theta) is the Q value of the neural network at the current value, Q (s ', a '; theta)^-) For the target value of Q, theta and theta of the neural network^-The parameters of the neural network are the current value and the parameters of the neural network are the target value respectively, and theta' is the updated parameters of the neural network.

In another aspect of the present invention, a depth reinforcement learning apparatus based on a visual converter is provided, the apparatus includes a construction module, a data acquisition module, an input module, a training module, and an acquisition module:

the building module is used for building a depth reinforcement learning network structure based on a visual converter, wherein the visual converter comprises a multilayer perceptron and a conversion encoder, the conversion encoder comprises a multi-head attention layer and a feedforward network, the weight of the depth reinforcement learning network is initialized, and an experience playback pool is built according to the capacity of a memory;

the data acquisition module is used for interacting with the operation environment through a greedy strategy to generate experience data and putting the experience data into the experience playback pool;

the input module is used for randomly extracting a batch of training sample images from the experience playback pool when the number of samples in the experience playback pool meets the preset number of training samples, preprocessing the training sample images, and inputting the preprocessed training sample images into the training module;

the training module is used for training the deep reinforcement learning network by utilizing the preprocessed training sample image;

the obtaining module is used for obtaining the reinforcement learning model when the deep reinforcement learning network meets the convergence condition.

In some embodiments, the data acquisition module is specifically configured to:

In some embodiments, the input module is specifically configured to:

when the number of samples in the experience playback pool meets a preset training sample number m, randomly extracting training sample images with the number of batch sizes from the experience playback pool, preprocessing the training sample images with the size of H × W, dividing the training sample images into N color blocks according to the size of the training sample images, wherein the size of each color block is P × P, H is the height of the training sample images, W is the width of the training sample images, and N is H × W/P²；

X_output＝MLP(X_hidden)，

X_hidden＝LayerNorm(X_attention+FeedForward(X_attention))，

X_attention＝LayerNorm(X₃+SelfAttention(X₃W_Q,X₃W_K,X₃W_V))，

wherein MLP is multilayer perceptron, X_hiddenFor converting the output of the encoder, feed forward is composed of two layers of linear mapping and activation functionFeedforward network, X_attentionSelfattention is the self-attention layer, W, for the output of the multi-head attention layer_Q、W_K、W_VRespectively, the network weights of the linear mapping.

In some embodiments, the training module is specifically configured to:

In another aspect of the present invention, an electronic device is provided, including:

one or more processors;

a storage unit for storing one or more programs which, when executed by the one or more processors, enable the one or more processors to implement the method as described above.

In another aspect of the invention, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, is adapted to carry out the method according to the above description.

According to the depth reinforcement learning method and device based on the vision converter, the vision converter is introduced into the depth reinforcement learning network, so that the blank of the application of the vision converter in the reinforcement learning field is filled, the interpretability of the reinforcement learning method is improved, the learning training can be performed more effectively, and the method and device can be applied to scenes using reinforcement learning algorithms, such as games, robot control and the like.

Drawings

FIG. 1 is a block diagram of an electronic device according to an embodiment of the invention;

FIG. 2 is a flowchart illustrating a depth reinforcement learning method based on a visual transformer according to another embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a deep reinforcement learning network based on a visual transformer according to another embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a transcoder according to another embodiment of the present invention;

fig. 5 is a schematic structural diagram of a depth-enhanced learning apparatus based on a visual converter according to another embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.

First, an example electronic device for implementing the apparatus and method of embodiments of the present invention is described with reference to fig. 1.

As shown in FIG. 1, electronic device 200 includes one or more processors 210, one or more memory devices 220, one or more input devices 230, one or more output devices 240, and the like, interconnected by a bus system 250 and/or other form of connection mechanism. It should be noted that the components and structures of the electronic device shown in fig. 1 are exemplary only, and not limiting, and the electronic device may have other components and structures as desired.

The processor 210 may be a neural network processor composed of chips of a multi (numerous) core architecture, may be a single Central Processing Unit (CPU), or may be a central processing unit + multi-core neural network processor array or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 200 to perform desired functions.

Storage 220 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc. On which one or more computer program instructions may be stored that may be executed by a processor to implement client functionality (implemented by the processor) and/or other desired functionality in embodiments of the invention described below. Various applications and various data, such as various data used and/or generated by the applications, may also be stored in the computer-readable storage medium.

The input device 230 may be a device used by a user to input instructions and may include one or more of a keyboard, a mouse, a microphone, a touch screen, and the like.

The output device 240 may output various information (e.g., images or sounds) to an outside (e.g., a user), and may include one or more of a display, a speaker, and the like.

Next, a depth reinforcement learning method based on a visual converter according to an embodiment of the present invention will be described with reference to fig. 2.

Illustratively, as shown in fig. 2, the present embodiment provides a depth reinforcement learning method S100 based on a visual converter, where the method S100 includes:

s110, constructing a deep reinforcement learning network structure based on a visual converter, wherein the visual converter comprises a multilayer perceptron and a conversion encoder, and the conversion encoder comprises a multi-head attention layer and a feedforward network.

Specifically, a state space, an action space and a reward function can be defined based on the reinforcement learning operation environment, and a deep reinforcement learning network structure based on the visual converter is constructed. Wherein the visual converter comprises a multi-layer perceptron and a transcoder, as shown in figure 3. As shown in fig. 4, the transcoder includes a multi-headed attention layer and a feed forward network.

And S120, initializing the weight of the deep reinforcement learning network, and constructing an experience playback pool according to the capacity of the memory.

Specifically, the weights of the deep reinforcement learning network may be initialized, and an experience playback pool may be established according to the capacity of the memory.

And S130, interacting with the operation environment through a greedy strategy to generate experience data and putting the experience data into the experience playback pool.

Specifically, the experience data can be generated in the interaction process through interaction with the reinforcement learning operation environment by a greedy strategy, and the experience data is put into an experience playback pool.

S140, when the number of the samples in the experience playback pool meets the preset number of the training samples, randomly extracting a batch of training sample images from the experience playback pool, and preprocessing the training sample images.

Specifically, when the number of samples in the experience playback pool meets the preset number of training samples, a batch of training sample images may be randomly extracted from the experience playback pool, and then the training sample images may be preprocessed according to actual needs.

It should be noted that the preset number of training samples may be the minimum number of training samples required for performing one training on the deep reinforcement learning network, or may be any number of training samples set according to actual needs, and a person skilled in the art may select the training samples as needed, which is not limited in this embodiment.

S150, inputting the preprocessed training sample image into the deep reinforcement learning network for training.

Specifically, the deep reinforcement learning network may be trained by using the preprocessed training sample image as an input.

And S160, when the deep reinforcement learning network meets the convergence condition, obtaining a reinforcement learning model.

Specifically, in the process of training the deep reinforcement learning network, when the deep reinforcement learning network meets the convergence condition, the current reinforcement learning model is obtained to be used as the final reinforcement learning model.

The depth reinforcement learning method based on the visual converter fills the blank of the application of the visual converter in the reinforcement learning field by introducing the visual converter into the depth reinforcement learning network, improves the interpretability of the reinforcement learning method, can carry out learning training more effectively, and can be applied to scenes using reinforcement learning algorithms, such as games, robot control and the like.

Illustratively, the interacting with the execution environment through a greedy strategy to generate experience data and put the experience data into the experience playback pool includes:

and interacting with the operating environment through an epsilon-greedy strategy, wherein during interaction, an output action randomly extracts an action from all actions according to the probability of epsilon, extracts the action with the maximum value according to the probability of 1-epsilon, acquires empirical data (s, a, r, s ') and puts the empirical data into an empirical playback pool, wherein s is the observed quantity at the current moment, a is the action at the current moment, r is the return of the environment, and s' is the observed quantity at the next moment.

Illustratively, when the number of samples in the experience playback pool satisfies a preset number of training samples, randomly extracting a batch of training sample images from the experience playback pool, and preprocessing the training sample images, includes:

and when the number of the samples in the experience playback pool meets the preset number m of the training samples, randomly extracting training sample images with the size of batch from the experience playback pool, and preprocessing the training sample images with the size of H x W. Dividing the training sample image into N color blocks according to the size of the training sample image, wherein the size of each color block is P × P, H is the height of the training sample image, W is the width of the training sample image, and N is H × W/P²；

Using a linear projection matrix to flatten each color block X in the input images at the t-2 moment, the t-1 moment and the t moment to obtain a mapped D-dimensional vector X₁Embedding (X), the specific operation is full connection in the deep learning framework of the pyrrchLine, and oriented to D-dimensional vector X₁Adding position embedding and sequence embedding to obtain color block vector X₂＝X₁+ PositionEncoding + sequencenecoding, specifically operating as module parameter setting torch.nn.

The state action value placeholder QvalueToken is processed by learning parameters and color block vector X₂Splicing to obtain X₃＝Concat(X₂QvalueToken), as a specific operation of the torch.nn. identity function in the pytorch deep learning framework. Inputting the processed data into a visual converter, and outputting an action state value X through the visual converter_outputWherein, in the step (A),

X_output＝MLP(X_hidden)，

X_hidden＝LayerNorm(X_attention+FeedForward(X_attention))，

X_attention＝LayerNorm(X₃+SelfAttention(X₃W_Q,X₃W_K,X₃W_V))，

wherein MLP is multilayer perceptron, X_hiddenFor the output of the transcoder, feed forward is a feed forward network consisting of two layers of linear mapping and activation functions, X_attentionFor the output of the multi-attention layer, Selfattention is the self-attention layer, i.e., the multi-attention layer, W_Q、W_K、W_VRespectively, the network weights of the linear mapping.

The deep reinforcement learning method based on the visual converter of the embodiment can further improve the interpretability of the reinforcement learning method through the attention mechanism of the visual converter, and further learn useful global observation information while extracting local observation information, thereby capturing global information better. In addition, the present embodiment enables the deep reinforcement learning network to utilize the observation information of the past time by using the time-series coding of the visual converter, so as to perform the learning training more effectively.

Illustratively, the inputting the preprocessed training sample image into the deep reinforcement learning network for training includes:

The deep reinforcement learning method based on the visual converter can train the deep reinforcement learning network in a parallel mode, so that the convergence speed of the deep reinforcement learning network is increased.

In another aspect of the invention, a depth reinforcement learning device based on a visual converter is provided.

Illustratively, as shown in fig. 5, the present embodiment provides a depth reinforcement learning apparatus 100 based on a visual converter, and the apparatus 100 includes a construction module 110, a data acquisition module 120, an input module 130, a training module 140, and an acquisition module 150. The apparatus 100 can be applied to the methods described above, and the details not mentioned in the following apparatuses can be referred to the related descriptions, which are not described herein again.

The building module 110 is configured to build a deep reinforcement learning network structure based on a visual converter, and define a state space, an action space, and a reward function, where the visual converter includes a multi-layer perceptron and a conversion encoder, the conversion encoder includes a multi-head attention layer and a feed-forward network, initialize weights of the deep reinforcement learning network, and build an experience playback pool according to a capacity of a memory;

the data acquisition module 120 is configured to interact with the operating environment through a greedy strategy to generate experience data and place the experience data into the experience playback pool;

the input module 130 is configured to randomly extract a batch of training sample images from the experience playback pool when the number of samples in the experience playback pool meets a preset number of training samples, pre-process the training sample images, and input the pre-processed training sample images into the training module 140;

the training module 140 is configured to train the deep reinforcement learning network by using the preprocessed training sample image;

the obtaining module 150 is configured to obtain a reinforcement learning model when the deep reinforcement learning network satisfies a convergence condition.

The depth reinforcement learning device based on the visual converter fills the blank of the application of the visual converter in the reinforcement learning field by introducing the visual converter into the depth reinforcement learning network, improves the interpretability of the reinforcement learning method, can carry out learning training more effectively, and can be applied to scenes using reinforcement learning algorithms, such as games, robot control and the like.

Illustratively, the data acquisition module 120 is specifically configured to:

Illustratively, the input module 130 is specifically configured to:

when the number of samples in the experience playback pool meets a preset training sample number m, randomly extracting training sample images with the number of batch sizes from the experience playback pool, preprocessing the training sample images with the size of H × W, dividing the training sample images into N color blocks according to the size of the training sample images, wherein the size of each color block is P × P, and H is the training sample imageHeight, W is the width of the training sample image, N ═ H × W/P²；

X_output＝MLP(X_hidden)，

X_hidden＝LayerNorm(X_attention+FeedForward(X_attention))，

X_attention＝LayerNorm(X₃+SelfAttention(X₃W_Q,X₃W_K,X₃W_V))，

The depth reinforcement learning device based on the visual converter of the embodiment can further improve the interpretability of the reinforcement learning method through the attention mechanism of the visual converter, and further learn useful global observation information while extracting local observation information, thereby capturing global information better. In addition, the present embodiment enables the deep reinforcement learning network to utilize the observation information of the past time by using the time-series coding of the visual converter, so as to perform the learning training more effectively.

Illustratively, the training module 140 is specifically configured to:

The deep reinforcement learning device based on the visual converter can train the deep reinforcement learning network in a parallel mode, so that the convergence speed of the deep reinforcement learning network is increased.

one or more processors;

a storage unit for storing one or more programs which, when executed by the one or more processors, enable the one or more processors to implement the method according to the preceding description.

The computer readable storage medium may be included in the apparatus or device of the present invention, or may exist separately.

The computer readable storage medium may be any tangible medium that can contain or store a program, and may be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, device, more specific examples include but are not limited to: a portable computer diskette, a hard disk, an optical fiber, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The computer readable storage medium may also include a propagated data signal with computer readable program code embodied therein, for example, in a non-transitory form, such as in a carrier wave or in a carrier wave, wherein the carrier wave is any suitable carrier wave or carrier wave for carrying the program code.

It will be understood that the above embodiments are merely exemplary embodiments taken to illustrate the principles of the present invention, which is not limited thereto. It will be apparent to those skilled in the art that various modifications and improvements can be made without departing from the spirit and substance of the invention, and these modifications and improvements are also considered to be within the scope of the invention.

Claims

1. A depth reinforcement learning method based on a visual converter is characterized by comprising the following steps:

2. The method of claim 1, wherein interacting with a runtime environment through a greedy strategy to generate and place experience data into the experience playback pool comprises:

3. The method according to claim 2, wherein when the number of samples in the experience replay pool meets a preset number of training samples, randomly extracting a batch of training sample images from the experience replay pool, and preprocessing the training sample images comprises:

Occupying state action value inThe bit QvalueToken learns the parameters and the color block vector X₂Splicing to obtain X₃＝Concat(X₂QvalueToken), then inputting the processed data into the visual converter, and outputting the action state value X through the visual converter_outputWherein, in the step (A),

X_output＝MLP(X_hidden)，

X_hidden＝LayerNorm(X_attention+FeedForward(X_attention))，

X_attention＝LayerNorm(X₃+SelfAttention(X₃W_Q,X₃W_K,X₃W_V))，

4. The method according to claim 3, wherein the inputting the preprocessed training sample images into the deep reinforcement learning network for training comprises:

wherein E is the mathematical expectation, a is the current moment action, a ' is the next moment action, alpha is the learning rate, gamma is the discount coefficient, Q (s, a; theta) is the Q value of the neural network at the current value, Q (s ', a '; theta)^-) For the target value of Q, theta and theta of the neural network^-Neural network of current values respectivelyAnd theta' is the updated parameter of the neural network.

5. The deep reinforcement learning device based on the vision converter is characterized by comprising a construction module, a data acquisition module, an input module, a training module and an acquisition module:

6. The apparatus of claim 5, wherein the data acquisition module is specifically configured to:

7. The apparatus of claim 6, wherein the input module is specifically configured to:

X_output＝MLP(X_hidden)，

X_hidden＝LayerNorm(X_attention+FeedForward(X_attention))，

X_attention＝LayerNorm(X₃+SelfAttention(X₃W_Q,X₃W_K,X₃W_V))，

8. The apparatus of claim 7, wherein the training module is specifically configured to:

9. An electronic device, characterized in that the electronic device comprises:

one or more processors;

a storage unit to store one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-4.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, is able to carry out a method according to any one of claims 1 to 4.