CN113052257A - Deep reinforcement learning method and device based on visual converter - Google Patents

Deep reinforcement learning method and device based on visual converter Download PDF

Info

Publication number
CN113052257A
CN113052257A CN202110393996.7A CN202110393996A CN113052257A CN 113052257 A CN113052257 A CN 113052257A CN 202110393996 A CN202110393996 A CN 202110393996A CN 113052257 A CN113052257 A CN 113052257A
Authority
CN
China
Prior art keywords
reinforcement learning
training sample
training
experience
sample images
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110393996.7A
Other languages
Chinese (zh)
Other versions
CN113052257B (en
Inventor
金丹
王昭
龙玉婧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC Information Science Research Institute
Original Assignee
CETC Information Science Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC Information Science Research Institute filed Critical CETC Information Science Research Institute
Priority to CN202110393996.7A priority Critical patent/CN113052257B/en
Publication of CN113052257A publication Critical patent/CN113052257A/en
Application granted granted Critical
Publication of CN113052257B publication Critical patent/CN113052257B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention belongs to the technical field of artificial intelligence, and provides a depth reinforcement learning method and device based on a visual converter, wherein the method comprises the following steps: constructing a depth reinforcement learning network structure based on a visual converter, wherein the visual converter comprises a multilayer perceptron and a conversion encoder, and the conversion encoder comprises a multi-head attention layer and a feedforward network; initializing deep reinforcement learning network weight, and constructing an experience playback pool according to the capacity of a memory; interacting with the operating environment through a greedy strategy to generate experience data and putting the experience data into an experience playback pool; when the number of samples in the experience playback pool meets a preset value, randomly extracting a batch of training sample images from the experience playback pool, preprocessing the training sample images, and inputting the training sample images into a deep reinforcement learning network for training; and when the deep reinforcement learning network meets the convergence condition, obtaining a reinforcement learning model. The invention can fill the blank of the application of the vision converter in the field of reinforcement learning, improve the interpretability of the reinforcement learning method and carry out learning training more effectively.

Description

Deep reinforcement learning method and device based on visual converter
Technical Field
The invention belongs to the technical field of artificial intelligence, and particularly relates to a depth reinforcement learning method and device based on a vision converter.
Background
In recent years, reinforcement learning has become a research focus in the field of machine learning. The agent achieves maximization of the reward or achieves some goal by learning strategies during interaction with the environment. By combining with the deep learning method, the deep reinforcement learning method makes a breakthrough in many artificial intelligence tasks, such as game games, robot control, group decision, automatic driving, and the like.
At present, the deep reinforcement learning method mainly includes a method based on a value function, a method based on a strategy gradient, and a method based on an Actor-Critic framework. In the existing reinforcement learning network framework, the adopted network structures are mainly a convolutional neural network and a long-term memory network. The convolutional neural network focuses on extracting local observation information, and the capturing capability of global observation information is weak. The long-time and short-time memory network has more advantages in processing sequence data and can learn and store information for a long time, but the long-time and short-time memory network is a circulating network structure and cannot be trained in parallel.
The converter (Transformer) is widely applied to natural language processing tasks, the converter architecture can avoid recursion, parallel computation is realized, and the global dependency of input and output is modeled through a self-attention mechanism. However, the converter has not been studied in the field of reinforcement learning. Therefore, there is a need to provide an improved depth reinforcement learning method based on a visual converter.
Disclosure of Invention
The present invention is directed to solve at least one of the problems of the prior art, and provides a depth reinforcement learning method and apparatus based on a visual converter.
In one aspect of the present invention, a depth reinforcement learning method based on a visual converter is provided, the method comprising:
constructing a deep reinforcement learning network structure based on a visual converter, wherein the visual converter comprises a multi-layer perceptron and a conversion encoder, and the conversion encoder comprises a multi-head attention layer and a feedforward network;
initializing the weight of the deep reinforcement learning network, and constructing an experience playback pool according to the capacity of a memory;
interacting with the operation environment through a greedy strategy to generate experience data and putting the experience data into the experience playback pool;
when the number of samples in the experience playback pool meets the preset number of training samples, randomly extracting a batch of training sample images from the experience playback pool, and preprocessing the training sample images;
inputting the preprocessed training sample image into the deep reinforcement learning network for training;
and when the deep reinforcement learning network meets the convergence condition, obtaining a reinforcement learning model.
In some embodiments, the interacting with the execution environment by the greedy policy to generate and place experience data into the experience playback pool includes:
and interacting with the operating environment through an epsilon-greedy strategy to obtain experience data (s, a, r, s ') and putting the experience data (s, a, r, s ') into the experience playback pool, wherein s is the observed quantity at the current moment, a is the action at the current moment, r is the return returned by the environment, and s ' is the observed quantity at the next moment.
In some embodiments, when the number of samples in the experience playback pool satisfies a preset number of training samples, randomly extracting a batch of training sample images from the experience playback pool, and preprocessing the training sample images includes:
when the number of samples in the experience playback pool meets a preset training sample number m, randomly extracting training sample images with the number of batch sizes from the experience playback pool, preprocessing the training sample images with the size of H x W, dividing the training sample images into N color blocks according to the size of the training sample images, wherein the size of each color block is P x P, H is the height of the training sample images, and W is the training sample number mWidth of training sample image, N ═ H × W/P2
Using a linear projection matrix to flatten each color block X in the input images at the t-2 moment, the t-1 moment and the t moment to obtain a mapped D-dimensional vector X1Embedding (X), and adding position embedding and sequence embedding to it to obtain a patch vector X2=X1+PositionEncoding+SequenceEncoding;
The state action value placeholder QvalueToken and the color block vector X are processed by learning parameters2Splicing to obtain X3=Concat(X2QvalueToken), then inputting the processed data into the visual converter, and outputting the action state value X through the visual converteroutputWherein, in the step (A),
Xoutput=MLP(Xhidden),
Xhidden=LayerNorm(X_attention+FeedForward(Xattention)),
Xattention=LayerNorm(X3+SelfAttention(X3WQ,X3WK,X3WV)),
wherein MLP is multilayer perceptron, XhiddenFor the output of the transcoder, feed forward is a feed forward network consisting of two layers of linear mapping and activation functions, XattentionSelfattention is the self-attention layer, W, for the output of the multi-head attention layerQ、WK、WVRespectively, the network weights of the linear mapping.
In some embodiments, the inputting the preprocessed training sample image into the deep reinforcement learning network for training includes:
training the deep reinforcement learning network according to a mean square error loss function L, wherein L is E [ r + gamma max ═a′Q(s′,a′;θ-)-Q(s,a;θ)]2Updating the weight of the deep reinforcement learning network,
Figure BDA0003017834960000031
wherein E is the mathematical expectation, a is the current moment action, a ' is the next moment action, alpha is the learning rate, gamma is the discount coefficient, Q (s, a; theta) is the Q value of the neural network at the current value, Q (s ', a '; theta)-) For the target value of Q, theta and theta of the neural network-The parameters of the neural network are the current value and the parameters of the neural network are the target value respectively, and theta' is the updated parameters of the neural network.
In another aspect of the present invention, a depth reinforcement learning apparatus based on a visual converter is provided, the apparatus includes a construction module, a data acquisition module, an input module, a training module, and an acquisition module:
the building module is used for building a depth reinforcement learning network structure based on a visual converter, wherein the visual converter comprises a multilayer perceptron and a conversion encoder, the conversion encoder comprises a multi-head attention layer and a feedforward network, the weight of the depth reinforcement learning network is initialized, and an experience playback pool is built according to the capacity of a memory;
the data acquisition module is used for interacting with the operation environment through a greedy strategy to generate experience data and putting the experience data into the experience playback pool;
the input module is used for randomly extracting a batch of training sample images from the experience playback pool when the number of samples in the experience playback pool meets the preset number of training samples, preprocessing the training sample images, and inputting the preprocessed training sample images into the training module;
the training module is used for training the deep reinforcement learning network by utilizing the preprocessed training sample image;
the obtaining module is used for obtaining the reinforcement learning model when the deep reinforcement learning network meets the convergence condition.
In some embodiments, the data acquisition module is specifically configured to:
and interacting with the operating environment through an epsilon-greedy strategy to obtain experience data (s, a, r, s ') and putting the experience data (s, a, r, s ') into the experience playback pool, wherein s is the observed quantity at the current moment, a is the action at the current moment, r is the return returned by the environment, and s ' is the observed quantity at the next moment.
In some embodiments, the input module is specifically configured to:
when the number of samples in the experience playback pool meets a preset training sample number m, randomly extracting training sample images with the number of batch sizes from the experience playback pool, preprocessing the training sample images with the size of H × W, dividing the training sample images into N color blocks according to the size of the training sample images, wherein the size of each color block is P × P, H is the height of the training sample images, W is the width of the training sample images, and N is H × W/P2
Using a linear projection matrix to flatten each color block X in the input images at the t-2 moment, the t-1 moment and the t moment to obtain a mapped D-dimensional vector X1Embedding (X), and adding position embedding and sequence embedding to it to obtain a patch vector X2=X1+PositionEncoding+SequenceEncoding;
The state action value placeholder QvalueToken and the color block vector X are processed by learning parameters2Splicing to obtain X3=Concat(X2QvalueToken), then inputting the processed data into the visual converter, and outputting the action state value X through the visual converteroutputWherein, in the step (A),
Xoutput=MLP(Xhidden),
Xhidden=LayerNorm(X_attention+FeedForward(Xattention)),
Xattention=LayerNorm(X3+SelfAttention(X3WQ,X3WK,X3WV)),
wherein MLP is multilayer perceptron, XhiddenFor converting the output of the encoder, feed forward is composed of two layers of linear mapping and activation functionFeedforward network, XattentionSelfattention is the self-attention layer, W, for the output of the multi-head attention layerQ、WK、WVRespectively, the network weights of the linear mapping.
In some embodiments, the training module is specifically configured to:
training the deep reinforcement learning network according to a mean square error loss function L, wherein L is E [ r + gamma max ═a′Q(s′,a′;θ-)-Q(s,a;θ)]2Updating the weight of the deep reinforcement learning network,
Figure BDA0003017834960000051
wherein E is the mathematical expectation, a is the current moment action, a ' is the next moment action, alpha is the learning rate, gamma is the discount coefficient, Q (s, a; theta) is the Q value of the neural network at the current value, Q (s ', a '; theta)-) For the target value of Q, theta and theta of the neural network-The parameters of the neural network are the current value and the parameters of the neural network are the target value respectively, and theta' is the updated parameters of the neural network.
In another aspect of the present invention, an electronic device is provided, including:
one or more processors;
a storage unit for storing one or more programs which, when executed by the one or more processors, enable the one or more processors to implement the method as described above.
In another aspect of the invention, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, is adapted to carry out the method according to the above description.
According to the depth reinforcement learning method and device based on the vision converter, the vision converter is introduced into the depth reinforcement learning network, so that the blank of the application of the vision converter in the reinforcement learning field is filled, the interpretability of the reinforcement learning method is improved, the learning training can be performed more effectively, and the method and device can be applied to scenes using reinforcement learning algorithms, such as games, robot control and the like.
Drawings
FIG. 1 is a block diagram of an electronic device according to an embodiment of the invention;
FIG. 2 is a flowchart illustrating a depth reinforcement learning method based on a visual transformer according to another embodiment of the present invention;
FIG. 3 is a schematic structural diagram of a deep reinforcement learning network based on a visual transformer according to another embodiment of the present invention;
FIG. 4 is a schematic structural diagram of a transcoder according to another embodiment of the present invention;
fig. 5 is a schematic structural diagram of a depth-enhanced learning apparatus based on a visual converter according to another embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.
First, an example electronic device for implementing the apparatus and method of embodiments of the present invention is described with reference to fig. 1.
As shown in FIG. 1, electronic device 200 includes one or more processors 210, one or more memory devices 220, one or more input devices 230, one or more output devices 240, and the like, interconnected by a bus system 250 and/or other form of connection mechanism. It should be noted that the components and structures of the electronic device shown in fig. 1 are exemplary only, and not limiting, and the electronic device may have other components and structures as desired.
The processor 210 may be a neural network processor composed of chips of a multi (numerous) core architecture, may be a single Central Processing Unit (CPU), or may be a central processing unit + multi-core neural network processor array or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 200 to perform desired functions.
Storage 220 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc. On which one or more computer program instructions may be stored that may be executed by a processor to implement client functionality (implemented by the processor) and/or other desired functionality in embodiments of the invention described below. Various applications and various data, such as various data used and/or generated by the applications, may also be stored in the computer-readable storage medium.
The input device 230 may be a device used by a user to input instructions and may include one or more of a keyboard, a mouse, a microphone, a touch screen, and the like.
The output device 240 may output various information (e.g., images or sounds) to an outside (e.g., a user), and may include one or more of a display, a speaker, and the like.
Next, a depth reinforcement learning method based on a visual converter according to an embodiment of the present invention will be described with reference to fig. 2.
Illustratively, as shown in fig. 2, the present embodiment provides a depth reinforcement learning method S100 based on a visual converter, where the method S100 includes:
s110, constructing a deep reinforcement learning network structure based on a visual converter, wherein the visual converter comprises a multilayer perceptron and a conversion encoder, and the conversion encoder comprises a multi-head attention layer and a feedforward network.
Specifically, a state space, an action space and a reward function can be defined based on the reinforcement learning operation environment, and a deep reinforcement learning network structure based on the visual converter is constructed. Wherein the visual converter comprises a multi-layer perceptron and a transcoder, as shown in figure 3. As shown in fig. 4, the transcoder includes a multi-headed attention layer and a feed forward network.
And S120, initializing the weight of the deep reinforcement learning network, and constructing an experience playback pool according to the capacity of the memory.
Specifically, the weights of the deep reinforcement learning network may be initialized, and an experience playback pool may be established according to the capacity of the memory.
And S130, interacting with the operation environment through a greedy strategy to generate experience data and putting the experience data into the experience playback pool.
Specifically, the experience data can be generated in the interaction process through interaction with the reinforcement learning operation environment by a greedy strategy, and the experience data is put into an experience playback pool.
S140, when the number of the samples in the experience playback pool meets the preset number of the training samples, randomly extracting a batch of training sample images from the experience playback pool, and preprocessing the training sample images.
Specifically, when the number of samples in the experience playback pool meets the preset number of training samples, a batch of training sample images may be randomly extracted from the experience playback pool, and then the training sample images may be preprocessed according to actual needs.
It should be noted that the preset number of training samples may be the minimum number of training samples required for performing one training on the deep reinforcement learning network, or may be any number of training samples set according to actual needs, and a person skilled in the art may select the training samples as needed, which is not limited in this embodiment.
S150, inputting the preprocessed training sample image into the deep reinforcement learning network for training.
Specifically, the deep reinforcement learning network may be trained by using the preprocessed training sample image as an input.
And S160, when the deep reinforcement learning network meets the convergence condition, obtaining a reinforcement learning model.
Specifically, in the process of training the deep reinforcement learning network, when the deep reinforcement learning network meets the convergence condition, the current reinforcement learning model is obtained to be used as the final reinforcement learning model.
The depth reinforcement learning method based on the visual converter fills the blank of the application of the visual converter in the reinforcement learning field by introducing the visual converter into the depth reinforcement learning network, improves the interpretability of the reinforcement learning method, can carry out learning training more effectively, and can be applied to scenes using reinforcement learning algorithms, such as games, robot control and the like.
Illustratively, the interacting with the execution environment through a greedy strategy to generate experience data and put the experience data into the experience playback pool includes:
and interacting with the operating environment through an epsilon-greedy strategy, wherein during interaction, an output action randomly extracts an action from all actions according to the probability of epsilon, extracts the action with the maximum value according to the probability of 1-epsilon, acquires empirical data (s, a, r, s ') and puts the empirical data into an empirical playback pool, wherein s is the observed quantity at the current moment, a is the action at the current moment, r is the return of the environment, and s' is the observed quantity at the next moment.
Illustratively, when the number of samples in the experience playback pool satisfies a preset number of training samples, randomly extracting a batch of training sample images from the experience playback pool, and preprocessing the training sample images, includes:
and when the number of the samples in the experience playback pool meets the preset number m of the training samples, randomly extracting training sample images with the size of batch from the experience playback pool, and preprocessing the training sample images with the size of H x W. Dividing the training sample image into N color blocks according to the size of the training sample image, wherein the size of each color block is P × P, H is the height of the training sample image, W is the width of the training sample image, and N is H × W/P2
Using a linear projection matrix to flatten each color block X in the input images at the t-2 moment, the t-1 moment and the t moment to obtain a mapped D-dimensional vector X1Embedding (X), the specific operation is full connection in the deep learning framework of the pyrrchLine, and oriented to D-dimensional vector X1Adding position embedding and sequence embedding to obtain color block vector X2=X1+ PositionEncoding + sequencenecoding, specifically operating as module parameter setting torch.nn.
The state action value placeholder QvalueToken is processed by learning parameters and color block vector X2Splicing to obtain X3=Concat(X2QvalueToken), as a specific operation of the torch.nn. identity function in the pytorch deep learning framework. Inputting the processed data into a visual converter, and outputting an action state value X through the visual converteroutputWherein, in the step (A),
Xoutput=MLP(Xhidden),
Xhidden=LayerNorm(X_attention+FeedForward(Xattention)),
Xattention=LayerNorm(X3+SelfAttention(X3WQ,X3WK,X3WV)),
wherein MLP is multilayer perceptron, XhiddenFor the output of the transcoder, feed forward is a feed forward network consisting of two layers of linear mapping and activation functions, XattentionFor the output of the multi-attention layer, Selfattention is the self-attention layer, i.e., the multi-attention layer, WQ、WK、WVRespectively, the network weights of the linear mapping.
The deep reinforcement learning method based on the visual converter of the embodiment can further improve the interpretability of the reinforcement learning method through the attention mechanism of the visual converter, and further learn useful global observation information while extracting local observation information, thereby capturing global information better. In addition, the present embodiment enables the deep reinforcement learning network to utilize the observation information of the past time by using the time-series coding of the visual converter, so as to perform the learning training more effectively.
Illustratively, the inputting the preprocessed training sample image into the deep reinforcement learning network for training includes:
training the deep reinforcement learning network according to a mean square error loss function L, wherein L is E [ r + gamma max ═a′Q(s′,a′;θ-)-Q(s,a;θ)]2Updating the weight of the deep reinforcement learning network,
Figure BDA0003017834960000091
wherein E is the mathematical expectation, a is the current moment action, a ' is the next moment action, alpha is the learning rate, gamma is the discount coefficient, Q (s, a; theta) is the Q value of the neural network at the current value, Q (s ', a '; theta)-) For the target value of Q, theta and theta of the neural network-The parameters of the neural network are the current value and the parameters of the neural network are the target value respectively, and theta' is the updated parameters of the neural network.
The deep reinforcement learning method based on the visual converter can train the deep reinforcement learning network in a parallel mode, so that the convergence speed of the deep reinforcement learning network is increased.
In another aspect of the invention, a depth reinforcement learning device based on a visual converter is provided.
Illustratively, as shown in fig. 5, the present embodiment provides a depth reinforcement learning apparatus 100 based on a visual converter, and the apparatus 100 includes a construction module 110, a data acquisition module 120, an input module 130, a training module 140, and an acquisition module 150. The apparatus 100 can be applied to the methods described above, and the details not mentioned in the following apparatuses can be referred to the related descriptions, which are not described herein again.
The building module 110 is configured to build a deep reinforcement learning network structure based on a visual converter, and define a state space, an action space, and a reward function, where the visual converter includes a multi-layer perceptron and a conversion encoder, the conversion encoder includes a multi-head attention layer and a feed-forward network, initialize weights of the deep reinforcement learning network, and build an experience playback pool according to a capacity of a memory;
the data acquisition module 120 is configured to interact with the operating environment through a greedy strategy to generate experience data and place the experience data into the experience playback pool;
the input module 130 is configured to randomly extract a batch of training sample images from the experience playback pool when the number of samples in the experience playback pool meets a preset number of training samples, pre-process the training sample images, and input the pre-processed training sample images into the training module 140;
the training module 140 is configured to train the deep reinforcement learning network by using the preprocessed training sample image;
the obtaining module 150 is configured to obtain a reinforcement learning model when the deep reinforcement learning network satisfies a convergence condition.
The depth reinforcement learning device based on the visual converter fills the blank of the application of the visual converter in the reinforcement learning field by introducing the visual converter into the depth reinforcement learning network, improves the interpretability of the reinforcement learning method, can carry out learning training more effectively, and can be applied to scenes using reinforcement learning algorithms, such as games, robot control and the like.
Illustratively, the data acquisition module 120 is specifically configured to:
and interacting with the operating environment through an epsilon-greedy strategy to obtain experience data (s, a, r, s ') and putting the experience data (s, a, r, s ') into the experience playback pool, wherein s is the observed quantity at the current moment, a is the action at the current moment, r is the return returned by the environment, and s ' is the observed quantity at the next moment.
Illustratively, the input module 130 is specifically configured to:
when the number of samples in the experience playback pool meets a preset training sample number m, randomly extracting training sample images with the number of batch sizes from the experience playback pool, preprocessing the training sample images with the size of H × W, dividing the training sample images into N color blocks according to the size of the training sample images, wherein the size of each color block is P × P, and H is the training sample imageHeight, W is the width of the training sample image, N ═ H × W/P2
Using a linear projection matrix to flatten each color block X in the input images at the t-2 moment, the t-1 moment and the t moment to obtain a mapped D-dimensional vector X1Embedding (X), and adding position embedding and sequence embedding to it to obtain a patch vector X2=X1+PositionEncoding+SequenceEncoding;
The state action value placeholder QvalueToken and the color block vector X are processed by learning parameters2Splicing to obtain X3=Concat(X2QvalueToken), then inputting the processed data into the visual converter, and outputting the action state value X through the visual converteroutputWherein, in the step (A),
Xoutput=MLP(Xhidden),
Xhidden=LayerNorm(X_attention+FeedForward(Xattention)),
Xattention=LayerNorm(X3+SelfAttention(X3WQ,X3WK,X3WV)),
wherein MLP is multilayer perceptron, XhiddenFor the output of the transcoder, feed forward is a feed forward network consisting of two layers of linear mapping and activation functions, XattentionFor the output of the multi-attention layer, Selfattention is the self-attention layer, i.e., the multi-attention layer, WQ、WK、WVRespectively, the network weights of the linear mapping.
The depth reinforcement learning device based on the visual converter of the embodiment can further improve the interpretability of the reinforcement learning method through the attention mechanism of the visual converter, and further learn useful global observation information while extracting local observation information, thereby capturing global information better. In addition, the present embodiment enables the deep reinforcement learning network to utilize the observation information of the past time by using the time-series coding of the visual converter, so as to perform the learning training more effectively.
Illustratively, the training module 140 is specifically configured to:
training the deep reinforcement learning network according to a mean square error loss function L, wherein L is E [ r + gamma max ═a′Q(s′,a′;θ-)-Q(s,a;θ)]2Updating the weight of the deep reinforcement learning network,
Figure BDA0003017834960000121
wherein E is the mathematical expectation, a is the current moment action, a ' is the next moment action, alpha is the learning rate, gamma is the discount coefficient, Q (s, a; theta) is the Q value of the neural network at the current value, Q (s ', a '; theta)-) For the target value of Q, theta and theta of the neural network-The parameters of the neural network are the current value and the parameters of the neural network are the target value respectively, and theta' is the updated parameters of the neural network.
The deep reinforcement learning device based on the visual converter can train the deep reinforcement learning network in a parallel mode, so that the convergence speed of the deep reinforcement learning network is increased.
In another aspect of the present invention, an electronic device is provided, including:
one or more processors;
a storage unit for storing one or more programs which, when executed by the one or more processors, enable the one or more processors to implement the method according to the preceding description.
In another aspect of the invention, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, is adapted to carry out the method according to the above description.
The computer readable storage medium may be included in the apparatus or device of the present invention, or may exist separately.
The computer readable storage medium may be any tangible medium that can contain or store a program, and may be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, device, more specific examples include but are not limited to: a portable computer diskette, a hard disk, an optical fiber, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The computer readable storage medium may also include a propagated data signal with computer readable program code embodied therein, for example, in a non-transitory form, such as in a carrier wave or in a carrier wave, wherein the carrier wave is any suitable carrier wave or carrier wave for carrying the program code.
It will be understood that the above embodiments are merely exemplary embodiments taken to illustrate the principles of the present invention, which is not limited thereto. It will be apparent to those skilled in the art that various modifications and improvements can be made without departing from the spirit and substance of the invention, and these modifications and improvements are also considered to be within the scope of the invention.

Claims (10)

1. A depth reinforcement learning method based on a visual converter is characterized by comprising the following steps:
constructing a deep reinforcement learning network structure based on a visual converter, wherein the visual converter comprises a multi-layer perceptron and a conversion encoder, and the conversion encoder comprises a multi-head attention layer and a feedforward network;
initializing the weight of the deep reinforcement learning network, and constructing an experience playback pool according to the capacity of a memory;
interacting with the operation environment through a greedy strategy to generate experience data and putting the experience data into the experience playback pool;
when the number of samples in the experience playback pool meets the preset number of training samples, randomly extracting a batch of training sample images from the experience playback pool, and preprocessing the training sample images;
inputting the preprocessed training sample image into the deep reinforcement learning network for training;
and when the deep reinforcement learning network meets the convergence condition, obtaining a reinforcement learning model.
2. The method of claim 1, wherein interacting with a runtime environment through a greedy strategy to generate and place experience data into the experience playback pool comprises:
and interacting with the operating environment through an epsilon-greedy strategy to obtain experience data (s, a, r, s ') and putting the experience data (s, a, r, s ') into the experience playback pool, wherein s is the observed quantity at the current moment, a is the action at the current moment, r is the return returned by the environment, and s ' is the observed quantity at the next moment.
3. The method according to claim 2, wherein when the number of samples in the experience replay pool meets a preset number of training samples, randomly extracting a batch of training sample images from the experience replay pool, and preprocessing the training sample images comprises:
when the number of samples in the experience playback pool meets a preset training sample number m, randomly extracting training sample images with the number of batch sizes from the experience playback pool, preprocessing the training sample images with the size of H × W, dividing the training sample images into N color blocks according to the size of the training sample images, wherein the size of each color block is P × P, H is the height of the training sample images, W is the width of the training sample images, and N is H × W/P2
Using a linear projection matrix to flatten each color block X in the input images at the t-2 moment, the t-1 moment and the t moment to obtain a mapped D-dimensional vector X1Embedding (X), and adding position embedding and sequence embedding to it to obtain a patch vector X2=X1+PositionEncoding+SequenceEncoding;
Occupying state action value inThe bit QvalueToken learns the parameters and the color block vector X2Splicing to obtain X3=Concat(X2QvalueToken), then inputting the processed data into the visual converter, and outputting the action state value X through the visual converteroutputWherein, in the step (A),
Xoutput=MLP(Xhidden),
Xhidden=LayerNorm(Xattention+FeedForward(Xattention)),
Xattention=LayerNorm(X3+SelfAttention(X3WQ,X3WK,X3WV)),
wherein MLP is multilayer perceptron, XhiddenFor the output of the transcoder, feed forward is a feed forward network consisting of two layers of linear mapping and activation functions, XattentionSelfattention is the self-attention layer, W, for the output of the multi-head attention layerQ、WK、WVRespectively, the network weights of the linear mapping.
4. The method according to claim 3, wherein the inputting the preprocessed training sample images into the deep reinforcement learning network for training comprises:
training the deep reinforcement learning network according to a mean square error loss function L, wherein L is E [ r + gamma max ═a′Q(s′,a′;θ-)-Q(s,a;θ)]2Updating the weight of the deep reinforcement learning network,
Figure FDA0003017834950000021
wherein E is the mathematical expectation, a is the current moment action, a ' is the next moment action, alpha is the learning rate, gamma is the discount coefficient, Q (s, a; theta) is the Q value of the neural network at the current value, Q (s ', a '; theta)-) For the target value of Q, theta and theta of the neural network-Neural network of current values respectivelyAnd theta' is the updated parameter of the neural network.
5. The deep reinforcement learning device based on the vision converter is characterized by comprising a construction module, a data acquisition module, an input module, a training module and an acquisition module:
the building module is used for building a depth reinforcement learning network structure based on a visual converter, wherein the visual converter comprises a multilayer perceptron and a conversion encoder, the conversion encoder comprises a multi-head attention layer and a feedforward network, the weight of the depth reinforcement learning network is initialized, and an experience playback pool is built according to the capacity of a memory;
the data acquisition module is used for interacting with the operation environment through a greedy strategy to generate experience data and putting the experience data into the experience playback pool;
the input module is used for randomly extracting a batch of training sample images from the experience playback pool when the number of samples in the experience playback pool meets the preset number of training samples, preprocessing the training sample images, and inputting the preprocessed training sample images into the training module;
the training module is used for training the deep reinforcement learning network by utilizing the preprocessed training sample image;
the obtaining module is used for obtaining the reinforcement learning model when the deep reinforcement learning network meets the convergence condition.
6. The apparatus of claim 5, wherein the data acquisition module is specifically configured to:
and interacting with the operating environment through an epsilon-greedy strategy to obtain experience data (s, a, r, s ') and putting the experience data (s, a, r, s ') into the experience playback pool, wherein s is the observed quantity at the current moment, a is the action at the current moment, r is the return returned by the environment, and s ' is the observed quantity at the next moment.
7. The apparatus of claim 6, wherein the input module is specifically configured to:
when the number of samples in the experience playback pool meets a preset training sample number m, randomly extracting training sample images with the number of batch sizes from the experience playback pool, preprocessing the training sample images with the size of H × W, dividing the training sample images into N color blocks according to the size of the training sample images, wherein the size of each color block is P × P, H is the height of the training sample images, W is the width of the training sample images, and N is H × W/P2
Using a linear projection matrix to flatten each color block X in the input images at the t-2 moment, the t-1 moment and the t moment to obtain a mapped D-dimensional vector X1Embedding (X), and adding position embedding and sequence embedding to it to obtain a patch vector X2=X1+PositionEncoding+SequenceEncoding;
The state action value placeholder QvalueToken and the color block vector X are processed by learning parameters2Splicing to obtain X3=Concat(X2QvalueToken), then inputting the processed data into the visual converter, and outputting the action state value X through the visual converteroutputWherein, in the step (A),
Xoutput=MLP(Xhidden),
Xhidden=LayerNorm(X_attention+FeedForward(Xattention)),
Xattention=LayerNorm(X3+SelfAttention(X3WQ,X3WK,X3WV)),
wherein MLP is multilayer perceptron, XhiddenFor the output of the transcoder, feed forward is a feed forward network consisting of two layers of linear mapping and activation functions, XattentionSelfattention is the self-attention layer, W, for the output of the multi-head attention layerQ、WK、WVRespectively, the network weights of the linear mapping.
8. The apparatus of claim 7, wherein the training module is specifically configured to:
training the deep reinforcement learning network according to a mean square error loss function L, wherein L is E [ r + gamma max ═a′Q(s′,a′;θ-)-Q(s,a;θ)]2Updating the weight of the deep reinforcement learning network,
Figure FDA0003017834950000041
wherein E is the mathematical expectation, a is the current moment action, a ' is the next moment action, alpha is the learning rate, gamma is the discount coefficient, Q (s, a; theta) is the Q value of the neural network at the current value, Q (s ', a '; theta)-) For the target value of Q, theta and theta of the neural network-The parameters of the neural network are the current value and the parameters of the neural network are the target value respectively, and theta' is the updated parameters of the neural network.
9. An electronic device, characterized in that the electronic device comprises:
one or more processors;
a storage unit to store one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-4.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, is able to carry out a method according to any one of claims 1 to 4.
CN202110393996.7A 2021-04-13 2021-04-13 Deep reinforcement learning method and device based on visual transducer Active CN113052257B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110393996.7A CN113052257B (en) 2021-04-13 2021-04-13 Deep reinforcement learning method and device based on visual transducer

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110393996.7A CN113052257B (en) 2021-04-13 2021-04-13 Deep reinforcement learning method and device based on visual transducer

Publications (2)

Publication Number Publication Date
CN113052257A true CN113052257A (en) 2021-06-29
CN113052257B CN113052257B (en) 2024-04-16

Family

ID=76519168

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110393996.7A Active CN113052257B (en) 2021-04-13 2021-04-13 Deep reinforcement learning method and device based on visual transducer

Country Status (1)

Country Link
CN (1) CN113052257B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113469119A (en) * 2021-07-20 2021-10-01 合肥工业大学 Cervical cell image classification method based on visual converter and graph convolution network
CN115147669A (en) * 2022-06-24 2022-10-04 北京百度网讯科技有限公司 Image processing method, training method and equipment based on visual converter model

Citations (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150201895A1 (en) * 2012-08-31 2015-07-23 The University Of Chicago Supervised machine learning technique for reduction of radiation dose in computed tomography imaging
CN108288094A (en) * 2018-01-31 2018-07-17 清华大学 Deeply learning method and device based on ambient condition prediction
CN109241552A (en) * 2018-07-12 2019-01-18 哈尔滨工程大学 A kind of underwater robot motion planning method based on multiple constraint target
US20190124348A1 (en) * 2017-10-19 2019-04-25 Samsung Electronics Co., Ltd. Image encoder using machine learning and data processing method of the image encoder
US20190258671A1 (en) * 2016-10-28 2019-08-22 Vilynx, Inc. Video Tagging System and Method
CN110286161A (en) * 2019-03-28 2019-09-27 清华大学 Main transformer method for diagnosing faults based on adaptive enhancing study
CN110945495A (en) * 2017-05-18 2020-03-31 易享信息技术有限公司 Conversion of natural language queries to database queries based on neural networks
CN111126282A (en) * 2019-12-25 2020-05-08 中国矿业大学 Remote sensing image content description method based on variation self-attention reinforcement learning
CN111461321A (en) * 2020-03-12 2020-07-28 南京理工大学 Improved deep reinforcement learning method and system based on Double DQN
CN111597830A (en) * 2020-05-20 2020-08-28 腾讯科技(深圳)有限公司 Multi-modal machine learning-based translation method, device, equipment and storage medium
CN111666500A (en) * 2020-06-08 2020-09-15 腾讯科技(深圳)有限公司 Training method of text classification model and related equipment
CN111709398A (en) * 2020-07-13 2020-09-25 腾讯科技(深圳)有限公司 Image recognition method, and training method and device of image recognition model
KR20200132665A (en) * 2019-05-17 2020-11-25 삼성전자주식회사 Attention layer included generator based prediction image generating apparatus and controlling method thereof
US20200379461A1 (en) * 2019-05-29 2020-12-03 Argo AI, LLC Methods and systems for trajectory forecasting with recurrent neural networks using inertial behavioral rollout
CN112084314A (en) * 2020-08-20 2020-12-15 电子科技大学 Knowledge-introducing generating type session system
CN112261725A (en) * 2020-10-23 2021-01-22 安徽理工大学 Data packet transmission intelligent decision method based on deep reinforcement learning
US20210073995A1 (en) * 2019-09-11 2021-03-11 Nvidia Corporation Training strategy search using reinforcement learning
CN112488306A (en) * 2020-12-22 2021-03-12 中国电子科技集团公司信息科学研究院 Neural network compression method and device, electronic equipment and storage medium

Patent Citations (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150201895A1 (en) * 2012-08-31 2015-07-23 The University Of Chicago Supervised machine learning technique for reduction of radiation dose in computed tomography imaging
US20190258671A1 (en) * 2016-10-28 2019-08-22 Vilynx, Inc. Video Tagging System and Method
CN110945495A (en) * 2017-05-18 2020-03-31 易享信息技术有限公司 Conversion of natural language queries to database queries based on neural networks
US20190124348A1 (en) * 2017-10-19 2019-04-25 Samsung Electronics Co., Ltd. Image encoder using machine learning and data processing method of the image encoder
CN108288094A (en) * 2018-01-31 2018-07-17 清华大学 Deeply learning method and device based on ambient condition prediction
CN109241552A (en) * 2018-07-12 2019-01-18 哈尔滨工程大学 A kind of underwater robot motion planning method based on multiple constraint target
CN110286161A (en) * 2019-03-28 2019-09-27 清华大学 Main transformer method for diagnosing faults based on adaptive enhancing study
KR20200132665A (en) * 2019-05-17 2020-11-25 삼성전자주식회사 Attention layer included generator based prediction image generating apparatus and controlling method thereof
US20200379461A1 (en) * 2019-05-29 2020-12-03 Argo AI, LLC Methods and systems for trajectory forecasting with recurrent neural networks using inertial behavioral rollout
US20210073995A1 (en) * 2019-09-11 2021-03-11 Nvidia Corporation Training strategy search using reinforcement learning
CN111126282A (en) * 2019-12-25 2020-05-08 中国矿业大学 Remote sensing image content description method based on variation self-attention reinforcement learning
CN111461321A (en) * 2020-03-12 2020-07-28 南京理工大学 Improved deep reinforcement learning method and system based on Double DQN
CN111597830A (en) * 2020-05-20 2020-08-28 腾讯科技(深圳)有限公司 Multi-modal machine learning-based translation method, device, equipment and storage medium
CN111666500A (en) * 2020-06-08 2020-09-15 腾讯科技(深圳)有限公司 Training method of text classification model and related equipment
CN111709398A (en) * 2020-07-13 2020-09-25 腾讯科技(深圳)有限公司 Image recognition method, and training method and device of image recognition model
CN112084314A (en) * 2020-08-20 2020-12-15 电子科技大学 Knowledge-introducing generating type session system
CN112261725A (en) * 2020-10-23 2021-01-22 安徽理工大学 Data packet transmission intelligent decision method based on deep reinforcement learning
CN112488306A (en) * 2020-12-22 2021-03-12 中国电子科技集团公司信息科学研究院 Neural network compression method and device, electronic equipment and storage medium

Non-Patent Citations (8)

* Cited by examiner, † Cited by third party
Title
DOSOVITSKIY, ALEXEY, ET AL: "\"An image is worth 16x16 words: Transformers for image recognition at scale\"", 《ARXIV》, pages 1 - 4 *
HAOYI ZHOU, 等: ""Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting"", 《ARXIV》, 28 March 2021 (2021-03-28), pages 1 - 15 *
IKER PENG: ""强化学习在视觉上的应用(RL for computer Vision)"", pages 1 - 9, Retrieved from the Internet <URL:《https://zhuanlan.zhihu.com/p/51202503》> *
J. KULHÁNEK等: ""Visual Navigation in Real-World Indoor Environments Using End-to-End Deep Reinforcement Learning"", 《IEEE ROBOTICS AND AUTOMATION LETTERS》, vol. 6, no. 3, 23 March 2021 (2021-03-23), pages 4345 - 4352 *
人工智能学术前沿: ""用于时间序列预测的深度Transformer模型:流感流行病例"", pages 1 - 6, Retrieved from the Internet <URL:《https://zhuanlan.zhihu.com/p/151423371》> *
李峰: ""深度强化学习必看经典论文:DQN,DDQN,Prioritized,Dueling,Rainbow"", pages 1 - 2, Retrieved from the Internet <URL:《https://zhuanlan.zhihu.com/p/337553995》> *
李飞雨: ""基于强化学习和机器翻译质量评估的中朝机器翻译研究"", 《计算机应用研究》 *
郝燕龙: ""基于密集卷积神经网络特征提取的图像描述模型研究"", 《中国优秀硕士学位论文全文数据库 信息科技辑》, no. 9, 15 September 2019 (2019-09-15), pages 138 - 1158 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113469119A (en) * 2021-07-20 2021-10-01 合肥工业大学 Cervical cell image classification method based on visual converter and graph convolution network
CN113469119B (en) * 2021-07-20 2022-10-04 合肥工业大学 Cervical cell image classification method based on visual converter and image convolution network
CN115147669A (en) * 2022-06-24 2022-10-04 北京百度网讯科技有限公司 Image processing method, training method and equipment based on visual converter model
CN115147669B (en) * 2022-06-24 2023-04-18 北京百度网讯科技有限公司 Image processing method, training method and equipment based on visual converter model

Also Published As

Publication number Publication date
CN113052257B (en) 2024-04-16

Similar Documents

Publication Publication Date Title
US20210390653A1 (en) Learning robotic tasks using one or more neural networks
US11373087B2 (en) Method and apparatus for generating fixed-point type neural network
CN109464803B (en) Virtual object control method, virtual object control device, model training device, storage medium and equipment
KR102387570B1 (en) Method and apparatus of generating facial expression and learning method for generating facial expression
WO2019155064A1 (en) Data compression using jointly trained encoder, decoder, and prior neural networks
CN110476173B (en) Hierarchical device placement with reinforcement learning
CN110795549B (en) Short text conversation method, device, equipment and storage medium
CN111292262B (en) Image processing method, device, electronic equipment and storage medium
CN113052257A (en) Deep reinforcement learning method and device based on visual converter
US11776269B2 (en) Action classification in video clips using attention-based neural networks
US11501168B2 (en) Learning longer-term dependencies in neural network using auxiliary losses
CN111340190A (en) Method and device for constructing network structure, and image generation method and device
KR20200076461A (en) Method and apparatus for processing neural network based on nested bit representation
CN109740012B (en) Method for understanding and asking and answering image semantics based on deep neural network
JP2020123345A (en) Learning method and learning device for generating training data acquired from virtual data on virtual world by using generative adversarial network (gan), to thereby reduce annotation cost required in learning processes of neural network for autonomous driving, and testing method and testing device using the same
CN112216307A (en) Speech emotion recognition method and device
CN116912629B (en) General image text description generation method and related device based on multi-task learning
CN116188621A (en) Text supervision-based bidirectional data stream generation countermeasure network image generation method
KR102597184B1 (en) Knowledge distillation method and system specialized for lightweight pruning-based deep neural networks
WO2022127603A1 (en) Model processing method and related device
CN116266376A (en) Rendering method and device
CN113039555B (en) Method, system and storage medium for classifying actions in video clips
US20220189171A1 (en) Apparatus and method for prediction of video frame based on deep learning
CN113011555B (en) Data processing method, device, equipment and storage medium
US20240096071A1 (en) Video processing method using transfer learning and pre-training server

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant