CN113052257B - Deep reinforcement learning method and device based on visual transducer - Google Patents

Deep reinforcement learning method and device based on visual transducer Download PDF

Info

Publication number
CN113052257B
CN113052257B CN202110393996.7A CN202110393996A CN113052257B CN 113052257 B CN113052257 B CN 113052257B CN 202110393996 A CN202110393996 A CN 202110393996A CN 113052257 B CN113052257 B CN 113052257B
Authority
CN
China
Prior art keywords
reinforcement learning
training sample
experience
sample images
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110393996.7A
Other languages
Chinese (zh)
Other versions
CN113052257A (en
Inventor
金丹
王昭
龙玉婧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC Information Science Research Institute
Original Assignee
CETC Information Science Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC Information Science Research Institute filed Critical CETC Information Science Research Institute
Priority to CN202110393996.7A priority Critical patent/CN113052257B/en
Publication of CN113052257A publication Critical patent/CN113052257A/en
Application granted granted Critical
Publication of CN113052257B publication Critical patent/CN113052257B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Processing (AREA)

Abstract

The invention belongs to the technical field of artificial intelligence, and provides a deep reinforcement learning method and device based on a visual transducer, wherein the method comprises the following steps: constructing a depth reinforcement learning network structure based on a visual transducer, wherein the visual transducer comprises a multi-layer perceptron and a conversion encoder, and the conversion encoder comprises a multi-head attention layer and a feedforward network; initializing a deep reinforcement learning network weight, and constructing an experience playback pool according to the memory capacity; interacting with the operation environment through a greedy strategy, generating experience data and placing the experience data into an experience playback pool; when the number of samples in the experience playback pool meets a preset value, randomly extracting a batch of training sample images from the experience playback pool, preprocessing the training sample images, and inputting the training sample images into a deep reinforcement learning network for training; and when the deep reinforcement learning network meets the convergence condition, acquiring a reinforcement learning model. The invention can fill the blank of the application of the visual transducer in the field of reinforcement learning, improve the interpretability of the reinforcement learning method and more effectively carry out learning training.

Description

Deep reinforcement learning method and device based on visual transducer
Technical Field
The invention belongs to the technical field of artificial intelligence, and particularly relates to a deep reinforcement learning method and device based on a vision converter.
Background
In recent years, reinforcement learning has become a research hotspot in the field of machine learning. The agent maximizes the return or achieves some goal by learning strategies during interactions with the environment. By combining with the deep learning method, the deep reinforcement learning method breaks through many artificial intelligence tasks, such as game play, robot control, group decision, automatic driving, and the like.
Currently, the deep reinforcement learning method mainly comprises a method based on a value function, a method based on a strategy gradient and a method based on an Actor-Critic framework. In the existing reinforcement learning network framework, the adopted network structure is mainly a convolutional neural network and a long-short-term memory network. Convolutional neural networks focus on the extraction of local observation information, and the capturing capability of global observation information is weak. The long-short-time memory network has the advantage of processing sequence data, can learn and store information for a long time, but can not perform parallel training when being used as a circulating network structure.
Transducers (transducers) are widely used in natural language processing tasks, the architecture of the transducers can avoid recursion, parallel computation is realized, and the global dependence of input and output is modeled through a self-attention mechanism. However, the converter has not been studied in the reinforcement learning field. Accordingly, there is a need to provide an improved visual transducer based deep reinforcement learning method.
Disclosure of Invention
The invention aims to at least solve one of the technical problems in the prior art and provides a deep reinforcement learning method and device based on a visual transducer.
In one aspect of the present invention, there is provided a method of deep reinforcement learning based on a visual transducer, the method comprising:
constructing a depth reinforcement learning network structure based on a visual converter, wherein the visual converter comprises a multi-layer perceptron and a conversion encoder, and the conversion encoder comprises a multi-head attention layer and a feedforward network;
initializing the weight of the deep reinforcement learning network, and constructing an experience playback pool according to the capacity of a memory;
interacting with the running environment through a greedy strategy, generating experience data and placing the experience data into the experience playback pool;
when the number of samples in the experience playback pool meets the preset number of training samples, randomly extracting a batch of training sample images from the experience playback pool, and preprocessing the training sample images;
inputting the preprocessed training sample image into the deep reinforcement learning network for training;
and when the deep reinforcement learning network meets the convergence condition, acquiring a reinforcement learning model.
In some implementations, the interacting with the runtime environment through a greedy strategy generates and places empirical data into the empirical playback pool, including:
and interacting with the running environment through an epsilon-greedy strategy, acquiring experience data (s, a, r, s ') and putting the experience data into the experience playback pool, wherein s is the observed quantity at the current moment, a is the action at the current moment, r is the return returned by the environment, and s' is the observed quantity at the next moment.
In some embodiments, when the number of samples in the experience playback pool meets the preset number of training samples, randomly extracting a batch of training sample images from the experience playback pool, and preprocessing the training sample images, including:
when the number of samples in the experience playback pool meets the preset training sample numberWhen the quantity m is measured, randomly extracting training sample images with the quantity of batch from the experience playback pool, preprocessing the training sample images with the size of H.times.W, dividing the training sample images into N color blocks according to the size of the training sample images, wherein the size of each color block is P.times.P, H is the height of the training sample images, W is the width of the training sample images, and N=H.times.W/P 2
Flattening each color block X in the input t-2 moment, t-1 moment and t moment images by using a linear projection matrix to obtain a mapped D-dimensional vector X 1 =embedding (X), and adding position Embedding and timing Embedding sequence encoding thereto to obtain a color block vector X 2 =X 1 +PositionEncoding+SequenceEncoding;
The state action value placeholder QvalueToken is connected with the color block vector X by means of learning parameters 2 Splicing to obtain X 3 =Concat(X 2 QvalueToken) before inputting the processed data into the vision transformer, outputting the action state value X via the vision transformer output Wherein, the method comprises the steps of, wherein,
X output =MLP(X hidden ),
X hidden =LayerNorm(X_attention+FeedForward(X attention )),
X attention =LayerNorm(X 3 +SelfAttention(X 3 W Q ,X 3 W K ,X 3 W V )),
wherein MLP is a multi-layer sensor, X hidden For converting the output of the encoder, feedForward is a feed-forward network consisting of two layers of linear mapping and activation functions, X attention For the output of the multi-head attention layer, selfAttention is the self-attention layer, W Q 、W K 、W V The network weights are respectively linear mappings.
In some embodiments, the inputting the preprocessed training sample image into the deep reinforcement learning network for training comprises:
according to mean square errorThe difference loss function L trains the deep reinforcement learning network, where l=e [ r+γmax a′ Q(s′,a′;θ - )-Q(s,a;θ)] 2 Updating the weight of the deep reinforcement learning network,
wherein E is mathematical expectation, a is current moment action, a ' is next moment action, alpha is learning rate, gamma is discount coefficient, Q (s, a; theta) is Q value of the current value neural network, Q (s ', a '; theta) - ) Q value, θ and θ of the neural network as target values - And respectively obtaining the parameters of the current value neural network and the parameters of the target value neural network, wherein θ' is the parameter of the updated value neural network.
In another aspect of the invention, a deep reinforcement learning device based on a vision transducer is provided, and the device comprises a construction module, a data acquisition module, an input module, a training module and an acquisition module:
the construction module is used for constructing a depth reinforcement learning network structure based on a visual converter, wherein the visual converter comprises a multi-layer perceptron and a conversion encoder, the conversion encoder comprises a multi-head attention layer and a feedforward network, the weight of the depth reinforcement learning network is initialized, and an experience playback pool is constructed according to the capacity of a memory;
the data acquisition module is used for interacting with the operation environment through a greedy strategy, generating experience data and placing the experience data into the experience playback pool;
the input module is used for randomly extracting a batch of training sample images from the experience playback pool when the number of samples in the experience playback pool meets the preset training sample number, preprocessing the training sample images and inputting the preprocessed training sample images into the training module;
the training module is used for training the deep reinforcement learning network by utilizing the preprocessed training sample image;
the obtaining module is used for obtaining the reinforcement learning model when the deep reinforcement learning network meets the convergence condition.
In some embodiments, the data acquisition module is specifically configured to:
and interacting with the running environment through an epsilon-greedy strategy, acquiring experience data (s, a, r, s ') and putting the experience data into the experience playback pool, wherein s is the observed quantity at the current moment, a is the action at the current moment, r is the return returned by the environment, and s' is the observed quantity at the next moment.
In some embodiments, the input module is specifically configured to:
when the number of samples in the experience playback pool meets the preset training sample number m, randomly extracting training sample images with the size of batch from the experience playback pool, preprocessing the training sample images with the size of H×W, dividing the training sample images into N color blocks according to the size of the training sample images, wherein the size of each color block is P×P, H is the height of the training sample images, W is the width of the training sample images, and N=H×W/P 2
Flattening each color block X in the input t-2 moment, t-1 moment and t moment images by using a linear projection matrix to obtain a mapped D-dimensional vector X 1 =embedding (X), and adding position Embedding and timing Embedding sequence encoding thereto to obtain a color block vector X 2 =X 1 +PositionEncoding+SequenceEncoding;
The state action value placeholder QvalueToken is connected with the color block vector X by means of learning parameters 2 Splicing to obtain X 3 =Concat(X 2 QvalueToken) before inputting the processed data into the vision transformer, outputting the action state value X via the vision transformer output Wherein, the method comprises the steps of, wherein,
X output =MLP(X hidden ),
X hidden =LayerNorm(X_attention+FeedForward(X attention )),
X attention =LayerNorm(X 3 +SelfAttention(X 3 W Q ,X 3 W K ,X 3 W V )),
wherein MLP is a multi-layer sensor, X hidden For converting the output of the encoder, feedForward is a feed-forward network consisting of two layers of linear mapping and activation functions, X attention For the output of the multi-head attention layer, selfAttention is the self-attention layer, W Q 、W K 、W V The network weights are respectively linear mappings.
In some embodiments, the training module is specifically configured to:
training the deep reinforcement learning network according to a mean square error loss function L, wherein L=Er+gammamax a′ Q(s′,a′;θ - )-Q(s,a;θ)] 2 Updating the weight of the deep reinforcement learning network,
wherein E is mathematical expectation, a is current moment action, a ' is next moment action, alpha is learning rate, gamma is discount coefficient, Q (s, a; theta) is Q value of the current value neural network, Q (s ', a '; theta) - ) Q value, θ and θ of the neural network as target values - And respectively obtaining the parameters of the current value neural network and the parameters of the target value neural network, wherein θ' is the parameter of the updated value neural network.
In another aspect of the present invention, there is provided an electronic apparatus including:
one or more processors;
and a storage unit configured to store one or more programs that, when executed by the one or more processors, enable the one or more processors to implement the method described above.
In another aspect of the invention, a computer-readable storage medium is provided, on which a computer program is stored which, when being executed by a processor, enables the implementation of a method according to the preceding description.
According to the deep reinforcement learning method and device based on the visual transducer, the visual transducer is introduced into the deep reinforcement learning network, so that the blank of application of the visual transducer in the reinforcement learning field is filled, the interpretability of the reinforcement learning method is improved, learning training can be effectively carried out, and the deep reinforcement learning method and device based on the visual transducer can be applied to scenes using reinforcement learning algorithms, such as games, robot control and the like.
Drawings
FIG. 1 is a block diagram schematically illustrating the composition of an electronic device according to an embodiment of the present invention;
FIG. 2 is a flow chart of a deep reinforcement learning method based on a visual transducer according to another embodiment of the present invention;
FIG. 3 is a schematic diagram of a deep reinforcement learning network based on a vision transducer according to another embodiment of the present invention;
FIG. 4 is a schematic diagram of a transcoder according to another embodiment of the present invention;
fig. 5 is a schematic structural diagram of a deep reinforcement learning device based on a visual transducer according to another embodiment of the invention.
Detailed Description
The present invention will be described in further detail below with reference to the drawings and detailed description for the purpose of better understanding of the technical solution of the present invention to those skilled in the art.
First, an example electronic device for implementing the apparatus and method of embodiments of the present invention is described with reference to fig. 1.
As shown in fig. 1, electronic device 200 includes one or more processors 210, one or more storage devices 220, one or more input devices 230, one or more output devices 240, etc., interconnected by a bus system 250 and/or other forms of connection mechanisms. It should be noted that the components and structures of the electronic device shown in fig. 1 are exemplary only and not limiting, as the electronic device may have other components and structures as desired.
Processor 210 may be a neural network processor comprised of chips of a multi (many) core architecture, may be a separate Central Processing Unit (CPU), or may be a central processing unit + multi-core neural network processor array or other form of processing unit having data processing and/or instruction execution capabilities, and may control other components in electronic device 200 to perform desired functions.
The storage 220 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, random Access Memory (RAM) and/or cache memory (cache), and the like. The non-volatile memory may include, for example, read Only Memory (ROM), hard disk, flash memory, and the like. One or more computer program instructions may be stored on the computer readable storage medium that can be executed by a processor to perform client functions and/or other desired functions in embodiments of the present invention described below. Various applications and various data, such as various data used and/or generated by the applications, may also be stored in the computer readable storage medium.
The input device 230 may be a device used by a user to input instructions and may include one or more of a keyboard, a mouse, a microphone, a touch screen, and the like.
The output device 240 may output various information (e.g., images or sounds) to the outside (e.g., a user), and may include one or more of a display, a speaker, and the like.
Next, a visual transducer-based deep reinforcement learning method according to an embodiment of the present invention will be described with reference to fig. 2.
As illustrated in fig. 2, the present embodiment provides a deep reinforcement learning method S100 based on a visual transducer, where the method S100 includes:
s110, constructing a depth reinforcement learning network structure based on a visual converter, wherein the visual converter comprises a multi-layer perceptron and a conversion encoder, and the conversion encoder comprises a multi-head attention layer and a feedforward network.
Specifically, a state space, an action space and a reward function can be defined based on the reinforcement learning operation environment, and a deep reinforcement learning network structure based on the vision transducer can be constructed. Wherein, as shown in fig. 3, the visual converter comprises a multi-layer sensor and a conversion encoder. As shown in fig. 4, the transcoder includes a multi-headed attention layer and a feed forward network.
S120, initializing the weight of the deep reinforcement learning network, and constructing an experience playback pool according to the capacity of the memory.
Specifically, each weight of the deep reinforcement learning network can be initialized, and an experience playback pool is built according to the capacity of the memory.
S130, interacting with the running environment through a greedy strategy, generating experience data and putting the experience data into the experience playback pool.
Specifically, the reinforcement learning operating environment can be interacted through a greedy strategy, experience data are generated in the interaction process, and the experience data are put into an experience playback pool.
And S140, randomly extracting a batch of training sample images from the experience playback pool when the number of samples in the experience playback pool meets the preset training sample number, and preprocessing the training sample images.
Specifically, when the number of samples in the experience playback pool meets the preset number of training samples, a batch of training sample images can be randomly extracted from the experience playback pool, and then the training sample images are preprocessed according to actual needs.
It should be noted that the preset number of training samples may be the minimum number of training samples required for training the deep reinforcement learning network once, or may be any number of training samples set according to actual needs, and those skilled in the art may select the training samples as required, which is not limited in this embodiment.
S150, inputting the preprocessed training sample image into the deep reinforcement learning network for training.
Specifically, the pre-processed training sample image may be used as an input to train the deep reinforcement learning network.
S160, acquiring a reinforcement learning model when the deep reinforcement learning network meets convergence conditions.
Specifically, in the process of training the deep reinforcement learning network, when the deep reinforcement learning network meets the convergence condition, the current reinforcement learning model is obtained as the final reinforcement learning model.
The deep reinforcement learning method based on the visual transducer fills the blank of application of the visual transducer in the reinforcement learning field by introducing the visual transducer into the deep reinforcement learning network, improves the interpretability of the reinforcement learning method, can more effectively perform learning training, and can be applied to scenes using reinforcement learning algorithms, such as games, robot control and the like.
Illustratively, the interacting with the runtime environment through a greedy strategy generates and places experience data into the experience playback pool, including:
and when interaction is performed, the output action randomly extracts one action from all actions with epsilon probability, and extracts the action with the highest value with 1-epsilon probability, so that experience data (s, a, r, s ') are obtained and put into an experience playback pool, wherein s is the observed quantity at the current moment, a is the action at the current moment, r is the return of the environment, and s' is the observed quantity at the next moment.
Illustratively, when the number of samples in the experience playback pool meets the preset number of training samples, randomly extracting a batch of training sample images from the experience playback pool, and preprocessing the training sample images, including:
when the number of samples in the experience playback pool meets the preset training sample number m, randomly extracting training sample images with the batch size from the experience playback pool, and preprocessing the training sample images with the H×W size. Dividing the training sample image into N color blocks according to the size of the training sample image, wherein the size of each color block is P, and H is the training sample imageHeight, W is the width of the training sample image, n=h×w/P 2
Flattening each color block X in the input t-2 moment, t-1 moment and t moment images by using a linear projection matrix to obtain a mapped D-dimensional vector X 1 =enhancement (X), specific operation is as full-join layer torch.nn.linear in pytorch deep learning framework, and vector X to D dimension 1 Adding position embedding position encoding and time sequence encoding to obtain a color block vector X 2 =X 1 +PositionEncoding+SequenceEncoding, a specific operation is such as setting the module parameters in the pyrach deep learning framework to torch.
The state action value placeholder QvalueToken is connected with the color block vector X by means of learning parameters 2 Splicing to obtain X 3 =Concat(X 2 QvalueToken), specific operations such as torch.nn.identity functions in the pytorch deep learning framework. Then the processed data is input into a vision converter, and the action state value X is output through the vision converter output Wherein, the method comprises the steps of, wherein,
X output =MLP(X hidden ),
X hidden =LayerNorm(X_attention+FeedForward(X attention )),
X attention =LayerNorm(X 3 +SelfAttention(X 3 W Q ,X 3 W K ,X 3 W V )),
wherein MLP is a multi-layer sensor, X hidden For converting the output of the encoder, feedForward is a feed-forward network consisting of two layers of linear mapping and activation functions, X attention For the output of the multi-head attention layer, selfAttention is the self-attention layer, i.e. multi-head attention layer, W Q 、W K 、W V The network weights are respectively linear mappings.
According to the deep reinforcement learning method based on the visual transducer, the interpretability of the reinforcement learning method can be further improved through the attention mechanism of the visual transducer, and useful global observation information can be further learned while local observation information is extracted, so that global information can be better captured. In addition, the embodiment enables the deep reinforcement learning network to use the observation information of the past time by using the time sequence coding of the vision converter, so that the learning training can be performed more effectively.
Illustratively, the inputting the preprocessed training sample image into the deep reinforcement learning network for training includes:
training a deep reinforcement learning network according to a mean square error loss function L, wherein L=Er+γmax a′ Q(s′,a′;θ - )-Q(s,a;θ)] 2 Updating the weight of the deep reinforcement learning network,
wherein E is mathematical expectation, a is current moment action, a ' is next moment action, alpha is learning rate, gamma is discount coefficient, Q (s, a; theta) is Q value of the current value neural network, Q (s ', a '; theta) - ) Q value, θ and θ of the neural network as target values - And respectively obtaining the parameters of the current value neural network and the parameters of the target value neural network, wherein θ' is the parameter of the updated value neural network.
According to the deep reinforcement learning method based on the visual transducer, the deep reinforcement learning network can be trained in a parallel mode, so that the convergence speed of the deep reinforcement learning network is increased.
In another aspect of the invention, a deep reinforcement learning device based on a visual transducer is provided.
As illustrated in fig. 5, the present embodiment provides a deep reinforcement learning device 100 based on a vision transducer, where the device 100 includes a construction module 110, a data acquisition module 120, an input module 130, a training module 140, and an acquisition module 150. The apparatus 100 may be applied to the method described above, and details not mentioned in the following apparatus may be referred to in the related description, which is not repeated here.
The construction module 110 is configured to construct a deep reinforcement learning network structure based on a visual transducer, and define a state space, an action space and a reward function, wherein the visual transducer comprises a multi-layer perceptron and a conversion encoder, the conversion encoder comprises a multi-head attention layer and a feedforward network, initialize weights of the deep reinforcement learning network, and construct an experience playback pool according to capacity of a memory;
the data collection module 120 is configured to interact with the operation environment through a greedy strategy, generate experience data, and place the experience data in the experience playback pool;
the input module 130 is configured to randomly extract a batch of training sample images from the experience playback pool when the number of samples in the experience playback pool meets a preset training sample number, perform preprocessing on the training sample images, and input the preprocessed training sample images to the training module 140;
the training module 140 is configured to train the deep reinforcement learning network by using the preprocessed training sample image;
the obtaining module 150 is configured to obtain a reinforcement learning model when the deep reinforcement learning network meets a convergence condition.
The deep reinforcement learning device based on the visual transducer of the embodiment fills the blank of application of the visual transducer in the reinforcement learning field by introducing the visual transducer into the deep reinforcement learning network, improves the interpretability of the reinforcement learning method, can more effectively perform learning training, and can be applied to scenes using reinforcement learning algorithms, such as games, robot control and the like.
Illustratively, the data acquisition module 120 is specifically configured to:
and interacting with the running environment through an epsilon-greedy strategy, acquiring experience data (s, a, r, s ') and putting the experience data into the experience playback pool, wherein s is the observed quantity at the current moment, a is the action at the current moment, r is the return returned by the environment, and s' is the observed quantity at the next moment.
Illustratively, the input module 130 is specifically configured to:
when the number of samples in the experience playback poolWhen the quantity meets the preset training sample quantity m, randomly extracting training sample images with the quantity of batch from the experience playback pool, preprocessing the training sample images with the size of H.W, dividing the training sample images into N color blocks according to the size of the training sample images, wherein the size of each color block is P.P, H is the height of the training sample images, W is the width of the training sample images, and N=H.W/P 2
Flattening each color block X in the input t-2 moment, t-1 moment and t moment images by using a linear projection matrix to obtain a mapped D-dimensional vector X 1 =embedding (X), and adding position Embedding and timing Embedding sequence encoding thereto to obtain a color block vector X 2 =X 1 +PositionEncoding+SequenceEncoding;
The state action value placeholder QvalueToken is connected with the color block vector X by means of learning parameters 2 Splicing to obtain X 3 =Concat(X 2 QvalueToken) before inputting the processed data into the vision transformer, outputting the action state value X via the vision transformer output Wherein, the method comprises the steps of, wherein,
X output =MLP(X hidden ),
X hidden =LayerNorm(X_attention+FeedForward(X attention )),
X attention =LayerNorm(X 3 +SelfAttention(X 3 W Q ,X 3 W K ,X 3 W V )),
wherein MLP is a multi-layer sensor, X hidden For converting the output of the encoder, feedForward is a feed-forward network consisting of two layers of linear mapping and activation functions, X attention For the output of the multi-head attention layer, selfAttention is the self-attention layer, i.e. multi-head attention layer, W Q 、W K 、W V The network weights are respectively linear mappings.
The deep reinforcement learning device based on the visual transducer can further improve the interpretability of the reinforcement learning method through the attention mechanism of the visual transducer, and further learn useful global observation information while extracting local observation information, so that the global information is better captured. In addition, the embodiment enables the deep reinforcement learning network to use the observation information of the past time by using the time sequence coding of the vision converter, so that the learning training can be performed more effectively.
Illustratively, the training module 140 is specifically configured to:
training the deep reinforcement learning network according to a mean square error loss function L, wherein L=Er+gammamax a′ Q(s′,a′;θ - )-Q(s,a;θ)] 2 Updating the weight of the deep reinforcement learning network,
wherein E is mathematical expectation, a is current moment action, a ' is next moment action, alpha is learning rate, gamma is discount coefficient, Q (s, a; theta) is Q value of the current value neural network, Q (s ', a '; theta) - ) Q value, θ and θ of the neural network as target values - And respectively obtaining the parameters of the current value neural network and the parameters of the target value neural network, wherein θ' is the parameter of the updated value neural network.
The deep reinforcement learning device based on the vision converter of the embodiment can train the deep reinforcement learning network in a parallel mode, so that the convergence speed of the deep reinforcement learning network is increased.
In another aspect of the present invention, there is provided an electronic apparatus including:
one or more processors;
and a storage unit configured to store one or more programs that, when executed by the one or more processors, enable the one or more processors to implement the method according to the foregoing description.
In another aspect of the invention, a computer-readable storage medium is provided, on which a computer program is stored which, when being executed by a processor, enables the implementation of a method according to the preceding description.
The computer readable storage medium may be included in the apparatus or device of the present invention or may exist alone.
Wherein a computer readable storage medium may be any tangible medium that can contain, or store a program that can be an electronic, magnetic, optical, electromagnetic, infrared, semiconductor system, apparatus, device, more specific examples of which include, but are not limited to: a connection having one or more wires, a portable computer diskette, a hard disk, an optical fiber, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination thereof.
The computer-readable storage medium may also include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein, specific examples of which include, but are not limited to, electromagnetic signals, optical signals, or any suitable combination thereof.
It is to be understood that the above embodiments are merely illustrative of the application of the principles of the present invention, but not in limitation thereof. Various modifications and improvements may be made by those skilled in the art without departing from the spirit and substance of the invention, and are also considered to be within the scope of the invention.

Claims (6)

1. A method of deep reinforcement learning based on a visual transducer, the method comprising:
constructing a depth reinforcement learning network structure based on a visual converter, wherein the visual converter comprises a multi-layer perceptron and a conversion encoder, and the conversion encoder comprises a multi-head attention layer and a feedforward network;
initializing the weight of the deep reinforcement learning network, and constructing an experience playback pool according to the capacity of a memory;
interacting with the running environment through a greedy strategy, generating experience data and placing the experience data into the experience playback pool;
when the number of samples in the experience playback pool meets the preset number of training samples, randomly extracting a batch of training sample images from the experience playback pool, and preprocessing the training sample images;
inputting the preprocessed training sample image into the deep reinforcement learning network for training;
when the deep reinforcement learning network meets convergence conditions, acquiring a reinforcement learning model;
the interacting with the running environment through the greedy strategy generates experience data and puts the experience data into the experience playback pool, and the method comprises the following steps:
the method comprises the steps of interacting with an operation environment through an epsilon-greedy strategy, obtaining experience data (s, a, r, s ') and putting the experience data into the experience playback pool, wherein s is the observed quantity at the current moment, a is the action at the current moment, r is the return returned by the environment, and s' is the observed quantity at the next moment;
when the number of samples in the experience playback pool meets the preset training sample number, randomly extracting a batch of training sample images from the experience playback pool, and preprocessing the training sample images, wherein the method comprises the following steps:
when the number of samples in the experience playback pool meets the preset training sample number m, randomly extracting training sample images with the size of batch from the experience playback pool, preprocessing the training sample images with the size of H×W, dividing the training sample images into N color blocks according to the size of the training sample images, wherein the size of each color block is P×P, H is the height of the training sample images, W is the width of the training sample images, and N=H×W/P 2
Flattening each color block X in the input t-2 moment, t-1 moment and t moment images by using a linear projection matrix to obtain a mapped D-dimensional vector X 1 =embedding (X), and adding position Embedding and timing Embedding sequence encoding thereto to obtain a color block vector X 2 =X 1 +PositionEncoding+SequenceEncoding;
The state action value placeholder QvalueToken is connected with the color block vector X by means of learning parameters 2 Splicing to obtain X 3 =Concat(X 2 QvalueToken) before inputting the processed data into the vision transformer, outputting the action state value X via the vision transformer output Wherein, the method comprises the steps of, wherein,
X attention =LayerNorm(X 3 +SelfAttention(X 3 W Q ,X 3 W K ,X 3 W V )),
X hidden =LayerNorm(X attention +FeedForward(X attention )),
X output =MLP(X hidden ),
wherein MLP is a multi-layer sensor, X hidden For converting the output of the encoder, feedForward is a feed-forward network consisting of two layers of linear mapping and activation functions, X attention For the output of the multi-head attention layer, selfAttention is the self-attention layer, W Q 、W K 、W V The network weights are respectively linear mappings.
2. The method of claim 1, wherein inputting the preprocessed training sample image into the deep reinforcement learning network for training comprises:
training the deep reinforcement learning network according to a mean square error loss function L, wherein L=Er+gammamax a′ Q(s′,a′;θ - )-Q(s,a;θ)] 2 Updating the weight of the deep reinforcement learning network,
wherein E is mathematical expectation, a is current moment action, a' is next moment action, alpha is learning rate, gamma is discount coefficient, Q (s, a; theta) is Q value of the current value neural network, Q is%s′,a′;θ - ) Q value, θ and θ of the neural network as target values - And respectively obtaining the parameters of the current value neural network and the parameters of the target value neural network, wherein θ' is the parameter of the updated value neural network.
3. The deep reinforcement learning device based on the vision converter is characterized by comprising a construction module, a data acquisition module, an input module, a training module and an acquisition module:
the construction module is used for constructing a depth reinforcement learning network structure based on a visual converter, wherein the visual converter comprises a multi-layer perceptron and a conversion encoder, the conversion encoder comprises a multi-head attention layer and a feedforward network, the weight of the depth reinforcement learning network is initialized, and an experience playback pool is constructed according to the capacity of a memory;
the data acquisition module is used for interacting with the operation environment through a greedy strategy, generating experience data and placing the experience data into the experience playback pool;
the input module is used for randomly extracting a batch of training sample images from the experience playback pool when the number of samples in the experience playback pool meets the preset training sample number, preprocessing the training sample images and inputting the preprocessed training sample images into the training module;
the training module is used for training the deep reinforcement learning network by utilizing the preprocessed training sample image;
the acquisition module is used for acquiring a reinforcement learning model when the deep reinforcement learning network meets convergence conditions;
the data acquisition module is specifically used for:
the method comprises the steps of interacting with an operation environment through an epsilon-greedy strategy, obtaining experience data (s, a, r, s ') and putting the experience data into the experience playback pool, wherein s is the observed quantity at the current moment, a is the action at the current moment, r is the return returned by the environment, and s' is the observed quantity at the next moment;
the input module is specifically used for:
when the number of samples in the experience playback pool meets the preset training sample number m, randomly extracting training sample images with the size of batch from the experience playback pool, preprocessing the training sample images with the size of H×W, dividing the training sample images into N color blocks according to the size of the training sample images, wherein the size of each color block is P×P, H is the height of the training sample images, W is the width of the training sample images, and N=H×W/P 2
Flattening each color block X in the input t-2 moment, t-1 moment and t moment images by using a linear projection matrix to obtain a mapped D-dimensional vector X 1 =embedding (X), and adding position Embedding and timing Embedding sequence encoding thereto to obtain a color block vector X 2 =X 1 +PositionEncoding+SequenceEncoding;
The state action value placeholder QvalueToken is connected with the color block vector X by means of learning parameters 2 Splicing to obtain X 3 =Concat(X 2 QvalueToken) before inputting the processed data into the vision transformer, outputting the action state value X via the vision transformer output Wherein, the method comprises the steps of, wherein,
X attention =LayerNorm(X 3 +SelfAttention(X 3 W Q ,X 3 W K ,X 3 W V )),
X hidden =LayerNorm(X_attention+FeedForward(X attention )),
X output =MLP(X hidden ),
wherein MLP is a multi-layer sensor, X hidden For converting the output of the encoder, feedForward is a feed-forward network consisting of two layers of linear mapping and activation functions, X attention For the output of the multi-head attention layer, selfAttention is the self-attention layer, W Q 、W K 、W V The network weights are respectively linear mappings.
4. The apparatus of claim 3, wherein the training module is specifically configured to:
training the deep reinforcement learning network according to a mean square error loss function L, wherein L=Er+gammamax a′ Q(s′,a′;θ - )-Q(s,a;θ)] 2 Updating the weight of the deep reinforcement learning network,
wherein E is mathematical expectation, a is current moment action, a ' is next moment action, alpha is learning rate, gamma is discount coefficient, Q (s, a; theta) is Q value of the current value neural network, Q (s ', a '; theta) - ) Q value, θ and θ of the neural network as target values - And respectively obtaining the parameters of the current value neural network and the parameters of the target value neural network, wherein θ' is the parameter of the updated value neural network.
5. An electronic device, the electronic device comprising:
one or more processors;
a storage unit for storing one or more programs, which when executed by the one or more processors, enable the one or more processors to implement the method of claim 1 or 2.
6. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, is capable of realizing the method according to claim 1 or 2.
CN202110393996.7A 2021-04-13 2021-04-13 Deep reinforcement learning method and device based on visual transducer Active CN113052257B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110393996.7A CN113052257B (en) 2021-04-13 2021-04-13 Deep reinforcement learning method and device based on visual transducer

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110393996.7A CN113052257B (en) 2021-04-13 2021-04-13 Deep reinforcement learning method and device based on visual transducer

Publications (2)

Publication Number Publication Date
CN113052257A CN113052257A (en) 2021-06-29
CN113052257B true CN113052257B (en) 2024-04-16

Family

ID=76519168

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110393996.7A Active CN113052257B (en) 2021-04-13 2021-04-13 Deep reinforcement learning method and device based on visual transducer

Country Status (1)

Country Link
CN (1) CN113052257B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113469119B (en) * 2021-07-20 2022-10-04 合肥工业大学 Cervical cell image classification method based on visual converter and image convolution network
CN115147669B (en) * 2022-06-24 2023-04-18 北京百度网讯科技有限公司 Image processing method, training method and equipment based on visual converter model
CN118003329B (en) * 2024-03-18 2024-09-06 复旦大学 Visual reinforcement learning test time adaptation method applied to mechanical arm control
CN118233312B (en) * 2024-03-20 2024-09-17 同济大学 Adaptive broadband resource allocation method combining deep reinforcement learning and converter

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108288094A (en) * 2018-01-31 2018-07-17 清华大学 Deeply learning method and device based on ambient condition prediction
CN109241552A (en) * 2018-07-12 2019-01-18 哈尔滨工程大学 A kind of underwater robot motion planning method based on multiple constraint target
CN110286161A (en) * 2019-03-28 2019-09-27 清华大学 Main transformer method for diagnosing faults based on adaptive enhancing study
CN110945495A (en) * 2017-05-18 2020-03-31 易享信息技术有限公司 Conversion of natural language queries to database queries based on neural networks
CN111126282A (en) * 2019-12-25 2020-05-08 中国矿业大学 Remote sensing image content description method based on variation self-attention reinforcement learning
CN111461321A (en) * 2020-03-12 2020-07-28 南京理工大学 Improved deep reinforcement learning method and system based on Double DQN
CN111597830A (en) * 2020-05-20 2020-08-28 腾讯科技(深圳)有限公司 Multi-modal machine learning-based translation method, device, equipment and storage medium
CN111666500A (en) * 2020-06-08 2020-09-15 腾讯科技(深圳)有限公司 Training method of text classification model and related equipment
CN111709398A (en) * 2020-07-13 2020-09-25 腾讯科技(深圳)有限公司 Image recognition method, and training method and device of image recognition model
KR20200132665A (en) * 2019-05-17 2020-11-25 삼성전자주식회사 Attention layer included generator based prediction image generating apparatus and controlling method thereof
CN112084314A (en) * 2020-08-20 2020-12-15 电子科技大学 Knowledge-introducing generating type session system
CN112261725A (en) * 2020-10-23 2021-01-22 安徽理工大学 Data packet transmission intelligent decision method based on deep reinforcement learning
CN112488306A (en) * 2020-12-22 2021-03-12 中国电子科技集团公司信息科学研究院 Neural network compression method and device, electronic equipment and storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2890300B1 (en) * 2012-08-31 2019-01-02 Kenji Suzuki Supervised machine learning technique for reduction of radiation dose in computed tomography imaging
US11256741B2 (en) * 2016-10-28 2022-02-22 Vertex Capital Llc Video tagging system and method
KR102535361B1 (en) * 2017-10-19 2023-05-24 삼성전자주식회사 Image encoder using machine learning and data processing method thereof
US11131993B2 (en) * 2019-05-29 2021-09-28 Argo AI, LLC Methods and systems for trajectory forecasting with recurrent neural networks using inertial behavioral rollout
US11100643B2 (en) * 2019-09-11 2021-08-24 Nvidia Corporation Training strategy search using reinforcement learning

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110945495A (en) * 2017-05-18 2020-03-31 易享信息技术有限公司 Conversion of natural language queries to database queries based on neural networks
CN108288094A (en) * 2018-01-31 2018-07-17 清华大学 Deeply learning method and device based on ambient condition prediction
CN109241552A (en) * 2018-07-12 2019-01-18 哈尔滨工程大学 A kind of underwater robot motion planning method based on multiple constraint target
CN110286161A (en) * 2019-03-28 2019-09-27 清华大学 Main transformer method for diagnosing faults based on adaptive enhancing study
KR20200132665A (en) * 2019-05-17 2020-11-25 삼성전자주식회사 Attention layer included generator based prediction image generating apparatus and controlling method thereof
CN111126282A (en) * 2019-12-25 2020-05-08 中国矿业大学 Remote sensing image content description method based on variation self-attention reinforcement learning
CN111461321A (en) * 2020-03-12 2020-07-28 南京理工大学 Improved deep reinforcement learning method and system based on Double DQN
CN111597830A (en) * 2020-05-20 2020-08-28 腾讯科技(深圳)有限公司 Multi-modal machine learning-based translation method, device, equipment and storage medium
CN111666500A (en) * 2020-06-08 2020-09-15 腾讯科技(深圳)有限公司 Training method of text classification model and related equipment
CN111709398A (en) * 2020-07-13 2020-09-25 腾讯科技(深圳)有限公司 Image recognition method, and training method and device of image recognition model
CN112084314A (en) * 2020-08-20 2020-12-15 电子科技大学 Knowledge-introducing generating type session system
CN112261725A (en) * 2020-10-23 2021-01-22 安徽理工大学 Data packet transmission intelligent decision method based on deep reinforcement learning
CN112488306A (en) * 2020-12-22 2021-03-12 中国电子科技集团公司信息科学研究院 Neural network compression method and device, electronic equipment and storage medium

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
"Visual Navigation in Real-World Indoor Environments Using End-to-End Deep Reinforcement Learning";J. Kulhánek等;《IEEE Robotics and Automation Letters》;20210323;第6卷(第3期);第4345-4352页 *
"基于密集卷积神经网络特征提取的图像描述模型研究";郝燕龙;《中国优秀硕士学位论文全文数据库 信息科技辑》;20190915(第9期);I138-1158 *
"基于强化学习和机器翻译质量评估的中朝机器翻译研究";李飞雨;《计算机应用研究》;摘要 *
Dosovitskiy, Alexey, et al."An image is worth 16x16 words: Transformers for image recognition at scale".《arXiv》.2020,第1-4节. *
Haoyi Zhou,等."Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting".《arxiv》.2021,第1-15页. *

Also Published As

Publication number Publication date
CN113052257A (en) 2021-06-29

Similar Documents

Publication Publication Date Title
CN113052257B (en) Deep reinforcement learning method and device based on visual transducer
US20210390653A1 (en) Learning robotic tasks using one or more neural networks
CN113039555B (en) Method, system and storage medium for classifying actions in video clips
CN110796111B (en) Image processing method, device, equipment and storage medium
CN110476173B (en) Hierarchical device placement with reinforcement learning
CN112529146B (en) Neural network model training method and device
CN117218498B (en) Multi-modal large language model training method and system based on multi-modal encoder
CN113762461B (en) Training neural networks using reversible boost operators with limited data
CN112840359B (en) Controlling agents on a long time scale by using time value transfer
CN110795549B (en) Short text conversation method, device, equipment and storage medium
CN111833360B (en) Image processing method, device, equipment and computer readable storage medium
CN111340190A (en) Method and device for constructing network structure, and image generation method and device
WO2022242127A1 (en) Image feature extraction method and apparatus, and electronic device and storage medium
CN111282272B (en) Information processing method, computer readable medium and electronic device
CN109858046A (en) Using auxiliary loss come the long-rang dependence in learning neural network
WO2021169366A1 (en) Data enhancement method and apparatus
CN117456587A (en) Multi-mode information control-based speaker face video generation method and device
CN113554040B (en) Image description method and device based on condition generation countermeasure network
CN117708698A (en) Class determination method, device, equipment and storage medium
CN116266376A (en) Rendering method and device
CN116665114A (en) Multi-mode-based remote sensing scene identification method, system and medium
CN108376283B (en) Pooling device and pooling method for neural network
CN116095183A (en) Data compression method and related equipment
Zhong et al. Disentangling controllable object through video prediction improves visual reinforcement learning
CN117011403A (en) Method and device for generating image data, training method and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant