WO2020024791A1 - 智能体强化学习方法、装置、设备及介质 - Google Patents
智能体强化学习方法、装置、设备及介质 Download PDFInfo
- Publication number
- WO2020024791A1 WO2020024791A1 PCT/CN2019/096233 CN2019096233W WO2020024791A1 WO 2020024791 A1 WO2020024791 A1 WO 2020024791A1 CN 2019096233 W CN2019096233 W CN 2019096233W WO 2020024791 A1 WO2020024791 A1 WO 2020024791A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- attention
- current environment
- agent
- environment image
- change
- Prior art date
Links
- 230000002787 reinforcement Effects 0.000 title claims abstract description 94
- 238000000034 method Methods 0.000 title claims abstract description 83
- 230000000007 visual effect Effects 0.000 claims abstract description 62
- 230000008859 change Effects 0.000 claims description 119
- 238000012549 training Methods 0.000 claims description 49
- 238000004590 computer program Methods 0.000 claims description 26
- 238000013527 convolutional neural network Methods 0.000 claims description 25
- 230000007613 environmental effect Effects 0.000 claims description 25
- 238000001914 filtration Methods 0.000 claims description 4
- 238000005070 sampling Methods 0.000 claims description 4
- 230000002194 synthesizing effect Effects 0.000 claims description 3
- 239000003795 chemical substances by application Substances 0.000 description 133
- 230000008569 process Effects 0.000 description 17
- 230000009471 action Effects 0.000 description 14
- 238000012545 processing Methods 0.000 description 13
- 238000004891 communication Methods 0.000 description 11
- 238000010586 diagram Methods 0.000 description 6
- 230000009286 beneficial effect Effects 0.000 description 5
- 230000004913 activation Effects 0.000 description 4
- 230000006399 behavior Effects 0.000 description 4
- 239000013598 vector Substances 0.000 description 4
- 238000013528 artificial neural network Methods 0.000 description 3
- 230000014509 gene expression Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 230000003542 behavioural effect Effects 0.000 description 2
- 239000011449 brick Substances 0.000 description 2
- 238000013500 data storage Methods 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000000306 recurrent effect Effects 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000005484 gravity Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 230000000873 masking effect Effects 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/19—Recognition using electronic means
- G06V30/192—Recognition using electronic means using simultaneous comparisons or correlations of the image signals with a plurality of references
- G06V30/194—References adjustable by an adaptive method, e.g. learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/004—Artificial life, i.e. computing arrangements simulating life
- G06N3/008—Artificial life, i.e. computing arrangements simulating life based on physical entities controlled by simulated intelligence so as to replicate intelligent life forms, e.g. based on robots replicating pets or humans in their appearance or behaviour
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
- G06F18/2148—Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the process organisation or structure, e.g. boosting cascade
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/01—Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/255—Detecting or recognising potential candidate objects based on visual cues, e.g. shapes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
- G06V10/443—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
- G06V10/449—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
- G06V10/451—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
- G06V10/454—Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/10—Terrestrial scenes
Definitions
- the present disclosure relates to computer vision technology, and in particular, to a method for agent reinforcement learning, an agent reinforcement learning device, an electronic device, a computer-readable storage medium, and a computer program.
- agents are commonly used, for example, a moving board or a robot arm that catches a falling ball in a game.
- agents usually use the reward information obtained from trial and error in the environment to guide learning.
- the embodiments of the present disclosure provide a technical solution for agent reinforcement learning.
- a method for agent reinforcement learning includes: acquiring key visual information on which an agent makes a decision on a current environment image; and acquiring actual key visual information of the current environment image. Determining the attention change return information according to the key visual information and the actual key visual information; adjusting the return feedback of the reinforcement learning of the agent according to the return change information of the attention.
- an agent reinforcement learning device includes: acquiring a key vision module for acquiring key visual information on which the agent makes a decision on a current environment image; acquiring actual vision A module for obtaining actual key visual information of the current environment image; determining a change return module for determining attention change return information according to the key visual information on which it is based and the actual key visual information; and adjusting return feedback A module configured to adjust the reward feedback of the reinforcement learning of the agent according to the attention change reward information.
- an electronic device including: a memory for storing a computer program; a processor for executing a computer program stored in the memory, and when the computer program is executed, Implement any method embodiment of the present disclosure.
- a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, any method embodiment of the present disclosure is implemented.
- a computer program including computer instructions, and when the computer instructions are run in a processor of a device, any method embodiment of the present disclosure is implemented.
- the agent reinforcement learning device Based on the agent reinforcement learning method, the agent reinforcement learning device, the electronic device, the computer-readable storage medium, and the computer program provided by the embodiments of the present disclosure, by obtaining key visual information on which the agent makes a decision on the current environment image, it can be used
- the actual key visual information of the current environmental image is a measure of the attention change (such as attention shift) of the current environment image when the agent is making a decision, so that the attention change can be used to determine the attention Change returns information.
- the embodiment of the present disclosure adjusts the reward feedback of the reinforcement learning of the agent by using the reward change information of the attention, so that the reward feedback reflects the reward change information of the attention, so that using the reward feedback to perform reinforcement learning on the agent can reduce the Probability that the body's attention is not accurate (such as attentional shift) causes it to perform dangerous actions. Therefore, it can be known that the technical solution provided by the embodiments of the present disclosure is beneficial to improving the behavior security of the agent.
- FIG. 1 is a flowchart of an agent reinforcement learning method according to an embodiment of the present disclosure
- FIG. 2 is a schematic diagram of a network structure of an agent
- FIG. 3 is another schematic diagram of the network structure of the agent.
- FIG. 4 is a flowchart of obtaining a value attention map of an agent against a current environment image according to an embodiment of the present disclosure
- FIG. 5 is a schematic diagram of acquiring a value attention map of an agent against a current environment image according to an embodiment of the present disclosure
- FIG. 6 is a schematic structural diagram of an intelligent reinforcement learning device 1 according to an embodiment of the present disclosure.
- FIG. 7 is a block diagram of an exemplary device implementing an embodiment of the present disclosure.
- a plurality may refer to two or more, and “at least one” may refer to one, two, or more.
- the term "and / or” in the disclosed embodiment is merely an association relationship describing an associated object, which means that there can be three kinds of relationships, for example, A and / or B can mean: A exists alone, and A and B, there are three cases of B alone.
- the character “/” generally indicates that the related objects before and after are an “or” relationship.
- Embodiments of the present disclosure can be applied to electronic devices such as terminal devices, computer systems, and servers, which can operate with many other general or special-purpose computing system environments or configurations.
- Examples of well-known terminal devices, computing systems, environments, and / or configurations suitable for use with electronic devices such as terminal devices, computer systems, and servers including but not limited to: personal computer systems, server computer systems, thin clients, thick Clients, handheld or laptop devices, microprocessor-based systems, set-top boxes, programmable consumer electronics, networked personal computers, small computer systems, mainframe computer systems, and distributed cloud computing technology environments including any of these systems, and more .
- Electronic devices such as a terminal device, a computer system, and a server can be described in the general context of computer system executable instructions (such as program modules) executed by a computer system.
- program modules may include routines, programs, target programs, components, logic, and data structures, etc., which perform specific tasks or implement specific abstract data types.
- the computer system / server can be implemented in a distributed cloud computing environment. In a distributed cloud computing environment, tasks are performed by remote processing devices linked through a communication network. In a distributed cloud computing environment, program modules may be located on a local or remote computing system storage medium including a storage device.
- FIG. 1 is a flowchart of an agent reinforcement learning method according to an embodiment of the present disclosure. As shown in FIG. 1, the method in this embodiment includes:
- the agent in the embodiment of the present disclosure may be a mobile board or a robotic arm that catches a falling ball in a game, and a vehicle, a robot, a smart home device, etc., which have artificial intelligence and are formed based on reinforcement learning.
- the object of character The embodiment of the present disclosure does not limit the specific expression form of the agent, nor does it limit the possibility that the object appears as hardware, software, or a combination of software and hardware.
- the operation S100 may be performed by a processor calling a corresponding instruction stored in a memory, or may be performed by an acquisition key vision module 600 executed by the processor.
- the operation S110 may be performed by a processor calling a corresponding instruction stored in a memory, or may be performed by an actual vision module 610 that is run by the processor.
- the operation S120 may be performed by a processor calling a corresponding instruction stored in the memory, or may be performed by a determination change reporting module 620 executed by the processor.
- adjusting the reward feedback of the agent's reinforcement learning according to the attention change return information may include: making the reward feedback of the agent's reinforcement learning include the attention change return information, such as adding the attention change return information to the return feedback.
- the operation S130 may be executed by the processor calling a corresponding instruction stored in the memory, or may be executed by the adjustment return feedback module 630 executed by the processor.
- the key visual information in the embodiment of the present disclosure may include: an area needing attention in the image; and may also include: an attention area in the image.
- the key visual information based on may include: the attention area considered by the agent, that is, the attention area for the current environment image when the agent makes a decision.
- the actual key visual information of the current environment image may include: the real key visual information of the current environment image, that is, the real attention area of the current environment image, that is, the area where the target object in the current environment image is located.
- the attention change information for the attention area of the current environment image, the intersection with the area where the target object is located, and the area where the target object is located may be determined.
- the attention change report information in the embodiment of the present disclosure is used to make the attention area of the current environment image considered by the agent closer to the actual key visual information of the current environment image.
- the feedback feedback in the embodiment of the present disclosure may include: attention change return information and return information formed by the agent making a decision on the current environment image.
- the return information formed by the agent making decisions on the current environmental image is usually the return information used by existing agents for reinforcement learning.
- the embodiment of the present disclosure obtains the key visual information on which the agent is based on the current environment image, so that the actual key visual information of the current environment image can be used to measure the change in attention of the agent when making decisions on the current environment image ( (Such as the attention shift situation, etc.), and then the attention change situation can be used to determine the attention change return information.
- the embodiment of the present disclosure adjusts the feedback feedback of the agent's learning by using the attention change return information so that the feedback feedback can reflect the return information of the attention change. By using such return feedback to implement the reinforcement learning of the agent, it can reduce the Inaccurate attention leads to the probability of performing dangerous actions, which is conducive to improving the behavioral safety of the agent.
- An example of the above-mentioned dangerous action is: when the agent should move, the decision result of the agent is an empty action, so that the agent maintains the original state, and the empty action determined at this time is a dangerous action.
- the embodiments of the present disclosure do not limit the specific expressions of the dangerous actions.
- FIG. 2 an example of the network structure included in the agent during the reinforcement learning process is shown in FIG. 2.
- the agent in FIG. 2 includes a convolutional neural network (at the middle position in FIG. 2), a decision network (Policy Network), and a value network (Value network).
- the agent can obtain the current environment image by interacting with the environment.
- the image shown at the bottom of Figure 2 is an example of the current environment image.
- the current environment image is input to the convolutional neural network.
- the feature map of the current environment image formed by the previous convolution layer is provided to the subsequent convolution layer, and the current image formed by the last convolution layer
- Feature maps of environmental images are provided to decision networks and value networks, respectively.
- the decision network performs decision processing on the feature maps it receives.
- the value network performs state value prediction processing on the received feature map to determine the state value of the current environment image.
- the agent in FIG. 3 includes a convolutional neural network (at the middle position in FIG. 3), RNN (Recurrent Neuron Network, recurrent neural network), a decision network, and a value network.
- the agent can obtain the current environment image by interacting with the environment.
- the image shown at the bottom of Figure 3 is an example of the current environment image.
- the current environment image is input to the convolutional neural network.
- the feature map of the current environment image formed by the previous convolution layer is provided to the subsequent convolution layer, and the current image formed by the last convolution layer
- the feature map of the environment image is provided to the RNN, and the RNN can convert the time series information of the feature map into a one-dimensional feature vector.
- the feature map and time-series feature vectors output by the RNN are provided to the decision network and the value network, respectively.
- the decision network performs decision processing on the feature maps and time-series feature vectors it receives.
- the value network performs state value prediction processing on the received feature map and time-series feature vectors to determine the state value of the current environment image.
- FIG. 2 and FIG. 3 are only optional examples of the network structure of the agent during the reinforcement learning process.
- the network structure of the agent may also be expressed in other forms, and the embodiment of the present disclosure does not limit the network structure of the agent. Specific manifestations.
- the key visual information based on the embodiments of the present disclosure can reflect the attention of the agent (for example, a decision network in the agent) on the current environment image when making a decision. information.
- the timing of making a decision may depend on a preset setting.
- the agent may be preset to make a decision every 0.2 seconds.
- the decision result in the embodiment of the present disclosure may be selecting an action from the action space.
- the embodiment of the present disclosure can be obtained through the value network of the agent: when the agent makes a decision, the heat map corresponding to its attention to the current environment image; and then, the agent can use the heat map to make a decision Key visual information on which the current environment image is based.
- the pixels in the heat map can be filtered according to a preset threshold to filter out pixels whose pixel values exceed the preset threshold, and then, based on the area formed by the filtered pixels, it can be determined The agent's attention area for the current environment image when making a decision.
- a preset threshold to filter out pixels whose pixel values exceed the preset threshold
- the agent in the embodiment of the present disclosure when the agent in the embodiment of the present disclosure makes a decision, its attention to the current environment image may be embodied using a value attention map (Value Attention Map).
- the value attention map may include the key visual information on which the value network of the agent is based when making state value judgments.
- obtaining the key visual information on which the agent makes a decision on the current environment image may include: obtaining the agent's value attention map for the current environment image; synthesizing the value attention map and the current environment image, Obtain a heat map; determine the agent's attention area for the current environment image according to the heat map.
- the embodiment of the present disclosure can obtain the value attention map of the current environment image in various ways.
- the embodiment of the present disclosure can obtain the value attention map by using the process shown in FIG. 4.
- S400 acquiring a feature map of a current environment image.
- the feature maps in the embodiments of the present disclosure generally belong to a high-level feature map formed by a convolutional neural network of an agent for a current environment image.
- the current environment image is input into the agent's convolutional neural network, and the feature map output by the last layer of the convolutional neural network is used as the feature map of the current environment image in S400.
- the feature map output by the penultimate layer of the convolutional neural network is used as the feature map of the current environment image in S400.
- it is a high-level feature map in a convolutional neural network.
- the high-level feature map in the embodiment of the present disclosure can be considered as: in a case where the structure of the convolutional neural network of the agent is divided into two or three or more stages, the middle stage, the middle stage, or the last stage A feature map formed by any layer of the image for the current environment.
- the high-level feature map in the embodiment of the present disclosure may also be considered as a feature map formed by a layer that is closer to or closest to the output of the convolutional neural network of the agent.
- the changed feature map in the embodiment of the present disclosure includes a feature map that is different from the feature map in S400 because the corresponding channel in the feature map is shielded compared to the feature map in S400.
- an example of obtaining each changed feature map in the embodiment of the present disclosure is: first, by shielding the first channel in the feature map, the first changed feature map can be obtained Secondly, by shielding the second channel in the feature map, a second changed feature map can be obtained; again, by shielding the third channel in the feature map, a third changed feature map can be obtained; and so on, until By masking the last channel in the feature map, the last change feature map can be obtained.
- the corresponding channels of the shielding feature map in the embodiment of the present disclosure may also be considered as the corresponding activation information of the shielding hidden layer.
- the embodiment of the present disclosure can obtain n change feature maps.
- the embodiments of the present disclosure may use existing methods to shield the activation information of the corresponding hidden layer, so as to obtain the changed feature map. The specific implementation manner is not described in detail here.
- each of the obtained change feature maps may be first input into the value network of the agent to obtain the status value of each change feature map.
- the value network performs the status of each change feature map separately Value prediction processing, so that the state value of each change feature map can be obtained, for example, n state value can be obtained for n change feature maps;
- the embodiment of the present disclosure can calculate the output of the feature map in S400 by calculating the value network.
- the state value is the difference between the state value and the state value of each change feature map, so as to obtain the state value change amount of each change feature map relative to the feature map of the current environment image.
- the embodiment of the present disclosure can obtain n by calculating the difference between V and V 1 , the difference between V and V 2 , the difference between V and V i , ..., and the difference between V and V n .
- the difference that is, ⁇ V 1 , ⁇ V 2 , ⁇ V i ,... And ⁇ V n (as shown at the upper right position in FIG. 5).
- ⁇ V 1 , ⁇ V 2 , ⁇ V i ,... And ⁇ V n are the state value changes of the n change feature maps relative to the feature map of the current environment image, respectively.
- the embodiment of the present disclosure may use the following formula (1) to calculate the state value change amount of the changed feature map relative to the feature map of the current environment image:
- ⁇ V represents the state value change amount
- V represents the state value formed by the value network for the feature map of the current environment image
- H represents the feature map of the current environment image
- B i ⁇ H represents the feature map is masked After the ith channel in the change feature map obtained
- f V (B i ⁇ H) represents the state value formed by the value network for the change feature map, where i is an integer greater than 0 and not greater than n, n is an integer greater than 1.
- the embodiment of the present disclosure sequentially shields different activation information of the hidden layer. And obtain the state value change amount of each change feature map relative to the feature map, so that the different state value change amounts can reflect the degree of attention of the agent to different areas.
- the above operations S400-S430 may be executed by the processor calling corresponding instructions stored in the memory, or may be executed by the acquisition key vision module 600 run by the processor.
- the embodiment of the present disclosure may perform normalization processing on the amount of state value change to form the weight of each changed feature map.
- An example of normalizing the change in state value is shown in the following formula (2):
- ⁇ i represents the weight of the i-th change feature map.
- the embodiment of the present disclosure may form a value attention map by the following formula (3):
- A represents the value attention map
- Hi represents the feature map of the i-th channel output from the last convolutional layer of the convolutional neural network
- K is the number of channels.
- the embodiments of the present disclosure may also use existing methods to obtain the value attention map of the current environment image when the agent makes a decision.
- the embodiment of the present disclosure does not limit the specific implementation process of acquiring the value attention map for the current environment image when the agent makes a decision.
- the embodiment of the present disclosure may first adjust the size of the value attention map A obtained above, for example, perform upsampling processing on the value attention map A, so as to make the size of the value attention map A and the current environment
- the size of the images is the same; after that, the value attention map A ′ after the size adjustment is fused with the current environment image (such as the image in the lower left corner of FIG. 5) to obtain a heat map corresponding to the value attention map of the current environment image.
- An alternative example of a heat map is shown in the image in the lower right corner of Figure 5.
- the actual key visual information of the current environment image in the embodiment of the present disclosure may include: a region where the target object is located in the current environment image.
- the embodiment of the present disclosure may use a target object detection algorithm to obtain a region where the target object is located in the current environment image.
- the embodiments of the present disclosure do not limit the specific implementation of the target object detection algorithm, nor the specific implementation of obtaining the area where the target object in the current environment image is obtained.
- the attention change return information in the embodiment of the present disclosure may reflect the gap between the area that the agent is focusing on for the current environment image and the area that should actually be focused on. That is, the embodiment of the present disclosure can determine the attention change return information according to the difference between the attention area of the current environment image and the area where the target object is located when the agent makes a decision. .
- the embodiment of the present disclosure may first determine the attention area of the agent for the current environment image according to the key visual information on which it is based.
- the key visual information based on such as thermal power
- the key visual information based on may be determined based on a preset threshold.
- Figure the pixels in the image are filtered, and the pixels whose pixel value exceeds the preset threshold are determined.
- the attention area a of the agent against the current environment image is determined.
- the disclosed embodiment may calculate a ratio (a ⁇ b) / b of the intersection of the attention area a and the area b where the target object is located in the current environment image and the area b where the target object is located, and determine the attention change return information according to the ratio.
- the attention change information is obtained.
- the ratio in the embodiment of the present disclosure or the attention change return information obtained based on the ratio can be considered as a safety evaluation index for the behavior of the agent. The larger the ratio, the higher the safety of the agent's behavior, and conversely, the smaller the ratio, the lower the safety of the agent's behavior.
- the embodiment of the present disclosure adjusts the reward feedback of the reinforcement learning of the agent by using the above-mentioned information of obtaining the attention change reward (for example, adding the obtained information of the change of attention change to the reward feedback of the reinforcement learning of the agent) , And use such feedback feedback to update the network parameters of the agent (such as updating the network parameters of the convolutional neural network, value network, and strategy network), so that the agent can reduce changes in attention during the reinforcement learning process (such as attention Probability of performing dangerous actions.
- the method of updating the network parameters of the agent can be based on the actor-critic algorithm in reinforcement learning.
- the specific goals of updating the network parameters of the agent include: making the state value predicted by the value network in the agent as close as possible to the accumulated value of reward information in an environment exploration cycle, and the network parameters of the decision network in the agent. Updates should be updated to increase the value of the state predicted by the value network.
- the small balls of the bricks will drop rapidly due to gravity during the falling process.
- the moving board that catches the falling balls it is often due to the attention Force lags, and there is a phenomenon that a dangerous action is performed (such as moving the board to perform an empty action, etc.).
- the embodiments of the present disclosure use the feedback feedback (such as reward information) that can reflect the attention change return information, so that the mobile board performs reinforcement learning, which is helpful to avoid the phenomenon of the mobile board's attention lag, and thus to reduce the mobile board to perform dangerous actions. Odds.
- the agent may be an agent that has already performed a certain degree of reinforcement learning.
- the embodiments of the present disclosure can use the existing reinforcement learning method to enable the agent to perform reinforcement learning based on the feedback feedback that does not include the attention change return information, and determine the agent's
- the degree of reinforcement learning reaches a certain requirement (for example, the entropy of the decision network drops to a certain value (such as 0.6))
- the technical solution provided by the embodiment of the present disclosure is adopted to continue the reinforcement learning of the agent, which is beneficial to improving the intelligence The efficiency and success rate of the body's reinforcement learning.
- the embodiment of the present disclosure may select important reinforcement learning training data as historical training data storage from the reinforcement learning training data obtained by sampling, so as to facilitate the experience playback process.
- the embodiments of the present disclosure can effectively reduce the cache space required for historical training data by selecting important reinforcement learning training data as historical training data storage; by using important reinforcement learning training data as historical training data for experience playback, it is beneficial to improve intelligence Reinforcement learning efficiency.
- the method may further include: determining a degree of exploration in an environment exploration period according to the key visual information on which it is based; in a case where it is determined that the degree of exploration does not meet a predetermined degree of exploration, using Stored historical training data for experience playback.
- the historical training data may include: training data obtained by filtering the sampled reinforcement learning training data using preset requirements.
- determining the degree of exploration in the environment exploration cycle according to the key visual information on which it is based may include: according to the agent's value attention map of the current environment image for multiple adjacent moments in the environment exploration cycle The change information is used to determine the amount of attention change in the environment exploration cycle. The amount of attention change is used to measure the degree of exploration in the environmental exploration cycle.
- the embodiment of the present disclosure may use the positive reward (such as a positive reward) in an environment exploration cycle and the degree of exploration in the environment exploration cycle to determine the reinforcement learning training data in the environment exploration cycle. Therefore, when it is judged that the importance degree meets a predetermined requirement, the reinforcement learning training data in the environment exploration cycle can be used as the historical training data cache.
- the positive reward such as a positive reward
- the degree of exploration of the environment exploration cycle in the embodiment of the present disclosure may be reflected by the amount of attention change in the environment exploration cycle.
- the embodiment of the present disclosure may The change information between the value attention maps of the current environment images at multiple neighboring moments determines the amount of attention change in the environment exploration cycle, and uses the amount of attention change as the exploration degree in the environment exploration cycle.
- the following formula (4) can be used to calculate the amount of attention change in an environment exploration cycle:
- the embodiment of the present disclosure may use the following formula (5) to calculate the importance of the reinforcement learning training data in an environment exploration cycle:
- S represents the importance of the reinforcement learning training data in an environment exploration cycle
- ⁇ represents a hyperparameter
- ⁇ is usually a constant between 0-1
- r + represents a positive value in the environment exploration cycle.
- E represents the average amount of attention change during the environmental exploration cycle.
- all reinforcement learning training data (such as reward information and current environment images) in the environment exploration cycle may be used. It is cached as historical training data; otherwise, all reinforcement learning training data in the environment exploration cycle is not retained.
- the embodiment of the present disclosure may use the cached historical training data to adjust the network parameters of the agent in an empirical playback manner; for example, adjusting the strategy network and value network And network parameters of convolutional neural networks; for example, adjusting network parameters of strategy networks, value networks, RNNs, and convolutional neural networks.
- the embodiment of the present disclosure determines the degree of exploration in an environment exploration period. When it is determined that the degree of exploration does not meet the predetermined degree of exploration, a random number may be generated.
- the random number exceeds a predetermined value (such as 0.3)
- a predetermined value such as 0.3
- it is determined that experience playback is required so that the embodiment of the present disclosure can perform experience playback operation using pre-stored historical training data. If the random number does not exceed a predetermined value, it can be determined that empirical playback is not required.
- the specific implementation process of experience playback can use the existing implementation. It will not be described in detail here.
- agent reinforcement learning method may be executed by any appropriate device having data processing capabilities, including but not limited to: a terminal device and a server.
- any agent reinforcement learning method provided by the embodiment of the present disclosure may be executed by a processor.
- the processor executes any agent reinforcement learning method mentioned in the embodiment of the present disclosure by calling a corresponding instruction stored in a memory. I will not repeat them below.
- the foregoing program may be stored in a computer-readable storage medium.
- the program is executed, the program is executed.
- the method includes the steps of the foregoing method embodiment; and the foregoing storage medium includes: a ROM, a RAM, a magnetic disk, or an optical disc, which can store various program codes.
- FIG. 6 is a schematic structural diagram of an embodiment of an agent reinforcement learning device according to an embodiment of the present disclosure.
- the device in this embodiment mainly includes: acquiring a key vision module 600, acquiring an actual vision module 610, determining a change return module 620, and an adjustment return feedback module 630.
- the apparatus may further include: an experience playback module 640 and a training data acquisition module 650.
- the acquiring key vision module 600 is used for acquiring key visual information on which the agent makes a decision on the current environment image.
- the key visual information on which the above is based may include: when the agent makes a decision, the attention area of the current environment image.
- the acquisition of the key vision module 600 may further be used to: first, obtain a value attention map of the agent against the current environment image, and then synthesize the value attention map and the current environment image to obtain a heat map; then, determine the agent according to the heat map Attention area for the current environment image.
- the manner in which the key vision module 600 obtains the value attention map may be selected as follows: first, the key vision module 600 obtains a feature map of the current environment image; after that, the key vision module 600 obtains a sequence based on the feature map Each change feature map formed by shielding each channel of the feature map; then, the key vision module 600 obtains a state value change amount of each change feature map relative to the feature map; finally, the key vision module 600 obtains a change amount of each state value and Each change feature map forms a value attention map.
- the manner in which the key vision module 600 acquires the feature map of the current environment image may be selected as follows: first, the key vision module 600 inputs the current environment image into a convolutional neural network, and then, the key vision module 600 is acquired Get the feature map of the last convolutional layer output of the convolutional neural network. The feature map output by the last convolution layer is the feature map of the current environment image obtained by the acquisition key vision module.
- the manner in which the key vision module 600 acquires the state value changes of each changed feature map relative to the feature map may be optional: first, the key vision module 600 obtains the changed feature maps and inputs them to the agent. Value network to obtain the state value of each changed feature map; then, the key visual module 600 is obtained to calculate the state value of the value network output for the feature map and the difference between the state value of each changed feature map and the state value of each changed feature map to obtain each changed feature The amount of change in the state value of the map relative to the feature map.
- the acquiring actual vision module 610 is configured to acquire actual key visual information of the current environment image.
- the actual key visual information of the current environment image in the embodiment of the present disclosure may include: a region where the target object is located in the current environment image.
- the determination change return module 620 is configured to determine attention change return information according to the above-mentioned key visual information and the actual key visual information.
- the determination change return module 620 may determine the attention change according to the ratio of the intersection of the area of the current environment image with the target object to the area of the target object when the agent makes a decision. Return information.
- the adjusted return feedback module 630 is configured to adjust the return feedback of the reinforcement learning of the agent according to the return information of the attention change.
- the reward feedback of the agent's reinforcement learning in the embodiment of the present disclosure may include: attention change reward information and reward information formed by the agent's decision on the current environment image.
- the experience playback module 640 is used to determine the degree of exploration in the environment exploration period according to the key visual information on which it is based; if it is determined that the degree of exploration does not meet the predetermined degree of exploration, the stored historical training data is used for experience playback.
- the historical training data in the embodiment of the present disclosure includes: training data obtained by filtering the sampled reinforcement learning training data by using a preset requirement.
- the experience playback module 640 determines the degree of exploration in the environment exploration period.
- the experience playback module 640 may optionally be based on the agent's attention to the current environment image at multiple neighboring moments in the environment exploration period. Try to change information between the maps to determine the amount of attention change in the environment exploration cycle. The amount of attention change is used to measure the degree of exploration in the environmental exploration cycle.
- the acquisition training data module 650 is used to determine the importance of the reinforcement learning training data sampled during the environmental exploration cycle according to the positive return and the degree of exploration during the environmental exploration cycle, and the importance of the sampling during the environmental exploration cycle meets the predetermined requirements.
- Reinforcement learning training data is stored as historical training data.
- the specific operations performed by the acquisition of the key vision module 600, the acquisition of the actual vision module 610, the determination of the change return module 620, the adjustment return feedback module 630, the experience playback module 640, and the acquisition training data module 650 can be referred to the above method embodiment for FIG. 1 To the description in FIG. 5. The description will not be repeated here.
- FIG. 7 shows an exemplary device 700 suitable for implementing embodiments of the present disclosure.
- the device 700 may be a control system / electronic system configured in a car, a mobile terminal (for example, a smart mobile phone, etc.), a personal computer (for example, a PC, for example, Desktop computers or laptops, etc.), tablet computers, and servers.
- the device 700 includes one or more processors, a communication unit, and the like.
- the one or more processors may be: one or more central processing units (CPUs) 701, and / or, one or more utilizations.
- the processor can load executable instructions stored in read-only memory (ROM) 702 or load from storage section 708 to random access memory (RAM) 703 Executable instructions in the execution of various appropriate actions and processes.
- the communication unit 712 may include, but is not limited to, a network card, and the network card may include, but is not limited to, an IB (Infiniband) network card.
- the processor may communicate with the read-only memory 702 and / or the random access memory 703 to execute executable instructions, connect to the communication unit 712 through the bus 704, and communicate with other target devices via the communication unit 712, thereby completing any implementation of the present disclosure.
- the corresponding steps in the agent reinforcement learning method may be used to load executable instructions stored in read-only memory (ROM) 702 or load from storage section 708 to random access memory (RAM) 703 Executable instructions in the execution of various appropriate actions and processes.
- the communication unit 712 may include, but is not limited to, a network card, and the network card may include, but is not limited to, an I
- the RAM 703 can also store various programs and data required for device operation.
- the CPU 701, the ROM 702, and the RAM 703 are connected to each other through a bus 704.
- ROM 702 is an optional module.
- the RAM 703 stores executable instructions, or writes executable instructions to the ROM 702 at runtime, and the executable instructions cause the central processing unit 701 to execute the steps included in the method for reinforcement learning of an agent of any of the above embodiments.
- An input / output (I / O) interface 705 is also connected to the bus 704.
- the communication unit 712 may be provided in an integrated manner, or may be provided with a plurality of sub-modules (for example, a plurality of IB network cards), and are respectively connected to the bus.
- the following components are connected to the I / O interface 705: an input portion 706 including a keyboard, a mouse, and the like; an output portion 707 including a cathode ray tube (CRT), a liquid crystal display (LCD), and a speaker; a storage portion 708 including a hard disk and the like And a communication section 709 including a network interface card such as a LAN card, a modem, and the like.
- the communication section 709 performs communication processing via a network such as the Internet.
- the driver 710 is also connected to the I / O interface 705 as needed.
- a removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc., is installed on the drive 710 as needed, so that a computer program read out therefrom is installed in the storage section 708 as needed.
- FIG. 7 is only an optional implementation.
- the number and type of the components in FIG. 7 can be selected, deleted, added or replaced according to actual needs. ;
- it can also be implemented in separate settings or integrated settings.
- GPU713 and CPU701 can be set separately.
- GPU713 can be integrated on CPU701, and the communication department can be set separately or integrated. Set on CPU701 or GPU713 and so on.
- the process described below with reference to the flowchart may be implemented as a computer software program.
- the embodiment of the present disclosure includes a computer program product that includes a computer program product tangibly contained in a machine-readable medium.
- the computer program on the computer program includes program code for executing the steps shown in the flowchart, and the program code may include instructions corresponding to the steps in the reinforcement learning method for an agent provided by any embodiment of the present disclosure.
- the computer program may be downloaded and installed from a network through the communication portion 709, and / or installed from a removable medium 711.
- the computer program is executed by a central processing unit (CPU) 701
- the instructions for implementing the corresponding operations described in the reinforcement learning method for an agent of any embodiment of the present disclosure are executed.
- an embodiment of the present disclosure further provides a computer program program product for storing computer-readable instructions, and when the instructions are executed, the computer executes any of the embodiments described above.
- the computer program product may be specifically implemented by hardware, software, or a combination thereof.
- the computer program product is embodied as a computer storage medium.
- the computer program product is embodied as a software product, such as a Software Development Kit (SDK), etc. Wait.
- SDK Software Development Kit
- the embodiment of the present disclosure further provides another agent reinforcement learning method and corresponding device and electronic device, computer storage medium, computer program, and computer program product.
- the method includes : The first device sends an agent reinforcement learning instruction to the second device, the instruction causes the second device to execute the agent reinforcement learning method in any of the possible embodiments described above; the first device receives the agent reinforcement learning sent by the second device the result of.
- the agent reinforcement learning instruction may specifically be a call instruction
- the first device may instruct the second device to perform the agent reinforcement learning operation by means of a call. Accordingly, in response to receiving the call instruction, the second device The operations and / or processes in any of the above embodiments of the agent reinforcement learning method may be performed.
- inventions of embodiments of the present disclosure may be implemented in many ways.
- the methods and devices, electronic devices, and computer-readable storage media of the embodiments of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware.
- the above order of the steps of the method is for illustration only, and the steps of the method of the embodiments of the present disclosure are not limited to the order specifically described above, unless otherwise specifically stated.
- embodiments of the present disclosure may also be implemented as programs recorded in a recording medium, and the programs include machine-readable instructions for implementing a method according to an embodiment of the present disclosure.
- the embodiments of the present disclosure also cover a recording medium storing a program for executing a method according to an embodiment of the present disclosure.
Abstract
Description
Claims (27)
- 一种智能体强化学习方法,其特征在于,包括:获取智能体针对当前环境图像进行决策所依据的关键视觉信息;获取所述当前环境图像的实际关键视觉信息;根据所述所依据的关键视觉信息以及所述实际关键视觉信息,确定注意力变化回报信息;根据所述注意力变化回报信息调整智能体强化学习的回报反馈。
- 根据权利要求1所述的方法,其特征在于,所述所依据的关键视觉信息包括:所述智能体在做出决策时,针对所述当前环境图像的注意力区域。
- 根据权利要求2所述的方法,其特征在于,所述获取智能体针对当前环境图像进行决策所依据的关键视觉信息,包括:获取所述智能体针对所述当前环境图像的价值注意力图;对所述价值注意力图和所述当前环境图像进行合成处理,获得热力图;根据所述热力图确定所述智能体针对当前环境图像的注意力区域。
- 根据权利要求3所述的方法,其特征在于,所述获取智能体针对当前环境图像的价值注意力图,包括:获取所述当前环境图像的特征图;根据所述特征图获得依次屏蔽所述特征图各通道而形成的各改变特征图;获取所述各改变特征图分别相对于所述特征图的状态价值改变量;根据各状态价值改变量以及各改变特征图形成所述价值注意力图。
- 根据权利要求4所述的方法,其特征在于,所述获取当前环境图像的特征图,包括:将所述当前环境图像输入卷积神经网络,并获取所述卷积神经网络的最后一层卷积层输出的特征图。
- 根据权利要求4至5中任一项所述的方法,其特征在于,所述获取所述各改变特征图分别相对于所述特征图的状态价值改变量,包括:将所述各改变特征图输入智能体的价值网络,以获得所述各改变特征图的状态价值;计算所述价值网络针对所述特征图输出的状态价值,分别与所述各改变特征图的状态价值的差值,以获得所述各改变特征图分别相对于所述特征图的状态价值改变量。
- 根据权利要求1至6中任一项所述的方法,其特征在于,所述当前环境图像的实际关键视觉信息包括:当前环境图像中的目标对象所在区域。
- 根据权利要求7所述的方法,其特征在于,所述根据所述所依据的关键视觉信息以及所述实际关键视觉信息,确定注意力变化回报信息,包括:根据所述智能体在做出决策时,针对所述当前环境图像的注意力区域,与所述目标对象所在区域的交集与目标对象所在区域的比值,确定注意力变化回报信息。
- 根据权利要求1至8中任一项所述的方法,其特征在于,所述智能体强化学习的回报反馈包括:注意力变化回报信息以及智能体针对当前环境图像进行决策所形成的回报信息。
- 根据权利要求1至9中任一项所述的方法,其特征在于,所述方法还包括:根据所述所依据的关键视觉信息,确定环境探索周期内的探索程度;在判断出所述探索程度不符合预定探索程度的情况下,利用存储的历史训练数据进行经验回放;所述历史训练数据包括:利用预设要求对采样的强化学习训练数据进行筛选,而获得的训练数据。
- 根据权利要求10所述的方法,其特征在于,所述根据所述所依据的关键视觉信息,确定环境探索周期内的探索程度,包括:根据智能体针对环境探索周期内的多个相邻时刻的当前环境图像的价值注意力图之间的变化信息,确定该环境探索周期内的注意力改变量;其中,所述注意力改变量用于衡量所述环境探索周期内的探索程度。
- 根据权利要求11所述的方法,其特征在于,所述方法还包括:根据所述环境探索周期内的正向回报和所述探索程度,确定所述环境探索周期内采样的强化学习训练数据的重要程度;将该环境探索周期内采样的重要程度符合预定要求的强化学习训练数据作为历史训练数据存储。
- 一种智能体强化学习装置,其特征在于,包括:获取关键视觉模块,用于获取智能体针对当前环境图像进行决策所依据的关键视觉信息;获取实际视觉模块,用于获取所述当前环境图像的实际关键视觉信息;确定变化回报模块,用于根据所述所依据的关键视觉信息以及所述实际关键视觉信息,确定注意力变化回报信息;调整回报反馈模块,用于根据所述注意力变化回报信息调整智能体强化学习的回报反馈。
- 根据权利要求13所述的装置,其特征在于,所述所依据的关键视觉信息包括:所述智能体在做出决策时,针对所述当前环境图像的注意力区域。
- 根据权利要求14所述的装置,其特征在于,所述获取关键视觉模块用于:获取所述智能体针对所述当前环境图像的价值注意力图;对所述价值注意力图和所述当前环境图像进行合成处理,获得热力图;根据所述热力图确定所述智能体针对当前环境图像的注意力区域。
- 根据权利要求15所述的装置,其特征在于,所述获取关键视觉模块用于:获取所述当前环境图像的特征图;根据所述特征图获得依次屏蔽所述特征图各通道而形成的各改变特征图;获取所述各改变特征图分别相对于所述特征图的状态价值改变量;根据各状态价值改变量以及各改变特征图形成所述价值注意力图。
- 根据权利要求16所述的装置,其特征在于,所述获取关键视觉模块用于:将所述当前环境图像输入卷积神经网络,并获取所述卷积神经网络的最后一层卷积层输出的特征图;其中,所述最后一层卷积层输出的特征图为所述获取关键视觉模块获取到的当前环境图像的特征图。
- 根据权利要求16至17中任一项所述的装置,其特征在于,所述获取关键视觉模块用于:将所述各改变特征图输入智能体的价值网络,以获得所述各改变特征图的状态价值;计算所述价值网络针对所述特征图输出的状态价值,分别与所述各改变特征图的状态价值的差值,以获得各改变特征图分别相对于所述特征图的状态价值改变量。
- 根据权利要求13至18中任一项所述的装置,其特征在于,所述当前环境图像的实际关键视觉信息包括:当前环境图像中的目标对象所在区域。
- 根据权利要求19所述的装置,其特征在于,所述确定变化回报模块用于:根据所述智能体在做出决策时,针对所述当前环境图像的注意力区域,与所述目标对象所在区域的交集与目标对象所在区域的比值,确定注意力变化回报信息。
- 根据权利要求13至20中任一项所述的装置,其特征在于,所述智能体强化学习 的回报反馈包括:注意力变化回报信息以及智能体针对当前环境图像进行决策所形成的回报信息。
- 根据权利要求13至21中任一项所述的装置,其特征在于,所述装置还包括:经验回放模块,用于:根据所述所依据的关键视觉信息,确定环境探索周期内的探索程度;在判断出所述探索程度不符合预定探索程度的情况下,利用存储的历史训练数据进行经验回放;所述历史训练数据包括:利用预设要求对采样的强化学习训练数据进行筛选,而获得的训练数据。
- 根据权利要求22所述的装置,其特征在于,所述经验回放模块用于:根据智能体针对环境探索周期内的多个相邻时刻的当前环境图像的价值注意力图之间的变化信息,确定该环境探索周期内的注意力改变量;其中,所述注意力改变量用于衡量所述环境探索周期内的探索程度。
- 根据权利要求23所述的装置,其特征在于,所述装置还包括:获取训练数据模块,用于根据所述环境探索周期内的正向回报和所述探索程度,确定所述环境探索周期内采样的强化学习训练数据的重要程度;将该环境探索周期内采样的重要程度符合预定要求的强化学习训练数据作为历史训练数据存储。
- 一种电子设备,包括:存储器,用于存储计算机程序;处理器,用于执行所述存储器中存储的计算机程序,且所述计算机程序被执行时,实现上述权利要求1-12中任一项所述的方法。
- 一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时,实现上述权利要求1-12中任一项所述的方法。
- 一种计算机程序,包括计算机指令,当所述计算机指令在设备的处理器中运行时,实现上述权利要求1-12中任一项所述的方法。
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
SG11202013079WA SG11202013079WA (en) | 2018-07-28 | 2019-07-16 | Intelligent agent reinforcement learning method and apparatus, device and medium |
JP2021500797A JP7163477B2 (ja) | 2018-07-28 | 2019-07-16 | 知能客体強化学習方法、装置、デバイス、及び媒体 |
US17/137,063 US20210117738A1 (en) | 2018-07-28 | 2020-12-29 | Intelligent agent reinforcement learning method and apparatus, device and medium |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810849877.6A CN109190720B (zh) | 2018-07-28 | 2018-07-28 | 智能体强化学习方法、装置、设备及介质 |
CN201810849877.6 | 2018-07-28 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/137,063 Continuation US20210117738A1 (en) | 2018-07-28 | 2020-12-29 | Intelligent agent reinforcement learning method and apparatus, device and medium |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020024791A1 true WO2020024791A1 (zh) | 2020-02-06 |
Family
ID=64937811
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2019/096233 WO2020024791A1 (zh) | 2018-07-28 | 2019-07-16 | 智能体强化学习方法、装置、设备及介质 |
Country Status (5)
Country | Link |
---|---|
US (1) | US20210117738A1 (zh) |
JP (1) | JP7163477B2 (zh) |
CN (1) | CN109190720B (zh) |
SG (1) | SG11202013079WA (zh) |
WO (1) | WO2020024791A1 (zh) |
Families Citing this family (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109190720B (zh) * | 2018-07-28 | 2021-08-06 | 深圳市商汤科技有限公司 | 智能体强化学习方法、装置、设备及介质 |
JP2022525423A (ja) * | 2019-03-20 | 2022-05-13 | ソニーグループ株式会社 | ダブルアクタークリティックアルゴリズムを通じた強化学習 |
CN111898727A (zh) * | 2019-05-06 | 2020-11-06 | 清华大学 | 基于短时访问机制的强化学习方法、装置及存储介质 |
CN110147891B (zh) * | 2019-05-23 | 2021-06-01 | 北京地平线机器人技术研发有限公司 | 应用于强化学习训练过程的方法、装置及电子设备 |
CN110225019B (zh) * | 2019-06-04 | 2021-08-31 | 腾讯科技(深圳)有限公司 | 一种网络安全处理方法和装置 |
CN113872924B (zh) * | 2020-06-30 | 2023-05-02 | 中国电子科技集团公司电子科学研究院 | 一种多智能体的动作决策方法、装置、设备及存储介质 |
CN111791103B (zh) * | 2020-06-30 | 2022-04-29 | 北京百度网讯科技有限公司 | 滤波器调试方法、装置、电子设备和可读存储介质 |
CN112216124B (zh) * | 2020-09-17 | 2021-07-27 | 浙江工业大学 | 一种基于深度强化学习的交通信号控制方法 |
CN113255893B (zh) * | 2021-06-01 | 2022-07-05 | 北京理工大学 | 一种多智能体行动策略自演进生成方法 |
CN113671834B (zh) * | 2021-08-24 | 2023-09-01 | 郑州大学 | 一种机器人柔性行为决策方法及设备 |
CN113867147A (zh) * | 2021-09-29 | 2021-12-31 | 商汤集团有限公司 | 训练及控制方法、装置、计算设备和介质 |
CN116805353B (zh) * | 2023-08-21 | 2023-10-31 | 成都中轨轨道设备有限公司 | 跨行业通用的智能机器视觉感知方法 |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106970615A (zh) * | 2017-03-21 | 2017-07-21 | 西北工业大学 | 一种深度强化学习的实时在线路径规划方法 |
US20180174001A1 (en) * | 2016-12-15 | 2018-06-21 | Samsung Electronics Co., Ltd. | Method of training neural network, and recognition method and apparatus using neural network |
CN109190720A (zh) * | 2018-07-28 | 2019-01-11 | 深圳市商汤科技有限公司 | 智能体强化学习方法、装置、设备及介质 |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP5346701B2 (ja) * | 2009-06-12 | 2013-11-20 | 本田技研工業株式会社 | 学習制御システム及び学習制御方法 |
CN117371492A (zh) * | 2016-11-04 | 2024-01-09 | 渊慧科技有限公司 | 一种计算机实现的方法及其系统 |
CN107179077B (zh) * | 2017-05-15 | 2020-06-09 | 北京航空航天大学 | 一种基于elm-lrf的自适应视觉导航方法 |
-
2018
- 2018-07-28 CN CN201810849877.6A patent/CN109190720B/zh active Active
-
2019
- 2019-07-16 SG SG11202013079WA patent/SG11202013079WA/en unknown
- 2019-07-16 JP JP2021500797A patent/JP7163477B2/ja active Active
- 2019-07-16 WO PCT/CN2019/096233 patent/WO2020024791A1/zh active Application Filing
-
2020
- 2020-12-29 US US17/137,063 patent/US20210117738A1/en not_active Abandoned
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180174001A1 (en) * | 2016-12-15 | 2018-06-21 | Samsung Electronics Co., Ltd. | Method of training neural network, and recognition method and apparatus using neural network |
CN106970615A (zh) * | 2017-03-21 | 2017-07-21 | 西北工业大学 | 一种深度强化学习的实时在线路径规划方法 |
CN109190720A (zh) * | 2018-07-28 | 2019-01-11 | 深圳市商汤科技有限公司 | 智能体强化学习方法、装置、设备及介质 |
Non-Patent Citations (3)
Title |
---|
LIU, . QUAN ET AL.: "A Deep Recurrent Q-Network Based on Visual Attention Mechanism", CHINESE JOURNAL OF COMPUTERS, vol. 40, no. 6, 30 June 2017 (2017-06-30) * |
PRICE, B. ET AL.: "Accelerating Reinforcement Learning through Implicit Imitation", JOURNAL OF ARRIFICIAD INTELLIGENCE RESEARCH, vol. 19, no. 2003, 12 March 2003 (2003-03-12), XP080507310 * |
YUN, S. ET AL.: "Action-Driven Visual Object Tracking with Deep Reinforcement Learning", IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, vol. 29, no. 6, 30 June 2018 (2018-06-30), XP055681249 * |
Also Published As
Publication number | Publication date |
---|---|
US20210117738A1 (en) | 2021-04-22 |
CN109190720B (zh) | 2021-08-06 |
CN109190720A (zh) | 2019-01-11 |
JP7163477B2 (ja) | 2022-10-31 |
SG11202013079WA (en) | 2021-02-25 |
JP2021532457A (ja) | 2021-11-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2020024791A1 (zh) | 智能体强化学习方法、装置、设备及介质 | |
JP7009614B2 (ja) | ディープニューラルネットワークの正規化方法および装置、機器、ならびに記憶媒体 | |
WO2022083536A1 (zh) | 一种神经网络构建方法以及装置 | |
US20220012533A1 (en) | Object Recognition Method and Apparatus | |
TWI721510B (zh) | 雙目圖像的深度估計方法、設備及儲存介質 | |
WO2020256704A1 (en) | Real-time video ultra resolution | |
US11270124B1 (en) | Temporal bottleneck attention architecture for video action recognition | |
WO2019214344A1 (zh) | 系统增强学习方法和装置、电子设备、计算机存储介质 | |
CN112651511A (zh) | 一种训练模型的方法、数据处理的方法以及装置 | |
US11688077B2 (en) | Adaptive object tracking policy | |
CN110447041B (zh) | 噪声神经网络层 | |
WO2022179581A1 (zh) | 一种图像处理方法及相关设备 | |
CN109934247A (zh) | 电子装置及其控制方法 | |
JP7227385B2 (ja) | ニューラルネットワークのトレーニング及び目開閉状態の検出方法、装置並び機器 | |
WO2021103675A1 (zh) | 神经网络的训练及人脸检测方法、装置、设备和存储介质 | |
Kumagai et al. | Mixture of counting CNNs | |
CN111783996B (zh) | 一种数据处理方法、装置及设备 | |
US20240078428A1 (en) | Neural network model training method, data processing method, and apparatus | |
US10757369B1 (en) | Computer implemented system and method for high performance visual tracking | |
CN113407820B (zh) | 利用模型进行数据处理的方法及相关系统、存储介质 | |
US11388223B2 (en) | Management device, management method, and management program | |
Xu et al. | A deep deterministic policy gradient algorithm based on averaged state-action estimation | |
CN113868187A (zh) | 处理神经网络的方法和电子装置 | |
US11816185B1 (en) | Multi-view image analysis using neural networks | |
KR102190584B1 (ko) | 메타 강화 학습을 이용한 인간 행동패턴 및 행동전략 추정 시스템 및 방법 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19844987 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2021500797 Country of ref document: JP Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 07/05/2021) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19844987 Country of ref document: EP Kind code of ref document: A1 |