WO2020024791A1 - 智能体强化学习方法、装置、设备及介质 - Google Patents

智能体强化学习方法、装置、设备及介质 Download PDF

Info

Publication number
WO2020024791A1
WO2020024791A1 PCT/CN2019/096233 CN2019096233W WO2020024791A1 WO 2020024791 A1 WO2020024791 A1 WO 2020024791A1 CN 2019096233 W CN2019096233 W CN 2019096233W WO 2020024791 A1 WO2020024791 A1 WO 2020024791A1
Authority
WO
WIPO (PCT)
Prior art keywords
attention
current environment
agent
environment image
change
Prior art date
Application number
PCT/CN2019/096233
Other languages
English (en)
French (fr)
Inventor
刘春晓
薛洋
张伟
林倞
Original Assignee
深圳市商汤科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市商汤科技有限公司 filed Critical 深圳市商汤科技有限公司
Priority to SG11202013079WA priority Critical patent/SG11202013079WA/en
Priority to JP2021500797A priority patent/JP7163477B2/ja
Publication of WO2020024791A1 publication Critical patent/WO2020024791A1/zh
Priority to US17/137,063 priority patent/US20210117738A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/192Recognition using electronic means using simultaneous comparisons or correlations of the image signals with a plurality of references
    • G06V30/194References adjustable by an adaptive method, e.g. learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/008Artificial life, i.e. computing arrangements simulating life based on physical entities controlled by simulated intelligence so as to replicate intelligent life forms, e.g. based on robots replicating pets or humans in their appearance or behaviour
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06F18/2148Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the process organisation or structure, e.g. boosting cascade
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/01Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/255Detecting or recognising potential candidate objects based on visual cues, e.g. shapes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes

Definitions

  • the present disclosure relates to computer vision technology, and in particular, to a method for agent reinforcement learning, an agent reinforcement learning device, an electronic device, a computer-readable storage medium, and a computer program.
  • agents are commonly used, for example, a moving board or a robot arm that catches a falling ball in a game.
  • agents usually use the reward information obtained from trial and error in the environment to guide learning.
  • the embodiments of the present disclosure provide a technical solution for agent reinforcement learning.
  • a method for agent reinforcement learning includes: acquiring key visual information on which an agent makes a decision on a current environment image; and acquiring actual key visual information of the current environment image. Determining the attention change return information according to the key visual information and the actual key visual information; adjusting the return feedback of the reinforcement learning of the agent according to the return change information of the attention.
  • an agent reinforcement learning device includes: acquiring a key vision module for acquiring key visual information on which the agent makes a decision on a current environment image; acquiring actual vision A module for obtaining actual key visual information of the current environment image; determining a change return module for determining attention change return information according to the key visual information on which it is based and the actual key visual information; and adjusting return feedback A module configured to adjust the reward feedback of the reinforcement learning of the agent according to the attention change reward information.
  • an electronic device including: a memory for storing a computer program; a processor for executing a computer program stored in the memory, and when the computer program is executed, Implement any method embodiment of the present disclosure.
  • a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, any method embodiment of the present disclosure is implemented.
  • a computer program including computer instructions, and when the computer instructions are run in a processor of a device, any method embodiment of the present disclosure is implemented.
  • the agent reinforcement learning device Based on the agent reinforcement learning method, the agent reinforcement learning device, the electronic device, the computer-readable storage medium, and the computer program provided by the embodiments of the present disclosure, by obtaining key visual information on which the agent makes a decision on the current environment image, it can be used
  • the actual key visual information of the current environmental image is a measure of the attention change (such as attention shift) of the current environment image when the agent is making a decision, so that the attention change can be used to determine the attention Change returns information.
  • the embodiment of the present disclosure adjusts the reward feedback of the reinforcement learning of the agent by using the reward change information of the attention, so that the reward feedback reflects the reward change information of the attention, so that using the reward feedback to perform reinforcement learning on the agent can reduce the Probability that the body's attention is not accurate (such as attentional shift) causes it to perform dangerous actions. Therefore, it can be known that the technical solution provided by the embodiments of the present disclosure is beneficial to improving the behavior security of the agent.
  • FIG. 1 is a flowchart of an agent reinforcement learning method according to an embodiment of the present disclosure
  • FIG. 2 is a schematic diagram of a network structure of an agent
  • FIG. 3 is another schematic diagram of the network structure of the agent.
  • FIG. 4 is a flowchart of obtaining a value attention map of an agent against a current environment image according to an embodiment of the present disclosure
  • FIG. 5 is a schematic diagram of acquiring a value attention map of an agent against a current environment image according to an embodiment of the present disclosure
  • FIG. 6 is a schematic structural diagram of an intelligent reinforcement learning device 1 according to an embodiment of the present disclosure.
  • FIG. 7 is a block diagram of an exemplary device implementing an embodiment of the present disclosure.
  • a plurality may refer to two or more, and “at least one” may refer to one, two, or more.
  • the term "and / or” in the disclosed embodiment is merely an association relationship describing an associated object, which means that there can be three kinds of relationships, for example, A and / or B can mean: A exists alone, and A and B, there are three cases of B alone.
  • the character “/” generally indicates that the related objects before and after are an “or” relationship.
  • Embodiments of the present disclosure can be applied to electronic devices such as terminal devices, computer systems, and servers, which can operate with many other general or special-purpose computing system environments or configurations.
  • Examples of well-known terminal devices, computing systems, environments, and / or configurations suitable for use with electronic devices such as terminal devices, computer systems, and servers including but not limited to: personal computer systems, server computer systems, thin clients, thick Clients, handheld or laptop devices, microprocessor-based systems, set-top boxes, programmable consumer electronics, networked personal computers, small computer systems, mainframe computer systems, and distributed cloud computing technology environments including any of these systems, and more .
  • Electronic devices such as a terminal device, a computer system, and a server can be described in the general context of computer system executable instructions (such as program modules) executed by a computer system.
  • program modules may include routines, programs, target programs, components, logic, and data structures, etc., which perform specific tasks or implement specific abstract data types.
  • the computer system / server can be implemented in a distributed cloud computing environment. In a distributed cloud computing environment, tasks are performed by remote processing devices linked through a communication network. In a distributed cloud computing environment, program modules may be located on a local or remote computing system storage medium including a storage device.
  • FIG. 1 is a flowchart of an agent reinforcement learning method according to an embodiment of the present disclosure. As shown in FIG. 1, the method in this embodiment includes:
  • the agent in the embodiment of the present disclosure may be a mobile board or a robotic arm that catches a falling ball in a game, and a vehicle, a robot, a smart home device, etc., which have artificial intelligence and are formed based on reinforcement learning.
  • the object of character The embodiment of the present disclosure does not limit the specific expression form of the agent, nor does it limit the possibility that the object appears as hardware, software, or a combination of software and hardware.
  • the operation S100 may be performed by a processor calling a corresponding instruction stored in a memory, or may be performed by an acquisition key vision module 600 executed by the processor.
  • the operation S110 may be performed by a processor calling a corresponding instruction stored in a memory, or may be performed by an actual vision module 610 that is run by the processor.
  • the operation S120 may be performed by a processor calling a corresponding instruction stored in the memory, or may be performed by a determination change reporting module 620 executed by the processor.
  • adjusting the reward feedback of the agent's reinforcement learning according to the attention change return information may include: making the reward feedback of the agent's reinforcement learning include the attention change return information, such as adding the attention change return information to the return feedback.
  • the operation S130 may be executed by the processor calling a corresponding instruction stored in the memory, or may be executed by the adjustment return feedback module 630 executed by the processor.
  • the key visual information in the embodiment of the present disclosure may include: an area needing attention in the image; and may also include: an attention area in the image.
  • the key visual information based on may include: the attention area considered by the agent, that is, the attention area for the current environment image when the agent makes a decision.
  • the actual key visual information of the current environment image may include: the real key visual information of the current environment image, that is, the real attention area of the current environment image, that is, the area where the target object in the current environment image is located.
  • the attention change information for the attention area of the current environment image, the intersection with the area where the target object is located, and the area where the target object is located may be determined.
  • the attention change report information in the embodiment of the present disclosure is used to make the attention area of the current environment image considered by the agent closer to the actual key visual information of the current environment image.
  • the feedback feedback in the embodiment of the present disclosure may include: attention change return information and return information formed by the agent making a decision on the current environment image.
  • the return information formed by the agent making decisions on the current environmental image is usually the return information used by existing agents for reinforcement learning.
  • the embodiment of the present disclosure obtains the key visual information on which the agent is based on the current environment image, so that the actual key visual information of the current environment image can be used to measure the change in attention of the agent when making decisions on the current environment image ( (Such as the attention shift situation, etc.), and then the attention change situation can be used to determine the attention change return information.
  • the embodiment of the present disclosure adjusts the feedback feedback of the agent's learning by using the attention change return information so that the feedback feedback can reflect the return information of the attention change. By using such return feedback to implement the reinforcement learning of the agent, it can reduce the Inaccurate attention leads to the probability of performing dangerous actions, which is conducive to improving the behavioral safety of the agent.
  • An example of the above-mentioned dangerous action is: when the agent should move, the decision result of the agent is an empty action, so that the agent maintains the original state, and the empty action determined at this time is a dangerous action.
  • the embodiments of the present disclosure do not limit the specific expressions of the dangerous actions.
  • FIG. 2 an example of the network structure included in the agent during the reinforcement learning process is shown in FIG. 2.
  • the agent in FIG. 2 includes a convolutional neural network (at the middle position in FIG. 2), a decision network (Policy Network), and a value network (Value network).
  • the agent can obtain the current environment image by interacting with the environment.
  • the image shown at the bottom of Figure 2 is an example of the current environment image.
  • the current environment image is input to the convolutional neural network.
  • the feature map of the current environment image formed by the previous convolution layer is provided to the subsequent convolution layer, and the current image formed by the last convolution layer
  • Feature maps of environmental images are provided to decision networks and value networks, respectively.
  • the decision network performs decision processing on the feature maps it receives.
  • the value network performs state value prediction processing on the received feature map to determine the state value of the current environment image.
  • the agent in FIG. 3 includes a convolutional neural network (at the middle position in FIG. 3), RNN (Recurrent Neuron Network, recurrent neural network), a decision network, and a value network.
  • the agent can obtain the current environment image by interacting with the environment.
  • the image shown at the bottom of Figure 3 is an example of the current environment image.
  • the current environment image is input to the convolutional neural network.
  • the feature map of the current environment image formed by the previous convolution layer is provided to the subsequent convolution layer, and the current image formed by the last convolution layer
  • the feature map of the environment image is provided to the RNN, and the RNN can convert the time series information of the feature map into a one-dimensional feature vector.
  • the feature map and time-series feature vectors output by the RNN are provided to the decision network and the value network, respectively.
  • the decision network performs decision processing on the feature maps and time-series feature vectors it receives.
  • the value network performs state value prediction processing on the received feature map and time-series feature vectors to determine the state value of the current environment image.
  • FIG. 2 and FIG. 3 are only optional examples of the network structure of the agent during the reinforcement learning process.
  • the network structure of the agent may also be expressed in other forms, and the embodiment of the present disclosure does not limit the network structure of the agent. Specific manifestations.
  • the key visual information based on the embodiments of the present disclosure can reflect the attention of the agent (for example, a decision network in the agent) on the current environment image when making a decision. information.
  • the timing of making a decision may depend on a preset setting.
  • the agent may be preset to make a decision every 0.2 seconds.
  • the decision result in the embodiment of the present disclosure may be selecting an action from the action space.
  • the embodiment of the present disclosure can be obtained through the value network of the agent: when the agent makes a decision, the heat map corresponding to its attention to the current environment image; and then, the agent can use the heat map to make a decision Key visual information on which the current environment image is based.
  • the pixels in the heat map can be filtered according to a preset threshold to filter out pixels whose pixel values exceed the preset threshold, and then, based on the area formed by the filtered pixels, it can be determined The agent's attention area for the current environment image when making a decision.
  • a preset threshold to filter out pixels whose pixel values exceed the preset threshold
  • the agent in the embodiment of the present disclosure when the agent in the embodiment of the present disclosure makes a decision, its attention to the current environment image may be embodied using a value attention map (Value Attention Map).
  • the value attention map may include the key visual information on which the value network of the agent is based when making state value judgments.
  • obtaining the key visual information on which the agent makes a decision on the current environment image may include: obtaining the agent's value attention map for the current environment image; synthesizing the value attention map and the current environment image, Obtain a heat map; determine the agent's attention area for the current environment image according to the heat map.
  • the embodiment of the present disclosure can obtain the value attention map of the current environment image in various ways.
  • the embodiment of the present disclosure can obtain the value attention map by using the process shown in FIG. 4.
  • S400 acquiring a feature map of a current environment image.
  • the feature maps in the embodiments of the present disclosure generally belong to a high-level feature map formed by a convolutional neural network of an agent for a current environment image.
  • the current environment image is input into the agent's convolutional neural network, and the feature map output by the last layer of the convolutional neural network is used as the feature map of the current environment image in S400.
  • the feature map output by the penultimate layer of the convolutional neural network is used as the feature map of the current environment image in S400.
  • it is a high-level feature map in a convolutional neural network.
  • the high-level feature map in the embodiment of the present disclosure can be considered as: in a case where the structure of the convolutional neural network of the agent is divided into two or three or more stages, the middle stage, the middle stage, or the last stage A feature map formed by any layer of the image for the current environment.
  • the high-level feature map in the embodiment of the present disclosure may also be considered as a feature map formed by a layer that is closer to or closest to the output of the convolutional neural network of the agent.
  • the changed feature map in the embodiment of the present disclosure includes a feature map that is different from the feature map in S400 because the corresponding channel in the feature map is shielded compared to the feature map in S400.
  • an example of obtaining each changed feature map in the embodiment of the present disclosure is: first, by shielding the first channel in the feature map, the first changed feature map can be obtained Secondly, by shielding the second channel in the feature map, a second changed feature map can be obtained; again, by shielding the third channel in the feature map, a third changed feature map can be obtained; and so on, until By masking the last channel in the feature map, the last change feature map can be obtained.
  • the corresponding channels of the shielding feature map in the embodiment of the present disclosure may also be considered as the corresponding activation information of the shielding hidden layer.
  • the embodiment of the present disclosure can obtain n change feature maps.
  • the embodiments of the present disclosure may use existing methods to shield the activation information of the corresponding hidden layer, so as to obtain the changed feature map. The specific implementation manner is not described in detail here.
  • each of the obtained change feature maps may be first input into the value network of the agent to obtain the status value of each change feature map.
  • the value network performs the status of each change feature map separately Value prediction processing, so that the state value of each change feature map can be obtained, for example, n state value can be obtained for n change feature maps;
  • the embodiment of the present disclosure can calculate the output of the feature map in S400 by calculating the value network.
  • the state value is the difference between the state value and the state value of each change feature map, so as to obtain the state value change amount of each change feature map relative to the feature map of the current environment image.
  • the embodiment of the present disclosure can obtain n by calculating the difference between V and V 1 , the difference between V and V 2 , the difference between V and V i , ..., and the difference between V and V n .
  • the difference that is, ⁇ V 1 , ⁇ V 2 , ⁇ V i ,... And ⁇ V n (as shown at the upper right position in FIG. 5).
  • ⁇ V 1 , ⁇ V 2 , ⁇ V i ,... And ⁇ V n are the state value changes of the n change feature maps relative to the feature map of the current environment image, respectively.
  • the embodiment of the present disclosure may use the following formula (1) to calculate the state value change amount of the changed feature map relative to the feature map of the current environment image:
  • ⁇ V represents the state value change amount
  • V represents the state value formed by the value network for the feature map of the current environment image
  • H represents the feature map of the current environment image
  • B i ⁇ H represents the feature map is masked After the ith channel in the change feature map obtained
  • f V (B i ⁇ H) represents the state value formed by the value network for the change feature map, where i is an integer greater than 0 and not greater than n, n is an integer greater than 1.
  • the embodiment of the present disclosure sequentially shields different activation information of the hidden layer. And obtain the state value change amount of each change feature map relative to the feature map, so that the different state value change amounts can reflect the degree of attention of the agent to different areas.
  • the above operations S400-S430 may be executed by the processor calling corresponding instructions stored in the memory, or may be executed by the acquisition key vision module 600 run by the processor.
  • the embodiment of the present disclosure may perform normalization processing on the amount of state value change to form the weight of each changed feature map.
  • An example of normalizing the change in state value is shown in the following formula (2):
  • ⁇ i represents the weight of the i-th change feature map.
  • the embodiment of the present disclosure may form a value attention map by the following formula (3):
  • A represents the value attention map
  • Hi represents the feature map of the i-th channel output from the last convolutional layer of the convolutional neural network
  • K is the number of channels.
  • the embodiments of the present disclosure may also use existing methods to obtain the value attention map of the current environment image when the agent makes a decision.
  • the embodiment of the present disclosure does not limit the specific implementation process of acquiring the value attention map for the current environment image when the agent makes a decision.
  • the embodiment of the present disclosure may first adjust the size of the value attention map A obtained above, for example, perform upsampling processing on the value attention map A, so as to make the size of the value attention map A and the current environment
  • the size of the images is the same; after that, the value attention map A ′ after the size adjustment is fused with the current environment image (such as the image in the lower left corner of FIG. 5) to obtain a heat map corresponding to the value attention map of the current environment image.
  • An alternative example of a heat map is shown in the image in the lower right corner of Figure 5.
  • the actual key visual information of the current environment image in the embodiment of the present disclosure may include: a region where the target object is located in the current environment image.
  • the embodiment of the present disclosure may use a target object detection algorithm to obtain a region where the target object is located in the current environment image.
  • the embodiments of the present disclosure do not limit the specific implementation of the target object detection algorithm, nor the specific implementation of obtaining the area where the target object in the current environment image is obtained.
  • the attention change return information in the embodiment of the present disclosure may reflect the gap between the area that the agent is focusing on for the current environment image and the area that should actually be focused on. That is, the embodiment of the present disclosure can determine the attention change return information according to the difference between the attention area of the current environment image and the area where the target object is located when the agent makes a decision. .
  • the embodiment of the present disclosure may first determine the attention area of the agent for the current environment image according to the key visual information on which it is based.
  • the key visual information based on such as thermal power
  • the key visual information based on may be determined based on a preset threshold.
  • Figure the pixels in the image are filtered, and the pixels whose pixel value exceeds the preset threshold are determined.
  • the attention area a of the agent against the current environment image is determined.
  • the disclosed embodiment may calculate a ratio (a ⁇ b) / b of the intersection of the attention area a and the area b where the target object is located in the current environment image and the area b where the target object is located, and determine the attention change return information according to the ratio.
  • the attention change information is obtained.
  • the ratio in the embodiment of the present disclosure or the attention change return information obtained based on the ratio can be considered as a safety evaluation index for the behavior of the agent. The larger the ratio, the higher the safety of the agent's behavior, and conversely, the smaller the ratio, the lower the safety of the agent's behavior.
  • the embodiment of the present disclosure adjusts the reward feedback of the reinforcement learning of the agent by using the above-mentioned information of obtaining the attention change reward (for example, adding the obtained information of the change of attention change to the reward feedback of the reinforcement learning of the agent) , And use such feedback feedback to update the network parameters of the agent (such as updating the network parameters of the convolutional neural network, value network, and strategy network), so that the agent can reduce changes in attention during the reinforcement learning process (such as attention Probability of performing dangerous actions.
  • the method of updating the network parameters of the agent can be based on the actor-critic algorithm in reinforcement learning.
  • the specific goals of updating the network parameters of the agent include: making the state value predicted by the value network in the agent as close as possible to the accumulated value of reward information in an environment exploration cycle, and the network parameters of the decision network in the agent. Updates should be updated to increase the value of the state predicted by the value network.
  • the small balls of the bricks will drop rapidly due to gravity during the falling process.
  • the moving board that catches the falling balls it is often due to the attention Force lags, and there is a phenomenon that a dangerous action is performed (such as moving the board to perform an empty action, etc.).
  • the embodiments of the present disclosure use the feedback feedback (such as reward information) that can reflect the attention change return information, so that the mobile board performs reinforcement learning, which is helpful to avoid the phenomenon of the mobile board's attention lag, and thus to reduce the mobile board to perform dangerous actions. Odds.
  • the agent may be an agent that has already performed a certain degree of reinforcement learning.
  • the embodiments of the present disclosure can use the existing reinforcement learning method to enable the agent to perform reinforcement learning based on the feedback feedback that does not include the attention change return information, and determine the agent's
  • the degree of reinforcement learning reaches a certain requirement (for example, the entropy of the decision network drops to a certain value (such as 0.6))
  • the technical solution provided by the embodiment of the present disclosure is adopted to continue the reinforcement learning of the agent, which is beneficial to improving the intelligence The efficiency and success rate of the body's reinforcement learning.
  • the embodiment of the present disclosure may select important reinforcement learning training data as historical training data storage from the reinforcement learning training data obtained by sampling, so as to facilitate the experience playback process.
  • the embodiments of the present disclosure can effectively reduce the cache space required for historical training data by selecting important reinforcement learning training data as historical training data storage; by using important reinforcement learning training data as historical training data for experience playback, it is beneficial to improve intelligence Reinforcement learning efficiency.
  • the method may further include: determining a degree of exploration in an environment exploration period according to the key visual information on which it is based; in a case where it is determined that the degree of exploration does not meet a predetermined degree of exploration, using Stored historical training data for experience playback.
  • the historical training data may include: training data obtained by filtering the sampled reinforcement learning training data using preset requirements.
  • determining the degree of exploration in the environment exploration cycle according to the key visual information on which it is based may include: according to the agent's value attention map of the current environment image for multiple adjacent moments in the environment exploration cycle The change information is used to determine the amount of attention change in the environment exploration cycle. The amount of attention change is used to measure the degree of exploration in the environmental exploration cycle.
  • the embodiment of the present disclosure may use the positive reward (such as a positive reward) in an environment exploration cycle and the degree of exploration in the environment exploration cycle to determine the reinforcement learning training data in the environment exploration cycle. Therefore, when it is judged that the importance degree meets a predetermined requirement, the reinforcement learning training data in the environment exploration cycle can be used as the historical training data cache.
  • the positive reward such as a positive reward
  • the degree of exploration of the environment exploration cycle in the embodiment of the present disclosure may be reflected by the amount of attention change in the environment exploration cycle.
  • the embodiment of the present disclosure may The change information between the value attention maps of the current environment images at multiple neighboring moments determines the amount of attention change in the environment exploration cycle, and uses the amount of attention change as the exploration degree in the environment exploration cycle.
  • the following formula (4) can be used to calculate the amount of attention change in an environment exploration cycle:
  • the embodiment of the present disclosure may use the following formula (5) to calculate the importance of the reinforcement learning training data in an environment exploration cycle:
  • S represents the importance of the reinforcement learning training data in an environment exploration cycle
  • represents a hyperparameter
  • is usually a constant between 0-1
  • r + represents a positive value in the environment exploration cycle.
  • E represents the average amount of attention change during the environmental exploration cycle.
  • all reinforcement learning training data (such as reward information and current environment images) in the environment exploration cycle may be used. It is cached as historical training data; otherwise, all reinforcement learning training data in the environment exploration cycle is not retained.
  • the embodiment of the present disclosure may use the cached historical training data to adjust the network parameters of the agent in an empirical playback manner; for example, adjusting the strategy network and value network And network parameters of convolutional neural networks; for example, adjusting network parameters of strategy networks, value networks, RNNs, and convolutional neural networks.
  • the embodiment of the present disclosure determines the degree of exploration in an environment exploration period. When it is determined that the degree of exploration does not meet the predetermined degree of exploration, a random number may be generated.
  • the random number exceeds a predetermined value (such as 0.3)
  • a predetermined value such as 0.3
  • it is determined that experience playback is required so that the embodiment of the present disclosure can perform experience playback operation using pre-stored historical training data. If the random number does not exceed a predetermined value, it can be determined that empirical playback is not required.
  • the specific implementation process of experience playback can use the existing implementation. It will not be described in detail here.
  • agent reinforcement learning method may be executed by any appropriate device having data processing capabilities, including but not limited to: a terminal device and a server.
  • any agent reinforcement learning method provided by the embodiment of the present disclosure may be executed by a processor.
  • the processor executes any agent reinforcement learning method mentioned in the embodiment of the present disclosure by calling a corresponding instruction stored in a memory. I will not repeat them below.
  • the foregoing program may be stored in a computer-readable storage medium.
  • the program is executed, the program is executed.
  • the method includes the steps of the foregoing method embodiment; and the foregoing storage medium includes: a ROM, a RAM, a magnetic disk, or an optical disc, which can store various program codes.
  • FIG. 6 is a schematic structural diagram of an embodiment of an agent reinforcement learning device according to an embodiment of the present disclosure.
  • the device in this embodiment mainly includes: acquiring a key vision module 600, acquiring an actual vision module 610, determining a change return module 620, and an adjustment return feedback module 630.
  • the apparatus may further include: an experience playback module 640 and a training data acquisition module 650.
  • the acquiring key vision module 600 is used for acquiring key visual information on which the agent makes a decision on the current environment image.
  • the key visual information on which the above is based may include: when the agent makes a decision, the attention area of the current environment image.
  • the acquisition of the key vision module 600 may further be used to: first, obtain a value attention map of the agent against the current environment image, and then synthesize the value attention map and the current environment image to obtain a heat map; then, determine the agent according to the heat map Attention area for the current environment image.
  • the manner in which the key vision module 600 obtains the value attention map may be selected as follows: first, the key vision module 600 obtains a feature map of the current environment image; after that, the key vision module 600 obtains a sequence based on the feature map Each change feature map formed by shielding each channel of the feature map; then, the key vision module 600 obtains a state value change amount of each change feature map relative to the feature map; finally, the key vision module 600 obtains a change amount of each state value and Each change feature map forms a value attention map.
  • the manner in which the key vision module 600 acquires the feature map of the current environment image may be selected as follows: first, the key vision module 600 inputs the current environment image into a convolutional neural network, and then, the key vision module 600 is acquired Get the feature map of the last convolutional layer output of the convolutional neural network. The feature map output by the last convolution layer is the feature map of the current environment image obtained by the acquisition key vision module.
  • the manner in which the key vision module 600 acquires the state value changes of each changed feature map relative to the feature map may be optional: first, the key vision module 600 obtains the changed feature maps and inputs them to the agent. Value network to obtain the state value of each changed feature map; then, the key visual module 600 is obtained to calculate the state value of the value network output for the feature map and the difference between the state value of each changed feature map and the state value of each changed feature map to obtain each changed feature The amount of change in the state value of the map relative to the feature map.
  • the acquiring actual vision module 610 is configured to acquire actual key visual information of the current environment image.
  • the actual key visual information of the current environment image in the embodiment of the present disclosure may include: a region where the target object is located in the current environment image.
  • the determination change return module 620 is configured to determine attention change return information according to the above-mentioned key visual information and the actual key visual information.
  • the determination change return module 620 may determine the attention change according to the ratio of the intersection of the area of the current environment image with the target object to the area of the target object when the agent makes a decision. Return information.
  • the adjusted return feedback module 630 is configured to adjust the return feedback of the reinforcement learning of the agent according to the return information of the attention change.
  • the reward feedback of the agent's reinforcement learning in the embodiment of the present disclosure may include: attention change reward information and reward information formed by the agent's decision on the current environment image.
  • the experience playback module 640 is used to determine the degree of exploration in the environment exploration period according to the key visual information on which it is based; if it is determined that the degree of exploration does not meet the predetermined degree of exploration, the stored historical training data is used for experience playback.
  • the historical training data in the embodiment of the present disclosure includes: training data obtained by filtering the sampled reinforcement learning training data by using a preset requirement.
  • the experience playback module 640 determines the degree of exploration in the environment exploration period.
  • the experience playback module 640 may optionally be based on the agent's attention to the current environment image at multiple neighboring moments in the environment exploration period. Try to change information between the maps to determine the amount of attention change in the environment exploration cycle. The amount of attention change is used to measure the degree of exploration in the environmental exploration cycle.
  • the acquisition training data module 650 is used to determine the importance of the reinforcement learning training data sampled during the environmental exploration cycle according to the positive return and the degree of exploration during the environmental exploration cycle, and the importance of the sampling during the environmental exploration cycle meets the predetermined requirements.
  • Reinforcement learning training data is stored as historical training data.
  • the specific operations performed by the acquisition of the key vision module 600, the acquisition of the actual vision module 610, the determination of the change return module 620, the adjustment return feedback module 630, the experience playback module 640, and the acquisition training data module 650 can be referred to the above method embodiment for FIG. 1 To the description in FIG. 5. The description will not be repeated here.
  • FIG. 7 shows an exemplary device 700 suitable for implementing embodiments of the present disclosure.
  • the device 700 may be a control system / electronic system configured in a car, a mobile terminal (for example, a smart mobile phone, etc.), a personal computer (for example, a PC, for example, Desktop computers or laptops, etc.), tablet computers, and servers.
  • the device 700 includes one or more processors, a communication unit, and the like.
  • the one or more processors may be: one or more central processing units (CPUs) 701, and / or, one or more utilizations.
  • the processor can load executable instructions stored in read-only memory (ROM) 702 or load from storage section 708 to random access memory (RAM) 703 Executable instructions in the execution of various appropriate actions and processes.
  • the communication unit 712 may include, but is not limited to, a network card, and the network card may include, but is not limited to, an IB (Infiniband) network card.
  • the processor may communicate with the read-only memory 702 and / or the random access memory 703 to execute executable instructions, connect to the communication unit 712 through the bus 704, and communicate with other target devices via the communication unit 712, thereby completing any implementation of the present disclosure.
  • the corresponding steps in the agent reinforcement learning method may be used to load executable instructions stored in read-only memory (ROM) 702 or load from storage section 708 to random access memory (RAM) 703 Executable instructions in the execution of various appropriate actions and processes.
  • the communication unit 712 may include, but is not limited to, a network card, and the network card may include, but is not limited to, an I
  • the RAM 703 can also store various programs and data required for device operation.
  • the CPU 701, the ROM 702, and the RAM 703 are connected to each other through a bus 704.
  • ROM 702 is an optional module.
  • the RAM 703 stores executable instructions, or writes executable instructions to the ROM 702 at runtime, and the executable instructions cause the central processing unit 701 to execute the steps included in the method for reinforcement learning of an agent of any of the above embodiments.
  • An input / output (I / O) interface 705 is also connected to the bus 704.
  • the communication unit 712 may be provided in an integrated manner, or may be provided with a plurality of sub-modules (for example, a plurality of IB network cards), and are respectively connected to the bus.
  • the following components are connected to the I / O interface 705: an input portion 706 including a keyboard, a mouse, and the like; an output portion 707 including a cathode ray tube (CRT), a liquid crystal display (LCD), and a speaker; a storage portion 708 including a hard disk and the like And a communication section 709 including a network interface card such as a LAN card, a modem, and the like.
  • the communication section 709 performs communication processing via a network such as the Internet.
  • the driver 710 is also connected to the I / O interface 705 as needed.
  • a removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc., is installed on the drive 710 as needed, so that a computer program read out therefrom is installed in the storage section 708 as needed.
  • FIG. 7 is only an optional implementation.
  • the number and type of the components in FIG. 7 can be selected, deleted, added or replaced according to actual needs. ;
  • it can also be implemented in separate settings or integrated settings.
  • GPU713 and CPU701 can be set separately.
  • GPU713 can be integrated on CPU701, and the communication department can be set separately or integrated. Set on CPU701 or GPU713 and so on.
  • the process described below with reference to the flowchart may be implemented as a computer software program.
  • the embodiment of the present disclosure includes a computer program product that includes a computer program product tangibly contained in a machine-readable medium.
  • the computer program on the computer program includes program code for executing the steps shown in the flowchart, and the program code may include instructions corresponding to the steps in the reinforcement learning method for an agent provided by any embodiment of the present disclosure.
  • the computer program may be downloaded and installed from a network through the communication portion 709, and / or installed from a removable medium 711.
  • the computer program is executed by a central processing unit (CPU) 701
  • the instructions for implementing the corresponding operations described in the reinforcement learning method for an agent of any embodiment of the present disclosure are executed.
  • an embodiment of the present disclosure further provides a computer program program product for storing computer-readable instructions, and when the instructions are executed, the computer executes any of the embodiments described above.
  • the computer program product may be specifically implemented by hardware, software, or a combination thereof.
  • the computer program product is embodied as a computer storage medium.
  • the computer program product is embodied as a software product, such as a Software Development Kit (SDK), etc. Wait.
  • SDK Software Development Kit
  • the embodiment of the present disclosure further provides another agent reinforcement learning method and corresponding device and electronic device, computer storage medium, computer program, and computer program product.
  • the method includes : The first device sends an agent reinforcement learning instruction to the second device, the instruction causes the second device to execute the agent reinforcement learning method in any of the possible embodiments described above; the first device receives the agent reinforcement learning sent by the second device the result of.
  • the agent reinforcement learning instruction may specifically be a call instruction
  • the first device may instruct the second device to perform the agent reinforcement learning operation by means of a call. Accordingly, in response to receiving the call instruction, the second device The operations and / or processes in any of the above embodiments of the agent reinforcement learning method may be performed.
  • inventions of embodiments of the present disclosure may be implemented in many ways.
  • the methods and devices, electronic devices, and computer-readable storage media of the embodiments of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware.
  • the above order of the steps of the method is for illustration only, and the steps of the method of the embodiments of the present disclosure are not limited to the order specifically described above, unless otherwise specifically stated.
  • embodiments of the present disclosure may also be implemented as programs recorded in a recording medium, and the programs include machine-readable instructions for implementing a method according to an embodiment of the present disclosure.
  • the embodiments of the present disclosure also cover a recording medium storing a program for executing a method according to an embodiment of the present disclosure.

Abstract

本公开实施例的实施例公开了一种智能体强化学习方法、装置、设备及介质,其中的方法包括:获取智能体针对当前环境图像进行决策所依据的关键视觉信息;获取所述当前环境图像的实际关键视觉信息;根据所述所依据的关键视觉信息以及所述实际关键视觉信息,确定注意力变化回报信息;根据所述注意力变化回报信息调整智能体强化学习的回报反馈。

Description

智能体强化学习方法、装置、设备及介质
本公开要求在2018年07月28日提交中国专利局、申请号为CN201810849877.6、发明名称为“智能体强化学习方法、装置、设备及介质”的中国专利申请的优先权,其全部内容通过引用结合在本公开中。
技术领域
本公开涉及计算机视觉技术,尤其是涉及一种智能体强化学习方法、智能体强化学习装置、电子设备、计算机可读存储介质以及计算机程序。
背景技术
在游戏以及机器人等诸多技术领域,通常会使用到智能体,例如,游戏中的接档下落小球的移动板或者机器臂等。智能体在强化学习过程中,通常是利用在环境中试错得到的奖赏信息,来指导学习的。
如何提高强化学习后的智能体的行为安全性,是强化学习中的一个重要的技术问题。
发明内容
本公开实施例提供一种智能体强化学习的技术方案。
根据本公开实施例的一方面,提供一种智能体强化学习方法,所述方法包括:获取智能体针对当前环境图像进行决策所依据的关键视觉信息;获取所述当前环境图像的实际关键视觉信息;根据所述所依据的关键视觉信息以及所述实际关键视觉信息,确定注意力变化回报信息;根据所述注意力变化回报信息调整智能体强化学习的回报反馈。
根据本公开实施例的另一方面,提供一种智能体强化学习装置,所述装置包括:获取关键视觉模块,用于获取智能体针对当前环境图像进行决策所依据的关键视觉信息;获取实际视觉模块,用于获取所述当前环境图像的实际关键视觉信息;确定变化回报模块,用于根据所述所依据的关键视觉信息以及所述实际关键视觉信息,确定注意力变化回报信息;调整回报反馈模块,用于根据所述注意力变化回报信息调整智能体强化学习的回报反馈。
根据本公开实施例的又一方面,提供一种电子设备,包括:存储器,用于存储计算机程序;处理器,用于执行所述存储器中存储的计算机程序,且所述计算机程序被执行时,实现本公开任一方法实施例。
根据本公开实施例的再一个方面,提供一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时,实现本公开任一方法实施例。
根据本公开实施例的再一个方面,提供一种计算机程序,包括计算机指令,当所述计算机指令在设备的处理器中运行时,实现本公开任一方法实施例。
基于本公开实施例提供的智能体强化学习方法、智能体强化学习装置、电子设备、计算机可读存储介质及计算机程序,通过获得智能体针对当前环境图像进行决策所依据的关键视觉信息,可以利用当前环境图像的实际关键视觉信息,衡量出智能体在进行决策时,针对当前环境图像时的注意力变化情况(如注意力偏移情况),从而可以利用该注意力变化情况,确定出注意力变化回报信息。本公开实施例通过利用注意力变化回报信息来调整智能体强化学习的回报反馈,可以使回报反馈体现出注意力变化回报信息,从而利用这样的回报反馈对智能体进行强化学习,可以减少由于智能体的注意力不准确(如注意力偏移)而导致其执行危险动作等概率。由此可知,本公开实施例提供的技术方案有利于提高智能体的行为安全性。
下面通过附图和实施例,对本公开实施例的技术方案做进一步的详细描述。
附图说明
构成说明书的一部分的附图描述了本公开实施例的实施例,并且连同描述一起用于解释本公开实施例的原理。
参照附图,根据下面的详细描述,可以更加清楚地理解本公开实施例,其中:
图1为本公开实施例的智能体强化学习方法一个的流程图;
图2为智能体的网络结构的一个的示意图;
图3为智能体的网络结构的另一个的示意图;
图4为本公开实施例的获取智能体针对当前环境图像的价值注意力图的一个的流程图;
图5为本公开实施例的获取智能体针对当前环境图像的价值注意力图的一个的示意图;
图6为本公开实施例的智能体强化学习装置一的结构示意图;
图7为实现本公开实施例的一示例性设备的框图。
具体实施方式
现在将参照附图来详细描述本公开实施例的各种示例性实施例。应注意到:除非另外具体说明,否则在这些实施例中阐述的部件和步骤的相对布置、数字表达式和数值不限制本公开实施例的范围。
还应理解,在本公开实施例中,“多个”可以指两个或两个以上,“至少一个”可以指一个、两个或两个以上。
本领域技术人员可以理解,本公开实施例中的“第一”、“第二”等术语仅用于区别不同步骤、设备或模块等,既不代表任何特定技术含义,也不表示它们之间的必然逻辑顺序,不应理解成对本公开实施例的限定。还应理解,在本公开实施例中,“多个”可以指两个或两个以上,“至少一个”可以指一个、两个或两个以上。
还应理解,对于本公开实施例中提及的任一部件、数据或结构,在没有明确限定或者在前后文给出相反启示的情况下,一般可以理解为一个或多个。
还应理解,本公开实施例对各个实施例的描述着重强调各个实施例之间的不同之处,其相同或相似之处可以相互参考,为了简洁,不再一一赘述。
同时,应当明白,为了便于描述,附图中所示出的各个部分的尺寸并不是按照实际的比例关系绘制的。
以下对至少一个示例性实施例的描述实际上仅仅是说明性的,决不作为对本公开实施例及其应用或使用的任何限制。
对于相关领域普通技术人员已知的技术、方法以及设备可能不作详细讨论,但在适当情况下,所述技术、方法和设备应当被视为说明书的一部分。
应当注意到:相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步讨论。
另外,公开实施例中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本公开实施例中字符“/”,一般表示前后关联对象是一种“或”的关系。
本公开实施例可以应用于终端设备、计算机系统及服务器等电子设备,其可与众多其它通用或者专用的计算系统环境或者配置一起操作。适于与终端设备、计算机系统以及服务器等电子设备一起使用的众所周知的终端设备、计算系统、环境和/或配置的例子,包括但不限于:个人计算机系统、服务器计算机系统、瘦客户机、厚客户机、手持或膝上设备、基于微处理器的系统、机顶盒、可编程消费电子产品、网络个人电脑、小型计算机系统﹑大型计算机系统和包括上述任何系统的分布式云计算技术环境,等等。
终端设备、计算机系统以及服务器等电子设备可以在由计算机系统执行的计算机系统可执行指令(诸如程序模块)的一般语境下描述。通常,程序模块可以包括例程、程序、目标程序、组件、逻辑以及数据结构等等,它们执行特定的任务或者实现特定的抽象数据类型。计算机系统/服务器可以在分布式云计算环境中实施,分布式云计算环境中,任务是由通过通信网络链接的远程处理设备执行的。在分布式云计算环境中,程序模块可以位于包括存储设备的本地或远程计算系统存储介质上。
图1为本公开实施例的智能体强化学习方法的一个流程图。如图1所示,该实施例方法包括:
S100、获取智能体针对当前环境图像进行决策所依据的关键视觉信息。
在一个可选示例中,本公开实施例中的智能体可以为游戏中接档下落小球的移动板或者机械臂等、以及车辆、机器人、智能家居设备等基于强化学习而形成的具有人工智能特性的客体。本公开实施例不限制智能体的具体表现形式,也不限制客体表现为硬件、软件或者软硬件结合的可能性。
在一个可选示例中,该操作S100可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的获取关键视觉模块600执行。
S110、获取当前环境图像的实际关键视觉信息。
在一个可选示例中,该操作S110可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的获取实际视觉模块610执行。
S120、根据上述所依据的关键视觉信息以及上述实际关键视觉信息,确定注意力变化回报信息。
在一个可选示例中,该操作S120可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的确定变化回报模块620执行。
S130、根据注意力变化回报信息调整智能体强化学习的回报反馈(Reward),从而可以基于调整后的回报反馈实现智能体的强化学习。
其中,根据注意力变化回报信息调整智能体强化学习的回报反馈可以包括:使智能体强化学习的回报反馈包含有注意力变化回报信息,如将注意力变化回报信息添加在回报反馈中。
在一个可选示例中,该操作S130可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的调整回报反馈模块630执行。
在一些可选示例中,本公开实施例中的关键视觉信息可以包括:图像中需要注意的区域;也可以包括:图像中的注意力区域。所依据的关键视觉信息可以包括:智能体所认为的注意力区域,即智能体在做出决策时,针对当前环境图像的注意力区域。当前环境图像的实际关键视觉信息可以包括:当前环境图像的真正的关键视觉信息,即当前环境图像的真正的注意力区域,也即当前环境图像中的目标对象所在区域。
在一些可选示例中,可以根据智能体在做出决策时,针对当前环境图像的注意力区域,与目标对象所在区域的交集与目标对象所在区域的比值,确定注意力变化回报信息。
本公开实施例中的注意力变化回报信息用于使智能体所认为的当前环境图像的注意力区域更接近于当前环境图像的实际关键视觉信息。在一些可选示例中,本公开实施例的回报反馈可以包括:注意力变化回报信息以及智能体针对当前环境图像进行决策所形成的回报信息。智能体针对当前环境图像进行决策所形成的回报信息通常是现有的智能体进行强化学习所采用的回报信息。
本公开实施例通过获得智能体针对当前环境图像的所依据的关键视觉信息,从而可以利用当前环境图像的实际关键视觉信息,衡量出智能体在针对当前环境图像进行决策时的注意力变化情况(如注意力偏移情况等),进而可以利用该注意力变化情况,确定出注意力变化回报信息。本公开实施例通过利用注意力变化回报信息调整智能体学习的回报反馈, 使回报反馈可以体现出注意力变化回报信息,通过利用这样的回报反馈实现智能体的强化学习,可以减少由于智能体的注意力不准确而导致其执行危险动作等概率,从而有利于提高智能体的行为安全性。上述危险动作的一个例子为:在智能体应该运动的情况下,智能体的决策结果为空动作,从而使智能体保持原状态,此时决策出的空动作即为危险动作。本公开实施例并不限制危险动作的具体表现形式。
在一个可选示例中,强化学习过程中的智能体所包含的网络结构的一个例子如图2所示。图2中的智能体包含有卷积神经网络(图2中间位置处)、决策网络(Policy Network)和价值网络(Value Network)等。智能体通过与环境进行交互,可以获得当前环境图像,图2最下方所示图像即为当前环境图像的一个例子。当前环境图像被输入卷积神经网络,在卷积神经网络中,前一卷积层所形成的当前环境图像的特征图被提供给后一卷积层,最后一层卷积层所形成的当前环境图像的特征图,分别提供给决策网络和价值网络。决策网络针对其接收到的特征图进行决策处理。价值网络针对接收到的特征图进行状态价值预测处理,以确定当前环境图像的状态价值。
强化学习过程中的智能体所包含的网络结构的另一个例子如图3所示。图3中的智能体包含有卷积神经网络(图3中间位置处)、RNN(Recurrent Neuron Network,循环神经网络)、决策网络以及价值网络。智能体通过与环境进行交互,可以获得当前环境图像,图3最下方所示的图像是当前环境图像的一个例子。当前环境图像被输入卷积神经网络,在卷积神经网络中,前一卷积层所形成的当前环境图像的特征图被提供给后一卷积层,最后一层卷积层所形成的当前环境图像的特征图,被提供给RNN,RNN可以将特征图的时序信息转换为一维的特征向量。RNN输出的特征图以及时序特征向量被分别提供给决策网络和价值网络。决策网络针对其接收到的特征图和时序特征向量进行决策处理。价值网络针对接收到的特征图和时序特征向量进行状态价值预测处理,以确定当前环境图像的状态价值。
需要说明的是,图2和图3仅为强化学习过程中的智能体的网络结构的可选示例,智能体的网络结构还可以表现为其他形式,本公开实施例不限制智能体的网络结构的具体表现形式。
在一个可选示例中,本公开实施例中的所依据的关键视觉信息,是可以反映出智能体(例如,智能体中的决策网络)在做出决策时,针对当前环境图像的注意力的信息。在本公开实施例中,做出决策的时机可以取决于预先设定,例如,可以预先设定智能体每隔0.2秒作出一次决策。本公开实施例中的决策结果可以为从动作空间中选择出一个动作。本公开实施例可以先通过智能体的价值网络获得:智能体在做出决策时,与其针对当前环境图像的注意力,相对应的热力图;然后,再通过该热力图获得智能体做出决策时,针对当前环境图像的所依据的关键视觉信息。例如,本公开实施例可以根据预先设置的阈值对热力图中的像素进行筛选,以筛选出像素的取值超过预先设置的阈值的像素,之后,根据筛选出的像素所形成的区域,可以确定出智能体在做出决策时,针对当前环境图像的注意力区域。通过利用智能体的价值网络来获得关键视觉信息,有利于方便快捷的获得关键视觉信息。
在一个可选示例中,本公开实施例中的智能体在做出决策时,其针对当前环境图像的注意力可以使用价值注意力图(Value Attention Map)来体现。换而言之,价值注意力图可以包括:智能体的价值网络在做出状态价值判断时,所依据的关键视觉信息。在一个可选示例中,获取智能体针对当前环境图像进行决策所依据的关键视觉信息,可以包括:获取智能体针对当前环境图像的价值注意力图;对价值注意力图和当前环境图像进行合成处理,获得热力图;根据热力图确定智能体针对当前环境图像的注意力区域。
本公开实施例可以采用多种方式获得当前环境图像的价值注意力图,例如,本公开实施例可以利用如图4所示的流程获得价值注意力图。图4中,S400、获取当前环境图像的 特征图。
可选的,本公开实施例中的特征图通常属于智能体的卷积神经网络针对当前环境图像而形成的高层特征图。例如,将当前环境图像输入智能体的卷积神经网络中,并将该卷积神经网络的最后一层卷积层输出的特征图作为S400中的当前环境图像的特征图。当然,将卷积神经网络的倒数第二层卷积层输出的特征图作为S400中的当前环境图像的特征图,也是完全可行的。只要是属于卷积神经网络中的高层特征图即可。本公开实施例中的高层特征图可以认为是:在将智能体的卷积神经网络的结构划分为两个或者三个或者更多阶段的情况下,中间阶段或中后阶段或最后一个阶段中的任一层针对当前环境图像而形成的特征图。本公开实施例中的高层特征图也可以认为是,较为接近或者最接近智能体的卷积神经网络输出的层所形成的特征图。通过利用高层特征图,有利于提高获得的价值注意力图的准确性。
S410、根据上述获得的特征图,获得依次屏蔽该特征图各通道而形成的各改变特征图。
可选的,本公开实施例中的改变特征图包括相对于S400中的特征图而言,由于屏蔽了特征图中的相应通道,而形成的与S400中的特征图不同的特征图。在当前环境图像的特征图具有多个通道的情况下,本公开实施例获得各改变特征图的一个例子为:首先,通过屏蔽该特征图中的第一通道,可以获得第一个改变特征图;其次,通过屏蔽该特征图中的第二通道,可以获得第二个改变特征图;再次,通过屏蔽该特征图中的第三通道,可以获得第三个改变特征图;以此类推,直到屏蔽该特征图中的最后一个通道,可以获得最后一个改变特征图。图5右侧的中间位置处,示出了通过屏蔽高层特征图的不同通道而获得的三个改变特征图。本公开实施例中的屏蔽特征图的相应通道,也可以认为是屏蔽隐含层的相应激活信息。在特征图具有n(n为大于1的整数)个通道的情况下,本公开实施例可以获得n个改变特征图。本公开实施例可以采用现有的方式实现屏蔽相应隐含层的激活信息,从而获得改变特征图,具体实现方式在此不再详细说明。
S420、获取各改变特征图分别相对于特征图的状态价值改变量。
可选的,本公开实施例可以先将上述获得的各改变特征图,分别输入智能体的价值网络中,以获得各改变特征图的状态价值,例如由价值网络针对各改变特征图分别进行状态价值预测处理,从而可以获得各改变特征图的状态价值,例如,针对n个改变特征图可以获得n个状态价值;其次,本公开实施例可以通过计算价值网络针对S400中的特征图所输出的状态价值,分别与各改变特征图的状态价值的差值,从而获得各改变特征图分别相对于当前环境图像的特征图的状态价值改变量。
可选的,假设价值网络针对当前环境图像的特征图所形成的状态价值为V,且价值网络针对n个改变特征图所形成的状态价值分别为V 1、V 2、V i、……以及V n,则本公开实施例可以通过计算V与V 1的差值、V与V 2的差值、V与V i的差值、……以及V与V n的差值,从而获得n个差值,即ΔV 1、ΔV 2、ΔV i、……以及ΔV n(如图5右上位置处所示)。ΔV 1、ΔV 2、ΔV i、……以及ΔV n即为n个改变特征图分别相对于当前环境图像的特征图的状态价值改变量。
针对任意一个改变特征图而言,本公开实施例可以利用下述公式(1)计算该改变特征图相对于当前环境图像的特征图的状态价值改变量:
ΔV=V-f V(B i⊙H)          公式(1)
在上述公式(1)中,ΔV表示状态价值改变量;V表示价值网络针对当前环境图像的特征图所形成的状态价值;H表示当前环境图像的特征图;B i⊙H表示屏蔽了特征图中的第i个通道后,所获得的改变特征图;f V(B i⊙H)表示价值网络针对该改变特征图 所形成的状态价值,其中,i为大于0且不大于n的整数,n为大于1的整数。
由于卷积神经网络中的隐含层的不同激活信息会针对相应的特定模式进行激活,从而使隐含层关注不同的区域,因此,本公开实施例通过依次屏蔽隐含层的不同激活信息,并获取各改变特征图相对于特征图的状态价值改变量,使不同的状态价值改变量可以反映出智能体对不同区域的关注程度。
S430、根据各状态价值改变量以及各改变特征图形成价值注意力图。
在一个可选示例中,上述操作S400-S430可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的获取关键视觉模块600执行。
可选的,本公开实施例可以对状态价值改变量进行归一化处理,以形成各改变特征图的权值。对状态价值改变量进行归一化处理的一个例子,如下述公式(2)所示:
Figure PCTCN2019096233-appb-000001
在上述公式(2)中,ω i表示第i个改变特征图的权值。
可选的,本公开实施例可以通过下述公式(3)形成价值注意力图:
Figure PCTCN2019096233-appb-000002
在上述公式(3)中,A表示价值注意力图,H i表示卷积神经网络的最后一卷积层输出的第i个通道的特征图,K为通道数量。
需要特别说明的是,本公开实施例也可以采用现有的方式获得智能体在做出决策时,针对当前环境图像的价值注意力图。本公开实施例不限制获取智能体在做出决策时,针对当前环境图像的价值注意力图的具体实现过程。
在一个可选示例中,本公开实施例可以先对上述获得的价值注意力图A进行尺寸调整,例如,对价值注意力图A进行上采样处理等,以便于使价值注意力图A的尺寸与当前环境图像的尺寸相同;之后,再将尺寸调整后的价值注意力图A’和当前环境图像(如图5左下角的图像)进行融合处理,从而获得当前环境图像的价值注意力图所对应的热力图。热力图的一个可选例子如图5右下角所示的图像。
在一个可选示例中,本公开实施例中的当前环境图像的实际关键视觉信息可以包括:当前环境图像中的目标对象所在区域。例如,本公开实施例可以利用目标对象检测算法,获得当前环境图像中的目标对象所在区域。本公开实施例不限制目标对象检测算法的具体实现方式,也不限制获得当前环境图像中的目标对象所在区域的具体实现方式。
在一个可选示例中,本公开实施例中的注意力变化回报信息可以反映出智能体针对当前环境图像所关注的区域与实际应关注的区域之间的差距。也就是说,本公开实施例可以根据智能体做出决策时针对当前环境图像所关注的注意力区域、以及当前环境图像中的目标对象所在区域之间的差异大小,确定出注意力变化回报信息。
可选的,本公开实施例可以先根据所依据的关键视觉信息,确定出智能体针对当前环境图像的注意力区域,例如,可以根据预先设置的阈值,对所依据的关键视觉信息(如热力图)中的像素进行筛选,筛选出像素的取值超过预先设置的阈值的像素,并根据筛选出的像素所形成的区域,确定出智能体针对当前环境图像的注意力区域a;然后,本公开实施例可以计算注意力区域a与当前环境图像中的目标对象所在区域b的交集与目标对象所在区域b的比值(a∩b)/b,并根据该比值确定出注意力变化回报信息。例如,通过针对比值进行换算,从而获得注意力变化回报信息。本公开实施例中的比值或者基于比值而获得的注意力变化回报信息,可以认为是对智能体行为的安全性评价指标。比值越大,则智 能体行为的安全性越高,反之,比值越小,则智能体行为的安全性越低。
在一个可选示例中,本公开实施例通过利用上述获得注意力变化回报信息调整智能体强化学习的回报反馈(如将上述获得的注意力变化回报信息添加在智能体强化学习的回报反馈中),并利用这样的回报反馈来更新智能体的网络参数(如更新卷积神经网络、价值网络以及策略网络等的网络参数),使智能体在强化学习过程中,可以降低注意力变化(如注意力偏差)所导致的执行危险动作的几率。更新智能体的网络参数的方式可以采用基于强化学习中的演员-评论家算法的方式。更新智能体的网络参数的具体目标包括:使智能体中的价值网络所预测的状态价值尽可能的接近一个环境探索周期内的奖赏信息的积累值,且智能体中的决策网络的网络参数的更新应朝着使价值网络预测的状态价值增大的方向更新。
在一个可选示例中,在打砖块的游戏中,打砖块的小球在下落过程中,会由于重力作用而加速下降,对于接档下落小球的移动板而言,往往会由于注意力滞后,而存在执行危险动作(如移动板执行空动作等)的现象。本公开实施例通过利用可以体现出注意力变化回报信息的回报反馈(如奖赏信息),使移动板进行强化学习,有利于避免移动板注意力滞后的现象,从而有利于降低移动板执行危险动作的几率。
需要特别说明的是,在利用注意力变化回报信息调整回报反馈,以利用该回报反馈实现智能体的强化学习时,该智能体可以是已经进行了一定程度的强化学习的智能体。例如,在对智能体进行初始化处理后,本公开实施例可以利用现有的强化学习方式,基于未包含有注意力变化回报信息的回报反馈,使智能体进行强化学习,在判断出智能体的强化学习程度达到一定的要求(例如,决策网络的熵降到一定数值(如0.6))的情况下,再采用本公开实施例提供的技术方案使智能体继续进行强化学习,从而有利于提高智能体的强化学习的效率以及成功率。
在一个可选示例中,在上述强化学习的过程中,本公开实施例可以从采样获得的强化学习训练数据中,选取重要的强化学习训练数据作为历史训练数据存储,以便于在经验回放过程中,可以利用重要的强化学习训练数据,来调整智能体的网络参数;例如,对策略网络、价值网络以及卷积神经网络的网络参数进行调整;再例如,对策略网络、价值网络、RNN以及卷积神经网络的网络参数进行调整。本公开实施例通过选取重要的强化学习训练数据作为历史训练数据存储,可以有效减少历史训练数据所需的缓存空间;通过采用重要的强化学习训练数据作为历史训练数据进行经验回放,有利于提高智能体的强化学习效率。
在本公开上述实施例的智能体强化学习方法中,还可以包括:根据所依据的关键视觉信息,确定环境探索周期内的探索程度;在判断出探索程度不符合预定探索程度的情况下,利用存储的历史训练数据进行经验回放。其中的历史训练数据可以包括:利用预设要求对采样的强化学习训练数据进行筛选,而获得的训练数据。
在一个可选示例中,根据所依据的关键视觉信息,确定环境探索周期内的探索程度,可以包括:根据智能体针对环境探索周期内的多个相邻时刻的当前环境图像的价值注意力图之间的变化信息,确定该环境探索周期内的注意力改变量。其中,该注意力改变量用于衡量环境探索周期内的探索程度。
在一个可选示例中,本公开实施例可以利用一环境探索周期内的正向回报(如正向奖赏等)和该环境探索周期的探索程度,来确定该环境探索周期内的强化学习训练数据的重要程度,从而在判断出该重要程度符合预定要求时,可以将该环境探索周期内的强化学习训练数据,作为历史训练数据缓存。
在一个可选示例中,本公开实施例中的环境探索周期的探索程度可以使用该环境探索周期内的注意力改变量来体现,例如,本公开实施例可以根据智能体针对环境探索周期内的多个相邻时刻的当前环境图像的价值注意力图之间的变化信息,确定该环境探索周期内的注意力改变量,并将该注意力改变量作为该环境探索周期内的探索程度。可选的,本公 开实施例可以利用下述公式(4)来计算一个环境探索周期内的注意力改变量:
Figure PCTCN2019096233-appb-000003
在上述公式(4)中,E表示一个环境探索周期内的平均注意力改变量,
Figure PCTCN2019096233-appb-000004
表示包含当前环境图像中的所有像素;T表示一个环境探索周期内智能体与环境交互的次数;A t表示在智能体第t次与环境交互时的当前环境图像所对应的价值注意力图,A t-1表示在第t-1次与环境交互时的当前环境图像所对应的价值注意力图。
在一个可选示例中,本公开实施例可以利用下述公式(5)来计算一个环境探索周期内的强化学习训练数据的重要程度:
S=β∑r ++(1-β)E            公式(5)
在上述公式(5)中,S表示一个环境探索周期内的强化学习训练数据的重要程度,β表示超参数,β通常为0-1之间的常数,r +表示该环境探索周期内的正向回报,E表示该环境探索周期内的平均注意力改变量。
在一个可选示例中,如果一个环境探索周期内的强化学习训练数据的重要程度高于预定值,则可以将该环境探索周期内的所有强化学习训练数据(如奖赏信息以及当前环境图像等)作为历史训练数据进行缓存;否则,不保留该环境探索周期内的所有强化学习训练数据。
在一个可选示例中,本公开实施例在强化学习过程中,本公开实施例可以利用缓存的历史训练数据,以经验回放方式,来调整智能体的网络参数;例如,调整策略网络、价值网络以及卷积神经网络的网络参数;再例如,调整策略网络、价值网络、RNN以及卷积神经网络的网络参数。可选的,本公开实施例判断一个环境探索周期内的探索程度,在确定出该探索程度不符合预定探索程度的情况下,可以产生随机数,如果该随机数超过预定数值(如0.3),则确定出需要进行经验回放,从而本公开实施例可以利用预先存储的历史训练数据执行经验回放操作。而如果该随机数未超过预定数值,则可以确定出不需要进行经验回放。经验回放的具体实现过程可以采用现有的实现方式。在此不再详细说明。
本公开实施例提供的任一种智能体强化学习方法可以由任意适当的具有数据处理能力的设备执行,包括但不限于:终端设备和服务器等。或者,本公开实施例提供的任一种智能体强化学习方法可以由处理器执行,如处理器通过调用存储器存储的相应指令来执行本公开实施例提及的任一种智能体强化学习方法。下文不再赘述。
本领域普通技术人员可以理解:实现上述方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成,前述的程序可以存储于一计算机可读取存储介质中,该程序在执行时,执行包括上述方法实施例的步骤;而前述的存储介质包括:ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。
图6为本公开实施例的智能体强化学习装置一个实施例的结构示意图。如图6所示,该实施例的装置主要包括:获取关键视觉模块600、获取实际视觉模块610、确定变化回报模块620以及调整回报反馈模块630。可选的,该装置还可以包括:经验回放模块640以及获取训练数据模块650。
获取关键视觉模块600用于获取智能体针对当前环境图像进行决策所依据的关键视觉信息。
在一个可选示例中,上述所依据的关键视觉信息可以包括:智能体在做出决策时,针对当前环境图像的注意力区域。获取关键视觉模块600可以进一步用于:首先,获取智能 体针对当前环境图像的价值注意力图,之后,对价值注意力图和当前环境图像进行合成处理,获得热力图;然后,根据热力图确定智能体针对当前环境图像的注意力区域。
在一个可选示例中,获取关键视觉模块600获取价值注意力图的方式可以可选为:首先,获取关键视觉模块600获取当前环境图像的特征图;之后,获取关键视觉模块600根据特征图获得依次屏蔽特征图各通道而形成的各改变特征图;然后,获取关键视觉模块600获取各改变特征图分别相对于特征图的状态价值改变量;最后,获取关键视觉模块600根据各状态价值改变量以及各改变特征图形成价值注意力图。
在一个可选示例中,获取关键视觉模块600获取当前环境图像的特征图的方式可以可选为:首先,获取关键视觉模块600将当前环境图像输入卷积神经网络,然后,获取关键视觉模块600获取卷积神经网络的最后一层卷积层输出的特征图。其中的最后一层卷积层输出的特征图为所述获取关键视觉模块获取到的当前环境图像的特征图。
在一个可选示例中,获取关键视觉模块600获取各改变特征图分别相对于特征图的状态价值改变量的方式可以可选的为:首先,获取关键视觉模块600将各改变特征图输入智能体的价值网络,以获得各改变特征图的状态价值;之后,获取关键视觉模块600计算价值网络针对特征图输出的状态价值,分别与各改变特征图的状态价值的差值,以获得各改变特征图分别相对于特征图的状态价值改变量。
获取实际视觉模块610用于获取当前环境图像的实际关键视觉信息。
在一个可选示例中,本公开实施例中的当前环境图像的实际关键视觉信息可以包括:当前环境图像中的目标对象所在区域。
确定变化回报模块620用于根据上述所依据的关键视觉信息以及上述实际关键视觉信息,确定注意力变化回报信息。
在一个可选示例中,确定变化回报模块620可以根据智能体在做出决策时,针对当前环境图像的注意力区域,与目标对象所在区域的交集与目标对象所在区域的比值,确定注意力变化回报信息。
调整回报反馈模块630用于根据注意力变化回报信息调整智能体强化学习的回报反馈。
在一个可选示例中,本公开实施例中的智能体强化学习的回报反馈可以包括:注意力变化回报信息以及智能体针对当前环境图像进行决策所形成的回报信息。
经验回放模块640用于根据所依据的关键视觉信息,确定环境探索周期内的探索程度;在判断出探索程度不符合预定探索程度的情况下,利用存储的历史训练数据进行经验回放。本公开实施例中的历史训练数据包括:利用预设要求对采样的强化学习训练数据进行筛选,而获得的训练数据。
在一个可选示例中,经验回放模块640确定环境探索周期内的探索程度可以可选的为:经验回放模块640根据智能体针对环境探索周期内的多个相邻时刻的当前环境图像的价值注意力图之间的变化信息,确定该环境探索周期内的注意力改变量。其中的注意力改变量用于衡量环境探索周期内的探索程度。
获取训练数据模块650用于根据环境探索周期内的正向回报和探索程度,确定环境探索周期内采样的强化学习训练数据的重要程度,并将该环境探索周期内采样的重要程度符合预定要求的强化学习训练数据作为历史训练数据存储。
获取关键视觉模块600、获取实际视觉模块610、确定变化回报模块620、调整回报反馈模块630、经验回放模块640以及获取训练数据模块650所执行的具体操作,可以参见上述方法实施例中针对图1至图5中的描述。在此不再重复说明。
图7示出了适于实现本公开实施例的示例性设备700,设备700可以是汽车中配置的控制系统/电子系统、移动终端(例如,智能移动电话等)、个人计算机(PC,例如,台式 计算机或者笔记型计算机等)、平板电脑以及服务器等。图7中,设备700包括一个或者多个处理器、通信部等,所述一个或者多个处理器可以为:一个或者多个中央处理单元(CPU)701,和/或,一个或者多个利用神经网络进行智能体强化学习方法的图像处理器(GPU)713等,处理器可以根据存储在只读存储器(ROM)702中的可执行指令或者从存储部分708加载到随机访问存储器(RAM)703中的可执行指令而执行各种适当的动作和处理。通信部712可以包括但不限于网卡,所述网卡可以包括但不限于IB(Infiniband)网卡。处理器可与只读存储器702和/或随机访问存储器703中通信以执行可执行指令,通过总线704与通信部712相连、并经通信部712与其他目标设备通信,从而完成本公开任一实施例智能体强化学习方法中的相应步骤。
上述各指令所执行的操作可以参见上述方法实施例中的相关描述,在此不再详细说明。此外,在RAM 703中,还可以存储有装置操作所需的各种程序以及数据。CPU701、ROM702以及RAM703通过总线704彼此相连。
在有RAM703的情况下,ROM702为可选模块。RAM703存储可执行指令,或在运行时向ROM702中写入可执行指令,可执行指令使中央处理单元701执行上述任一实施例智能体强化学习方法所包括的步骤。输入/输出(I/O)接口705也连接至总线704。通信部712可以集成设置,也可以设置为具有多个子模块(例如,多个IB网卡),并分别与总线连接。
以下部件连接至I/O接口705:包括键盘、鼠标等的输入部分706;包括诸如阴极射线管(CRT)、液晶显示器(LCD)等以及扬声器等的输出部分707;包括硬盘等的存储部分708;以及包括诸如LAN卡、调制解调器等的网络接口卡的通信部分709。通信部分709经由诸如因特网的网络执行通信处理。驱动器710也根据需要连接至I/O接口705。可拆卸介质711,诸如磁盘、光盘、磁光盘、半导体存储器等等,根据需要安装在驱动器710上,以便于从其上读出的计算机程序根据需要被安装在存储部分708中。
需要特别说明的是,如图7所示的架构仅为一种可选实现方式,在具体实践过程中,可根据实际需要对上述图7的部件数量和类型进行选择、删减、增加或替换;在不同功能部件设置上,也可采用分离设置或集成设置等实现方式,例如,GPU713和CPU701可分离设置,再如理,可将GPU713集成在CPU701上,通信部可分离设置,也可集成设置在CPU701或GPU713上等。这些可替换的实施例均落入本公开实施例的保护范围。
特别地,根据本公开实施例的实施例,下文参考流程图描述的过程可以被实现为计算机软件程序,例如,本公开实施例包括一种计算机程序产品,其包含有形地包含在机器可读介质上的计算机程序,计算机程序包含用于执行流程图所示的步骤的程序代码,程序代码可包括对应执行本公开任一实施例提供的智能体强化学习方法中的步骤对应的指令。
在这样的实施例中,该计算机程序可以通过通信部分709从网络上被下载及安装,和/或从可拆卸介质711被安装。在该计算机程序被中央处理单元(CPU)701执行时,执行本公开任一实施例智能体强化学习方法中记载的实现上述相应操作的指令。
在一个或多个可选实施例中,本公开实施例还提供了一种计算机程序程序产品,用于存储计算机可读指令,所述指令被执行时使得计算机执行上述任意任一实施例中所述的智能体强化学习方法。
该计算机程序产品可以具体通过硬件、软件或其结合的方式实现。在一个可选例子中,所述计算机程序产品具体体现为计算机存储介质,在另一个可选例子中,所述计算机程序产品具体体现为软件产品,例如软件开发包(Software Development Kit,SDK)等等。
在一个或多个可选实施例中,本公开实施例还提供了另一种智能体强化学习方法及其对应的装置和电子设备、计算机存储介质、计算机程序以及计算机程序产品,其中的方法包括:第一装置向第二装置发送智能体强化学习指示,该指示使得第二装置执行上述任一可能的实施例中的智能体强化学习方法;第一装置接收第二装置发送的智能体强化学习的 结果。
在一些实施例中,该智能体强化学习指示可以具体为调用指令,第一装置可以通过调用的方式指示第二装置执行智能体强化学习操作,相应地,响应于接收到调用指令,第二装置可以执行上述智能体强化学习方法中的任意实施例中的操作和/或流程。
本说明书中各个实施例均采用递进的方式描述,每个实施例重点说明的都是与其它实施例的不同之处,各个实施例之间相同或相似的部分相互参见即可。对于系统实施例而言,由于其与方法实施例基本对应,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
可能以许多方式来实现本公开实施例的方法和装置、电子设备以及计算机可读存储介质。例如,可通过软件、硬件、固件或者软件、硬件、固件的任何组合来实现本公开实施例的方法和装置、电子设备以及计算机可读存储介质。用于方法的步骤的上述顺序仅是为了进行说明,本公开实施例的方法的步骤不限于以上具体描述的顺序,除非以其它方式特别说明。此外,在一些实施例中,还可将本公开实施例实施为记录在记录介质中的程序,这些程序包括用于实现根据本公开实施例的方法的机器可读指令。因而,本公开实施例还覆盖存储用于执行根据本公开实施例的方法的程序的记录介质。
本公开实施例的描述,是为了示例和描述起见而给出的,而并不是无遗漏的或者将本公开实施例限于所公开实施例的形式。很多修改和变化对于本领域的普通技术人员而言,是显然的。选择和描述实施例是为了更好说明本公开实施例的原理以及实际应用,并且使本领域的普通技术人员能够理解本公开实施例可以从而设计适于特定用途的带有各种修改的各种实施例。

Claims (27)

  1. 一种智能体强化学习方法,其特征在于,包括:
    获取智能体针对当前环境图像进行决策所依据的关键视觉信息;
    获取所述当前环境图像的实际关键视觉信息;
    根据所述所依据的关键视觉信息以及所述实际关键视觉信息,确定注意力变化回报信息;
    根据所述注意力变化回报信息调整智能体强化学习的回报反馈。
  2. 根据权利要求1所述的方法,其特征在于,所述所依据的关键视觉信息包括:所述智能体在做出决策时,针对所述当前环境图像的注意力区域。
  3. 根据权利要求2所述的方法,其特征在于,所述获取智能体针对当前环境图像进行决策所依据的关键视觉信息,包括:
    获取所述智能体针对所述当前环境图像的价值注意力图;
    对所述价值注意力图和所述当前环境图像进行合成处理,获得热力图;
    根据所述热力图确定所述智能体针对当前环境图像的注意力区域。
  4. 根据权利要求3所述的方法,其特征在于,所述获取智能体针对当前环境图像的价值注意力图,包括:
    获取所述当前环境图像的特征图;
    根据所述特征图获得依次屏蔽所述特征图各通道而形成的各改变特征图;
    获取所述各改变特征图分别相对于所述特征图的状态价值改变量;
    根据各状态价值改变量以及各改变特征图形成所述价值注意力图。
  5. 根据权利要求4所述的方法,其特征在于,所述获取当前环境图像的特征图,包括:
    将所述当前环境图像输入卷积神经网络,并获取所述卷积神经网络的最后一层卷积层输出的特征图。
  6. 根据权利要求4至5中任一项所述的方法,其特征在于,所述获取所述各改变特征图分别相对于所述特征图的状态价值改变量,包括:
    将所述各改变特征图输入智能体的价值网络,以获得所述各改变特征图的状态价值;
    计算所述价值网络针对所述特征图输出的状态价值,分别与所述各改变特征图的状态价值的差值,以获得所述各改变特征图分别相对于所述特征图的状态价值改变量。
  7. 根据权利要求1至6中任一项所述的方法,其特征在于,所述当前环境图像的实际关键视觉信息包括:当前环境图像中的目标对象所在区域。
  8. 根据权利要求7所述的方法,其特征在于,所述根据所述所依据的关键视觉信息以及所述实际关键视觉信息,确定注意力变化回报信息,包括:
    根据所述智能体在做出决策时,针对所述当前环境图像的注意力区域,与所述目标对象所在区域的交集与目标对象所在区域的比值,确定注意力变化回报信息。
  9. 根据权利要求1至8中任一项所述的方法,其特征在于,所述智能体强化学习的回报反馈包括:注意力变化回报信息以及智能体针对当前环境图像进行决策所形成的回报信息。
  10. 根据权利要求1至9中任一项所述的方法,其特征在于,所述方法还包括:
    根据所述所依据的关键视觉信息,确定环境探索周期内的探索程度;
    在判断出所述探索程度不符合预定探索程度的情况下,利用存储的历史训练数据进行经验回放;
    所述历史训练数据包括:利用预设要求对采样的强化学习训练数据进行筛选,而获得的训练数据。
  11. 根据权利要求10所述的方法,其特征在于,所述根据所述所依据的关键视觉信息,确定环境探索周期内的探索程度,包括:
    根据智能体针对环境探索周期内的多个相邻时刻的当前环境图像的价值注意力图之间的变化信息,确定该环境探索周期内的注意力改变量;
    其中,所述注意力改变量用于衡量所述环境探索周期内的探索程度。
  12. 根据权利要求11所述的方法,其特征在于,所述方法还包括:
    根据所述环境探索周期内的正向回报和所述探索程度,确定所述环境探索周期内采样的强化学习训练数据的重要程度;
    将该环境探索周期内采样的重要程度符合预定要求的强化学习训练数据作为历史训练数据存储。
  13. 一种智能体强化学习装置,其特征在于,包括:
    获取关键视觉模块,用于获取智能体针对当前环境图像进行决策所依据的关键视觉信息;
    获取实际视觉模块,用于获取所述当前环境图像的实际关键视觉信息;
    确定变化回报模块,用于根据所述所依据的关键视觉信息以及所述实际关键视觉信息,确定注意力变化回报信息;
    调整回报反馈模块,用于根据所述注意力变化回报信息调整智能体强化学习的回报反馈。
  14. 根据权利要求13所述的装置,其特征在于,所述所依据的关键视觉信息包括:所述智能体在做出决策时,针对所述当前环境图像的注意力区域。
  15. 根据权利要求14所述的装置,其特征在于,所述获取关键视觉模块用于:
    获取所述智能体针对所述当前环境图像的价值注意力图;
    对所述价值注意力图和所述当前环境图像进行合成处理,获得热力图;
    根据所述热力图确定所述智能体针对当前环境图像的注意力区域。
  16. 根据权利要求15所述的装置,其特征在于,所述获取关键视觉模块用于:
    获取所述当前环境图像的特征图;
    根据所述特征图获得依次屏蔽所述特征图各通道而形成的各改变特征图;
    获取所述各改变特征图分别相对于所述特征图的状态价值改变量;
    根据各状态价值改变量以及各改变特征图形成所述价值注意力图。
  17. 根据权利要求16所述的装置,其特征在于,所述获取关键视觉模块用于:
    将所述当前环境图像输入卷积神经网络,并获取所述卷积神经网络的最后一层卷积层输出的特征图;
    其中,所述最后一层卷积层输出的特征图为所述获取关键视觉模块获取到的当前环境图像的特征图。
  18. 根据权利要求16至17中任一项所述的装置,其特征在于,所述获取关键视觉模块用于:
    将所述各改变特征图输入智能体的价值网络,以获得所述各改变特征图的状态价值;
    计算所述价值网络针对所述特征图输出的状态价值,分别与所述各改变特征图的状态价值的差值,以获得各改变特征图分别相对于所述特征图的状态价值改变量。
  19. 根据权利要求13至18中任一项所述的装置,其特征在于,所述当前环境图像的实际关键视觉信息包括:当前环境图像中的目标对象所在区域。
  20. 根据权利要求19所述的装置,其特征在于,所述确定变化回报模块用于:
    根据所述智能体在做出决策时,针对所述当前环境图像的注意力区域,与所述目标对象所在区域的交集与目标对象所在区域的比值,确定注意力变化回报信息。
  21. 根据权利要求13至20中任一项所述的装置,其特征在于,所述智能体强化学习 的回报反馈包括:注意力变化回报信息以及智能体针对当前环境图像进行决策所形成的回报信息。
  22. 根据权利要求13至21中任一项所述的装置,其特征在于,所述装置还包括:经验回放模块,用于:
    根据所述所依据的关键视觉信息,确定环境探索周期内的探索程度;
    在判断出所述探索程度不符合预定探索程度的情况下,利用存储的历史训练数据进行经验回放;
    所述历史训练数据包括:利用预设要求对采样的强化学习训练数据进行筛选,而获得的训练数据。
  23. 根据权利要求22所述的装置,其特征在于,所述经验回放模块用于:
    根据智能体针对环境探索周期内的多个相邻时刻的当前环境图像的价值注意力图之间的变化信息,确定该环境探索周期内的注意力改变量;
    其中,所述注意力改变量用于衡量所述环境探索周期内的探索程度。
  24. 根据权利要求23所述的装置,其特征在于,所述装置还包括:获取训练数据模块,用于
    根据所述环境探索周期内的正向回报和所述探索程度,确定所述环境探索周期内采样的强化学习训练数据的重要程度;
    将该环境探索周期内采样的重要程度符合预定要求的强化学习训练数据作为历史训练数据存储。
  25. 一种电子设备,包括:
    存储器,用于存储计算机程序;
    处理器,用于执行所述存储器中存储的计算机程序,且所述计算机程序被执行时,实现上述权利要求1-12中任一项所述的方法。
  26. 一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时,实现上述权利要求1-12中任一项所述的方法。
  27. 一种计算机程序,包括计算机指令,当所述计算机指令在设备的处理器中运行时,实现上述权利要求1-12中任一项所述的方法。
PCT/CN2019/096233 2018-07-28 2019-07-16 智能体强化学习方法、装置、设备及介质 WO2020024791A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
SG11202013079WA SG11202013079WA (en) 2018-07-28 2019-07-16 Intelligent agent reinforcement learning method and apparatus, device and medium
JP2021500797A JP7163477B2 (ja) 2018-07-28 2019-07-16 知能客体強化学習方法、装置、デバイス、及び媒体
US17/137,063 US20210117738A1 (en) 2018-07-28 2020-12-29 Intelligent agent reinforcement learning method and apparatus, device and medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810849877.6A CN109190720B (zh) 2018-07-28 2018-07-28 智能体强化学习方法、装置、设备及介质
CN201810849877.6 2018-07-28

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/137,063 Continuation US20210117738A1 (en) 2018-07-28 2020-12-29 Intelligent agent reinforcement learning method and apparatus, device and medium

Publications (1)

Publication Number Publication Date
WO2020024791A1 true WO2020024791A1 (zh) 2020-02-06

Family

ID=64937811

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/096233 WO2020024791A1 (zh) 2018-07-28 2019-07-16 智能体强化学习方法、装置、设备及介质

Country Status (5)

Country Link
US (1) US20210117738A1 (zh)
JP (1) JP7163477B2 (zh)
CN (1) CN109190720B (zh)
SG (1) SG11202013079WA (zh)
WO (1) WO2020024791A1 (zh)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109190720B (zh) * 2018-07-28 2021-08-06 深圳市商汤科技有限公司 智能体强化学习方法、装置、设备及介质
JP2022525423A (ja) * 2019-03-20 2022-05-13 ソニーグループ株式会社 ダブルアクタークリティックアルゴリズムを通じた強化学習
CN111898727A (zh) * 2019-05-06 2020-11-06 清华大学 基于短时访问机制的强化学习方法、装置及存储介质
CN110147891B (zh) * 2019-05-23 2021-06-01 北京地平线机器人技术研发有限公司 应用于强化学习训练过程的方法、装置及电子设备
CN110225019B (zh) * 2019-06-04 2021-08-31 腾讯科技(深圳)有限公司 一种网络安全处理方法和装置
CN113872924B (zh) * 2020-06-30 2023-05-02 中国电子科技集团公司电子科学研究院 一种多智能体的动作决策方法、装置、设备及存储介质
CN111791103B (zh) * 2020-06-30 2022-04-29 北京百度网讯科技有限公司 滤波器调试方法、装置、电子设备和可读存储介质
CN112216124B (zh) * 2020-09-17 2021-07-27 浙江工业大学 一种基于深度强化学习的交通信号控制方法
CN113255893B (zh) * 2021-06-01 2022-07-05 北京理工大学 一种多智能体行动策略自演进生成方法
CN113671834B (zh) * 2021-08-24 2023-09-01 郑州大学 一种机器人柔性行为决策方法及设备
CN113867147A (zh) * 2021-09-29 2021-12-31 商汤集团有限公司 训练及控制方法、装置、计算设备和介质
CN116805353B (zh) * 2023-08-21 2023-10-31 成都中轨轨道设备有限公司 跨行业通用的智能机器视觉感知方法

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106970615A (zh) * 2017-03-21 2017-07-21 西北工业大学 一种深度强化学习的实时在线路径规划方法
US20180174001A1 (en) * 2016-12-15 2018-06-21 Samsung Electronics Co., Ltd. Method of training neural network, and recognition method and apparatus using neural network
CN109190720A (zh) * 2018-07-28 2019-01-11 深圳市商汤科技有限公司 智能体强化学习方法、装置、设备及介质

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5346701B2 (ja) * 2009-06-12 2013-11-20 本田技研工業株式会社 学習制御システム及び学習制御方法
CN117371492A (zh) * 2016-11-04 2024-01-09 渊慧科技有限公司 一种计算机实现的方法及其系统
CN107179077B (zh) * 2017-05-15 2020-06-09 北京航空航天大学 一种基于elm-lrf的自适应视觉导航方法

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180174001A1 (en) * 2016-12-15 2018-06-21 Samsung Electronics Co., Ltd. Method of training neural network, and recognition method and apparatus using neural network
CN106970615A (zh) * 2017-03-21 2017-07-21 西北工业大学 一种深度强化学习的实时在线路径规划方法
CN109190720A (zh) * 2018-07-28 2019-01-11 深圳市商汤科技有限公司 智能体强化学习方法、装置、设备及介质

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
LIU, . QUAN ET AL.: "A Deep Recurrent Q-Network Based on Visual Attention Mechanism", CHINESE JOURNAL OF COMPUTERS, vol. 40, no. 6, 30 June 2017 (2017-06-30) *
PRICE, B. ET AL.: "Accelerating Reinforcement Learning through Implicit Imitation", JOURNAL OF ARRIFICIAD INTELLIGENCE RESEARCH, vol. 19, no. 2003, 12 March 2003 (2003-03-12), XP080507310 *
YUN, S. ET AL.: "Action-Driven Visual Object Tracking with Deep Reinforcement Learning", IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, vol. 29, no. 6, 30 June 2018 (2018-06-30), XP055681249 *

Also Published As

Publication number Publication date
US20210117738A1 (en) 2021-04-22
CN109190720B (zh) 2021-08-06
CN109190720A (zh) 2019-01-11
JP7163477B2 (ja) 2022-10-31
SG11202013079WA (en) 2021-02-25
JP2021532457A (ja) 2021-11-25

Similar Documents

Publication Publication Date Title
WO2020024791A1 (zh) 智能体强化学习方法、装置、设备及介质
JP7009614B2 (ja) ディープニューラルネットワークの正規化方法および装置、機器、ならびに記憶媒体
WO2022083536A1 (zh) 一种神经网络构建方法以及装置
US20220012533A1 (en) Object Recognition Method and Apparatus
TWI721510B (zh) 雙目圖像的深度估計方法、設備及儲存介質
WO2020256704A1 (en) Real-time video ultra resolution
US11270124B1 (en) Temporal bottleneck attention architecture for video action recognition
WO2019214344A1 (zh) 系统增强学习方法和装置、电子设备、计算机存储介质
CN112651511A (zh) 一种训练模型的方法、数据处理的方法以及装置
US11688077B2 (en) Adaptive object tracking policy
CN110447041B (zh) 噪声神经网络层
WO2022179581A1 (zh) 一种图像处理方法及相关设备
CN109934247A (zh) 电子装置及其控制方法
JP7227385B2 (ja) ニューラルネットワークのトレーニング及び目開閉状態の検出方法、装置並び機器
WO2021103675A1 (zh) 神经网络的训练及人脸检测方法、装置、设备和存储介质
Kumagai et al. Mixture of counting CNNs
CN111783996B (zh) 一种数据处理方法、装置及设备
US20240078428A1 (en) Neural network model training method, data processing method, and apparatus
US10757369B1 (en) Computer implemented system and method for high performance visual tracking
CN113407820B (zh) 利用模型进行数据处理的方法及相关系统、存储介质
US11388223B2 (en) Management device, management method, and management program
Xu et al. A deep deterministic policy gradient algorithm based on averaged state-action estimation
CN113868187A (zh) 处理神经网络的方法和电子装置
US11816185B1 (en) Multi-view image analysis using neural networks
KR102190584B1 (ko) 메타 강화 학습을 이용한 인간 행동패턴 및 행동전략 추정 시스템 및 방법

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19844987

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2021500797

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 07/05/2021)

122 Ep: pct application non-entry in european phase

Ref document number: 19844987

Country of ref document: EP

Kind code of ref document: A1