CN117236459A

CN117236459A - Multi-agent reinforcement learning method and related device

Info

Publication number: CN117236459A
Application number: CN202210623513.2A
Authority: CN
Inventors: 李银川; 邵云峰
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2022-06-02
Filing date: 2022-06-02
Publication date: 2023-12-15
Also published as: WO2023231961A1

Abstract

A multi-agent reinforcement learning method is applied to the technical field of artificial intelligence. According to the multi-agent reinforcement learning method, a parallel dense structure and a sparse structure are introduced into an agent network, and the first agent only pays attention to part of agents based on the sparse structure, so that part of information irrelevant to the first agent is ignored, and the convergence efficiency of the agent network is improved; moreover, the first intelligent agent can pay attention to all intelligent agents except the first intelligent agent based on a dense structure, so that the intelligent agent network can realize effective convergence in the training process. Based on the parallel dense structure and sparse structure, the convergence efficiency of the intelligent agent network is improved while the effective convergence of the intelligent agent network can be realized, so that the multi-intelligent agent reinforcement learning efficiency is improved.

Description

Multi-agent reinforcement learning method and related device

Technical Field

The application relates to the technical field of artificial intelligence (Artificial Intelligence, AI), in particular to a multi-agent reinforcement learning and related device.

Background

Artificial intelligence is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and expand human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar manner to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

Reinforcement learning is one of the directions in which artificial intelligence has been attracting attention in recent years. With the rapid development and application of reinforcement learning, reinforcement learning has been widely used in the fields of robot control, game play, unmanned driving, and the like. At present, reinforcement learning algorithms are mostly applied to single agent scenes. In a single agent scenario, the environment in which the agent is located is stable and unchanged. However, in the multi-agent scene, the environment is dynamic and complex, and the action of each agent can influence the action selection of other agents, so that the multi-agent reinforcement learning has the problems of unstable environment, dimensional explosion and the like. Because the states of other agents need to be considered before each agent performs actions in the multi-agent reinforcement learning process, the convergence of the agent network is difficult to realize in the training process, and the training efficiency is low.

Therefore, a method for improving the efficiency of multi-agent reinforcement learning is needed.

Disclosure of Invention

The application provides a multi-agent reinforcement learning method which can improve the convergence efficiency of an agent network, thereby improving the multi-agent reinforcement learning efficiency.

The first aspect of the present application provides a multi-agent reinforcement learning method applied to a learning environment including a plurality of agents. Specifically, the multi-agent reinforcement learning method includes: firstly, an observed value of a first agent is obtained, wherein the first agent is an agent in the plurality of agents, and the observed value comprises a state value of the learning environment observed by the first agent. For example, in the case where the first agent is an intelligent vehicle, the observed value of the first agent may be the motion state of other intelligent vehicles in the road environment observed by the first agent.

The observations are then input into an agent network, resulting in a first value function and a second value function, the agent network comprising a parallel dense structure for extracting first state features related to all agents in the learning environment based on the observations and deriving the first value function based on the first state features, and a sparse structure for extracting second state features related to a portion of agents in the learning environment based on the observations and deriving the second value function based on the second state features.

And finally, determining a first loss function corresponding to the first valence function and a second loss function corresponding to the second valence function through a time sequence difference method, and training the intelligent agent network based on the first loss function and the second loss function.

In the scheme, a dense structure and a sparse structure are introduced in parallel in the intelligent agent network, and the first intelligent agent only pays attention to part of intelligent agents based on the sparse structure, so that part of information irrelevant to the first intelligent agent is ignored, and the convergence efficiency of the intelligent agent network is improved; moreover, the first intelligent agent can pay attention to all intelligent agents except the first intelligent agent based on a dense structure, so that the intelligent agent network can realize effective convergence in the training process. Based on the parallel dense structure and sparse structure, the convergence efficiency of the intelligent agent network is improved while the effective convergence of the intelligent agent network can be realized, so that the multi-intelligent agent reinforcement learning efficiency is improved.

In a possible implementation, the dense structure is used for extracting the first state feature based on a dense attention mechanism and the observed value, and the sparse structure is used for extracting the second state feature based on a sparse attention mechanism and the observed value.

In one possible implementation, the first value function is used to indicate a value of the first agent in selecting a different action under the first state characteristic, and the second value function is used to indicate a value of the first agent in selecting a different action under the second state characteristic.

According to the scheme, based on the first value function and the second value function, the value brought by the intelligent agent network when various actions are selected to be executed under the first state characteristic and the second state characteristic respectively can be predicted, so that the intelligent agent network is guided to select corresponding execution actions subsequently.

In one possible implementation manner, the determining, by using a time sequence difference method, a first loss function corresponding to the first valence function and a second loss function corresponding to the second valence function includes: selecting a first action from a plurality of actions indicated by the first valence function according to the first valence function and a preset strategy, and determining a first value corresponding to the first action; acquiring a second value corresponding to the state after the first action is executed, and determining the first loss function according to the first value and the second value; selecting a second action from a plurality of actions indicated by the second value function according to the second value function and the preset strategy, and determining a third value corresponding to the second action; and acquiring a fourth value corresponding to the state after the second action is executed, and determining the second loss function according to the third value and the fourth value.

In this embodiment, the loss function for training the intelligent network is constructed based on the time sequence difference method, so that the convergence condition of the intelligent network can be well evaluated, and further the training of the intelligent network is effectively realized.

In one possible implementation, the preset policy includes a greedy policy or an e-greedy policy, where the greedy policy is used to select an action with highest value, and the e-greedy policy is used to select an action with highest value based on a first probability and select other actions than the action with highest value based on a second probability.

In the scheme, the selection of the action corresponding to the cost function is realized based on the E-greedy strategy, and the balance between development and exploration can be realized in the reinforcement learning process, so that the intelligent network can realize rapid convergence.

In one possible implementation, the first probability has a positive correlation with the number of training of the agent network, and the second probability has a negative correlation with the number of training of the agent network.

In one possible implementation, the agent network further includes a recurrent neural network for encoding the observations and historical actions performed by the agent to obtain encoded features; the first value function is derived based on the first state characteristic and the encoding characteristic, and the second value function is derived based on the second state characteristic and the encoding characteristic.

In the scheme, the circulating neural network is introduced into the intelligent network, so that the intelligent network can be conveniently memorized with historical state and action information, and the intelligent network is guided to more accurately output the cost function.

In one possible implementation, the training the agent network based on the first and second loss functions includes: acquiring dense loss functions corresponding to other intelligent agents in the plurality of intelligent agents, and performing mixed processing on the first loss function and the dense loss functions through a first mixed network to obtain dense mixed loss functions, wherein the dense loss functions are obtained based on a dense attention mechanism; acquiring sparse loss functions corresponding to other intelligent agents in the plurality of intelligent agents, and performing mixed processing on the second loss function and the sparse loss function through a second mixed network to obtain a sparse mixed loss function, wherein the sparse loss function is obtained based on a sparse attention mechanism; training the agent network based on the dense mixed loss function and the sparse mixed loss function.

In the scheme, the training process of the intelligent agent networks is that the loss function of the mixed intelligent agent networks is used for training, so that the training goal of each intelligent agent network is ensured to ensure that the overall decision of a plurality of intelligent agents is optimal.

In one possible implementation, after the agent network is trained, the dense structure in the agent network is used to perform tasks that infer an agent action, and the sparse structure in the agent network is not used to perform the tasks.

In one possible implementation, the learning environment includes an autopilot environment, a robotic collaborative work environment, or a multi-persona interactive gaming environment.

A second aspect of the present application provides a multi-agent reinforcement learning device for use in a learning environment comprising a plurality of agents, the device comprising: the acquisition unit is used for acquiring an observed value of a first intelligent agent, wherein the first intelligent agent is one of the plurality of intelligent agents, and the observed value comprises a state value of the learning environment observed by the first intelligent agent; a processing unit, configured to input the observed value into an agent network to obtain a first value function and a second value function, where the agent network includes a parallel dense structure and a sparse structure, where the dense structure is configured to extract first state features related to all agents in the learning environment based on the observed value and obtain the first value function based on the first state features, and the sparse structure is configured to extract second state features related to some agents in the learning environment based on the observed value and obtain the second value function based on the second state features; the processing unit is further configured to determine a first loss function corresponding to the first valence function and a second loss function corresponding to the second valence function through a time sequence difference method, and train the agent network based on the first loss function and the second loss function.

In a possible implementation manner, the processing unit is specifically configured to: selecting a first action from a plurality of actions indicated by the first valence function according to the first valence function and a preset strategy, and determining a first value corresponding to the first action; acquiring a second value corresponding to the state after the first action is executed, and determining the first loss function according to the first value and the second value; selecting a second action from a plurality of actions indicated by the second value function according to the second value function and the preset strategy, and determining a third value corresponding to the second action;

And acquiring a fourth value corresponding to the state after the second action is executed, and determining the second loss function according to the third value and the fourth value.

In a possible implementation manner, the processing unit is specifically configured to: acquiring dense loss functions corresponding to other intelligent agents in the plurality of intelligent agents, and performing mixed processing on the first loss function and the dense loss functions through a first mixed network to obtain dense mixed loss functions, wherein the dense loss functions are obtained based on a dense attention mechanism; acquiring sparse loss functions corresponding to other intelligent agents in the plurality of intelligent agents, and performing mixed processing on the second loss function and the sparse loss function through a second mixed network to obtain a sparse mixed loss function, wherein the sparse loss function is obtained based on a sparse attention mechanism; training the agent network based on the dense mixed loss function and the sparse mixed loss function.

A third aspect of the present application provides an electronic device, which may comprise a processor coupled to a memory, the memory storing program instructions which, when executed by the processor, implement the method of the first aspect or any implementation of the first aspect. For the steps in each possible implementation manner of the first aspect executed by the processor, reference may be specifically made to the first aspect, which is not described herein.

A fourth aspect of the application provides a computer readable storage medium having a computer program stored therein, which when run on a computer causes the computer to perform the method of the first aspect or any implementation of the first aspect.

A fifth aspect of the application provides circuitry comprising processing circuitry configured to perform the method of the first aspect or any implementation of the first aspect.

A sixth aspect of the application provides a computer program product which, when run on a computer, causes the computer to perform the method of the first aspect or any implementation of the first aspect.

A seventh aspect of the present application provides a chip system comprising a processor for supporting a server or threshold value acquisition device to perform the functions referred to in the first aspect or any implementation manner of the first aspect, e.g. to send or process data and/or information referred to in the method. In one possible design, the chip system further includes a memory for holding program instructions and data necessary for the server or the communication device. The chip system can be composed of chips, and can also comprise chips and other discrete devices.

The advantages of the second to seventh aspects may be referred to the description of the first aspect, and are not described here again.

Drawings

Fig. 1 is a schematic diagram of an autopilot scenario provided in an embodiment of the present application;

FIG. 2 is a schematic flow chart of a multi-agent reinforcement learning method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a dense structure related to a dense attention mechanism according to an embodiment of the present application;

fig. 4 is a schematic structural diagram related to a sparse attention mechanism in a sparse structure according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a sparse matrix and input sequence association method according to an embodiment of the present application;

FIG. 6 is a schematic diagram of another sparse matrix and sequence association method according to an embodiment of the present application;

FIG. 7 is a schematic diagram of another sparse matrix and sequence association method according to an embodiment of the present application;

FIG. 8 is a schematic diagram of another sparse matrix and sequence association method according to an embodiment of the present application;

fig. 9 is a schematic diagram of a multi-agent reinforcement learning architecture based on sparse attention assistance according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of an agent network according to an embodiment of the present application;

FIG. 11 is a schematic structural diagram of a multi-agent reinforcement learning device according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of an execution device according to an embodiment of the present application;

FIG. 13 is a schematic diagram of a chip according to an embodiment of the present application;

fig. 14 is a schematic structural diagram of a computer readable storage medium according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described below with reference to the accompanying drawings. As one of ordinary skill in the art can know, with the development of technology and the appearance of new scenes, the technical scheme provided by the embodiment of the application is also applicable to similar technical problems.

The terms first, second and the like in the description and in the claims and in the above-described figures, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and are merely illustrative of the manner in which embodiments of the application have been described in connection with the description of the objects having the same attributes. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of elements is not necessarily limited to those elements, but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

For ease of understanding, technical terms related to embodiments of the present application are described below.

(1) Reinforcement learning

Reinforcement learning is one of the areas of machine learning, mainly concerned with how agents take different actions in the environment to maximize the jackpot. In general, reinforcement learning is mainly composed of agents (agents), environments (environments), states, actions (actions), rewards (reports). Wherein, the agent is the body of reinforcement learning, which is used as a learner or a decision maker. The environment is everything except the reinforcement learning agent, and mainly consists of a state set. A state is a data representing an environment, and a set of states is all possible states in the environment. The actions are actions that the agent can do, and the action set is all actions that the agent can do. The rewards are positive/negative feedback signals obtained by the agent after executing an action, and the rewards are all feedback information available to the agent. Reinforcement learning is actually a mapping learning from environmental states to actions.

After the agent performs a certain action, the environment will switch to a new state for which the environment will give a reward signal (positive or negative). Then, the agent executes new actions according to a certain strategy according to the new state and rewards of environmental feedback. The above process is the way in which the agent and environment interact through states, actions, rewards, and the final objective of the above process is to maximize the jackpot for the agent.

The intelligent agent can know what action is needed to be taken by the intelligent agent under what state through reinforcement learning so as to obtain the maximum rewarding. Because the interaction mode of the intelligent agent and the environment is similar to that of human beings and the environment, reinforcement learning can be considered as a set of universal learning framework, and can be used for solving the problem of universal artificial intelligence. Reinforcement learning is also known as a machine learning method of general artificial intelligence.

Furthermore, there are two very important concepts in reinforcement learning tasks: development (explore) and exploration (explore), sometimes also called utilization and heuristics, respectively. In reinforcement learning, development refers to the selection of an optimal action by an agent in a known distribution of all tuples (state-action) based on the principle of maximizing action value. In other words, when an agent selects from known actions, it is said to develop (or utilize). Exploration refers to the selection of other unknown actions by the agent beyond the known binary (state-action) distribution.

(2) Multi-agent reinforcement learning

Multi-agent reinforcement learning refers to the fact that multiple agents cooperate, compete or mix in a collaborative, competitive or hybrid manner to maximize the multi-agent average return or achieve Nash equilibrium through learning strategies during interaction with the environment.

(3) Attention mechanism (Attention Mechanism)

The attention mechanism is a special structure embedded in the machine learning model for automatically learning and calculating the contribution of input data to output data. The model applying the attention mechanism can give different weights to each part of the input sequence, thereby extracting more important characteristic information in the input sequence, so that the model is finally output more accurately.

In deep learning, the attention mechanism may be implemented by a weight vector describing importance: when an element is predicted or inferred, the association between the element and other elements is determined by a weight vector. For example, for a certain pixel in an image or a certain word in a sentence, the correlation between the target element and other elements may be quantitatively estimated using the attention vector, and the weighted sum of the attention vectors is taken as an approximation of the target value.

The attention mechanism in deep learning simulates the attention mechanism of the human brain. For example, when a human is viewing a picture, while the human eye can see the full view of the picture, when the human is looking deep and carefully, the human eye focuses only on a portion of the picture, where the human brain is primarily focused on this small pattern. That is, when a human carefully observes an image, the attention of the human brain to the whole image is not balanced and is distinguished by a certain weight, which is the core idea of the attention mechanism.

In brief, human vision processing systems tend to selectively focus on certain portions of an image, while ignoring other irrelevant information, thereby facilitating perception of the human brain. Similarly, in deep learning attention mechanisms, certain portions of the input may be more relevant than others in some questions involving language, speech, or vision. Thus, by means of the attention mechanism in the attention model, the attention model can be caused to perform different processing on different parts of the input data, such that the attention model only dynamically focuses on data related to the task.

(4) Dense attention mechanism

Dense attention mechanisms are one type of attention mechanism. A model applying dense attention mechanisms weights each part of the content of the input sequence with a value greater than 0, so that the output sequence has a correlation with virtually all the content of the input sequence.

(5) Sparse attention mechanism

The sparse attention mechanism is also one of the attention mechanisms. A model that applies a sparse attention mechanism weights a portion of the content in the input sequence with a value of 0 and another portion of the content in the input sequence with a value greater than 0 so that the output sequence has a substantially associative relationship with only a portion of the content of the input sequence.

(6) Matrix multiplication operation (MatMul)

Matrix multiplication is a binary operation that yields a third matrix, the product of the first two, also commonly referred to as the matrix product, from two matrices. The matrix may be used to represent a linear mapping and the matrix product may be used to represent a composite of the linear mapping.

Specifically, for the matrix a and the matrix B, the matrix a and the matrix B perform matrix multiplication operation to obtain a matrix C. The element of the mth row and the nth column of the matrix C is equal to the sum of the products of the element of the mth row of the matrix a and the corresponding element of the nth column of the matrix B. Let a be an mxp matrix and B be a p×n matrix, then let C be the product of matrix a and B, denoted as c=a×b. Where matrix A may be denoted as [ m, p ], matrix B may be denoted as [ p, n ], and matrix C may be denoted as [ m, n ]. The ith row and jth column elements in matrix C may be represented as:

Specifically, one possible example of the assumed matrix a, matrix B, and matrix C is as follows:

in general, matrix multiplication is only meaningful if the number of columns (column) of the first matrix and the number of rows (row) of the second matrix are the same. An mxn matrix is an array of mxn rows and columns.

(7) Softmax function

The Softmax function is also called a normalized activation function, and is a promotion of a logic function. The Softma function can transform one K-dimensional vector Z containing arbitrary real numbers into another K-dimensional vector σ (Z) such that each element in the transformed vector σ (Z) ranges between (0, 1) and the sum of all elements is 1.

(8) Sparsemax function

The Sparsemax function is also commonly referred to as a sparse probability activation function. In contrast to the Softmax function, the Sparsemax function also transforms a K-dimensional vector Z containing arbitrary real numbers into another K-dimensional vector σ (Z), but the values of some of the elements in the transformed vector σ (Z) are 0.

(9) Circulating neural network

The source of the recurrent neural network is to characterize the relationship of the current output of a sequence to previous information. In the network structure, the recurrent neural network memorizes the previous information and uses the previous information to influence the output of the following nodes. Namely: the nodes between the hidden layers of the cyclic neural network are connected, and the input of the hidden layers not only comprises the output of the input layer, but also comprises the output of the hidden layer at the last moment.

(10) Compressed sensing (Compressed sensing)

Compressed sensing, also known as compressive sampling (Compressive sampling) or Sparse sampling (spark sampling), is a technique to find Sparse solutions for underdetermined linear systems.

The multi-agent reinforcement learning method provided by the embodiment of the application can be applied to electronic equipment, in particular to electronic equipment which needs to execute multi-intelligent tasks. The electronic device may be, for example, a server, a robot, a smart phone (mobile phone), a personal computer (personal computer, PC), a notebook computer, a wireless electronic device in an industrial control (industrial control), a wireless electronic device in a self driving (self driving), a wireless electronic device in a smart grid (smart grid), a wireless electronic device in a logistics warehouse, a wireless electronic device in a transportation security (transportation safety), a wireless electronic device in a smart city (smart city), or the like.

The apparatus to which the multi-agent reinforcement learning method provided by the embodiment of the present application is applied is described above, and a scenario to which the multi-agent reinforcement learning method provided by the embodiment of the present application is applied will be described below.

The multi-agent reinforcement learning method provided by the embodiment of the application can be applied to training an agent network for executing multi-agent tasks. The multi-agent task includes, for example, an autopilot task, a robot collaborative task, or a multi-role interactive game task.

In the automatic driving task, each intelligent vehicle is regarded as an intelligent agent, and each intelligent vehicle can acquire information of other intelligent vehicles, such as distance between other intelligent vehicles and the vehicle, speed and acceleration of other intelligent vehicles and the like, through a sensor and the like. The information of other intelligent vehicles acquired by each intelligent vehicle can be regarded as the observation value of the intelligent vehicle on the environment state. For the intelligent vehicle, the intelligent vehicle needs to determine the action performed by the intelligent vehicle, such as acceleration, deceleration or lane change, based on the observed value of the environment state, so that the forward feedback obtained by the intelligent vehicle from the environment is as high as possible, that is, the safety factor of the intelligent vehicle is as high as possible or the running time of the intelligent vehicle reaching the end point is as short as possible.

Referring to fig. 1, fig. 1 is a schematic diagram of an autopilot scenario according to an embodiment of the present application. As shown in fig. 1, the automatic driving scene includes an intelligent vehicle 1, an intelligent vehicle 2, an intelligent vehicle 3 and an intelligent vehicle 4, i.e. the above 4 intelligent vehicles form a multi-agent learning environment. Taking the intelligent vehicle 1 as an example, during the process of executing an automatic driving task by the intelligent vehicle 1, the intelligent vehicle 1 acquires information of the front intelligent vehicle 2, the rear right intelligent vehicle 3 and the front left intelligent vehicle 4 as an observation value of the environment state by the intelligent vehicle 1. Then, the intelligent vehicle 1 decides an action to be performed by itself, such as a deceleration running or a right lane change, based on the observed value of the environmental state.

In a robot collaborative task, each robot is considered an agent, and multiple robots may collaborate to accomplish a particular task. For example, in the field of logistics storage, multiple robots cooperate to transport a given item to a given location. Each robot can also acquire information of other robots by means of sensors or the like, such as the distance between the other robots and themselves, the actions currently performed by the other robots, and the running speeds of the other robots. The information of other robots acquired by each robot can be regarded as the observation value of the environment state by the robot. For the robot, the robot needs to determine the actions executed by the robot, such as the movement direction, the rotation angle, and the pause motion, based on the observed value of the environmental state, so that the forward feedback obtained by the robot from the environment is as high as possible, that is, the robot completes the carrying task in as little time as possible on the premise of avoiding collision.

In a multi-role interactive game task, each character unit is considered an agent, and multiple character units may cooperate to accomplish a particular combat task. For example, in a large instant game, a plurality of character units of the same race commonly perform a task of competing with character units of other races. Each character unit can acquire information of other character units in the same race, such as attack targets, attack modes, moving directions and the like of the other character units. The information acquired by each character unit for other character units may be considered an observation of the environmental status by that character unit. Each character unit in the game needs to determine the action executed by itself based on the observed value of the environment state, such as actions of switching attack targets, changing attack modes, changing moving routes, and the like, so that the forward feedback obtained by the character unit from the environment is as high as possible, that is, the character unit can complete the fight task with as little loss as possible.

The scenario of the multi-agent reinforcement learning method according to the embodiment of the present application is described above, and a specific flow of the multi-agent reinforcement learning method according to the embodiment of the present application will be described below. The multi-agent reinforcement learning method provided by the embodiment of the application is applied to a learning environment including a plurality of agents, wherein the learning environment includes an automatic driving environment, a robot collaborative operation environment or a multi-role interaction game environment, and the description of the learning environment can be referred to specifically and is not repeated here.

Referring to fig. 2, fig. 2 is a schematic flow chart of a multi-agent reinforcement learning method according to an embodiment of the application. As shown in fig. 2, the multi-agent reinforcement learning method specifically includes the following steps 201-204.

Step 201, obtaining an observed value of a first agent, where the first agent is an agent in the plurality of agents, and the observed value includes a state value of the learning environment observed by the first agent.

In the multi-agent reinforcement learning process, the operating device first acquires an observed value of the first agent for the learning environment. The observed value of the first agent for the learning environment may be a state value obtained by the first agent observing other agents in the learning environment. For example, in the case where the first agent is an intelligent vehicle, the observed value of the first agent may be the motion state of other intelligent vehicles in the road environment observed by the first agent.

It will be appreciated that, in different scenarios, the types of the observations in the observations of the first agent may be different, and this embodiment does not specifically limit the observations of the observations.

Step 202, inputting the observed value into an agent network to obtain a first value function and a second value function, wherein the agent network comprises a dense structure and a sparse structure, the dense structure is used for extracting first state characteristics related to all agents in the learning environment based on the observed value and obtaining the first value function based on the first state characteristics, and the sparse structure is used for extracting second state characteristics related to part of agents in the learning environment based on the observed value and obtaining the second value function based on the second state characteristics.

In this embodiment, the cost function represents the future benefits that the agent can expect to obtain from a certain state. The cost function predicts future benefits by using expectations, so that the quality of the current state can be known without waiting for the actual occurrence of the future benefits, and various possible future benefit conditions can be summarized by expectations. Therefore, the quality degree of different actions executed by the intelligent agent can be conveniently evaluated based on the cost function.

Specifically, the first value function is used to indicate the value of the first agent in selecting different actions under the first state characteristic, and the second value function is used to indicate the value of the first agent in selecting different actions under the second state characteristic. Therefore, based on the first value function and the second value function, the value brought by the intelligent agent network when the intelligent agent network selects to execute various actions under the first state characteristic and the second state characteristic respectively can be predicted, so that the intelligent agent network is guided to select corresponding execution actions subsequently.

Furthermore, two parallel networks are included in the agent network, namely a dense structure and a sparse structure. The dense structure is used for extracting features of the observed value, and the extracted features of the dense structure do not ignore the existing information in the observed value. That is, in the case where the observed value of the first agent includes the state information of all the agents in the learning environment, the state features of all the agents in the learning environment are also included in the first state features extracted from the dense structure.

In addition, the sparse structure is also used for extracting features of the observed value, and the extracted features of the sparse structure can ignore part of existing information of the observed value. That is, in the case where the observed value of the first agent includes state information of all agents in the learning environment, the first state feature extracted from the dense structure includes only state features of a part of the agents in the learning environment.

Illustratively, in some embodiments, the dense structure is configured to perform feature extraction on the observations of the first agent based on a dense attention mechanism, resulting in the first state feature. The first status feature is associated with all agents in the learning environment, and the weights assigned by portions of the first status feature associated with different agents during the feature extraction phase may be different. In short, during feature extraction of the observed value by the dense structure, the dense structure may process portions of the observed value corresponding to different agents with different weights, so that the resulting first state feature is actually information focusing on portions of the observed value.

Illustratively, the sparse structure may be configured to perform feature extraction on the observed value of the first agent based on a sparse attention mechanism, resulting in the second state feature. Wherein the second status feature is associated with only a portion of the agents in the learning environment. Specifically, in the process of extracting features from the observed value by using the sparse structure, the content of the observed value corresponding to a certain part of the agents in the sparse structure is processed by adopting a weight with a value of 0, so that the obtained second state features are actually information of a certain part of agents but only information of another part of agents is focused.

In other embodiments, the sparse structure may be implemented by other algorithms or mechanisms. For example, the sparse structure may be implemented based on similarity calculations, compressed sensing, TOP-K algorithms, or regularization mechanisms, thereby enabling acquisition of second state features that include only part of the agent information. In general, the sparse structure is used to extract the features of a part of the agents from the information of all the agents, and the implementation manner of the sparse structure is not particularly limited in this embodiment.

It will be appreciated that in most scenarios, not all agents in the entire learning environment will have an impact on the decision of the first agent. Thus, in most cases, the first agent need only be concerned with other agents that have an impact on its own decisions, and not with all agents in the learning environment. For example, in an autopilot scenario, an intelligent vehicle typically observes state information for all other vehicles within a certain range (e.g., 200 meters) from the host vehicle. However, in practice, two vehicles located in front of and behind the intelligent vehicle and some of the vehicles located in the adjacent lanes of the intelligent vehicle will generally have an impact on the decision of the vehicle, and not all other vehicles observed by the intelligent vehicle will have an impact on the decision of the vehicle. For example, in fig. 1 described above, the vehicles that affect the decision of the intelligent vehicle 1 are the intelligent vehicle 2 and the intelligent vehicle 3, while the intelligent vehicle 4 located on the opposite lane does not actually affect the decision of the intelligent vehicle 1 at all.

That is, the observation value of the first agent is processed based on the sparse attention mechanism, and a part of agents related to the first agent can be screened from all agents, so that training of the agent network is guided based on the state characteristics of the part of agents, and the convergence speed of the agent network is effectively improved.

Step 203, determining a first loss function corresponding to the first valence function and a second loss function corresponding to the second valence function through a time sequence difference method, and training the intelligent agent network based on the first loss function and the second loss function.

In this embodiment, after a first value function and a second value function are obtained based on a dense structure and a sparse structure, a loss function corresponding to the two value functions is determined by a time sequence difference method, so as to obtain a first loss function and a second loss function respectively. The time sequence difference method is based on the instant rewards and the estimated value of the next state to replace the possible return of the current state at the end of the state sequence, and is a biased estimation of the value of the current state.

In particular, the first loss function is actually based on a dense attention mechanism, and the second loss function is based on a sparse attention mechanism. Finally, the agent network is trained in conjunction with the first loss function and the second loss function. For example, training the agent network based on a sum of the first loss function and the second loss function; alternatively, after the weighted summation is performed on the first loss function and the second loss function, the agent network is trained based on the loss function obtained by the weighted summation. Wherein, in the process of performing weighted summation on the first loss function and the second loss function, the weight of the second loss function can be adjusted according to actual needs, the larger the weight of the second loss function is, the more important states of the intelligent agent are required to be paid attention to by the intelligent agent network, and the smaller the weight of the second loss function is, the more even states of the intelligent agent are required to be paid attention to by the intelligent agent network.

For ease of understanding, the following will describe in detail the beneficial effects of the present embodiment method in terms of introducing both dense and sparse structures.

The applicant has found that if a dense attention mechanism is used alone in an agent network, the agent network can be enabled to automatically select an agent set that has a greater impact on a particular agent from all agents in a massive multi-agent scenario. However, since the correlation coefficient vector in the dense attention mechanism is dense, information of all agents is still considered when determining the cost function by the agent network, so that the time cost and the space cost of the algorithm training are not improved greatly, and the final algorithm performance is usually only positively influenced.

In addition, if a sparse attention mechanism is independently used in the intelligent agent network, an intelligent agent set with a larger degree relative to a specific intelligent agent can be automatically screened out according to the related information of all intelligent agents in a large-scale multi-intelligent agent scene, and the related information of the intelligent agents in the set is introduced as additional information to realize model training of the specific intelligent agent, so that the convergence speed and the final performance of a model are improved.

However, in some cases, merely introducing sparse attention mechanisms in an agent network is likely to disrupt the training process of the agent network. Because an agent network that introduces only sparse attention mechanisms cannot actually distinguish which agent is more important in the early stages of training, the agent network is likely to erroneously discard some important agent information. Thus, the agent network may temporarily discard a portion of the agent's information during the training process as an exploration strategy. In addition, during reinforcement learning, the agent network may choose to perform unknown actions based on the obtained cost function to perform exploration of actions. That is, the agent network itself may also implement an exploration policy associated with the action. When the two search strategies are executed simultaneously by the intelligent network in the training process, the intelligent network is difficult to achieve effective convergence or easy to converge to obtain a local optimal solution.

Therefore, in this embodiment, by introducing a dense structure and a sparse structure in parallel in the agent network, the first agent only pays attention to part of the agents based on the sparse structure, so that part of information irrelevant to the first agent is ignored, and convergence efficiency of the agent network is improved; moreover, the first intelligent agent can pay attention to all intelligent agents except the first intelligent agent based on a dense structure, so that the intelligent agent network can realize effective convergence in the training process. Based on the parallel dense structure and sparse structure, the convergence efficiency of the intelligent agent network is improved while the effective convergence of the intelligent agent network can be realized, so that the multi-intelligent agent reinforcement learning efficiency is improved.

Optionally, since the sparse structure in this embodiment is mainly used to accelerate training of the agent network. Thus, after the training of the agent network is completed, the dense structure in the agent network is then used to perform the task of reasoning about agent actions, while the sparse structure in the agent network is not used to perform the task of reasoning about agent actions. That is, the sparse structure in the agent network only works during the training phase. After the training of the agent network is completed, the sparse structure in the agent network does not work any more, or the sparse structure in the agent network is removed and the dense structure is preserved.

Optionally, in some embodiments, the agent network further comprises a recurrent neural network (Recurrent Neural Network, RNN) for encoding the observations and historical actions performed by the agent to derive encoded features. Wherein the first value function is derived based on the first state characteristic and the encoding characteristic, and the second value function is derived based on the second state characteristic and the encoding characteristic.

Specifically, the recurrent neural network is a recurrent neural network (recursive neural network) which takes sequence data as input, performs recursion in the evolution direction of the sequence and connects all nodes in a chained manner. In this embodiment, by introducing the recurrent neural network into the agent network, the agent network can be facilitated to memorize the history state and the action information, thereby guiding the agent network to output the cost function more accurately.

By way of example, the recurrent neural network may include, for example, a Long Short-term memory network (LSTM-Term Memory networks) or a gated loop unit (Gate Recurrent Unit, GRU). The present embodiment is not limited to the specific structure of the recurrent neural network.

The above describes the process of training an agent network by introducing a dense attention mechanism and a sparse attention mechanism in the embodiments of the present application. For ease of understanding, specific implementations of dense and sparse attention mechanisms will be described in detail below.

Referring to fig. 3, fig. 3 is a schematic diagram of a dense structure related to a dense attention mechanism according to an embodiment of the present application.

As shown in fig. 3, the input of the dense structure is the observed value of the first agent, and the observed value of the first agent can be regarded as one input sequence. In the dense structure, after the dimension of the input sequence is increased (i.e. the dimension of the input sequence is increased), matrix multiplication operation is performed on the matrix of the input sequence based on different weights respectively, so as to obtain three different weight matrices, namely a matrix Q, a matrix K and a matrix V. Therein, the dense structure may be one in which the up-and-matrix multiplication operations are performed on the input sequence by a multi-layer perceptron (Multilayer Perceptron, MLP). Then, performing matrix multiplication operation on the matrix Q and the matrix K to obtain a matrix A; after the normalization operation is performed on the matrix a, a matrix A1 is obtained. And finally, performing matrix multiplication operation on the matrix A1 and the matrix V to obtain a final output matrix, wherein the output matrix is the first state characteristic extracted from the dense structure. For the output matrix obtained by the dense structure, the values of all elements in the output matrix are not 0, so that the first state characteristic contains the state information of all the agents.

Referring to fig. 4, fig. 4 is a schematic structural diagram related to a sparse attention mechanism in a sparse structure according to an embodiment of the present application.

As shown in fig. 4, the input of the sparse structure is the observed value of the first agent, and the observed value of the first agent can be regarded as an input sequence as well. In the sparse structure, after the dimension of the input sequence is increased (i.e. the dimension of the input sequence is increased), matrix multiplication operation is performed on the matrix of the input sequence based on different weights respectively, so as to obtain three different weight matrices, namely a matrix Q, a matrix K and a matrix V. The sparse structure may also be a structure in which the MLP performs an up-scaling operation on the input sequence. Then, performing matrix multiplication operation on the matrix Q and the matrix K to obtain a matrix A; after performing a sparse probability activation (sparse) operation on matrix a, matrix A2 is obtained. And finally, performing matrix multiplication operation on the matrix A1 and the matrix V to obtain a final output matrix, wherein the output matrix is the second state characteristic extracted from the sparse structure. For the output matrix obtained by the sparse structure, the values of part of elements in the output matrix are 0, so that the second state characteristic actually only contains state information of part of the intelligent agents.

Furthermore, in some possible embodiments, the sparse structure may also be configured to perform sparse computation on the matrix Q and the matrix K, to obtain, as a sparse matrix, a matrix a, where elements at target positions in the matrix a are not 0, and elements at positions other than the target positions in the matrix a are 0. Specifically, the sparse calculation refers to position information of a target position in the matrix a (i.e., a position of an effective element in the sparse matrix a), acquires corresponding matrix data in the matrix Q and the matrix K, and calculates an element value of the target position in the matrix a. Similarly, in the sparse structure, sparse computation may be performed on the matrix A2 and the matrix V to obtain an output matrix.

In the process of extracting the sparse matrix (the second state feature) by the sparse structure based on the sparse attention mechanism, the sparse matrix output by the sparse structure in the training process may have a plurality of different structures, and the plurality of structures of the sparse matrix may be described below.

Referring to fig. 5, fig. 5 is a schematic diagram of a sparse matrix and input sequence association method according to an embodiment of the present application. As shown in (a) of fig. 5, only the values of four-part elements in the sparse matrix are valid, and the number and arrangement of elements of each part of the four-part elements are the same. Furthermore, the values of other elements in the sparse matrix are invalid.

As shown in fig. 5 (b), a cell in the input sequence has an association with only a cell in a local range where the cell is located. For example, the input sequence shown in fig. 5 (b) includes 16 units, and the 16 units are divided into 4 partial ranges, and the 4 partial ranges are respectively composed of 4 units at different positions in the input sequence. For any one local scope, any one unit in the local scope has relevance to all units in the local scope.

Assuming that the sparse matrix shown in fig. 5 (a) is calculated based on the input matrix 1 and the input matrix 2, for the element of the nth row and the mth column in the sparse matrix, the value of the element is calculated based on the element of the nth row in the input matrix 1 and the element of the mth column in the input matrix 2. Therefore, when the input sequence is represented by an input matrix, the correlation between the cells in the input sequence shown in fig. 5 (b) can be represented by the sparse matrix shown in fig. 5 (a). In the sparse matrix shown in fig. 5 (a), the elements of each row are used to represent the association between a certain cell and all cells within the local range of the cell.

Referring to fig. 6, fig. 6 is a schematic diagram of another sparse matrix and sequence association method according to an embodiment of the present application. As shown in fig. 6 (b), a cell in the input sequence has an association with only other cells that are within the local range of and preceding the cell. In this case, the correlation between the cells in the input sequence shown in fig. 6 (b) can be expressed by the sparse matrix shown in fig. 6 (a). In the sparse matrix shown in fig. 6 (a), the elements of each row are used to represent the association between a certain cell and other cells located in front of the cell in the local area where the cell is located.

Referring to fig. 7, fig. 7 is a schematic diagram of another sparse matrix and sequence association method according to an embodiment of the present application. As shown in (b) of fig. 7, a cell in the input sequence has relevance only to the k cells before and after the cell and itself. In this case, the correlation between the cells in the input sequence shown in fig. 7 (b) can be expressed by the sparse matrix shown in fig. 7 (a). In the sparse matrix shown in fig. 7 (a), the elements of each row are used to represent the association between a certain cell and the k cells before and after the cell and itself.

Referring to fig. 8, fig. 8 is a schematic diagram of another sparse matrix and sequence association method according to an embodiment of the present application. As shown in fig. 8 (b), the input sequence is divided into 4 partial ranges, and a specific position unit in each partial range (for example, the last unit in each partial range) has an association with all units in the entire input sequence. In this case, the correlation between the cells in the input sequence shown in fig. 8 (b) can be expressed by the sparse matrix shown in fig. 8 (a). In the sparse matrix shown in fig. 8 (a), the elements of each column are used to represent the association between a certain cell and all cells in the entire input sequence.

The process of extracting the state characteristics of the observed values based on the dense attention mechanism and the sparse attention mechanism by the intelligent agent network so as to output the cost function is described in detail above. The process of constructing a loss function based on a cost function during training of an agent network will be described in detail below.

Optionally, in step 203 of the above method, the running device determines a first loss function corresponding to the first valence function by using a time sequence difference method, and specifically includes: the operation equipment selects a first action from a plurality of actions indicated by the first value function according to the first value function and a preset strategy and determines a first value corresponding to the first action; and then, the running equipment acquires a second value corresponding to the state after the first action is executed, and the first loss function is determined according to the first value and the second value.

In brief, since the first value function indicates a value for performing a different action under the first state characteristic, the operating device may select a first action from a plurality of actions corresponding to different values based on a preset policy, the first action corresponding to the value being the first value. Then, the running device determines a second value corresponding to the state of the current learning environment after the first agent performs the first action. Illustratively, the process of determining the second value corresponding to the state after performing the first action by the running device may be similar to steps 201-202 described above, and thus the running device may process the state after performing the first action based on the target network corresponding to the dense structure, thereby obtaining the second value. The target network corresponding to the dense structure may have the same network structure as the dense structure, and the target network of the dense structure may train with the dense structure.

Because the time sequence differential method is based on the instant prize and the estimated value of the next state, the possible return of the current state at the end of the state sequence is replaced. Therefore, in this embodiment, after obtaining the first value corresponding to the first state feature (i.e. the value in the current state), the estimated value of the next state (i.e. the second value corresponding to the second state feature) is obtained by executing the first action selected by the current state, so as to determine the possible return of the current state at the end of the state sequence based on the first value and the second value. Thus, when the first loss function is smaller than a certain threshold, the return obtained by the current state at the end of the state sequence is considered to be small, and the network is close to convergence.

Similarly, the running device determines the second loss function corresponding to the second value function through a time sequence difference method specifically includes: the operation equipment selects a second action from a plurality of actions indicated by the second value function according to the second value function and the preset strategy and determines a third value corresponding to the second action; then, the running equipment obtains a fourth value corresponding to the state after the second action is executed, and the second loss function is determined according to the third value and the fourth value.

In some possible embodiments, the preset policies for selecting actions indicated in the cost function include a greedy policy or an e-greedy policy. The greedy strategy is used for selecting the action with the highest value, namely selecting the first action with the highest value from the first value function or selecting the second action with the highest value from the second value function. In short, a greedy strategy is actually a deterministic strategy, i.e. a strategy that always maximizes the value function is chosen.

In addition, the E-greedy strategy is used for selecting the action with the highest value based on the first probability and selecting other actions except the action with the highest value based on the second probability. Compared to greedy policies, an e-greedy policy is actually a policy with uncertainty, i.e. one that has a certain probability of choosing the largest value function, and other policies also have a certain probability of choosing.

Illustratively, the e-greedy strategy can be represented by equation 1 below.

Wherein pi (a|s) represents the probability of choosing action a; the E represents greedy rate; i a(s) i represents the number of actions that can be selected;representing the most valuable action in the action set.

In this embodiment, selection of actions corresponding to the cost function is implemented based on an e-greedy strategy, so that balance between development and exploration can be implemented in the reinforcement learning process, and thus, rapid convergence of the intelligent network can be implemented.

Optionally, in the training process of the intelligent agent network, the first probability for selecting the action with the highest value has a positive correlation with the training times of the intelligent agent network, and the second probability for selecting other actions has a negative correlation with the training times of the intelligent agent network. That is, the more training times the agent network, the greater the first probability, and the greater the likelihood of selecting the action with the highest value; the more training the agent network, the less the second probability, and the less likely it is to select actions other than the action of highest value. In short, in the early stages of training, the agent network needs to expend more effort to perform exploration; as training progresses, the agent network gradually converges, so the agent network focuses on development, and gradually reduces effort put on exploration.

It will be appreciated that the above method describes a process of training an agent network based on a first loss function and a second loss function obtained by the agent network of a first agent, i.e. the training goal of the agent network of the first agent is to optimize the decisions of the first agent, i.e. the decisions of the individual agents. However, in some scenarios where multiple agent networks are trained simultaneously, the training goals of the agent networks may be such that the overall decision of the multiple agents is optimal. For example, in a multi-robot collaborative logistics warehouse scenario, the goal of the multi-robot collaborative is for multiple robots to finish the shipment in the shortest time, rather than for a single robot to finish the shipment for which it is responsible. Thus, in this case, for each of the plurality of agent networks, the training may be performed with a loss function of the plurality of agent networks mixed during the training process, thereby ensuring that the training goal of each agent network is such that the overall decision of the plurality of agents is optimal.

Illustratively, in step 203 described above, the agent network is trained based on the first and second loss functions, specifically comprising the following steps.

Firstly, the operation equipment acquires dense loss functions corresponding to other intelligent agents in the plurality of intelligent agents, and performs mixing processing on the first loss functions and the dense loss functions through a first mixing network to obtain dense mixed loss functions. Wherein the dense loss function corresponding to the other agent is also obtained based on a dense attention mechanism, for example, the dense structure based on the dense attention mechanism is also provided in the network of the other agent. Therefore, referring to the process of processing the observed value by the agent network of the first agent to obtain the first loss function, other agents can also process their observed values based on the corresponding agent network to obtain the dense loss function.

Alternatively, the process of performing the mixing process on the first loss function and the dense loss functions corresponding to the other agents through the first mixing network may be to perform weighted summation on the first loss function and each dense loss function to obtain a dense mixed loss function. It can be understood that in the present embodiment, the mixing processing manner of the mixing network is illustrated by taking weighted summation as an example, and in practical application, the mixing network may also adopt other mixing processing manners, which are not described herein.

And then, the operation equipment acquires sparse loss functions corresponding to other intelligent agents in the plurality of intelligent agents, and performs mixed processing on the second loss functions and the sparse loss functions through a second mixed network to obtain sparse mixed loss functions, wherein the sparse loss functions are obtained based on a sparse attention mechanism. Similarly, other agent networks may also have sparse structures based on sparse attention mechanisms. Therefore, referring to the process of processing the observed value by the agent network of the first agent to obtain the second loss function, other agents can also process their observed values based on the corresponding agent network to obtain the sparse loss function.

Finally, an operating device trains the agent network based on the dense mixing loss function and the sparse mixing loss function. For example, the running device performs a weighted summation of the dense and sparse hybrid loss functions to obtain a total loss function and trains the agent network of the first agent based on the total loss function.

The multi-agent reinforcement learning method provided by the embodiment of the application is introduced. For ease of understanding, the following detailed description of the specific implementation of the multi-agent reinforcement learning method will be provided in connection with specific examples.

Referring to fig. 9, fig. 9 is a schematic diagram of a multi-agent reinforcement learning architecture based on sparse attention assistance according to an embodiment of the present application. As shown in fig. 9, the multi-agent reinforcement learning architecture includes a plurality of agent networks, namely agent network 1, agent network 2 … agent network N, and first and second hybrid networks. Wherein each agent network corresponds to an agent, and the input to each agent network is an observation of the agent. In each agent network, a dense structure based on dense attention mechanisms and a sparse structure based on sparse attention mechanisms are included. Moreover, the structure of the dense structure in each agent network may be the same and the parameters are different; or the structure of the dense structure in each agent network may be different. Similarly, the structure of the sparse structure in each agent network may be the same and the parameters are different; or the structure of the sparse structure in each agent network may be different.

In the reinforcement learning process, each intelligent agent network processes the input observed value based on the dense structure and the sparse structure respectively to obtain a dense loss function and a sparse loss function. Then, the dense loss function output by each intelligent agent network is input into a first mixing network, and the first mixing network performs mixing treatment to obtain a dense mixing loss function; the sparse loss function output by each intelligent agent network is input into a first mixing network, and the first mixing network performs mixing processing to obtain a sparse mixing loss function. Finally, each agent network is trained based on the resulting dense and sparse hybrid loss functions until convergence.

Referring to fig. 10, fig. 10 is a schematic structural diagram of an intelligent agent network according to an embodiment of the present application. As shown in fig. 10, fig. 10 illustrates an example of an Agent Network i (i.e., agent i Network) in the architecture shown in fig. 9, and the configuration of the Agent Network is described. Wherein the intelligent agent network i is a network corresponding to the intelligent agent i, and the input of the intelligent agent network i is the observed value O of the intelligent agent i _i 。

First, input O of agent network i _i And (5) preprocessing through an MLP to obtain preprocessing characteristics. Then, the encoded features are respectively input into a Dense structure based on a Dense Attention (Attention) mechanism and a Dense structure based on a Sparse Attention (Attention) mechanism, and the features Y output by the Dense structure are obtained _dense (i.e. the first state feature described above) and features Y output by the sparse structure _sparse (i.e., the second status feature described above). In addition, the preprocessing features are also input into the GRU, and fed by the GRUAnd further encoding the rows to obtain the encoding characteristics output by the GRU. Specifically, the process of the GRU can be expressed as:wherein (1)>Represents the output of GRU,>representing hidden layer information->Is the observation and history action characteristic information of the intelligent agent.

Then, the feature Y output by the MLP for the dense structure _dense And the coding features output by the GRU, processing to obtain a cost function (namely the first cost function) corresponding to the dense structure; and, feature Y output to sparse structure by MLP _sparse And the coding features output by the GRU, and processing to obtain a cost function (namely the second divalent function) corresponding to the sparse structure. Secondly, selecting actions in a cost function corresponding to the dense structure based on an epsilon-greedy strategy, and obtaining actions corresponding to the dense structure (namely the first actions) and values of the actions corresponding to the dense structure (hereinafter referred to as values corresponding to the dense structure); and selecting actions in the cost function corresponding to the sparse structure based on the epsilon-greedy strategy, and obtaining the actions corresponding to the sparse structure (namely the second actions) and the values of the actions corresponding to the sparse structure (hereinafter referred to as the values corresponding to the sparse structure).

Finally, a cost function (i.e., the first cost function described above) of the value corresponding to the dense structure and a cost function (i.e., the second cost function described above) of the value corresponding to the sparse structure are determined based on the time-series differential method. And the two loss functions obtained by the agent i are respectively input into a first mixed network and a second mixed network, and finally the dense mixed loss function output by the first mixed network and the sparse mixed loss function output by the second mixed network are obtained. By weighting and summing the dense and sparse hybrid loss functions, the total loss function for training the agent network i is obtained.

The process of finding the total loss function may be represented by a formula, for example.

Wherein,representing a total loss function; />Representing a dense mixing loss function; λ represents a regularization parameter; />Representing a sparse mixing loss function. The larger λ represents that the agent network is more concerned about some important states; the smaller λ, the more uniform the distribution of attention that the agent network allows.

Having described the multi-agent reinforcement learning method provided by the embodiment of the present application, an apparatus for performing multi-agent reinforcement learning will be described below.

Referring to fig. 11, fig. 11 is a schematic structural diagram of a multi-agent reinforcement learning device according to an embodiment of the present application. As shown in fig. 11, an embodiment of the present application provides a multi-agent reinforcement learning device, which is applied to a learning environment including a plurality of agents, and includes: an obtaining unit 1101, configured to obtain an observed value of a first agent, where the first agent is an agent in the plurality of agents, and the observed value includes a state value of the learning environment observed by the first agent; a processing unit 1102, configured to input the observed value into an agent network to obtain a first value function and a second value function, where the agent network includes a dense structure and a sparse structure, where the dense structure is configured to extract first state features related to all agents in the learning environment based on the observed value and obtain the first value function based on the first state features, and the sparse structure is configured to extract second state features related to some agents in the learning environment based on the observed value and obtain the second value function based on the second state features; the processing unit 1102 is further configured to determine a first loss function corresponding to the first valence function and a second loss function corresponding to the second valence function by using a time sequence difference method, and train the agent network based on the first loss function and the second loss function.

In a possible implementation manner, the processing unit 1102 is specifically configured to: selecting a first action from a plurality of actions indicated by the first valence function according to the first valence function and a preset strategy, and determining a first value corresponding to the first action; acquiring a second value corresponding to the state after the first action is executed, and determining the first loss function according to the first value and the second value; selecting a second action from a plurality of actions indicated by the second value function according to the second value function and the preset strategy, and determining a third value corresponding to the second action;

In a possible implementation manner, the processing unit 1102 is specifically configured to: acquiring dense loss functions corresponding to other intelligent agents in the plurality of intelligent agents, and performing mixed processing on the first loss function and the dense loss functions through a first mixed network to obtain dense mixed loss functions, wherein the dense loss functions are obtained based on a dense attention mechanism; acquiring sparse loss functions corresponding to other intelligent agents in the plurality of intelligent agents, and performing mixed processing on the second loss function and the sparse loss function through a second mixed network to obtain a sparse mixed loss function, wherein the sparse loss function is obtained based on a sparse attention mechanism; training the agent network based on the dense mixed loss function and the sparse mixed loss function.

Referring to fig. 12, fig. 12 is a schematic structural diagram of an execution device provided by an embodiment of the present application, and the execution device 1200 may be embodied as a mobile phone, a tablet, a notebook computer, an intelligent wearable device, a server, etc., which is not limited herein. Specifically, the execution apparatus 1200 includes: a receiver 1201, a transmitter 1202, a processor 1203 and a memory 1204 (where the number of processors 1203 in the execution apparatus 1200 may be one or more, one processor is exemplified in fig. 12), wherein the processor 1203 may include an application processor 12031 and a communication processor 12032. In some embodiments of the application, the receiver 1201, the transmitter 1202, the processor 1203, and the memory 1204 may be connected by a bus or other means.

The memory 1204 may include read only memory and random access memory, and provides instructions and data to the processor 1203. A portion of the memory 1204 may also include non-volatile random access memory (non-volatile random access memory, NVRAM). The memory 1204 stores a processor and operating instructions, executable modules or data structures, or a subset thereof, or an extended set thereof, wherein the operating instructions may include various operating instructions for implementing various operations.

The processor 1203 controls the operation of the execution apparatus. In a specific application, the individual components of the execution device are coupled together by a bus system, which may include, in addition to a data bus, a power bus, a control bus, a status signal bus, etc. For clarity of illustration, however, the various buses are referred to in the figures as bus systems.

The method disclosed in the above embodiment of the present application may be applied to the processor 1203 or implemented by the processor 1203. The processor 1203 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the method described above may be performed by integrated logic circuitry in hardware or instructions in software in the processor 1203. The processor 1203 may be a general purpose processor, a digital signal processor (digital signal processing, DSP), a microprocessor or a microcontroller, and may further include an application specific integrated circuit (application specific integrated circuit, ASIC), a field-programmable gate array (FPGA-programmable gate array) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The processor 1203 may implement or perform the methods, steps, and logic blocks disclosed in embodiments of the present application. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in the memory 1204, and the processor 1203 reads the information in the memory 1204 and performs the steps of the above method in combination with its hardware.

The receiver 1201 may be used to receive input numeric or character information and to generate signal inputs related to performing relevant settings and function control of the device. The transmitter 1202 may be configured to output numeric or character information via a first interface; the transmitter 1202 may also be configured to send instructions to the disk stack via the first interface to modify data in the disk stack; transmitter 1202 may also include a display device such as a display screen.

In an embodiment of the present application, in one case, the processor 1203 is configured to execute the multi-agent reinforcement learning method in the corresponding embodiment of fig. 2.

The electronic device provided by the embodiment of the application can be a chip, and the chip comprises: a processing unit, which may be, for example, a processor, and a communication unit, which may be, for example, an input/output interface, pins or circuitry, etc. The processing unit may execute the computer-executable instructions stored in the storage unit, so that the chip in the execution device performs the method for selecting the model hyper-parameters described in the above embodiment, or so that the chip in the training device performs the method for selecting the model hyper-parameters described in the above embodiment. Optionally, the storage unit is a storage unit in the chip, such as a register, a cache, etc., and the storage unit may also be a storage unit in the wireless access device side located outside the chip, such as a read-only memory (ROM) or other type of static storage device that may store static information and instructions, a random access memory (random access memory, RAM), etc.

Specifically, referring to fig. 13, fig. 13 is a schematic structural diagram of a chip provided in an embodiment of the present application, where the chip may be represented as a neural network processor NPU 1300, and the NPU 1300 is mounted as a coprocessor on a main CPU (Host CPU), and the Host CPU distributes tasks. The core part of the NPU is an arithmetic circuit 1303, and the controller 1304 controls the arithmetic circuit 1303 to extract matrix data in the memory and perform multiplication.

In some implementations, the arithmetic circuit 1303 includes a plurality of processing units (PEs) inside. In some implementations, the operation circuit 1303 is a two-dimensional systolic array. The arithmetic circuit 1303 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the operation circuit 1303 is a general-purpose matrix processor.

For example, assume that there is an input matrix a, a weight matrix B, and an output matrix C. The arithmetic circuit fetches the data corresponding to the matrix B from the weight memory 1302 and buffers the data on each PE in the arithmetic circuit. The arithmetic circuit takes matrix a data from the input memory 1301 and performs matrix operation with matrix B, and the partial result or the final result of the matrix obtained is stored in an accumulator (accumulator) 1308.

Unified memory 1306 is used to store input data and output data. The weight data is directly transferred to the weight memory 1302 through the memory cell access controller (Direct Memory Access Controller, DMAC) 1305. The input data is also carried into the unified memory 1306 through the DMAC.

BIU Bus Interface Unit, bus interface unit 1313, is used for the AXI bus to interact with the DMAC and finger memory (Instruction Fetch Buffer, IFB) 1309.

The bus interface unit 1313 (Bus Interface Unit, abbreviated as BIU) is configured to obtain an instruction from the external memory by the instruction fetch memory 1309, and is also configured to obtain raw data of the input matrix a or the weight matrix B from the external memory by the memory unit access controller 1305.

The DMAC is mainly used to transfer input data in the external memory DDR to the unified memory 1306 or to transfer weight data to the weight memory 1302 or to transfer input data to the input memory 1301.

The vector calculation unit 1307 includes a plurality of operation processing units that perform further processing, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, and the like, on the output of the operation circuit 1303, if necessary. The method is mainly used for non-convolution/full-connection layer network calculation in the neural network, such as Batch Normalization (batch normalization), pixel-level summation, up-sampling of a characteristic plane and the like.

In some implementations, the vector computation unit 1307 can store the vector of processed outputs to the unified memory 1306. For example, the vector calculation unit 1307 may perform a linear function; alternatively, a nonlinear function is applied to the output of the arithmetic circuit 1303, for example, to linearly interpolate the feature plane extracted by the convolution layer, and then, for example, to accumulate a vector of values to generate an activation value. In some implementations, vector computation unit 1307 generates a normalized value, a pixel-level summed value, or both. In some implementations, the vector of processed outputs can be used as an activation input to the arithmetic circuit 1303, for example for use in subsequent layers in a neural network.

An instruction fetch memory (instruction fetch buffer) 1309 connected to the controller 1304 for storing instructions used by the controller 1304;

the unified memory 1306, the input memory 1301, the weight memory 1302, and the finger memory 1309 are all On-Chip memories. The external memory is proprietary to the NPU hardware architecture.

The processor mentioned in any of the above may be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits for controlling the execution of the above-mentioned programs.

Referring to fig. 14, fig. 14 is a schematic structural diagram of a computer readable storage medium according to an embodiment of the present application. The present application also provides a computer readable storage medium, in some embodiments, the method disclosed in FIG. 2 above may be embodied as computer program instructions encoded on the computer readable storage medium or other non-transitory medium or article of manufacture in a machine readable format.

Fig. 14 schematically illustrates a conceptual partial view of an example computer-readable storage medium comprising a computer program for executing a computer process on a computing device, arranged in accordance with at least some embodiments presented herein.

In one embodiment, computer-readable storage medium 1400 is provided using signal bearing medium 1401. The signal bearing medium 1401 may include one or more program instructions 1402 which, when executed by one or more processors, may provide the functionality or portions of the functionality described above with respect to fig. 2. Thus, for example, referring to the embodiment shown in fig. 2, one or more features of steps 201-203 may be carried by one or more instructions associated with signal bearing medium 1401. Further, program instructions 1402 in fig. 14 also describe example instructions.

In some examples, signal bearing medium 1401 may include computer readable medium 1403 such as, but not limited to, a hard disk drive, compact Disk (CD), digital Video Disk (DVD), digital tape, memory, ROM or RAM, and the like.

In some implementations, the signal bearing medium 1401 may include a computer recordable medium 1404 such as, but not limited to, memory, read/write (R/W) CD, R/W DVD, and the like. In some implementations, the signal bearing medium 1401 may include a communication medium 1405 such as, but not limited to, a digital and/or analog communication medium (e.g., fiber optic cable, waveguide, wired communications link, wireless communications link, etc.). Thus, for example, the signal bearing medium 1401 may be conveyed by a communication medium 1405 in wireless form (e.g. a wireless communication medium complying with the IEEE 802.14 standard or other transmission protocol).

The one or more program instructions 1402 may be, for example, computer-executable instructions or logic-implemented instructions. In some examples, a computing device of the computing device may be configured to provide various operations, functions, or actions in response to program instructions 1402 communicated to the computing device through one or more of computer-readable medium 1403, computer-recordable medium 1404, and/or communication medium 1405.

It should be further noted that the above-described apparatus embodiments are merely illustrative, and that the units described as separate units may or may not be physically separate, and that units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, in the drawings of the embodiment of the device provided by the application, the connection relation between the modules represents that the modules have communication connection, and can be specifically implemented as one or more communication buses or signal lines.

From the above description of the embodiments, it will be apparent to those skilled in the art that the present application may be implemented by means of software plus necessary general purpose hardware, or of course by means of special purpose hardware including application specific integrated circuits, special purpose CPUs, special purpose memories, special purpose components, etc. Generally, functions performed by computer programs can be easily implemented by corresponding hardware, and specific hardware structures for implementing the same functions can be varied, such as analog circuits, digital circuits, or dedicated circuits. However, a software program implementation is a preferred embodiment for many more of the cases of the present application. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a readable storage medium, such as a floppy disk, a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk or an optical disk of a computer, etc., comprising several instructions for causing a computer device (which may be a personal computer, a training device, a network device, etc.) to perform the method according to the embodiments of the present application.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product.

The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, training device, or data center to another website, computer, training device, or data center via a wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be stored by a computer or a data storage device such as a training device, a data center, or the like that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy Disk, a hard Disk, a magnetic tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.

Claims

1. A multi-agent reinforcement learning method, wherein the method is applied to a learning environment including a plurality of agents, the method comprising:

obtaining an observation value of a first intelligent agent, wherein the first intelligent agent is an intelligent agent in the plurality of intelligent agents, and the observation value comprises a state value of the learning environment observed by the first intelligent agent;

inputting the observed values into an agent network to obtain a first value function and a second value function, wherein the agent network comprises a parallel dense structure and a sparse structure, the dense structure is used for extracting first state features related to all agents in the learning environment based on the observed values and obtaining the first value function based on the first state features, and the sparse structure is used for extracting second state features related to part of the agents in the learning environment based on the observed values and obtaining the second value function based on the second state features;

and determining a first loss function corresponding to the first valence function and a second loss function corresponding to the second valence function through a time sequence difference method, and training the intelligent agent network based on the first loss function and the second loss function.

2. The method of claim 1, wherein the dense structure is configured to extract the first state feature based on a dense attention mechanism and the observations, and wherein the sparse structure is configured to extract the second state feature based on a sparse attention mechanism and the observations.

3. The method of claim 1 or 2, wherein the first value function is used to indicate a value of the first agent for selecting a different action in the first state characteristic, and the second value function is used to indicate a value of the first agent for selecting a different action in the second state characteristic.

4. A method according to claim 3, wherein said determining, by a time-series differential method, a first loss function corresponding to said first valence function and a second loss function corresponding to said second valence function comprises:

selecting a first action from a plurality of actions indicated by the first valence function according to the first valence function and a preset strategy, and determining a first value corresponding to the first action;

acquiring a second value corresponding to the state after the first action is executed, and determining the first loss function according to the first value and the second value;

Selecting a second action from a plurality of actions indicated by the second value function according to the second value function and the preset strategy, and determining a third value corresponding to the second action;

5. The method of claim 4, wherein the preset policies include a greedy policy for selecting the most valuable actions or an e-greedy policy for selecting actions other than the most valuable actions based on a first probability.

6. The method of claim 5, wherein the first probability has a positive correlation with the number of training of the agent network and the second probability has a negative correlation with the number of training of the agent network.

7. The method of any of claims 1-6, wherein the agent network further comprises a recurrent neural network for encoding the observations and historical actions performed by the agent to obtain encoded features;

The first value function is derived based on the first state characteristic and the encoding characteristic, and the second value function is derived based on the second state characteristic and the encoding characteristic.

8. The method of any of claims 1-7, wherein the training the agent network based on the first and second loss functions comprises:

acquiring dense loss functions corresponding to other intelligent agents in the plurality of intelligent agents, and performing mixed processing on the first loss function and the dense loss functions through a first mixed network to obtain dense mixed loss functions, wherein the dense loss functions are obtained based on a dense attention mechanism;

acquiring sparse loss functions corresponding to other intelligent agents in the plurality of intelligent agents, and performing mixed processing on the second loss function and the sparse loss function through a second mixed network to obtain a sparse mixed loss function, wherein the sparse loss function is obtained based on a sparse attention mechanism;

training the agent network based on the dense mixed loss function and the sparse mixed loss function.

9. The method of any of claims 1-8, wherein the dense structure in the agent network is used to perform tasks that infer an agent action after training of the agent network is complete, and the sparse structure in the agent network is not used to perform the tasks.

10. The method of any one of claims 1-9, wherein the learning environment comprises an autopilot environment, a robotic collaborative work environment, or a multi-persona interactive gaming environment.

11. A multi-agent reinforcement learning device, wherein the device is applied to a learning environment comprising a plurality of agents, the device comprising:

the acquisition unit is used for acquiring an observed value of a first intelligent agent, wherein the first intelligent agent is one of the plurality of intelligent agents, and the observed value comprises a state value of the learning environment observed by the first intelligent agent;

a processing unit, configured to input the observed value into an agent network to obtain a first value function and a second value function, where the agent network includes a parallel dense structure and a sparse structure, where the dense structure is configured to extract first state features related to all agents in the learning environment based on the observed value and obtain the first value function based on the first state features, and the sparse structure is configured to extract second state features related to some agents in the learning environment based on the observed value and obtain the second value function based on the second state features;

The processing unit is further configured to determine a first loss function corresponding to the first valence function and a second loss function corresponding to the second valence function through a time sequence difference method, and train the agent network based on the first loss function and the second loss function.

12. The apparatus of claim 11, wherein the dense structure is configured to extract the first state feature based on a dense attention mechanism and the observations, and wherein the sparse structure is configured to extract the second state feature based on a sparse attention mechanism and the observations.

13. The apparatus of claim 11 or 12, wherein the first value function is used to indicate a value of the first agent for selecting a different action in the first state characteristic, and the second value function is used to indicate a value of the first agent for selecting a different action in the second state characteristic.

14. The apparatus according to claim 13, wherein the processing unit is specifically configured to:

15. The apparatus of claim 14, wherein the preset policy comprises a greedy policy or an e-greedy policy, wherein the greedy policy is used to select the most valuable action, wherein the e-greedy policy is used to select the most valuable action based on a first probability and wherein other actions than the most valuable action are selected based on a second probability.

16. The apparatus of claim 15, wherein the first probability has a positive correlation with the number of training of the agent network and the second probability has a negative correlation with the number of training of the agent network.

17. The apparatus of any one of claims 11-16, wherein the agent network further comprises a recurrent neural network for encoding the observations and historical actions performed by the agent to obtain encoded features;

18. The apparatus according to any one of claims 11-17, wherein the processing unit is specifically configured to:

19. The apparatus of any of claims 11-18, wherein the dense structure in the agent network is used to perform tasks that infer an agent action after training of the agent network is complete, and the sparse structure in the agent network is not used to perform the tasks.

20. The apparatus of any one of claims 11-19, wherein the learning environment comprises an autopilot environment, a robotic collaborative work environment, or a multi-persona interactive gaming environment.

21. The multi-agent reinforcement learning device is characterized by comprising a memory and a processor; the memory stores code, the processor being configured to execute the code, the data processing apparatus performing the method of any of claims 1 to 10 when the code is executed.

22. A multi-agent reinforcement learning system comprising a plurality of multi-agent reinforcement learning devices of claim 21, and different multi-agent reinforcement learning devices for processing different agents.

23. A computer storage medium storing instructions which, when executed by a computer, cause the computer to carry out the method of any one of claims 1 to 10.

24. A computer program product, characterized in that it stores instructions that, when executed by a computer, cause the computer to implement the method of any one of claims 1 to 10.