CN115952736A

CN115952736A - Multi-agent target collaborative search method and system

Info

Publication number: CN115952736A
Application number: CN202310005618.6A
Authority: CN
Inventors: 张晓平; 郑远鹏; 王力; 孟祥鹏; 吴宜通; 马新雨; 张嘉林; 冯辉
Original assignee: North China University of Technology
Current assignee: North China University of Technology
Priority date: 2023-01-04
Filing date: 2023-01-04
Publication date: 2023-04-11

Abstract

The invention discloses a multi-agent target collaborative searching method and a multi-agent target collaborative searching system, which relate to the technical field of group intelligence and multi-agent target searching, and the method comprises the following steps: aiming at any agent in the search simulation environment, acquiring environment detection information sensed by the agent; setting a deterministic policy action based on the expected revenue gradient; acquiring updated environment detection information after the intelligent agent executes the deterministic strategy action; determining intrinsic reward data based on the action emotion change data and the extrinsic context reward data; the internal reward data and the external environment reward data form search integral reward data; searching the overall reward data, the current state data, the updated state data and the deterministic strategy action to form an experience quadruplet; and each agent randomly extracts an experience quadruple from the experience pool to train so as to obtain and execute an optimal strategy action, so that the target collaborative search is realized. The method solves the problem of reward sparseness and improves the efficiency of searching complex unknown environments by multiple agents.

Description

Multi-agent target collaborative search method and system

Technical Field

The invention relates to the technical field of group intelligence and multi-agent target search, in particular to a multi-agent target collaborative search method and system based on improved depth certainty strategy gradient.

Background

In recent years, with the cross fusion of control, communication and computer technologies, multi-agent cooperative control has attracted much attention of many researchers. Compared with a single-agent system, the multi-agent system can complete more complex tasks and has the advantages of high efficiency, high fault tolerance, inherent parallelism and the like. Targeting searches by multiple agents is one of the major challenges in the multi-agent domain. Since multiple agents are assigned to search different targets, different agents may interfere with each other, reducing overall task efficiency.

The existing multi-agent target collaborative search technology is based on an intelligent bionic algorithm, such as an ant colony algorithm, a neural network algorithm, a genetic algorithm, a particle swarm algorithm and the like, while the intelligent bionic algorithm usually needs multiple iterative planning to obtain an optimal solution, but in the search problem of a complex dynamic environment, the problem of local optimal is easily involved, and a feasible scheme is difficult to obtain. Deep reinforcement learning is a hot topic applied to multi-agent target collaborative search in recent years, the perception capability of deep learning and the decision capability of reinforcement learning are combined, and a solution is provided for the perception decision problem of a complex multi-agent system.

The multi-agent deep reinforcement learning algorithm is characterized in that a cost function or a strategy is fitted by combining strong high-dimensional data representation capacity of deep learning on the basis of optimal decision making capacity provided by reinforcement learning, and then an optimal cost function or an optimal strategy is obtained based on interactive sample training. The method uses the neural network as a function approximator to generalize and approximate a value function, overcomes the defects of traditional reinforcement learning and multi-agent reinforcement learning, particularly dimension disasters, and enables a plurality of agents to: 1) Observe their status (or decision factors). 2) Information (e.g., instant rewards, Q values, cost functions, and optimization strategies) is exchanged with neighboring agents. 3) Interact with their operating environment. 4) Self-learning knowledge and selecting proper action under the unsupervised condition to realize the performance enhancement of the system. In particular, the deep deterministic strategy gradient algorithm can well process tasks of high-dimensional or continuous action space, such as multi-agent search, path planning and the like. And directly updating and iterating the strategy by optimizing the parameters so as to maximize the accumulated expected return. Compared with other methods, the gradient algorithm of the depth certainty strategy is simpler and has better convergence. However, in an environment with characteristics such as sparse reward and random noise, it is difficult for the deep reinforcement learning algorithm to obtain a state action sample containing effective reward information through a random exploration method, so that the training process is inefficient and even an effective strategy cannot be learned.

Disclosure of Invention

The invention aims to provide a multi-agent target collaborative searching method and a multi-agent target collaborative searching system, which solve the problem of sparse reward by introducing emotion data and improve the efficiency of searching complex unknown environments by multi-agents.

In order to achieve the purpose, the invention provides the following scheme:

a multi-agent target collaborative search method comprises the following steps:

constructing a search simulation environment; a plurality of agents, a plurality of obstacles and a plurality of search targets are randomly arranged in the search simulation environment; each agent comprises an Actor network and a criticic network; the Actor network is used for selecting an action to be executed of the intelligent agent, and the Critic network is used for evaluating expected income of the intelligent agent;

aiming at any agent in the search simulation environment, acquiring environment detection information sensed by the agent; the environment detection information comprises current state data of all agents in the detection range of the agents;

setting a deterministic policy action according to the environment detection information, the Actor network and the criticc network based on an expected profit gradient;

obtaining updated environment detection information after the agent executes the deterministic policy action; the updated environment detection information comprises action emotion change data, external environment reward data and updated state data;

determining intrinsic reward data based on the action emotion change data and the extrinsic context reward data; the intrinsic reward data and the extrinsic context reward data constitute search global reward data; the search ensemble reward data, the current state data, the updated state data, and the deterministic policy action constitute an empirical quadruplet; the experience quadruplets corresponding to the multiple agents form an experience pool;

randomly extracting an experience quadruple from the experience pool by each intelligent agent, and training the Actor network and the Critic network by using the extracted experience quadruple to obtain the optimal strategy action of each intelligent agent; each agent executes a corresponding optimal strategy action to realize the target collaborative search.

Optionally, the motion emotion change data is calculated as follows:

wherein the content of the first and second substances,

data representing changes in the movement of agents, theta _i ,η _i ,λ _i Weight vectors representing internal changes in a first emotion, a second emotion and a third emotion, respectively, in an ith agent>

Respectively representing the times of the ith intelligent agent reaching the first emotional state, the second emotional state and the third emotional state within a time step; />

Pre-set intrinsic emotion awards, corresponding in combination with a first emotion, which are indicative of a first emotion>

A predetermined intrinsic sentiment award, based on a second sentiment indication>

And the preset internal emotion reward corresponding to the third emotion is represented.

Optionally, determining intrinsic reward data based on the action emotion change data and the extrinsic context reward data specifically includes:

according to the formula

Calculating an emotional steady-state value; wherein the content of the first and second substances,

data representing the change of action condition of the intelligent agent, based on the comparison result>

Represents the initial emotional value preset in the intelligent agent before the beginning of the t moment, H _t Expressing the emotional steady state value in the intelligent agent within the time t;

according to the formula

Calculating an emotion function value; wherein E represents the value of the emotion function, H _t-1 Representing the emotional steady state value in the intelligent agent at the t-1 moment;

according to the formula

Calculating an emotion coefficient; wherein C represents an emotion coefficient, k represents a constant coefficient, and e represents a constant e; according to the formula

Calculating intrinsic reward data; wherein, T represents the maximum time step,

indicates intrinsic reward data, <' > or greater, for the ith agent>

Indicating extrinsic context reward data.

Optionally, the determining process of the external environment reward data specifically includes:

if the updated environment detection information comprises a search target corresponding to the intelligent agent, giving a first preset value to the external environment reward data;

if the updated environment detection information comprises any obstacle and the distance between the obstacle and the intelligent agent is smaller than a first preset distance value, giving a second preset value to the external environment reward data;

if the updated environment detection information comprises other agents and the distance between the other agents and the agents is smaller than a second preset distance value, assigning the external environment reward data according to a preset collision punishment formula;

and if the updated state data of the intelligent agent in the updated environment detection information represents the movement of the intelligent agent, giving a third preset value to the external environment reward data.

In order to achieve the purpose, the invention also provides the following technical scheme:

a multi-agent target collaborative search system, comprising:

the simulation environment construction module is used for constructing a search simulation environment; a plurality of agents, a plurality of obstacles and a plurality of search targets are randomly arranged in the search simulation environment; each agent comprises an Actor network and a criticic network; the Actor network is used for selecting an action to be executed of the agent, and the criticic network is used for evaluating expected income of the agent;

the detection information acquisition module is used for acquiring environment detection information sensed by any agent in the search simulation environment; the environment detection information comprises current state data of all agents in the detection range of the agents;

the strategy action determining module is used for setting a deterministic strategy action according to the environment detection information, the Actor network and the criticic network based on an expected income gradient;

a detection information updating module, configured to obtain updated environment detection information after the agent executes the deterministic policy action; the updated environment detection information comprises action emotion change data, external environment reward data and updated state data;

an experience quadruplet construction module, which is used for determining intrinsic reward data based on the action emotion change data and the extrinsic environment reward data; the intrinsic reward data and the extrinsic context reward data constitute search global reward data; the search global reward data, the current state data, the updated state data, and the deterministic policy action form an empirical quadruple; the experience quadruplets corresponding to the multiple agents form an experience pool;

the multi-agent searching module is used for randomly extracting an experience quadruple from the experience pool by using each agent and training the Actor network and the Critic network by using the extracted experience quadruple to obtain the optimal strategy action of each agent; each agent executes the corresponding optimal strategy action to realize the target collaborative search.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

the invention discloses a multi-agent target collaborative search method and a multi-agent target collaborative search system.A simulation environment is constructed, and any agent in the simulation environment carries out environment perception to obtain environment detection information; based on the expected profit gradient, a deterministic strategy action is set according to environment detection information, an Actor network and a criticic network in an intelligent agent, so that the deep deterministic strategy gradient is applied to the multi-intelligent-agent collaborative search problem, the perception capability of deep learning and the decision capability of reinforcement learning are combined, the multi-intelligent-agent system has higher autonomous learning capability, the search speed and accuracy of a complex dynamic environment can be effectively improved, different scene changes are adapted, and the complex dynamic environment which cannot be solved by a traditional algorithm and an intelligent bionic algorithm is solved.

Then obtaining action emotion change data, external environment reward data and updated state data after the intelligent agent executes the deterministic strategy action; determining intrinsic reward data based on the action emotion change data and the extrinsic context reward data; the internal reward data and the external environment reward data form search integral reward data; searching the overall reward data, the current state data, the updated state data and the deterministic strategy action to form an experience quadruplet; the experience quadruplets corresponding to the multiple agents form an experience pool; in other words, the invention introduces the emotion intrinsic motivation which can be mapped into the intrinsic reward signals in the depth certainty strategy gradient algorithm, and the intelligent agent intrinsic reward and the external reward are jointly used as the integral reward of the intelligent agent searching process, so that the exploration strategy with strong inspiration is formed, the reward sparseness problem is effectively solved, the efficiency of searching the complex unknown environment by multiple intelligent agents is improved, and the problem of difficulty in learning the searching strategy under the reward sparseness condition of the traditional DDPG is solved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a schematic flow chart of a multi-agent target collaborative search method according to the present invention;

FIG. 2 is a diagram of a multi-agent algorithm framework for an embodiment of the present invention;

FIG. 3 is a schematic diagram of a multi-agent cooperative target search scenario of the present invention;

FIG. 4 is a flow chart of multi-agent target collaborative search according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of the multi-agent target collaborative search system according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention aims to provide a multi-agent target collaborative search method and a multi-agent target collaborative search system, and aims to solve the problem that a multi-agent target collaborative search problem is solved by designing an emotion intrinsic motivation module and introducing the emotion intrinsic motivation module into a depth certainty strategy gradient algorithm, wherein the problem that a multi-agent cannot learn an effective search strategy in a sparse reward environment is solved.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

In order to solve the problem of difficult exploration caused by sparse reward, random noise interference and the like in a large-scale state action space, researchers provide a deep reinforcement learning exploration method based on targets, uncertainty measurement, inherent motivation and the like. Compared with 2 types of methods based on targets and uncertainty measurement, the deep reinforcement learning method based on the intrinsic motivation starts from the mechanism that the intrinsic motivation in behaviours and psychology drives high-class organisms to independently explore unknown environments, and a plurality of heuristic concepts derived from the intrinsic motivation, such as novelty, are expressed into intrinsic reward signals to drive intelligent body to independently and efficiently explore environments, so that the method is embodied in a more abstract and anthropomorphic thinking. Specifically, intrinsic motivation is the pleasure that higher organisms obtain in pursuit of increased autonomy and ability or control, and is the driving force to explore unknown environments without external stimuli. The intrinsic motivation can be mapped into an intrinsic reward signal in deep reinforcement learning, and is combined with a deep reinforcement learning method based on a value function or a strategy gradient to form a strong heuristic exploration strategy so as to improve the efficiency of an intelligent agent in exploring a complex unknown environment.

Therefore, in order to solve the problems that the multi-agent cooperative search process is insufficient in exploration motivation and an effective search strategy cannot be learned due to sparse rewards, an internal depth certainty strategy gradient algorithm is designed for a multi-agent system. Psychological studies indicate that emotions have the function of driving behavioral adaptation, and active and passive emotions can generate behavioral motivation, and the change of emotions can cause the change of learning motivation. Therefore, the internal emotion motivation module is added, the internal emotion motivation module can generate the current emotion and the internal reward according to the environmental stimulus and the cognitive state, and the problem of sparse reward can be effectively solved.

Example one

As shown in fig. 1, the present embodiment provides a multi-agent target collaborative search method, including:

step 100, constructing a search simulation environment; a plurality of intelligent agents, a plurality of obstacles and a plurality of search targets are randomly arranged in the search simulation environment; each agent comprises an Actor network and a criticic network; the Actor network is used for selecting an action to be executed of the agent, and the criticic network is used for evaluating expected benefits of the agent.

200, aiming at any agent in the search simulation environment, acquiring environment detection information sensed by the agent; the environment detection information comprises current state data of all agents within the detection range of the agent. Wherein all agents include the agent itself and other agents within detection range.

In one specific example, as shown in fig. 2, where S represents a set of agent state spaces, S = { S = { S = _k |k＝1,2,...,m}，s _k Representing the k-th state of the agent and m representing the number of states of the agent. O represents an observation set, O = { O = { [ O ] ₁ ,O ₂ ,...,O _N And N denotes the nth agent. In each time step, the intelligent agent observes the environment state through perception to obtain environment detection information; the environment detection information also comprises the detection range of the intelligent agent, the position of the obstacle and the position of the search target. A represents motion space, A = { A = { (A) ₁ ,A ₂ ,...,A _N }。A _N Representing the action of the nth agent, the action of the nth agent at the time point t +1 is represented as follows:

wherein, theta _t+1 Represents the motion angle of the agent at time t +1, μ represents the rate of change of the agent motion angle, v (t + 1) represents the motion velocity of the agent at time t +1, and a represents the acceleration. The actions of the agent are primarily related to speed and direction.

In addition, in order to facilitate subsequent calculation, an assignment rule of external environment reward data of the agent needs to be set in advance, namely the agent obtains reward when moving to a target position and obtains punishment when colliding with an obstacle. The intelligent agent can search the target more quickly by setting the environment reward and learning and adopting the search strategy of the maximum accumulated reward.

Specifically, the process of determining the external environment reward data specifically includes:

1) If the environment detection information comprises a search target corresponding to the intelligent agent, namely the intelligent agent searches the target, a first preset value is given to the external environment reward data; in particular, extrinsic environment reward data

2) If the environment detection information comprises any obstacle and the distance between the obstacle and the intelligent body is smaller than a first preset distance value, namely the intelligent body collides with the obstacle, giving a second preset value to the external environment reward data; in particular, the amount of the solvent to be used,

3) And if the environment detection information comprises other intelligent agents and the distances between the other intelligent agents and the intelligent agents are smaller than a second preset distance value, namely the current intelligent agent collides with the other intelligent agents, assigning the external environment reward data according to a preset collision punishment formula.

Wherein, the preset collision penalty formula is as follows:

λ is the collision penalty factor, i and j represent the ith and jth agents, respectively.

4) If the updated state data of the intelligent agent in the environment detection information represents the intelligent agent movement, giving a third preset value to the external environment reward data; in particular, the cost of movement of the agent,

and 300, setting a deterministic strategy action according to the environment detection information, the Actor network and the Critic network based on an expected profit gradient. Each agent corresponds to an Actor network mu (o) ⁱ ；θ ⁱ ) And a criticic network Q (s, a; omega ⁱ ). The Actor network is deterministic, for input o ⁱ Act a of outputting ⁱ ＝μ(o ⁱ ；θ ⁱ ) Is determined. The inputs to the Critic network are the global state and the actions of all agents, and the output is a real number.It represents how well action a is performed based on state s. The Critic network is used to evaluate all actions and the Actor policy network makes improvements.

Step 300, specifically comprising:

1) And randomly selecting an action to be executed based on the environment detection information and the Actor network.

2) And calculating an expected profit value by adopting an expected profit gradient formula based on the action to be executed and the Critic network.

3) Judging whether the expected profit value meets a preset optimal condition or not; if the expected profit value meets a preset optimal condition, marking the action to be executed as a deterministic strategy action; and if the expected profit value does not meet the preset optimal condition, adjusting the network parameters of the Actor network according to the expected profit value, and then returning to the step of randomly selecting the action to be executed based on the environment detection information.

The role of the Actor network is to improve the parameter θ through training ⁱ And the average value of the Critic network is increased. The expected income gradient formula of the agent i is as follows:

wherein, the first and the second end of the pipe are connected with each other,

representing the expected revenue gradient, θ, for the ith agent ⁱ Representing an action strategy parameter, E representing a mathematical expectation; m represents a predetermined memory module>

Represents an action gradient term, is asserted>

Representing an action strategy gradient term; q ⁱ (o,a ₁ ,…,a _n ) (ii) a centralized rating value representing Critic network output of the ith agent, (a) ₁ ,…,a _n ) Represents the ithAll actions selectable by the agent; o = (o) ¹ ,o ² ,...,o ⁿ )，o ⁿ Current state data representing the nth agent; a is ⁱ Represents o is ⁱ An action to be performed by the agent that is output when the action is input to the Actor network of the agent; n represents the number of all actions selectable by the ith agent, based on>

Indicating the deterministic policy for the ith agent, i =1,2, \ 8230, N, N indicating the total number of agents. In addition, the preset memory module is an experience playback array for storing the collected experiences, and each experience is a quadruple(s) _t ,a _t ,R _t ,s _t+1 )。

Step 400, obtaining updated environment detection information after the agent executes the deterministic policy action; the updated environment detection information comprises action emotion change data, external environment reward data and updated state data.

In particular, emotions have a function of driving behavioral adaptation, and active and passive emotions generate behavioral motivation, and changes in emotion cause changes in learning motivation. Aiming at the state change of the intelligent agent target searching process, three emotions of happiness (first emotion), anger (second emotion) and fear (third emotion) are introduced, and an intrinsic reward function based on emotional motivation is set to solve the problem of sparse reward.

As shown in Table 1, the intrinsic sentiment rewards are first set.

Based on table 1, the calculation formula of the motion emotion change data is as follows:

wherein the content of the first and second substances,

data representing changes in the feeling of action of an agent, theta _i ,η _i ,λ _i Weight vectors representing internal changes in a first emotion (happiness), a second emotion (anger) and a third emotion (fear) in the ith agent, respectively, </or R>

A predetermined intrinsic sentiment award, based on a representation of a first sentiment>

Pre-set intrinsic emotion awards, corresponding in secondary emotions, expressed in a manner to be played in conjunction with a plurality of emotions>

θ _i ＝[θ ₁ ,θ ₂ ,...,θ _M ]

η _i ＝[η ₁ ,η ₂ ,...,η _O ]。

λ _i ＝[λ ₁ ,λ ₂ ,...,λ _P ]

Wherein M, O and P respectively represent happy states

Anger status>

Fear status->

The number of the cells. Theta _M ,η _O ,λ _P Is a weight parameter, takes a value of[0，1]The degree of influence of the external environment information on the internal emotion change is determined.

Step 500, determining intrinsic reward data based on the action emotion change data and the extrinsic context reward data; the intrinsic reward data and the extrinsic environment reward data constitute search overall reward data; the search ensemble reward data, the current state data, the updated state data, and the deterministic policy action constitute an empirical quadruplet; and the experience quadruplets corresponding to the multiple agents form an experience pool.

Step 500, specifically comprising:

aiming at three emotions in the searching process of the intelligent agent, an emotion steady-state variable function H based on emotion change is designed _t In particular according to the formula

Representing the initial emotional value preset in the agent before the start of time t, H _t And the emotional steady state value in the intelligent agent within the time t is shown.

In two adjacent learning steps, the difference of the function values of the emotional steady-state variables of the intelligent agent can cause the emotional change of the intelligent agent, so that only one emotional dimension is considered, and the emotional function E is defined as follows:

wherein E represents the value of the emotion function, H _t-1 Representing the emotional steady state value in the intelligent agent at the t-1 moment; according to the formula

Calculating an emotion coefficient; where C denotes an emotion coefficient, k denotes a constant coefficient, and e denotes a constant e.

According to the formula

Computing intrinsic reward data of the agent based on the emotion; where T represents the maximum time step,

indicates intrinsic reward data, <' > or greater, for the ith agent>

Representing extrinsic context reward data.

The process of determining the external environment reward data specifically includes:

1) And if the updated environment detection information comprises the search target corresponding to the intelligent agent, giving a first preset value to the external environment reward data.

2) And if the updated environment detection information comprises any obstacle and the distance between the obstacle and the intelligent agent is smaller than a first preset distance value, giving a second preset value to the external environment reward data.

3) And if the updated environment detection information comprises other intelligent agents and the distance between the other intelligent agents and the intelligent agent is smaller than a second preset distance value, assigning the external environment reward data according to a preset collision penalty formula.

4) And if the updated state data of the intelligent agent in the updated environment detection information represents the movement of the intelligent agent, giving a third preset value to the external environment reward data.

Step 600, each agent randomly extracts an experience quadruple from the experience pool, and trains the Actor network and the Critic network by using the extracted experience quadruple to obtain an optimal strategy action of each agent; each agent executes a corresponding optimal strategy action to realize the target collaborative search.

Wherein, the Critic network is used for centralized training by combining observation and action. And (4) optimizing the centralized evaluation of the deterministic action strategy, updating the ith value network by using a TD error, and fitting the value function Q (s, a) better by the value network.

To speed up the learning process of the agent, the Critic network inputs include observed states and actions taken by other agents to update the Critic network parameters by minimizing losses. And then calculating parameters of the updated action network by a gradient descent method. The updating formula of the Critic network is as follows:

wherein, L (θ) ⁱ ) Representing expected profit value of Critic network output, R representing search global reward data, R ⁱ Search overall bonus data, Q 'representing ith agent' ⁱ (o',a ₁ ',…,a _n ') indicates the input of the i-th agent in the empirical quadruple as (o', a) ₁ ',…,a _n ') the collective evaluation value output by the Critic network; o' represents the updated state data in the empirical quadruple, a _n ' denotes an action to be performed of the agent output when o ' is input to an Actor network of the agent, and μ ' denotes a deterministic policy updated through the Actor network.

The TD error update formula is as follows:

δ(t)＝R+γQ′(s,a)-Q(s,a)

the Critic target network and the Actor target network adopt a soft updating mode to update parameters, and the updating mode is as follows:

in the improved depth certainty strategy gradient algorithm training process, each intelligent agent obtains the action executed in the current state according to the strategy of the intelligent agent, interacts with the environment to obtain experience, and stores the experience in the memory module of the intelligent agent. After all the agents interact with the environment, each agent randomly extracts experience from the experience pool to train a respective neural network, and therefore the optimal action at each moment is output.

The multi-agent target collaborative search scenario is shown in fig. 3: and regarding each agent as a particle, and completing a multi-agent target collaborative search task by establishing a multi-agent particle simulation environment. The multi-agent target collaborative search task is that the agents avoid collision and search for targets, and based on the task, the multi-agent target collaborative search method further comprises the following steps:

1) And in the process of executing the corresponding optimal strategy action by each intelligent agent, calculating the distance between the intelligent agent and other intelligent agents, the distance between the intelligent agent and any obstacle and the distance between the intelligent agent and the corresponding search target for any intelligent agent.

2) Evaluating the distance between the intelligent agent and other intelligent agents based on a first preset index to obtain a first evaluation result; evaluating the distance between the intelligent agent and any obstacle based on a second preset index to obtain a second evaluation result; and evaluating the distance between the intelligent agent and the corresponding search target based on a third preset index to obtain a third evaluation result. The specific evaluation formula is as follows:

wherein D is _i ，D _n ，D _o ，D _t Respectively representing the center coordinates of the current agent, the nth agent, the obstacle and the target. R is _i ，R _n ，R _o ，R _t Respectively representing the current agent, the nth agent, the,The radius of the obstacle and the target.

3) If the first evaluation result, the second evaluation result and the third evaluation result all represent that preset indexes are achieved, namely the distance between the intelligent agents, the distance between the intelligent agents and the obstacle and the distance between the intelligent agents and the target are all larger than the preset indexes, a search completion signal is generated; and if any one of the first evaluation result, the second evaluation result and the third evaluation result is represented by a preset index, returning to the step of obtaining the updated environment detection information after the intelligent agent executes the deterministic strategy action.

In a specific practical application, as shown in fig. 4, the multi-agent target collaborative search process of the present invention is: 1) Constructing a simulation environment, randomly arranging a plurality of intelligent agents, obstacles and targets, and carrying out initialization environment modeling; 2) Judging whether the serial number i of the intelligent agent meets the range that 1-plus i is less than or equal to N; 3) If the judgment result meets the judgment result, updating the Actor network of each agent by calculating the certainty strategy of each agent, and the specific steps comprise: the agent i randomly selects an executed action a based on the currently sensed environment information, including the position, the state s, the action information, the environment reward and the like of all agents in the detectable range; updating the expected income of the agent i according to an expected income gradient formula, judging whether the currently selected action is the optimal strategy of the criticic network, and if so, setting the action as a deterministic strategy; if not, the above steps in 3) are repeated until a deterministic policy is obtained. 4) The agent is given emotional changes based on actions, emotional incentives, and new states. 5) Judging whether the new state s' is a final state; 6) If the state is the final state, ending and updating the next intelligent network; if the state is not the final state, the internal reward based on the emotional motivation is obtained according to the change of the emotion of the intelligent agent, and the internal reward and the external reward of the intelligent agent are jointly used as the overall reward R of the intelligent agent searching process. Storing the quadruple (s, a, R, s') as an experience in a memory module; and sampling from a memory module, then updating a new expected revenue gradient, and updating Critic network parameters according to a minimum loss function to obtain an overall optimal strategy. 7) After the overall optimal strategy is obtained, each agent independently executes a search task according to the overall optimal strategy. And in the searching process of each intelligent agent, constantly calculating the quality index between each intelligent agent and the target, and when each intelligent agent meets the requirement of a preset index, finishing the multi-intelligent-agent target collaborative searching task.

The multi-agent target collaborative search method can be applied to the following steps:

1) The field of frequency-voltage cooperative control of a multi-microgrid electrified-pneumatic system. Different case models are established to serve as simulation environments for the multi-microgrid system, and due to the strong interaction characteristic between the microgrid and the natural gas network, coordination control can be performed on the multi-microgrid system based on the cooperative adjustment idea of 'centralized training and decentralized execution' of the MADDPG algorithm. The MADDPG controller can greatly inhibit frequency deviation caused by wind power and load disturbance and air pressure fluctuation caused by natural gas pipe network load fluctuation. In addition, the maddppg controller may well coordinate the overall stability between the sub-microgrid's of the multiple microgrid.

2) The field of automobile automatic driving. Aiming at a lane following system in an automatic driving automobile, a driving automobile is used as an intelligent agent, other vehicles or road directions on an automobile lane are used as obstacles, the destination of the automobile is used as a search target, and a reward function for lane tracking conditions is further determined by utilizing strong nonlinear fitting capacity and generalization performance of a depth certainty strategy gradient algorithm. Then, the multi-agent target collaborative search method is utilized to carry out target collaborative search, so that good car following effect can be obtained under various road conditions.

Example two

As shown in fig. 5, in order to implement the technical solution in the first embodiment, the present embodiment provides a multi-agent target collaborative search system, including:

a simulation environment construction module 101, configured to construct a search simulation environment; a plurality of agents, a plurality of obstacles and a plurality of search targets are randomly arranged in the search simulation environment; each agent comprises an Actor network and a Critic network; the Actor network is used for selecting an action to be executed of the agent, and the criticic network is used for evaluating expected income of the agent.

A detection information obtaining module 201, configured to obtain, for any agent in the search simulation environment, environment detection information that is perceived by the agent; the environment detection information comprises current state data of all agents within the detection range of the agent.

A policy action determining module 301, configured to set a deterministic policy action according to the environment detection information, the Actor network, and the Critic network, based on an expected revenue gradient.

A probe information updating module 401, configured to obtain updated environment probe information after the agent executes the deterministic policy action; the updated environment detection information comprises action emotion change data, external environment reward data and updated state data.

An experience quadruplet construction module 501, configured to determine intrinsic reward data based on the action emotion change data and the extrinsic context reward data; the intrinsic reward data and the extrinsic environment reward data constitute search overall reward data; the search global reward data, the current state data, the updated state data, and the deterministic policy action form an empirical quadruple; and the experience quadruplets corresponding to the plurality of agents form an experience pool.

The multi-agent searching module 601 is configured to randomly extract an experience quadruple from the experience pool by using each agent, and train the Actor network and the criticic network by using the extracted experience quadruple to obtain an optimal policy action of each agent; each agent executes the corresponding optimal strategy action to realize the target collaborative search.

Compared with the prior art, the invention also has the following advantages:

the basic idea of applying the deep deterministic strategy gradient algorithm in the multi-agent environment is to perform centralized training and decentralized execution, namely, the critic network of each agent collects the state and action information of all agents in the training process, but in the training stage, only the operator network of each agent makes a decision according to local information (namely the action and the state of the agent).

Because intrinsic emotions have a function of driving behavioral adaptation, both positive and negative emotions produce behavioral motivation, and changes in emotion cause changes in learning motivation. Therefore, the internal emotion motivation is introduced, the current emotion and internal reward can be generated according to the environmental stimulus and the cognitive state, the internal reward and the environmental reward of the intelligent agent are jointly used as the integral reward in the intelligent agent searching process, the reward sparse problem is effectively solved, and the collision can be avoided to the maximum extent.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the foregoing, the description is not to be taken in a limiting sense.

Claims

1. A multi-agent target collaborative search method is characterized by comprising the following steps:

constructing a search simulation environment; a plurality of agents, a plurality of obstacles and a plurality of search targets are randomly arranged in the search simulation environment; each agent comprises an Actor network and a criticic network; the Actor network is used for selecting an action to be executed of the agent, and the criticic network is used for evaluating expected income of the agent;

2. The multi-agent target collaborative search method according to claim 1, wherein the action emotion change data is calculated by the following formula:

data representing changes in the movement of agents, theta _i ,η _i ,λ _i Respectively showing the first emotion and the second emotion in the ith agentAn emotion and a third emotion, an internally varying weight vector @>

Respectively representing the times of the ith intelligent agent reaching the first emotional state, the second emotional state and the third emotional state within a time step; r is a radical of hydrogen _t ^h A predetermined intrinsic sentiment award, r, representing a correspondence of a first sentiment _t ^a A predetermined intrinsic sentiment award, r, representing a correspondence of a second sentiment _t ^f And the preset internal emotion reward corresponding to the third emotion is represented.

3. The multi-agent target collaborative search method according to claim 1, wherein determining intrinsic reward data based on the action emotion change data and the extrinsic context reward data specifically comprises:

according to the formula

according to the formula

Calculating an emotion function value; wherein E represents the value of the emotion function, H _t-1 Indicating the situation inside the agent at time t-1A steady state sensing value;

according to the formula

Calculating an emotion coefficient; wherein C represents an emotion coefficient, k represents a constant coefficient, and e represents a constant e;

according to the formula

Calculating intrinsic reward data; where T represents the maximum time step, r _t ⁱ Indicating intrinsic reward data, r, for the ith agent _t ^e Indicating extrinsic context reward data.

4. The multi-agent target collaborative search method according to claim 1, wherein the setting of deterministic policy actions based on expected profit gradients according to the environment probe information, the Actor network and the Critic network specifically includes:

randomly selecting an action to be executed based on the environment detection information and the Actor network;

calculating an expected profit value by adopting an expected profit gradient formula based on the action to be executed and the Critic network;

judging whether the expected profit value meets a preset optimal condition or not;

if the expected profit value meets a preset optimal condition, marking the action to be executed as a deterministic strategy action;

and if the expected profit value does not meet preset optimal conditions, adjusting network parameters of the Actor network according to the expected profit value, and then returning to the step of randomly selecting actions to be executed based on the environment detection information.

5. The multi-agent target collaborative search method according to claim 4, wherein the expected profit gradient formula is specifically:

represents the expected profit gradient, θ, of the ith agent ⁱ Representing an action strategy parameter, E representing a mathematical expectation; m represents a predetermined memory module>

Represents an action gradient term>

Representing an action strategy gradient term; q ⁱ (o,a ₁ ,...,a _n ) (ii) a centralized rating value representing Critic network output of the ith agent, (a) ₁ ,...,a _n ) All actions selectable by the ith agent are represented; o = (o) ¹ ,o ² ,...,o ⁿ )，o ⁿ Current state data representing an nth agent; a is ⁱ Represents that ⁱ An action to be performed by the agent that is output when the action is input to the Actor network of the agent; n represents the number of all actions selectable by the ith agent, based on>

Representing the deterministic strategy for the ith agent, i =1,2, \8230, N representing the total number of agents.

6. The multi-agent target collaborative search method according to claim 1, wherein in a process of training the Actor network and the Critic network by using the extracted experience quaternion, an update formula of the Critic network is as follows:

/>

wherein, L (theta) ⁱ ) Representing an expected profit value of the Critic network output, and E representing a mathematical expectation; m represents a predetermined memory module, [ theta ] ⁱ Representing action strategy parameters, R representing search global reward data, R ⁱ Search overall bonus data, Q 'representing ith agent' ⁱ (o',a ₁ ',...,a _n ') indicates the input of the i-th agent in the empirical quadruple as (o', a) ₁ ',...,a _n ') the centralized evaluation value output by the Critic network; o' represents the updated state data in the empirical quadruple, o represents the current state data, a _n ' represents an action to be performed of an agent output when o ' is input to an Actor network of the agent, N represents the number of all agents within a detection range of the agent, mu ' represents a deterministic strategy after being updated by the Actor network, i =1,2, \ 8230;, and N, N represents the total number of agents.

7. The multi-agent target collaborative search method according to claim 1, further comprising:

calculating the distance between the intelligent agent and other intelligent agents, the distance between the intelligent agent and any obstacle and the distance between the intelligent agent and the corresponding search target for any intelligent agent in the process of executing the corresponding optimal strategy action by each intelligent agent;

evaluating the distance between the intelligent agent and other intelligent agents based on a first preset index to obtain a first evaluation result;

evaluating the distance between the intelligent agent and any obstacle based on a second preset index to obtain a second evaluation result;

based on a third preset index, evaluating the distance between the intelligent agent and the corresponding search target to obtain a third evaluation result;

if the first evaluation result, the second evaluation result and the third evaluation result all represent that preset indexes are reached, generating a search completion signal;

and if any result representation of the first evaluation result, the second evaluation result and the third evaluation result does not reach a preset index, returning to the step of obtaining updated environment detection information after the intelligent agent executes the deterministic strategy action.

8. The multi-agent target collaborative search method according to claim 1, wherein the process of determining the external environment reward data specifically includes:

if the updated environment detection information comprises other agents and the distance between the other agents and the agents is smaller than a second preset distance value, assigning the external environment reward data according to a preset collision penalty formula;

9. A multi-agent target collaborative search system, wherein the multi-agent target collaborative search system comprises:

the simulation environment construction module is used for constructing a search simulation environment; a plurality of intelligent agents, a plurality of obstacles and a plurality of search targets are randomly arranged in the search simulation environment; each agent comprises an Actor network and a Critic network; the Actor network is used for selecting an action to be executed of the agent, and the criticic network is used for evaluating expected income of the agent;

the detection information acquisition module is used for acquiring environment detection information sensed by any intelligent agent in the search simulation environment; the environment detection information comprises current state data of all agents in the detection range of the agents;

an experience quadruplet construction module, which is used for determining intrinsic reward data based on the action emotion change data and the extrinsic environment reward data; the intrinsic reward data and the extrinsic environment reward data constitute search overall reward data; the search global reward data, the current state data, the updated state data, and the deterministic policy action form an empirical quadruple; the experience quadruplets corresponding to the multiple agents form an experience pool;

the multi-agent searching module is used for randomly extracting an experience quadruple from the experience pool by using each agent and training the Actor network and the criticic network by using the extracted experience quadruple to obtain the optimal strategy action of each agent; each agent executes the corresponding optimal strategy action to realize the target collaborative search.