CN114792133A

CN114792133A - Deep reinforcement learning method and device based on multi-agent cooperation system

Info

Publication number: CN114792133A
Application number: CN202210715660.2A
Authority: CN
Inventors: 丘腾海; 付清旭; 蒲志强; 刘振; 易建强
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2022-06-23
Filing date: 2022-06-23
Publication date: 2022-07-26
Anticipated expiration: 2042-06-23
Also published as: CN114792133B

Abstract

The invention provides a deep reinforcement learning method and device based on a multi-agent cooperation system, which relate to the technical field of artificial intelligence, and the method comprises the following steps: in one-time deep reinforcement learning, acquiring a current distribution and adjustment action of a corresponding collaborative graph of the multi-agent cooperative system based on a pre-constructed deep reinforcement learning network and current observation data, acquiring current reward and punishment data based on the current distribution and adjustment action, optimizing the deep reinforcement learning network based on the current reward and punishment data, and repeatedly executing the steps until a preset convergence condition is reached or a preset learning frequency is reached; the current distribution adjustment action of the collaborative map is updated through multiple times of deep reinforcement learning, the actual action of the intelligent agent is not updated, the steps of the deep reinforcement learning are simplified, multiple times of deep reinforcement learning are rapidly carried out under the condition that deep reinforcement learning rewards are sparse, more rewards are accumulated, and therefore the training efficiency of the deep reinforcement learning network is improved, and the convergence speed is low.

Description

Deep reinforcement learning method and device based on multi-agent cooperation system

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a deep reinforcement learning method and device based on a multi-agent cooperation system.

Background

With the rapid development of artificial intelligence technology, deep reinforcement learning is used as a way to move to general artificial intelligence technology, and the attention and the application range of the technology are more and more. With further application and popularization of deep reinforcement learning, research on the technical field of multi-agent cooperation is gradually increased.

In the prior art, because deep reinforcement learning generally has the problem of sparse rewards, when the deep reinforcement learning is applied to a multi-agent cooperative system, the sparse rewards possibly introduce deviations, so that an agent cannot learn the expected cooperative behavior of the multi-agent cooperative system, and the problems of low efficiency and low convergence rate of multi-agent cooperative training are caused.

Therefore, in the prior art, due to the reward sparseness of the deep reinforcement learning network, the technical problems of low efficiency and low convergence rate of multi-agent cooperative training are solved effectively by technical personnel in the related field.

Disclosure of Invention

The invention provides a deep reinforcement learning method and device based on a multi-agent cooperative system, which are used for overcoming the defects of low multi-agent cooperative training efficiency and low convergence speed caused by sparse awards of a deep reinforcement learning network in the prior art, and improving the multi-agent cooperative training efficiency and accelerating the convergence speed.

The invention provides a deep reinforcement learning method based on a multi-agent cooperation system, which comprises the following steps: acquiring current observation data of each agent in the multi-agent cooperation system, wherein the current observation data comprises state data of other agents and a target to be processed in an agent observation range; in one-time deep reinforcement learning, acquiring a current distribution and adjustment action of a corresponding collaborative graph of the multi-agent cooperative system based on a pre-constructed deep reinforcement learning network and the current observation data, acquiring current reward and punishment data based on the current distribution and adjustment action, and optimizing the deep reinforcement learning network based on the current reward and punishment data; the current allocation adjustment action is used for adjusting allocation schemes of the agents and the clusters in the collaborative map; repeatedly executing the steps until a preset convergence condition is reached or a preset learning frequency is reached; and taking the current deep reinforcement learning network as a target deep reinforcement learning network.

According to the deep reinforcement learning method based on the multi-agent cooperation system, the current distribution adjusting action of the cooperative graph corresponding to the multi-agent cooperation system is obtained based on the pre-constructed deep reinforcement learning network and the current observation data, and the method comprises the following steps: inputting the current observation data into a multilayer perceptron module in the deep reinforcement learning network to obtain first state features of other agents and second state features of the target to be processed; inputting the first state feature and the second state feature to a first attention mechanism module in the deep reinforcement learning network to obtain a third state feature; inputting the first state feature, the second state feature and the third state feature to a second attention mechanism module in the deep reinforcement learning network to obtain a target state feature; and inputting the target state characteristics to an action coding module in the deep reinforcement learning network to obtain the current distribution adjustment action.

According to the deep reinforcement learning method based on the multi-agent cooperation system, the step is repeatedly executed until a preset convergence condition is reached or a preset learning frequency is reached, and the method comprises the following steps: acquiring the current learning times, judging whether the current learning times reach preset learning times or not, and stopping executing the steps under the condition that the current learning times reach the preset learning times; under the condition that the current learning frequency does not reach the preset learning frequency, acquiring a reward and punishment data set of multiple times of deep reinforcement learning, and generating a deep reinforcement learning reward and punishment curve based on the reward and punishment data set; judging whether the depth reinforcement learning award-punishment curve is converged, and stopping executing the steps under the condition that the depth reinforcement learning award-punishment curve is converged; and under the condition that the depth reinforcement learning award-punishment curve is not converged, the steps are repeatedly executed until the current learning frequency reaches a preset learning frequency or the depth reinforcement learning award-punishment curve is converged.

According to the deep reinforcement learning method based on the multi-agent cooperation system, the method further comprises the following steps: acquiring current observation data of each agent in the multi-agent cooperation system and a cooperation diagram corresponding to the multi-agent cooperation system; the collaboration graph comprises a first allocation relationship of an agent to a cluster and a second allocation relationship of the cluster to a target to be processed; inputting the observation data into the target deep reinforcement learning network to obtain a target distribution adjusting action, and adjusting a first distribution relation and a second distribution relation in the collaborative map based on the target distribution adjusting action; and acquiring the current action of each agent based on the adjusted collaboration diagram so that each agent executes a preset collaboration task aiming at the target to be processed based on the current action.

According to the deep reinforcement learning method based on the multi-agent cooperation system, the target distribution adjusting action comprises an agent distribution adjusting action and a cluster distribution adjusting action, wherein: the intelligent agent distribution adjustment action comprises a last cluster number corresponding to each intelligent agent at the last moment and a current cluster number corresponding to each intelligent agent at the current moment; the cluster allocation adjustment action comprises a last target number to be processed corresponding to each cluster at the last moment and a current target number to be processed corresponding to each cluster at the current moment.

According to the deep reinforcement learning method based on the multi-agent cooperation system, the adjusting of the first distribution relation and the second distribution relation in the cooperation diagram based on the target distribution adjusting action comprises the following steps: judging whether the last cluster number of the agent is consistent with the current cluster number or not for each agent; under the condition that the last cluster number is inconsistent with the current cluster number, acquiring the number of agents in the cluster corresponding to the last cluster number; under the condition that the number of the agents is larger than a preset first number threshold, distributing the agents to the clusters corresponding to the current cluster numbers; judging whether the last target number to be processed corresponding to each cluster is consistent with the current target number to be processed or not for each cluster; under the condition that the last target number to be processed is inconsistent with the current target number to be processed, acquiring the number of clusters in the target to be processed corresponding to the last target number to be processed; and under the condition that the number of the clusters is greater than a preset second number threshold, distributing the clusters to the to-be-processed targets corresponding to the current to-be-processed target numbers.

According to the deep reinforcement learning method based on the multi-agent cooperation system, the obtaining of the current action of each agent based on the adjusted cooperation map comprises the following steps: acquiring a preset mapping function, and acquiring a cluster corresponding to each intelligent agent based on the adjusted collaborative map, wherein the preset mapping function represents a mapping function between the cluster and the action; and generating a current action of each agent based on a preset mapping function, the current observation data of each agent and the corresponding cluster, wherein the current action comprises the advancing direction of the agent.

The invention also provides a deep reinforcement learning device based on the multi-agent cooperation system, which comprises: the data acquisition module is used for acquiring current observation data of each intelligent agent in the multi-intelligent-agent cooperation system, wherein the current observation data comprises state data of other intelligent agents and a target to be processed in an intelligent-agent observation range; the system comprises a first training module, a second training module and a third training module, wherein the first training module is used for acquiring a current distribution and adjustment action of a corresponding collaborative graph of the multi-agent cooperative system based on a pre-constructed depth reinforcement learning network and the current observation data in one depth reinforcement learning, acquiring current reward and punishment data based on the current distribution and adjustment action, and optimizing the depth reinforcement learning network based on the current reward and punishment data; the current allocation adjustment action is used for adjusting allocation schemes of the agents and the clusters in the collaborative map; the second training module is used for repeatedly executing the steps until a preset convergence condition is reached or a preset learning frequency is reached; and taking the current deep reinforcement learning network as a target deep reinforcement learning network.

The present invention also provides an electronic device, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the method for deep reinforcement learning based on multi-agent cooperative system as described in any one of the above.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a multi-agent collaboration system based deep reinforcement learning method as described in any of the above.

According to the deep reinforcement learning method and device based on the multi-agent cooperative system, the current observation data are input into the deep reinforcement learning network, the current distribution adjustment action of the cooperative graph is updated through multiple times of deep reinforcement learning, the actual action of the agents is not updated, the steps of the deep reinforcement learning are simplified, the training time of the deep reinforcement learning is saved, the deep reinforcement learning can be rapidly carried out for multiple times to accumulate more rewards under the condition that the rewards of the deep reinforcement learning are sparse, the convergence speed of the deep reinforcement learning network is improved, the training efficiency of the deep reinforcement learning is improved, and the technical problems that the efficiency of the multi-agent cooperative training is low and the convergence speed is low due to sparse rewards of the deep reinforcement learning network in the prior art are solved.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart of a deep reinforcement learning method based on a multi-agent cooperation system according to the present invention;

FIG. 2 is a second flowchart of the deep reinforcement learning method based on multi-agent cooperative system according to the present invention;

FIG. 3 is a third flowchart of the deep reinforcement learning method based on multi-agent cooperative system according to the present invention;

FIG. 4 is a fourth flowchart of the deep reinforcement learning method based on multi-agent cooperative system provided by the present invention;

FIG. 5 is a fifth flowchart of the deep reinforcement learning method based on multi-agent cooperative system according to the present invention;

FIG. 6 is a sixth schematic flowchart of a deep reinforcement learning method based on a multi-agent cooperation system according to the present invention;

FIG. 7a is a diagram illustrating an application scenario in accordance with a second embodiment of the present invention;

FIG. 7b is a schematic structural diagram of a deep reinforcement learning network according to a second embodiment of the present invention;

FIG. 7c is a schematic diagram of a collaboration diagram in accordance with a second embodiment of the invention;

FIG. 8 is a schematic structural diagram of a deep reinforcement learning device based on a multi-agent cooperation system provided by the invention;

fig. 9 is a schematic structural diagram of an electronic device provided by the present invention.

Reference numerals:

100: a deep reinforcement learning device of the multi-agent cooperation system; 10: a data acquisition module; 20: a first training module; 30: a second training module; 910: a processor; 920: a communication interface; 930: a memory; 940: a communication bus.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The deep reinforcement learning method of the multi-agent cooperative system of the present invention is described below with reference to fig. 1 to 6. As shown in fig. 1, the present invention provides a deep reinforcement learning method based on a multi-agent cooperation system, which includes:

step S1: the method comprises the steps of obtaining current observation data of each agent in the multi-agent cooperation system, wherein the current observation data comprise state data of other agents and objects to be processed in the observation range of the agents.

The state data comprises first state data corresponding to other agents and second state data corresponding to the target to be processed. The first status data includes the location, velocity, and health values of the other agents. The second state data includes a position, a speed, and a health value corresponding to the target to be processed. The target to be processed represents a target which is required to be cooperatively processed by the intelligent agent through a preset cooperative task, such as a target of an intrusion system of computer viruses and the like.

Step S2: in one-time deep reinforcement learning, acquiring a current distribution and adjustment action of a corresponding collaborative graph of the multi-agent cooperative system based on a pre-constructed deep reinforcement learning network and current observation data, acquiring current reward and punishment data based on the current distribution and adjustment action, and optimizing the deep reinforcement learning network based on the current reward and punishment data; the current allocation adjustment action is used to adjust the allocation scheme of the agents and clusters in the collaborative map.

In one embodiment, current reward and punishment data are obtained based on a preset reward and punishment function and a current distribution adjustment action, and a network weight coefficient of the deep reinforcement learning network is updated based on the current reward and punishment data, so that the deep reinforcement learning network is optimized.

Step S3: repeatedly executing the steps until a preset convergence condition is reached or a preset learning frequency is reached; and taking the current deep reinforcement learning network as a target deep reinforcement learning network.

The preset convergence condition may be that the deep reinforcement learning is stopped when the current network convergence accuracy of the deep reinforcement learning network reaches the preset network convergence accuracy, or the deep reinforcement learning is stopped when the current reward and punishment data tends to a stable value, or other convergence conditions, which is not limited in the present invention.

In the steps S1 to S3, the current observation data is input to the deep reinforcement learning network, and the current allocation adjustment action of the collaborative map is updated through multiple times of deep reinforcement learning without updating the actual action of the intelligent agent, so that the steps of the deep reinforcement learning are simplified, the training time of the deep reinforcement learning is saved, and under the condition that the reward of the deep reinforcement learning is sparse, multiple times of deep reinforcement learning is quickly performed to accumulate more rewards, thereby improving the convergence rate of the deep reinforcement learning network, improving the training efficiency of the deep reinforcement learning, and solving the technical problems of low efficiency and low convergence rate of the collaborative training of the multi-intelligent agent caused by sparse rewards of the deep reinforcement learning network in the prior art.

Preferably, the deep reinforcement learning network is constructed based on an attention mechanism. An Attention Mechanism (Attention Mechanism) is a special structure embedded in a machine learning model by people, and is used for automatically learning and calculating the contribution of input data to output data so as to selectively process signals, help the model to select effective and proper-scale features and further enable the model to effectively and efficiently complete tasks.

Preferably, the deep reinforcement learning network is optimized based on a time step degradation method, specifically, before performing the current deep reinforcement learning, current observation data corresponding to a current time interval needs to be acquired first, and the current observation data corresponding to the current time interval is input into the deep reinforcement learning network optimized by the last deep reinforcement learning, so as to acquire a current distribution adjustment action corresponding to the current time interval, so as to replace the distribution adjustment action acquired by the last deep reinforcement learning, that is, in the process of each deep reinforcement learning, the update of the current distribution adjustment action and the optimization of the deep reinforcement learning network need to be completed.

In one embodiment, the collaboration diagram includes a first sub-collaboration diagram and a second sub-collaboration diagram, where the first sub-collaboration diagram is made up of agents and clusters, one agent for each cluster, each cluster containing one or more agents. The second sub-collaboration graph is composed of clusters and objects to be processed, each cluster corresponds to one object to be processed, and each object to be processed comprises one or more clusters.

In one embodiment, as shown in fig. 2, the above step S2 includes steps S21 to S24, wherein:

step S21: and inputting the current observation data into a multi-layer perceptron module in the deep reinforcement learning network to obtain first state characteristics of other intelligent agents and second state characteristics of the target to be processed.

Among them, a Multilayer Perceptron (MLP) is also called an artificial neural network, and is used to classify input data and extract features.

Step S22: and inputting the first state characteristic and the second state characteristic into a first attention mechanism module in the deep reinforcement learning network to obtain a third state characteristic.

The third state feature may also be referred to as a mixed-state feature. By combining the first state features of other agents and the second state features of the target to be processed, the state features of the agents and the target state features which have larger influences in the surrounding environment can be further extracted, and therefore a third state feature which needs to be focused is formed.

Preferably, the number of the first attention mechanism modules is multiple, and the multiple first attention mechanism modules can be connected in a cascade manner, so that the third state features output by the first attention mechanism module in the previous layer can be input into the first attention mechanism module in the next layer for feature extraction again to obtain new third state features, and so on, so that the third state features with richer feature information can be extracted, the learning effect of single deep reinforcement learning is improved, the number of times of deep reinforcement learning is reduced, and the reinforcement training efficiency of the deep reinforcement learning is further improved.

Step S23: and inputting the first state feature, the second state feature and the third state feature into a second attention mechanism module in the deep reinforcement learning network to obtain a target state feature.

Step S24: and inputting the target state characteristics into an action coding module in the deep reinforcement learning network to obtain the current distribution adjustment action. The action coding module is used for coding the target state characteristics into the current distribution adjustment action.

In the steps S21 to S24, the target state features with richer feature information can be extracted from the current observation data by the plurality of attention mechanism modules, so that a more accurate current allocation adjustment action can be generated based on the action coding module, the learning effect of single deep reinforcement learning is improved, the number of times of deep reinforcement learning is reduced, and the reinforcement training efficiency of deep reinforcement learning is further improved.

In one embodiment, as shown in fig. 3, the above step S3 includes steps S31 to S33, wherein:

step S31: and acquiring the current learning frequency, judging whether the current learning frequency reaches the preset learning frequency, and stopping executing the steps under the condition that the current learning frequency reaches the preset learning frequency.

Step S32: under the condition that the current learning frequency does not reach the preset learning frequency, acquiring a reward and punishment data set of multiple times of deep reinforcement learning, and generating a deep reinforcement learning reward and punishment curve based on the reward and punishment data set; and judging whether the depth reinforcement learning reward and punishment curve is converged, and stopping executing the steps under the condition that the depth reinforcement learning reward and punishment curve is converged.

It should be noted that under the condition that the reinforcement learning reward and punishment curve is in a convergence state, reward and punishment data of multiple times of deep reinforcement learning tend to a stable value, which indicates that the deep reinforcement learning network tends to a stable state, and the deep reinforcement learning can be ended.

Step S33: and under the condition that the depth reinforcement learning award-punishment curve is not converged, the steps are repeatedly executed until the current learning frequency reaches the preset learning frequency or the depth reinforcement learning award-punishment curve is converged.

Through the steps S31 to S33, when the reinforcement learning reward and punishment curve is in the convergence state, reward and punishment data of the multiple times of deep reinforcement learning tend to a stable value, which indicates that the deep reinforcement learning network tends to a stable state at this time, so that it can be known that the current deep reinforcement learning network has a better network performance, and thus the deep reinforcement learning can be stopped. According to the method, the current training process of the deep reinforcement learning network can be observed very intuitively by obtaining the reinforcement learning reward and punishment curve, and the fact that the deep reinforcement learning network has excellent network performance can be determined under the condition that the reinforcement learning reward and punishment curve is converged, so that deep reinforcement learning can be stopped, and the target deep reinforcement learning network with excellent performance is obtained.

In one embodiment, as shown in fig. 4, the multi-agent cooperation system-based deep reinforcement learning method provided by the present invention further includes steps S4 to S6, wherein:

step S4: acquiring current observation data of each agent in the multi-agent cooperation system and a cooperation map corresponding to the multi-agent cooperation system; the collaboration graph includes a first allocation of agents to clusters and a second allocation of clusters to targets to be processed.

Step S5: and inputting the observation data into a target deep reinforcement learning network to obtain a target distribution adjustment action, and adjusting the first distribution relation and the second distribution relation in the collaborative map based on the target distribution adjustment action.

Wherein the first allocation relationship represents an allocation relationship of the agent to the cluster, i.e., a currently allocated cluster of each agent is selected from a plurality of clusters. The second allocation relationship represents an allocation relationship between the clusters and the targets to be processed, that is, a currently allocated target to be processed of each cluster is selected from the plurality of targets to be processed.

Further, the allocation scheme of the intelligent agent and the cluster is realized by adopting a first-in first-out principle, namely, the intelligent agent is stored first, and if the intelligent agent needs to be moved, the intelligent agent is moved first; firstly storing the cluster, and if the cluster needs to be moved, moving the cluster.

Step S6: and acquiring the current action of each intelligent agent based on the adjusted collaborative map so that each intelligent agent executes a preset collaboration task aiming at the target to be processed based on the current action.

The above steps S4 to S6 can save the time of intensive training by putting the tedious adjustment step of the collaborative map and the step of obtaining the actions of the intelligent agents after the intensive learning is completed, and can generate more accurate target distribution adjustment actions in one step based on the trained target deep intensive learning network, and can complete the adjustment of the collaborative map in one step based on the target distribution adjustment actions, and can obtain more accurate current actions of the intelligent agents in one step based on the adjusted collaborative map, so that the intelligent agents can complete the collaborative tasks based on the current actions, thereby realizing the fast completion of the collaborative tasks of the multi-intelligent agents, and solving the technical problems in the prior art that the efficiency of the collaborative training of the multi-intelligent agents is low, the convergence speed is slow, and the collaborative tasks cannot be completed due to sparse rewards of the intensive learning.

In one embodiment, the target allocation adjustment action comprises an agent allocation adjustment action and a cluster allocation adjustment action, wherein: the intelligent agent distribution adjustment action comprises a last cluster number corresponding to each intelligent agent at the last moment and a current cluster number corresponding to each intelligent agent at the current moment; the cluster allocation adjustment action comprises a last target number to be processed corresponding to each cluster at the last moment and a current target number to be processed corresponding to each cluster at the current moment.

The cluster number and the target number may be numbers, letters, or various symbols, or a combination of numbers, letters, and various symbols, which is not limited in this embodiment.

In one embodiment, as shown in fig. 5, the above step S5 includes steps S51 to S52, wherein:

step S51: judging whether the last cluster number of the agent is consistent with the current cluster number or not for each agent; under the condition that the number of the previous cluster is inconsistent with the number of the current cluster, acquiring the number of agents in the cluster corresponding to the number of the previous cluster; and under the condition that the number of the agents is larger than a preset first number threshold value, distributing the agents to the cluster corresponding to the current cluster number.

Further, the preset first number threshold is set to 0 or 1. Taking the preset first number threshold as 0 for further explanation, when the number of agents in the cluster corresponding to the last cluster number of the agent is greater than 0, the agent is allocated to the cluster corresponding to the current cluster number. And under the condition that the number of the agents in the cluster corresponding to the last cluster number of the agent is equal to 0, the distribution position of the agent is not adjusted, so that the situation that no agent exists in the cluster corresponding to the last cluster number of the agent is avoided, and the target to be processed corresponding to the cluster is responded.

Step S52: judging whether the number of the target to be processed corresponding to each cluster is consistent with the number of the current target to be processed or not; under the condition that the number of the last target to be processed is inconsistent with the number of the current target to be processed, acquiring the number of clusters in the target to be processed corresponding to the number of the last target to be processed; and under the condition that the number of the clusters is greater than a preset second number threshold, distributing the clusters to the to-be-processed targets corresponding to the current to-be-processed target numbers.

Further, the preset second number threshold is set to 0 or 1. Taking the preset second number threshold as 0 as an example for further explanation, when the number of clusters in the to-be-processed targets corresponding to the last to-be-processed target number of a cluster is greater than 0, the cluster is allocated to the to-be-processed target corresponding to the current to-be-processed target number. And under the condition that the number of the clusters in the target to be processed corresponding to the last target number of the cluster is equal to 0, not adjusting the distribution position of the cluster so as to avoid that no cluster is in the target to be processed corresponding to the last target number of the cluster so as to correspond to the target to be processed.

In one embodiment, as shown in fig. 6, the above step S6 includes steps S61 to S62, wherein:

step S61: and acquiring a preset mapping function, and acquiring a cluster corresponding to each agent based on the adjusted collaborative map, wherein the preset mapping function represents a mapping function between the cluster and the action.

Step S62: and generating a current action of each agent based on a preset mapping function, the current observation data of each agent and the corresponding cluster, wherein the current action comprises the advancing direction of the agent. The advancing direction includes the front, back, left, right and the like.

Further, the cluster and the current observation data of each agent in the cluster are input into a preset mapping function, so as to obtain an action set of the agents in the cluster, wherein the mapping process is shown as the following formula (1):

wherein, the first and the second end of the pipe are connected with each other,

is shown as

The agents in the cluster are in the second place

The set of actions at the time of day,

denotes the first

The current action of the 1 st agent in the cluster,

denotes the first

First in a cluster

The current action of the individual agent is,

a preset mapping function representing the cluster to action,

is shown as

First of time

The number of the clusters is not limited,

is shown as

The agent in the cluster is

Current observed data at the time.

Two specific embodiments are provided below to further explain the deep reinforcement learning method based on multi-agent cooperative system provided by the present invention.

In a specific embodiment, the deep reinforcement learning method based on the multi-agent cooperation system provided by the invention comprises the following steps:

step 1: the method comprises the steps of obtaining current observation data of each intelligent agent in the multi-intelligent-agent cooperation system, wherein the current observation data comprise state data of other intelligent agents and to-be-processed targets in the observation range of the intelligent agents.

Step 2: in one-time deep reinforcement learning, acquiring a current distribution and adjustment action of a corresponding collaborative graph of the multi-agent cooperative system based on a pre-constructed deep reinforcement learning network and current observation data, acquiring current reward and punishment data based on the current distribution and adjustment action, and optimizing the deep reinforcement learning network based on the current reward and punishment data; the current allocation adjustment action is used to adjust the allocation scheme of agents and clusters in the collaborative map.

And step 3: repeatedly executing the steps until a preset convergence condition is reached or a preset learning frequency is reached; and taking the current deep reinforcement learning network as a target deep reinforcement learning network.

And 4, step 4: acquiring current observation data of each agent in the multi-agent cooperation system and a cooperation map corresponding to the multi-agent cooperation system; the collaboration graph includes a first allocation relationship of the agents to the clusters and a second allocation relationship of the clusters to the targets to be processed.

And 5: and inputting the observation data into a target deep reinforcement learning network to obtain a target distribution adjustment action, and adjusting the first distribution relation and the second distribution relation in the collaborative map based on the target distribution adjustment action.

And 6: and acquiring the current action of each intelligent agent based on the adjusted collaborative map so that each intelligent agent executes a preset collaboration task aiming at the target to be processed based on the current action.

Fig. 7a is a schematic diagram of an application scenario in a second embodiment of the present invention, as shown in fig. 7a, the target to be processed is an intrusion target, and the reward and punishment data is a reward value. The intrusion targets are randomly scattered around the base and approach to the nearest base, and if the intrusion targets reach the base, the intelligent agent fails to capture. The agent captures the intrusion target and prevents the intrusion target from approaching the base, and at least 2 agents are needed to prevent 1 intrusion target. When the base is not invaded within the specified time, the prize value is acquired to be +1, otherwise, the prize value is-1. The second embodiment provided by the invention specifically comprises the following steps:

step (1): the method comprises the steps of obtaining current observation data of each intelligent agent in the multi-intelligent-agent cooperation system, wherein the current observation data comprise first state data of other intelligent agents in the observation range of the intelligent agent and second state data of an intrusion target.

Step (2): as shown in fig. 7b, in one deep reinforcement learning, the current observation data is input to the multi-layered perceptron module in the deep reinforcement learning network to obtain the first state features of other agents and the second state features of the intrusion target. And inputting the first state characteristic and the second state characteristic into a first attention mechanism module in the deep reinforcement learning network to obtain a third state characteristic. And inputting the first state characteristic, the second state characteristic and the third state characteristic into a second attention mechanism module in the deep reinforcement learning network to obtain a target state characteristic. And inputting the target state characteristics into an action coding module in the deep reinforcement learning network to obtain the current distribution adjustment action. Acquiring current reward and punishment data based on the current distribution and adjustment action, and optimizing the depth reinforcement learning network based on the current reward and punishment data; the current allocation adjustment action is used to adjust the allocation scheme of agents and clusters in the collaborative map.

And (3): acquiring the current learning times, judging whether the current learning times reach the preset learning times, and taking the current deep reinforcement learning network as a target deep reinforcement learning network under the condition that the current learning times reach the preset learning times. Under the condition that the current learning frequency does not reach the preset learning frequency, acquiring a reward and punishment data set of multiple times of deep reinforcement learning, and generating a deep reinforcement learning reward and punishment curve based on the reward and punishment data set; and judging whether the deep reinforcement learning award-punishment curve is converged, and taking the current deep reinforcement learning network as a target deep reinforcement learning network under the condition that the deep reinforcement learning award-punishment curve is converged. Under the condition that the depth reinforcement learning reward and punishment curve is not converged, repeatedly executing the steps (1) to (2) until the current learning frequency reaches a preset learning frequency or the depth reinforcement learning reward and punishment curve is converged; and taking the current deep reinforcement learning network as a target deep reinforcement learning network.

And (4): and acquiring current observation data of each intelligent agent in the multi-intelligent-agent cooperation system and a cooperation diagram corresponding to the multi-intelligent-agent cooperation system. As shown in FIG. 7c, the collaboration diagram includes a first allocation of agents to clusters and a second allocation of clusters to intrusion targets. And inputting the observation data into a target deep reinforcement learning network to obtain a target distribution adjustment action. The target allocation adjustment action includes an agent allocation adjustment action and a cluster allocation adjustment action, wherein: the intelligent agent distribution adjustment action comprises a last cluster number corresponding to each intelligent agent at the last moment and a current cluster number corresponding to each intelligent agent at the current moment; the cluster allocation adjustment action comprises a last intrusion target number corresponding to the last moment of each cluster and a current intrusion target number corresponding to the current moment.

And (5): judging whether the last cluster number of the agent is consistent with the current cluster number or not for each agent; under the condition that the number of the previous cluster is inconsistent with the number of the current cluster, acquiring the number of agents in the cluster corresponding to the number of the previous cluster; and under the condition that the number of the agents is larger than a preset first number threshold value, distributing the agents to the cluster corresponding to the current cluster number. Judging whether the last intrusion target number corresponding to each cluster is consistent with the current intrusion target number or not aiming at each cluster; under the condition that the last invasion target number is inconsistent with the current invasion target number, acquiring the number of clusters in the invasion target corresponding to the last invasion target number; and under the condition that the number of the clusters is greater than a preset second number threshold, distributing the clusters to the intrusion targets corresponding to the current intrusion target numbers.

And (6): and acquiring a preset mapping function, and acquiring a cluster corresponding to each agent based on the adjusted collaborative map, wherein the preset mapping function represents a mapping function between the cluster and the action. And generating the current action of each agent based on a preset mapping function, the current observation data of each agent and the corresponding cluster, so that each agent prevents the intrusion target from approaching the base based on the current action, and the cooperative trapping task of the intrusion target is completed, wherein the current action comprises the advancing direction of the agent.

In summary, according to the deep reinforcement learning method based on the multi-agent cooperative system, the cooperative graph with the three-layer structure is constructed, the distribution experience data of the cooperative behavior of the cluster is introduced, the reasonable distribution of the intelligent agents to the cluster and the cluster to the target to be processed is realized, the cooperative graph is dynamically adjusted by using the deep reinforcement learning network based on the attention mechanism, after the deep reinforcement learning is completed, the current action of the intelligent agents is obtained according to the adjusted cooperative graph, the training steps of the deep reinforcement learning are simplified, the training efficiency of the deep reinforcement learning is improved, and the problems of low training efficiency and slow convergence speed caused by the sparse cooperative reward of the large-scale cluster in the complex application environment are solved, so that the smooth completion of the cooperative task of the multi-agent is ensured.

The following describes the multi-agent cooperation system based deep reinforcement learning apparatus provided by the present invention, and the multi-agent cooperation system based deep reinforcement learning apparatus described below and the multi-agent cooperation system based deep reinforcement learning method described above can be referred to each other.

As shown in fig. 8, the present invention provides a multi-agent cooperative system based deep reinforcement learning apparatus 100, which includes a data acquisition module 10, a first training module 20 and a second training module 30, wherein:

the data acquisition module 10 is configured to acquire current observation data of each agent in the multi-agent collaboration system, where the current observation data includes status data of other agents and a target to be processed in an observation range of the agent.

The first training module 20 is configured to, in one deep reinforcement learning, obtain a current distribution adjustment action of the corresponding collaborative map of the multi-agent collaboration system based on a pre-established deep reinforcement learning network and current observation data, obtain current reward and punishment data based on the current distribution adjustment action, and optimize the deep reinforcement learning network based on the current reward and punishment data; the current allocation adjustment action is used to adjust the allocation scheme of agents and clusters in the collaborative map.

A second training module 30, configured to repeatedly perform the above steps until a preset convergence condition is reached or a preset learning frequency is reached; and taking the current deep reinforcement learning network as a target deep reinforcement learning network.

In one embodiment, the first training module 20 includes a first feature extraction unit, a second feature extraction unit, a third feature extraction unit, and an adjustment action acquisition unit, wherein:

and the first feature extraction unit is used for inputting the current observation data into a multilayer perceptron module in the deep reinforcement learning network to obtain first state features of other intelligent agents and second state features of the target to be processed.

And the second feature extraction unit is used for inputting the first state feature and the second state feature to a first attention mechanism module in the deep reinforcement learning network to obtain a third state feature.

And the third feature extraction unit is used for inputting the first state feature, the second state feature and the third state feature into a second attention mechanism module in the deep reinforcement learning network to obtain the target state feature.

And the adjusting action obtaining unit is used for inputting the target state characteristics to an action coding module in the deep reinforcement learning network to obtain the current distribution adjusting action.

In one embodiment, the second training module 30 includes a first determining unit, a second determining unit and a reinforcement learning unit, wherein:

and the first judging unit is used for acquiring the current learning frequency, judging whether the current learning frequency reaches the preset learning frequency or not, and stopping executing the steps under the condition that the current learning frequency reaches the preset learning frequency.

The second judgment unit is used for acquiring a reward and punishment data set of multiple times of depth reinforcement learning under the condition that the current learning frequency does not reach the preset learning frequency, and generating a depth reinforcement learning reward and punishment curve based on the reward and punishment data set; and judging whether the depth reinforcement learning award-punishment curve is converged, and stopping executing the steps under the condition that the depth reinforcement learning award-punishment curve is converged.

And the reinforcement learning unit is used for repeatedly executing the steps under the condition that the depth reinforcement learning reward and punishment curve is not converged until the current learning frequency reaches the preset learning frequency or the depth reinforcement learning reward and punishment curve is converged.

In one embodiment, the multi-agent collaborative system based deep reinforcement learning apparatus 100 further comprises a collaborative map acquisition module, an assignment adjustment module, and an action acquisition module, wherein:

the cooperative drawing acquisition module is used for acquiring the current observation data of each intelligent agent in the multi-intelligent-agent cooperation system and the cooperative drawing corresponding to the multi-intelligent-agent cooperation system; the collaboration graph includes a first allocation relationship of the agents to the clusters and a second allocation relationship of the clusters to the targets to be processed.

And the distribution adjusting module is used for inputting the observation data into the target deep reinforcement learning network to obtain a target distribution adjusting action and adjusting the first distribution relation and the second distribution relation in the collaborative map based on the target distribution adjusting action.

And the action acquisition module is used for acquiring the current action of each intelligent agent based on the adjusted collaborative map so that each intelligent agent executes a preset collaboration task aiming at the target to be processed based on the current action.

In one embodiment, the target assignment adjustment action includes an agent assignment adjustment action and a cluster assignment adjustment action, wherein: the intelligent agent distribution adjustment action comprises a last cluster number corresponding to each intelligent agent at the last moment and a current cluster number corresponding to each intelligent agent at the current moment; the cluster allocation adjustment action comprises a last target number to be processed corresponding to each cluster at the last moment and a current target number to be processed corresponding to each cluster at the current moment.

In one embodiment, the allocation adjustment module comprises a first allocation adjustment unit and a second allocation adjustment unit, wherein.

The first distribution adjusting unit is used for judging whether the last cluster number of the intelligent agent is consistent with the current cluster number or not aiming at each intelligent agent; under the condition that the number of the previous cluster is inconsistent with the number of the current cluster, acquiring the number of agents in the cluster corresponding to the number of the previous cluster; and under the condition that the number of the agents is larger than a preset first number threshold, allocating the agents to the cluster corresponding to the current cluster number.

The second distribution adjusting unit is used for judging whether the last target number to be processed corresponding to each cluster is consistent with the current target number to be processed or not; under the condition that the last target number to be processed is inconsistent with the current target number to be processed, acquiring the number of clusters in the target to be processed corresponding to the last target number to be processed; and under the condition that the number of the clusters is greater than a preset second number threshold, distributing the clusters to the to-be-processed targets corresponding to the current to-be-processed target numbers.

In one embodiment, the action obtaining module comprises a mapping data obtaining unit and a current action obtaining unit, wherein:

and the mapping data acquisition unit is used for acquiring a preset mapping function and acquiring a cluster corresponding to each agent based on the adjusted collaborative map, wherein the preset mapping function represents a mapping function between the cluster and the action.

And the current action acquisition unit is used for generating the current action of each agent based on the preset mapping function, the current observation data of each agent and the corresponding cluster, and the current action comprises the advancing direction of the agent.

Fig. 9 illustrates a physical structure diagram of an electronic device, and as shown in fig. 9, the electronic device may include: a processor (processor)910, a communication Interface (Communications Interface)920, a memory (memory)930, and a communication bus 940, wherein the processor 910, the communication Interface 920, and the memory 930 are coupled for communication via the communication bus 940. Processor 910 may invoke logic instructions in memory 930 to perform a multi-agent collaboration system-based deep reinforcement learning method comprising: acquiring current observation data of each agent in the multi-agent cooperation system, wherein the current observation data comprises state data of other agents and a target to be processed in an agent observation range; in one-time deep reinforcement learning, a current distribution adjustment action of a corresponding collaborative map of the multi-agent cooperation system is obtained based on a pre-constructed deep reinforcement learning network and current observation data, current reward and punishment data is obtained based on the current distribution adjustment action, and the deep reinforcement learning network is optimized based on the current reward and punishment data; the current allocation adjustment action is used for adjusting allocation schemes of the agents and the clusters in the collaborative map; repeatedly executing the steps until a preset convergence condition is reached or a preset learning frequency is reached; and taking the current deep reinforcement learning network as a target deep reinforcement learning network.

Furthermore, the logic instructions in the memory 930 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention or a part thereof which substantially contributes to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, an optical disk, or other various media capable of storing program codes.

In still another aspect, the present invention also provides a non-transitory computer readable storage medium, on which a computer program is stored, the computer program, when being executed by a processor, is used for implementing the multi-agent cooperative system based deep reinforcement learning method provided by the above methods, the method including: acquiring current observation data of each agent in the multi-agent cooperation system, wherein the current observation data comprises state data of other agents and a target to be processed in an agent observation range; in one-time deep reinforcement learning, a current distribution adjustment action of a corresponding collaborative map of the multi-agent cooperation system is obtained based on a pre-constructed deep reinforcement learning network and current observation data, current reward and punishment data is obtained based on the current distribution adjustment action, and the deep reinforcement learning network is optimized based on the current reward and punishment data; the current allocation adjustment action is used for adjusting allocation schemes of the agents and the clusters in the collaborative map; repeatedly executing the steps until a preset convergence condition is reached or a preset learning frequency is reached; and taking the current deep reinforcement learning network as a target deep reinforcement learning network.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment may be implemented by software plus a necessary general hardware platform, and may also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, and not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A deep reinforcement learning method based on a multi-agent cooperation system is characterized by comprising the following steps:

acquiring current observation data of each agent in the multi-agent cooperation system, wherein the current observation data comprises state data of other agents and a target to be processed in an agent observation range;

in one-time deep reinforcement learning, a current distribution adjustment action of a corresponding collaborative map of the multi-agent cooperation system is obtained based on a pre-constructed deep reinforcement learning network and the current observation data, current reward and punishment data is obtained based on the current distribution adjustment action, and the deep reinforcement learning network is optimized based on the current reward and punishment data; the current allocation adjustment action is used for adjusting allocation schemes of the agents and the clusters in the collaborative map;

repeatedly executing the steps until a preset convergence condition is reached or a preset learning frequency is reached; and taking the current deep reinforcement learning network as a target deep reinforcement learning network.

2. The multi-agent collaboration system based deep reinforcement learning method as claimed in claim 1, wherein the obtaining of the current distribution adjustment action of the multi-agent collaboration system corresponding collaboration diagram based on the pre-constructed deep reinforcement learning network and the current observation data comprises:

inputting the current observation data into a multi-layer perceptron module in the deep reinforcement learning network to obtain first state features of other agents and second state features of the target to be processed;

inputting the first state feature and the second state feature to a first attention mechanism module in the deep reinforcement learning network to obtain a third state feature;

inputting the first state feature, the second state feature and the third state feature to a second attention mechanism module in the deep reinforcement learning network to obtain a target state feature;

and inputting the target state characteristics to an action coding module in the deep reinforcement learning network to obtain the current distribution adjustment action.

3. The multi-agent cooperative system-based deep reinforcement learning method as claimed in claim 1, wherein the repeatedly performing the above steps until reaching a preset convergence condition or reaching a preset learning number comprises:

acquiring the current learning times, judging whether the current learning times reach preset learning times or not, and stopping executing the steps under the condition that the current learning times reach the preset learning times;

under the condition that the current learning frequency does not reach the preset learning frequency, acquiring a reward and punishment data set of multiple times of deep reinforcement learning, and generating a deep reinforcement learning reward and punishment curve based on the reward and punishment data set; judging whether the depth reinforcement learning reward and punishment curve is converged, and stopping executing the steps under the condition that the depth reinforcement learning reward and punishment curve is converged;

and under the condition that the depth reinforcement learning reward and punishment curve is not converged, repeatedly executing the steps until the current learning frequency reaches a preset learning frequency or the depth reinforcement learning reward and punishment curve is converged.

4. The multi-agent collaboration system based deep reinforcement learning method as claimed in claim 1, wherein the method further comprises:

acquiring current observation data of each agent in the multi-agent cooperation system and a cooperation map corresponding to the multi-agent cooperation system; the collaboration diagram comprises a first allocation relationship of an agent and a cluster and a second allocation relationship of the cluster and a target to be processed;

inputting the observation data into the target deep reinforcement learning network to obtain a target distribution adjusting action, and adjusting a first distribution relation and a second distribution relation in the collaborative map based on the target distribution adjusting action;

and acquiring the current action of each agent based on the adjusted collaboration diagram so that each agent executes a preset collaboration task aiming at the target to be processed based on the current action.

5. The multi-agent collaboration system based deep reinforcement learning method as claimed in claim 4, wherein said target assignment adjustment actions comprise agent assignment adjustment actions and cluster assignment adjustment actions, wherein:

the intelligent agent distribution adjustment action comprises a last cluster number corresponding to each intelligent agent at the last moment and a current cluster number corresponding to each intelligent agent at the current moment;

the cluster allocation adjustment action comprises a last target number to be processed corresponding to each cluster at the last moment and a current target number to be processed corresponding to each cluster at the current moment.

6. The multi-agent collaboration system based deep reinforcement learning method of claim 5, wherein the adjusting of the first allocation relationship and the second allocation relationship in the collaboration graph based on the target allocation adjustment action comprises:

judging whether the last cluster number of the agent is consistent with the current cluster number or not for each agent; under the condition that the last cluster number is inconsistent with the current cluster number, acquiring the number of agents in the cluster corresponding to the last cluster number; under the condition that the number of the agents is larger than a preset first number threshold value, distributing the agents to the clusters corresponding to the current cluster numbers;

judging whether the last target number to be processed corresponding to each cluster is consistent with the current target number to be processed or not for each cluster; under the condition that the last target number to be processed is inconsistent with the current target number to be processed, acquiring the number of clusters in the target to be processed corresponding to the last target number to be processed; and under the condition that the number of the clusters is greater than a preset second number threshold, distributing the clusters to the to-be-processed targets corresponding to the current to-be-processed target numbers.

7. The multi-agent collaboration system based deep reinforcement learning method as claimed in claim 4, wherein the obtaining the current action of each of the agents based on the adjusted collaboration diagram comprises:

acquiring a preset mapping function, and acquiring a cluster corresponding to each agent based on the adjusted collaborative map, wherein the preset mapping function represents a mapping function between a cluster and an action;

and generating a current action of each agent based on a preset mapping function, the current observation data of each agent and the corresponding cluster, wherein the current action comprises the advancing direction of the agent.

8. A deep reinforcement learning apparatus based on a multi-agent cooperation system, comprising:

the data acquisition module is used for acquiring current observation data of each intelligent agent in the multi-intelligent-agent cooperation system, wherein the current observation data comprises state data of other intelligent agents and a target to be processed in an intelligent-agent observation range;

the system comprises a first training module, a second training module and a third training module, wherein the first training module is used for acquiring a current distribution adjustment action of a corresponding collaborative graph of the multi-agent cooperation system based on a pre-constructed deep reinforcement learning network and the current observation data in one deep reinforcement learning, acquiring current reward and punishment data based on the current distribution adjustment action, and optimizing the deep reinforcement learning network based on the current reward and punishment data; the current allocation adjustment action is used for adjusting allocation schemes of the agents and the clusters in the collaborative map;

the second training module is used for repeatedly executing the steps until a preset convergence condition is reached or a preset learning frequency is reached; and taking the current deep reinforcement learning network as a target deep reinforcement learning network.

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the multi-agent collaboration system based deep reinforcement learning method as claimed in any one of claims 1 to 7.

10. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the multi-agent collaboration system based deep reinforcement learning method as claimed in any one of claims 1 to 7.