CN114545776A

CN114545776A - Multi-agent control method and device

Info

Publication number: CN114545776A
Application number: CN202210199344.4A
Authority: CN
Inventors: 陈杰; 唐振
Original assignee: Shengjing Intelligent Technology Jiaxing Co ltd
Current assignee: Shengjing Intelligent Technology Jiaxing Co ltd
Priority date: 2022-03-02
Filing date: 2022-03-02
Publication date: 2022-05-27

Abstract

The invention provides a multi-agent control method and a device, which relate to the technical field of artificial intelligence, and the method comprises the following steps: inputting the state information of the current moment into the multi-agent control model, and acquiring the target action of each agent at the next moment output by the multi-agent control model; generating a control instruction corresponding to each intelligent agent at the next moment based on the target action of each intelligent agent at the next moment, and controlling each multi-intelligent agent based on the control instruction corresponding to each intelligent agent at the next moment; each agent is used for patrolling a target area; the multi-agent control model is constructed based on a Transformer network model and a reinforcement learning network model. The multi-agent control method and the multi-agent control device can be combined with a Transformer network model and a reinforcement learning network model with simpler calculation process and higher calculation efficiency, realize the control of the multi-agent and improve the efficiency of controlling the multi-agent.

Description

Multi-agent control method and device

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a multi-agent control method and device.

Background

Reinforcement Learning (RL), also known as refitting Learning, evaluative Learning or Reinforcement Learning, is one of the paradigms and methodologies of machine Learning and can be used to describe and solve the problem of an agent in interacting with the environment to achieve maximum return or achieve a specific goal through Learning strategies. Conventional reinforcement learning Network models may include, but are not limited to, a DQN Network model (Deep Q-learning Network), a PG Network model (Policy Gradient), an AC Network model (Actor-critical), and the like. Based on the traditional reinforcement learning network model, the control of a single intelligent agent can be realized.

In some service scenarios, for example, in a service scenario in which a plurality of patrol robots patrol a target area, a plurality of agents need to be controlled simultaneously. In the prior art, the control of multiple agents can be realized based on a multi-agent control algorithm. However, the existing multi-agent control algorithm has a complex calculation process and low calculation efficiency, so that the efficiency of controlling the multi-agent based on the existing multi-agent control algorithm is low. How to realize more efficient control of multiple intelligent agents based on a reinforcement learning network model with simpler calculation process and higher calculation efficiency is a technical problem to be solved urgently in the field.

Disclosure of Invention

The invention provides a multi-agent control method and device, which are used for overcoming the defect of low efficiency of controlling multi-agents in the prior art and realizing more efficient control of the multi-agents.

The invention provides a multi-agent control method, which comprises the following steps:

acquiring attribute information of each agent and area information of a target area as state information of the current moment;

inputting the state information of the current moment into a multi-agent control model, and acquiring the target action of each agent at the next moment output by the multi-agent control model;

generating a control instruction corresponding to each agent at the next moment based on the target action of each agent at the next moment, so as to control each multi-agent based on the control instruction corresponding to each agent at the next moment;

each intelligent agent is used for patrolling the target area; the multi-agent control model is constructed based on a Transformer network model and a reinforcement learning network model.

According to a multi-agent control method provided by the invention, the multi-agent control model comprises the following steps: a Transformer layer and a reinforcement learning layer;

correspondingly, the inputting the state information of the current time into the multi-agent control model, and acquiring the target action of each agent at the next time output by the multi-agent control model specifically includes:

inputting the state information of the current moment into a Transformer layer, and acquiring the executable action of each intelligent agent at the next moment output by the Transformer layer;

and inputting the state information of the current moment and the executable action of each intelligent agent at the next moment into the reinforcement learning layer, determining one of the executable actions of each intelligent agent at the next moment as the target action of each intelligent agent at the next moment by the reinforcement learning layer, and further acquiring the target action of each intelligent agent at the next moment output by the reinforcement learning layer.

According to the multi-agent control method provided by the invention, the loss function of the multi-agent control model comprises the following steps: a reward loss function;

and the reward loss function is used for describing the intersection area between any two patrolled areas of the intelligent agents within a preset time length and the area of the non-patrolled area within the target area, which are minimized from the beginning of patrolling the target area by each intelligent agent.

According to the multi-agent control method provided by the invention, the Transformer layer comprises the following steps: a format conversion unit and an action determination unit;

correspondingly, the inputting the state information of the current moment into the Transformer layer, and acquiring the executable action of each agent at the next moment output by the Transformer layer specifically includes:

inputting the state information of the current moment into a format conversion unit layer, and sequencing the intelligent agents by the format conversion unit based on the position information of each intelligent agent at the current moment so as to obtain an intelligent agent attribute information sequence of the current moment output by the format conversion unit; wherein the attribute information of the agent comprises location information of the agent;

and inputting the attribute information sequence of the agents at the current moment and the area information of the target area into the action determining unit, and acquiring the executable action of each agent at the next moment output by the action determining unit.

According to the multi-agent control method provided by the invention, the reinforcement learning layer comprises the following steps: a probability distribution unit and a result output unit;

correspondingly, the inputting the state information of the current time and the executable action of each agent at the next time into the reinforcement learning layer, and acquiring the target action of each agent at the next time output by the reinforcement learning layer specifically includes:

inputting the state information of the current moment and the executable action of each agent at the next moment into the probability distribution unit, and acquiring the probability distribution of the executable action of each agent at the next moment, which is output by the probability distribution unit;

and inputting the probability distribution of the executable action of each agent at the next moment into the result output unit, and acquiring the target action of each agent at the next moment output by the result output unit.

According to a multi-agent control method provided by the present invention, the attribute information includes: the residual value of the electric quantity of the intelligent agent, the radius of the visible range and the position information.

The present invention also provides a multi-agent control device comprising:

the information acquisition module is used for acquiring the attribute information of each agent and the area information of the target area as the state information of the current moment;

the model calculation module is used for inputting the state information of the current moment into the multi-agent control model and acquiring the target action of each agent at the next moment output by the multi-agent control model;

the agent control module is used for generating a control instruction corresponding to each agent at the next moment based on the target action of each agent at the next moment so as to control each multi-agent based on the control instruction corresponding to each agent at the next moment;

The present invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the multi-agent control method as described in any of the above when executing the program.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a multi-agent control method as described in any one of the above.

The present invention also provides a computer program product comprising a computer program which, when executed by a processor, implements a multi-agent control method as described in any one of the above.

The multi-agent control method and the device provided by the invention have the advantages that the state information comprising the attribute information of each agent at the current moment and the area information of the target area is input into the multi-agent control model, the target action of each agent at the next moment output by the multi-agent control model is obtained, the control instruction corresponding to the agent at the next moment is generated based on the target action of each agent at the next moment, and each agent is controlled based on the control instruction corresponding to each agent at the next moment.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart of a multi-agent control method provided by the present invention;

FIG. 2 is a schematic structural diagram of a conventional DQN network model in the multi-agent control method provided by the present invention;

FIG. 3 is a schematic structural diagram of a multi-agent control model in the multi-agent control method provided by the present invention;

FIG. 4 is a schematic structural diagram of an action determining unit in the multi-agent control method provided by the present invention;

FIG. 5 is a schematic structural diagram of a multi-agent control device provided by the present invention;

fig. 6 is a schematic structural diagram of an electronic device provided in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the description of the invention, it is to be noted that, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.

FIG. 1 is a flow chart of a multi-agent control method provided by the present invention. The multi-agent control method of the present invention is described below in conjunction with FIG. 1. As shown in fig. 1, the method includes: step 101, acquiring attribute information of each agent and area information of a target area as state information of the current moment; wherein, each agent is used for patrolling the target area.

It should be noted that, the intelligent agent in the embodiment of the present invention may include, but is not limited to, an intelligent entity such as a drone, a vehicle, and a patrol robot. A plurality of the above agents may enable patrolling a specified limited area or a specified limited space.

The target area is a predetermined area requiring patrol. Multiple agents may patrol the target area.

When any agent patrols the target area, the agent can respond to the received control command and complete corresponding actions.

The action of the agent may include a direction of movement. In the embodiment of the present invention, 1 ° may be used as the step length, and the moving direction may be defined as a direction rotated by 1 ° to 360 ° clockwise. For example: the action of the agent may comprise moving forward in a direction after 30 ° of clockwise rotation at a certain moving speed.

It should be noted that the moving speed of each agent may be the same or different, and in the case that the moving speed of each agent is the same, the moving distance of each agent in the same time period is equal.

It should be noted that, in the embodiment of the present invention, the agents are the same agent, that is, the moving speeds of the agents are the same. And each agent maintains a state of constantly moving forward. Thus, for any agent, the agent's actions may include only the direction of movement, i.e., the agent's actions may include only the angle of clockwise rotation.

The agent may be based on observing an area within a certain range according to a visual sensor on the agent, the visual range of the agent being related to performance parameters of the visual sensor, which is an inherent property of the agent.

The attribute information of the agent at the current time may be used to describe the state of the agent at the current time. The attribute information of the agent may include at least one of a remaining power amount of the agent, a visible range, location information, a moving speed, and a distance from each other agent. The attribute information of the agent is not particularly limited in the embodiment of the present invention.

Alternatively, the attribute information of the agent may be vectorized, and the attribute information of the agent is represented by an attribute vector. Each dimension of the attribute vector may represent a type of attribute information for the agent.

For each agent, the attribute information of the agent at the current time may be obtained in various ways, for example: the attribute information of the agent at the current time may be obtained based on the controller of the agent and various sensors.

The area information of the target area at the current time may be used to describe the patrol status of the target area at the current time. The area information of the target area may include, but is not limited to, areas where each agent has patrolled and areas where the agent has not patrolled within a preset time period since each agent starts patrolling the target area, and the like.

The area information of the target area at the current time may be acquired in various ways, for example: the area information of the target area at the current moment can be acquired through various sensors.

102, inputting the state information of the current moment into a multi-agent control model, and acquiring the target action of each agent at the next moment output by the multi-agent control model; the multi-agent control model is constructed based on a Transformer network model and a reinforcement learning network model.

It should be noted that, in general, a problem having a finite Markov (Markov Decision Process) property can be solved well based on the reinforcement learning network model. For the automatic resource scheduling strategy, the reinforcement learning network model belongs to a sequence decision problem and has a finite Markov property.

Fig. 2 is a schematic structural diagram of a conventional DQN network model in the multi-agent control method provided by the present invention. As shown in fig. 2, the conventional DQN Network model includes Q-Network elements, Target-Network elements, and Experience Replay elements. The Q-Network unit and the Target-Network unit have the same Network structure, and the weight of the Q-Network unit can be updated to the Target-Network unit after multiple times of training.

For any agent used for patrolling a target area, it is difficult for the agent to pay attention to other agents in the target area based on a traditional DQN network model, that is, it is difficult to obtain a relationship between any two agents in the target area. The relationship may include a distance between the two agents at the current time, and the like.

Therefore, in the embodiment of the invention, based on the Transformer network model and the reinforcement learning network model, the multi-agent control model is constructed and trained to obtain the trained multi-agent control model. The trained multi-agent control model can be used for acquiring and outputting the target action of each agent at the next moment based on the state information at the current moment. Wherein the target action is the action that the agent needs to execute at the next moment.

Optionally, the reinforcement learning network model in the embodiment of the present invention may be a DQN network model, a PG network model (Policy component), an AC network model (Actor-critical), or the like. In the embodiment of the present invention, the specific reinforcement learning network model is not limited.

Preferably, the DQN network model has the advantages of strong universality, simpler calculation process, higher calculation efficiency and the like, and the multi-agent control model can be constructed based on the Transformer network model and the DQN network model in the embodiment of the invention.

And constructing a multi-agent control model based on a Transformer network model and a reinforcement learning network model, training the multi-agent control model, and inputting the state information of the current moment into the multi-agent control model after obtaining the trained multi-agent control model.

The multi-agent control model can acquire and output the target action of each agent at the next moment based on the state information at the current moment.

And 103, generating a control instruction corresponding to each agent at the next moment based on the target action of each agent at the next moment, so as to control each agent based on the control instruction corresponding to each agent at the next moment.

Specifically, after the target action of each agent at the next time outputted by the multi-agent control model is obtained, the control instruction corresponding to each agent at the next time may be generated based on the target action of each agent at the next time.

For each agent, after generating the control instruction corresponding to the agent at the next time, the control instruction corresponding to the agent at the next time may be sent to the controller of the agent. The controller of the agent can respond to the control instruction corresponding to the agent at the next moment, and control the action of the agent at the next moment.

According to the embodiment of the invention, the state information comprising the attribute information of each intelligent agent at the current moment and the area information of the target area is input into the multi-intelligent-agent control model, the target action of each intelligent agent at the next moment output by the multi-intelligent-agent control model is obtained, the control instruction corresponding to the intelligent agent at the next moment is generated based on the target action of each intelligent agent at the next moment, and each intelligent agent is controlled based on the control instruction corresponding to each intelligent agent at the next moment.

Based on the content of the above embodiments, the attribute information includes: remaining power, visual range, and location information.

Specifically, for any agent used for patrolling a target area, the attribute information of the agent may include the remaining power amount, the visible range, and the location information of the agent.

Alternatively, a planar rectangular coordinate system may be established in the target region, and the position information of any agent may be represented by coordinates in the planar rectangular coordinate system.

Alternatively, the remaining amount of electricity of the agent may be represented by e; r represents the visible radius of the agent; x, y represent the abscissa and ordinate of the agent in the above coordinate system.

The attribute information of the intelligent agent in the embodiment of the invention comprises the residual electric quantity, the visual range and the position information of the intelligent agent, and can provide a data basis for the multi-intelligent-agent control model to more efficiently and more accurately acquire the target action of each intelligent agent at the next moment.

FIG. 3 is a schematic structural diagram of a multi-agent control model in the multi-agent control method provided by the present invention. As shown in FIG. 3, a multi-agent control model, comprising: a Transformer layer and a reinforcement learning layer.

Correspondingly, the state information of the current moment is input into the multi-agent control model, and the target action of each agent at the next moment output by the multi-agent control model is obtained, which specifically comprises the following steps: and inputting the state information of the current moment into the Transformer layer, and acquiring the executable action of each agent at the next moment output by the Transformer layer.

The Transformer layer is constructed based on a Transformer network model. The reinforcement learning layer is constructed based on the DQN reinforcement learning network model.

As shown in fig. 3, Status in fig. 3 may indicate Status information of the current time. After the state information of the current time is acquired, the state information of the current time may be stored in the Center Storage unit.

It should be noted that the Center Storage unit may be a cache unit.

The state information of the current time can be acquired from the Center Storage unit and can be input to the Transformer layer.

The Transformer layer can acquire and output the executable action of each agent at the next moment based on the state information at the current moment.

It should be noted that, for any agent, the number of executable actions of the agent at the next time may be one or more.

Optionally, the executable action of each agent at the next time may be vectorized to obtain an action vector corresponding to each agent at the next time. Each dimension in the motion vector corresponding to any agent can represent an executable motion of the agent at the next moment.

Specifically, the state information of the current time can be acquired from the Center Storage unit, and the state information of the current time and the executable action of each agent at the next time can be input into the reinforcement learning layer.

As shown in FIG. 3, the reinforcement learning layer comprises a Q-Network unit, a Target-Network unit and an Experience Replay unit. The state vector in the Q-Network element represents the state information at the current time.

The mechanism of the reinforcement learning layer is as follows:

S₀→A₀→R₁→S₁→…→S_t-1→A_t-1→R_t→S_t→…

wherein S is₀State information representing an initial time, wherein the initial time is the time when each intelligent agent starts to patrol a target area; a. the₀Representing the target action of each agent at the initial moment; r₁The reward value of each agent at the 1 st moment after each agent executes the target action is represented; s_t-1Status information indicating the t-1 th time; a. the_t-1Representing the target action of each agent at the t-1 moment; r_tIndicating the prize value of each agent at time t after each agent performs the target action.

As shown in fig. 3, Reward in fig. 3 represents the prize value of each agent at the next time after each agent performs the target action. Status in fig. 3 indicates Status information at the next time.

The transform layer and the reinforcement learning layer are combined to establish the mapping as shown in the following:

wherein, a_iRepresenting the executable action of agent i at the next moment, a_i∈{V₁,V₂,…,V_k-1,V_kI e {1,2, …, I }, I representing the number of agents, V_kIndicating a clockwise rotation of the kth degree;

representing the target action of the agent i at the next moment; χ represents state information at the current time; θ represents a model parameter of the deepened learning layer.

For each agent, the reinforcement learning layer may determine one of the executable actions of the agent at the next time as the target action of the agent at the next time based on the state information of the agent at the current time and the executable action of the agent at the next time, and then output the target action of each agent at the next time.

The method comprises the steps of inputting state information of the current time into a Transformer layer in a multi-agent control model, obtaining executable actions of each intelligent agent at the next time output by the Transformer layer, inputting the state information of the current time and the executable actions of each intelligent agent at the next time into a reinforcement learning layer in the multi-agent control model, determining one of the executable actions of each intelligent agent at the next time as a target action of each intelligent agent at the next time by the reinforcement learning layer, further obtaining the target action of each intelligent agent at the next time output by the reinforcement learning layer, determining the relation among the intelligent agents based on the Transformer layer, determining the executable actions of each intelligent agent at the next time based on the relation, and further realizing the control of the multi-intelligent agents based on a traditional reinforcement learning network model.

Based on the content of the above embodiments, the loss function of the multi-agent control model includes: a reward loss function.

And the reward loss function is used for describing the minimum intersection area between any two patrolling areas of the intelligent agents and the area of the non-patrolling area in the target area within a preset time length from the beginning of patrolling the target area by each intelligent agent.

In particular, when a multi-agent is patrolling a target area, the patrolling target may include more efficient patrolling of the target area and fewer unscheduled areas.

And the patrol is more efficient, and the intersection area between any two intelligent agents in the preset time period can be represented by starting patrol of each intelligent agent to the target area. The smaller the intersection area is, the higher the patrol efficiency is.

And less non-patrol areas can be represented by the area of the non-patrol area in the target area within the preset time from the beginning of patrol of each agent on the target area. The smaller the area of the non-patrol area is, the smaller the number of non-patrol areas is.

The reward loss function may be expressed by the following formula:

Υ(Γ,S,A)＝αφ(Γ,S,A)+βψ(Γ,S,A),0<α,β≤1

where φ (Γ, S, A) represents the intersection area loss function; ψ (Γ, S, a) represents an patrolling area loss function; α and β represent weights.

The intersection area loss function φ (Γ, S, A) may be represented by the following equation:

wherein avg represents the average value; circle (i, j) represents the intersection area between the patrolled area of the agent i and the patrolled area of the agent j within a preset time length from the beginning of patrolling the target area by each agent; r represents the visible radius of the agent; ξ is 0.1.

The patrolling area loss function ψ (Γ, S, a) may be expressed by the following formula:

wherein, Δ Η_tAnd the variable quantity of the area of the non-patrol area in the target area at the time t within the preset time length from the beginning of patrol of each agent on the target area is represented.

Based on the reward loss function, the multi-agent control model can be trained with the training objective of reward loss function minimization, i.e., intersection area loss function minimization and non-patrol area loss function minimization. The specific process of training the multi-agent control model is as follows: the method comprises the steps of firstly, setting maximum iteration times, batch size, the number of encoders and decoders in a transform layer, Head number and other hyper-parameters, initializing each sample agent, and initializing a Center Storage unit of an enhanced learning layer.

Step two, starting the iteration, inputting the sample state information of the current moment into a Transformer layer in training, and acquiring each predicted executable action of each sample agent at the next moment output by the Transformer layer; the sample state information at the current moment comprises attribute information of each sample agent and area information of the sample area; attribute information of the sample agent, including remaining power, visible range and location information of the sample agent; the area information of the sample area includes, but is not limited to, an area where each sample agent has patrolled and an area where the sample agent has not patrolled within a preset time period since each sample agent starts patrolling the sample area, and the like; t-1 represents the current time.

Step three, after inputting each predicted executable action of each sample agent at the next moment into the deep learning layer in the training, the deep learning layer in the training can traverse each sample agent (or realize in parallel), determine the predicted target action of each sample agent at the next moment by using a delta greedy strategy, and obtain the sample state information at the next moment, and after each sample agent executes the predicted target action, obtain the reward value of each sample agent at the next moment, namely S'_t-1、A′_t-1、R′_tAnd S'_tThen S'_t-1、A′_t-1、R′_tAnd S'_tAnd storing the data to an Experience Replay unit in the deep learning layer. Wherein t-1 represents the current time; s'_t-1Sample state information representing a current time; a'_t-1Representing a target action of each sample agent at the current time; r'_tRepresenting the reward value of each sample agent at the next time after each sample agent performs the predicted objective action; s'_tIndicating status information at the next time.

And step four, selecting a batch of data in the Experience Replay unit, taking the output of the Target-Network unit in the reinforcement learning layer as a group-truth (real data), and training the reinforcement learning layer and the transform layer based on the group-truth and the reward loss function.

And step five, updating the hyper-parameter based on the iteration result, and repeatedly executing the step two to the step five under the condition that the iteration times do not exceed the maximum iteration times until the iteration times are larger than the maximum iteration times, so as to obtain the trained multi-agent control model.

In the embodiment of the invention, the multi-agent control model is trained based on the reward loss function, the reward loss function is determined based on the intersection area between the patrolled areas of any two agents and the area of the non-patrolled area in the target area within the preset time length from the beginning of patrolling the target area by each agent, and more efficient patrolling and less non-observed areas of the target area can be realized based on the trained multi-agent control model.

Based on the content of the foregoing embodiments, the transform layer includes: a format conversion unit and an action determination unit.

Correspondingly, the state information of the current moment is input into the Transformer layer, and the executable action of each agent at the next moment output by the Transformer layer is obtained, which specifically comprises the following steps: inputting the state information of the current moment into a format conversion unit layer, and sequencing all the intelligent agents by the format conversion unit based on the position information of each intelligent agent at the current moment so as to obtain an intelligent agent attribute information sequence of the current moment output by the format conversion unit; wherein the attribute information of the agent includes location information of the agent.

Specifically, for the agent i for patrolling the target area, the position information index (i) of the agent i at the current time is as follows:

index(i)＝i_x+i_y

wherein i_xAnd i_yAnd the coordinate values of the agent i at the current moment in the rectangular plane coordinate system are shown.

Based on the position information index (i) of each agent at the current time, the agents are sorted in an ascending order, and the obtained agent attribute information sequence at the current time is as follows:

G＝[g₁,g₂,…,g_i,…,g_I-1,g_I]^T

wherein G represents the agent sequence at the current time; g_iAttribute information (attribute vector) indicating the agent i at the current time.

And inputting the attribute information sequence of the agent at the current moment and the area information of the target area into the action determining unit, and acquiring the executable action of each agent at the next moment output by the action determining unit.

Fig. 4 is a schematic structural diagram of an action determining unit in the multi-agent control method provided by the present invention. Since the input of the action determining unit is the attribute information sequence of the agent at the current time, which is already vectorized and includes the Position information of each agent, compared with the conventional transform network model, as shown in fig. 4, the action determining unit in the embodiment of the present invention removes the original embed unit and Position-embed unit in the conventional transform network model.

In fig. 4, Inputs of the action determining unit are agent attribute information sequences at the current time, and Outputs are agent attribute information sequences at the current time after right shift operation.

After the agent attribute information sequence and the area information of the target area at the current time are input to the action determining unit, the action determining unit may obtain and output an executable action of each agent at the next time based on the agent attribute information sequence and the area information of the target area at the current time.

According to the embodiment of the invention, the state information of the current moment is input into the format conversion unit in the Transformer layer, the attribute information sequence of the intelligent agent at the current moment output by the format conversion unit is obtained, the attribute information sequence of the intelligent agent at the current moment and the area information of the target area are input into the action determination unit in the Transformer layer, the executable action of each intelligent agent at the next moment output by the action determination unit is obtained, the state information at the current moment can be converted into the vectorized information sequence which can be identified by the action determination unit through the format conversion unit, the relation among the intelligent agents can be determined based on the action determination unit, and the executable action of each intelligent agent at the next moment is determined based on the relation, so that the control of multiple intelligent agents based on the traditional reinforcement learning network model can be realized.

Based on the content of the foregoing embodiments, the reinforcement learning layer includes: a probability distribution unit and a result output unit.

Correspondingly, inputting the state information of the current moment and the executable action of each agent at the next moment into the reinforcement learning layer, and acquiring the target action of each agent at the next moment output by the reinforcement learning layer, specifically comprising: and inputting the state information of the current moment and the executable action of each agent at the next moment into the probability distribution unit, and acquiring the probability distribution of the executable action of each agent at the next moment output by the probability distribution unit.

Specifically, after the state information of the current time and the executable action of each agent at the next time are input into the probability distribution unit, the probability distribution unit may obtain and output a probability distribution of the executable action of each agent at the next time. The probability distribution of the executable actions of the agent at the next time may be used to describe the probability of executing each executable action by the agent at the next time. The probability distribution of the agent's executable actions at the next time may include a probability value of each executable action performed by the agent at the next time.

As shown in FIG. 3, the action vector in the Q-Network cell represents the probability distribution of the executable action for each agent at the next time. The probability distribution of the executable action of each agent at the next moment can be represented by the probability vector corresponding to each agent at the next moment. The probability vector corresponding to agent i at the next moment is as follows:

wherein p is_tRepresenting the probability of the agent i rotating t deg. clockwise for the next moment, t e {1,2, …, k }.

Specifically, after the probability distribution of the executable action of each agent at the next time is input to the result output unit, the result output unit may use the executable action with the highest probability in the probability distribution of the executable action of each agent at the next time as the target action of each agent at the next time, and further may obtain the target action of each agent at the next time output by the result input unit.

According to the embodiment of the invention, the state information of the current moment and the executable action of each intelligent agent at the next moment are input into the probability distribution unit in the reinforcement learning layer, the probability distribution of the executable action of each intelligent agent at the next moment output by the probability distribution unit is obtained, the probability distribution of the executable action of each intelligent agent at the next moment is input into the result output unit in the reinforcement learning layer, the target action of each intelligent agent at the next moment output by the result output unit is obtained, the target action of each intelligent agent at the next moment can be more accurately and more efficiently obtained on the basis of the reinforcement learning layer, and the multi-intelligent agent can be more efficiently controlled.

Fig. 5 is a schematic structural diagram of a multi-agent control device provided by the present invention. The multi-agent control device provided by the present invention is described below with reference to fig. 5, and the multi-agent control device described below and the multi-agent control method provided by the present invention described above may be referred to correspondingly. As shown in fig. 5, the apparatus includes: an information acquisition module 501, a model calculation module 502, and an agent control module 503.

An information obtaining module 501, configured to obtain attribute information of each agent and area information of the target area as state information of the current time.

The model calculation module 502 is configured to input the state information of the current time into the multi-agent control model, and obtain the target action of each agent at the next time output by the multi-agent control model.

The agent control module 503 is configured to generate a control instruction corresponding to each agent at the next time based on the target action of each agent at the next time, so as to control each multi-agent based on the control instruction corresponding to each agent at the next time.

Each agent is used for patrolling a target area; the multi-agent control model is constructed based on a Transformer network model and a reinforcement learning network model.

Specifically, the information acquisition module 501, the model calculation module 502, and the agent control module 503 are electrically connected.

For each agent, the information obtaining module 501 may obtain the attribute information of the agent at the current time in various ways, for example: the attribute information of the agent at the current time may be obtained based on the controller of the agent and various sensors. The information obtaining module 501 may obtain the area information of the target area at the current time in various ways, for example: the area information of the target area at the current moment can be acquired through various sensors.

The model calculation module 502 inputs the current state information into the multi-agent control model. The multi-agent control model can acquire and output the target action of each agent at the next moment based on the state information at the current moment.

Agent control module 503 may generate a control command corresponding to each agent at the next time based on the target action of each agent at the next time. For each agent, after generating the control instruction corresponding to the agent at the next time, the control instruction corresponding to the agent at the next time may be sent to the controller of the agent. The controller of the agent can respond to the control instruction corresponding to the agent at the next moment, and control the action of the agent at the next moment.

According to the embodiment of the invention, the state information comprising the attribute information of each intelligent agent and the area information of the target area at the current moment is input into the multi-intelligent-agent control model, the target action of each intelligent agent at the next moment output by the multi-intelligent-agent control model is obtained, the control instruction corresponding to the intelligent agent at the next moment is generated based on the target action of each intelligent agent at the next moment, and each intelligent agent is controlled based on the control instruction corresponding to each intelligent agent at the next moment.

Fig. 6 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 6: a processor (processor)610, a communication Interface (Communications Interface)620, a memory (memory)630 and a communication bus 640, wherein the processor 610, the communication Interface 620 and the memory 630 communicate with each other via the communication bus 640. The processor 610 may invoke logic instructions in the memory 630 to perform a multi-agent control method comprising: acquiring attribute information of each agent and area information of a target area as state information of the current moment; inputting the state information of the current moment into the multi-agent control model, and acquiring the target action of each agent at the next moment output by the multi-agent control model; generating a control instruction corresponding to each agent at the next moment based on the target action of each agent at the next moment, and controlling each multi-agent based on the control instruction corresponding to each agent at the next moment; each agent is used for patrolling a target area; the multi-agent control model is constructed based on a Transformer network model and a reinforcement learning network model.

In addition, the logic instructions in the memory 630 may be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention or a part thereof which substantially contributes to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product, the computer program product comprising a computer program, the computer program being storable on a non-transitory computer readable storage medium, the computer program, when executed by a processor, being capable of executing the multi-agent control method provided by the above methods, the method comprising: acquiring attribute information of each agent and area information of a target area as state information of the current moment; inputting the state information of the current moment into the multi-agent control model, and acquiring the target action of each agent at the next moment output by the multi-agent control model; generating a control instruction corresponding to each intelligent agent at the next moment based on the target action of each intelligent agent at the next moment, and controlling each multi-intelligent agent based on the control instruction corresponding to each intelligent agent at the next moment; each intelligent agent is used for patrolling a target area; the multi-agent control model is constructed based on a Transformer network model and a reinforcement learning network model.

In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a multi-agent control method provided by performing the above methods, the method comprising: acquiring attribute information of each agent and area information of a target area as state information of the current moment; inputting the state information of the current moment into the multi-agent control model, and acquiring the target action of each agent at the next moment output by the multi-agent control model; generating a control instruction corresponding to each intelligent agent at the next moment based on the target action of each intelligent agent at the next moment, and controlling each multi-intelligent agent based on the control instruction corresponding to each intelligent agent at the next moment; each agent is used for patrolling a target area; the multi-agent control model is constructed based on a Transformer network model and a reinforcement learning network model.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A multi-agent control method, comprising:

2. The multi-agent control method according to claim 1, wherein the multi-agent control model comprises: a Transformer layer and a reinforcement learning layer;

3. The multi-agent control method according to claim 2, wherein the loss function of the multi-agent control model comprises: a reward loss function;

4. The multi-agent control method of claim 2, wherein the Transformer layer comprises: a format conversion unit and an action determination unit;

5. The multi-agent control method of claim 2, wherein the reinforcement learning layer comprises: a probability distribution unit and a result output unit;

6. A multi-agent control method as claimed in any one of claims 1 to 3, wherein said attribute information comprises: the remaining value of the electric quantity of the agent, the radius of the visible range and the position information.

7. A multi-agent control device, comprising:

8. An electronic device comprising a memory, a processor and a computer program stored on said memory and executable on said processor, characterized in that said processor implements the multi-agent control method according to any of claims 1 to 6 when executing said program.

9. A non-transitory computer readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the multi-agent control method of any of claims 1 to 6.

10. A computer program product comprising a computer program, wherein the computer program when executed by a processor implements the multi-agent control method as claimed in any one of claims 1 to 6.