CN114298244A

CN114298244A - Decision control method, device and system for intelligent agent group interaction

Info

Publication number: CN114298244A
Application number: CN202111676244.8A
Authority: CN
Inventors: 余超; 刘岳鑫
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2021-12-31
Filing date: 2021-12-31
Publication date: 2022-04-08

Abstract

The invention discloses a decision control method, a decision control device and a decision control system for intelligent agent group interaction. The decision control device comprises an initial interaction unit, a model training unit and a decision control unit. The decision control system also comprises a decision control module and a data storage module. By constructing an initial decision control model comprising a top-level learning model and a bottom-level learning model and performing top-level and bottom-level fusion training on the initial decision control model, a final decision control model is obtained and then decision control is performed.

Description

Decision control method, device and system for intelligent agent group interaction

Technical Field

The invention belongs to the field of decision control of intelligent agent group interaction, and relates to a decision control method, a decision control device and a decision control system of intelligent agent group interaction.

Background

Under large-scale group interaction scenes, such as a large-scale multi-player online character game, a stock right trading market, an advertisement online auction, urban traffic flow and a military intelligent cluster, massive individuals act on the same environment in a concurrent mode and adjust own strategies in real time, and the dynamism and the scalability provide new challenges for the multi-agent reinforcement learning algorithm.

In the prior art, generally, decision control is performed on group interaction through a maddppg algorithm based on a Central Training Distributed Execution (CTDE) learning mode, a VDN algorithm based on a value decomposition idea, or a learning method based on a mean field theory; the MADDPG algorithm based on a Central Training Distributed Execution (CTDE) learning mode acquires the states, behaviors and target strategies of all individuals by using a centrally controlled criticic network in a training stage, and each agent Actor makes a decision according to local information in an execution stage; based on a VDN algorithm of a value decomposition idea, each intelligent agent realizes the maximization of a global gain function by maximizing a local gain function, so that the cooperation of a plurality of intelligent agents is realized (the cooperation among the intelligent agents is realized by depicting the interaction among individuals); the learning method based on the mean field theory can macroscopically express the state information and the action information from a group level, so that the problems of dimension disaster and complex interaction in group decision are better solved.

However, the prior art still has the following defects: coordination among individuals, coordination between individuals and neighborhood agents, and coordination between groups and groups cannot be considered at the same time, so that the decision control effect during group interaction is poor.

Therefore, there is a need for a method, an apparatus and a system for controlling intelligent agent group interaction in a decision-making manner, so as to overcome the above-mentioned drawbacks in the prior art.

Disclosure of Invention

In view of the above technical problems, an object of the present invention is to provide a method, an apparatus and a system for controlling a decision of group interaction of agents, so as to improve the effectiveness of decision control during group interaction of agents.

The invention provides a decision control method for group interaction of an agent, which comprises the following steps: acquiring a preset initial decision control model, and enabling an intelligent agent group to perform group interaction according to the initial decision control model so as to acquire an initial decision control data set; the initial decision control model comprises a top-layer learning model and a bottom-layer learning model; training the top-level learning model and the bottom-level learning model by using the initial decision control data set so as to obtain a final decision control model; and carrying out decision control on the group interaction of the intelligent agent according to the final decision control model.

In one embodiment, the obtaining a preset initial decision control model to enable a group of agents to perform group interaction according to the initial decision control model, thereby obtaining an initial decision control data set specifically includes: acquiring a preset initial decision control model and a preset opponent model of an opponent, initializing a preset group interaction platform, and acquiring a first state of an agent and a second state of the opponent; the initial decision control model comprises a local neural network; inputting the first state into the local neural network for a first behavior and a first reward, inputting the second state into the adversary model for a second behavior and a second reward, and storing the first state, the second state, the first behavior, the second behavior, the first reward, and the second reward into an initial decision control data set; inputting the first behavior and the second behavior into the group interaction platform, so as to correspondingly obtain a third state of the agent and a fourth state of the opponent; inputting the third state into the local neural network for a third behavior and a third reward, inputting the fourth state into the adversary model for a fourth behavior and a fourth reward, and storing the third state, the fourth state, the third behavior, the fourth behavior, the third reward, and the fourth reward into an initial decision control data set.

In one embodiment, training the top-level learning model and the bottom-level learning model using the initial decision control data set to obtain a final decision control model specifically includes: dividing the intelligent agent group into groups with corresponding number according to the preset group number, and acquiring the average behavior value, the reward and the value of each group according to the initial decision control data group; acquiring a learning objective according to the average behavior value and the reward sum value of each group; training the top-level learning model according to the learning target and the initial decision control data set so as to obtain a first top-level model and a corresponding first mean neural network, training the bottom-level learning model according to the first mean neural network and the initial decision control data set, and recording training times; judging whether the training times reach a preset time threshold value or not; and when the training times reach a preset time threshold, stopping training and outputting a final decision control model.

In one embodiment, after determining whether the training times reaches a preset time threshold, the method further includes: and when the training times do not reach a preset time threshold value, continuing the model training.

The invention also provides a decision control device for group interaction of the intelligent agents, which comprises an initial interaction unit, a model training unit and a decision control unit, wherein the initial interaction unit is used for acquiring a preset initial decision control model, so that the group interaction of the intelligent agent group is carried out according to the initial decision control model, and an initial decision control data set is acquired; the model training unit is used for training a preset top-layer learning model and a preset bottom-layer learning model by using the initial decision control data set so as to obtain a final decision control model; and the decision control unit is used for carrying out decision control on the group interaction of the intelligent agent according to the final decision control model.

In one embodiment, the initial interaction unit is further configured to: acquiring a preset initial decision control model and a preset opponent model of an opponent, initializing a preset group interaction platform, and acquiring a first state of an agent and a second state of the opponent; the initial decision control model comprises a local neural network; inputting the first state into the local neural network for a first behavior and a first reward, inputting the second state into the adversary model for a second behavior and a second reward, and storing the first state, the second state, the first behavior, the second behavior, the first reward, and the second reward into an initial decision control data set; inputting the first behavior and the second behavior into the group interaction platform, so as to correspondingly obtain a third state of the agent and a fourth state of the opponent; inputting the third state into the local neural network for a third behavior and a third reward, inputting the fourth state into the adversary model for a fourth behavior and a fourth reward, and storing the third state, the fourth state, the third behavior, the fourth behavior, the third reward, and the fourth reward into an initial decision control data set.

In one embodiment, the model training unit is further configured to: dividing the intelligent agent group into groups with corresponding number according to the preset group number, and acquiring the average behavior value, the reward and the value of each group according to the initial decision control data group; acquiring a learning objective according to the average behavior value and the reward sum value of each group; training the top-level learning model according to the learning target and the initial decision control data set so as to obtain a first top-level model and a corresponding first mean neural network, training the bottom-level learning model according to the first mean neural network and the initial decision control data set, and recording training times; judging whether the training times reach a preset time threshold value or not; and when the training times reach a preset time threshold, stopping training and outputting a final decision control model.

The invention also provides a decision control system for group interaction of the intelligent agents, which further comprises a decision control module and a data storage module, wherein the decision control module is in communication connection with the data storage module, the decision control module is used for carrying out group interaction decision control on the intelligent agent group according to the decision control method for group interaction of the intelligent agents, and the data storage module is used for storing all data.

Compared with the prior art, the embodiment of the invention has the following beneficial effects:

the invention provides a decision control method, a decision control device and a decision control system for group interaction of an intelligent agent.

Drawings

The invention will be further described with reference to the accompanying drawings, in which:

FIG. 1 illustrates a flow diagram of one embodiment of a method for decision control of agent population interaction in accordance with the present invention;

FIG. 2 shows a schematic of the training of the top-level learning model and the bottom-level learning model;

FIG. 3 is a block diagram illustrating one embodiment of an intelligent agent group interaction decision control apparatus in accordance with the present invention;

FIG. 4 is a block diagram illustrating one embodiment of an intelligent agent group interaction decision control system in accordance with the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Detailed description of the preferred embodiment

The embodiment of the invention first describes a decision control method for intelligent agent group interaction. FIG. 1 is a flow chart illustrating one embodiment of a method for decision control of agent population interaction in accordance with the present invention.

As shown in fig. 1, the decision control method includes the following steps:

and S1, acquiring a preset initial decision control model, and enabling intelligent agent groups to perform group interaction according to the initial decision control model, thereby acquiring an initial decision control data set.

The initial decision control model includes a top-level learning model and a bottom-level learning model.

And S2, training the top-layer learning model and the bottom-layer learning model by utilizing the initial decision control data set so as to obtain a final decision control model.

To further illustrate the initial decision control model, FIG. 2 shows a schematic of the training of the top-level learning model and the bottom-level learning model.

Where top-level learning coordinates collaboration between several groups based on macro-angle. And utilizing a CTDE learning mode to apply the learned cooperative information to the average field Q value network of each group, namely utilizing the average field information of a plurality of groups to obtain the average field Q values corresponding to the plurality of groups, and summing to obtain a global Q value. Meanwhile, global rewards are obtained based on the real rewards obtained by each agent, and a top learning goal is formed. And updating parameters of the average field Q value network based on the target, and transmitting the updated average field Q value network to the bottom layer packet for learning. The bottom layer model realizes the cooperation between the agents in the grouping based on the mean field thought and the attention mechanism. The underlying learning is collaborative learning of agents within the same group. Therefore, only the mean field Q network with the top-level update completed is received in the process, and any information of other grouped agents is not received. Each packet contains a plurality of agents.

In practical applications, the state information set o is based on each packet while in the top-level learning phaseⁱMotion information set aⁱThe top layer relies on the weight matrix w (x) passed by the bottom layerⁱ) Statistical population information μ (x)ⁱ) Where i is the group number, k is the agent number, xⁱEquivalent to concatenate (o)ⁱ,aⁱ) I.e. concatenation of status information and action information. Wherein, muⁱ(x) The calculation formula is as follows:

wherein N (i) represents the number of i-th group agents,

is equivalent to

Indicating status information of the kth agent in the ith group;

representing action information of the kth agent in the ith group.

Subsequently, the statistical information μ (x)ⁱ) Transmitting to the average field Q value network, and calculating to obtain Q corresponding to each group_MF(μ(xⁱ) Finally a whole is obtained by the summing moduleLocal Q value Q_tot:

Where m is the number of packets.

After obtaining the global Q value, it is necessary to minimize the loss function L based on the CTDE idea_totOptimizing Q_totBased on a loss function L_totOptimizing Q_totAnd further optimize Q_MF(μ(xⁱ)). Wherein the loss function formula is:

in the formula (I), the compound is shown in the specification,

the independent local maximum (IGM) principle is used here, i.e. when the average field Q value Q of each packet is Q_MF(μ(xⁱ) When all reach the maximum, the global Q value Q_totAnd is maximized accordingly. This model is also based on the idea of CTDE to combine the mean field Q values Q of other packets during the top-level learning process_MF(μ(xⁱ) ) is centrally trained, but only the mean field Q of the packet is used in the underlying packet_MF(μ(xⁱ) Average field Q values for all packets are not collected.

In practical application, when in the bottom learning stage, the grouped mean value information

As a central node, among other things,

computing importance weights for each agent in the ith group based on the Attention mechanism

Where is the number of k agents.

Wherein the importance weight

The formula of (1) is:

in the formula, W_KIs K_keyParameter of the network, W_QIs K_queryA parameter of the network; k_keyNetwork and K_queryThe network is an internal embedded network of attention modules. Each agent is obtained through a local Q value network

t represents the time.

Subsequently, the mean field Q value Q of the top layer transfer needs to be fused_MF(μ(xⁱ) ) and then a grouped local Q value is obtained through the weighted summation module

The partial Q value of the packet

The formula of (1) is:

in the formula (I), the compound is shown in the specification,

the importance of the kth agent in group i. By grouping local Q values

The cooperation of a plurality of agents in the group and the cooperation of the bottom layer and the top layer can be realized.

On the basis of the above, it is desirable to minimize the loss function

Training the packet Q function:

in the formula (I), the compound is shown in the specification,

wherein Q is_MF(μ(xⁱ) Is a top-level to bottom-level guideline and represents the mean field Q delivered at the top level. r isⁱIs the true prize value returned by the environment obtained by group i.

Is the training target for packet i.

And S3, performing decision control on the group interaction of the intelligent agent according to the final decision control model.

The embodiment of the invention describes a decision control method for group interaction of an intelligent agent, which comprises the steps of constructing an initial decision control model comprising a top-layer learning model and a bottom-layer learning model, and performing top-layer and bottom-layer fusion training on the initial decision control model to obtain a final decision control model for decision control.

Detailed description of the invention

Besides the method, the embodiment of the invention also describes a decision control device for intelligent agent group interaction. Fig. 3 is a block diagram of an embodiment of a decision control device for group interaction of agents according to the present invention.

As shown, the decision control device includes an initial interaction unit 11, a model training unit 12, and a decision control unit 13.

The initial interaction unit 11 is configured to obtain a preset initial decision control model, so that an agent group performs group interaction according to the initial decision control model, thereby obtaining an initial decision control data set. In one embodiment, the initial interaction unit 11 is further configured to: acquiring a preset initial decision control model and a preset opponent model of an opponent, initializing a preset group interaction platform, and acquiring a first state of an agent and a second state of the opponent; the initial decision control model comprises a local neural network; inputting the first state into the local neural network for a first behavior and a first reward, inputting the second state into the adversary model for a second behavior and a second reward, and storing the first state, the second state, the first behavior, the second behavior, the first reward, and the second reward into an initial decision control data set; inputting the first behavior and the second behavior into the group interaction platform, so as to correspondingly obtain a third state of the agent and a fourth state of the opponent; inputting the third state into the local neural network for a third behavior and a third reward, inputting the fourth state into the adversary model for a fourth behavior and a fourth reward, and storing the third state, the fourth state, the third behavior, the fourth behavior, the third reward, and the fourth reward into an initial decision control data set.

The model training unit 12 is configured to train a preset top-level learning model and a preset bottom-level learning model by using the initial decision control data set, so as to obtain a final decision control model. In one embodiment, the model training unit 12 is further configured to: dividing the intelligent agent group into groups with corresponding number according to the preset group number, and acquiring the average behavior value, the reward and the value of each group according to the initial decision control data group; acquiring a learning objective according to the average behavior value and the reward sum value of each group; training the top-level learning model according to the learning target and the initial decision control data set so as to obtain a first top-level model and a corresponding first mean neural network, training the bottom-level learning model according to the first mean neural network and the initial decision control data set, and recording training times; judging whether the training times reach a preset time threshold value or not; and when the training times reach a preset time threshold, stopping training and outputting a final decision control model.

And the decision control unit 13 is configured to perform decision control on group interaction of the agents according to the final decision control model.

The embodiment of the invention discloses a decision control device for group interaction of an intelligent agent, which is characterized in that a final decision control model is obtained and then decision control is carried out by constructing an initial decision control model comprising a top-layer learning model and a bottom-layer learning model and carrying out top-layer and bottom-layer fusion training on the initial decision control model, and the decision control device improves the effectiveness of decision control during group interaction of the intelligent agent.

Detailed description of the preferred embodiment

Besides the method and the device, the invention also describes a decision control system for intelligent agent group interaction. FIG. 4 is a block diagram illustrating one embodiment of an intelligent agent group interaction decision control system in accordance with the present invention.

As shown in the figure, the decision control system further includes a decision control module 1 and a data storage module 2, the decision control module 1 is in communication connection with the data storage module 2, wherein the decision control module 1 is configured to perform decision control of group interaction on a group of agents according to the decision control method of group interaction of agents, and the data storage module 3 is configured to store all data.

In practical application, the decision control module 1 and the data storage module 2 are respectively connected to the intelligent agent group in a communication manner, so that the decision control module 1 can perform group interaction decision control on the intelligent agent group.

The embodiment of the invention discloses a decision control system for group interaction of an intelligent agent, which is characterized in that a final decision control model is obtained and then decision control is carried out by constructing an initial decision control model comprising a top-layer learning model and a bottom-layer learning model and carrying out top-layer and bottom-layer fusion training on the initial decision control model, and the decision control system improves the effectiveness of decision control during group interaction of the intelligent agent.

The above-mentioned embodiments are provided to further explain the objects, technical solutions and advantages of the present invention in detail, and it should be understood that the above-mentioned embodiments are only examples of the present invention and are not intended to limit the scope of the present invention. It should be understood that any modifications, equivalents, improvements and the like, which come within the spirit and principle of the invention, may occur to those skilled in the art and are intended to be included within the scope of the invention.

Claims

1. A decision control method for group interaction of intelligent agents is characterized by comprising the following steps:

acquiring a preset initial decision control model, and enabling an intelligent agent group to perform group interaction according to the initial decision control model so as to acquire an initial decision control data set; the initial decision control model comprises a top-layer learning model and a bottom-layer learning model;

training the top-level learning model and the bottom-level learning model by using the initial decision control data set so as to obtain a final decision control model;

and carrying out decision control on the group interaction of the intelligent agent according to the final decision control model.

2. The method for controlling decision making for agent group interaction according to claim 1, wherein obtaining a preset initial decision control model, and enabling an agent group to perform group interaction according to the initial decision control model, thereby obtaining an initial decision control data set specifically comprises:

acquiring a preset initial decision control model and a preset opponent model of an opponent, initializing a preset group interaction platform, and acquiring a first state of an agent and a second state of the opponent; the initial decision control model comprises a local neural network;

inputting the first state into the local neural network for a first behavior and a first reward, inputting the second state into the adversary model for a second behavior and a second reward, and storing the first state, the second state, the first behavior, the second behavior, the first reward, and the second reward into an initial decision control data set;

inputting the first behavior and the second behavior into the group interaction platform, so as to correspondingly obtain a third state of the agent and a fourth state of the opponent;

inputting the third state into the local neural network for a third behavior and a third reward, inputting the fourth state into the adversary model for a fourth behavior and a fourth reward, and storing the third state, the fourth state, the third behavior, the fourth behavior, the third reward, and the fourth reward into an initial decision control data set.

3. The method according to claim 2, wherein the training of the top-level learning model and the bottom-level learning model using the initial decision control data set to obtain a final decision control model comprises:

dividing the intelligent agent group into groups with corresponding number according to the preset group number, and acquiring the average behavior value, the reward and the value of each group according to the initial decision control data group;

acquiring a learning objective according to the average behavior value and the reward sum value of each group;

training the top-level learning model according to the learning target and the initial decision control data set so as to obtain a first top-level model and a corresponding first mean neural network, training the bottom-level learning model according to the first mean neural network and the initial decision control data set, and recording training times;

judging whether the training times reach a preset time threshold value or not;

and when the training times reach a preset time threshold, stopping training and outputting a final decision control model.

4. The method of claim 3, wherein after determining whether the training number reaches a preset number threshold, the method further comprises:

and when the training times do not reach a preset time threshold value, continuing the model training.

5. A decision control device for intelligent agent group interaction is characterized by comprising an initial interaction unit, a model training unit and a decision control unit, wherein,

the initial interaction unit is used for acquiring a preset initial decision control model, so that an intelligent agent group carries out group interaction according to the initial decision control model, and an initial decision control data set is acquired;

the model training unit is used for training a preset top-layer learning model and a preset bottom-layer learning model by using the initial decision control data set so as to obtain a final decision control model;

and the decision control unit is used for carrying out decision control on the group interaction of the intelligent agent according to the final decision control model.

6. The apparatus for intelligent agent group interaction decision control according to claim 5, wherein the initial interaction unit is further configured to:

7. The intelligent agent group interaction decision control device of claim 6, wherein the model training unit is further configured to:

judging whether the training times reach a preset time threshold value or not;

8. A decision control system for group interaction of intelligent agents, which is characterized by further comprising a decision control module and a data storage module, wherein the decision control module is in communication connection with the data storage module, the decision control module is used for performing group interaction decision control on the intelligent agent group according to the decision control method for group interaction of intelligent agents as claimed in any one of claims 1 to 4, and the data storage module is used for storing all data.