CN117009811A

CN117009811A - Multi-agent training method and system based on reinforcement learning

Info

Publication number: CN117009811A
Application number: CN202310779728.8A
Authority: CN
Inventors: 胡斌; 莫小山; 郭慧; 陆红艳; 庞怡宁; 蒙颖姗
Original assignee: Wuzhou University
Current assignee: Wuzhou University
Priority date: 2023-06-29
Filing date: 2023-06-29
Publication date: 2023-11-07

Abstract

The invention discloses a multi-agent training method and system based on reinforcement learning, comprising the following steps: building a reinforcement learning environment; designing the attribute value and the behavior of the intelligent agent; acquiring information through a sensing system, transmitting the information into a decision system, judging by an agent in the cluster according to the transmitted information, and selecting a corresponding state to execute; judging whether the pass is successful, rewarding individuals when the pass is successful, and rewarding groups when the pass is in goal; collecting behavior information and environment information of each intelligent agent through a neural network in the process that the intelligent agents continuously interact with the environment, and fitting out an expected cumulative return function of each intelligent agent; in the reinforcement learning process, N agents are concentrated together for training, decision information is stored in a neural network, a combined strategy with a common target is formed, and multi-agent reinforcement learning is completed. The invention achieves the aim of better cooperatively applying multiple intelligent agents in practice.

Description

Multi-agent training method and system based on reinforcement learning

Technical Field

The invention relates to the technical field of multi-agent training, in particular to a multi-agent training method and system based on reinforcement learning.

Background

The multi-agent system can be used for solving the problems in the fields of robot systems, distributed decision making, traffic control, business management and the like. Virtual human clustering is a typical and complex multi-agent system, and can perform single intelligent and group cooperation cluster intelligent. Reinforcement learning of multiple agents is an important branch in the field of virtual crowd behavior research, and reinforcement learning technology, game theory and the like are fully applied to the multiple agent system, so that the multiple agents complete complicated computer tasks through interaction and decision in a higher-dimensional and dynamic real scene. The simulation of the virtual human clusters mainly comprises a real-time drawing technology, a motion control technology and a behavior control technology, and the basic behavior capability of human beings is reflected.

Reinforcement learning is one area of machine learning that emphasizes how to act on an environmental basis to achieve the greatest expected benefit. The method is used for describing and solving the problem that the agent achieves the maximization of return or achieves a specific target through learning strategies in the interaction process with the environment. Reinforcement learning has many applications in the field of artificial intelligence, such as decision making scenarios for computer vision and environmental scenarios.

Multi-agent reinforcement learning is a sub-category under the reinforcement learning framework, and is also an expansion and extension of the traditional single-agent reinforcement learning method. Unlike single agents that make decisions centrally, there are many complex relationships, such as competition and collaboration, between each intelligence of a multi-agent system in a virtual crowd-set collaboration. In order to better apply multi-agent collaboration in practice, the collaboration and competition relationship between agents in a virtual crowd set must be resolved.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a multi-agent training method and system based on reinforcement learning, which are used for solving the technical problem that the prior art cannot effectively solve the cooperation and competition relationship between agents in a virtual crowd set, thereby achieving the purpose of better applying the multi-agent cooperation in practice.

In order to solve the problems, the technical scheme adopted by the invention is as follows:

a multi-agent training method based on reinforcement learning comprises the following steps:

taking football as a research object, and constructing a reinforcement learning environment;

abstracting football players into intelligent agents, and designing attribute values and behaviors of the intelligent agents;

Acquiring information through a perception system of the intelligent agent, transmitting the acquired information into a decision system of the intelligent agent, judging the intelligent agent in the cluster according to the transmitted information, and selecting a corresponding state to execute;

judging whether the pass is successful based on the optimized pass algorithm logic, rewarding individuals when the pass is successful, and rewarding groups when the pass is in goal;

collecting behavior information and environment information of each intelligent agent through a neural network in the process that the intelligent agents continuously interact with the environment, and fitting out an expected cumulative return function of each intelligent agent according to the collected behavior information and environment information;

in the reinforcement learning process, N intelligent agents are concentrated together for training by adopting a decentralization training framework, each intelligent agent makes an independent decision according to respective expected cumulative return functions, decision information is stored in the neural network, a combined strategy with a common target is formed, and multi-intelligent-agent reinforcement learning is completed;

wherein the pass prize value is reduced when the person is rewarded.

As a preferred embodiment of the present invention, when constructing a reinforcement learning environment, the method includes:

The fence and the ground form a simple football field, the small square persons are used for replacing virtual persons, the small square persons of two teams are distinguished by different colors, and when an environment is built, only four small square persons of each team are used for training;

wherein, when time reaches the training step number duration or when there is a goal team, the scene is reset.

As a preferred embodiment of the present invention, when designing an agent attribute value and an agent behavior, the method includes:

designing the speed, passing and shooting force, physical strength and robbing force of the intelligent body;

designing the movement, ball holding, ball robbing, ball passing, sprinting and shooting of the intelligent body;

wherein, the intelligent body attribute value includes speed, pass and shooting dynamics, physical power and robbery dynamics, intelligent body action includes removal, holding a ball, robbery, pass, sprint and shooting.

In a preferred embodiment of the present invention, when designing a pass and a shoot of the agent, the method includes:

when passing or shooting, judging whether the intelligent body is in a ball holding state, if so, calling a ball management script, passing in required parameters through an incoming function of the ball management script, applying a force consistent with the advancing direction of the intelligent body to the ball according to the passed-in parameters, and setting team attributes for ball operation;

When the robbery of the intelligent body is designed, the method comprises the following steps:

and (3) performing collision detection, judging whether the bumped intelligent agent is an intelligent agent of other teams, if so, calling a ball management script, setting the intelligent agent state of the ball to be a common state, and giving a team ID without state attribute and smaller force to the ball.

In a preferred embodiment of the present invention, when information is acquired by the sensing system of the agent, the method includes:

different targets are defined into different tags, a plurality of rays are emitted on the intelligent body, and when the emitted rays collide with the target Tag to be transmitted, information of the target is obtained and transmitted into a decision system of the intelligent body;

wherein, when emitting a plurality of rays, comprising:

all targets are sorted and separated into different types of groups, each group employing separate radiation detection.

In a preferred embodiment of the present invention, when determining whether a pass is successful based on the optimized pass algorithm logic, the method includes:

when passing the ball, judging whether an intelligent body holds the ball, if so, calling a ball management script, passing the ball through parameters required by an incoming function of the ball management script, judging whether the ball collides with a wall according to the incoming parameters, if not, judging whether a ball receiver is a teammate, if so, judging that the ball passing is successful, and rewarding individuals;

When awarding a community, the method comprises the following steps:

and storing all players in a team through the player class array, traversing the players in the player class array when the player is in a goal, and respectively carrying out prize value assignment to finish group rewards.

As a preferred embodiment of the present invention, when fitting the expected cumulative return function of each agent based on the collected behavior information and environmental information, it includes:

and evaluating whether the behavior of each intelligent agent is beneficial to the whole multi-intelligent agent system according to the Markov decision, if so, recording the behavior, performing learning and memorizing, and fitting the expected cumulative return function of each intelligent agent in the continuous learning and memorizing process.

As a preferred embodiment of the present invention, when evaluating according to markov decisions, it comprises:

the markov decision is described by defining a five-tuple comprising: finite state sets, finite action sets, state transition probability matrices, reward functions, and discount coefficients;

the state transition probability matrix is specifically shown in formula 1:

P ^a _ss' ＝P[S _t+1 ＝s'|S _t ＝s,A _t ＝a] (1)；

wherein P represents a state transition probability matrix, A represents a finite action set, S represents a finite state set, and for a certain time t, the state S thereof _t For s, action A taken _t A is a; at time t+1, its state S _t+1 S';

when the state is s and the action is a, the reward function is specifically shown in formula 2:

the reinforcement learning-based multi-agent training method of claim 8, wherein the expected cumulative return function represents the expected cumulative return for an agent when performing action a in state s following a particular strategy μ, as shown in equation 3:

wherein t represents a certain time.

A reinforcement learning-based multi-agent training system, comprising:

reinforcement learning environment building unit: the method is used for constructing a reinforcement learning environment by taking football as a research object;

functional logic implementation unit: the football player abstract method is used for abstracting football players into intelligent agents and designing attribute values and behaviors of the intelligent agents;

information acquisition and communication implementation unit: the decision system is used for acquiring information through the perception system of the intelligent agent, transmitting the acquired information into the decision system of the intelligent agent, judging the intelligent agent in the cluster according to the transmitted information, and selecting a corresponding state for execution;

rewarding behavior realizing unit: the method is used for judging whether the pass is successful based on the optimized pass algorithm logic, rewarding individuals when the pass is successful, and rewarding groups when the pass is in;

Cluster decision implementation unit: the system is used for collecting behavior information and environment information of each intelligent agent through a neural network in the continuous interaction of the intelligent agents with the environment, and fitting out an expected cumulative return function of each intelligent agent according to the collected behavior information and environment information; in the reinforcement learning process, N intelligent agents are concentrated together for training by adopting a decentralization training framework, each intelligent agent makes an independent decision according to respective expected cumulative return functions, decision information is stored in the neural network, a combined strategy with a common target is formed, and multi-intelligent-agent reinforcement learning is completed;

wherein the pass prize value is reduced when the person is rewarded.

Compared with the prior art, the invention has the beneficial effects that:

(1) Aiming at the problem that the multi-agent system is particularly difficult in reinforcement learning due to the unstable factor of the environment caused by the increase of the multi-agent system, the invention proposes a neural network suitable for the training of the multi-agent system by redesigning the multi-agent system, distinguishes the neural network from the reinforcement learning of a single agent, uses reinforcement learning to carry out model training on the multi-agent system, and realizes team cooperation among the multi-agent in the virtual crowd set;

(2) According to the invention, through optimization of personal strategy rewards of the intelligent agents in the virtual crowd set, redesign of ball passing logic and optimization of training parameters of the virtual crowd set, team cooperation of the multi-intelligent agent system in reinforcement learning becomes feasible, and the multi-intelligent agent system has better stability and convergence;

(3) According to the invention, the virtual crowd set based on cognitive behavior modeling can effectively promote self behaviors from the interaction experience of the environment and other intelligent agents, a reinforcement learning method is adopted to research the virtual crowd set football cooperation technology taking football as an object, and the intelligent agents in the virtual crowd set are learned to coordinate with other intelligent agents in the learning process and learn to select self behaviors, other intelligent agent selection behaviors and capture of virtual person targets through communication design, coordination cooperation design and learning mode design among the virtual crowd sets;

(4) The virtual crowd multi-agent training model obtained by the invention can solve the crowd problems of game AI, unmanned aerial vehicle combat, airplane formation flight and the like;

(5) The invention takes football sport as a research object to research team cooperation problems and team game problems of virtual people clusters under a specific environment, and solves the cooperation and competition relationship among the intelligent agents in a multi-intelligent training mode of reinforcement learning.

The invention is described in further detail below with reference to the drawings and the detailed description.

Drawings

FIG. 1-is a reinforcement learning schematic diagram of an embodiment of the present invention;

FIG. 2-is a schematic illustration of decentralizing training of an embodiment of the present invention;

FIG. 3-is a schematic diagram of a centralized training of an embodiment of the present invention;

FIG. 4 is a schematic diagram of a virtual human reinforcement learning training environment in accordance with an embodiment of the present invention;

FIG. 5-is a schematic diagram of a model of a Markov decision process of an embodiment of the present invention;

FIG. 6-is a schematic diagram of a pass between teammates in accordance with an embodiment of the present invention;

FIG. 7-is a pass (shoot) flow chart of an embodiment of the present invention;

FIG. 8-is a flow chart of a robbery ball in accordance with an embodiment of the present invention;

FIG. 9-is a schematic diagram of various groups of test objects according to an embodiment of the invention;

FIG. 10 is a schematic diagram of all detection objects according to an embodiment of the invention;

FIG. 11 is a flow chart of a pass bonus of an embodiment of the invention;

FIG. 12-is a logical diagram of an optimized pass algorithm according to an embodiment of the present invention;

FIG. 13-is a block diagram of an agent neural network of an embodiment of the present invention;

FIG. 14-is a training scenario diagram of an embodiment of the present invention;

FIG. 15-is a data collection diagram of an embodiment of the present invention;

FIG. 16 is a system jackpot diagram of an embodiment of the invention;

FIG. 17 is a system team rewards diagram of an embodiment of the invention;

FIG. 18 is a system policy evaluation diagram of an embodiment of the present invention;

FIG. 19 is a diagram of a jackpot comparison of an embodiment of the invention;

FIG. 20-is a team rewards comparison chart of an embodiment of the invention;

FIG. 21-is a comparative graph of policy evaluation according to an embodiment of the present invention;

FIG. 22 is a diagram of steps of a reinforcement learning-based multi-agent training method in accordance with an embodiment of the present invention.

Detailed Description

The multi-agent training method based on reinforcement learning provided by the invention, as shown in fig. 22, comprises the following steps:

step S1: taking football as a research object, and constructing a reinforcement learning environment;

step S2: abstracting football players into intelligent agents, and designing attribute values and behaviors of the intelligent agents;

step S3: acquiring information through a perception system of the intelligent agent, transmitting the acquired information into a decision system of the intelligent agent, judging the intelligent agent in the cluster according to the transmitted information, and selecting a corresponding state to execute;

step S4: judging whether the pass is successful based on the optimized pass algorithm logic, rewarding individuals when the pass is successful, and rewarding groups when the pass is in goal;

Step S5: collecting behavior information and environment information of each intelligent agent through a neural network in the process that the intelligent agents continuously interact with the environment, and fitting out an expected cumulative return function of each intelligent agent according to the collected behavior information and environment information;

step S6: in the reinforcement learning process, N intelligent agents are concentrated together for training by adopting a decentralization training framework, each intelligent agent makes an independent decision according to respective expected cumulative return functions, decision information is stored in a neural network, a joint strategy with a common target is formed, and multi-intelligent-agent reinforcement learning is completed;

wherein the pass prize value is reduced when the person is rewarded.

In the above step S1, the reason why the present invention is directed to football is as follows: football is a group sport, has the characteristics of integrity, antagonism, variability and feasibility, has the characteristics of individual intelligence and group intelligence, and is a typical application scene of multi-agent cooperative cooperation. Football matches need to face different teams and to specify different strategies, requiring individuals to have a degree of flexibility and adaptability. Therefore, the invention selects football match games as the research objects of virtual human cluster cooperation, and can fully verify the validity of learning rationality and role allocation strategies of multiple intelligent agents under the conditions of large number of participating entities, large scale of state space and complex real-time decision.

In the step S1, when the reinforcement learning environment is set up, the method includes:

Specifically, the invention takes a unit development engine as an experimental carrier, takes football as a study object of a virtual human cluster cooperation technology, encloses a simple football field with the ground through a fence, and particularly as shown in fig. 4, and replaces virtual humans through small square persons. In order to reduce the complexity of the multi-agent system environment, the virtual human cluster training environment is not designed completely according to football rules, so that when the environment is built, only four athletes of each team are used for training, and the scene is reset when the time reaches the training step number duration or when a goal team exists.

In the step S2, when designing the agent attribute value and the agent behavior, the method includes:

designing the speed, passing and shooting force and physical strength of the intelligent body and the ball-robbing force;

wherein, the intelligent body attribute value includes speed, pass and shooting dynamics, physical power and robbery dynamics, and intelligent body action includes removal, holding a ball, robbery, pass, sprint and shooting.

Further, when designing passing and shooting of the intelligent body, the method comprises the following steps:

Specifically, according to the characteristics of integrity, antagonism, variability and feasibility of football, football players are abstracted into intelligent bodies, based on speed, passing and shooting force, physical strength and robbing force of the football players, tactical design is performed according to the characteristics of team, and attribute values and behavior design are performed on the intelligent bodies, as shown in tables 1 and 2 respectively:

TABLE 1 agent Attribute value Table

Attribute value	Description of the invention
		Speed of speed	The movement speed of the intelligent body can be accelerated under a specific physical value
Pass and shoot dynamics	Force when the agent passes or shoots
		Physical strength	Physical strength possessed by the intelligent body can accelerate
Force of robbing ball	After the intelligent body robs the ball, the ball receives force

TABLE 2 agent behavior design form

Because the multi-agent system is formed by combining N agents, only one agent needs to be modeled to obtain the multi-agent system model. While each reinforcement learning problem follows a markov decision process, as shown in fig. 5, so agents need to be modeled as markov decisions.

In the invention, in the virtual crowd set, three agent scripts, behavi or Parameters scripts and Decision Requester scripts are commonly used for learning by one agent. The learning process can be completed by setting learning parameters in Behavior Parameters scripts and Decision Requester scripts, realizing environmental information collection and action behavior realization in agent scripts, transmitting the environmental information collection and action behavior realization to decision functions and rewarding the actions of the decision functions.

For the realization of passing and shooting actions, by judging whether the ball is owned by the intelligent agent or not, in the state of having the ball, the passing function can be carried out, as shown in fig. 6, the ball management script is called, the parameters of the ball are set, and when the intelligent agent touches the ball, whether the ball is passed by a teammate or not can be judged according to the attribute of the ball.

In the present invention, the main principle of passing the ball is to pass the ball out, so the present invention accomplishes the realization of two actions by defining a function. Three parameters, i.e., the IDs of the ball, the action executor, and the action team, are respectively entered by defining a Throowball script for management and then using the Throowfunction of the Throowball script, as shown in FIG. 7. When passing or shooting, judging whether the intelligent body is in a ball holding state, if the intelligent body is in a ball holding state, the tag is hasballAgent, and then passing and shooting can be carried out, a force consistent with the advancing direction of the intelligent body is applied to the ball, and the team attribute of the ball operation is set.

For the ball-robbing action, in order to simplify the operation difficulty, the invention defines that the ball is robbed when the ball collides with other team players in a collision detection mode, sets the team player state of the ball to be in a common state, and gives the team ID without state attribute and smaller force to the ball. In this way, the ball holding state of the opponent player is emptied, and the effect of robbing is achieved, as shown in fig. 8.

In the step S3, when information is acquired by the sensing system of the agent, the method includes:

wherein, when emitting a plurality of rays, comprising:

Specifically, when the virtual crowd set of the invention is trained in the reinforcement learning environment and the target-based ray detection assembly is adopted, corresponding target information is required to be transmitted to the sensor of the intelligent body. In this regard, different tags are defined in the reinforcement learning environment to distinguish different targets, such as tags of goals, balls, red players, blue players, field boundaries, etc., and specific Tag values and corresponding objects are shown in table 3. The information is acquired and transmitted through the perception system of the intelligent body, a plurality of rays are emitted on the intelligent body in a ray mode, when the emitted rays collide with the target Tag to be transmitted, the representative intelligent body can see the target, and can acquire the information of the target, such as the position, the state and the like, and the information is transmitted into the decision system of the intelligent body. The agents in the cluster judge according to the transmitted information, and select the corresponding state to execute, thereby achieving the state change.

TABLE 3 target information Table

When the same ray detection is adopted for a plurality of targets, the accuracy and the calculation force of target detection are increased, so that the judgment of an intelligent agent on the target point is affected. In order to solve the problem, the invention adopts a layering detection method, as shown in fig. 9 and 10, all targets are sorted and divided into different types of groups, and each group adopts independent ray detection, so that the parameter unification and detection precision of the target detection of the intelligent agent are effectively controlled.

In the step S4, when judging whether the pass is successful based on the optimized pass algorithm logic, the method includes:

when passing the ball, judging whether the intelligent body holds the ball, if so, calling a ball management script, passing the ball according to the parameters required by the incoming function of the ball management script, judging whether the ball collides with the wall, if not, judging whether the ball receiver is a teammate, if so, judging that the ball passing is successful, and rewarding the individual;

when awarding a community, the method comprises the following steps:

and storing all players in a team through the player class array, traversing the players in the player class array when the player is in a goal, and respectively carrying out prize value assignment to finish group prizes.

Specifically, in reinforcement learning, rewards are the only criteria for measuring agent behavioral actions. The agent will learn to avoid the behaviour which is not beneficial to itself gradually by trial and error continuously, so the reward is by giving the agent a measure of whether the behaviour is beneficial to itself, when the behaviour is beneficial to itself, the agent will bias towards the behaviour. In order to make the intelligent body cooperate with each other to play football games, the invention awards the team and the individual respectively, so that the intelligent body can achieve the aim of cooperation. Because the behavior of an individual agent in the virtual crowd is a weight assigned according to the magnitude of the reward, the greater the weight of the agent's action taken when a certain action reward value is greater. In the present invention, in order for the agent to mainly target the goal, the pass prize value of the personal policy is reduced to 0.1 as shown in fig. 11. The pass algorithm logic is redesigned and optimized as shown in fig. 12, thereby preventing the agent from rewarding passing balls through walls, corners, etc.

Defining an array object of a team member class, which is used for storing all players of the team and setting the team member class as a public type, and can be externally modified to modify the number of team members, and can set the fight of different people, thereby being convenient for later management and use. When a ball is played, the member arrays are traversed, and the reward values are assigned respectively, so that team rewards are realized.

In step S5, in order to conform to the training framework of the decentralization training and the decentralization training, the present invention defines a total control brain Q (neural network) to process the states of the respective agents in the multi-agent system, and evaluates whether the states represent the benefits of the whole multi-agent system according to the markov decision, and the decision can be made between the respective agents through the environmental information, and the overall architecture is shown in fig. 13.

In multi-agent system reinforcement learning, there may be N agents, each of which can independently make decisions according to environmental information, and the neural network can access its behavior information and environmental information, and when its behavior is beneficial to a team, it can be recorded, so as to achieve the effect of learning and memory. By fitting a Q-value function Q from data collected from the agent's constant interactions with the environment: sxa→r. This function represents the expected cumulative return in performing action a in state s when the agent follows a particular policy μ, as shown in equation 3 (see below).

In the step S5, when fitting the expected cumulative return function of each agent according to the collected behavior information and environment information, the method includes:

And evaluating whether the behavior of each agent is beneficial to the whole multi-agent system according to the Markov decision, if so, recording the behavior, performing learning and memorizing, and fitting the expected cumulative return function of each agent in the continuous learning and memorizing process.

Further, when evaluating according to a markov decision, it includes:

the state transition probability matrix is specifically shown in formula 1:

P ^a _ss' ＝P[S _t+1 ＝s'|S _t ＝s,A _t ＝a] (1)；

when the state is s and the action is a, the reward function is specifically shown in the formula 2:

further, the expected cumulative return function represents the expected cumulative return for an agent performing action a in state s following a particular policy μ, as shown in equation 3:

wherein t represents a certain time.

Specifically, when the invention combines reinforcement learning and applies Markov decision to solve the sequence problem, the next execution action is selected according to the current environment, rewards are given, the state is updated, and the invention returns to act on the environment. Thus, the Markov decision is described by defining a five-tuple, i.e. < S, A, P, R, R >, where S is the finite state set, A is the finite action set, P is the state transition probability matrix, R is the reward function, and R is the discount coefficient. Based on reinforcement learning, the process of obtaining the state change of the agent is to transition to the next state according to the current state and take the corresponding action, so in the present invention, the state transition probability matrix P and the environmental reward feedback R are shown in the above-mentioned formulas 1 and 2, respectively.

When the reinforcement learning sequence problem is described by using a Markov decision, when the agent is in a certain state t, the agent can select a certain action in the action set, and after the action is completed, the agent state is changed into another state t+1, and the last state t is rewarded by using a rewarding function. Therefore, when the reinforcement learning sequence problem is solved, action and state evaluation can be performed according to the Markov decision so as to achieve the learning effect.

In the above step S6, reinforcement learning is the whole of the virtual crowd-concentrated agent and the environment, as shown in fig. 1. Reinforcement learning is where agents (agents) learn in a "trial and error" manner, and the rewards obtained by interacting with the environment guide the behavior with the goal of having the agents obtain the maximum rewards. If a certain behavior strategy of the intelligent agent can obtain positive rewards (forward), the trend of the intelligent agent for carrying out the behavior strategy is enhanced, and the optimal behavior strategy is obtained by continuously interacting with the environment, so that the learning effect is achieved. In reinforcement learning, the task of the intelligent agent is not given with a data set, the intelligent agent can only interact with the environment continuously, the action currently done is evaluated according to the rewarding feedback based on the environment, and the state and the environment are interacted continuously, so that the current environment corresponds to a state value, and the state value is described through Markov decision.

In the above step S6, when reinforcement learning is used to solve the problem of the behavior of the virtual crowd, there are two training frames that can be selected, i.e., a centering and a decentralizing training frame. In the decentralized architecture, each agent in the virtual crowd set trains independently of the other agents, and its policy network outputs actions to be taken based on local observations. Decentralised training, as shown in fig. 2, the framework faces the problem of environmental instability, because during training one agent sees the other agents as part of the environment, however training of the other agents will cause state transfer functions in the environment to change, thereby destroying the markov assumption followed by the reinforcement learning algorithm. The centralized training, as shown in fig. 3, the framework can solve the problem of non-stationarity of the environment, and by jointly modeling all the agents to learn a joint strategy, the input of the strategy is the joint observation of all the agents, the joint action of all the agents is output, the input and output are joint strategies, and team cooperation can be executed. One of the innovation points of the invention is that: when multi-agent training is carried out, a 'decentralization training+centralization training' mode is adopted, namely, a centralization mode is adopted during training, after training, agents carry out decision making according to the self state by utilizing a trained decision network, decision input and output information is reduced, a plurality of group agents intensively learn, individual intelligent strategy execution can be carried out, the environment is stable, and the problems of unstable environment and large-scale agents are overcome to a certain extent at the same time.

In the reinforcement learning process, when a decentralization training framework is adopted, N intelligent agents are concentrated for training, each intelligent agent is relatively independent and can make independent decisions, and decision information of the intelligent agents is stored in a neural network to form a combined strategy with a common target, so that the aim of reinforcement learning of multiple intelligent agents is fulfilled.

The multi-agent training system based on reinforcement learning provided by the invention comprises: the system comprises a reinforcement learning environment building unit, a functional logic realizing unit, an information acquisition and communication realizing unit, a rewarding behavior realizing unit and a cluster decision realizing unit.

The reinforcement learning environment building unit is used for building the reinforcement learning environment by taking football as a research object.

The functional logic implementation unit is used for abstracting football players into intelligent agents and designing attribute values and behaviors of the intelligent agents.

The information acquisition and communication realization unit is used for acquiring information through the perception system of the intelligent agent, transmitting the acquired information into the decision system of the intelligent agent, judging the intelligent agent in the cluster according to the transmitted information, and selecting the corresponding state for execution.

The rewarding behavior realizing unit is used for judging whether the pass is successful based on the optimized pass algorithm logic, rewarding individuals when the pass is successful, and rewarding groups when the pass is in goal.

The cluster decision realization unit is used for collecting the behavior information and the environment information of each agent through the neural network in the continuous interaction of the agents with the environment, and fitting out the expected cumulative return function of each agent according to the collected behavior information and environment information; in the reinforcement learning process, N intelligent agents are concentrated to train by adopting a decentralization training framework, each intelligent agent makes an independent decision according to respective expected cumulative return functions, decision information is stored in the neural network, a joint strategy with a common target is formed, and multi-intelligent-agent reinforcement learning is completed.

Wherein the pass prize value is reduced when the person is rewarded.

Experimental design and optimization

Experimental parameters

The types of algorithms currently used for machine learning training are mainly proximal policy gradient (PPO) algorithm, soft Actor-Critic (SAC) algorithm, and MA-POCA algorithm, the first two of which are single agent training algorithms. The operations executed when the maximum number of steps is trained, the initial number of steps is counted, the empirical number of steps is collected, and the number of the saved nodes is the number of the nodes saved by the decision network.

In the invention, experimental parameters of the virtual crowd cooperative training include training algorithm type, training maximum step number, starting statistics step number, collecting experience step number, saving node number and saving node step number, as shown in table 4:

table 4 description of experimental training parameters

Fields	Attributes of	Default value
			trainer_type	Training algorithm type	PPO
max_steps	Training maximum number of steps	500000
			summary_freq	Start counting steps	50000
time_horizon	Collecting the number of experience steps	64
			keep_checkpoints	Saving node number	5
checkpoint_interval	Preserving node step numbers	500000

Experiment verification

When the intelligent agent training system is used for training the intelligent agents in the virtual crowd, the multi-intelligent agent training algorithm is adopted to carry out experimental verification on the defined multi-intelligent agent training framework, the multi-intelligent agent strengthening model training is carried out by using the unit environment, in order to save learning time, the training scenes are duplicated, a plurality of training scenes are trained simultaneously, and training data are output to a Tensorboard, as shown in fig. 14 and 15.

Experimental results

By using the Tensorboard data visualization panel, all data are collected and a table is generated, and according to the data display, as the training steps increase, the reward value of the multi-agent system is continuously increased and the system is in a stable state afterwards, so that the data fluctuation range is smaller and the convergence is higher as a whole, as shown in fig. 16, 17 and 18.

From the analysis of experimental results, rewards for the multi-agent system are in an ascending trend, which represents that the multi-agent system is learning to kick a ball towards the envisaged direction, and reflects the team strategy of the multi-agent society in team rewards; the strategy for the multi-agent is beneficial to the agent slowly, so that the multi-agent adopts the strategy with larger weight.

Experimental optimization and analysis

From the result analysis, the currently adopted training parameters are not in accordance with the expected value of the reward value of the training result, and the design of the pass reward and the goal reward of the personal strategy is unreasonable, so that the condition that the personal reward is larger than the team reward exists when the team performs training, and the same team agent performs pass through the wall body to acquire the personal reward. In this regard, the present invention optimizes the virtual crowd centralized agent personal policy rewards, redesigns the pass logic (as in fig. 12), and optimizes the virtual crowd centralized training parameters for the experiment.

After the experiment is optimized, the comparison of the two training results shows that the training effect after the training parameters are improved is better than before, the feasibility of the multi-agent system in reinforcement learning is also verified through the experiment, team cooperation can be performed through defining a neural network and a team strategy, and the training feasibility is reflected by experimental data, as shown in fig. 19, 20 and 21.

Summary

The invention provides a multi-agent training method based on reinforcement learning, and provides an implementation flow of algorithm analysis of multi-agent reinforcement learning and multi-agent system training based on football environment by utilizing football sport as a research object and a verification object. Through the reinforcement learning training of the multi-intelligent system in the football environment, the experimental data verify the feasibility of team cooperation of the multi-intelligent system in reinforcement learning, and have better stability and convergence.

The above embodiments are only preferred embodiments of the present invention, and the scope of the present invention is not limited thereto, but any insubstantial changes and substitutions made by those skilled in the art on the basis of the present invention are intended to be within the scope of the present invention as claimed.

Claims

1. The multi-agent training method based on reinforcement learning is characterized by comprising the following steps of:

wherein the pass prize value is reduced when the person is rewarded.

2. The reinforcement learning-based multi-agent training method according to claim 1, wherein when constructing the reinforcement learning environment, comprising:

3. The reinforcement learning-based multi-agent training method of claim 1, wherein when designing agent attribute values and agent behaviors, comprising:

4. The reinforcement learning based multi-agent training method of claim 3, wherein when designing a pass and a shoot of the agent, comprising:

5. The reinforcement learning-based multi-agent training method according to claim 1, wherein when information is acquired through an agent's perception system, comprising:

wherein, when emitting a plurality of rays, comprising:

6. The reinforcement learning-based multi-agent training method according to claim 1, wherein when judging whether passing is successful based on the optimized passing algorithm logic, comprising:

When awarding a community, the method comprises the following steps:

7. The reinforcement learning-based multi-agent training method of claim 1, wherein when fitting the expected cumulative return function for each agent based on the collected behavior information and environmental information, comprising:

8. The reinforcement learning-based multi-agent training method of claim 7, comprising, when evaluating according to a markov decision:

The state transition probability matrix is specifically shown in formula 1:

P ^a _ss' ＝P[S _t+1 ＝s'|S _t ＝s,A _t ＝a](1)；

9. the reinforcement learning-based multi-agent training method of claim 8, wherein the expected cumulative return function represents the expected cumulative return for an agent when performing action a in state s following a particular strategy μ, as shown in equation 3:

wherein t represents a certain time.

10. A multi-agent training system based on reinforcement learning, comprising:

wherein the pass prize value is reduced when the person is rewarded.