CN113269297A

CN113269297A - Multi-agent scheduling method facing time constraint

Info

Publication number: CN113269297A
Application number: CN202110810946.4A
Authority: CN
Inventors: 朱晨阳
Original assignee: Donghe Software Jiangsu Co ltd
Current assignee: Donghe Software Jiangsu Co ltd
Priority date: 2021-07-19
Filing date: 2021-07-19
Publication date: 2021-08-17
Anticipated expiration: 2041-07-19
Also published as: CN113269297B

Abstract

The invention relates to a time constraint-oriented multi-agent scheduling method, which comprises the following steps: establishing a dispatching center; the dispatching center collects real-time data of states and actions of the multi-agent and random environment; the dispatching center processes the collected data and sends the action instruction to the multi-agent; by introducing time constraint into the random game model, the real-time, non-deterministic and probabilistic behaviors shown among the multi-agents or in the interaction process of the multi-agents and the random environment can be described, the reward function related to time can be quantized, and a multi-objective optimization strategy is determined through the reward function; the efficiency of calculating the maximum reward expectation of the model and the pareto curve fitting efficiency based on weight combination are improved according to the designed algorithm, so that the reaction speed of the multi-agent is improved; by giving different weights to a plurality of targets, the priorities of the targets are distinguished, so that the running reliability of the multi-agent is improved.

Description

Multi-agent scheduling method facing time constraint

Technical Field

The invention relates to the technical field of multi-agent interaction, in particular to a scheduling method of multi-agent facing to time constraint.

Background

As the interaction between multi-agents (robots, robot dogs, drones, etc.) becomes increasingly close, the errors that occur during interaction also continue to increase as the size and complexity of multi-agent systems increase. How to design a multi-agent scheduling system to meet the requirement of multi-target design under uncertain environment and corresponding time constraint becomes a key scientific problem which needs to be solved urgently.

At present, the research on the multi-agent scheduling system mainly verifies the quantitative attribute of the model and the attribute related to the reward function through a model checking method, and approaches to pareto optimality of the model through a value iteration method. However, the following problems still remain unsolved in the multi-objective optimization for multi-agent scheduling facing time constraints:

(1) the model inspection needs to carry out exhaustive search on the state space of the multi-agent and random environment, and the state number of the model increases exponentially with the increase of concurrent components, so that the problem of state space explosion is caused;

(2) in a random game model facing time constraint, the reward function can be a point of time, and under the condition that the running time is uncertain, the reward function is variable, so that a value iteration and strategy iteration algorithm based on the model is not suitable for the scene;

(3) there is a lack of description of target priority differences in combining multiple target strategies for multi-agents, and a lack of research to weigh multi-objective optimization strategies based on weight combination.

Disclosure of Invention

The invention aims to overcome the defects and shortcomings of the prior art and provide a time-constraint-oriented multi-agent scheduling method with advanced concept, high reliability and high speed.

The technical scheme for realizing the purpose of the invention is as follows: a scheduling method of multi-agent facing time constraint includes the following steps:

s1, establishing a scheduling center, which specifically comprises the following steps:

s11, establishing a random game model between a plurality of intelligent agents facing time constraint and a random environment or between the plurality of intelligent agents based on a multi-target random game template facing time constraint;

s12, testing and simulating the running track of a random game model according to a statistical model, and designing a value function learning method which is not based on the model to calculate the maximum reward expectation of the multi-agent taking different actions in various states;

s13, iterating the algorithm according to the convergence conditions of zero and random games between the multi-agent and the random environment and general and random games between the multi-agent;

s14, fitting a multi-target pareto curve based on weight combination according to a convex optimization hyperplane separation theorem;

s2, the dispatching center collects real-time data of states and actions of the multi-agent and random environments;

and S3, the dispatching center processes the acquired data and sends the action instruction to the multi-agent.

Further, step S11 is specifically:

s111, the multi-target random game template facing to time constraint is a ten-tuple

Wherein:

representing a limited set of participant multi-agents and a random environment participating in a random game;

a finite set of states representing a multi-agent and stochastic environment;

representing the initial state of the multi-agent and stochastic environment,

；

representing a certain agent or a random environment

A finite set of states of (a) is,

，

；

representing a finite set of actions of the multi-agent;

represents a finite set of all clocks;

representing a set of clock constraints;

indicating multi-agent status

Invariance conditions on clock constraints;

indicating multi-agent status

Upper adopts

Clock constraints at the time of action;

representing multiple agents from

Status pass through

Is moved to

The state transition function of the state is,

to represent

A probability distribution of (a);

representing the states of the multi-agents and the reward functions corresponding to the actions,

represents a real number;

s112, establishing a multi-target random game model between multiple intelligent agents and a random environment or between the multiple intelligent agents facing to time constraint and adopting

As a multi-agent on-path

Action set of

In a selection strategy of

The reward expectation formula for a policy is as follows:

in the formula：

Indicating multi-agent status

The corresponding reward function;

representing multi-agent actions

The corresponding reward function;

；

representing a desired reward function for the multi-agent;

representing a policy;

representing multi-agent selection strategies

Probability distribution of (2).

Further, the set of clock constraints of step S111

Middle clock constraint

Defined by the following formula;

in the formula:

is that

One of the clock signals of the first and second clock signals,

is a constant number of times that the number of the first,

，

，

。

further, step S12 is specifically:

s121, acquiring initial data of all states and actions of the multi-agent in a random environment;

s122, establishing a random game model facing time constraint based on the acquired data, simulating the running track of the random game model through UPPAAL-SMC, exploring all states and actions of the multi-agent in a random environment and training a target strategy;

s123, establishing a state-action value function table of the multi-agent through off-line learning simulation running tracks

Said value function table

Is defined as being in a state

Take action down

Wherein:

a group of state-tuples is represented,

a group of action tuples is represented,

a different set of classifications representing the state,

indicating the gaming participants to which the current state pertains.

Further, step S13 is specifically:

s131, initializing a state-action value function table for double-player zero-sum random games

At the selection of each state

Upon corresponding action, the multi-agent or random environment is based on

Greedy method to select

And finally updating the value function by adopting a method of cumulatively updating the average value according to the corresponding action, wherein the formula is as follows:

in the formula:

representing the number of approximate cumulative computations, which may be considered as a step size,

；

represents the estimated return, i.e., the sum of the future benefits with attenuation;

s132, aiming at the common and random games of a plurality of people, firstly initializing a state-action value function table

At the selection of each state

In corresponding action, the multi-agent is based on

Greedy method to select

And corresponding actions, and finally updating the cost function by adopting a Nash equalization function, wherein the formula is as follows:

in the formula:

indicating the number of approximate cumulative calculations,

；

representing the number of the multi-agent;

represents an attenuation value;

representing the rewards currently earned by the multi-agent;

indicating a state

In the execution of the selected action

Then obtaining a new state;

represents from

Starting a multi-agent to adopt a joint strategy

Calculated long term average return.

Further, an agent in the general and random game of step S132

Nash equalization function of

The following formula is satisfied:

in the formula:

for a certain agent

The set of policies of (2);

indicating the number of multi-agents.

Further, step S14 is specifically:

s141, taking the weighted sum of the multi-target rewards as an optimization target, and calculating the weighted sum of the multi-target optimization, wherein the formula is as follows:

in the formula:

a vector of weights is represented by a vector of weights,

a bonus vector is represented that is,

the policy is represented by a set of rules,

representing the desired reward function to which the weight combination is added;

is expressed in a policy

A target reward weighted sum of;

representing multi-agent selection strategies

A probability distribution of (a);

and S142, fitting the multi-target pareto curves with different weight combinations according to the convex optimization hyperplane separation theorem.

After the technical scheme is adopted, the invention has the following positive effects:

(1) the invention introduces time constraint in the random game model, on one hand, the real-time, non-determinacy and probability behaviors shown among the multi-agent or in the interaction process of the multi-agent and the random environment can be described, on the other hand, the reward function related to time can be quantized, and the multi-objective optimization strategy is determined through the reward function.

(2) The invention calculates the expected reward expectation according to the Monte Carlo simulation track by designing an off-line algorithm, avoids the problem of state space explosion generated when the maximum reward expectation is calculated, and reduces the iteration times of the algorithm according to the convergence conditions of the zero and random games and the general and random games, thereby reducing the energy consumption of the system and improving the reaction speed of the multi-agent.

(3) The invention gives different weights to a plurality of targets and distinguishes the priorities of the targets, thereby improving the running reliability of the multi-agent.

Drawings

In order that the present disclosure may be more readily and clearly understood, the following detailed description of the present disclosure is provided in connection with specific embodiments thereof and with the accompanying drawings, in which:

FIG. 1 is a block diagram of a dispatch center according to the present invention;

FIG. 2 is a flow chart of the present invention;

FIG. 3 is a table generation method of the double zero and random game value function of the present invention;

FIG. 4 is a table of values for a multi-player general and random game play according to the present invention;

FIG. 5 illustrates a pareto curve generation method according to the present invention;

FIG. 6 is a graph of a pareto curve fit based on weight combinations according to the present invention;

FIG. 7 is a schematic diagram of the multi-robot and random environment dynamic gaming model in this embodiment 1;

fig. 8 is a schematic diagram of a dynamic gaming model among multiple robots in this embodiment 2.

Detailed Description

As shown in fig. 1-5, a scheduling method of multi-agent facing to time constraint includes the following steps:

s11, establishing a random game model between multiple intelligent agents and a random environment or between the multiple intelligent agents facing time constraints based on a multi-target random game template facing time constraints, which specifically comprises the following steps:

Wherein:

a finite set of states representing a multi-agent and stochastic environment;

representing the initial state of the multi-agent and stochastic environment,

；

representing a certain agent or a random environment

A finite set of states of (a) is,

，

；

representing a finite set of actions of the multi-agent;

represents a finite set of all clocks;

representing sets of clock constraints, clock constraints

By the formula

Definitions, in which:

is that

One of the clock signals of the first and second clock signals,

is a constant number of times that the number of the first,

，

，

(ii) a Such as a state requiring a delay

Then state of

Corresponding to

There will be a time constraint

And a certain state is subjected to a cutoff time

Constraint, then corresponding

There is a constraint

. At the same time, the user can select the desired position,

or a combination of different time constraints, e.g.

. At the same time, the user can select the desired position,

the logical inverse is also accepted.

Indicating multi-agent status

Invariance conditions on clock constraints;

indicating multi-agent status

Upper adopts

Clock constraints at the time of action;

representing multiple agents from

Status pass through

Is moved to

The state transition function of the state is,

to represent

A probability distribution of (a);

represents a real number;

As a multi-agent on-path

Action set of

In a selection strategy of

The reward expectation formula for a policy is as follows:

in the formula:

indicating multi-agent status

The corresponding reward function;

representing multi-agent actions

The corresponding reward function;

；

representing a desired reward function for the multi-agent;

representing a policy;

representing multi-agent selection strategies

Probability distribution of (2).

S12, testing and simulating the running track of the random game model according to the statistical model

Design not based on modelThe value function learning method calculates the maximum reward expectation of the multi-agent taking different actions in various states, which is as follows:

s122, establishing a random game model facing to time constraint based on the acquired data, and simulating the running track of the random game model through UPPAAL-SMC (statistical model testing tool)

Exploring all states and actions of the multi-agent in a random environment and training a target strategy;

s123, simulating the running track through offline learning

Establishing state-action value function table of multi-agent

Value function table

Is defined as being in a state

Take action down

Wherein:

a group of state-tuples is represented,

a group of action tuples is represented,

a different set of classifications representing the state,

indicating the gaming participants to which the current state pertains.

S13, iterating the algorithm according to the convergence conditions of the zero and random games between the multi-agent and the random environment and the general and random games between the multi-agent, which is specifically as follows:

At the selection of each state

Upon corresponding action, the multi-agent or random environment is based on

Greedy method to select

Corresponding action, i.e. if

Corresponding action set

Then there will be

The probability of selecting the action of maximizing the cost function table, and also

Randomly selecting an action; status of state

In the execution of the selected action

Will get afterTo a new state

And corresponding rewards

(ii) a Suppose that the game participants are respectively

And

the state sets are respectively

And

and the model targets the maximizing participants

The gain of (1). If the next step state belongs to

Then it is desirable to maximize the reward, as shown in equation (1); if the status of the next step belongs to

Then the reward needs to be minimized, as shown in equation (2);

（1）

（2）

in the formula:

indicating the currently-obtained benefit of the prize,

indicating that the yield of the next step is currently maximized,

indicating that the yield of the next step is currently minimized,

represents an attenuation value;

and finally updating the value function by adopting a method of accumulating and updating the average value, wherein the formula is as follows:

in the formula:

；

That is, for a state, different agents will have different actions, and each agent generates an optimal strategy by observing the actions of other agents and corresponding reward values; in selecting each state

Upon corresponding action, different agents are based on

Greedy method to select

A corresponding action; status of state

In the execution of the selected action

Then a new state is obtained

And corresponding rewards

(ii) a And finally, updating the cost function by adopting a Nash equalization function, wherein the formula is as follows:

in the formula:

indicating the number of approximate cumulative calculations,

；

representing the number of the multi-agent;

represents an attenuation value;

representing the rewards currently earned by the multi-agent;

indicating a state

In the execution of the selected action

Then obtaining a new state;

indicating slave status

Starting a multi-agent to adopt a joint strategy

A calculated long-term average reward;

wherein:

represents from

Starting a multi-agent to adopt a joint strategy

The calculated long-term average return satisfies the following formula,

as an agent

The set of policies of (1).

S14, fitting the multi-target pareto curves with different weight combinations according to the convex optimization hyperplane separation theorem, which specifically comprises the following steps:

in the formula:

a vector of weights is represented by a vector of weights,

a bonus vector is represented that is,

the policy is represented by a set of rules,

is expressed in a policy

A target reward weighted sum of;

representing multi-agent selection strategies

A probability distribution of (a);

s142, if the goal is to calculate the maximum reward expectation, calculating the reward expectation

Can field of

Wherein:

a desire for a reward is indicated and,

a set of reward expectations is represented,

，

a collection of possible domains is represented that,

i.e. exist

All values in the feasible Domain

Are all less than

(ii) a If the goal is to calculate a maximum reward expectation (e.g., a minimized energy consumption scenario), the reward expectation is calculated

Can field of

；

S143, if the goal is to calculate the maximum reward expectation, calculating the reward expectation

Is not feasible

Wherein

A vector of weights is represented by a vector of weights,

a set of weight vectors is represented as a set of weight vectors,

，

a vector of the desire to reward is represented,

a set of desired vectors is indicated for the reward,

，

and

are all represented as the inner product of the vector,

a set of non-feasible fields is represented,

i.e. reward expectation for any combination of weights, such that

(ii) a Calculating a reward expectation if the goal is to calculate a minimum reward expectation

Is not feasible

；

S144, calculating the weight vector with the maximum variance

And calculating the distance between feasible and infeasible domains

；

S145. if the distance is larger than the set distance

Then calculate

So as to be composed of

The constructed maximum separation hyperplane can maximally separate the areas where the two sets are positioned, namely the coverage surface of the reachable set corresponding to the weight is enlarged; based on the convergence of the distance function, the newly generated reward may be expected

Joining a reward expectation set

In, and iterate continuously until

Is less than

。

Fitting a pareto curve based on multiple target reward weights for different weight vector combinations requires traversing the weight vector combinations and calculating the corresponding reward expectations multiple times. If there is

The reward of each target, adopting the exhaustive search weight and the corresponding optimal reward expectation, needs to be in

The reward expectation of each set of weight vector combinations is calculated at the time complexity of (1). Because the reachable point sets generated by different weight combinations are crossed, the invention adopts a method of approximating the unreachable point set expected to be rewarded with the reachable point set to select the weight combination

The coverage of the reachable point set corresponding to the weight is enlarged as much as possible, so that the expected calculation quantity of the reward is reduced, and the efficiency of the whole algorithm is improved. As shown in FIG. 6, fitting a weight-combination-based pareto curve that minimizes dual objectives first computes

And

reward expectation in both extreme cases

And

this indicates the case where only the object 2 is considered and the case where only the object 1 is considered. First according to

A corresponding straight line is generated and,

representing the reward expectation vector, the area enclosed by the intersection point of the two straight lines and the coordinate axis is an unreachable set, the area enclosed by the two reachable points and the reward expectation upper limit is a reachable set, different weights correspond to different reachable point sets andthe maximum separation is hyperplane. The slope of the largest separating hyperplane of the unreachable point-reachable region furthest from the origin can be taken as a weight, reachable points are generated from reward expectations, and a set of reachable points is formed. As shown in FIG. 6, the point of the unreachable set furthest from the origin is taken

From the law of separation of hyperplane from convex set, it can be seen that the hyperplane must exist and can be separated

Point and reachable set

. From this, by calculating the maximum separation hyperplane, the corresponding weight value is obtained

. By passing

Calculating maximum reward expectation, and adding

The sets generate corresponding reachable sets and unreachable sets. It can be seen from FIG. 6 that the maximum values of the reachable set and the unreachable set are approaching each other when the distance between the reachable set and the unreachable set is smaller than

And outputting a corresponding reachable point set, wherein the boundary point of the reachable point set is the pareto curve.

Example 1

As shown in fig. 7, multiple robots (robot M, robot N) complete multiple jobs (job T, job T +1, job T + 2) in a random environment, and the process of scheduling these robots to complete jobs generates corresponding energy consumption and time delay. Since the robot may malfunction in the middle of completing a job, different robots may complete different jobs at different times. Therefore, first of all, based on the orientationThe time-constrained multi-target random game template models a system for multiple robots to complete multiple operations in a random environment by adopting a time-constrained zero-sum random game method, wherein game parties are respectively the multiple robots and the random environment. The system mainly comprises an operation model and a multi-machine scheduler model. The job model has three states, namely an idle state, a waiting state and an execution state, each job is triggered by a random environment to enter the waiting state from the idle state, and if the scheduler determines a robot executing the job and distributes the job to the corresponding robot for execution, the job enters the execution state; if the midway robot fails and cannot complete the job, the job enters a waiting state from an execution state to wait for the next available robot, and after the job is completed, the job enters an idle state from the execution state. Each robot has three states, namely an idle state, an operating state and a fault state. When the robot is in an idle state and an operating state, the probability (1-p) and the probability (1-q) are respectively failed. In the idle state, if a job waiting to be executed is allocated by the scheduler, the robot enters the running state. Each robot can only allocate jobs within its execution range and return to an idle state after the operation is finished. And if the robot recovers to work normally in the fault state, returning to the idle state. Secondly, simulating the running track of the model through UPPAAL-SMC

Exploring all states and actions in a random environment, training a target strategy through the acquired data, and simulating a running track through offline learning

And establishes a table of state-action value functions

，

Is defined as being in a state

Take action down

Wherein:

a group of state-tuples is represented,

three different states of the operation are represented, namely idle, waiting and executing respectively;

three different states of the robot are represented, namely idle state, running state and fault state;

control attribution representing the current state, respectively the environment

And a scheduler

；

A group of action tuples is represented,

the method comprises the steps of indicating that a job to be executed is scheduled to run on a robot;

represents a job trigger;

indicating a robot fault;

indicating that the job is ready to be executed;

indicating that the job is complete;

indicating that the robot recovered from the failure; initializing a table of state-action value functions, selecting each state

When corresponding action is taken, according to

Greedy method to select

Corresponding action, state

In the execution of the selected action

Then a new state is obtained

And corresponding rewards

(ii) a Updating the cost function by accumulating the updated average

. And finally, taking the weighted sum of the number of finished jobs, the consumed energy consumption and the time for finishing the jobs as an optimization target, and fitting a multi-target pareto curve according to a convex optimization hyperplane separation theorem so as to generate a scheduling strategy for completing a plurality of jobs by a plurality of robots in a random environment.

Example 2

The present embodiment is the same as the method of embodiment 1, and is directed to the task of cooperatively completing specimen collection and transportation by multiple robots. As shown in fig. 8, multiple robots (robot M, robot M +1, robot M +2, and …) need to collect or process specimens at different task points (task points 1 to 6), and then transport the specimens to target points (target point 1 and target point 2). When one robot carries out a task at a certain task point, the task points are not opened to other robots, and a sequence exists among the task points, for example, the task point 4 is opened only to a robot completing the task of the task point 1, the task point 5 is opened only to a robot completing the task point 1, the task point 2 or the task point 3, and the task point 6 is opened only to a robot completing the task point 3. The uncertainty of the whole system comprises the uncertainty of the task time of different robots in different task points and the uncertainty of the moving time of the robots between different task points. In the process of executing tasks and moving, the robots need to avoid static obstacles and dynamic obstacles, and ensure that the total power consumption is minimum under the condition that the electric power used by each robot is different, and finally reach a target place. The robot has three states for executing the task at the task point, when the robot reaches the task point, the robot firstly triggers waiting, if the robot is already executing the task at the task point, the robot waits for the task at the task point to be completed, and if other robots complete the task, the robot starts executing the task. If the midway task is in error, returning to a waiting state to continue waiting for execution. And after the robot completes the task at the task point, searching the next task point to complete the task. In order to establish a multi-objective optimized task scheduling strategy, namely a strategy for completing the collection, processing and task transmission of all samples in a short time and with less energy consumption, firstly, a general and random game model facing to time constraint is established based on a multi-objective random game template facing to time constraint, and the participants of the game are multiple robots. Secondly, simulating the running track of the model through UPPAAL-SMC

Exploring the sum of all states of the multiple robots in a random environmentAction, then collecting simulation data to train multi-target optimization strategy, and simulating operation track by off-line learning

And establishes a table of state-action value functions

，

Is defined as being in a state

Robot for different types

Taking action

Wherein:

a group of state-tuples is represented,

three different states of the task are represented, namely idle, waiting and executing respectively;

indicating the current robot

The task point or the target point;

represents an action tuple, wherein

Indicating that the robot is between task points or between a task point and a target pointMove between, this process can adopt to have already been existed

The path searching algorithm searches the shortest path between the task points;

and the current robot is ready to perform a task under the task point, and if the task point is occupied by other robots, the robot waits until the task point is idle.

Indicating that the robot has entered the task execution state from waiting,

indicating that the current task of the robot fails due to internal and external factors, the task needs to be re-executed,

the robot finishes the current task and can move to the next task point to execute the task. All robots according to

Greedy method to select

A corresponding action; status of state

In the execution of the selected action

Then a new state is obtained

And corresponding rewards

(ii) a And adopts Nash et alUpdating value function of balance function

. And finally, taking the weighted sum of the time of task execution, the total energy consumption and the task completion degree as an optimization target, and fitting a multi-target pareto curve according to the convex optimization hyperplane separation theorem, thereby generating a multi-target optimization strategy for cooperatively collecting and transporting samples by multiple robots in a random environment.

The above-mentioned embodiments are intended to illustrate the objects, aspects and advantages of the present invention, and it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the present invention, and any modifications, equivalent substitutions and improvements made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A scheduling method of multi-agent facing time constraint is characterized in that the method comprises the following steps:

2. The scheduling method of multi-agent oriented to time constraints as claimed in claim 1, wherein the step S11 is specifically as follows:

Wherein:

a finite set of states representing a multi-agent and stochastic environment;

representing the initial state of the multi-agent and stochastic environment,

；

representing a certain agent or a random environment

A finite set of states of (a) is,

，

；

representing a finite set of actions of the multi-agent;

represents a finite set of all clocks;

representing a set of clock constraints;

indicating multi-agent status

Invariance conditions on clock constraints;

indicating multi-agent status

Upper adopts

Clock constraints at the time of action;

representing multiple agents from

Status pass through

Is moved to

The state transition function of the state is,

to represent

A probability distribution of (a);

represents a real number;

As a multi-agent on-path

Action set of

In a selection strategy of

The reward expectation formula for a policy is as follows:

in the formula:

indicating multi-agent status

The corresponding reward function;

representing multi-agent actions

The corresponding reward function;

；

representing a desired reward function for the multi-agent;

representing a policy;

representing multi-agent selection strategies

Probability distribution of (2).

3. A method of scheduling multi-agent towards time constraints as claimed in claim 2, characterized in that: the set of clock constraints of step S111

Middle clock constraint

Defined by the following formula;

in the formula:

is that

One of the clock signals of the first and second clock signals,

is a constant number of times that the number of the first,

，

，

。

4. the scheduling method of multi-agent oriented to time constraints as claimed in claim 1, wherein the step S12 is specifically as follows:

s123, establishing multiple intelligence by simulating running track through offline learningState-action value function table of energy body

Said value function table

Is defined as being in a state

Take action down

Wherein:

a group of state-tuples is represented,

a group of action tuples is represented,

a different set of classifications representing the state,

indicating the gaming participants to which the current state pertains.

5. The scheduling method of multi-agent oriented to time constraints as claimed in claim 4, wherein the step S13 is specifically as follows:

At the selection of each state

Upon corresponding action, the multi-agent or random environment is based on

Greedy method to select

in the formula:

；

At the selection of each state

In corresponding action, the multi-agent is based on

Greedy method to select

in the formula:

indicating the number of approximate cumulative calculations,

；

representing the number of the multi-agent;

represents an attenuation value;

representing the rewards currently earned by the multi-agent;

indicating a state

In the execution of the selected action

Then obtaining a new state;

indicating slave status

Starting a multi-agent to adopt a joint strategy

Calculated long term average return.

6. A method of scheduling multi-agent towards time constraints as claimed in claim 5, characterized by: an agent in the general and random game of step S132