CN113269297B

CN113269297B - Multi-agent scheduling method facing time constraint

Info

Publication number: CN113269297B
Application number: CN202110810946.4A
Authority: CN
Inventors: 朱晨阳
Original assignee: Donghe Software Jiangsu Co ltd
Current assignee: Donghe Software Jiangsu Co ltd
Priority date: 2021-07-19
Filing date: 2021-07-19
Publication date: 2021-11-05
Anticipated expiration: 2041-07-19
Also published as: CN113269297A

Abstract

The invention relates to a time constraint-oriented multi-agent scheduling method, which comprises the following steps: establishing a dispatching center; the dispatching center collects real-time data of states and actions of the multi-agent and random environment; the dispatching center processes the collected data and sends the action instruction to the multi-agent; by introducing time constraint into the random game model, the real-time, non-deterministic and probabilistic behaviors shown among the multi-agents or in the interaction process of the multi-agents and the random environment can be described, the reward function related to time can be quantized, and a multi-objective optimization strategy is determined through the reward function; the efficiency of calculating the maximum reward expectation of the model and the pareto curve fitting efficiency based on weight combination are improved according to the designed algorithm, so that the reaction speed of the multi-agent is improved; by giving different weights to a plurality of targets, the priorities of the targets are distinguished, so that the running reliability of the multi-agent is improved.

Description

Multi-agent scheduling method facing time constraint

Technical Field

The invention relates to the technical field of multi-agent interaction, in particular to a scheduling method of multi-agent facing to time constraint.

Background

As the interaction between multi-agents (robots, robot dogs, drones, etc.) becomes increasingly close, the errors that occur during interaction also continue to increase as the size and complexity of multi-agent systems increase. How to design a multi-agent scheduling system to meet the requirement of multi-target design under uncertain environment and corresponding time constraint becomes a key scientific problem which needs to be solved urgently.

At present, the research on the multi-agent scheduling system mainly verifies the quantitative attribute of the model and the attribute related to the reward function through a model checking method, and approaches to pareto optimality of the model through a value iteration method. However, the following problems still remain unsolved in the multi-objective optimization for multi-agent scheduling facing time constraints:

(1) the model inspection needs to carry out exhaustive search on the state space of the multi-agent and random environment, and the state number of the model increases exponentially with the increase of concurrent components, so that the problem of state space explosion is caused;

(2) in a random game model facing time constraint, the reward function can be a point of time, and under the condition that the running time is uncertain, the reward function is variable, so that a value iteration and strategy iteration algorithm based on the model is not suitable for the scene;

(3) there is a lack of description of target priority differences in combining multiple target strategies for multi-agents, and a lack of research to weigh multi-objective optimization strategies based on weight combination.

Disclosure of Invention

The invention aims to overcome the defects and shortcomings of the prior art and provide a time-constraint-oriented multi-agent scheduling method with advanced concept, high reliability and high speed.

The technical scheme for realizing the purpose of the invention is as follows: a scheduling method of multi-agent facing time constraint includes the following steps:

s1, establishing a scheduling center, which specifically comprises the following steps:

s11, establishing a time constraint-oriented random game model for cooperatively collecting and transporting a sample by a plurality of intelligent agents based on a time constraint-oriented multi-target random game template;

s12, testing and simulating the running track of a random game model according to a statistical model, designing a value function learning method which is not based on the model to calculate the maximum reward expectation of a multi-agent collecting samples at different task points or processing the samples, and then transporting the samples to a target point, wherein the reward comprises task running time, total energy consumption and task completion degree;

s13, iterating the algorithm according to the convergence condition of the random game among the multiple agents;

s14, endowing weights to the rewards, calculating the weighted sum, fitting a multi-target pareto curve based on weight combination according to a convex optimization hyperplane separation theorem, and generating an optimal task scheduling strategy for cooperatively collecting and transporting samples by multiple intelligent agents;

s2, the dispatching center collects real-time data of states and actions of the multi-agent and random environments;

and S3, the dispatching center processes the acquired data and sends the action instruction to the multi-agent.

Further, step S11 is specifically:

s111, the multi-target random game template facing to time constraint is a ten-tuple

Wherein:

II represents a limited set of multi-agent participants and random environment of the participants participating in the random game;

l represents a finite set of states of the multi-agent and stochastic environments;

representing the initial state of the multi-agent and stochastic environment,

S_ia finite set of states representing a certain agent or random environment i,

i∈Π；

a represents a finite set of actions of a multi-agent;

c represents a finite set of all clocks;

Φ (C) represents a set of clock constraints;

inv ∈ L → Φ (C) represents an invariance condition of the multi-agent with respect to the clock constraint at state L;

grd ∈ L × A → Φ (C) represents the clock constraint when the multi-agent takes a ∈ A action in state L ∈ L;

△∈L×A→Dist(2^CxL) represents the state transition function of the multi-agent from L e L state through the action of a e A to L' eL state, Dist (2)^CXl) represents the probability distribution of L;

h belongs to L and U.A → R represents the state of the multi-agent and the reward function corresponding to the action, and R represents the real number;

s112, establishing a multi-target random game model for the cooperative collection and transportation of the multi-agent for the time constraint, and adopting the lambda epsilon (L A)^*L_i→ Dist (A) as a multi agent on-path (l)₀a₀l₁a₁…l_i) The following selection strategy of action set a, with λ as the reward expectation formula of the task scheduling strategy, is as follows:

E[rew(h)]＝∫(∑h(l_i)+h(a_i))dPr(λ)

in the formula: h (l)_i) Indicating that multiple agents are in State l_iThe corresponding reward function; h (a)_i) Indicating multi-agent at action a_iThe corresponding reward function; l_i∈L∧a_i∈A；E[rew(h)]Representing a desired reward function for the multi-agent; lambda represents a task scheduling strategy; pr (lambda) represents the probability distribution of the multi-agent selection task scheduling strategy lambda.

Further, the clock constraint condition g in the set Φ (C) of clock constraint conditions in step S111 is defined inductively by the following formula;

in the formula: x is a clock in C, C is a constant, g ∈ Φ (C), g₁∈Φ(C)，g₂∈Φ(C)。

Further, step S12 is specifically:

s121, acquiring initial data of all states and actions of the multi-agent in a random environment;

s122, establishing a random game model for cooperatively collecting and transporting samples by multiple agents facing time constraint based on collected data, simulating the running track of the random game model through a statistical model, exploring all states and actions of the multiple agents in a random environment and training a target strategy;

s123, establishing a state-action value function table Q (s, a) of the multi-agent through off-line learning simulation operation tracks, wherein the value function table Q (s, a) is defined as a value function of taking action a belonging to QA under the condition that s belonging to QS, wherein: QS ═ < S, SS > represents a state tuple, QA ═ a > represents an action tuple, S represents a different classification set of states, and SS represents the game participant to which the current state belongs.

Further, step S13 is specifically:

s131, aiming at double zero-sum random games, initializing a state-action value function table Q (s, a), selecting actions corresponding to each state s in a multi-agent or random environment according to an e-greedy method, and updating a value function by adopting a method of accumulative average updating, wherein the formula is as follows:

Q(s,a)＝Q(s,a)+α[v-Q(s,a)]

in the formula: α represents the number of approximate cumulative computations, which can be considered as a step size, α ∈ (0,1 ];

v represents the estimated return, i.e. the sum of the future benefits with attenuation;

s132, aiming at the common and random games of a plurality of people, firstly initializing a state-action value function table Q (s, a)¹,…,aⁿ) When the action corresponding to each state s is selected, the multi-agent selects the action corresponding to s according to an e-greedy method, and finally, a Nash equilibrium function is adopted to update the cost function, wherein the formula is as follows:

in the formula: α represents the number of approximate cumulative computations, α ∈ (0,1)](ii) a n represents the number of multi-agents; gamma represents an attenuation value; h_sRepresenting the rewards currently earned by the multi-agent; s' indicates that state s is performing the selected action a ═ a¹,…,aⁿ]Then obtaining a new state;

representing that the multi-agent adopts a joint task scheduling strategy from the state s

Calculated long term average return.

Further, step S132 is performed by nash equalization function of an agent i in the general and random games

The following formula is satisfied:

in the formula: x is a task scheduling strategy set of a certain agent i; n represents the number of multi-agents.

Further, step S14 is specifically:

s141, taking the weighted sum of the multi-target rewards as an optimization target, and calculating the weighted sum of the multi-target optimization, wherein the formula is as follows:

E[rew(w·h)]＝∫rew(w·h)(λ)dPr(λ)

in the formula: w represents a weight vector, h represents an incentive vector, lambda represents a task scheduling strategy, and E [ rew (w.h) ] represents an expected incentive function added with weight combination; rew (w · h) (λ) represents the target reward weighted sum under the task scheduling policy λ; pr (lambda) represents the probability distribution of the multi-agent selecting task scheduling strategy lambda;

and S142, fitting the multi-target pareto curves with different weight combinations according to a convex optimization hyperplane separation theorem to generate an optimal task scheduling strategy for cooperatively collecting samples and transporting by multiple agents, wherein the task scheduling strategy comprises actions required to be selected by different agents in different states.

After the technical scheme is adopted, the invention has the following positive effects:

(1) the invention introduces time constraint in the random game model, on one hand, the real-time, non-determinacy and probability behaviors shown among the multi-agent or in the interaction process of the multi-agent and the random environment can be described, on the other hand, the reward function related to time can be quantized, and the multi-objective optimization strategy is determined through the reward function.

(2) The invention calculates the expected reward expectation according to the Monte Carlo simulation track by designing an off-line algorithm, avoids the problem of state space explosion generated when the maximum reward expectation is calculated, and reduces the iteration times of the algorithm according to the convergence conditions of the zero and random games and the general and random games, thereby reducing the energy consumption of the system and improving the reaction speed of the multi-agent.

(3) The invention gives different weights to a plurality of targets and distinguishes the priorities of the targets, thereby improving the running reliability of the multi-agent.

Drawings

In order that the present disclosure may be more readily and clearly understood, the following detailed description of the present disclosure is provided in connection with specific embodiments thereof and with the accompanying drawings, in which:

FIG. 1 is a block diagram of a dispatch center according to the present invention;

FIG. 2 is a flow chart of the present invention;

FIG. 3 is a table generation method of the double zero and random game value function of the present invention;

FIG. 4 is a table of values for a multi-player general and random game play according to the present invention;

FIG. 5 illustrates a pareto curve generation method according to the present invention;

FIG. 6 is a graph of a pareto curve fit based on weight combinations according to the present invention;

FIG. 7 is a schematic diagram of the multi-robot and random environment dynamic gaming model in this embodiment 1;

fig. 8 is a schematic diagram of a dynamic gaming model among multiple robots in this embodiment 2.

Detailed Description

As shown in fig. 1-5, a scheduling method of multi-agent facing to time constraint includes the following steps:

s11, establishing a time constraint-oriented random game model for cooperatively collecting and transporting specimens by multiple agents based on a time constraint-oriented multi-target random game template, which comprises the following specific steps:

Wherein:

representing the initial state of the multi-agent and stochastic environment,

S_irepresenting a finite set of states of a certain agent or random environment i, S_i∈L，i∈П；

A represents a finite set of actions of a multi-agent;

c represents a finite set of all clocks;

phi (C) represents a set of clock constraints, and the clock constraint g is formed by a formula

Definitions, in which: x is a clock in C, C is a constant, g ∈ Φ (C), g₁∈Φ(C)，g₂E.g. Φ (C); for example, if a state needs a delay c equal to 3s, a corresponding to state L has a time constraint of 3 ≦ x, and if a state is constrained by a cutoff time of 5s, a corresponding to state L has a constraint of x ≦ 5. Also, g can be a combination of different time constraints, such as 3 ≦ x ≦ 5. At the same time, g also accepts the logical inverse operation.

E[rew(h)]＝∫(∑h(l_i)+h(a_i))dPr(λ)

S12, testing and simulating the running track of the random game model according to the statistical model

Designing a value function learning method which is not based on a model to calculate the maximum reward expectation of a plurality of intelligent agents for collecting specimens or processing specimens at different task points and then transporting the specimens to a target point, wherein the reward comprises task performing time, total energy consumption and task completion degree, and the method specifically comprises the following steps:

s122, establishing a time-constraint-oriented random game model for cooperatively collecting and transporting samples by multiple agents based on collected data, and simulating the running track of the random game model through UPPAAL-SMC (statistical model testing tool)

Exploring multiple agents in a random environmentExisting states and actions and training target strategies;

s123, simulating the running track through offline learning

Establishing a state-action value function table Q (s, a) for the multi-agent, the value function table Q (s, a) being defined as a value function with action a ∈ QA taken under state s ∈ QS, wherein: QS ═<S,SS>Represents a state tuple, QA ═<A>Representing an action tuple, S representing a different categorized set of states, SS representing the gaming participant to which the current state belongs.

S13, iterating the algorithm according to the convergence condition of the random game among the multiple agents, which is as follows:

s131, aiming at double zero-sum random games, initializing a state-action value function table Q (s, a), and when an action corresponding to each state s is selected, selecting the action corresponding to s according to an e-greedy method by a multi-agent or random environment, namely if an action set A corresponding to s, selecting the action of a maximized value function table with a probability of 1-e, and selecting the action at random with the probability of e; the state s will get a new state s' and a corresponding reward H after performing the selected action a; suppose that the game participants are e and s respectively, and the state sets are L respectively_sAnd L_eAnd the model aims to maximize the revenue of party e. If the next state belongs to e, the reward needs to be maximized, as shown in formula (1); if the status of the next step belongs to s, then the reward needs to be minimized, as shown in equation (2);

v＝H_s+γmax_a′Q(s′,a′) (1)

v＝H_s+γmin_a′Q(s′,a′) (2)

in the formula: h_sIndicating the currently awarded prize, γ max_a′Q (s ', a') represents the current maximum next step yield,. gamma.min_a′Q (s ', a') represents the yield of the current minimized next step, and gamma represents the attenuation value;

and finally updating the value function by adopting a method of accumulating and updating the average value, wherein the formula is as follows:

Q(s,a)＝Q(s,a)+α[v-Q(s,a)]

s132, aiming at the common and random games of a plurality of people, firstly initializing a state-action value function table Q (s, a)¹,…,aⁿ) That is, for a state, different agents have different actions, and each agent generates an optimal task scheduling strategy by observing the actions of other agents and corresponding reward values; when the action corresponding to each state s is selected, different agents select the action corresponding to s according to an Ee-greedy method; the state s will get a new state s' and a corresponding reward H after performing the selected action a; and finally, updating the cost function by adopting a Nash equalization function, wherein the formula is as follows:

in the formula: α represents the number of approximate cumulative computations, α ∈ (0,1)](ii) a n represents the number of multi-agents; gamma represents an attenuation value; h_sRepresenting the rewards currently earned by the multi-agent; s' represents the action of state s in performing a selection

Then obtaining a new state;

A calculated long-term average reward;

wherein:

representing that the multi-agent adopts a joint task scheduling strategy from s

And the calculated long-term average return meets the following formula, and x is a task scheduling strategy set of the agent i.

S14, endowing weights to the rewards, calculating the weighted sum, fitting a multi-target pareto curve based on weight combination according to a convex optimization hyperplane separation theorem, and generating an optimal task scheduling strategy for cooperatively collecting samples and transporting by multiple intelligent agents, wherein the optimal task scheduling strategy comprises the following specific steps:

E[rew(w·h)]＝∫rew(w·h)(λ)dPr(λ)

s142, if the goal is to calculate the maximum reward expectation, calculating the feasible region of the reward expectation H

Wherein: p represents reward expectation, P represents a reward expectation set, P belongs to P, down (H) represents a set of feasible domains, z belongs to down (H), namely, P belongs to P, and all values z in the feasible domains are less than P; if the goal is to calculate a maximum reward expectation (e.g., a minimized energy consumption scenario), the feasible region of reward expectation H is calculated

S143, if the goal is to calculate the maximum reward expectation, calculating the infeasible domain of the reward expectation H

Wherein W represents a weight vector, W represents a set of weight vectors, W belongs to W, Q represents a reward expectation vector, Q represents a set of reward expectation vectors, Q belongs to Q, w.y and w.q both represent vector inner products, UReach (H) represents a set of infeasible domains, y belongs to UReach (H), namely reward expectation for any weight combination, so that w.y > w.q; calculating an infeasible field of reward expectations H if the goal is to calculate a minimum reward expectation

S144, calculating expected reward expectation of the weight vector W with the largest variance, and calculating the distance dist (down (H) and UReach (H)) between a feasible domain and an infeasible domain;

s145, if the distance is larger than the set theta, calculating W to enable the maximum separation hyperplane constructed by W to maximally separate the areas where the two sets are located, namely expanding the coverage of the reachable sets corresponding to the weights; depending on the convergence of the distance function, the newly generated reward expectation P may be added to the set of reward expectations P and iterated until dist (down (h), ureach (h)) is less than θ.

Fitting a pareto curve based on multiple target reward weights for different weight vector combinations requires traversing the weight vector combinations and calculating the corresponding reward expectations multiple times. If there are k goals to reward, the use of exhaustive search weights and corresponding optimal reward expectations may need to be at O (n)^k) The reward expectation of each set of weight vector combinations is calculated at the time complexity of (1). Since the reachable point sets generated by different weight combinations can be crossed, the invention adopts the expected rewardThe weight combination W is selected by the method of approximating the unreachable point set and the reachable point set, and the coverage area of the reachable point set corresponding to the weight is enlarged as much as possible, so that the number of reward expectation calculation is reduced, and the efficiency of the whole algorithm is improved. As shown in fig. 6, a pareto curve based on the weight combination minimizing the dual targets is fitted, and the reward expectations p1 and p2 in the extreme cases of W ═ 0,1 and W ═ 1,0 are first calculated, which respectively represent the case where only target 2 and only target 1 are considered. Firstly, generating corresponding straight lines according to w- (x, y) > w.p, wherein p represents an incentive expectation vector, a region defined by the intersection point of the two straight lines and a coordinate axis is an unreachable set, a region defined by two reachable points and an incentive expectation upper limit is a reachable set, and different weights correspond to different reachable point sets and a maximum separation hyperplane. The slope of the largest separating hyperplane of the unreachable point-reachable region furthest from the origin can be taken as a weight, reachable points are generated from reward expectations, and a set of reachable points is formed. As shown in fig. 6, the point o of the unreachable set farthest from the origin is selected, and the hyperplane separating theorem indicates that the hyperplane can separate the point o from the reachable set down (h). From this, the maximum separation hyperplane is calculated, and the corresponding weight value W is obtained. And calculating the maximum reward expectation through W, and adding the P set to generate a corresponding reachable set and a corresponding unreachable set. It can be seen from fig. 6 that the maximum values of the reachable set and the unreachable set are continuously close to each other, and when the distance between the reachable set and the unreachable set is smaller than θ, the reachable point set corresponding to the reachable point set is output, and the boundary point of the reachable point set is the pareto curve.

Example 1

As shown in fig. 7, multiple robots (robot M, robot N) complete multiple jobs (job T, job T +1, job T +2) in a random environment, and the process of scheduling these robots to complete jobs generates corresponding energy consumption and time delay. Since the robot may malfunction in the middle of completing a job, different robots may complete different jobs at different times. Therefore, firstly, a system for completing a plurality of operations of multiple robots in a random environment is modeled by adopting a zero-sum random game method facing time constraint based on a multi-target random game template facing time constraint, and game parties are respectively the multiple robots and the random environment. System for controlling a power supplyThe system mainly comprises two parts, namely an operation model and a multi-machine scheduler model. The job model has three states, namely an idle state, a waiting state and an execution state, each job is triggered by a random environment to enter the waiting state from the idle state, and if the scheduler determines a robot executing the job and distributes the job to the corresponding robot for execution, the job enters the execution state; if the midway robot fails and cannot complete the job, the job enters a waiting state from an execution state to wait for the next available robot, and after the job is completed, the job enters an idle state from the execution state. Each robot has three states, namely an idle state, an operating state and a fault state. When the robot is in an idle state and an operating state, the probability (1-p) and the probability (1-q) are respectively failed. In the idle state, if a job waiting to be executed is allocated by the scheduler, the robot enters the running state. Each robot can only allocate jobs within its execution range and return to an idle state after the operation is finished. And if the robot recovers to work normally in the fault state, returning to the idle state. Secondly, simulating the running track of the model through UPPAAL-SMC

Exploring all states and actions in a random environment, training a target strategy through the acquired data, and simulating a running track through offline learning

And establishing a state-action value function table Q (s, a), Q (s, a) being defined as a value function for taking an action a ∈ QA under a state s ∈ QS, wherein: QS ═<TS,MS,SS>Representing state tuples, TS representing three different states of the operation, namely idle, waiting and executing; MS represents three different states of the robot, namely idle state, running state and fault state; the SS represents the control attribution of the current state and is respectively an environment s belonging to Le and a scheduler s belonging to Ls; QA ═<Schedule,Trigger,Fail,Ready,Finish,Recover>Representing action tuples, and scheduling represents scheduling the job to be executed to the robot to run; trigger represents operation Trigger; fail representationA robot failure; ready indicates that the job is Ready for execution; finish indicates the job is completed; recover means that the robot recovers from the fault; initializing a state-action value function table, selecting an action corresponding to s according to an element-greedy method when an action corresponding to each state s is selected, wherein the state s can obtain a new state s' and a corresponding reward H after the selected action a is executed; and updating the value function Q (s, a) by adopting a method of accumulating the updated average value. And finally, taking the weighted sum of the number of finished jobs, the consumed energy consumption and the time for finishing the jobs as an optimization target, and fitting a multi-target pareto curve according to a convex optimization hyperplane separation theorem so as to generate a scheduling strategy for completing a plurality of jobs by a plurality of robots in a random environment.

Example 2

The present embodiment is the same as the method of embodiment 1, and is directed to the task of cooperatively completing specimen collection and transportation by multiple robots. As shown in fig. 8, multiple robots (robot M, robot M +1, robot M +2, and …) need to collect or process specimens at different task points (task points 1 to 6), and then transport the specimens to target points (target point 1 and target point 2). When one robot carries out a task at a certain task point, the task points are not opened to other robots, and a sequence exists among the task points, for example, the task point 4 is opened only to a robot completing the task of the task point 1, the task point 5 is opened only to a robot completing the task point 1, the task point 2 or the task point 3, and the task point 6 is opened only to a robot completing the task point 3. The uncertainty of the whole system comprises the uncertainty of the task time of different robots in different task points and the uncertainty of the moving time of the robots between different task points. In the process of executing tasks and moving, the robots need to avoid static obstacles and dynamic obstacles, and ensure that the total power consumption is minimum under the condition that the electric power used by each robot is different, and finally reach a target place. The robot has three states for executing the task at the task point, when the robot reaches the task point, the robot firstly triggers waiting, if the robot is already executing the task at the task point, the robot waits for the task at the task point to be completed, and if other robots complete the task, the robot starts executing the task. If the task is midwayIf error is reported, the system returns to a waiting state to continue waiting for execution. And after the robot completes the task at the task point, searching the next task point to complete the task. In order to establish a multi-objective optimized task scheduling strategy, namely a strategy for completing the collection, processing and task transmission of all samples in a short time and with less energy consumption, firstly, a general and random game model facing to time constraint is established based on a multi-objective random game template facing to time constraint, and the participants of the game are multiple robots. Secondly, simulating the running track of the model through UPPAAL-SMC

Exploring all states and actions of multiple robots in random environment, collecting simulation data to train multi-target optimization strategy, and simulating running track through off-line learning

And establishes a table of state-action value functions Q (s, a)¹,…,aⁿ)，Q(s,a¹,…,aⁿ) Defined as the different robots i ∈ 1, …, n taking action a under the state s ∈ QSⁱE QA, where: QS ═<TS,MS>Representing state tuples, wherein TS represents three different states of the task, namely idle, waiting and executing; the MS represents a task point or a target point where the current robot i is located; QA ═<Move,Trigger,Ready,Fail,Finish>Representing action tuples, wherein Move represents that the robot moves between task points or between the task points and a target point, and the process can adopt the existing 2D routing algorithm to search the shortest path between the task points; trigger represents that the current robot is ready to perform a task under the task point, and if the task point is occupied by other robots, the Trigger waits until the task point is idle. Ready indicates that the robot enters a task execution state from waiting, Fail indicates that the current task of the robot needs to be executed again because internal and external factors Fail, Finish indicates that the robot finishes the current task and can move to the next task point to execute the task. All robots select the corresponding action of s according to an Ee-greedy method; the state s will get a new state s' and corresponding reward after performing the selected action aH; and updating the cost function Q (s, a) using a Nash equalization function¹,…,aⁿ). And finally, taking the weighted sum of the time of task execution, the total energy consumption and the task completion degree as an optimization target, and fitting a multi-target pareto curve according to the convex optimization hyperplane separation theorem, thereby generating a multi-target optimization strategy for cooperatively collecting and transporting samples by multiple robots in a random environment.

The above-mentioned embodiments are intended to illustrate the objects, aspects and advantages of the present invention, and it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the present invention, and any modifications, equivalent substitutions and improvements made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A scheduling method of multi-agent facing time constraint is characterized in that the method comprises the following steps:

2. The scheduling method of multi-agent oriented to time constraints as claimed in claim 1, wherein the step S11 is specifically as follows:

Wherein:

representing the initial state of the multi-agent and stochastic environment,

S_ia finite set of states representing a certain agent or random environment i,

a represents a finite set of actions of a multi-agent;

c represents a finite set of all clocks;

Φ (C) represents a set of clock constraints;

E[rew(h)]＝∫(∑h(l_i)+h(a_i))dPr(λ)

3. A method of scheduling multi-agent towards time constraints as claimed in claim 2, characterized in that: the clock constraint condition g in the set phi (C) of clock constraint conditions in the step S111 is defined by the following formula in an inductive way;

4. The scheduling method of multi-agent oriented to time constraints as claimed in claim 1, wherein the step S12 is specifically as follows:

5. The scheduling method of multi-agent oriented to time constraints as claimed in claim 4, wherein the step S13 is specifically as follows:

Q(s,a)＝Q(s,a)+α[v-Q(s,a)]

Calculated long term average return.

6. A method of scheduling multi-agent towards time constraints as claimed in claim 5, characterized by: step S132, Nash equilibrium function of a certain intelligent agent i in the general and random game

The following formula is satisfied:

7. The scheduling method of multi-agent oriented to time constraints as claimed in claim 1, wherein the step S14 is specifically as follows:

E[rew(w·h)]＝∫rew(w·h)(λ)dPr(λ)

in the formula: w represents a weight vector, h represents a reward vector, lambda represents a task scheduling policy, and E [ rew (w.h) ] represents an expected reward function to which a weight combination is added; rew (w · h) (λ) represents the target reward weighted sum under the task scheduling policy λ; pr (lambda) represents the probability distribution of the multi-agent selecting task scheduling strategy lambda;