CN113269297B - Multi-agent scheduling method facing time constraint - Google Patents

Multi-agent scheduling method facing time constraint Download PDF

Info

Publication number
CN113269297B
CN113269297B CN202110810946.4A CN202110810946A CN113269297B CN 113269297 B CN113269297 B CN 113269297B CN 202110810946 A CN202110810946 A CN 202110810946A CN 113269297 B CN113269297 B CN 113269297B
Authority
CN
China
Prior art keywords
agent
state
random
action
reward
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110810946.4A
Other languages
Chinese (zh)
Other versions
CN113269297A (en
Inventor
朱晨阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Donghe Software Jiangsu Co ltd
Original Assignee
Donghe Software Jiangsu Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Donghe Software Jiangsu Co ltd filed Critical Donghe Software Jiangsu Co ltd
Priority to CN202110810946.4A priority Critical patent/CN113269297B/en
Publication of CN113269297A publication Critical patent/CN113269297A/en
Application granted granted Critical
Publication of CN113269297B publication Critical patent/CN113269297B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to a time constraint-oriented multi-agent scheduling method, which comprises the following steps: establishing a dispatching center; the dispatching center collects real-time data of states and actions of the multi-agent and random environment; the dispatching center processes the collected data and sends the action instruction to the multi-agent; by introducing time constraint into the random game model, the real-time, non-deterministic and probabilistic behaviors shown among the multi-agents or in the interaction process of the multi-agents and the random environment can be described, the reward function related to time can be quantized, and a multi-objective optimization strategy is determined through the reward function; the efficiency of calculating the maximum reward expectation of the model and the pareto curve fitting efficiency based on weight combination are improved according to the designed algorithm, so that the reaction speed of the multi-agent is improved; by giving different weights to a plurality of targets, the priorities of the targets are distinguished, so that the running reliability of the multi-agent is improved.

Description

Multi-agent scheduling method facing time constraint
Technical Field
The invention relates to the technical field of multi-agent interaction, in particular to a scheduling method of multi-agent facing to time constraint.
Background
As the interaction between multi-agents (robots, robot dogs, drones, etc.) becomes increasingly close, the errors that occur during interaction also continue to increase as the size and complexity of multi-agent systems increase. How to design a multi-agent scheduling system to meet the requirement of multi-target design under uncertain environment and corresponding time constraint becomes a key scientific problem which needs to be solved urgently.
At present, the research on the multi-agent scheduling system mainly verifies the quantitative attribute of the model and the attribute related to the reward function through a model checking method, and approaches to pareto optimality of the model through a value iteration method. However, the following problems still remain unsolved in the multi-objective optimization for multi-agent scheduling facing time constraints:
(1) the model inspection needs to carry out exhaustive search on the state space of the multi-agent and random environment, and the state number of the model increases exponentially with the increase of concurrent components, so that the problem of state space explosion is caused;
(2) in a random game model facing time constraint, the reward function can be a point of time, and under the condition that the running time is uncertain, the reward function is variable, so that a value iteration and strategy iteration algorithm based on the model is not suitable for the scene;
(3) there is a lack of description of target priority differences in combining multiple target strategies for multi-agents, and a lack of research to weigh multi-objective optimization strategies based on weight combination.
Disclosure of Invention
The invention aims to overcome the defects and shortcomings of the prior art and provide a time-constraint-oriented multi-agent scheduling method with advanced concept, high reliability and high speed.
The technical scheme for realizing the purpose of the invention is as follows: a scheduling method of multi-agent facing time constraint includes the following steps:
s1, establishing a scheduling center, which specifically comprises the following steps:
s11, establishing a time constraint-oriented random game model for cooperatively collecting and transporting a sample by a plurality of intelligent agents based on a time constraint-oriented multi-target random game template;
s12, testing and simulating the running track of a random game model according to a statistical model, designing a value function learning method which is not based on the model to calculate the maximum reward expectation of a multi-agent collecting samples at different task points or processing the samples, and then transporting the samples to a target point, wherein the reward comprises task running time, total energy consumption and task completion degree;
s13, iterating the algorithm according to the convergence condition of the random game among the multiple agents;
s14, endowing weights to the rewards, calculating the weighted sum, fitting a multi-target pareto curve based on weight combination according to a convex optimization hyperplane separation theorem, and generating an optimal task scheduling strategy for cooperatively collecting and transporting samples by multiple intelligent agents;
s2, the dispatching center collects real-time data of states and actions of the multi-agent and random environments;
and S3, the dispatching center processes the acquired data and sends the action instruction to the multi-agent.
Further, step S11 is specifically:
s111, the multi-target random game template facing to time constraint is a ten-tuple
Figure GDA0003274261270000021
Figure GDA0003274261270000022
Wherein:
II represents a limited set of multi-agent participants and random environment of the participants participating in the random game;
l represents a finite set of states of the multi-agent and stochastic environments;
Figure GDA0003274261270000026
representing the initial state of the multi-agent and stochastic environment,
Figure GDA0003274261270000023
Sia finite set of states representing a certain agent or random environment i,
Figure GDA0003274261270000025
i∈Π;
a represents a finite set of actions of a multi-agent;
c represents a finite set of all clocks;
Φ (C) represents a set of clock constraints;
inv ∈ L → Φ (C) represents an invariance condition of the multi-agent with respect to the clock constraint at state L;
grd ∈ L × A → Φ (C) represents the clock constraint when the multi-agent takes a ∈ A action in state L ∈ L;
△∈L×A→Dist(2CxL) represents the state transition function of the multi-agent from L e L state through the action of a e A to L' eL state, Dist (2)CXl) represents the probability distribution of L;
h belongs to L and U.A → R represents the state of the multi-agent and the reward function corresponding to the action, and R represents the real number;
s112, establishing a multi-target random game model for the cooperative collection and transportation of the multi-agent for the time constraint, and adopting the lambda epsilon (L A)*Li→ Dist (A) as a multi agent on-path (l)0a0l1a1…li) The following selection strategy of action set a, with λ as the reward expectation formula of the task scheduling strategy, is as follows:
E[rew(h)]=∫(∑h(li)+h(ai))dPr(λ)
in the formula: h (l)i) Indicating that multiple agents are in State liThe corresponding reward function; h (a)i) Indicating multi-agent at action aiThe corresponding reward function; li∈L∧ai∈A;E[rew(h)]Representing a desired reward function for the multi-agent; lambda represents a task scheduling strategy; pr (lambda) represents the probability distribution of the multi-agent selection task scheduling strategy lambda.
Further, the clock constraint condition g in the set Φ (C) of clock constraint conditions in step S111 is defined inductively by the following formula;
Figure GDA0003274261270000031
in the formula: x is a clock in C, C is a constant, g ∈ Φ (C), g1∈Φ(C),g2∈Φ(C)。
Further, step S12 is specifically:
s121, acquiring initial data of all states and actions of the multi-agent in a random environment;
s122, establishing a random game model for cooperatively collecting and transporting samples by multiple agents facing time constraint based on collected data, simulating the running track of the random game model through a statistical model, exploring all states and actions of the multiple agents in a random environment and training a target strategy;
s123, establishing a state-action value function table Q (s, a) of the multi-agent through off-line learning simulation operation tracks, wherein the value function table Q (s, a) is defined as a value function of taking action a belonging to QA under the condition that s belonging to QS, wherein: QS ═ < S, SS > represents a state tuple, QA ═ a > represents an action tuple, S represents a different classification set of states, and SS represents the game participant to which the current state belongs.
Further, step S13 is specifically:
s131, aiming at double zero-sum random games, initializing a state-action value function table Q (s, a), selecting actions corresponding to each state s in a multi-agent or random environment according to an e-greedy method, and updating a value function by adopting a method of accumulative average updating, wherein the formula is as follows:
Q(s,a)=Q(s,a)+α[v-Q(s,a)]
in the formula: α represents the number of approximate cumulative computations, which can be considered as a step size, α ∈ (0,1 ];
v represents the estimated return, i.e. the sum of the future benefits with attenuation;
s132, aiming at the common and random games of a plurality of people, firstly initializing a state-action value function table Q (s, a)1,…,an) When the action corresponding to each state s is selected, the multi-agent selects the action corresponding to s according to an e-greedy method, and finally, a Nash equilibrium function is adopted to update the cost function, wherein the formula is as follows:
Figure GDA0003274261270000041
in the formula: α represents the number of approximate cumulative computations, α ∈ (0,1)](ii) a n represents the number of multi-agents; gamma represents an attenuation value; hsRepresenting the rewards currently earned by the multi-agent; s' indicates that state s is performing the selected action a ═ a1,…,an]Then obtaining a new state;
Figure GDA0003274261270000042
representing that the multi-agent adopts a joint task scheduling strategy from the state s
Figure GDA0003274261270000043
Calculated long term average return.
Further, step S132 is performed by nash equalization function of an agent i in the general and random games
Figure GDA0003274261270000044
The following formula is satisfied:
Figure GDA0003274261270000045
in the formula: x is a task scheduling strategy set of a certain agent i; n represents the number of multi-agents.
Further, step S14 is specifically:
s141, taking the weighted sum of the multi-target rewards as an optimization target, and calculating the weighted sum of the multi-target optimization, wherein the formula is as follows:
E[rew(w·h)]=∫rew(w·h)(λ)dPr(λ)
in the formula: w represents a weight vector, h represents an incentive vector, lambda represents a task scheduling strategy, and E [ rew (w.h) ] represents an expected incentive function added with weight combination; rew (w · h) (λ) represents the target reward weighted sum under the task scheduling policy λ; pr (lambda) represents the probability distribution of the multi-agent selecting task scheduling strategy lambda;
and S142, fitting the multi-target pareto curves with different weight combinations according to a convex optimization hyperplane separation theorem to generate an optimal task scheduling strategy for cooperatively collecting samples and transporting by multiple agents, wherein the task scheduling strategy comprises actions required to be selected by different agents in different states.
After the technical scheme is adopted, the invention has the following positive effects:
(1) the invention introduces time constraint in the random game model, on one hand, the real-time, non-determinacy and probability behaviors shown among the multi-agent or in the interaction process of the multi-agent and the random environment can be described, on the other hand, the reward function related to time can be quantized, and the multi-objective optimization strategy is determined through the reward function.
(2) The invention calculates the expected reward expectation according to the Monte Carlo simulation track by designing an off-line algorithm, avoids the problem of state space explosion generated when the maximum reward expectation is calculated, and reduces the iteration times of the algorithm according to the convergence conditions of the zero and random games and the general and random games, thereby reducing the energy consumption of the system and improving the reaction speed of the multi-agent.
(3) The invention gives different weights to a plurality of targets and distinguishes the priorities of the targets, thereby improving the running reliability of the multi-agent.
Drawings
In order that the present disclosure may be more readily and clearly understood, the following detailed description of the present disclosure is provided in connection with specific embodiments thereof and with the accompanying drawings, in which:
FIG. 1 is a block diagram of a dispatch center according to the present invention;
FIG. 2 is a flow chart of the present invention;
FIG. 3 is a table generation method of the double zero and random game value function of the present invention;
FIG. 4 is a table of values for a multi-player general and random game play according to the present invention;
FIG. 5 illustrates a pareto curve generation method according to the present invention;
FIG. 6 is a graph of a pareto curve fit based on weight combinations according to the present invention;
FIG. 7 is a schematic diagram of the multi-robot and random environment dynamic gaming model in this embodiment 1;
fig. 8 is a schematic diagram of a dynamic gaming model among multiple robots in this embodiment 2.
Detailed Description
As shown in fig. 1-5, a scheduling method of multi-agent facing to time constraint includes the following steps:
s1, establishing a scheduling center, which specifically comprises the following steps:
s11, establishing a time constraint-oriented random game model for cooperatively collecting and transporting specimens by multiple agents based on a time constraint-oriented multi-target random game template, which comprises the following specific steps:
s111, the multi-target random game template facing to time constraint is a ten-tuple
Figure GDA0003274261270000051
Figure GDA0003274261270000052
Wherein:
II represents a limited set of multi-agent participants and random environment of the participants participating in the random game;
l represents a finite set of states of the multi-agent and stochastic environments;
Figure GDA0003274261270000061
representing the initial state of the multi-agent and stochastic environment,
Figure GDA0003274261270000062
Sirepresenting a finite set of states of a certain agent or random environment i, Si∈L,i∈П;
A represents a finite set of actions of a multi-agent;
c represents a finite set of all clocks;
phi (C) represents a set of clock constraints, and the clock constraint g is formed by a formula
Figure GDA0003274261270000064
Figure GDA0003274261270000065
Definitions, in which: x is a clock in C, C is a constant, g ∈ Φ (C), g1∈Φ(C),g2E.g. Φ (C); for example, if a state needs a delay c equal to 3s, a corresponding to state L has a time constraint of 3 ≦ x, and if a state is constrained by a cutoff time of 5s, a corresponding to state L has a constraint of x ≦ 5. Also, g can be a combination of different time constraints, such as 3 ≦ x ≦ 5. At the same time, g also accepts the logical inverse operation.
inv ∈ L → Φ (C) represents an invariance condition of the multi-agent with respect to the clock constraint at state L;
grd ∈ L × A → Φ (C) represents the clock constraint when the multi-agent takes a ∈ A action in state L ∈ L;
△∈L×A→Dist(2CxL) represents the state transition function of the multi-agent from L e L state through the action of a e A to L' eL state, Dist (2)CXl) represents the probability distribution of L;
h belongs to L and U.A → R represents the state of the multi-agent and the reward function corresponding to the action, and R represents the real number;
s112, establishing a multi-target random game model for the cooperative collection and transportation of the multi-agent for the time constraint, and adopting the lambda epsilon (L A)*Li→ Dist (A) as a multi agent on-path (l)0a0l1a1…li) The following selection strategy of action set a, with λ as the reward expectation formula of the task scheduling strategy, is as follows:
E[rew(h)]=∫(∑h(li)+h(ai))dPr(λ)
in the formula: h (l)i) Indicating that multiple agents are in State liThe corresponding reward function; h (a)i) Indicating multi-agent at action aiThe corresponding reward function; li∈L∧ai∈A;E[rew(h)]Representing a desired reward function for the multi-agent; lambda represents a task scheduling strategy; pr (lambda) represents the probability distribution of the multi-agent selection task scheduling strategy lambda.
S12, testing and simulating the running track of the random game model according to the statistical model
Figure GDA0003274261270000063
Designing a value function learning method which is not based on a model to calculate the maximum reward expectation of a plurality of intelligent agents for collecting specimens or processing specimens at different task points and then transporting the specimens to a target point, wherein the reward comprises task performing time, total energy consumption and task completion degree, and the method specifically comprises the following steps:
s121, acquiring initial data of all states and actions of the multi-agent in a random environment;
s122, establishing a time-constraint-oriented random game model for cooperatively collecting and transporting samples by multiple agents based on collected data, and simulating the running track of the random game model through UPPAAL-SMC (statistical model testing tool)
Figure GDA0003274261270000071
Exploring multiple agents in a random environmentExisting states and actions and training target strategies;
s123, simulating the running track through offline learning
Figure GDA0003274261270000072
Establishing a state-action value function table Q (s, a) for the multi-agent, the value function table Q (s, a) being defined as a value function with action a ∈ QA taken under state s ∈ QS, wherein: QS ═<S,SS>Represents a state tuple, QA ═<A>Representing an action tuple, S representing a different categorized set of states, SS representing the gaming participant to which the current state belongs.
S13, iterating the algorithm according to the convergence condition of the random game among the multiple agents, which is as follows:
s131, aiming at double zero-sum random games, initializing a state-action value function table Q (s, a), and when an action corresponding to each state s is selected, selecting the action corresponding to s according to an e-greedy method by a multi-agent or random environment, namely if an action set A corresponding to s, selecting the action of a maximized value function table with a probability of 1-e, and selecting the action at random with the probability of e; the state s will get a new state s' and a corresponding reward H after performing the selected action a; suppose that the game participants are e and s respectively, and the state sets are L respectivelysAnd LeAnd the model aims to maximize the revenue of party e. If the next state belongs to e, the reward needs to be maximized, as shown in formula (1); if the status of the next step belongs to s, then the reward needs to be minimized, as shown in equation (2);
v=Hs+γmaxa′Q(s′,a′) (1)
v=Hs+γmina′Q(s′,a′) (2)
in the formula: hsIndicating the currently awarded prize, γ maxa′Q (s ', a') represents the current maximum next step yield,. gamma.mina′Q (s ', a') represents the yield of the current minimized next step, and gamma represents the attenuation value;
and finally updating the value function by adopting a method of accumulating and updating the average value, wherein the formula is as follows:
Q(s,a)=Q(s,a)+α[v-Q(s,a)]
in the formula: α represents the number of approximate cumulative computations, which can be considered as a step size, α ∈ (0,1 ];
v represents the estimated return, i.e. the sum of the future benefits with attenuation;
s132, aiming at the common and random games of a plurality of people, firstly initializing a state-action value function table Q (s, a)1,…,an) That is, for a state, different agents have different actions, and each agent generates an optimal task scheduling strategy by observing the actions of other agents and corresponding reward values; when the action corresponding to each state s is selected, different agents select the action corresponding to s according to an Ee-greedy method; the state s will get a new state s' and a corresponding reward H after performing the selected action a; and finally, updating the cost function by adopting a Nash equalization function, wherein the formula is as follows:
Figure GDA0003274261270000081
in the formula: α represents the number of approximate cumulative computations, α ∈ (0,1)](ii) a n represents the number of multi-agents; gamma represents an attenuation value; hsRepresenting the rewards currently earned by the multi-agent; s' represents the action of state s in performing a selection
Figure GDA0003274261270000082
Then obtaining a new state;
Figure GDA0003274261270000083
representing that the multi-agent adopts a joint task scheduling strategy from the state s
Figure GDA0003274261270000084
A calculated long-term average reward;
wherein:
Figure GDA0003274261270000085
representing that the multi-agent adopts a joint task scheduling strategy from s
Figure GDA0003274261270000086
Figure GDA0003274261270000087
And the calculated long-term average return meets the following formula, and x is a task scheduling strategy set of the agent i.
Figure GDA0003274261270000088
S14, endowing weights to the rewards, calculating the weighted sum, fitting a multi-target pareto curve based on weight combination according to a convex optimization hyperplane separation theorem, and generating an optimal task scheduling strategy for cooperatively collecting samples and transporting by multiple intelligent agents, wherein the optimal task scheduling strategy comprises the following specific steps:
s141, taking the weighted sum of the multi-target rewards as an optimization target, and calculating the weighted sum of the multi-target optimization, wherein the formula is as follows:
E[rew(w·h)]=∫rew(w·h)(λ)dPr(λ)
in the formula: w represents a weight vector, h represents an incentive vector, lambda represents a task scheduling strategy, and E [ rew (w.h) ] represents an expected incentive function added with weight combination; rew (w · h) (λ) represents the target reward weighted sum under the task scheduling policy λ; pr (lambda) represents the probability distribution of the multi-agent selecting task scheduling strategy lambda;
s142, if the goal is to calculate the maximum reward expectation, calculating the feasible region of the reward expectation H
Figure GDA0003274261270000089
Figure GDA00032742612700000810
Wherein: p represents reward expectation, P represents a reward expectation set, P belongs to P, down (H) represents a set of feasible domains, z belongs to down (H), namely, P belongs to P, and all values z in the feasible domains are less than P; if the goal is to calculate a maximum reward expectation (e.g., a minimized energy consumption scenario), the feasible region of reward expectation H is calculated
Figure GDA00032742612700000811
S143, if the goal is to calculate the maximum reward expectation, calculating the infeasible domain of the reward expectation H
Figure GDA00032742612700000812
Wherein W represents a weight vector, W represents a set of weight vectors, W belongs to W, Q represents a reward expectation vector, Q represents a set of reward expectation vectors, Q belongs to Q, w.y and w.q both represent vector inner products, UReach (H) represents a set of infeasible domains, y belongs to UReach (H), namely reward expectation for any weight combination, so that w.y > w.q; calculating an infeasible field of reward expectations H if the goal is to calculate a minimum reward expectation
Figure GDA0003274261270000091
S144, calculating expected reward expectation of the weight vector W with the largest variance, and calculating the distance dist (down (H) and UReach (H)) between a feasible domain and an infeasible domain;
s145, if the distance is larger than the set theta, calculating W to enable the maximum separation hyperplane constructed by W to maximally separate the areas where the two sets are located, namely expanding the coverage of the reachable sets corresponding to the weights; depending on the convergence of the distance function, the newly generated reward expectation P may be added to the set of reward expectations P and iterated until dist (down (h), ureach (h)) is less than θ.
S2, the dispatching center collects real-time data of states and actions of the multi-agent and random environments;
and S3, the dispatching center processes the acquired data and sends the action instruction to the multi-agent.
Fitting a pareto curve based on multiple target reward weights for different weight vector combinations requires traversing the weight vector combinations and calculating the corresponding reward expectations multiple times. If there are k goals to reward, the use of exhaustive search weights and corresponding optimal reward expectations may need to be at O (n)k) The reward expectation of each set of weight vector combinations is calculated at the time complexity of (1). Since the reachable point sets generated by different weight combinations can be crossed, the invention adopts the expected rewardThe weight combination W is selected by the method of approximating the unreachable point set and the reachable point set, and the coverage area of the reachable point set corresponding to the weight is enlarged as much as possible, so that the number of reward expectation calculation is reduced, and the efficiency of the whole algorithm is improved. As shown in fig. 6, a pareto curve based on the weight combination minimizing the dual targets is fitted, and the reward expectations p1 and p2 in the extreme cases of W ═ 0,1 and W ═ 1,0 are first calculated, which respectively represent the case where only target 2 and only target 1 are considered. Firstly, generating corresponding straight lines according to w- (x, y) > w.p, wherein p represents an incentive expectation vector, a region defined by the intersection point of the two straight lines and a coordinate axis is an unreachable set, a region defined by two reachable points and an incentive expectation upper limit is a reachable set, and different weights correspond to different reachable point sets and a maximum separation hyperplane. The slope of the largest separating hyperplane of the unreachable point-reachable region furthest from the origin can be taken as a weight, reachable points are generated from reward expectations, and a set of reachable points is formed. As shown in fig. 6, the point o of the unreachable set farthest from the origin is selected, and the hyperplane separating theorem indicates that the hyperplane can separate the point o from the reachable set down (h). From this, the maximum separation hyperplane is calculated, and the corresponding weight value W is obtained. And calculating the maximum reward expectation through W, and adding the P set to generate a corresponding reachable set and a corresponding unreachable set. It can be seen from fig. 6 that the maximum values of the reachable set and the unreachable set are continuously close to each other, and when the distance between the reachable set and the unreachable set is smaller than θ, the reachable point set corresponding to the reachable point set is output, and the boundary point of the reachable point set is the pareto curve.
Example 1
As shown in fig. 7, multiple robots (robot M, robot N) complete multiple jobs (job T, job T +1, job T +2) in a random environment, and the process of scheduling these robots to complete jobs generates corresponding energy consumption and time delay. Since the robot may malfunction in the middle of completing a job, different robots may complete different jobs at different times. Therefore, firstly, a system for completing a plurality of operations of multiple robots in a random environment is modeled by adopting a zero-sum random game method facing time constraint based on a multi-target random game template facing time constraint, and game parties are respectively the multiple robots and the random environment. System for controlling a power supplyThe system mainly comprises two parts, namely an operation model and a multi-machine scheduler model. The job model has three states, namely an idle state, a waiting state and an execution state, each job is triggered by a random environment to enter the waiting state from the idle state, and if the scheduler determines a robot executing the job and distributes the job to the corresponding robot for execution, the job enters the execution state; if the midway robot fails and cannot complete the job, the job enters a waiting state from an execution state to wait for the next available robot, and after the job is completed, the job enters an idle state from the execution state. Each robot has three states, namely an idle state, an operating state and a fault state. When the robot is in an idle state and an operating state, the probability (1-p) and the probability (1-q) are respectively failed. In the idle state, if a job waiting to be executed is allocated by the scheduler, the robot enters the running state. Each robot can only allocate jobs within its execution range and return to an idle state after the operation is finished. And if the robot recovers to work normally in the fault state, returning to the idle state. Secondly, simulating the running track of the model through UPPAAL-SMC
Figure GDA0003274261270000101
Exploring all states and actions in a random environment, training a target strategy through the acquired data, and simulating a running track through offline learning
Figure GDA0003274261270000102
And establishing a state-action value function table Q (s, a), Q (s, a) being defined as a value function for taking an action a ∈ QA under a state s ∈ QS, wherein: QS ═<TS,MS,SS>Representing state tuples, TS representing three different states of the operation, namely idle, waiting and executing; MS represents three different states of the robot, namely idle state, running state and fault state; the SS represents the control attribution of the current state and is respectively an environment s belonging to Le and a scheduler s belonging to Ls; QA ═<Schedule,Trigger,Fail,Ready,Finish,Recover>Representing action tuples, and scheduling represents scheduling the job to be executed to the robot to run; trigger represents operation Trigger; fail representationA robot failure; ready indicates that the job is Ready for execution; finish indicates the job is completed; recover means that the robot recovers from the fault; initializing a state-action value function table, selecting an action corresponding to s according to an element-greedy method when an action corresponding to each state s is selected, wherein the state s can obtain a new state s' and a corresponding reward H after the selected action a is executed; and updating the value function Q (s, a) by adopting a method of accumulating the updated average value. And finally, taking the weighted sum of the number of finished jobs, the consumed energy consumption and the time for finishing the jobs as an optimization target, and fitting a multi-target pareto curve according to a convex optimization hyperplane separation theorem so as to generate a scheduling strategy for completing a plurality of jobs by a plurality of robots in a random environment.
Example 2
The present embodiment is the same as the method of embodiment 1, and is directed to the task of cooperatively completing specimen collection and transportation by multiple robots. As shown in fig. 8, multiple robots (robot M, robot M +1, robot M +2, and …) need to collect or process specimens at different task points (task points 1 to 6), and then transport the specimens to target points (target point 1 and target point 2). When one robot carries out a task at a certain task point, the task points are not opened to other robots, and a sequence exists among the task points, for example, the task point 4 is opened only to a robot completing the task of the task point 1, the task point 5 is opened only to a robot completing the task point 1, the task point 2 or the task point 3, and the task point 6 is opened only to a robot completing the task point 3. The uncertainty of the whole system comprises the uncertainty of the task time of different robots in different task points and the uncertainty of the moving time of the robots between different task points. In the process of executing tasks and moving, the robots need to avoid static obstacles and dynamic obstacles, and ensure that the total power consumption is minimum under the condition that the electric power used by each robot is different, and finally reach a target place. The robot has three states for executing the task at the task point, when the robot reaches the task point, the robot firstly triggers waiting, if the robot is already executing the task at the task point, the robot waits for the task at the task point to be completed, and if other robots complete the task, the robot starts executing the task. If the task is midwayIf error is reported, the system returns to a waiting state to continue waiting for execution. And after the robot completes the task at the task point, searching the next task point to complete the task. In order to establish a multi-objective optimized task scheduling strategy, namely a strategy for completing the collection, processing and task transmission of all samples in a short time and with less energy consumption, firstly, a general and random game model facing to time constraint is established based on a multi-objective random game template facing to time constraint, and the participants of the game are multiple robots. Secondly, simulating the running track of the model through UPPAAL-SMC
Figure GDA0003274261270000111
Exploring all states and actions of multiple robots in random environment, collecting simulation data to train multi-target optimization strategy, and simulating running track through off-line learning
Figure GDA0003274261270000121
And establishes a table of state-action value functions Q (s, a)1,…,an),Q(s,a1,…,an) Defined as the different robots i ∈ 1, …, n taking action a under the state s ∈ QSiE QA, where: QS ═<TS,MS>Representing state tuples, wherein TS represents three different states of the task, namely idle, waiting and executing; the MS represents a task point or a target point where the current robot i is located; QA ═<Move,Trigger,Ready,Fail,Finish>Representing action tuples, wherein Move represents that the robot moves between task points or between the task points and a target point, and the process can adopt the existing 2D routing algorithm to search the shortest path between the task points; trigger represents that the current robot is ready to perform a task under the task point, and if the task point is occupied by other robots, the Trigger waits until the task point is idle. Ready indicates that the robot enters a task execution state from waiting, Fail indicates that the current task of the robot needs to be executed again because internal and external factors Fail, Finish indicates that the robot finishes the current task and can move to the next task point to execute the task. All robots select the corresponding action of s according to an Ee-greedy method; the state s will get a new state s' and corresponding reward after performing the selected action aH; and updating the cost function Q (s, a) using a Nash equalization function1,…,an). And finally, taking the weighted sum of the time of task execution, the total energy consumption and the task completion degree as an optimization target, and fitting a multi-target pareto curve according to the convex optimization hyperplane separation theorem, thereby generating a multi-target optimization strategy for cooperatively collecting and transporting samples by multiple robots in a random environment.
The above-mentioned embodiments are intended to illustrate the objects, aspects and advantages of the present invention, and it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the present invention, and any modifications, equivalent substitutions and improvements made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (7)

1. A scheduling method of multi-agent facing time constraint is characterized in that the method comprises the following steps:
s1, establishing a scheduling center, which specifically comprises the following steps:
s11, establishing a time constraint-oriented random game model for cooperatively collecting and transporting a sample by a plurality of intelligent agents based on a time constraint-oriented multi-target random game template;
s12, testing and simulating the running track of a random game model according to a statistical model, designing a value function learning method which is not based on the model to calculate the maximum reward expectation of a multi-agent collecting samples at different task points or processing the samples, and then transporting the samples to a target point, wherein the reward comprises task running time, total energy consumption and task completion degree;
s13, iterating the algorithm according to the convergence condition of the random game among the multiple agents;
s14, endowing weights to the rewards, calculating the weighted sum, fitting a multi-target pareto curve based on weight combination according to a convex optimization hyperplane separation theorem, and generating an optimal task scheduling strategy for cooperatively collecting and transporting samples by multiple intelligent agents;
s2, the dispatching center collects real-time data of states and actions of the multi-agent and random environments;
and S3, the dispatching center processes the acquired data and sends the action instruction to the multi-agent.
2. The scheduling method of multi-agent oriented to time constraints as claimed in claim 1, wherein the step S11 is specifically as follows:
s111, the multi-target random game template facing to time constraint is a ten-tuple
Figure FDA0003274261260000011
Figure FDA0003274261260000012
Wherein:
II represents a limited set of multi-agent participants and random environment of the participants participating in the random game;
l represents a finite set of states of the multi-agent and stochastic environments;
Figure FDA0003274261260000013
representing the initial state of the multi-agent and stochastic environment,
Figure FDA0003274261260000014
Sia finite set of states representing a certain agent or random environment i,
Figure FDA0003274261260000015
a represents a finite set of actions of a multi-agent;
c represents a finite set of all clocks;
Φ (C) represents a set of clock constraints;
inv ∈ L → Φ (C) represents an invariance condition of the multi-agent with respect to the clock constraint at state L;
grd ∈ L × A → Φ (C) represents the clock constraint when the multi-agent takes a ∈ A action in state L ∈ L;
△∈L×A→Dist(2CxL) represents the state transition function of the multi-agent from L e L state through the action of a e A to L' eL state, Dist (2)CXl) represents the probability distribution of L;
h belongs to L and U.A → R represents the state of the multi-agent and the reward function corresponding to the action, and R represents the real number;
s112, establishing a multi-target random game model for the cooperative collection and transportation of the multi-agent for the time constraint, and adopting the lambda epsilon (L A)*Li→ Dist (A) as a multi agent on-path (l)0a0l1a1…li) The following selection strategy of action set a, with λ as the reward expectation formula of the task scheduling strategy, is as follows:
E[rew(h)]=∫(∑h(li)+h(ai))dPr(λ)
in the formula: h (l)i) Indicating that multiple agents are in State liThe corresponding reward function; h (a)i) Indicating multi-agent at action aiThe corresponding reward function; li∈L∧ai∈A;E[rew(h)]Representing a desired reward function for the multi-agent; lambda represents a task scheduling strategy; pr (lambda) represents the probability distribution of the multi-agent selection task scheduling strategy lambda.
3. A method of scheduling multi-agent towards time constraints as claimed in claim 2, characterized in that: the clock constraint condition g in the set phi (C) of clock constraint conditions in the step S111 is defined by the following formula in an inductive way;
Figure FDA0003274261260000021
in the formula: x is a clock in C, C is a constant, g ∈ Φ (C), g1∈Φ(C),g2∈Φ(C)。
4. The scheduling method of multi-agent oriented to time constraints as claimed in claim 1, wherein the step S12 is specifically as follows:
s121, acquiring initial data of all states and actions of the multi-agent in a random environment;
s122, establishing a random game model for cooperatively collecting and transporting samples by multiple agents facing time constraint based on collected data, simulating the running track of the random game model through a statistical model, exploring all states and actions of the multiple agents in a random environment and training a target strategy;
s123, establishing a state-action value function table Q (s, a) of the multi-agent through off-line learning simulation operation tracks, wherein the value function table Q (s, a) is defined as a value function of taking action a belonging to QA under the condition that s belonging to QS, wherein: QS ═ < S, SS > represents a state tuple, QA ═ a > represents an action tuple, S represents a different classification set of states, and SS represents the game participant to which the current state belongs.
5. The scheduling method of multi-agent oriented to time constraints as claimed in claim 4, wherein the step S13 is specifically as follows:
s131, aiming at double zero-sum random games, initializing a state-action value function table Q (s, a), selecting actions corresponding to each state s in a multi-agent or random environment according to an e-greedy method, and updating a value function by adopting a method of accumulative average updating, wherein the formula is as follows:
Q(s,a)=Q(s,a)+α[v-Q(s,a)]
in the formula: α represents the number of approximate cumulative computations, which can be considered as a step size, α ∈ (0,1 ];
v represents the estimated return, i.e. the sum of the future benefits with attenuation;
s132, aiming at the common and random games of a plurality of people, firstly initializing a state-action value function table Q (s, a)1,…,an) When the action corresponding to each state s is selected, the multi-agent selects the action corresponding to s according to an e-greedy method, and finally, a Nash equilibrium function is adopted to update the cost function, wherein the formula is as follows:
Figure FDA0003274261260000031
in the formula: α represents the number of approximate cumulative computations, α ∈ (0,1)](ii) a n represents the number of multi-agents; gamma represents an attenuation value; hsRepresenting the rewards currently earned by the multi-agent; s' indicates that state s is performing the selected action a ═ a1,…,an]Then obtaining a new state;
Figure FDA0003274261260000032
representing that the multi-agent adopts a joint task scheduling strategy from the state s
Figure FDA0003274261260000033
Calculated long term average return.
6. A method of scheduling multi-agent towards time constraints as claimed in claim 5, characterized by: step S132, Nash equilibrium function of a certain intelligent agent i in the general and random game
Figure FDA0003274261260000034
The following formula is satisfied:
Figure FDA0003274261260000035
in the formula: x is a task scheduling strategy set of a certain agent i; n represents the number of multi-agents.
7. The scheduling method of multi-agent oriented to time constraints as claimed in claim 1, wherein the step S14 is specifically as follows:
s141, taking the weighted sum of the multi-target rewards as an optimization target, and calculating the weighted sum of the multi-target optimization, wherein the formula is as follows:
E[rew(w·h)]=∫rew(w·h)(λ)dPr(λ)
in the formula: w represents a weight vector, h represents a reward vector, lambda represents a task scheduling policy, and E [ rew (w.h) ] represents an expected reward function to which a weight combination is added; rew (w · h) (λ) represents the target reward weighted sum under the task scheduling policy λ; pr (lambda) represents the probability distribution of the multi-agent selecting task scheduling strategy lambda;
and S142, fitting the multi-target pareto curves with different weight combinations according to a convex optimization hyperplane separation theorem to generate an optimal task scheduling strategy for cooperatively collecting samples and transporting by multiple agents, wherein the task scheduling strategy comprises actions required to be selected by different agents in different states.
CN202110810946.4A 2021-07-19 2021-07-19 Multi-agent scheduling method facing time constraint Active CN113269297B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110810946.4A CN113269297B (en) 2021-07-19 2021-07-19 Multi-agent scheduling method facing time constraint

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110810946.4A CN113269297B (en) 2021-07-19 2021-07-19 Multi-agent scheduling method facing time constraint

Publications (2)

Publication Number Publication Date
CN113269297A CN113269297A (en) 2021-08-17
CN113269297B true CN113269297B (en) 2021-11-05

Family

ID=77236924

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110810946.4A Active CN113269297B (en) 2021-07-19 2021-07-19 Multi-agent scheduling method facing time constraint

Country Status (1)

Country Link
CN (1) CN113269297B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115563527B (en) * 2022-09-27 2023-06-16 西南交通大学 Multi-Agent deep reinforcement learning system and method based on state classification and assignment
CN115576278B (en) * 2022-09-30 2023-08-04 常州大学 Multi-agent multi-task layered continuous control method based on temporal equilibrium analysis

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110728406A (en) * 2019-10-15 2020-01-24 南京邮电大学 Multi-agent power generation optimization scheduling method based on reinforcement learning
CN111860649A (en) * 2020-07-21 2020-10-30 赵佳 Action set output method and system based on multi-agent reinforcement learning

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107045655A (en) * 2016-12-07 2017-08-15 三峡大学 Wolf pack clan strategy process based on the random consistent game of multiple agent and virtual generating clan
CN106899026A (en) * 2017-03-24 2017-06-27 三峡大学 Intelligent power generation control method based on the multiple agent intensified learning with time warp thought
CN110471297B (en) * 2019-07-30 2020-08-11 清华大学 Multi-agent cooperative control method, system and equipment
CN112132263B (en) * 2020-09-11 2022-09-16 大连理工大学 Multi-agent autonomous navigation method based on reinforcement learning

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110728406A (en) * 2019-10-15 2020-01-24 南京邮电大学 Multi-agent power generation optimization scheduling method based on reinforcement learning
CN111860649A (en) * 2020-07-21 2020-10-30 赵佳 Action set output method and system based on multi-agent reinforcement learning

Also Published As

Publication number Publication date
CN113269297A (en) 2021-08-17

Similar Documents

Publication Publication Date Title
CN113269297B (en) Multi-agent scheduling method facing time constraint
Zhao et al. A heuristic distributed task allocation method for multivehicle multitask problems and its application to search and rescue scenario
JP6938791B2 (en) Methods for operating robots in multi-agent systems, robots and multi-agent systems
Whitbrook et al. Reliable, distributed scheduling and rescheduling for time-critical, multiagent systems
Goldberg et al. Coordinating mobile robot group behavior using a model of interaction dynamics
CN107562066B (en) Multi-target heuristic sequencing task planning method for spacecraft
Rybkin et al. Model-based reinforcement learning via latent-space collocation
Arai et al. Experience-based reinforcement learning to acquire effective behavior in a multi-agent domain
Kannan et al. The autonomous recharging problem: Formulation and a market-based solution
Schillinger et al. Auctioning over probabilistic options for temporal logic-based multi-robot cooperation under uncertainty
Yu et al. Asynchronous multi-agent reinforcement learning for efficient real-time multi-robot cooperative exploration
CN117103282B (en) Double-arm robot cooperative motion control method based on MATD3 algorithm
Matikainen et al. Multi-armed recommendation bandits for selecting state machine policies for robotic systems
Schillinger et al. Hierarchical ltl-task mdps for multi-agent coordination through auctioning and learning
CN113341706A (en) Man-machine cooperation assembly line system based on deep reinforcement learning
Chen et al. A bi-criteria nonlinear fluctuation smoothing rule incorporating the SOM–FBPN remaining cycle time estimator for scheduling a wafer fab—a simulation study
Zaidi et al. Task allocation based on shared resource constraint for multi-robot systems in manufacturing industry
Bøgh et al. Distributed fleet management in noisy environments via model-predictive control
Xiao et al. Local advantage actor-critic for robust multi-agent deep reinforcement learning
Li et al. Decentralized Refinement Planning and Acting
Shriyam et al. Task assignment and scheduling for mobile robot teams
Li et al. Multi-mode filter target tracking method for mobile robot using multi-agent reinforcement learning
Zhang et al. Multi-task Actor-Critic with Knowledge Transfer via a Shared Critic
Dracopoulos Robot path planning for maze navigation
Yoon et al. Learning to improve performance during non-repetitive tasks performed by robots

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant