CN113269297A - Multi-agent scheduling method facing time constraint - Google Patents

Multi-agent scheduling method facing time constraint Download PDF

Info

Publication number
CN113269297A
CN113269297A CN202110810946.4A CN202110810946A CN113269297A CN 113269297 A CN113269297 A CN 113269297A CN 202110810946 A CN202110810946 A CN 202110810946A CN 113269297 A CN113269297 A CN 113269297A
Authority
CN
China
Prior art keywords
agent
random
representing
state
action
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110810946.4A
Other languages
Chinese (zh)
Other versions
CN113269297B (en
Inventor
朱晨阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Donghe Software Jiangsu Co ltd
Original Assignee
Donghe Software Jiangsu Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Donghe Software Jiangsu Co ltd filed Critical Donghe Software Jiangsu Co ltd
Priority to CN202110810946.4A priority Critical patent/CN113269297B/en
Publication of CN113269297A publication Critical patent/CN113269297A/en
Application granted granted Critical
Publication of CN113269297B publication Critical patent/CN113269297B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Abstract

The invention relates to a time constraint-oriented multi-agent scheduling method, which comprises the following steps: establishing a dispatching center; the dispatching center collects real-time data of states and actions of the multi-agent and random environment; the dispatching center processes the collected data and sends the action instruction to the multi-agent; by introducing time constraint into the random game model, the real-time, non-deterministic and probabilistic behaviors shown among the multi-agents or in the interaction process of the multi-agents and the random environment can be described, the reward function related to time can be quantized, and a multi-objective optimization strategy is determined through the reward function; the efficiency of calculating the maximum reward expectation of the model and the pareto curve fitting efficiency based on weight combination are improved according to the designed algorithm, so that the reaction speed of the multi-agent is improved; by giving different weights to a plurality of targets, the priorities of the targets are distinguished, so that the running reliability of the multi-agent is improved.

Description

Multi-agent scheduling method facing time constraint
Technical Field
The invention relates to the technical field of multi-agent interaction, in particular to a scheduling method of multi-agent facing to time constraint.
Background
As the interaction between multi-agents (robots, robot dogs, drones, etc.) becomes increasingly close, the errors that occur during interaction also continue to increase as the size and complexity of multi-agent systems increase. How to design a multi-agent scheduling system to meet the requirement of multi-target design under uncertain environment and corresponding time constraint becomes a key scientific problem which needs to be solved urgently.
At present, the research on the multi-agent scheduling system mainly verifies the quantitative attribute of the model and the attribute related to the reward function through a model checking method, and approaches to pareto optimality of the model through a value iteration method. However, the following problems still remain unsolved in the multi-objective optimization for multi-agent scheduling facing time constraints:
(1) the model inspection needs to carry out exhaustive search on the state space of the multi-agent and random environment, and the state number of the model increases exponentially with the increase of concurrent components, so that the problem of state space explosion is caused;
(2) in a random game model facing time constraint, the reward function can be a point of time, and under the condition that the running time is uncertain, the reward function is variable, so that a value iteration and strategy iteration algorithm based on the model is not suitable for the scene;
(3) there is a lack of description of target priority differences in combining multiple target strategies for multi-agents, and a lack of research to weigh multi-objective optimization strategies based on weight combination.
Disclosure of Invention
The invention aims to overcome the defects and shortcomings of the prior art and provide a time-constraint-oriented multi-agent scheduling method with advanced concept, high reliability and high speed.
The technical scheme for realizing the purpose of the invention is as follows: a scheduling method of multi-agent facing time constraint includes the following steps:
s1, establishing a scheduling center, which specifically comprises the following steps:
s11, establishing a random game model between a plurality of intelligent agents facing time constraint and a random environment or between the plurality of intelligent agents based on a multi-target random game template facing time constraint;
s12, testing and simulating the running track of a random game model according to a statistical model, and designing a value function learning method which is not based on the model to calculate the maximum reward expectation of the multi-agent taking different actions in various states;
s13, iterating the algorithm according to the convergence conditions of zero and random games between the multi-agent and the random environment and general and random games between the multi-agent;
s14, fitting a multi-target pareto curve based on weight combination according to a convex optimization hyperplane separation theorem;
s2, the dispatching center collects real-time data of states and actions of the multi-agent and random environments;
and S3, the dispatching center processes the acquired data and sends the action instruction to the multi-agent.
Further, step S11 is specifically:
s111, the multi-target random game template facing to time constraint is a ten-tuple
Figure 185614DEST_PATH_IMAGE001
Wherein:
Figure 463755DEST_PATH_IMAGE002
representing a limited set of participant multi-agents and a random environment participating in a random game;
Figure 216947DEST_PATH_IMAGE003
a finite set of states representing a multi-agent and stochastic environment;
Figure 98316DEST_PATH_IMAGE004
representing the initial state of the multi-agent and stochastic environment,
Figure 47686DEST_PATH_IMAGE005
Figure 432531DEST_PATH_IMAGE006
representing a certain agent or a random environment
Figure 356624DEST_PATH_IMAGE007
A finite set of states of (a) is,
Figure 990868DEST_PATH_IMAGE008
Figure 776553DEST_PATH_IMAGE009
Figure 547063DEST_PATH_IMAGE010
representing a finite set of actions of the multi-agent;
Figure 907637DEST_PATH_IMAGE011
represents a finite set of all clocks;
Figure 763597DEST_PATH_IMAGE012
representing a set of clock constraints;
Figure 336661DEST_PATH_IMAGE013
indicating multi-agent status
Figure 945366DEST_PATH_IMAGE003
Invariance conditions on clock constraints;
Figure 211262DEST_PATH_IMAGE014
indicating multi-agent status
Figure 757781DEST_PATH_IMAGE015
Upper adopts
Figure 771087DEST_PATH_IMAGE016
Clock constraints at the time of action;
Figure 719451DEST_PATH_IMAGE017
representing multiple agents from
Figure 156249DEST_PATH_IMAGE015
Status pass through
Figure 471955DEST_PATH_IMAGE016
Is moved to
Figure 386821DEST_PATH_IMAGE018
The state transition function of the state is,
Figure 986430DEST_PATH_IMAGE019
to represent
Figure 594128DEST_PATH_IMAGE003
A probability distribution of (a);
Figure 895666DEST_PATH_IMAGE020
representing the states of the multi-agents and the reward functions corresponding to the actions,
Figure 614223DEST_PATH_IMAGE021
represents a real number;
s112, establishing a multi-target random game model between multiple intelligent agents and a random environment or between the multiple intelligent agents facing to time constraint and adopting
Figure 68338DEST_PATH_IMAGE022
As a multi-agent on-path
Figure 846938DEST_PATH_IMAGE023
Action set of
Figure 868728DEST_PATH_IMAGE010
In a selection strategy of
Figure 656555DEST_PATH_IMAGE024
The reward expectation formula for a policy is as follows:
Figure 699597DEST_PATH_IMAGE025
in the formula:
Figure 898366DEST_PATH_IMAGE026
Indicating multi-agent status
Figure 394070DEST_PATH_IMAGE027
The corresponding reward function;
Figure 454430DEST_PATH_IMAGE028
representing multi-agent actions
Figure 368290DEST_PATH_IMAGE029
The corresponding reward function;
Figure 223114DEST_PATH_IMAGE030
Figure 471692DEST_PATH_IMAGE031
representing a desired reward function for the multi-agent;
Figure 585011DEST_PATH_IMAGE024
representing a policy;
Figure 602645DEST_PATH_IMAGE032
representing multi-agent selection strategies
Figure 893949DEST_PATH_IMAGE024
Probability distribution of (2).
Further, the set of clock constraints of step S111
Figure 629824DEST_PATH_IMAGE012
Middle clock constraint
Figure 779789DEST_PATH_IMAGE033
Defined by the following formula;
Figure 917510DEST_PATH_IMAGE034
in the formula:
Figure 645294DEST_PATH_IMAGE035
is that
Figure 868465DEST_PATH_IMAGE011
One of the clock signals of the first and second clock signals,
Figure 589165DEST_PATH_IMAGE036
is a constant number of times that the number of the first,
Figure 315813DEST_PATH_IMAGE037
Figure 230811DEST_PATH_IMAGE038
Figure 410119DEST_PATH_IMAGE039
further, step S12 is specifically:
s121, acquiring initial data of all states and actions of the multi-agent in a random environment;
s122, establishing a random game model facing time constraint based on the acquired data, simulating the running track of the random game model through UPPAAL-SMC, exploring all states and actions of the multi-agent in a random environment and training a target strategy;
s123, establishing a state-action value function table of the multi-agent through off-line learning simulation running tracks
Figure 925324DEST_PATH_IMAGE040
Said value function table
Figure 772058DEST_PATH_IMAGE040
Is defined as being in a state
Figure 107224DEST_PATH_IMAGE041
Take action down
Figure 39408DEST_PATH_IMAGE042
Wherein:
Figure 367490DEST_PATH_IMAGE043
a group of state-tuples is represented,
Figure 334309DEST_PATH_IMAGE044
a group of action tuples is represented,
Figure 574798DEST_PATH_IMAGE045
a different set of classifications representing the state,
Figure 259857DEST_PATH_IMAGE046
indicating the gaming participants to which the current state pertains.
Further, step S13 is specifically:
s131, initializing a state-action value function table for double-player zero-sum random games
Figure 611204DEST_PATH_IMAGE040
At the selection of each state
Figure 448841DEST_PATH_IMAGE047
Upon corresponding action, the multi-agent or random environment is based on
Figure 594651DEST_PATH_IMAGE048
Greedy method to select
Figure 767007DEST_PATH_IMAGE047
And finally updating the value function by adopting a method of cumulatively updating the average value according to the corresponding action, wherein the formula is as follows:
Figure 922045DEST_PATH_IMAGE049
in the formula:
Figure 597876DEST_PATH_IMAGE050
representing the number of approximate cumulative computations, which may be considered as a step size,
Figure 429435DEST_PATH_IMAGE051
Figure 823507DEST_PATH_IMAGE052
represents the estimated return, i.e., the sum of the future benefits with attenuation;
s132, aiming at the common and random games of a plurality of people, firstly initializing a state-action value function table
Figure 202143DEST_PATH_IMAGE053
At the selection of each state
Figure 998060DEST_PATH_IMAGE047
In corresponding action, the multi-agent is based on
Figure 16832DEST_PATH_IMAGE048
Greedy method to select
Figure 632621DEST_PATH_IMAGE047
And corresponding actions, and finally updating the cost function by adopting a Nash equalization function, wherein the formula is as follows:
Figure 378729DEST_PATH_IMAGE054
in the formula:
Figure 294733DEST_PATH_IMAGE050
indicating the number of approximate cumulative calculations,
Figure 484405DEST_PATH_IMAGE051
Figure 587491DEST_PATH_IMAGE055
representing the number of the multi-agent;
Figure 888022DEST_PATH_IMAGE056
represents an attenuation value;
Figure 143685DEST_PATH_IMAGE057
representing the rewards currently earned by the multi-agent;
Figure 238680DEST_PATH_IMAGE058
indicating a state
Figure 563482DEST_PATH_IMAGE047
In the execution of the selected action
Figure 402125DEST_PATH_IMAGE059
Then obtaining a new state;
Figure 276409DEST_PATH_IMAGE060
represents from
Figure 276726DEST_PATH_IMAGE047
Starting a multi-agent to adopt a joint strategy
Figure 354404DEST_PATH_IMAGE061
Calculated long term average return.
Further, an agent in the general and random game of step S132
Figure 213382DEST_PATH_IMAGE007
Nash equalization function of
Figure 692905DEST_PATH_IMAGE060
The following formula is satisfied:
Figure 129702DEST_PATH_IMAGE062
in the formula:
Figure 694676DEST_PATH_IMAGE063
for a certain agent
Figure 389968DEST_PATH_IMAGE007
The set of policies of (2);
Figure 989577DEST_PATH_IMAGE055
indicating the number of multi-agents.
Further, step S14 is specifically:
s141, taking the weighted sum of the multi-target rewards as an optimization target, and calculating the weighted sum of the multi-target optimization, wherein the formula is as follows:
Figure 862855DEST_PATH_IMAGE064
in the formula:
Figure 649545DEST_PATH_IMAGE065
a vector of weights is represented by a vector of weights,
Figure 118835DEST_PATH_IMAGE066
a bonus vector is represented that is,
Figure 369688DEST_PATH_IMAGE024
the policy is represented by a set of rules,
Figure 364932DEST_PATH_IMAGE067
representing the desired reward function to which the weight combination is added;
Figure 638919DEST_PATH_IMAGE068
is expressed in a policy
Figure 161167DEST_PATH_IMAGE024
A target reward weighted sum of;
Figure 984635DEST_PATH_IMAGE032
representing multi-agent selection strategies
Figure 934137DEST_PATH_IMAGE024
A probability distribution of (a);
and S142, fitting the multi-target pareto curves with different weight combinations according to the convex optimization hyperplane separation theorem.
After the technical scheme is adopted, the invention has the following positive effects:
(1) the invention introduces time constraint in the random game model, on one hand, the real-time, non-determinacy and probability behaviors shown among the multi-agent or in the interaction process of the multi-agent and the random environment can be described, on the other hand, the reward function related to time can be quantized, and the multi-objective optimization strategy is determined through the reward function.
(2) The invention calculates the expected reward expectation according to the Monte Carlo simulation track by designing an off-line algorithm, avoids the problem of state space explosion generated when the maximum reward expectation is calculated, and reduces the iteration times of the algorithm according to the convergence conditions of the zero and random games and the general and random games, thereby reducing the energy consumption of the system and improving the reaction speed of the multi-agent.
(3) The invention gives different weights to a plurality of targets and distinguishes the priorities of the targets, thereby improving the running reliability of the multi-agent.
Drawings
In order that the present disclosure may be more readily and clearly understood, the following detailed description of the present disclosure is provided in connection with specific embodiments thereof and with the accompanying drawings, in which:
FIG. 1 is a block diagram of a dispatch center according to the present invention;
FIG. 2 is a flow chart of the present invention;
FIG. 3 is a table generation method of the double zero and random game value function of the present invention;
FIG. 4 is a table of values for a multi-player general and random game play according to the present invention;
FIG. 5 illustrates a pareto curve generation method according to the present invention;
FIG. 6 is a graph of a pareto curve fit based on weight combinations according to the present invention;
FIG. 7 is a schematic diagram of the multi-robot and random environment dynamic gaming model in this embodiment 1;
fig. 8 is a schematic diagram of a dynamic gaming model among multiple robots in this embodiment 2.
Detailed Description
As shown in fig. 1-5, a scheduling method of multi-agent facing to time constraint includes the following steps:
s1, establishing a scheduling center, which specifically comprises the following steps:
s11, establishing a random game model between multiple intelligent agents and a random environment or between the multiple intelligent agents facing time constraints based on a multi-target random game template facing time constraints, which specifically comprises the following steps:
s111, the multi-target random game template facing to time constraint is a ten-tuple
Figure 960999DEST_PATH_IMAGE001
Wherein:
Figure 21358DEST_PATH_IMAGE002
representing a limited set of participant multi-agents and a random environment participating in a random game;
Figure 715645DEST_PATH_IMAGE003
a finite set of states representing a multi-agent and stochastic environment;
Figure 586780DEST_PATH_IMAGE004
representing the initial state of the multi-agent and stochastic environment,
Figure 100938DEST_PATH_IMAGE005
Figure 620781DEST_PATH_IMAGE006
representing a certain agent or a random environment
Figure 169574DEST_PATH_IMAGE007
A finite set of states of (a) is,
Figure 460878DEST_PATH_IMAGE008
Figure 462332DEST_PATH_IMAGE009
Figure 877877DEST_PATH_IMAGE010
representing a finite set of actions of the multi-agent;
Figure 15597DEST_PATH_IMAGE011
represents a finite set of all clocks;
Figure 743382DEST_PATH_IMAGE012
representing sets of clock constraints, clock constraints
Figure 966553DEST_PATH_IMAGE037
By the formula
Figure 437985DEST_PATH_IMAGE034
Definitions, in which:
Figure 945059DEST_PATH_IMAGE035
is that
Figure 578165DEST_PATH_IMAGE011
One of the clock signals of the first and second clock signals,
Figure 288632DEST_PATH_IMAGE036
is a constant number of times that the number of the first,
Figure 563756DEST_PATH_IMAGE037
Figure 895642DEST_PATH_IMAGE038
Figure 965230DEST_PATH_IMAGE039
(ii) a Such as a state requiring a delay
Figure 631834DEST_PATH_IMAGE069
Then state of
Figure 428758DEST_PATH_IMAGE003
Corresponding to
Figure 129998DEST_PATH_IMAGE010
There will be a time constraint
Figure 104907DEST_PATH_IMAGE070
And a certain state is subjected to a cutoff time
Figure 475452DEST_PATH_IMAGE071
Constraint, then corresponding
Figure 826799DEST_PATH_IMAGE010
There is a constraint
Figure 648124DEST_PATH_IMAGE072
. At the same time, the user can select the desired position,
Figure 308782DEST_PATH_IMAGE033
or a combination of different time constraints, e.g.
Figure 215558DEST_PATH_IMAGE073
. At the same time, the user can select the desired position,
Figure 105016DEST_PATH_IMAGE033
the logical inverse is also accepted.
Figure 328319DEST_PATH_IMAGE013
Indicating multi-agent status
Figure 645030DEST_PATH_IMAGE003
Invariance conditions on clock constraints;
Figure 255747DEST_PATH_IMAGE014
indicating multi-agent status
Figure 948896DEST_PATH_IMAGE015
Upper adopts
Figure 744814DEST_PATH_IMAGE016
Clock constraints at the time of action;
Figure 481695DEST_PATH_IMAGE017
representing multiple agents from
Figure 363063DEST_PATH_IMAGE015
Status pass through
Figure 859904DEST_PATH_IMAGE016
Is moved to
Figure 510328DEST_PATH_IMAGE018
The state transition function of the state is,
Figure 450733DEST_PATH_IMAGE019
to represent
Figure 553818DEST_PATH_IMAGE003
A probability distribution of (a);
Figure 588770DEST_PATH_IMAGE020
representing the states of the multi-agents and the reward functions corresponding to the actions,
Figure 359280DEST_PATH_IMAGE021
represents a real number;
s112, establishing a multi-target random game model between multiple intelligent agents and a random environment or between the multiple intelligent agents facing to time constraint and adopting
Figure 703543DEST_PATH_IMAGE022
As a multi-agent on-path
Figure 293924DEST_PATH_IMAGE023
Action set of
Figure 132567DEST_PATH_IMAGE010
In a selection strategy of
Figure 757584DEST_PATH_IMAGE024
The reward expectation formula for a policy is as follows:
Figure 771283DEST_PATH_IMAGE025
in the formula:
Figure 114539DEST_PATH_IMAGE026
indicating multi-agent status
Figure 479178DEST_PATH_IMAGE027
The corresponding reward function;
Figure 302909DEST_PATH_IMAGE028
representing multi-agent actions
Figure 739707DEST_PATH_IMAGE029
The corresponding reward function;
Figure 39101DEST_PATH_IMAGE030
Figure 468814DEST_PATH_IMAGE031
representing a desired reward function for the multi-agent;
Figure 802844DEST_PATH_IMAGE024
representing a policy;
Figure 676122DEST_PATH_IMAGE032
representing multi-agent selection strategies
Figure 728391DEST_PATH_IMAGE024
Probability distribution of (2).
S12, testing and simulating the running track of the random game model according to the statistical model
Figure 712528DEST_PATH_IMAGE074
Design not based on modelThe value function learning method calculates the maximum reward expectation of the multi-agent taking different actions in various states, which is as follows:
s121, acquiring initial data of all states and actions of the multi-agent in a random environment;
s122, establishing a random game model facing to time constraint based on the acquired data, and simulating the running track of the random game model through UPPAAL-SMC (statistical model testing tool)
Figure 648866DEST_PATH_IMAGE074
Exploring all states and actions of the multi-agent in a random environment and training a target strategy;
s123, simulating the running track through offline learning
Figure 693046DEST_PATH_IMAGE074
Establishing state-action value function table of multi-agent
Figure 498191DEST_PATH_IMAGE040
Value function table
Figure 20439DEST_PATH_IMAGE040
Is defined as being in a state
Figure 594640DEST_PATH_IMAGE041
Take action down
Figure 793409DEST_PATH_IMAGE042
Wherein:
Figure 85850DEST_PATH_IMAGE043
a group of state-tuples is represented,
Figure 146210DEST_PATH_IMAGE044
a group of action tuples is represented,
Figure 309338DEST_PATH_IMAGE045
a different set of classifications representing the state,
Figure 960899DEST_PATH_IMAGE046
indicating the gaming participants to which the current state pertains.
S13, iterating the algorithm according to the convergence conditions of the zero and random games between the multi-agent and the random environment and the general and random games between the multi-agent, which is specifically as follows:
s131, initializing a state-action value function table for double-player zero-sum random games
Figure 960210DEST_PATH_IMAGE040
At the selection of each state
Figure 89840DEST_PATH_IMAGE047
Upon corresponding action, the multi-agent or random environment is based on
Figure 373054DEST_PATH_IMAGE048
Greedy method to select
Figure 929937DEST_PATH_IMAGE047
Corresponding action, i.e. if
Figure 665812DEST_PATH_IMAGE047
Corresponding action set
Figure 599133DEST_PATH_IMAGE010
Then there will be
Figure 986121DEST_PATH_IMAGE075
The probability of selecting the action of maximizing the cost function table, and also
Figure 448326DEST_PATH_IMAGE076
Randomly selecting an action; status of state
Figure 937076DEST_PATH_IMAGE047
In the execution of the selected action
Figure 142930DEST_PATH_IMAGE077
Will get afterTo a new state
Figure 882959DEST_PATH_IMAGE058
And corresponding rewards
Figure 781645DEST_PATH_IMAGE078
(ii) a Suppose that the game participants are respectively
Figure 757691DEST_PATH_IMAGE079
And
Figure 501657DEST_PATH_IMAGE047
the state sets are respectively
Figure 879548DEST_PATH_IMAGE080
And
Figure 932824DEST_PATH_IMAGE081
and the model targets the maximizing participants
Figure 396166DEST_PATH_IMAGE079
The gain of (1). If the next step state belongs to
Figure 943822DEST_PATH_IMAGE079
Then it is desirable to maximize the reward, as shown in equation (1); if the status of the next step belongs to
Figure 130215DEST_PATH_IMAGE047
Then the reward needs to be minimized, as shown in equation (2);
Figure 134818DEST_PATH_IMAGE082
(1)
Figure 757560DEST_PATH_IMAGE083
(2)
in the formula:
Figure 827016DEST_PATH_IMAGE057
indicating the currently-obtained benefit of the prize,
Figure 648342DEST_PATH_IMAGE084
indicating that the yield of the next step is currently maximized,
Figure 59732DEST_PATH_IMAGE085
indicating that the yield of the next step is currently minimized,
Figure 966508DEST_PATH_IMAGE086
represents an attenuation value;
and finally updating the value function by adopting a method of accumulating and updating the average value, wherein the formula is as follows:
Figure 603769DEST_PATH_IMAGE049
in the formula:
Figure 545180DEST_PATH_IMAGE050
representing the number of approximate cumulative computations, which may be considered as a step size,
Figure 127471DEST_PATH_IMAGE051
Figure 521544DEST_PATH_IMAGE052
represents the estimated return, i.e., the sum of the future benefits with attenuation;
s132, aiming at the common and random games of a plurality of people, firstly initializing a state-action value function table
Figure 463961DEST_PATH_IMAGE087
That is, for a state, different agents will have different actions, and each agent generates an optimal strategy by observing the actions of other agents and corresponding reward values; in selecting each state
Figure 728720DEST_PATH_IMAGE047
Upon corresponding action, different agents are based on
Figure 747491DEST_PATH_IMAGE048
Greedy method to select
Figure 628860DEST_PATH_IMAGE047
A corresponding action; status of state
Figure 876433DEST_PATH_IMAGE047
In the execution of the selected action
Figure 526857DEST_PATH_IMAGE088
Then a new state is obtained
Figure 185371DEST_PATH_IMAGE058
And corresponding rewards
Figure 537724DEST_PATH_IMAGE078
(ii) a And finally, updating the cost function by adopting a Nash equalization function, wherein the formula is as follows:
Figure 838255DEST_PATH_IMAGE054
in the formula:
Figure 608765DEST_PATH_IMAGE050
indicating the number of approximate cumulative calculations,
Figure 438181DEST_PATH_IMAGE051
Figure 294142DEST_PATH_IMAGE055
representing the number of the multi-agent;
Figure 146167DEST_PATH_IMAGE056
represents an attenuation value;
Figure 505604DEST_PATH_IMAGE057
representing the rewards currently earned by the multi-agent;
Figure 771500DEST_PATH_IMAGE058
indicating a state
Figure 114757DEST_PATH_IMAGE047
In the execution of the selected action
Figure 757091DEST_PATH_IMAGE059
Then obtaining a new state;
Figure 485881DEST_PATH_IMAGE060
indicating slave status
Figure 188258DEST_PATH_IMAGE047
Starting a multi-agent to adopt a joint strategy
Figure 753231DEST_PATH_IMAGE061
A calculated long-term average reward;
wherein:
Figure 933677DEST_PATH_IMAGE060
represents from
Figure 533286DEST_PATH_IMAGE047
Starting a multi-agent to adopt a joint strategy
Figure 891717DEST_PATH_IMAGE061
The calculated long-term average return satisfies the following formula,
Figure 943986DEST_PATH_IMAGE063
as an agent
Figure 928123DEST_PATH_IMAGE007
The set of policies of (1).
Figure 382238DEST_PATH_IMAGE062
S14, fitting the multi-target pareto curves with different weight combinations according to the convex optimization hyperplane separation theorem, which specifically comprises the following steps:
s141, taking the weighted sum of the multi-target rewards as an optimization target, and calculating the weighted sum of the multi-target optimization, wherein the formula is as follows:
Figure 410106DEST_PATH_IMAGE064
in the formula:
Figure 215251DEST_PATH_IMAGE065
a vector of weights is represented by a vector of weights,
Figure 737499DEST_PATH_IMAGE066
a bonus vector is represented that is,
Figure 311700DEST_PATH_IMAGE024
the policy is represented by a set of rules,
Figure 261201DEST_PATH_IMAGE067
representing the desired reward function to which the weight combination is added;
Figure 288063DEST_PATH_IMAGE068
is expressed in a policy
Figure 385242DEST_PATH_IMAGE024
A target reward weighted sum of;
Figure 813950DEST_PATH_IMAGE032
representing multi-agent selection strategies
Figure 137615DEST_PATH_IMAGE024
A probability distribution of (a);
s142, if the goal is to calculate the maximum reward expectation, calculating the reward expectation
Figure 901040DEST_PATH_IMAGE078
Can field of
Figure 30670DEST_PATH_IMAGE089
Wherein:
Figure 48305DEST_PATH_IMAGE090
a desire for a reward is indicated and,
Figure 605188DEST_PATH_IMAGE091
a set of reward expectations is represented,
Figure 606642DEST_PATH_IMAGE092
Figure 25116DEST_PATH_IMAGE093
a collection of possible domains is represented that,
Figure 897258DEST_PATH_IMAGE094
i.e. exist
Figure 890621DEST_PATH_IMAGE092
All values in the feasible Domain
Figure 113792DEST_PATH_IMAGE095
Are all less than
Figure 585225DEST_PATH_IMAGE090
(ii) a If the goal is to calculate a maximum reward expectation (e.g., a minimized energy consumption scenario), the reward expectation is calculated
Figure 826719DEST_PATH_IMAGE078
Can field of
Figure 990984DEST_PATH_IMAGE089
S143, if the goal is to calculate the maximum reward expectation, calculating the reward expectation
Figure 435872DEST_PATH_IMAGE078
Is not feasible
Figure 710996DEST_PATH_IMAGE096
Wherein
Figure 823308DEST_PATH_IMAGE065
A vector of weights is represented by a vector of weights,
Figure 375119DEST_PATH_IMAGE097
a set of weight vectors is represented as a set of weight vectors,
Figure 838461DEST_PATH_IMAGE098
Figure 386117DEST_PATH_IMAGE099
a vector of the desire to reward is represented,
Figure 618515DEST_PATH_IMAGE100
a set of desired vectors is indicated for the reward,
Figure 593425DEST_PATH_IMAGE101
Figure 527752DEST_PATH_IMAGE102
and
Figure 144678DEST_PATH_IMAGE103
are all represented as the inner product of the vector,
Figure 966003DEST_PATH_IMAGE104
a set of non-feasible fields is represented,
Figure 377393DEST_PATH_IMAGE105
i.e. reward expectation for any combination of weights, such that
Figure 549748DEST_PATH_IMAGE106
(ii) a Calculating a reward expectation if the goal is to calculate a minimum reward expectation
Figure 189939DEST_PATH_IMAGE078
Is not feasible
Figure 396930DEST_PATH_IMAGE096
S144, calculating the weight vector with the maximum variance
Figure 713642DEST_PATH_IMAGE065
And calculating the distance between feasible and infeasible domains
Figure 107714DEST_PATH_IMAGE107
S145. if the distance is larger than the set distance
Figure 66442DEST_PATH_IMAGE108
Then calculate
Figure 111628DEST_PATH_IMAGE065
So as to be composed of
Figure 864820DEST_PATH_IMAGE065
The constructed maximum separation hyperplane can maximally separate the areas where the two sets are positioned, namely the coverage surface of the reachable set corresponding to the weight is enlarged; based on the convergence of the distance function, the newly generated reward may be expected
Figure 11768DEST_PATH_IMAGE090
Joining a reward expectation set
Figure 243029DEST_PATH_IMAGE091
In, and iterate continuously until
Figure 159032DEST_PATH_IMAGE107
Is less than
Figure 830929DEST_PATH_IMAGE108
S2, the dispatching center collects real-time data of states and actions of the multi-agent and random environments;
and S3, the dispatching center processes the acquired data and sends the action instruction to the multi-agent.
Fitting a pareto curve based on multiple target reward weights for different weight vector combinations requires traversing the weight vector combinations and calculating the corresponding reward expectations multiple times. If there is
Figure 199593DEST_PATH_IMAGE109
The reward of each target, adopting the exhaustive search weight and the corresponding optimal reward expectation, needs to be in
Figure 500124DEST_PATH_IMAGE110
The reward expectation of each set of weight vector combinations is calculated at the time complexity of (1). Because the reachable point sets generated by different weight combinations are crossed, the invention adopts a method of approximating the unreachable point set expected to be rewarded with the reachable point set to select the weight combination
Figure 5055DEST_PATH_IMAGE065
The coverage of the reachable point set corresponding to the weight is enlarged as much as possible, so that the expected calculation quantity of the reward is reduced, and the efficiency of the whole algorithm is improved. As shown in FIG. 6, fitting a weight-combination-based pareto curve that minimizes dual objectives first computes
Figure 365629DEST_PATH_IMAGE111
And
Figure 470857DEST_PATH_IMAGE112
reward expectation in both extreme cases
Figure 309500DEST_PATH_IMAGE113
And
Figure 934517DEST_PATH_IMAGE114
this indicates the case where only the object 2 is considered and the case where only the object 1 is considered. First according to
Figure 200413DEST_PATH_IMAGE115
A corresponding straight line is generated and,
Figure 543670DEST_PATH_IMAGE116
representing the reward expectation vector, the area enclosed by the intersection point of the two straight lines and the coordinate axis is an unreachable set, the area enclosed by the two reachable points and the reward expectation upper limit is a reachable set, different weights correspond to different reachable point sets andthe maximum separation is hyperplane. The slope of the largest separating hyperplane of the unreachable point-reachable region furthest from the origin can be taken as a weight, reachable points are generated from reward expectations, and a set of reachable points is formed. As shown in FIG. 6, the point of the unreachable set furthest from the origin is taken
Figure 936736DEST_PATH_IMAGE117
From the law of separation of hyperplane from convex set, it can be seen that the hyperplane must exist and can be separated
Figure 681838DEST_PATH_IMAGE117
Point and reachable set
Figure 118636DEST_PATH_IMAGE093
. From this, by calculating the maximum separation hyperplane, the corresponding weight value is obtained
Figure 949189DEST_PATH_IMAGE065
. By passing
Figure 864055DEST_PATH_IMAGE065
Calculating maximum reward expectation, and adding
Figure 712931DEST_PATH_IMAGE091
The sets generate corresponding reachable sets and unreachable sets. It can be seen from FIG. 6 that the maximum values of the reachable set and the unreachable set are approaching each other when the distance between the reachable set and the unreachable set is smaller than
Figure 586209DEST_PATH_IMAGE108
And outputting a corresponding reachable point set, wherein the boundary point of the reachable point set is the pareto curve.
Example 1
As shown in fig. 7, multiple robots (robot M, robot N) complete multiple jobs (job T, job T +1, job T + 2) in a random environment, and the process of scheduling these robots to complete jobs generates corresponding energy consumption and time delay. Since the robot may malfunction in the middle of completing a job, different robots may complete different jobs at different times. Therefore, first of all, based on the orientationThe time-constrained multi-target random game template models a system for multiple robots to complete multiple operations in a random environment by adopting a time-constrained zero-sum random game method, wherein game parties are respectively the multiple robots and the random environment. The system mainly comprises an operation model and a multi-machine scheduler model. The job model has three states, namely an idle state, a waiting state and an execution state, each job is triggered by a random environment to enter the waiting state from the idle state, and if the scheduler determines a robot executing the job and distributes the job to the corresponding robot for execution, the job enters the execution state; if the midway robot fails and cannot complete the job, the job enters a waiting state from an execution state to wait for the next available robot, and after the job is completed, the job enters an idle state from the execution state. Each robot has three states, namely an idle state, an operating state and a fault state. When the robot is in an idle state and an operating state, the probability (1-p) and the probability (1-q) are respectively failed. In the idle state, if a job waiting to be executed is allocated by the scheduler, the robot enters the running state. Each robot can only allocate jobs within its execution range and return to an idle state after the operation is finished. And if the robot recovers to work normally in the fault state, returning to the idle state. Secondly, simulating the running track of the model through UPPAAL-SMC
Figure 638479DEST_PATH_IMAGE074
Exploring all states and actions in a random environment, training a target strategy through the acquired data, and simulating a running track through offline learning
Figure 357036DEST_PATH_IMAGE074
And establishes a table of state-action value functions
Figure 76730DEST_PATH_IMAGE040
Figure 603133DEST_PATH_IMAGE040
Is defined as being in a state
Figure 142699DEST_PATH_IMAGE041
Take action down
Figure 930526DEST_PATH_IMAGE042
Wherein:
Figure 426099DEST_PATH_IMAGE118
a group of state-tuples is represented,
Figure 110021DEST_PATH_IMAGE119
three different states of the operation are represented, namely idle, waiting and executing respectively;
Figure 136883DEST_PATH_IMAGE120
three different states of the robot are represented, namely idle state, running state and fault state;
Figure 213554DEST_PATH_IMAGE046
control attribution representing the current state, respectively the environment
Figure 111103DEST_PATH_IMAGE121
And a scheduler
Figure 497085DEST_PATH_IMAGE122
Figure 11243DEST_PATH_IMAGE123
A group of action tuples is represented,
Figure 124562DEST_PATH_IMAGE124
the method comprises the steps of indicating that a job to be executed is scheduled to run on a robot;
Figure 407775DEST_PATH_IMAGE125
represents a job trigger;
Figure 964659DEST_PATH_IMAGE126
indicating a robot fault;
Figure 700533DEST_PATH_IMAGE127
indicating that the job is ready to be executed;
Figure 368275DEST_PATH_IMAGE128
indicating that the job is complete;
Figure 988219DEST_PATH_IMAGE129
indicating that the robot recovered from the failure; initializing a table of state-action value functions, selecting each state
Figure 981583DEST_PATH_IMAGE047
When corresponding action is taken, according to
Figure 939174DEST_PATH_IMAGE048
Greedy method to select
Figure 676186DEST_PATH_IMAGE047
Corresponding action, state
Figure 668413DEST_PATH_IMAGE047
In the execution of the selected action
Figure 816367DEST_PATH_IMAGE088
Then a new state is obtained
Figure 526834DEST_PATH_IMAGE058
And corresponding rewards
Figure 536378DEST_PATH_IMAGE078
(ii) a Updating the cost function by accumulating the updated average
Figure 383111DEST_PATH_IMAGE040
. And finally, taking the weighted sum of the number of finished jobs, the consumed energy consumption and the time for finishing the jobs as an optimization target, and fitting a multi-target pareto curve according to a convex optimization hyperplane separation theorem so as to generate a scheduling strategy for completing a plurality of jobs by a plurality of robots in a random environment.
Example 2
The present embodiment is the same as the method of embodiment 1, and is directed to the task of cooperatively completing specimen collection and transportation by multiple robots. As shown in fig. 8, multiple robots (robot M, robot M +1, robot M +2, and …) need to collect or process specimens at different task points (task points 1 to 6), and then transport the specimens to target points (target point 1 and target point 2). When one robot carries out a task at a certain task point, the task points are not opened to other robots, and a sequence exists among the task points, for example, the task point 4 is opened only to a robot completing the task of the task point 1, the task point 5 is opened only to a robot completing the task point 1, the task point 2 or the task point 3, and the task point 6 is opened only to a robot completing the task point 3. The uncertainty of the whole system comprises the uncertainty of the task time of different robots in different task points and the uncertainty of the moving time of the robots between different task points. In the process of executing tasks and moving, the robots need to avoid static obstacles and dynamic obstacles, and ensure that the total power consumption is minimum under the condition that the electric power used by each robot is different, and finally reach a target place. The robot has three states for executing the task at the task point, when the robot reaches the task point, the robot firstly triggers waiting, if the robot is already executing the task at the task point, the robot waits for the task at the task point to be completed, and if other robots complete the task, the robot starts executing the task. If the midway task is in error, returning to a waiting state to continue waiting for execution. And after the robot completes the task at the task point, searching the next task point to complete the task. In order to establish a multi-objective optimized task scheduling strategy, namely a strategy for completing the collection, processing and task transmission of all samples in a short time and with less energy consumption, firstly, a general and random game model facing to time constraint is established based on a multi-objective random game template facing to time constraint, and the participants of the game are multiple robots. Secondly, simulating the running track of the model through UPPAAL-SMC
Figure 203431DEST_PATH_IMAGE074
Exploring the sum of all states of the multiple robots in a random environmentAction, then collecting simulation data to train multi-target optimization strategy, and simulating operation track by off-line learning
Figure 135615DEST_PATH_IMAGE130
And establishes a table of state-action value functions
Figure 214429DEST_PATH_IMAGE087
Figure 181248DEST_PATH_IMAGE087
Is defined as being in a state
Figure 156157DEST_PATH_IMAGE041
Robot for different types
Figure 356063DEST_PATH_IMAGE131
Taking action
Figure 707410DEST_PATH_IMAGE132
Wherein:
Figure 794315DEST_PATH_IMAGE133
a group of state-tuples is represented,
Figure 940126DEST_PATH_IMAGE119
three different states of the task are represented, namely idle, waiting and executing respectively;
Figure 846902DEST_PATH_IMAGE120
indicating the current robot
Figure 15321DEST_PATH_IMAGE007
The task point or the target point;
Figure 691153DEST_PATH_IMAGE134
represents an action tuple, wherein
Figure 539024DEST_PATH_IMAGE135
Indicating that the robot is between task points or between a task point and a target pointMove between, this process can adopt to have already been existed
Figure 933096DEST_PATH_IMAGE136
The path searching algorithm searches the shortest path between the task points;
Figure 626245DEST_PATH_IMAGE125
and the current robot is ready to perform a task under the task point, and if the task point is occupied by other robots, the robot waits until the task point is idle.
Figure 671431DEST_PATH_IMAGE127
Indicating that the robot has entered the task execution state from waiting,
Figure 424623DEST_PATH_IMAGE126
indicating that the current task of the robot fails due to internal and external factors, the task needs to be re-executed,
Figure 40412DEST_PATH_IMAGE128
the robot finishes the current task and can move to the next task point to execute the task. All robots according to
Figure 537253DEST_PATH_IMAGE048
Greedy method to select
Figure 938409DEST_PATH_IMAGE047
A corresponding action; status of state
Figure 862503DEST_PATH_IMAGE047
In the execution of the selected action
Figure 700009DEST_PATH_IMAGE088
Then a new state is obtained
Figure 984229DEST_PATH_IMAGE058
And corresponding rewards
Figure 489159DEST_PATH_IMAGE078
(ii) a And adopts Nash et alUpdating value function of balance function
Figure 584154DEST_PATH_IMAGE087
. And finally, taking the weighted sum of the time of task execution, the total energy consumption and the task completion degree as an optimization target, and fitting a multi-target pareto curve according to the convex optimization hyperplane separation theorem, thereby generating a multi-target optimization strategy for cooperatively collecting and transporting samples by multiple robots in a random environment.
The above-mentioned embodiments are intended to illustrate the objects, aspects and advantages of the present invention, and it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the present invention, and any modifications, equivalent substitutions and improvements made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (7)

1. A scheduling method of multi-agent facing time constraint is characterized in that the method comprises the following steps:
s1, establishing a scheduling center, which specifically comprises the following steps:
s11, establishing a random game model between a plurality of intelligent agents facing time constraint and a random environment or between the plurality of intelligent agents based on a multi-target random game template facing time constraint;
s12, testing and simulating the running track of a random game model according to a statistical model, and designing a value function learning method which is not based on the model to calculate the maximum reward expectation of the multi-agent taking different actions in various states;
s13, iterating the algorithm according to the convergence conditions of zero and random games between the multi-agent and the random environment and general and random games between the multi-agent;
s14, fitting a multi-target pareto curve based on weight combination according to a convex optimization hyperplane separation theorem;
s2, the dispatching center collects real-time data of states and actions of the multi-agent and random environments;
and S3, the dispatching center processes the acquired data and sends the action instruction to the multi-agent.
2. The scheduling method of multi-agent oriented to time constraints as claimed in claim 1, wherein the step S11 is specifically as follows:
s111, the multi-target random game template facing to time constraint is a ten-tuple
Figure 699124DEST_PATH_IMAGE001
Wherein:
Figure 366996DEST_PATH_IMAGE002
representing a limited set of participant multi-agents and a random environment participating in a random game;
Figure 818837DEST_PATH_IMAGE003
a finite set of states representing a multi-agent and stochastic environment;
Figure 682888DEST_PATH_IMAGE004
representing the initial state of the multi-agent and stochastic environment,
Figure 215370DEST_PATH_IMAGE005
Figure 303411DEST_PATH_IMAGE006
representing a certain agent or a random environment
Figure 39286DEST_PATH_IMAGE007
A finite set of states of (a) is,
Figure 707028DEST_PATH_IMAGE008
Figure 795813DEST_PATH_IMAGE009
Figure 258019DEST_PATH_IMAGE010
representing a finite set of actions of the multi-agent;
Figure 481190DEST_PATH_IMAGE011
represents a finite set of all clocks;
Figure 687043DEST_PATH_IMAGE012
representing a set of clock constraints;
Figure 928537DEST_PATH_IMAGE013
indicating multi-agent status
Figure 827223DEST_PATH_IMAGE003
Invariance conditions on clock constraints;
Figure 537690DEST_PATH_IMAGE014
indicating multi-agent status
Figure 281655DEST_PATH_IMAGE015
Upper adopts
Figure 144700DEST_PATH_IMAGE016
Clock constraints at the time of action;
Figure 745446DEST_PATH_IMAGE017
representing multiple agents from
Figure 943209DEST_PATH_IMAGE015
Status pass through
Figure 490865DEST_PATH_IMAGE016
Is moved to
Figure 192105DEST_PATH_IMAGE018
The state transition function of the state is,
Figure 681861DEST_PATH_IMAGE019
to represent
Figure 101341DEST_PATH_IMAGE003
A probability distribution of (a);
Figure 452688DEST_PATH_IMAGE020
representing the states of the multi-agents and the reward functions corresponding to the actions,
Figure 274013DEST_PATH_IMAGE021
represents a real number;
s112, establishing a multi-target random game model between multiple intelligent agents and a random environment or between the multiple intelligent agents facing to time constraint and adopting
Figure 482141DEST_PATH_IMAGE022
As a multi-agent on-path
Figure 136720DEST_PATH_IMAGE023
Action set of
Figure 760599DEST_PATH_IMAGE010
In a selection strategy of
Figure 498748DEST_PATH_IMAGE024
The reward expectation formula for a policy is as follows:
Figure 81039DEST_PATH_IMAGE025
in the formula:
Figure 209532DEST_PATH_IMAGE026
indicating multi-agent status
Figure 151949DEST_PATH_IMAGE027
The corresponding reward function;
Figure 479025DEST_PATH_IMAGE028
representing multi-agent actions
Figure 232217DEST_PATH_IMAGE029
The corresponding reward function;
Figure 848007DEST_PATH_IMAGE030
Figure 344847DEST_PATH_IMAGE031
representing a desired reward function for the multi-agent;
Figure 792009DEST_PATH_IMAGE024
representing a policy;
Figure 935676DEST_PATH_IMAGE032
representing multi-agent selection strategies
Figure 304341DEST_PATH_IMAGE024
Probability distribution of (2).
3. A method of scheduling multi-agent towards time constraints as claimed in claim 2, characterized in that: the set of clock constraints of step S111
Figure 136031DEST_PATH_IMAGE012
Middle clock constraint
Figure 109803DEST_PATH_IMAGE033
Defined by the following formula;
Figure 735956DEST_PATH_IMAGE034
in the formula:
Figure 841184DEST_PATH_IMAGE035
is that
Figure 414248DEST_PATH_IMAGE011
One of the clock signals of the first and second clock signals,
Figure 570423DEST_PATH_IMAGE036
is a constant number of times that the number of the first,
Figure 836319DEST_PATH_IMAGE037
Figure 648418DEST_PATH_IMAGE038
Figure 38554DEST_PATH_IMAGE039
4. the scheduling method of multi-agent oriented to time constraints as claimed in claim 1, wherein the step S12 is specifically as follows:
s121, acquiring initial data of all states and actions of the multi-agent in a random environment;
s122, establishing a random game model facing time constraint based on the acquired data, simulating the running track of the random game model through UPPAAL-SMC, exploring all states and actions of the multi-agent in a random environment and training a target strategy;
s123, establishing multiple intelligence by simulating running track through offline learningState-action value function table of energy body
Figure 314815DEST_PATH_IMAGE040
Said value function table
Figure 751612DEST_PATH_IMAGE040
Is defined as being in a state
Figure 519848DEST_PATH_IMAGE041
Take action down
Figure 497031DEST_PATH_IMAGE042
Wherein:
Figure 814749DEST_PATH_IMAGE043
a group of state-tuples is represented,
Figure 422448DEST_PATH_IMAGE044
a group of action tuples is represented,
Figure 474718DEST_PATH_IMAGE045
a different set of classifications representing the state,
Figure 990013DEST_PATH_IMAGE046
indicating the gaming participants to which the current state pertains.
5. The scheduling method of multi-agent oriented to time constraints as claimed in claim 4, wherein the step S13 is specifically as follows:
s131, initializing a state-action value function table for double-player zero-sum random games
Figure 178549DEST_PATH_IMAGE040
At the selection of each state
Figure 707881DEST_PATH_IMAGE047
Upon corresponding action, the multi-agent or random environment is based on
Figure 44185DEST_PATH_IMAGE048
Greedy method to select
Figure 300853DEST_PATH_IMAGE047
And finally updating the value function by adopting a method of cumulatively updating the average value according to the corresponding action, wherein the formula is as follows:
Figure 609475DEST_PATH_IMAGE049
in the formula:
Figure 558977DEST_PATH_IMAGE050
representing the number of approximate cumulative computations, which may be considered as a step size,
Figure 116997DEST_PATH_IMAGE051
Figure 692204DEST_PATH_IMAGE052
represents the estimated return, i.e., the sum of the future benefits with attenuation;
s132, aiming at the common and random games of a plurality of people, firstly initializing a state-action value function table
Figure 589752DEST_PATH_IMAGE053
At the selection of each state
Figure 710155DEST_PATH_IMAGE047
In corresponding action, the multi-agent is based on
Figure 21051DEST_PATH_IMAGE048
Greedy method to select
Figure 885102DEST_PATH_IMAGE047
And corresponding actions, and finally updating the cost function by adopting a Nash equalization function, wherein the formula is as follows:
Figure 384960DEST_PATH_IMAGE054
in the formula:
Figure 941843DEST_PATH_IMAGE050
indicating the number of approximate cumulative calculations,
Figure 474455DEST_PATH_IMAGE051
Figure 876618DEST_PATH_IMAGE055
representing the number of the multi-agent;
Figure 748759DEST_PATH_IMAGE056
represents an attenuation value;
Figure 273281DEST_PATH_IMAGE057
representing the rewards currently earned by the multi-agent;
Figure 745720DEST_PATH_IMAGE058
indicating a state
Figure 685994DEST_PATH_IMAGE047
In the execution of the selected action
Figure 678221DEST_PATH_IMAGE059
Then obtaining a new state;
Figure 373644DEST_PATH_IMAGE060
indicating slave status
Figure 818532DEST_PATH_IMAGE047
Starting a multi-agent to adopt a joint strategy
Figure 313230DEST_PATH_IMAGE061
Calculated long term average return.
6. A method of scheduling multi-agent towards time constraints as claimed in claim 5, characterized by: an agent in the general and random game of step S132
Figure 425542DEST_PATH_IMAGE007
Nash equalization function of
Figure 26288DEST_PATH_IMAGE060
The following formula is satisfied:
Figure 224051DEST_PATH_IMAGE062
in the formula:
Figure 506128DEST_PATH_IMAGE063
for a certain agent
Figure 722214DEST_PATH_IMAGE007
The set of policies of (2);
Figure 493861DEST_PATH_IMAGE055
indicating the number of multi-agents.
7. The scheduling method of multi-agent oriented to time constraints as claimed in claim 1, wherein the step S14 is specifically as follows:
s141, taking the weighted sum of the multi-target rewards as an optimization target, and calculating the weighted sum of the multi-target optimization, wherein the formula is as follows:
Figure 913341DEST_PATH_IMAGE064
in the formula:
Figure 264688DEST_PATH_IMAGE065
a vector of weights is represented by a vector of weights,
Figure 882751DEST_PATH_IMAGE066
a bonus vector is represented that is,
Figure 294141DEST_PATH_IMAGE024
the policy is represented by a set of rules,
Figure 683141DEST_PATH_IMAGE067
representing the desired reward function to which the weight combination is added;
Figure 572599DEST_PATH_IMAGE068
is expressed in a policy
Figure 310748DEST_PATH_IMAGE024
A target reward weighted sum of;
Figure 627460DEST_PATH_IMAGE032
representing multi-agent selection strategies
Figure 755953DEST_PATH_IMAGE024
A probability distribution of (a);
and S142, fitting the multi-target pareto curves with different weight combinations according to the convex optimization hyperplane separation theorem.
CN202110810946.4A 2021-07-19 2021-07-19 Multi-agent scheduling method facing time constraint Active CN113269297B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110810946.4A CN113269297B (en) 2021-07-19 2021-07-19 Multi-agent scheduling method facing time constraint

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110810946.4A CN113269297B (en) 2021-07-19 2021-07-19 Multi-agent scheduling method facing time constraint

Publications (2)

Publication Number Publication Date
CN113269297A true CN113269297A (en) 2021-08-17
CN113269297B CN113269297B (en) 2021-11-05

Family

ID=77236924

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110810946.4A Active CN113269297B (en) 2021-07-19 2021-07-19 Multi-agent scheduling method facing time constraint

Country Status (1)

Country Link
CN (1) CN113269297B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115563527A (en) * 2022-09-27 2023-01-03 西南交通大学 Multi-Agent deep reinforcement learning framework and method based on state classification and assignment
CN115576278A (en) * 2022-09-30 2023-01-06 常州大学 Multi-agent multi-task layered continuous control method based on temporal equilibrium analysis

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106899026A (en) * 2017-03-24 2017-06-27 三峡大学 Intelligent power generation control method based on the multiple agent intensified learning with time warp thought
CN107045655A (en) * 2016-12-07 2017-08-15 三峡大学 Wolf pack clan strategy process based on the random consistent game of multiple agent and virtual generating clan
CN110471297A (en) * 2019-07-30 2019-11-19 清华大学 Multiple agent cooperative control method, system and equipment
CN110728406A (en) * 2019-10-15 2020-01-24 南京邮电大学 Multi-agent power generation optimization scheduling method based on reinforcement learning
CN111860649A (en) * 2020-07-21 2020-10-30 赵佳 Action set output method and system based on multi-agent reinforcement learning
CN112132263A (en) * 2020-09-11 2020-12-25 大连理工大学 Multi-agent autonomous navigation method based on reinforcement learning

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107045655A (en) * 2016-12-07 2017-08-15 三峡大学 Wolf pack clan strategy process based on the random consistent game of multiple agent and virtual generating clan
CN106899026A (en) * 2017-03-24 2017-06-27 三峡大学 Intelligent power generation control method based on the multiple agent intensified learning with time warp thought
CN110471297A (en) * 2019-07-30 2019-11-19 清华大学 Multiple agent cooperative control method, system and equipment
CN110728406A (en) * 2019-10-15 2020-01-24 南京邮电大学 Multi-agent power generation optimization scheduling method based on reinforcement learning
CN111860649A (en) * 2020-07-21 2020-10-30 赵佳 Action set output method and system based on multi-agent reinforcement learning
CN112132263A (en) * 2020-09-11 2020-12-25 大连理工大学 Multi-agent autonomous navigation method based on reinforcement learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
LIFU DING 等: "Multi-agent Deep Reinforcement Learning Algorithm for Distributed Economic Dispatch in Smart Grid", 《IECON 2020 THE 46TH ANNUAL CONFERENCE OF THE IEEE INDUSTRIAL ELECTRONICS SOCIETY》 *
李方圆: "基于多智能体协同算法的智能电网分布式调度与优化", 《中国优秀博硕士学位论文全文数据库(博士)工程科技Ⅱ辑》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115563527A (en) * 2022-09-27 2023-01-03 西南交通大学 Multi-Agent deep reinforcement learning framework and method based on state classification and assignment
CN115563527B (en) * 2022-09-27 2023-06-16 西南交通大学 Multi-Agent deep reinforcement learning system and method based on state classification and assignment
CN115576278A (en) * 2022-09-30 2023-01-06 常州大学 Multi-agent multi-task layered continuous control method based on temporal equilibrium analysis
CN115576278B (en) * 2022-09-30 2023-08-04 常州大学 Multi-agent multi-task layered continuous control method based on temporal equilibrium analysis
WO2024066675A1 (en) * 2022-09-30 2024-04-04 常州大学 Multi-agent multi-task hierarchical continuous control method based on temporal equilibrium analysis

Also Published As

Publication number Publication date
CN113269297B (en) 2021-11-05

Similar Documents

Publication Publication Date Title
Cao et al. Scheduling semiconductor testing facility by using cuckoo search algorithm with reinforcement learning and surrogate modeling
Choong et al. Automatic design of hyper-heuristic based on reinforcement learning
Zhao et al. A heuristic distributed task allocation method for multivehicle multitask problems and its application to search and rescue scenario
CN113269297B (en) Multi-agent scheduling method facing time constraint
Khan et al. Learning safe unlabeled multi-robot planning with motion constraints
Kannan et al. The autonomous recharging problem: Formulation and a market-based solution
Schillinger et al. Auctioning over probabilistic options for temporal logic-based multi-robot cooperation under uncertainty
Könighofer et al. Online shielding for stochastic systems
Chen et al. A bi-criteria nonlinear fluctuation smoothing rule incorporating the SOM–FBPN remaining cycle time estimator for scheduling a wafer fab—a simulation study
Sun et al. An intelligent controller for manufacturing cells
Liu et al. Multi-agent reinforcement learning-based coordinated dynamic task allocation for heterogenous UAVs
Zaidi et al. Task allocation based on shared resource constraint for multi-robot systems in manufacturing industry
CN114819316A (en) Complex optimization method for multi-agent task planning
Shriyam et al. Task assignment and scheduling for mobile robot teams
Gaggero et al. When time matters: Predictive mission planning in cyber-physical scenarios
Yang et al. Learning graph-enhanced commander-executor for multi-agent navigation
Bøgh et al. Distributed fleet management in noisy environments via model-predictive control
Bahgat et al. A multi-level architecture for solving the multi-robot task allocation problem using a market-based approach
Oliver et al. Auction and swarm multi-robot task allocation algorithms in real time scenarios
Shi et al. Efficient hierarchical policy network with fuzzy rules
Hong et al. Deterministic policy gradient based formation control for multi-agent systems
Jungbluth et al. Reinforcement Learning-based Scheduling of a Job-Shop Process with Distributedly Controlled Robotic Manipulators for Transport Operations
Chandana et al. RANFIS: Rough adaptive neuro-fuzzy inference system
Kim et al. Safety-aware unsupervised skill discovery
Tran et al. Real-time verification for distributed cyber-physical systems

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant