CN113269297A - Multi-agent scheduling method facing time constraint - Google Patents
Multi-agent scheduling method facing time constraint Download PDFInfo
- Publication number
- CN113269297A CN113269297A CN202110810946.4A CN202110810946A CN113269297A CN 113269297 A CN113269297 A CN 113269297A CN 202110810946 A CN202110810946 A CN 202110810946A CN 113269297 A CN113269297 A CN 113269297A
- Authority
- CN
- China
- Prior art keywords
- agent
- random
- representing
- state
- action
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/004—Artificial life, i.e. computing arrangements simulating life
- G06N3/006—Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
Abstract
The invention relates to a time constraint-oriented multi-agent scheduling method, which comprises the following steps: establishing a dispatching center; the dispatching center collects real-time data of states and actions of the multi-agent and random environment; the dispatching center processes the collected data and sends the action instruction to the multi-agent; by introducing time constraint into the random game model, the real-time, non-deterministic and probabilistic behaviors shown among the multi-agents or in the interaction process of the multi-agents and the random environment can be described, the reward function related to time can be quantized, and a multi-objective optimization strategy is determined through the reward function; the efficiency of calculating the maximum reward expectation of the model and the pareto curve fitting efficiency based on weight combination are improved according to the designed algorithm, so that the reaction speed of the multi-agent is improved; by giving different weights to a plurality of targets, the priorities of the targets are distinguished, so that the running reliability of the multi-agent is improved.
Description
Technical Field
The invention relates to the technical field of multi-agent interaction, in particular to a scheduling method of multi-agent facing to time constraint.
Background
As the interaction between multi-agents (robots, robot dogs, drones, etc.) becomes increasingly close, the errors that occur during interaction also continue to increase as the size and complexity of multi-agent systems increase. How to design a multi-agent scheduling system to meet the requirement of multi-target design under uncertain environment and corresponding time constraint becomes a key scientific problem which needs to be solved urgently.
At present, the research on the multi-agent scheduling system mainly verifies the quantitative attribute of the model and the attribute related to the reward function through a model checking method, and approaches to pareto optimality of the model through a value iteration method. However, the following problems still remain unsolved in the multi-objective optimization for multi-agent scheduling facing time constraints:
(1) the model inspection needs to carry out exhaustive search on the state space of the multi-agent and random environment, and the state number of the model increases exponentially with the increase of concurrent components, so that the problem of state space explosion is caused;
(2) in a random game model facing time constraint, the reward function can be a point of time, and under the condition that the running time is uncertain, the reward function is variable, so that a value iteration and strategy iteration algorithm based on the model is not suitable for the scene;
(3) there is a lack of description of target priority differences in combining multiple target strategies for multi-agents, and a lack of research to weigh multi-objective optimization strategies based on weight combination.
Disclosure of Invention
The invention aims to overcome the defects and shortcomings of the prior art and provide a time-constraint-oriented multi-agent scheduling method with advanced concept, high reliability and high speed.
The technical scheme for realizing the purpose of the invention is as follows: a scheduling method of multi-agent facing time constraint includes the following steps:
s1, establishing a scheduling center, which specifically comprises the following steps:
s11, establishing a random game model between a plurality of intelligent agents facing time constraint and a random environment or between the plurality of intelligent agents based on a multi-target random game template facing time constraint;
s12, testing and simulating the running track of a random game model according to a statistical model, and designing a value function learning method which is not based on the model to calculate the maximum reward expectation of the multi-agent taking different actions in various states;
s13, iterating the algorithm according to the convergence conditions of zero and random games between the multi-agent and the random environment and general and random games between the multi-agent;
s14, fitting a multi-target pareto curve based on weight combination according to a convex optimization hyperplane separation theorem;
s2, the dispatching center collects real-time data of states and actions of the multi-agent and random environments;
and S3, the dispatching center processes the acquired data and sends the action instruction to the multi-agent.
Further, step S11 is specifically:
representing a limited set of participant multi-agents and a random environment participating in a random game;
representing multiple agents fromStatus pass throughIs moved toThe state transition function of the state is,to representA probability distribution of (a);
representing the states of the multi-agents and the reward functions corresponding to the actions,represents a real number;
s112, establishing a multi-target random game model between multiple intelligent agents and a random environment or between the multiple intelligent agents facing to time constraint and adoptingAs a multi-agent on-pathAction set ofIn a selection strategy ofThe reward expectation formula for a policy is as follows:
in the formula:Indicating multi-agent statusThe corresponding reward function;representing multi-agent actionsThe corresponding reward function;;representing a desired reward function for the multi-agent;representing a policy;representing multi-agent selection strategiesProbability distribution of (2).
Further, the set of clock constraints of step S111Middle clock constraintDefined by the following formula;
in the formula:is thatOne of the clock signals of the first and second clock signals,is a constant number of times that the number of the first,,,。
further, step S12 is specifically:
s121, acquiring initial data of all states and actions of the multi-agent in a random environment;
s122, establishing a random game model facing time constraint based on the acquired data, simulating the running track of the random game model through UPPAAL-SMC, exploring all states and actions of the multi-agent in a random environment and training a target strategy;
s123, establishing a state-action value function table of the multi-agent through off-line learning simulation running tracksSaid value function tableIs defined as being in a stateTake action downWherein:a group of state-tuples is represented,a group of action tuples is represented,a different set of classifications representing the state,indicating the gaming participants to which the current state pertains.
Further, step S13 is specifically:
s131, initializing a state-action value function table for double-player zero-sum random gamesAt the selection of each stateUpon corresponding action, the multi-agent or random environment is based onGreedy method to selectAnd finally updating the value function by adopting a method of cumulatively updating the average value according to the corresponding action, wherein the formula is as follows:
in the formula:representing the number of approximate cumulative computations, which may be considered as a step size,;
s132, aiming at the common and random games of a plurality of people, firstly initializing a state-action value function tableAt the selection of each stateIn corresponding action, the multi-agent is based onGreedy method to selectAnd corresponding actions, and finally updating the cost function by adopting a Nash equalization function, wherein the formula is as follows:
in the formula:indicating the number of approximate cumulative calculations, ;representing the number of the multi-agent;represents an attenuation value;representing the rewards currently earned by the multi-agent;indicating a stateIn the execution of the selected actionThen obtaining a new state;represents fromStarting a multi-agent to adopt a joint strategyCalculated long term average return.
Further, an agent in the general and random game of step S132Nash equalization function ofThe following formula is satisfied:
Further, step S14 is specifically:
s141, taking the weighted sum of the multi-target rewards as an optimization target, and calculating the weighted sum of the multi-target optimization, wherein the formula is as follows:
in the formula:a vector of weights is represented by a vector of weights,a bonus vector is represented that is,the policy is represented by a set of rules,representing the desired reward function to which the weight combination is added;is expressed in a policyA target reward weighted sum of;representing multi-agent selection strategiesA probability distribution of (a);
and S142, fitting the multi-target pareto curves with different weight combinations according to the convex optimization hyperplane separation theorem.
After the technical scheme is adopted, the invention has the following positive effects:
(1) the invention introduces time constraint in the random game model, on one hand, the real-time, non-determinacy and probability behaviors shown among the multi-agent or in the interaction process of the multi-agent and the random environment can be described, on the other hand, the reward function related to time can be quantized, and the multi-objective optimization strategy is determined through the reward function.
(2) The invention calculates the expected reward expectation according to the Monte Carlo simulation track by designing an off-line algorithm, avoids the problem of state space explosion generated when the maximum reward expectation is calculated, and reduces the iteration times of the algorithm according to the convergence conditions of the zero and random games and the general and random games, thereby reducing the energy consumption of the system and improving the reaction speed of the multi-agent.
(3) The invention gives different weights to a plurality of targets and distinguishes the priorities of the targets, thereby improving the running reliability of the multi-agent.
Drawings
In order that the present disclosure may be more readily and clearly understood, the following detailed description of the present disclosure is provided in connection with specific embodiments thereof and with the accompanying drawings, in which:
FIG. 1 is a block diagram of a dispatch center according to the present invention;
FIG. 2 is a flow chart of the present invention;
FIG. 3 is a table generation method of the double zero and random game value function of the present invention;
FIG. 4 is a table of values for a multi-player general and random game play according to the present invention;
FIG. 5 illustrates a pareto curve generation method according to the present invention;
FIG. 6 is a graph of a pareto curve fit based on weight combinations according to the present invention;
FIG. 7 is a schematic diagram of the multi-robot and random environment dynamic gaming model in this embodiment 1;
fig. 8 is a schematic diagram of a dynamic gaming model among multiple robots in this embodiment 2.
Detailed Description
As shown in fig. 1-5, a scheduling method of multi-agent facing to time constraint includes the following steps:
s1, establishing a scheduling center, which specifically comprises the following steps:
s11, establishing a random game model between multiple intelligent agents and a random environment or between the multiple intelligent agents facing time constraints based on a multi-target random game template facing time constraints, which specifically comprises the following steps:
representing a limited set of participant multi-agents and a random environment participating in a random game;
representing sets of clock constraints, clock constraintsBy the formulaDefinitions, in which:is thatOne of the clock signals of the first and second clock signals,is a constant number of times that the number of the first,,,(ii) a Such as a state requiring a delayThen state ofCorresponding toThere will be a time constraintAnd a certain state is subjected to a cutoff timeConstraint, then correspondingThere is a constraint. At the same time, the user can select the desired position,or a combination of different time constraints, e.g.. At the same time, the user can select the desired position,the logical inverse is also accepted.
representing multiple agents fromStatus pass throughIs moved toThe state transition function of the state is,to representA probability distribution of (a);
representing the states of the multi-agents and the reward functions corresponding to the actions,represents a real number;
s112, establishing a multi-target random game model between multiple intelligent agents and a random environment or between the multiple intelligent agents facing to time constraint and adoptingAs a multi-agent on-pathAction set ofIn a selection strategy ofThe reward expectation formula for a policy is as follows:
in the formula:indicating multi-agent statusThe corresponding reward function;representing multi-agent actionsThe corresponding reward function;;representing a desired reward function for the multi-agent;representing a policy;representing multi-agent selection strategiesProbability distribution of (2).
S12, testing and simulating the running track of the random game model according to the statistical modelDesign not based on modelThe value function learning method calculates the maximum reward expectation of the multi-agent taking different actions in various states, which is as follows:
s121, acquiring initial data of all states and actions of the multi-agent in a random environment;
s122, establishing a random game model facing to time constraint based on the acquired data, and simulating the running track of the random game model through UPPAAL-SMC (statistical model testing tool)Exploring all states and actions of the multi-agent in a random environment and training a target strategy;
s123, simulating the running track through offline learningEstablishing state-action value function table of multi-agentValue function tableIs defined as being in a stateTake action downWherein:a group of state-tuples is represented,a group of action tuples is represented,a different set of classifications representing the state,indicating the gaming participants to which the current state pertains.
S13, iterating the algorithm according to the convergence conditions of the zero and random games between the multi-agent and the random environment and the general and random games between the multi-agent, which is specifically as follows:
s131, initializing a state-action value function table for double-player zero-sum random gamesAt the selection of each stateUpon corresponding action, the multi-agent or random environment is based onGreedy method to selectCorresponding action, i.e. ifCorresponding action setThen there will beThe probability of selecting the action of maximizing the cost function table, and alsoRandomly selecting an action; status of stateIn the execution of the selected actionWill get afterTo a new stateAnd corresponding rewards(ii) a Suppose that the game participants are respectivelyAndthe state sets are respectivelyAndand the model targets the maximizing participantsThe gain of (1). If the next step state belongs toThen it is desirable to maximize the reward, as shown in equation (1); if the status of the next step belongs toThen the reward needs to be minimized, as shown in equation (2);
in the formula:indicating the currently-obtained benefit of the prize,indicating that the yield of the next step is currently maximized,indicating that the yield of the next step is currently minimized,represents an attenuation value;
and finally updating the value function by adopting a method of accumulating and updating the average value, wherein the formula is as follows:
in the formula:representing the number of approximate cumulative computations, which may be considered as a step size,;
s132, aiming at the common and random games of a plurality of people, firstly initializing a state-action value function tableThat is, for a state, different agents will have different actions, and each agent generates an optimal strategy by observing the actions of other agents and corresponding reward values; in selecting each stateUpon corresponding action, different agents are based onGreedy method to selectA corresponding action; status of stateIn the execution of the selected actionThen a new state is obtainedAnd corresponding rewards(ii) a And finally, updating the cost function by adopting a Nash equalization function, wherein the formula is as follows:
in the formula:indicating the number of approximate cumulative calculations, ;representing the number of the multi-agent;represents an attenuation value;representing the rewards currently earned by the multi-agent;indicating a stateIn the execution of the selected actionThen obtaining a new state;indicating slave statusStarting a multi-agent to adopt a joint strategyA calculated long-term average reward;
wherein:represents fromStarting a multi-agent to adopt a joint strategyThe calculated long-term average return satisfies the following formula,as an agentThe set of policies of (1).
S14, fitting the multi-target pareto curves with different weight combinations according to the convex optimization hyperplane separation theorem, which specifically comprises the following steps:
s141, taking the weighted sum of the multi-target rewards as an optimization target, and calculating the weighted sum of the multi-target optimization, wherein the formula is as follows:
in the formula:a vector of weights is represented by a vector of weights,a bonus vector is represented that is,the policy is represented by a set of rules,representing the desired reward function to which the weight combination is added;is expressed in a policyA target reward weighted sum of;representing multi-agent selection strategiesA probability distribution of (a);
s142, if the goal is to calculate the maximum reward expectation, calculating the reward expectationCan field ofWherein:a desire for a reward is indicated and,a set of reward expectations is represented,,a collection of possible domains is represented that,i.e. existAll values in the feasible DomainAre all less than(ii) a If the goal is to calculate a maximum reward expectation (e.g., a minimized energy consumption scenario), the reward expectation is calculatedCan field of;
S143, if the goal is to calculate the maximum reward expectation, calculating the reward expectationIs not feasibleWhereinA vector of weights is represented by a vector of weights,a set of weight vectors is represented as a set of weight vectors,,a vector of the desire to reward is represented,a set of desired vectors is indicated for the reward,,andare all represented as the inner product of the vector,a set of non-feasible fields is represented,i.e. reward expectation for any combination of weights, such that(ii) a Calculating a reward expectation if the goal is to calculate a minimum reward expectationIs not feasible;
S144, calculating the weight vector with the maximum varianceAnd calculating the distance between feasible and infeasible domains;
S145. if the distance is larger than the set distanceThen calculateSo as to be composed ofThe constructed maximum separation hyperplane can maximally separate the areas where the two sets are positioned, namely the coverage surface of the reachable set corresponding to the weight is enlarged; based on the convergence of the distance function, the newly generated reward may be expectedJoining a reward expectation setIn, and iterate continuously untilIs less than。
S2, the dispatching center collects real-time data of states and actions of the multi-agent and random environments;
and S3, the dispatching center processes the acquired data and sends the action instruction to the multi-agent.
Fitting a pareto curve based on multiple target reward weights for different weight vector combinations requires traversing the weight vector combinations and calculating the corresponding reward expectations multiple times. If there isThe reward of each target, adopting the exhaustive search weight and the corresponding optimal reward expectation, needs to be inThe reward expectation of each set of weight vector combinations is calculated at the time complexity of (1). Because the reachable point sets generated by different weight combinations are crossed, the invention adopts a method of approximating the unreachable point set expected to be rewarded with the reachable point set to select the weight combinationThe coverage of the reachable point set corresponding to the weight is enlarged as much as possible, so that the expected calculation quantity of the reward is reduced, and the efficiency of the whole algorithm is improved. As shown in FIG. 6, fitting a weight-combination-based pareto curve that minimizes dual objectives first computesAndreward expectation in both extreme casesAndthis indicates the case where only the object 2 is considered and the case where only the object 1 is considered. First according toA corresponding straight line is generated and,representing the reward expectation vector, the area enclosed by the intersection point of the two straight lines and the coordinate axis is an unreachable set, the area enclosed by the two reachable points and the reward expectation upper limit is a reachable set, different weights correspond to different reachable point sets andthe maximum separation is hyperplane. The slope of the largest separating hyperplane of the unreachable point-reachable region furthest from the origin can be taken as a weight, reachable points are generated from reward expectations, and a set of reachable points is formed. As shown in FIG. 6, the point of the unreachable set furthest from the origin is takenFrom the law of separation of hyperplane from convex set, it can be seen that the hyperplane must exist and can be separatedPoint and reachable set. From this, by calculating the maximum separation hyperplane, the corresponding weight value is obtained. By passingCalculating maximum reward expectation, and addingThe sets generate corresponding reachable sets and unreachable sets. It can be seen from FIG. 6 that the maximum values of the reachable set and the unreachable set are approaching each other when the distance between the reachable set and the unreachable set is smaller thanAnd outputting a corresponding reachable point set, wherein the boundary point of the reachable point set is the pareto curve.
Example 1
As shown in fig. 7, multiple robots (robot M, robot N) complete multiple jobs (job T, job T +1, job T + 2) in a random environment, and the process of scheduling these robots to complete jobs generates corresponding energy consumption and time delay. Since the robot may malfunction in the middle of completing a job, different robots may complete different jobs at different times. Therefore, first of all, based on the orientationThe time-constrained multi-target random game template models a system for multiple robots to complete multiple operations in a random environment by adopting a time-constrained zero-sum random game method, wherein game parties are respectively the multiple robots and the random environment. The system mainly comprises an operation model and a multi-machine scheduler model. The job model has three states, namely an idle state, a waiting state and an execution state, each job is triggered by a random environment to enter the waiting state from the idle state, and if the scheduler determines a robot executing the job and distributes the job to the corresponding robot for execution, the job enters the execution state; if the midway robot fails and cannot complete the job, the job enters a waiting state from an execution state to wait for the next available robot, and after the job is completed, the job enters an idle state from the execution state. Each robot has three states, namely an idle state, an operating state and a fault state. When the robot is in an idle state and an operating state, the probability (1-p) and the probability (1-q) are respectively failed. In the idle state, if a job waiting to be executed is allocated by the scheduler, the robot enters the running state. Each robot can only allocate jobs within its execution range and return to an idle state after the operation is finished. And if the robot recovers to work normally in the fault state, returning to the idle state. Secondly, simulating the running track of the model through UPPAAL-SMCExploring all states and actions in a random environment, training a target strategy through the acquired data, and simulating a running track through offline learningAnd establishes a table of state-action value functions,Is defined as being in a stateTake action downWherein:a group of state-tuples is represented,three different states of the operation are represented, namely idle, waiting and executing respectively;three different states of the robot are represented, namely idle state, running state and fault state;control attribution representing the current state, respectively the environmentAnd a scheduler;A group of action tuples is represented,the method comprises the steps of indicating that a job to be executed is scheduled to run on a robot;represents a job trigger;indicating a robot fault;indicating that the job is ready to be executed;indicating that the job is complete;indicating that the robot recovered from the failure; initializing a table of state-action value functions, selecting each stateWhen corresponding action is taken, according toGreedy method to selectCorresponding action, stateIn the execution of the selected actionThen a new state is obtainedAnd corresponding rewards(ii) a Updating the cost function by accumulating the updated average. And finally, taking the weighted sum of the number of finished jobs, the consumed energy consumption and the time for finishing the jobs as an optimization target, and fitting a multi-target pareto curve according to a convex optimization hyperplane separation theorem so as to generate a scheduling strategy for completing a plurality of jobs by a plurality of robots in a random environment.
Example 2
The present embodiment is the same as the method of embodiment 1, and is directed to the task of cooperatively completing specimen collection and transportation by multiple robots. As shown in fig. 8, multiple robots (robot M, robot M +1, robot M +2, and …) need to collect or process specimens at different task points (task points 1 to 6), and then transport the specimens to target points (target point 1 and target point 2). When one robot carries out a task at a certain task point, the task points are not opened to other robots, and a sequence exists among the task points, for example, the task point 4 is opened only to a robot completing the task of the task point 1, the task point 5 is opened only to a robot completing the task point 1, the task point 2 or the task point 3, and the task point 6 is opened only to a robot completing the task point 3. The uncertainty of the whole system comprises the uncertainty of the task time of different robots in different task points and the uncertainty of the moving time of the robots between different task points. In the process of executing tasks and moving, the robots need to avoid static obstacles and dynamic obstacles, and ensure that the total power consumption is minimum under the condition that the electric power used by each robot is different, and finally reach a target place. The robot has three states for executing the task at the task point, when the robot reaches the task point, the robot firstly triggers waiting, if the robot is already executing the task at the task point, the robot waits for the task at the task point to be completed, and if other robots complete the task, the robot starts executing the task. If the midway task is in error, returning to a waiting state to continue waiting for execution. And after the robot completes the task at the task point, searching the next task point to complete the task. In order to establish a multi-objective optimized task scheduling strategy, namely a strategy for completing the collection, processing and task transmission of all samples in a short time and with less energy consumption, firstly, a general and random game model facing to time constraint is established based on a multi-objective random game template facing to time constraint, and the participants of the game are multiple robots. Secondly, simulating the running track of the model through UPPAAL-SMCExploring the sum of all states of the multiple robots in a random environmentAction, then collecting simulation data to train multi-target optimization strategy, and simulating operation track by off-line learningAnd establishes a table of state-action value functions,Is defined as being in a stateRobot for different typesTaking actionWherein:a group of state-tuples is represented,three different states of the task are represented, namely idle, waiting and executing respectively;indicating the current robotThe task point or the target point;represents an action tuple, whereinIndicating that the robot is between task points or between a task point and a target pointMove between, this process can adopt to have already been existedThe path searching algorithm searches the shortest path between the task points;and the current robot is ready to perform a task under the task point, and if the task point is occupied by other robots, the robot waits until the task point is idle.Indicating that the robot has entered the task execution state from waiting,indicating that the current task of the robot fails due to internal and external factors, the task needs to be re-executed,the robot finishes the current task and can move to the next task point to execute the task. All robots according toGreedy method to selectA corresponding action; status of stateIn the execution of the selected actionThen a new state is obtainedAnd corresponding rewards(ii) a And adopts Nash et alUpdating value function of balance function. And finally, taking the weighted sum of the time of task execution, the total energy consumption and the task completion degree as an optimization target, and fitting a multi-target pareto curve according to the convex optimization hyperplane separation theorem, thereby generating a multi-target optimization strategy for cooperatively collecting and transporting samples by multiple robots in a random environment.
The above-mentioned embodiments are intended to illustrate the objects, aspects and advantages of the present invention, and it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the present invention, and any modifications, equivalent substitutions and improvements made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (7)
1. A scheduling method of multi-agent facing time constraint is characterized in that the method comprises the following steps:
s1, establishing a scheduling center, which specifically comprises the following steps:
s11, establishing a random game model between a plurality of intelligent agents facing time constraint and a random environment or between the plurality of intelligent agents based on a multi-target random game template facing time constraint;
s12, testing and simulating the running track of a random game model according to a statistical model, and designing a value function learning method which is not based on the model to calculate the maximum reward expectation of the multi-agent taking different actions in various states;
s13, iterating the algorithm according to the convergence conditions of zero and random games between the multi-agent and the random environment and general and random games between the multi-agent;
s14, fitting a multi-target pareto curve based on weight combination according to a convex optimization hyperplane separation theorem;
s2, the dispatching center collects real-time data of states and actions of the multi-agent and random environments;
and S3, the dispatching center processes the acquired data and sends the action instruction to the multi-agent.
2. The scheduling method of multi-agent oriented to time constraints as claimed in claim 1, wherein the step S11 is specifically as follows:
representing a limited set of participant multi-agents and a random environment participating in a random game;
representing multiple agents fromStatus pass throughIs moved toThe state transition function of the state is,to representA probability distribution of (a);
representing the states of the multi-agents and the reward functions corresponding to the actions,represents a real number;
s112, establishing a multi-target random game model between multiple intelligent agents and a random environment or between the multiple intelligent agents facing to time constraint and adoptingAs a multi-agent on-pathAction set ofIn a selection strategy ofThe reward expectation formula for a policy is as follows:
in the formula:indicating multi-agent statusThe corresponding reward function;representing multi-agent actionsThe corresponding reward function;;representing a desired reward function for the multi-agent;representing a policy;representing multi-agent selection strategiesProbability distribution of (2).
3. A method of scheduling multi-agent towards time constraints as claimed in claim 2, characterized in that: the set of clock constraints of step S111Middle clock constraintDefined by the following formula;
4. the scheduling method of multi-agent oriented to time constraints as claimed in claim 1, wherein the step S12 is specifically as follows:
s121, acquiring initial data of all states and actions of the multi-agent in a random environment;
s122, establishing a random game model facing time constraint based on the acquired data, simulating the running track of the random game model through UPPAAL-SMC, exploring all states and actions of the multi-agent in a random environment and training a target strategy;
s123, establishing multiple intelligence by simulating running track through offline learningState-action value function table of energy bodySaid value function tableIs defined as being in a stateTake action downWherein:a group of state-tuples is represented,a group of action tuples is represented,a different set of classifications representing the state,indicating the gaming participants to which the current state pertains.
5. The scheduling method of multi-agent oriented to time constraints as claimed in claim 4, wherein the step S13 is specifically as follows:
s131, initializing a state-action value function table for double-player zero-sum random gamesAt the selection of each stateUpon corresponding action, the multi-agent or random environment is based onGreedy method to selectAnd finally updating the value function by adopting a method of cumulatively updating the average value according to the corresponding action, wherein the formula is as follows:
in the formula:representing the number of approximate cumulative computations, which may be considered as a step size,;
s132, aiming at the common and random games of a plurality of people, firstly initializing a state-action value function tableAt the selection of each stateIn corresponding action, the multi-agent is based onGreedy method to selectAnd corresponding actions, and finally updating the cost function by adopting a Nash equalization function, wherein the formula is as follows:
in the formula:indicating the number of approximate cumulative calculations, ;representing the number of the multi-agent;represents an attenuation value;representing the rewards currently earned by the multi-agent;indicating a stateIn the execution of the selected actionThen obtaining a new state;indicating slave statusStarting a multi-agent to adopt a joint strategyCalculated long term average return.
6. A method of scheduling multi-agent towards time constraints as claimed in claim 5, characterized by: an agent in the general and random game of step S132Nash equalization function ofThe following formula is satisfied:
7. The scheduling method of multi-agent oriented to time constraints as claimed in claim 1, wherein the step S14 is specifically as follows:
s141, taking the weighted sum of the multi-target rewards as an optimization target, and calculating the weighted sum of the multi-target optimization, wherein the formula is as follows:
in the formula:a vector of weights is represented by a vector of weights,a bonus vector is represented that is,the policy is represented by a set of rules,representing the desired reward function to which the weight combination is added;is expressed in a policyA target reward weighted sum of;representing multi-agent selection strategiesA probability distribution of (a);
and S142, fitting the multi-target pareto curves with different weight combinations according to the convex optimization hyperplane separation theorem.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110810946.4A CN113269297B (en) | 2021-07-19 | 2021-07-19 | Multi-agent scheduling method facing time constraint |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110810946.4A CN113269297B (en) | 2021-07-19 | 2021-07-19 | Multi-agent scheduling method facing time constraint |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113269297A true CN113269297A (en) | 2021-08-17 |
CN113269297B CN113269297B (en) | 2021-11-05 |
Family
ID=77236924
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110810946.4A Active CN113269297B (en) | 2021-07-19 | 2021-07-19 | Multi-agent scheduling method facing time constraint |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113269297B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115563527A (en) * | 2022-09-27 | 2023-01-03 | 西南交通大学 | Multi-Agent deep reinforcement learning framework and method based on state classification and assignment |
CN115576278A (en) * | 2022-09-30 | 2023-01-06 | 常州大学 | Multi-agent multi-task layered continuous control method based on temporal equilibrium analysis |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106899026A (en) * | 2017-03-24 | 2017-06-27 | 三峡大学 | Intelligent power generation control method based on the multiple agent intensified learning with time warp thought |
CN107045655A (en) * | 2016-12-07 | 2017-08-15 | 三峡大学 | Wolf pack clan strategy process based on the random consistent game of multiple agent and virtual generating clan |
CN110471297A (en) * | 2019-07-30 | 2019-11-19 | 清华大学 | Multiple agent cooperative control method, system and equipment |
CN110728406A (en) * | 2019-10-15 | 2020-01-24 | 南京邮电大学 | Multi-agent power generation optimization scheduling method based on reinforcement learning |
CN111860649A (en) * | 2020-07-21 | 2020-10-30 | 赵佳 | Action set output method and system based on multi-agent reinforcement learning |
CN112132263A (en) * | 2020-09-11 | 2020-12-25 | 大连理工大学 | Multi-agent autonomous navigation method based on reinforcement learning |
-
2021
- 2021-07-19 CN CN202110810946.4A patent/CN113269297B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107045655A (en) * | 2016-12-07 | 2017-08-15 | 三峡大学 | Wolf pack clan strategy process based on the random consistent game of multiple agent and virtual generating clan |
CN106899026A (en) * | 2017-03-24 | 2017-06-27 | 三峡大学 | Intelligent power generation control method based on the multiple agent intensified learning with time warp thought |
CN110471297A (en) * | 2019-07-30 | 2019-11-19 | 清华大学 | Multiple agent cooperative control method, system and equipment |
CN110728406A (en) * | 2019-10-15 | 2020-01-24 | 南京邮电大学 | Multi-agent power generation optimization scheduling method based on reinforcement learning |
CN111860649A (en) * | 2020-07-21 | 2020-10-30 | 赵佳 | Action set output method and system based on multi-agent reinforcement learning |
CN112132263A (en) * | 2020-09-11 | 2020-12-25 | 大连理工大学 | Multi-agent autonomous navigation method based on reinforcement learning |
Non-Patent Citations (2)
Title |
---|
LIFU DING 等: "Multi-agent Deep Reinforcement Learning Algorithm for Distributed Economic Dispatch in Smart Grid", 《IECON 2020 THE 46TH ANNUAL CONFERENCE OF THE IEEE INDUSTRIAL ELECTRONICS SOCIETY》 * |
李方圆: "基于多智能体协同算法的智能电网分布式调度与优化", 《中国优秀博硕士学位论文全文数据库(博士)工程科技Ⅱ辑》 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115563527A (en) * | 2022-09-27 | 2023-01-03 | 西南交通大学 | Multi-Agent deep reinforcement learning framework and method based on state classification and assignment |
CN115563527B (en) * | 2022-09-27 | 2023-06-16 | 西南交通大学 | Multi-Agent deep reinforcement learning system and method based on state classification and assignment |
CN115576278A (en) * | 2022-09-30 | 2023-01-06 | 常州大学 | Multi-agent multi-task layered continuous control method based on temporal equilibrium analysis |
CN115576278B (en) * | 2022-09-30 | 2023-08-04 | 常州大学 | Multi-agent multi-task layered continuous control method based on temporal equilibrium analysis |
WO2024066675A1 (en) * | 2022-09-30 | 2024-04-04 | 常州大学 | Multi-agent multi-task hierarchical continuous control method based on temporal equilibrium analysis |
Also Published As
Publication number | Publication date |
---|---|
CN113269297B (en) | 2021-11-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Cao et al. | Scheduling semiconductor testing facility by using cuckoo search algorithm with reinforcement learning and surrogate modeling | |
Choong et al. | Automatic design of hyper-heuristic based on reinforcement learning | |
Zhao et al. | A heuristic distributed task allocation method for multivehicle multitask problems and its application to search and rescue scenario | |
CN113269297B (en) | Multi-agent scheduling method facing time constraint | |
Khan et al. | Learning safe unlabeled multi-robot planning with motion constraints | |
Kannan et al. | The autonomous recharging problem: Formulation and a market-based solution | |
Schillinger et al. | Auctioning over probabilistic options for temporal logic-based multi-robot cooperation under uncertainty | |
Könighofer et al. | Online shielding for stochastic systems | |
Chen et al. | A bi-criteria nonlinear fluctuation smoothing rule incorporating the SOM–FBPN remaining cycle time estimator for scheduling a wafer fab—a simulation study | |
Sun et al. | An intelligent controller for manufacturing cells | |
Liu et al. | Multi-agent reinforcement learning-based coordinated dynamic task allocation for heterogenous UAVs | |
Zaidi et al. | Task allocation based on shared resource constraint for multi-robot systems in manufacturing industry | |
CN114819316A (en) | Complex optimization method for multi-agent task planning | |
Shriyam et al. | Task assignment and scheduling for mobile robot teams | |
Gaggero et al. | When time matters: Predictive mission planning in cyber-physical scenarios | |
Yang et al. | Learning graph-enhanced commander-executor for multi-agent navigation | |
Bøgh et al. | Distributed fleet management in noisy environments via model-predictive control | |
Bahgat et al. | A multi-level architecture for solving the multi-robot task allocation problem using a market-based approach | |
Oliver et al. | Auction and swarm multi-robot task allocation algorithms in real time scenarios | |
Shi et al. | Efficient hierarchical policy network with fuzzy rules | |
Hong et al. | Deterministic policy gradient based formation control for multi-agent systems | |
Jungbluth et al. | Reinforcement Learning-based Scheduling of a Job-Shop Process with Distributedly Controlled Robotic Manipulators for Transport Operations | |
Chandana et al. | RANFIS: Rough adaptive neuro-fuzzy inference system | |
Kim et al. | Safety-aware unsupervised skill discovery | |
Tran et al. | Real-time verification for distributed cyber-physical systems |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |