CN113377655A

CN113377655A - MAS-Q-learning-based task allocation method

Info

Publication number: CN113377655A
Application number: CN202110664158.9A
Authority: CN
Inventors: 王崇骏; 张�杰; 乔羽; 曹亦康; 李宁
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2021-06-16
Filing date: 2021-06-16
Publication date: 2021-09-10
Anticipated expiration: 2041-06-16
Also published as: CN113377655B

Abstract

The invention discloses a MAS-Q-learning-based task allocation method, which comprises the steps of obtaining user data in a real application scene, modeling the user data by adopting Markov decision, designing crowdsourcing personnel into an intelligent body quintuple, and calculating the global income of the crowdsourcing personnel through a Q value learning method; and positioning the state of the adjacent agent and the next state, describing the incidence relation among the agent members by adopting a Laplace matrix, calculating by adopting a multi-attribute decision method, and distributing and aggregating the calculation results by weight. And estimating an action-value function by adopting a time difference method, and simultaneously providing an intelligent state function meeting the conditions of reasonableness and completeness. The invention has good robustness and adaptability.

Description

MAS-Q-learning-based task allocation method

Technical Field

The invention relates to the field of task allocation, is mainly applied to a crowdsourcing scene, and particularly relates to a cost optimization problem of complex task allocation in the crowdsourcing scene.

Background

The design power of the invention is derived from emerging application of software testing work in the current crowdsourcing, and a general crowdsourcing process, in the crowdsourcing process, task allocation is not clear, and crowdsourcing workers cannot obtain personal income maximization.

Disclosure of Invention

The purpose of the invention is as follows: in order to avoid the problems that task allocation is ambiguous, crowdsourcing workers cannot obtain personal income maximization and the like in the crowdsourcing process, the invention provides a task allocation method based on MAS-Q-learning. The Q value learning method is used and a knowledge sharing mechanism is designed, the robustness of the model is improved, and the expandability of the solution scheme can be improved by utilizing the interaction characteristic by allowing partial knowledge sharing among various agents, wherein most agents are similar to each other and mutually influence through the collective states of the agents. Secondly, training and solving are carried out on small sample data, the data are trained in a semi-supervised mode, and modeling is carried out on an uncertain region; moreover, the model can also utilize the symmetry of a large-scale multi-agent system to converge task allocation into a difference-convex function planning problem, so that the convergence of the algorithm is improved. Finally, in order to verify the algorithm, a related simulator developed on the multi-agent carries out transfer learning on the task allocation problem and the mountain climbing problem, and multi-agent systems and environments with different scales are tested, so that the algorithm disclosed by the invention has a better learning effect than the Q value of the traditional multi-agent.

The technical scheme is as follows: in order to achieve the purpose, the invention adopts the technical scheme that:

a task allocation method based on MAS-Q-learning comprises the following steps:

step 1, data acquisition: user data in a real application scenario is obtained, the user data including user generated data having a state set, an action function, a selection probability, and a reward function.

Step 2, data preprocessing: modeling the user data obtained in the step 1 by adopting Markov decision, carrying out normalization processing on capacity data of crowdsourcing personnel aiming at different types of tasks, designing the crowdsourcing personnel into an intelligent body quintuple, and calculating the global income of the crowdsourcing personnel by a Q value learning method.

And step 3, state transition: the state of the neighboring agent and the next state are located to assist the self state transition with the target estimated state of the neighboring agent. And the neighbor nodes perform positioning and are calculated by using distance observation and information transmitted by the neighbor nodes.

Step 4, modeling the multi-agent system: the Laplace matrix is used for describing the incidence relation among the members of the agents, and the purpose is to construct a mechanism for information interaction of the agents of the members in the multi-agent system and a corresponding topological model, so that the difficulty in solving complex problems is reduced.

In the step 4, the multi-agent system is modeled as follows:

step 4a), the intelligent body system comprises more than two intelligent bodies, and the topological structure of the intelligent body system is composed of

And expressing, and calculating to obtain a kinetic equation and an edge state definition of a single intelligent body.

And 4b), updating the kinetic equation of a single agent, calculating to obtain a corresponding incidence correlation matrix, deducing to obtain a Laplace matrix, establishing an information feedback model, and further obtaining the information interaction feedback of the agent.

And 4c), after information feedback models among the agents in the multi-agent system are obtained, model reduction is carried out on the multi-agent system, and the complexity of solving is reduced based on the spanning tree sub-graph structure. And performing linear transformation on the spanning tree to obtain a spanning residual tree which is used as an internal feedback item of the multi-agent system, and finally obtaining the reduced-order multi-agent system model.

Step 5, a multi-attribute decision stage: firstly, a decision matrix is given, whether the weight is known or not is judged, the weight is determined, an aggregation operator of the attribute matrix is obtained according to the attribute value of the decision matrix, meanwhile, a corresponding multi-attribute decision method is selected for calculation according to the form of the solved target and the decision matrix, the calculation result is subjected to weight distribution and aggregation, and decision is made according to the scoring condition of the final schemes.

Step 6, the method optimization stage: and estimating an action-value function by adopting a time difference method, and simultaneously providing an intelligent state function meeting the conditions of reasonableness and completeness.

Preferably: the data preprocessing method in the step 2 is as follows:

step 2a), designing crowdsourcing personnel into an intelligent agent quintuple: s, A, P, gamma and R >, wherein S is a state, A is an action function, P is a selection probability, gamma is a discount factor, gamma belongs to (0,1) and R is a reward function.

Step 2b), when at a certain time t, the agent is in state S_t+1Selecting strategy from strategy set and generating action function A_tAt this time according to the probability p_tTransfer to the next state S_t+1And repeating the state, and obtaining the global benefit of the agent.

Preferably: the state transition method in the step 3 is as follows:

and 3a), firstly, deducing the Euclidean distance of the intelligent agent relative to the adjacent intelligent agent to obtain the relative estimation position of the intelligent agent j in a local coordinate system under the intelligent agent i to obtain distance observation.

And 3b), positioning the neighbor node by using the distance observation obtained in the step 3a) and the information transmitted by the neighbor node.

Preferably: the MAS-Q-learning based task allocation method as claimed in claim 4, wherein: the multi-attribute decision phase method in the step 6 is as follows: and solving the Markov decision process problem under the condition that the transition probability model is unknown. Setting a state (S), an action (A), a reward function (r), a transition probability (p) with a Markov property of p (S)_t+1|s₀,a₀,…,s_t,a_t)＝p(s_t+1|s_t,a_t) Wherein s is_tIndicating the state at time t, a_tRepresenting behavior at time t; the optimization goal of the model is

a_t～π(·|s_t) T is 0, … T-1, pi denotes a constant, pi (· | s)_t) Is shown in state s_tThe probability of the following. Using reinforcement learning method in p(s)_t+1|s_t,a_t) Solving the Markov decision process problem under the unknown condition, and estimating an action-value function by adopting a time difference method;

preferably: the state of the agent satisfying the integrity condition includes all information required by the agent's decision.

Preferably: and designing discrete or continuous action values for the action of the intelligent agent according to the numerical characteristics of the applied control quantity.

Compared with the prior art, the invention has the following beneficial effects:

the invention establishes a multi-person model based on a single-person decision method. Aiming at the particularity of the crowd test environment, the invention designs a multi-attribute decision-making mechanism in the crowd test process. The invention selects Q value learning as a training algorithm and optimizes the design of an imperfect information sharing mechanism. Through different imperfect information sharing scenes and different gamma values and data sets, training results are analyzed, and the system designed by the method has good robustness and adaptability, and the method and the model provided by the invention have certain applicability. Has reference value for future research in the related field. The method has strong practicability and is suitable for all crowdsourcing systems.

Drawings

FIG. 1 is an overall flow chart of the method of the present invention;

FIG. 2 is a cross-testing process used in the present invention.

FIG. 3 is a multi-agent collaborative behavior decision model research framework used in the present invention.

Detailed Description

The present invention is further illustrated by the following description in conjunction with the accompanying drawings and the specific embodiments, it is to be understood that these examples are given solely for the purpose of illustration and are not intended as a definition of the limits of the invention, since various equivalent modifications will occur to those skilled in the art upon reading the present invention and fall within the limits of the appended claims.

A task allocation method based on MAS-Q-learning, as shown in fig. 1-3, comprising the following steps:

step 1, data acquisition: user data in a real application scene is acquired, wherein the user data comprises data which are generated by a user and comprise a state set, an action function, a selection probability and a reward function, and the four types of data cannot have any deficiency.

The data preprocessing method in the step 2 is as follows:

Step 4, modeling the multi-agent system: the invention adopts the Laplace matrix to describe the incidence relation among the members of the agents, and aims to construct a mechanism for information interaction among the agents in the multi-agent system and a corresponding topological model, thereby reducing the difficulty in solving complex problems.

In the step 4, the multi-agent system is modeled as follows:

The multi-attribute decision phase method in the step 6 is as follows: solving under the condition that a transition probability model is unknownSolving a markov decision process problem. Setting a state (S), an action (A), a reward function (r), a transition probability (p) with a Markov property of p (S)_t+1|s₀,a₀,…,s_t,a_t)＝p(s_t+1|s_t,a_t) Wherein s is_tIndicating the state at time t, a_tRepresenting behavior at time t; the optimization goal of the model is

a_t～π(·|s_t) T-1, T ═ 0. Using reinforcement learning method in p(s)_t+1|s_t,a_t) And solving the Markov decision process problem under the unknown condition, and estimating an action-value function by adopting a time difference method. Under the research framework, the state of the intelligent agent is designed, and the conditions of reasonability, completeness and the like are met. The integrity requirement state contains all information needed by the decision of the agent, for example, trend information of a target track needs to be added in the track tracing problem of the agent, but if the information cannot be observed, the state needs to be expanded to contain historical observed values. The action of the intelligent agent is designed, and discrete or continuous action values are designed according to the numerical characteristics of the applied control quantity.

In actual deployment, the method is not suitable for all, and needs to be adjusted according to different data of a user, such as a decision set, an action set and the like.

In summary, the present invention designs a multi-attribute decision mechanism in the crowd test process. We choose Q learning as the training algorithm and optimize the design of imperfect information sharing mechanisms. Through different imperfect information sharing scenes and different gamma values and data sets, training results are analyzed, experiments prove that the method can be converged when the method runs to the 50 th round, which shows that the algorithm has certain superiority and good effect in the aspects of convergence speed and stability, and prove that the system designed by the invention has good robustness and adaptability, and the method and the model provided by the invention have certain applicability. Has reference value for future research in the related field.

The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims

1. A task allocation method based on MAS-Q-learning is characterized by comprising the following steps:

step 1, data acquisition: acquiring user data in a real application scene, wherein the user data comprises data which are generated by a user and have a state set, an action function, a selection probability and a reward function;

step 2, data preprocessing: modeling the user data obtained in the step 1 by adopting Markov decision, carrying out normalization processing on capacity data of crowdsourcing personnel aiming at different types of tasks, designing the crowdsourcing personnel into an intelligent body quintuple, and calculating the global income of the crowdsourcing personnel by a Q value learning method;

and step 3, state transition: locating the state of the neighboring agent and the next state to assist the self state transition with the target estimated state of the neighboring agent; the neighbor node carries out positioning and is calculated by utilizing distance observation and information transmitted by the neighbor node;

step 4, modeling the multi-agent system: the Laplace matrix is used for describing the incidence relation among the members of each agent, and the purpose is to construct a mechanism for information interaction of the agents of each member in the multi-agent system and a corresponding topological model, so that the difficulty in solving a complex problem is reduced;

step 5, a multi-attribute decision stage: firstly, a decision matrix is given, whether the weight is known or not is judged, the weight is determined, an aggregation operator of the attribute matrix is obtained according to the attribute value of the decision matrix, meanwhile, a corresponding multi-attribute decision method is selected for calculation according to the form of a solution target and the decision matrix, the calculation result is subjected to weight distribution and aggregation, and decision is made according to the scoring condition of the final schemes;

2. The MAS-Q-learning based task allocation method as claimed in claim 1, wherein: the data preprocessing method in the step 2 is as follows:

step 2a), designing crowdsourcing personnel into an intelligent agent quintuple: s, A, P, gamma and R >, wherein S is a state, A is an action function, P is a selection probability, gamma is a discount factor, gamma belongs to (0,1), and R is a reward function;

3. The MAS-Q-learning based task allocation method as claimed in claim 2, wherein: the state transition method in the step 3 is as follows:

step 3a), firstly, deducing the Euclidean distance of the intelligent agent relative to the adjacent intelligent agent to obtain the relative estimation position of the intelligent agent j in a local coordinate system under the intelligent agent i to obtain distance observation;

4. A MAS-Q-learning based task allocation method as claimed in claim 3 wherein: in the step 4, the multi-agent system is modeled as follows:

Expressing, calculating to obtain a kinetic equation and an edge state definition of a single intelligent agent;

step 4b), updating the kinetic equation of a single agent, then calculating to obtain a corresponding incidence correlation matrix, deducing to obtain a Laplace matrix, establishing an information feedback model, and further obtaining the information interaction feedback of the agent;

step 4c), after information feedback models among the agents in the multi-agent system are obtained, model reduction is carried out on the multi-agent system, and the complexity of solving is reduced based on the structure of the spanning tree sub-graph; and performing linear transformation on the spanning tree to obtain a spanning residual tree which is used as an internal feedback item of the multi-agent system, and finally obtaining the reduced-order multi-agent system model.

5. The MAS-Q-learning based task allocation method as claimed in claim 4, wherein: the multi-attribute decision phase method in the step 6 is as follows: solving a Markov decision process problem under the condition that a transition probability model is unknown, setting a state (S), an action (A), a reward function (r) and a transition probability (p), wherein the Markov property of the transition probability model is p (S)_t+1|s₀,a₀,…,s_t,a_t)＝p(s_t+1|s_t,a_t) Wherein s is_tIndicating the state at time t, a_tRepresenting behavior at time t; the optimization goal of the model is

s.t.s_t+1～p(·|s_t,a_t),a_t～π(·|s_t) T-1, pi denotes a constant, pi (· | s)_t) Is shown in state s_tProbability of using reinforcement learning method in p(s)_t+1|s_t,a_t) And solving the Markov decision process problem under the unknown condition, and estimating an action-value function by adopting a time difference method.

6. The MAS-Q-learning based task allocation method as claimed in claim 5, wherein: the state of the agent satisfying the integrity condition includes all information required by the agent's decision.

7. The MAS-Q-learning based task allocation method as claimed in claim 6, wherein: and designing discrete or continuous action values for the action of the intelligent agent according to the numerical characteristics of the applied control quantity.