CN113377655B

CN113377655B - A Method of Task Assignment Based on MAS-Q-Learing

Info

Publication number: CN113377655B
Application number: CN202110664158.9A
Authority: CN
Inventors: 王崇骏; 张�杰; 乔羽; 曹亦康; 李宁
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2021-06-16
Filing date: 2021-06-16
Publication date: 2023-06-20
Anticipated expiration: 2041-06-16
Also published as: CN113377655A

Abstract

The invention discloses a task allocation method based on MAS-Q-learning, which is used for acquiring user data in a real application scene, modeling the user data by adopting a Markov decision, designing crowdsourcing personnel into an agent quintuple, and calculating global benefits of the crowdsourcing personnel by a Q value learning method; and positioning the state of the adjacent intelligent agent and the next state, describing the association relation among the intelligent agent members by using a Laplace matrix, calculating by using a multi-attribute decision method, and then carrying out weight distribution and aggregation on the calculation result. And estimating the action-value function by adopting a time difference method, and simultaneously providing an agent state function meeting the conditions of rationality and integrity. The invention has good robustness and adaptability.

Description

A Method of Task Assignment Based on MAS-Q-Learing

技术领域technical field

本发明涉及任务分配领域，主要应用在众包场景中，具体涉及到众包场景下复杂任务分配的成本优化问题。The invention relates to the field of task allocation, is mainly applied in crowdsourcing scenarios, and specifically relates to the cost optimization problem of complex task allocation in crowdsourcing scenarios.

背景技术Background technique

本发明的设计动力来源于当前众包中软件测试工作的新兴应用，一般的众测过程，在该众包过程中，任务分配不明确，众包工人众包工人无法得到个人收益最大化。The design motivation of the present invention comes from the emerging application of software testing work in current crowdsourcing. In the general crowdsourcing process, task assignment is not clear, and crowdsourcing workers cannot maximize their personal benefits.

发明内容Contents of the invention

发明目的：为了避免众包过程中任务分配不明确、众包工人无法得到个人收益最大化等问题，本发明提供一种基于MAS-Q-Learing的任务分配方法，本发明与传统离散数据结构的图不同，众包过程在时间维度上是连续的，因此需要可变和不确定的时间域来对智能体进行引导。使用了Q值学习方法并设计了知识共享机制，提高了模型的鲁棒性，通过允许各个智能体之间进行部分知识共享，其中大多数智能体彼此类似，并通过它们的集体状态相互影响，利用这种交互特性可以提高求解方案的可扩展性。其次，本发明针对小样本数据进行训练与求解，数据采用半监督的方式进行训练，对不确定性区域进行建模；并且我们的模型还能利用大型多智能体系统的对称性，将任务分配收敛成差分—凸函数规划问题，提高了算法的收敛性。最后，为了验证算法，在多智能体上开发的相关模拟器，将任务分配问题与爬山问题进行迁移学习，测试了不同规模的多智能体系统以及环境，表明本发明算法比传统的多智能体Q值学习效果更好。Purpose of the invention: In order to avoid problems such as unclear assignment of tasks in the process of crowdsourcing and the inability of crowdsourcing workers to maximize their personal income, the present invention provides a task assignment method based on MAS-Q-Learing, which is different from the traditional discrete data structure Unlike graphs, the crowdsourcing process is continuous in the time dimension, thus requiring a variable and uncertain time domain to guide the agent. A Q-value learning method is used and a knowledge sharing mechanism is designed to improve the robustness of the model by allowing partial knowledge sharing among individual agents, most of which are similar to each other and influence each other through their collective state, Exploiting this interactive property can improve the scalability of the solution scheme. Secondly, the present invention trains and solves for small-sample data. The data is trained in a semi-supervised manner to model uncertain regions; and our model can also use the symmetry of large-scale multi-agent systems to assign tasks Convergence is a difference-convex function programming problem, which improves the convergence of the algorithm. Finally, in order to verify the algorithm, the relevant simulators developed on the multi-agent, transfer the task allocation problem and the mountain climbing problem, and tested the multi-agent systems and environments of different scales, showing that the algorithm of the present invention is better than the traditional multi-agent The Q value learning effect is better.

技术方案：为实现上述目的，本发明采用的技术方案为：Technical scheme: in order to achieve the above object, the technical scheme adopted in the present invention is:

一种基于MAS-Q-Learing的任务分配方法，包括如下步骤：A task allocation method based on MAS-Q-Learing, comprising the following steps:

步骤1，数据采集：获取真实应用场景中的用户数据，用户数据包括用户产生的具有状态集、动作函数、选择概率和奖励函数的数据。Step 1, data collection: Acquire user data in real application scenarios. User data includes data generated by users with state sets, action functions, selection probabilities, and reward functions.

步骤2，数据预处理：采用马尔科夫决策对步骤1得到的用户数据进行建模，针对不同类型的任务对众包人员进行能力数据的归一化处理，将众包人员设计成智能体五元组，通过Q值学习方法计算他们的全局收益。Step 2, data preprocessing: use Markov decision-making to model the user data obtained in step 1, normalize the ability data of crowdsourcing personnel for different types of tasks, and design crowdsourcing personnel as intelligent agents. tuples, and their global payoffs are computed by the Q-value learning method.

步骤3，状态转移：对邻近智能体的状态以及下一状态进行定位，以便利用邻近智能体的目标估计状态来辅助自身状态转移。邻居节点进行定位利用距离观测和邻居节点传递的信息计算出。Step 3, state transition: locate the state and next state of the neighboring agent, so as to use the target estimated state of the neighboring agent to assist its own state transition. The location of neighbor nodes is calculated by distance observation and information passed by neighbor nodes.

步骤4，多智能体系统建模：采用拉普拉斯矩阵用于描述各个智能体成员之间的关联关系，目的是构建一个多智能体系统内部各成员智能体进行信息交互的机制以及对应的拓扑模型，以此降低复杂问题的求解难度。Step 4, multi-agent system modeling: Laplacian matrix is used to describe the relationship between the members of each agent, the purpose is to build a mechanism for information interaction among members of the multi-agent system and the corresponding Topological models to reduce the difficulty of solving complex problems.

所述步骤4中多智能体系统建模如下：The multi-agent system modeling in the step 4 is as follows:

步骤4a)，智能体系统包括两个以上的智能体，智能体系统的拓扑结构由

表示，计算得到单个智能体的动力学方程以及边状态定义。Step 4a), the agent system includes more than two agents, and the topology of the agent system is given by

Indicates that the dynamic equation and edge state definition of a single agent are calculated.

步骤4b)，更新单个智能体的动力学方程，然后计算得到对应的入度关联矩阵，由此推理得到拉普拉斯矩阵，建立信息反馈模型，进而获得智能体的信息交互反馈。In step 4b), the dynamic equation of a single agent is updated, and then the corresponding in-degree correlation matrix is calculated, and the Laplacian matrix is obtained by inference, and an information feedback model is established to obtain the information interaction feedback of the agent.

步骤4c)，获得多智能体系统中智能体之间的信息反馈模型后，接下来对多智能体系统进行模型降阶，基于生成树子图结构降低求解的复杂度。对生成树进行线性变换获得生成余树，作为多智能体系统的内反馈项，最终获得降阶后的多智能体系统模型。Step 4c), after obtaining the information feedback model between agents in the multi-agent system, then perform model reduction on the multi-agent system, and reduce the complexity of the solution based on the spanning tree subgraph structure. The spanning tree is linearly transformed to obtain the spanning residual tree, which is used as the internal feedback item of the multi-agent system, and finally the reduced-order multi-agent system model is obtained.

步骤5，多属性决策阶段：首先给出决策矩阵，判断权重是否已知并确定权重，根据决策矩阵的属性值得出属性矩阵的集结算子，同时根据求解目标和决策矩阵的形式，选择相应的多属性决策方法进行计算，其计算结果再经过权重分配和集结，并根据最后各方案得分情况进行决策。Step 5, multi-attribute decision-making stage: first give the decision matrix, judge whether the weights are known and determine the weights, obtain the aggregate operator of the attribute matrix according to the attribute values of the decision matrix, and select the corresponding The multi-attribute decision-making method is used for calculation, and the calculation results are then weighted and assembled, and decisions are made according to the scores of the final schemes.

步骤6，方法优化阶段：采用时间差分方法估计动作-值函数，同时给出了满足合理性、完整性条件的智能体状态函数。Step 6, method optimization stage: the time difference method is used to estimate the action-value function, and at the same time, the agent state function satisfying the rationality and integrity conditions is given.

优选的：步骤2中数据预处理方法如下：Preferably: the data preprocessing method in step 2 is as follows:

步骤2a)，将众包人员设计成智能体五元组：<S,A,P,γ,R>，其中，S为状态，A为动作函数，P为选择概率，γ为折扣因子，γ∈(0,1)，R为奖励函数。Step 2a), design the crowdsourcers as an agent quintuple: <S,A,P,γ,R>, where S is the state, A is the action function, P is the selection probability, γ is the discount factor, γ ∈(0,1), R is the reward function.

步骤2b)，当处于某一时刻t时，智能体处于状态S_t+1，从策略集中选取策略并生成动作函数A_t，此时按照概率p_t转移到下一状态S_t+1，依此类推，遍历状态后，得到该智能体的全局收益。Step 2b), when at a certain time t, the agent is in the state S _t+1 , selects a strategy from the strategy set and generates an action function A _t , at this time, it transfers to the next state S _t+1 according to the probability p _t , according to By analogy, after traversing the state, the global income of the agent is obtained.

优选的：所述步骤3中状态转移方法如下：Preferably: the state transfer method in the step 3 is as follows:

步骤3a)，首先对智能体相对临近智能体的欧式距离进行推导，得到智能体j在智能体i下局部坐标系的相对估计位置，得到距离观测。Step 3a), first deduce the Euclidean distance of the agent relative to the adjacent agent, obtain the relative estimated position of the agent j in the local coordinate system under the agent i, and obtain the distance observation.

步骤3b)，利用步骤3a)获得的距离观测和邻居节点传递的信息对邻居节点进行定位。Step 3b), using the distance observation obtained in step 3a) and the information transmitted by the neighbor nodes to locate the neighbor nodes.

优选的：根据权利要求4所述基于MAS-Q-Learing的任务分配方法，其特征在于：所述步骤6中多属性决策阶段方法如下：在转移概率模型未知的条件下求解马尔科夫决策过程问题。设定状态(S)，动作(A)，奖励函数(r)，转移概率(p)，其马尔科夫性为p(s_t+1|s₀,a₀,…,s_t,a_t)＝p(s_t+1|s_t,a_t)，其中s_t表示在t时间的状态，a_t表示在t时间的行为；模型的优化目标为

a_t～π(·|s_t),t＝0,…T-1，π表示常数，π(·|s_t)表示在状态s_t下的概率。利用强化学习方法在p(s_t+1|s_t,a_t)未知情况下求解马尔科夫决策过程问题，采用时间差分方法估计动作-值函数；Preferably: according to the task assignment method based on MAS-Q-Learing described in claim 4, it is characterized in that: the multi-attribute decision-making stage method in the described step 6 is as follows: solve the Markov decision process under the condition that the transition probability model is unknown question. Set the state (S), action (A), reward function (r), transition probability (p), and its Markov property is p(s _t+1 |s ₀ ,a ₀ ,…,s _t ,a _t )=p(s _t+1 |s _t , a _t ), where s _t represents the state at time t, and a _t represents the behavior at time t; the optimization objective of the model is

a _t ~π(·|s _t ), t=0,...T-1, π represents a constant, and π(·|s _t ) represents the probability in state s _t . Using reinforcement learning method to solve Markov decision process problem when p(s _t+1 |s _t ,a _t ) is unknown, using time difference method to estimate action-value function;

优选的：智能体状态满足完整性条件包括智能体决策需要的所有信息。Preferably: the state of the agent satisfies the completeness condition and includes all the information required by the agent for decision-making.

优选的：对于智能体的动作根据施加控制量的数值特点设计离散或连续的动作值。Preferably: for the action of the agent, discrete or continuous action values are designed according to the numerical characteristics of the applied control amount.

本发明相比现有技术，具有以下有益效果：Compared with the prior art, the present invention has the following beneficial effects:

本发明基于单人决策方法建立了多人模型。针对人群测试环境的特殊性，本发明设计了人群测试过程中的多属性决策机制。本发明选择Q值学习作为训练算法，并优化不完美信息共享机制的设计。通过不同的不完美信息共享场景，以及不同的伽玛值和数据集，本发明对训练结果进行了分析，证明本发明设计的系统具有良好的鲁棒性和适应性，本发明提出的方法和模型具有一定的适用性。对相关领域的未来研究具有参考价值。具有较强的实用性，适用于所有的众包系统系统中。The present invention establishes a multi-person model based on a single-person decision-making method. Aiming at the particularity of the crowd test environment, the present invention designs a multi-attribute decision-making mechanism in the crowd test process. The present invention selects Q value learning as the training algorithm, and optimizes the design of the imperfect information sharing mechanism. Through different imperfect information sharing scenarios, as well as different gamma values and data sets, the present invention analyzes the training results, and proves that the system designed by the present invention has good robustness and adaptability. The method and The model has certain applicability. It has reference value for future research in related fields. It has strong practicability and is suitable for all crowdsourcing systems.

附图说明Description of drawings

图1为本发明的方法整体流程图；Fig. 1 is the overall flowchart of the method of the present invention;

图2为本发明所用的众测过程。Fig. 2 is the crowd testing process used by the present invention.

图3为本发明所用多智能体协同行为决策模型研究框架。Fig. 3 is the research framework of the multi-agent cooperative behavior decision-making model used in the present invention.

具体实施方式Detailed ways

下面结合附图和具体实施例，进一步阐明本发明，应理解这些实例仅用于说明本发明而不用于限制本发明的范围，在阅读了本发明之后，本领域技术人员对本发明的各种等价形式的修改均落于本申请所附权利要求所限定的范围。Below in conjunction with accompanying drawing and specific embodiment, further illustrate the present invention, should be understood that these examples are only for illustrating the present invention and are not intended to limit the scope of the present invention, after having read the present invention, those skilled in the art will understand various aspects of the present invention Modifications of the valence form all fall within the scope defined by the appended claims of the present application.

一种基于MAS-Q-Learing的任务分配方法，如图1-3所示，包括如下步骤：A task assignment method based on MAS-Q-Learing, as shown in Figure 1-3, includes the following steps:

步骤1，数据采集：获取真实应用场景中的用户数据，用户数据包括用户产生的具有状态集、动作函数、选择概率和奖励函数的数据，这四类数据不能存在任何缺失。Step 1. Data collection: Acquire user data in real application scenarios. User data includes data generated by users with state sets, action functions, selection probabilities, and reward functions. These four types of data cannot be missing.

步骤2中数据预处理方法如下：The data preprocessing method in step 2 is as follows:

步骤4，多智能体系统建模：本发明采用拉普拉斯矩阵用于描述各个智能体成员之间的关联关系，目的是构建一个多智能体系统内部各成员智能体进行信息交互的机制以及对应的拓扑模型，以此降低复杂问题的求解难度。Step 4, multi-agent system modeling: the present invention adopts Laplacian matrix to describe the association relationship between each agent member, and the purpose is to build a mechanism for information interaction between each member agent in a multi-agent system and The corresponding topological model can reduce the difficulty of solving complex problems.

所述步骤6中多属性决策阶段方法如下：在转移概率模型未知的条件下求解马尔科夫决策过程问题。设定状态(S)，动作(A)，奖励函数(r)，转移概率(p)，其马尔科夫性为p(s_t+1|s₀,a₀,…,s_t,a_t)＝p(s_t+1|s_t,a_t)，其中s_t表示在t时间的状态，a_t表示在t时间的行为；模型的优化目标为

a_t～π(·|s_t),t＝0,...T-1。利用强化学习方法在p(s_t+1|s_t,a_t)未知情况下求解马尔科夫决策过程问题，采用时间差分方法估计动作-值函数。在该研究框架下，对于智能体状态进行了设计，满足合理性、完整性等条件。完整性要求状态包含了智能体决策需要的所有信息，比如智能体的轨迹追踪问题中，需要加入目标轨迹的趋势信息，但是如果这一信息无法观测，则需要扩充状态包含历史的观测值。对于智能体的动作进行了设计，根据施加控制量的数值特点设计离散或连续的动作值。The method in the multi-attribute decision-making stage in step 6 is as follows: solve the Markov decision process problem under the condition that the transition probability model is unknown. Set the state (S), action (A), reward function (r), transition probability (p), and its Markov property is p(s _t+1 |s ₀ ,a ₀ ,…,s _t ,a _t )=p(s _t+1 |s _t , a _t ), where s _t represents the state at time t, and a _t represents the behavior at time t; the optimization objective of the model is

a _t ～π(·|s _t ), t=0,...T-1. The reinforcement learning method is used to solve the Markov decision process problem when p(s _t+1 |s _t , a _t ) is unknown, and the time difference method is used to estimate the action-value function. Under this research framework, the state of the agent is designed to meet the conditions of rationality and completeness. Integrity requires that the state contains all the information needed for the agent's decision-making. For example, in the trajectory tracking problem of the agent, the trend information of the target trajectory needs to be added, but if this information cannot be observed, the state needs to be expanded to include historical observations. The action of the agent is designed, and discrete or continuous action values are designed according to the numerical characteristics of the applied control amount.

实际部署中，该方法的适用不能一劳永逸，需要根据用户的决策集，行动集等数据的不同而做相应调整。In actual deployment, the application of this method cannot be done once and for all, and it needs to be adjusted according to the user's decision set, action set and other data.

综上所述，本发明设计了人群测试过程中的多属性决策机制。我们选择Q值学习作为训练算法，并优化不完美信息共享机制的设计。通过不同的不完美信息共享场景，以及不同的伽玛值和数据集，我们对训练结果进行了分析，实验证明，我们的方法在运行到第50轮时可以收敛，这说明在收敛速度以及稳定性方面，本算法具有一定的优越性，效果较好，证明本发明设计的系统具有良好的鲁棒性和适应性，本发明提出的方法和模型具有一定的适用性。对相关领域的未来研究具有参考价值。To sum up, the present invention designs a multi-attribute decision-making mechanism in the crowd testing process. We choose Q-value learning as the training algorithm and optimize the design of the imperfect information sharing mechanism. Through different imperfect information sharing scenarios, as well as different gamma values and data sets, we have analyzed the training results. Experiments have proved that our method can converge when it runs to the 50th round, which shows that the convergence speed and stability In terms of performance, this algorithm has certain advantages, and the effect is better, which proves that the system designed by the present invention has good robustness and adaptability, and the method and model proposed by the present invention have certain applicability. It has reference value for future research in related fields.

以上所述仅是本发明的优选实施方式，应当指出：对于本技术领域的普通技术人员来说，在不脱离本发明原理的前提下，还可以做出若干改进和润饰，这些改进和润饰也应视为本发明的保护范围。The above is only a preferred embodiment of the present invention, it should be pointed out that for those of ordinary skill in the art, without departing from the principle of the present invention, some improvements and modifications can also be made, and these improvements and modifications are also possible. It should be regarded as the protection scope of the present invention.

Claims

1. The MAS-Q-learning-based task allocation method is characterized by comprising the following steps:

step 1, data acquisition: acquiring user data in a real application scene, wherein the user data comprises data with a state set, an action function, a selection probability and a reward function, which are generated by a user;

step 2, data preprocessing: modeling the user data obtained in the step 1 by adopting a Markov decision, carrying out normalization processing on capability data of crowdsourcing personnel aiming at different types of tasks, designing the crowdsourcing personnel into an agent quintuple, and calculating global benefits of the crowdsourcing personnel by a Q value learning method;

the data preprocessing method in the step 2 is as follows:

step 2 a), designing crowdsourcing personnel into an agent quintuple:

wherein->

For the status of->

For action function +.>

To select probability +.>

For discounts factor->

，/>

Is a reward function;

step 2 b), when at a certain moment

When the intelligent agent is in the state->

Selecting a policy from the set of policies and generating an action function +.>

At this time according to probability->

Transition to the next state->

And so on, after traversing the state, obtaining the global benefit of the intelligent agent;

step 3, state transition: positioning the state of the adjacent agent and the next state so as to assist the state transition of the adjacent agent by using the target estimated state of the adjacent agent; the neighbor node performs positioning and calculates by using distance observation and information transmitted by the neighbor node;

step 4, modeling a multi-agent system: the Laplace matrix is used for describing the association relation among the intelligent agent members, and the purpose is to construct a mechanism for information interaction of the intelligent agents of the members in the multi-intelligent system and a corresponding topological model, so that the solving difficulty of the complex problem is reduced;

the multi-intelligent system modeling method comprises the following steps:

step 4 a), the intelligent system comprises more than two intelligent agents, and the topological structure of the intelligent agent system is formed by

Representing, calculating to obtain a dynamic equation and an edge state definition of a single agent;

step 4 b), updating a dynamic equation of a single intelligent agent, and then calculating to obtain a corresponding incidence-degree associated matrix, thereby reasoning to obtain a Laplace matrix, establishing an information feedback model, and further obtaining information interaction feedback of the intelligent agent;

step 4 c), after an information feedback model among the intelligent agents in the multi-intelligent agent system is obtained, model reduction is carried out on the multi-intelligent agent system, and the complexity of solving is reduced based on a spanning tree sub-graph structure; performing linear transformation on the spanning tree to obtain a spanning residual tree which is used as an internal feedback item of the multi-intelligent system, and finally obtaining a reduced multi-intelligent system model;

step 5, multi-attribute decision stage: firstly, giving a decision matrix, judging whether the weight is known and determining the weight, obtaining an aggregation operator of the attribute matrix according to the attribute value of the decision matrix, selecting a corresponding multi-attribute decision method to calculate according to a solving target and the form of the decision matrix, distributing and aggregating the weight of a calculation result, and deciding according to the scoring condition of each scheme;

the multi-attribute decision stage method is as follows: solving the Markov decision process problem under the condition that the transition probability model is unknown, and setting the state

Action->

Reward function->

Transition probability->

Its Markov is

Wherein s is _t State at time t +.>

Representing behavior at time t; the optimization goal of the model is->

，/>

Representing a constant->

Is indicated in the state->

Probability under +.>

Solving a Markov decision process problem under the unknown condition, and estimating an action-value function by adopting a time difference method;

step 6, method optimization stage: and estimating the action-value function by adopting a time difference method, and simultaneously providing an agent state function meeting the conditions of rationality and integrity.

2. The MAS-Q-learning based task allocation method according to claim 1, wherein: the state transition method in the step 3 is as follows:

step 3 a), firstly deducing the Euclidean distance of the intelligent agent relative to the adjacent intelligent agent to obtain the intelligent agent

In the intelligent body

The relative estimated position of the lower local coordinate system is used for obtaining distance observation;

and 3 b) positioning the neighbor node by utilizing the distance observation obtained in the step 3 a) and the information transmitted by the neighbor node.

3. The MAS-Q-learning based task allocation method according to claim 2, wherein: the agent status meeting the integrity condition includes all information needed for agent decision making.

4. The MAS-Q-learning based task allocation method according to claim 3, wherein: discrete or continuous motion values are designed for the motion of the agent according to the numerical characteristics of the applied control quantity.