CN113377655B - A Method of Task Assignment Based on MAS-Q-Learing - Google Patents
A Method of Task Assignment Based on MAS-Q-Learing Download PDFInfo
- Publication number
- CN113377655B CN113377655B CN202110664158.9A CN202110664158A CN113377655B CN 113377655 B CN113377655 B CN 113377655B CN 202110664158 A CN202110664158 A CN 202110664158A CN 113377655 B CN113377655 B CN 113377655B
- Authority
- CN
- China
- Prior art keywords
- agent
- state
- intelligent
- decision
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 59
- 230000006870 function Effects 0.000 claims abstract description 29
- 230000009471 action Effects 0.000 claims abstract description 23
- 239000011159 matrix material Substances 0.000 claims abstract description 20
- 238000004364 calculation method Methods 0.000 claims abstract description 6
- 230000008901 benefit Effects 0.000 claims abstract description 5
- 230000002776 aggregation Effects 0.000 claims abstract 2
- 238000004220 aggregation Methods 0.000 claims abstract 2
- 230000007704 transition Effects 0.000 claims description 14
- 230000008569 process Effects 0.000 claims description 12
- 230000007246 mechanism Effects 0.000 claims description 8
- 238000005457 optimization Methods 0.000 claims description 7
- 230000003993 interaction Effects 0.000 claims description 6
- 238000007781 pre-processing Methods 0.000 claims description 6
- 238000005094 computer simulation Methods 0.000 claims description 5
- 230000006399 behavior Effects 0.000 claims description 3
- 230000009467 reduction Effects 0.000 claims description 3
- 230000004931 aggregating effect Effects 0.000 claims 1
- 238000010606 normalization Methods 0.000 claims 1
- 230000009466 transformation Effects 0.000 claims 1
- 238000013461 design Methods 0.000 description 9
- 238000004422 calculation algorithm Methods 0.000 description 6
- 238000011160 research Methods 0.000 description 4
- 238000012360 testing method Methods 0.000 description 4
- 238000012549 training Methods 0.000 description 4
- 238000012546 transfer Methods 0.000 description 4
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000013480 data collection Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 230000002787 reinforcement Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000009194 climbing Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 230000008450 motivation Effects 0.000 description 1
- 230000011273 social behavior Effects 0.000 description 1
- 238000013522 software testing Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/36—Prevention of errors by analysis, debugging or testing of software
- G06F11/3668—Testing of software
- G06F11/3672—Test management
- G06F11/3684—Test management for test design, e.g. generating new test cases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02P—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
- Y02P90/00—Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
- Y02P90/30—Computing systems specially adapted for manufacturing
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Computing Systems (AREA)
- Data Mining & Analysis (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- Computer Hardware Design (AREA)
- Quality & Reliability (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
Description
技术领域technical field
本发明涉及任务分配领域,主要应用在众包场景中,具体涉及到众包场景下复杂任务分配的成本优化问题。The invention relates to the field of task allocation, is mainly applied in crowdsourcing scenarios, and specifically relates to the cost optimization problem of complex task allocation in crowdsourcing scenarios.
背景技术Background technique
本发明的设计动力来源于当前众包中软件测试工作的新兴应用,一般的众测过程,在该众包过程中,任务分配不明确,众包工人众包工人无法得到个人收益最大化。The design motivation of the present invention comes from the emerging application of software testing work in current crowdsourcing. In the general crowdsourcing process, task assignment is not clear, and crowdsourcing workers cannot maximize their personal benefits.
发明内容Contents of the invention
发明目的:为了避免众包过程中任务分配不明确、众包工人无法得到个人收益最大化等问题,本发明提供一种基于MAS-Q-Learing的任务分配方法,本发明与传统离散数据结构的图不同,众包过程在时间维度上是连续的,因此需要可变和不确定的时间域来对智能体进行引导。使用了Q值学习方法并设计了知识共享机制,提高了模型的鲁棒性,通过允许各个智能体之间进行部分知识共享,其中大多数智能体彼此类似,并通过它们的集体状态相互影响,利用这种交互特性可以提高求解方案的可扩展性。其次,本发明针对小样本数据进行训练与求解,数据采用半监督的方式进行训练,对不确定性区域进行建模;并且我们的模型还能利用大型多智能体系统的对称性,将任务分配收敛成差分—凸函数规划问题,提高了算法的收敛性。最后,为了验证算法,在多智能体上开发的相关模拟器,将任务分配问题与爬山问题进行迁移学习,测试了不同规模的多智能体系统以及环境,表明本发明算法比传统的多智能体Q值学习效果更好。Purpose of the invention: In order to avoid problems such as unclear assignment of tasks in the process of crowdsourcing and the inability of crowdsourcing workers to maximize their personal income, the present invention provides a task assignment method based on MAS-Q-Learing, which is different from the traditional discrete data structure Unlike graphs, the crowdsourcing process is continuous in the time dimension, thus requiring a variable and uncertain time domain to guide the agent. A Q-value learning method is used and a knowledge sharing mechanism is designed to improve the robustness of the model by allowing partial knowledge sharing among individual agents, most of which are similar to each other and influence each other through their collective state, Exploiting this interactive property can improve the scalability of the solution scheme. Secondly, the present invention trains and solves for small-sample data. The data is trained in a semi-supervised manner to model uncertain regions; and our model can also use the symmetry of large-scale multi-agent systems to assign tasks Convergence is a difference-convex function programming problem, which improves the convergence of the algorithm. Finally, in order to verify the algorithm, the relevant simulators developed on the multi-agent, transfer the task allocation problem and the mountain climbing problem, and tested the multi-agent systems and environments of different scales, showing that the algorithm of the present invention is better than the traditional multi-agent The Q value learning effect is better.
技术方案:为实现上述目的,本发明采用的技术方案为:Technical scheme: in order to achieve the above object, the technical scheme adopted in the present invention is:
一种基于MAS-Q-Learing的任务分配方法,包括如下步骤:A task allocation method based on MAS-Q-Learing, comprising the following steps:
步骤1,数据采集:获取真实应用场景中的用户数据,用户数据包括用户产生的具有状态集、动作函数、选择概率和奖励函数的数据。Step 1, data collection: Acquire user data in real application scenarios. User data includes data generated by users with state sets, action functions, selection probabilities, and reward functions.
步骤2,数据预处理:采用马尔科夫决策对步骤1得到的用户数据进行建模,针对不同类型的任务对众包人员进行能力数据的归一化处理,将众包人员设计成智能体五元组,通过Q值学习方法计算他们的全局收益。Step 2, data preprocessing: use Markov decision-making to model the user data obtained in step 1, normalize the ability data of crowdsourcing personnel for different types of tasks, and design crowdsourcing personnel as intelligent agents. tuples, and their global payoffs are computed by the Q-value learning method.
步骤3,状态转移:对邻近智能体的状态以及下一状态进行定位,以便利用邻近智能体的目标估计状态来辅助自身状态转移。邻居节点进行定位利用距离观测和邻居节点传递的信息计算出。Step 3, state transition: locate the state and next state of the neighboring agent, so as to use the target estimated state of the neighboring agent to assist its own state transition. The location of neighbor nodes is calculated by distance observation and information passed by neighbor nodes.
步骤4,多智能体系统建模:采用拉普拉斯矩阵用于描述各个智能体成员之间的关联关系,目的是构建一个多智能体系统内部各成员智能体进行信息交互的机制以及对应的拓扑模型,以此降低复杂问题的求解难度。Step 4, multi-agent system modeling: Laplacian matrix is used to describe the relationship between the members of each agent, the purpose is to build a mechanism for information interaction among members of the multi-agent system and the corresponding Topological models to reduce the difficulty of solving complex problems.
所述步骤4中多智能体系统建模如下:The multi-agent system modeling in the step 4 is as follows:
步骤4a),智能体系统包括两个以上的智能体,智能体系统的拓扑结构由表示,计算得到单个智能体的动力学方程以及边状态定义。Step 4a), the agent system includes more than two agents, and the topology of the agent system is given by Indicates that the dynamic equation and edge state definition of a single agent are calculated.
步骤4b),更新单个智能体的动力学方程,然后计算得到对应的入度关联矩阵,由此推理得到拉普拉斯矩阵,建立信息反馈模型,进而获得智能体的信息交互反馈。In step 4b), the dynamic equation of a single agent is updated, and then the corresponding in-degree correlation matrix is calculated, and the Laplacian matrix is obtained by inference, and an information feedback model is established to obtain the information interaction feedback of the agent.
步骤4c),获得多智能体系统中智能体之间的信息反馈模型后,接下来对多智能体系统进行模型降阶,基于生成树子图结构降低求解的复杂度。对生成树进行线性变换获得生成余树,作为多智能体系统的内反馈项,最终获得降阶后的多智能体系统模型。Step 4c), after obtaining the information feedback model between agents in the multi-agent system, then perform model reduction on the multi-agent system, and reduce the complexity of the solution based on the spanning tree subgraph structure. The spanning tree is linearly transformed to obtain the spanning residual tree, which is used as the internal feedback item of the multi-agent system, and finally the reduced-order multi-agent system model is obtained.
步骤5,多属性决策阶段:首先给出决策矩阵,判断权重是否已知并确定权重,根据决策矩阵的属性值得出属性矩阵的集结算子,同时根据求解目标和决策矩阵的形式,选择相应的多属性决策方法进行计算,其计算结果再经过权重分配和集结,并根据最后各方案得分情况进行决策。Step 5, multi-attribute decision-making stage: first give the decision matrix, judge whether the weights are known and determine the weights, obtain the aggregate operator of the attribute matrix according to the attribute values of the decision matrix, and select the corresponding The multi-attribute decision-making method is used for calculation, and the calculation results are then weighted and assembled, and decisions are made according to the scores of the final schemes.
步骤6,方法优化阶段:采用时间差分方法估计动作-值函数,同时给出了满足合理性、完整性条件的智能体状态函数。Step 6, method optimization stage: the time difference method is used to estimate the action-value function, and at the same time, the agent state function satisfying the rationality and integrity conditions is given.
优选的:步骤2中数据预处理方法如下:Preferably: the data preprocessing method in step 2 is as follows:
步骤2a),将众包人员设计成智能体五元组:<S,A,P,γ,R>,其中,S为状态,A为动作函数,P为选择概率,γ为折扣因子,γ∈(0,1),R为奖励函数。Step 2a), design the crowdsourcers as an agent quintuple: <S,A,P,γ,R>, where S is the state, A is the action function, P is the selection probability, γ is the discount factor, γ ∈(0,1), R is the reward function.
步骤2b),当处于某一时刻t时,智能体处于状态St+1,从策略集中选取策略并生成动作函数At,此时按照概率pt转移到下一状态St+1,依此类推,遍历状态后,得到该智能体的全局收益。Step 2b), when at a certain time t, the agent is in the state S t+1 , selects a strategy from the strategy set and generates an action function A t , at this time, it transfers to the next state S t+1 according to the probability p t , according to By analogy, after traversing the state, the global income of the agent is obtained.
优选的:所述步骤3中状态转移方法如下:Preferably: the state transfer method in the step 3 is as follows:
步骤3a),首先对智能体相对临近智能体的欧式距离进行推导,得到智能体j在智能体i下局部坐标系的相对估计位置,得到距离观测。Step 3a), first deduce the Euclidean distance of the agent relative to the adjacent agent, obtain the relative estimated position of the agent j in the local coordinate system under the agent i, and obtain the distance observation.
步骤3b),利用步骤3a)获得的距离观测和邻居节点传递的信息对邻居节点进行定位。Step 3b), using the distance observation obtained in step 3a) and the information transmitted by the neighbor nodes to locate the neighbor nodes.
优选的:根据权利要求4所述基于MAS-Q-Learing的任务分配方法,其特征在于:所述步骤6中多属性决策阶段方法如下:在转移概率模型未知的条件下求解马尔科夫决策过程问题。设定状态(S),动作(A),奖励函数(r),转移概率(p),其马尔科夫性为p(st+1|s0,a0,…,st,at)=p(st+1|st,at),其中st表示在t时间的状态,at表示在t时间的行为;模型的优化目标为at~π(·|st),t=0,…T-1,π表示常数,π(·|st)表示在状态st下的概率。利用强化学习方法在p(st+1|st,at)未知情况下求解马尔科夫决策过程问题,采用时间差分方法估计动作-值函数;Preferably: according to the task assignment method based on MAS-Q-Learing described in claim 4, it is characterized in that: the multi-attribute decision-making stage method in the described step 6 is as follows: solve the Markov decision process under the condition that the transition probability model is unknown question. Set the state (S), action (A), reward function (r), transition probability (p), and its Markov property is p(s t+1 |s 0 ,a 0 ,…,s t ,a t )=p(s t+1 |s t , a t ), where s t represents the state at time t, and a t represents the behavior at time t; the optimization objective of the model is a t ~π(·|s t ), t=0,...T-1, π represents a constant, and π(·|s t ) represents the probability in state s t . Using reinforcement learning method to solve Markov decision process problem when p(s t+1 |s t ,a t ) is unknown, using time difference method to estimate action-value function;
优选的:智能体状态满足完整性条件包括智能体决策需要的所有信息。Preferably: the state of the agent satisfies the completeness condition and includes all the information required by the agent for decision-making.
优选的:对于智能体的动作根据施加控制量的数值特点设计离散或连续的动作值。Preferably: for the action of the agent, discrete or continuous action values are designed according to the numerical characteristics of the applied control amount.
本发明相比现有技术,具有以下有益效果:Compared with the prior art, the present invention has the following beneficial effects:
本发明基于单人决策方法建立了多人模型。针对人群测试环境的特殊性,本发明设计了人群测试过程中的多属性决策机制。本发明选择Q值学习作为训练算法,并优化不完美信息共享机制的设计。通过不同的不完美信息共享场景,以及不同的伽玛值和数据集,本发明对训练结果进行了分析,证明本发明设计的系统具有良好的鲁棒性和适应性,本发明提出的方法和模型具有一定的适用性。对相关领域的未来研究具有参考价值。具有较强的实用性,适用于所有的众包系统系统中。The present invention establishes a multi-person model based on a single-person decision-making method. Aiming at the particularity of the crowd test environment, the present invention designs a multi-attribute decision-making mechanism in the crowd test process. The present invention selects Q value learning as the training algorithm, and optimizes the design of the imperfect information sharing mechanism. Through different imperfect information sharing scenarios, as well as different gamma values and data sets, the present invention analyzes the training results, and proves that the system designed by the present invention has good robustness and adaptability. The method and The model has certain applicability. It has reference value for future research in related fields. It has strong practicability and is suitable for all crowdsourcing systems.
附图说明Description of drawings
图1为本发明的方法整体流程图;Fig. 1 is the overall flowchart of the method of the present invention;
图2为本发明所用的众测过程。Fig. 2 is the crowd testing process used by the present invention.
图3为本发明所用多智能体协同行为决策模型研究框架。Fig. 3 is the research framework of the multi-agent cooperative behavior decision-making model used in the present invention.
具体实施方式Detailed ways
下面结合附图和具体实施例,进一步阐明本发明,应理解这些实例仅用于说明本发明而不用于限制本发明的范围,在阅读了本发明之后,本领域技术人员对本发明的各种等价形式的修改均落于本申请所附权利要求所限定的范围。Below in conjunction with accompanying drawing and specific embodiment, further illustrate the present invention, should be understood that these examples are only for illustrating the present invention and are not intended to limit the scope of the present invention, after having read the present invention, those skilled in the art will understand various aspects of the present invention Modifications of the valence form all fall within the scope defined by the appended claims of the present application.
一种基于MAS-Q-Learing的任务分配方法,如图1-3所示,包括如下步骤:A task assignment method based on MAS-Q-Learing, as shown in Figure 1-3, includes the following steps:
步骤1,数据采集:获取真实应用场景中的用户数据,用户数据包括用户产生的具有状态集、动作函数、选择概率和奖励函数的数据,这四类数据不能存在任何缺失。Step 1. Data collection: Acquire user data in real application scenarios. User data includes data generated by users with state sets, action functions, selection probabilities, and reward functions. These four types of data cannot be missing.
步骤2,数据预处理:采用马尔科夫决策对步骤1得到的用户数据进行建模,针对不同类型的任务对众包人员进行能力数据的归一化处理,将众包人员设计成智能体五元组,通过Q值学习方法计算他们的全局收益。Step 2, data preprocessing: use Markov decision-making to model the user data obtained in step 1, normalize the ability data of crowdsourcing personnel for different types of tasks, and design crowdsourcing personnel as intelligent agents. tuples, and their global payoffs are computed by the Q-value learning method.
步骤2中数据预处理方法如下:The data preprocessing method in step 2 is as follows:
步骤2a),将众包人员设计成智能体五元组:<S,A,P,γ,R>,其中,S为状态,A为动作函数,P为选择概率,γ为折扣因子,γ∈(0,1),R为奖励函数。Step 2a), design the crowdsourcers as an agent quintuple: <S,A,P,γ,R>, where S is the state, A is the action function, P is the selection probability, γ is the discount factor, γ ∈(0,1), R is the reward function.
步骤2b),当处于某一时刻t时,智能体处于状态St+1,从策略集中选取策略并生成动作函数At,此时按照概率pt转移到下一状态St+1,依此类推,遍历状态后,得到该智能体的全局收益。Step 2b), when at a certain time t, the agent is in the state S t+1 , selects a strategy from the strategy set and generates an action function A t , at this time, it transfers to the next state S t+1 according to the probability p t , according to By analogy, after traversing the state, the global income of the agent is obtained.
步骤3,状态转移:对邻近智能体的状态以及下一状态进行定位,以便利用邻近智能体的目标估计状态来辅助自身状态转移。邻居节点进行定位利用距离观测和邻居节点传递的信息计算出。Step 3, state transition: locate the state and next state of the neighboring agent, so as to use the target estimated state of the neighboring agent to assist its own state transition. The location of neighbor nodes is calculated by distance observation and information passed by neighbor nodes.
步骤3a),首先对智能体相对临近智能体的欧式距离进行推导,得到智能体j在智能体i下局部坐标系的相对估计位置,得到距离观测。Step 3a), first deduce the Euclidean distance of the agent relative to the adjacent agent, obtain the relative estimated position of the agent j in the local coordinate system under the agent i, and obtain the distance observation.
步骤3b),利用步骤3a)获得的距离观测和邻居节点传递的信息对邻居节点进行定位。Step 3b), using the distance observation obtained in step 3a) and the information transmitted by the neighbor nodes to locate the neighbor nodes.
步骤4,多智能体系统建模:本发明采用拉普拉斯矩阵用于描述各个智能体成员之间的关联关系,目的是构建一个多智能体系统内部各成员智能体进行信息交互的机制以及对应的拓扑模型,以此降低复杂问题的求解难度。Step 4, multi-agent system modeling: the present invention adopts Laplacian matrix to describe the association relationship between each agent member, and the purpose is to build a mechanism for information interaction between each member agent in a multi-agent system and The corresponding topological model can reduce the difficulty of solving complex problems.
所述步骤4中多智能体系统建模如下:The multi-agent system modeling in the step 4 is as follows:
步骤4a),智能体系统包括两个以上的智能体,智能体系统的拓扑结构由表示,计算得到单个智能体的动力学方程以及边状态定义。Step 4a), the agent system includes more than two agents, and the topology of the agent system is given by Indicates that the dynamic equation and edge state definition of a single agent are calculated.
步骤4b),更新单个智能体的动力学方程,然后计算得到对应的入度关联矩阵,由此推理得到拉普拉斯矩阵,建立信息反馈模型,进而获得智能体的信息交互反馈。In step 4b), the dynamic equation of a single agent is updated, and then the corresponding in-degree correlation matrix is calculated, and the Laplacian matrix is obtained by inference, and an information feedback model is established to obtain the information interaction feedback of the agent.
步骤4c),获得多智能体系统中智能体之间的信息反馈模型后,接下来对多智能体系统进行模型降阶,基于生成树子图结构降低求解的复杂度。对生成树进行线性变换获得生成余树,作为多智能体系统的内反馈项,最终获得降阶后的多智能体系统模型。Step 4c), after obtaining the information feedback model between agents in the multi-agent system, then perform model reduction on the multi-agent system, and reduce the complexity of the solution based on the spanning tree subgraph structure. The spanning tree is linearly transformed to obtain the spanning residual tree, which is used as the internal feedback item of the multi-agent system, and finally the reduced-order multi-agent system model is obtained.
步骤5,多属性决策阶段:首先给出决策矩阵,判断权重是否已知并确定权重,根据决策矩阵的属性值得出属性矩阵的集结算子,同时根据求解目标和决策矩阵的形式,选择相应的多属性决策方法进行计算,其计算结果再经过权重分配和集结,并根据最后各方案得分情况进行决策。Step 5, multi-attribute decision-making stage: first give the decision matrix, judge whether the weights are known and determine the weights, obtain the aggregate operator of the attribute matrix according to the attribute values of the decision matrix, and select the corresponding The multi-attribute decision-making method is used for calculation, and the calculation results are then weighted and assembled, and decisions are made according to the scores of the final schemes.
步骤6,方法优化阶段:采用时间差分方法估计动作-值函数,同时给出了满足合理性、完整性条件的智能体状态函数。Step 6, method optimization stage: the time difference method is used to estimate the action-value function, and at the same time, the agent state function satisfying the rationality and integrity conditions is given.
所述步骤6中多属性决策阶段方法如下:在转移概率模型未知的条件下求解马尔科夫决策过程问题。设定状态(S),动作(A),奖励函数(r),转移概率(p),其马尔科夫性为p(st+1|s0,a0,…,st,at)=p(st+1|st,at),其中st表示在t时间的状态,at表示在t时间的行为;模型的优化目标为at~π(·|st),t=0,...T-1。利用强化学习方法在p(st+1|st,at)未知情况下求解马尔科夫决策过程问题,采用时间差分方法估计动作-值函数。在该研究框架下,对于智能体状态进行了设计,满足合理性、完整性等条件。完整性要求状态包含了智能体决策需要的所有信息,比如智能体的轨迹追踪问题中,需要加入目标轨迹的趋势信息,但是如果这一信息无法观测,则需要扩充状态包含历史的观测值。对于智能体的动作进行了设计,根据施加控制量的数值特点设计离散或连续的动作值。The method in the multi-attribute decision-making stage in step 6 is as follows: solve the Markov decision process problem under the condition that the transition probability model is unknown. Set the state (S), action (A), reward function (r), transition probability (p), and its Markov property is p(s t+1 |s 0 ,a 0 ,…,s t ,a t )=p(s t+1 |s t , a t ), where s t represents the state at time t, and a t represents the behavior at time t; the optimization objective of the model is a t ~π(·|s t ), t=0,...T-1. The reinforcement learning method is used to solve the Markov decision process problem when p(s t+1 |s t , a t ) is unknown, and the time difference method is used to estimate the action-value function. Under this research framework, the state of the agent is designed to meet the conditions of rationality and completeness. Integrity requires that the state contains all the information needed for the agent's decision-making. For example, in the trajectory tracking problem of the agent, the trend information of the target trajectory needs to be added, but if this information cannot be observed, the state needs to be expanded to include historical observations. The action of the agent is designed, and discrete or continuous action values are designed according to the numerical characteristics of the applied control amount.
实际部署中,该方法的适用不能一劳永逸,需要根据用户的决策集,行动集等数据的不同而做相应调整。In actual deployment, the application of this method cannot be done once and for all, and it needs to be adjusted according to the user's decision set, action set and other data.
综上所述,本发明设计了人群测试过程中的多属性决策机制。我们选择Q值学习作为训练算法,并优化不完美信息共享机制的设计。通过不同的不完美信息共享场景,以及不同的伽玛值和数据集,我们对训练结果进行了分析,实验证明,我们的方法在运行到第50轮时可以收敛,这说明在收敛速度以及稳定性方面,本算法具有一定的优越性,效果较好,证明本发明设计的系统具有良好的鲁棒性和适应性,本发明提出的方法和模型具有一定的适用性。对相关领域的未来研究具有参考价值。To sum up, the present invention designs a multi-attribute decision-making mechanism in the crowd testing process. We choose Q-value learning as the training algorithm and optimize the design of the imperfect information sharing mechanism. Through different imperfect information sharing scenarios, as well as different gamma values and data sets, we have analyzed the training results. Experiments have proved that our method can converge when it runs to the 50th round, which shows that the convergence speed and stability In terms of performance, this algorithm has certain advantages, and the effect is better, which proves that the system designed by the present invention has good robustness and adaptability, and the method and model proposed by the present invention have certain applicability. It has reference value for future research in related fields.
以上所述仅是本发明的优选实施方式,应当指出:对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本发明的保护范围。The above is only a preferred embodiment of the present invention, it should be pointed out that for those of ordinary skill in the art, without departing from the principle of the present invention, some improvements and modifications can also be made, and these improvements and modifications are also possible. It should be regarded as the protection scope of the present invention.
Claims (4)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110664158.9A CN113377655B (en) | 2021-06-16 | 2021-06-16 | A Method of Task Assignment Based on MAS-Q-Learing |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110664158.9A CN113377655B (en) | 2021-06-16 | 2021-06-16 | A Method of Task Assignment Based on MAS-Q-Learing |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113377655A CN113377655A (en) | 2021-09-10 |
CN113377655B true CN113377655B (en) | 2023-06-20 |
Family
ID=77574510
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110664158.9A Active CN113377655B (en) | 2021-06-16 | 2021-06-16 | A Method of Task Assignment Based on MAS-Q-Learing |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113377655B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114154906B (en) * | 2021-12-10 | 2025-03-21 | 厦门兆翔智能科技有限公司 | Airport ground service intelligent scheduling method and system |
CN119273104B (en) * | 2024-12-09 | 2025-02-28 | 广东海洋大学 | Port multi-agent task allocation method and system based on reinforcement learning |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109409739A (en) * | 2018-10-19 | 2019-03-01 | 南京大学 | A kind of crowdsourcing platform method for allocating tasks based on part Observable markov decision process |
WO2020092437A1 (en) * | 2018-10-29 | 2020-05-07 | Google Llc | Determining control policies by minimizing the impact of delusion |
CN111695690A (en) * | 2020-07-30 | 2020-09-22 | 航天欧华信息技术有限公司 | Multi-agent confrontation decision-making method based on cooperative reinforcement learning and transfer learning |
CN111770454A (en) * | 2020-07-03 | 2020-10-13 | 南京工业大学 | A game method of location privacy protection and platform task assignment in mobile crowd-sensing |
CN111860649A (en) * | 2020-07-21 | 2020-10-30 | 赵佳 | Action set output method and system based on multi-agent reinforcement learning |
CN112121439A (en) * | 2020-08-21 | 2020-12-25 | 林瑞杰 | Cloud game engine intelligent optimization method and device based on reinforcement learning |
CN112598137A (en) * | 2020-12-21 | 2021-04-02 | 西北工业大学 | Optimal decision method based on improved Q-learning |
CN112801430A (en) * | 2021-04-13 | 2021-05-14 | 贝壳找房(北京)科技有限公司 | Task issuing method and device, electronic equipment and readable storage medium |
CN112884239A (en) * | 2021-03-12 | 2021-06-01 | 重庆大学 | Aerospace detonator production scheduling method based on deep reinforcement learning |
-
2021
- 2021-06-16 CN CN202110664158.9A patent/CN113377655B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109409739A (en) * | 2018-10-19 | 2019-03-01 | 南京大学 | A kind of crowdsourcing platform method for allocating tasks based on part Observable markov decision process |
WO2020092437A1 (en) * | 2018-10-29 | 2020-05-07 | Google Llc | Determining control policies by minimizing the impact of delusion |
CN111770454A (en) * | 2020-07-03 | 2020-10-13 | 南京工业大学 | A game method of location privacy protection and platform task assignment in mobile crowd-sensing |
CN111860649A (en) * | 2020-07-21 | 2020-10-30 | 赵佳 | Action set output method and system based on multi-agent reinforcement learning |
CN111695690A (en) * | 2020-07-30 | 2020-09-22 | 航天欧华信息技术有限公司 | Multi-agent confrontation decision-making method based on cooperative reinforcement learning and transfer learning |
CN112121439A (en) * | 2020-08-21 | 2020-12-25 | 林瑞杰 | Cloud game engine intelligent optimization method and device based on reinforcement learning |
CN112598137A (en) * | 2020-12-21 | 2021-04-02 | 西北工业大学 | Optimal decision method based on improved Q-learning |
CN112884239A (en) * | 2021-03-12 | 2021-06-01 | 重庆大学 | Aerospace detonator production scheduling method based on deep reinforcement learning |
CN112801430A (en) * | 2021-04-13 | 2021-05-14 | 贝壳找房(北京)科技有限公司 | Task issuing method and device, electronic equipment and readable storage medium |
Non-Patent Citations (5)
Title |
---|
Learning Task Allocation for Multiple Flows in Multi-Agent Systems;Zheng Zhao;《2009 International Conference on Communication Software and Networks》;1-9 * |
分布式任务分配中的一种信誉重连策略;张雷;《广西大学学报(自然科学版)》;645-648 * |
基于微云的移动云平台负载分发与资源分配;郑晓杰;《中国优秀硕士学位论文全文数据库信息科技辑》;I139-142 * |
基于模糊马尔科夫理论的机动智能体决策模型;杨萍;毕义明;刘卫东;;系统工程与电子技术(第03期);1-5 * |
基于深度强化学习的空间众包任务分配策略;倪志伟;《模式识别与人工智能 》;191-205 * |
Also Published As
Publication number | Publication date |
---|---|
CN113377655A (en) | 2021-09-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Wang et al. | Learning scheduling policies for multi-robot coordination with graph attention networks | |
CN109657868B (en) | Probability planning and identifying method for task time sequence logic constraint | |
CN113377655B (en) | A Method of Task Assignment Based on MAS-Q-Learing | |
Hu et al. | A review of research on reinforcement learning algorithms for multi-agents | |
CN104408518A (en) | Method of learning and optimizing neural network based on particle swarm optimization algorithm | |
CN113141012A (en) | Power grid power flow regulation and control decision reasoning method based on deep deterministic strategy gradient network | |
CN118643995A (en) | Battlefield situation deduction method and system based on hierarchical neural network | |
Chhabra et al. | Optimizing design parameters of fuzzy model based cocomo using genetic algorithms | |
CN119494522A (en) | Employee job matching and deployment method and system based on artificial intelligence | |
Abed-Alguni | Cooperative reinforcement learning for independent learners | |
Qiao et al. | Modelling Competitive Behaviors in Autonomous Driving Under Generative World Model | |
CN111737826A (en) | Rail transit automatic simulation modeling method and device based on reinforcement learning | |
Zhou et al. | Continuous patrolling in uncertain environment with the UAV swarm | |
Akhtar | Perceptual evolution for software project cost estimation using ant colony system | |
Shi et al. | A dynamic novel approach for bid/no-bid decision-making | |
Gao et al. | A survey of Markov model in reinforcement learning | |
CN118336824A (en) | A multi-agent partition control method based on state-behavior correlation characteristics | |
Dhiman et al. | A review of path planning and mapping technologies for autonomous mobile robot systems | |
Zhang et al. | Mobile robot localization based on gradient propagation particle filter network | |
CN116681157A (en) | Power load multi-step interval prediction method based on prediction interval neural network | |
CN116128028A (en) | An Efficient Deep Reinforcement Learning Algorithm for Combinatorial Optimization of Continuous Decision Spaces | |
Gan et al. | Heterogeneous agent cooperative planning based on q-learning | |
CN115686076A (en) | Unmanned aerial vehicle path planning method based on incremental development depth reinforcement learning | |
Zhang et al. | Multiexperience-Assisted Efficient Multiagent Reinforcement Learning | |
Aikhuele et al. | Dynamic decision-making method for design concept evaluation based on sustainability criteria |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |