CN113671827B

CN113671827B - Dynamic bipartite graph allocation length decision method based on cyclic neural network and reinforcement learning

Info

Publication number: CN113671827B
Application number: CN202110820657.2A
Authority: CN
Inventors: 陈荣; 刘岳
Original assignee: Dalian Maritime University
Current assignee: Dalian Maritime University
Priority date: 2021-07-20
Filing date: 2021-07-20
Publication date: 2023-06-27
Anticipated expiration: 2041-07-20
Also published as: CN113671827A

Abstract

The invention discloses a dynamic bipartite graph allocation length decision method based on a cyclic neural network and reinforcement learning, which comprises the following steps: s1: judging whether a worker or a task arrives at the current moment; s2: if the worker arrives, adding the worker into the available worker set; s3: acquiring a currently available worker set and an available task set according to the time information; s4: reading parameter information of a worker set and a task set; s5: inputting the worker parameters and the task parameters into a reinforcement learning network; s6: if not, directly jumping to S8, and if so, performing S7; s7: performing task allocation on the current available worker set and the available task set by using a Hungary algorithm, and recording allocation rewards; s8: removing expired workers and tasks from the available worker set W and the available task set T, and recording an expired penalty; s9: and training the reinforcement learning network according to the obtained rewards and punishments, entering the next moment, and returning to S1 to wait for a newly arrived worker or task.

Description

Dynamic bipartite graph allocation length decision method based on cyclic neural network and reinforcement learning

Technical Field

The invention relates to the technical field of dynamic task allocation, in particular to a dynamic bipartite graph allocation length decision method based on a cyclic neural network and reinforcement learning.

Background

The reason why the dynamic bipartite graph needs to make a decision on the allocation length in the matching process is that: under the dynamic condition, workers and tasks arrive and leave randomly, and the tasks have different difficulties, so that the workers have different capacities and calculate the task completion rate by using the capacities of the workers and the task difficulties; it is difficult to maximize the overall completion rate if greedy direct allocation occurs when workers or tasks arrive, while waiting for workers and tasks to reach a fixed number for allocation can result in significant outages of workers and tasks affecting the results.

The closest technology at present is Wang Y, etc ^[1] The Q learning-based RQL (stabilized Q-learning) method proposed in 2019 makes dynamic decisions on allocation length in the case of dynamic bipartite graphs. The specific method is that rewards obtained by selecting different actions according to the current task state of workers continuously update Q values by using a Belman equation, and the Q values are stored on a Q table. The action with the largest Q value in a certain state is the current optimal action. The state-in-action is defined as follows: the state is the time and difficulty of workers and tasks,Capacity, etc., but this makes the state space so large that the Q table cannot record, thus reducing the state space, and using the current worker and task numbers to form a binary group as the current state. The action is the target allocation length, and allocation is performed when the current worker task reaches the target allocation length. The distributed rewards are the total completion rate obtained by using the Hungary algorithm for distribution as rewards. State transition, the current state plus the number of worker tasks that arrive at the next moment.

The allocation length decision can be made using previous methods, but there are a number of disadvantages: workers and tasks have time constraints and expire beyond, but few studies can be directly and effectively applied to dynamic bipartite graph matching. The calculation of rewards without consideration of expired worker tasks should be penalized for the calculation of rewards in the RQL method to prevent excessive dispense length. Furthermore, where only workers, tasks, and no expiration penalties are considered, the method may be more prone to selecting longer dispense lengths within a range of selectable dispense lengths, making dynamic decisions for different situations difficult.

The worker capacity and task difficulty parameters are only applied to determining the stage of allocation using the hungarian algorithm after allocation, but in practice the worker capacity and task difficulty have a great influence in the decision making process of allocation length. Consider that in one case, the worker task distance expires longer, which results in generally lower worker capacity and generally higher task difficulty, the overall completion rate of the dispense would be significantly lower, and the dispense length should be longer.

Disclosure of Invention

According to the problems existing in the prior art, the invention discloses a dynamic bipartite graph allocation length decision method based on a cyclic neural network and reinforcement learning, which specifically comprises the following steps:

s1: judging whether a worker or a task arrives at the current moment, if not, directly ending, and if so, entering S2;

s2: if the task arrives, the worker adds the worker into the available worker set W, and if the worker arrives, the worker adds the worker into the available task set T;

s3: acquiring a currently available worker set W= { W according to the time information ₁ ,w ₂ ,...,w _n Sum available task set t= { T } ₁ ,t ₂ ,...,t _m }；

S4: reading parameter information of a worker set and a task set;

wherein the worker set parameters are

The urgency level including all worker's ability and remaining time (urgency level formula +.>

The remaining time to duration);

the task set parameters are

Including the difficulty of the task used and the urgency of the remaining time (urgency formula +.>

)；

S5: inputting the worker parameters and the task parameters into a reinforcement learning network, outputting a distributed Q value and a non-distributed Q value by the reinforcement learning network, and selecting actions corresponding to the larger Q value to execute at the current moment;

s6: if not, directly jumping to S8, and if so, performing S7;

s7: using hungarian algorithm for the current set of available workers w= { W ₁ ,w ₂ ,...,w _n Sum available task set t= { T } ₁ ,t ₂ ,...,t _m Performing task allocation and recording allocation rewards;

s8: removing expired workers and tasks from the available worker set W and the available task set T, and recording an expired penalty;

s9: and training the reinforcement learning network according to the obtained rewards and punishments, entering the next moment, and returning to S1 to wait for a newly arrived worker or task.

Further, whether a new worker and a new crowdsourcing task arrive or not is judged, information of the worker and the crowdsourcing task is read, and a worker adopts a quadruple w=<l _w ,s _w ,e _w ,a _w >Description in which l _w Representing worker position s _w Indicating the arrival time of the worker e _w Indicating the departure time of a worker, a _w Representing worker's ability, a crowdsourcing task employs five-tuple t =<l _t ,r _t ,s _t ,e _t ,d _t >Description in which l _t Representing the position of a task s _t Indicating the time at which the task starts, e _t Representing the deadline of a task, r _t Representing a task allocation scope, d _t The difficulty of the task is represented and the task is rewarded;

let the current time be c, if s _w ≤c<e _w The worker or task is made available at the current time when c.gtoreq.e _w The worker or task expires without being assignable. Similarly, if s _t ≤c<e _t The task is available at the current time when c is greater than or equal to e _t The task is expired and unallocated

Judging whether a worker or a task arrives at each moment, if not, ending, if the arriving entity is the worker, adding an available worker set, otherwise, adding the available task set, and setting the current available worker set as W= { W ₁ ,w ₂ ,...,w _n The current set of available tasks is noted as t= { T } ₁ ,t ₂ ,...,t _m }；

The method comprises the steps of determining the allocation length at each moment, wherein a reinforcement learning network is adopted to determine whether to allocate at the current moment according to the data of an available worker set and an available task set, the decision is aimed at maximizing the total completion rate and minimizing the expiration number of chemical human tasks, and if the allocation is carried out at the current moment, a Hungary algorithm is used for allocating the current available worker task set to maximize the total completion rate;

constructing bipartite graph b=from a worker task set<T,W,E>E represents an edge, representingThe assigned worker task pair takes the weight of the edge as the completion rate and adopts p _ij The formula is as follows:

recording rewards after completing the distribution and removing expired worker tasks and distributed worker tasks to enter the next moment.

Further, the interactive environment of the reinforcement learning network is an MDP model, and the MDP model comprises the environment existence state

Action of network selection->

Rewards returned after performing an action +.>

Transfer function of state after execution of action->

Wherein the status is

Defined as the currently available task set parameter +.>

Parameter of the set of available workers->

Wherein u is _w 、u _t Indicating the degree of time urgency, the calculation method is as follows

Action

Defined as 2 different actions, 1 indicating allocation and 0 indicating no allocation;

rewards

The calculation formula is as follows:

representing hungarian distribution result rewards minus distribution costs minus expiration penalties, x _ij Representing that the hungarian allocation result task i is allocated to worker j with 1, otherwise with 0, cost with allocation cost,

the Boolean function representing whether the expiration is expired, the expiration is 1, otherwise, the Boolean function is 0, and if the distribution is not performed, the Hungary distribution result and the distribution cost are both 0;

transfer function

The next state is the current worker task set minus the expired worker task set minus the assigned worker task set plus the workers or tasks that arrive at the next time, if not assigned, the assigned worker task set is an empty set.

By adopting the technical scheme, the dynamic bipartite graph allocation length decision method based on the cyclic neural network and reinforcement learning provided by the invention takes the information related to the completion rate, such as the residual time information of the worker task and the task difficulty of the worker capability, into consideration when the dynamic bipartite graph allocation length decision is carried out. Therefore, various parameters can be comprehensively considered in decision making, and the best result can be obtained under various conditions compared with the prior method which only considers the respective number of the tasks of the workers; secondly, the reinforced learning method based on the DQN is used and the LSTM is used for preprocessing the data, so that the problem that the state space is difficult to train and store and must be reduced due to overlarge state space originally is solved; finally, the problem of expiration penalties is considered in rewards, not only to remove expired worker tasks in the state, but also to penalize them in rewards, which allows our approach to take into account minimizing the number of expired tasks while maximizing the overall completion rate. The past methods have only exhibited an expired worker task in terms of a reduction in the number of worker tasks, and thus have tended to be more difficult to handle in a variety of environments for longer dispense lengths.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a schematic diagram of a reinforcement learning module network structure in the method of the present invention;

Detailed Description

In order to make the technical scheme and advantages of the present invention more clear, the technical scheme in the embodiment of the present invention is clearly and completely described below with reference to the accompanying drawings in the embodiment of the present invention:

the dynamic bipartite graph allocation length decision method based on the cyclic neural network and reinforcement learning shown in fig. 1 specifically comprises the following steps:

S4: reading parameter information of a worker set and a task set;

wherein the worker set parameters are

The urgency including all worker capacity and time remaining;

the task set parameters are

Including the difficulty of the task used and the urgency of the remaining time;

s6: if not, directly jumping to S8, and if so, performing S7;

Judging whether a new worker and a new crowdsourcing task arrive or not, reading information of the worker and the crowdsourcing task, and adopting a quadruple w=for one worker<l _w ,s _w ,e _w ,a _w >Description in which l _w Representing worker position s _w Indicating the arrival time of the worker e _w Indicating the departure time of a worker, a _w Representing worker's ability, a crowdsourcing task employs five-tuple t =<l _t ,r _t ,s _t ,e _t ,d _t >Description in which l _t Representing the position of a task s _t Indicating the time at which the task starts, e _t Representing deadlines for tasks，r _t Representing a task allocation scope, d _t The difficulty of the task is represented and the task is rewarded;

let the current time be c, if s _w ≤c<e _w The worker or task is made available at the current time when c.gtoreq.e _w The worker or task is expired and not assignable;

constructing bipartite graph b=from a worker task set<T,W,E>E represents a side, represents a worker task pair capable of being allocated, the weight of the side is the completion rate, and p is adopted _ij The formula is as follows:

Action of network selection->

Rewards returned after performing an action +.>

Transfer function of state after execution of action->

Wherein the status is

Defined as the currently available task set parameter +.>

Parameter of the set of available workers->

Action

rewards

The calculation formula is as follows:

transfer function

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.

Claims

1. A dynamic bipartite graph allocation length decision method based on a cyclic neural network and reinforcement learning is characterized by comprising the following steps:

S4: reading parameter information of a worker set and a task set;

wherein the worker set parameters are

The urgency including all worker capacity and time remaining;

tasksSet parameters are

s6: if not, directly jumping to S8, and if so, performing S7;

2. The dynamic bipartite graph allocation length decision method according to claim 1, wherein: judging whether a new worker and a new crowdsourcing task arrive or not, reading information of the worker and the crowdsourcing task, and enabling one worker to adopt a quadruple w=<l _w ,s _w ,e _w ,a _w >Description in which l _w Representing worker position s _w Indicating the arrival time of the worker e _w Indicating the departure time of a worker, a _w Representing worker's ability, a crowdsourcing task employs five-tuple t =<l _t ,r _t ,s _t ,e _t ,d _t >Description in which l _t Representing the position of a task s _t Indicating the time at which the task starts, e _t Representing the deadline of a task, r _t Representing a task allocation scope, d _t Representing the difficulty of the task and completing the taskIs a reward of (a);

let the current time be c, if s _w ≤c<e _w The worker is available at the current time when c is not less than e _w The worker is overdue and not assignable, and similarly, if s _t ≤c<e _t The task is available at the current time when c is greater than or equal to e _t The task is expired and not assignable;

The method comprises the steps of determining the allocation length at each moment, wherein a reinforcement learning network is adopted to determine whether to allocate at the current moment according to the data of an available worker set and an available task set, the decision is aimed at maximizing the total completion rate and minimizing the expiration number of chemical human tasks, and if the allocation is performed at the current moment, a Hungary algorithm is used for allocating the current available worker task set to maximize the total completion rate;

3. The dynamic bipartite graph allocation length decision method according to claim 1, wherein: the interactive environment of the reinforcement learning network is an MDP model, and the MDP model comprises the state of environment existence