CN116703047A

CN116703047A - Task allocation method, system and medium based on deep reinforcement learning

Info

Publication number: CN116703047A
Application number: CN202210176123.5A
Authority: CN
Inventors: 熊志华; 陈昊; 徐嘉路; 徐斌
Original assignee: BMW Brilliance Automotive Ltd
Current assignee: BMW Brilliance Automotive Ltd
Priority date: 2022-02-25
Filing date: 2022-02-25
Publication date: 2023-09-05

Abstract

The invention discloses a task allocation method, a system and a medium based on deep reinforcement learning, which are used for allocating tasks among at least two classes of working units, wherein the method comprises the following steps: decomposing the task into a plurality of subtasks such that each subtask can be performed by a single unit of work; determining the type of each subtask, wherein each class of work unit can execute at least one type of subtask; determining the execution time length of each subtask; determining a sequencing relation between subtasks; generating a task pool according to the type of the subtasks, the execution time of the subtasks and the sequence relation among the subtasks; generating a working pool for recording information of subtasks allocated to each working unit; constructing a state space, an action space and a reward space according to a Markov decision process; and performing deep reinforcement learning according to the state space, the action space and the rewarding space, and determining a task allocation scheme with the maximum total rewards as an optimal allocation scheme.

Description

Task allocation method, system and medium based on deep reinforcement learning

Technical Field

The present disclosure relates to a task allocation method, system and medium based on deep reinforcement learning.

Background

In the field of industrial manufacturing, a series of tasks such as manufacturing or assembly needs to be performed by a plurality of workers and/or a plurality of robots. Furthermore, workers and robots may also be required to cooperate to perform tasks, i.e., human-machine cooperation. Therefore, it is necessary to distribute tasks among a plurality of workers and/or a plurality of robots.

The task allocation method in the prior art is mainly aimed at simple task structures and task spaces (for example, simple task allocation of 1 worker and 1 robot), and task allocation is realized by manually optimizing task sequences through a mathematical method. However, with the rapid development of intelligent manufacturing, more and more participants of tasks and more complex human-computer collaboration penetrate into various industrial applications, task structural space is increasingly complex (such as complex task allocation of multiple workers and multiple robots), and conventional task allocation optimization methods will be difficult to apply.

Accordingly, there is a need for an improved task allocation method to more rationally allocate work tasks, thereby achieving an improved work order strategy and increasing work efficiency.

Disclosure of Invention

In view of the above technical problems, the invention provides a task allocation method, a system and a medium based on deep reinforcement learning.

According to one aspect of the present disclosure, there is provided a task allocation method based on deep reinforcement learning for allocating tasks between at least two classes of work units, the method comprising: decomposing the task into a plurality of subtasks such that each subtask can be performed by a single unit of work; determining the type of each subtask, wherein each class of work unit can execute at least one type of subtask; determining an execution duration of each subtask, the execution duration of a subtask representing a time taken from starting to execute the subtask to completing the subtask; determining a sequencing relation between subtasks; generating a task pool according to the type of the subtasks, the execution time of the subtasks and the sequence relation among the subtasks; generating a working pool for recording information of subtasks allocated to each working unit; constructing a state space, an action space and a reward space according to a Markov decision process, wherein the state space comprises the current state of a task pool and the current state of a working pool, the action space comprises an action for executing subtasks to be distributed to working units, and the reward space comprises rewards obtained by executing the task distribution action; and performing deep reinforcement learning according to the state space, the action space and the rewarding space, and determining a task allocation scheme with the maximum total rewards as an optimal allocation scheme.

According to another aspect of the present disclosure, there is provided a task allocation system based on deep reinforcement learning, including: at least one processor; and at least one storage device storing instructions that, when executed by the at least one processor, cause the at least one processor to perform the method according to the invention.

According to a further aspect of the present disclosure, there is provided a computer program product comprising instructions which, when executed by a processor, cause the method according to the present invention to be performed.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium, characterized in that instructions are stored, which when executed by a processor, cause the execution of the method according to the present invention.

Other features of the present invention and its advantages will become more apparent from the following detailed description of exemplary embodiments of the invention, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description, serve to explain the principles of the disclosure.

The disclosure may be more clearly understood from the following detailed description taken in conjunction with the accompanying drawings in which:

FIG. 1 illustrates a flowchart of a task allocation method based on deep reinforcement learning according to an exemplary embodiment of the present invention.

FIG. 2 shows a schematic diagram of the types of subtasks according to an exemplary embodiment of the invention;

FIG. 3 illustrates a schematic diagram of subtask duration in accordance with an exemplary embodiment of the present invention;

FIG. 4 shows a schematic diagram of a subtask contact matrix in accordance with an exemplary embodiment of the present invention;

FIG. 5A shows a schematic diagram of a task pool according to an exemplary embodiment of the invention.

FIG. 5B shows a schematic diagram of a filled task pool, according to an exemplary embodiment of the invention.

FIG. 6 shows a schematic diagram of a working pool according to an exemplary embodiment of the invention;

FIG. 7 illustrates a graphical representation of a state space according to an exemplary embodiment of the present invention;

FIG. 8 illustrates a graphical representation of an action space according to an exemplary embodiment of the present invention;

fig. 9 shows a graph of optimal allocation of outputs according to an exemplary embodiment of the present invention.

Fig. 10 illustrates an exemplary configuration of a server and/or terminal device in which embodiments in accordance with the present invention may be implemented.

Detailed Description

The following description is presented to enable one of ordinary skill in the art to make and use the embodiments and is provided in the context of a particular system and its requirements. Various modifications will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments and systems without departing from the spirit or scope of the described embodiments. Thus, the embodiments are not limited to the embodiments shown, but are to be accorded the widest scope consistent with the principles and features disclosed herein.

Embodiments of the present disclosure relate to collaborative execution of one or more tasks by various classes of work units. In the field of industrial manufacturing (e.g., automotive production, household appliance production, furniture production, etc.), the task may be the assembly of a product or an action that needs to be performed during manufacturing. For example, for an assembly process, tasks that need to be performed may include handling of the workpiece, positioning of the workpiece, placement and fastening of fasteners (e.g., bolts and rivets), welding, cleaning, painting, and the like. For example, for a manufacturing process, tasks that need to be performed may include handling of materials, processing of materials and intermediate pieces, handling of intermediate pieces and end products, and the like. Those skilled in the art will recognize that tasks in the manufacturing and other fields that need to be distributed among and performed cooperatively by a plurality of work units are within the scope of the present application and will not be described in further detail herein.

According to the present disclosure, work units refer to individuals performing tasks, which may belong to different categories. For example, in the field of industrial manufacturing, the category of the work units may include, for example, workers, robots, workers operating robots, transportation devices, and the like. Different classes of work units may have different work capabilities, so that different types of tasks may be performed. For example, work units of the worker class are more adept at accomplishing fine tasks and/or complex tasks so that bolts and shims may be conveniently inserted into corresponding screw holes, for example, while work units of the robot class are more suited for high precision, long time, high repeatability and/or heavy tasks so that, for example, heavy mechanical wrenches may be operated to fasten a large number of bolts in place. Those skilled in the art will appreciate that some types of tasks may be accomplished only by worker-class work units and some types of tasks may be accomplished only by robot-class work units. In addition, because of the overlapping capabilities of the work units, some types of tasks may be performed by work units of the worker class or by work units of the robot class. In this case, the types of tasks that can be performed by the different classes of work units may overlap each other.

Furthermore, it will be appreciated by those skilled in the art that workers may be further divided into different categories based on the training they are trained in, skills they have in, physical activity, etc. For example, workers may be classified into workers capable of performing a painting task and workers incapable of performing a painting task according to whether painting skills are grasped. Similarly, robots may be further divided into categories depending on their configuration and their working capabilities.

Thus, according to embodiments of the present disclosure, a category of individuals may perform at least one type of task.

According to embodiments of the present disclosure, the tasks that need to be allocated are tasks that are allocated among multiple different classes of work units. The task allocation between the same class of work units is often relatively simple because the same class of work units have similar capabilities and the same type of task can be performed, so it is often only necessary to consider whether each work unit is free or not when allocating tasks between the same class of work units. However, when tasks are distributed among a plurality of different types of work units, since the types of tasks that can be performed by the different types of work units differ and may also overlap each other, the types of the work units also need to be considered when tasks are distributed, and it is more difficult to optimize the task distribution and improve the work order policy.

In view of the above, the present application proposes a task allocation method based on deep reinforcement learning, which can allocate tasks between at least two classes of work units.

In the following embodiments of the present disclosure, for convenience of description, description will be given taking an example in which a work unit includes two categories (i.e., a worker category and a robot category). The letter "H" indicates a work unit classified as a worker, and the letter "R" indicates a work unit classified as a robot.

FIG. 1 illustrates a flowchart of a task allocation method based on deep reinforcement learning according to an exemplary embodiment of the present application.

First, in step S100, a task is decomposed into a plurality of subtasks. The tasks to be allocated may be a plurality of tasks associated with each other or a large-scale task, depending on the actual situation. By task decomposition, the task to be allocated is decomposed into a plurality of subtasks, so that each subtask can be executed by a single work unit, thereby facilitating allocation of the subtask to the respective work units for execution.

For example, for a task of bolting two workpieces, the task may be broken down into multiple sub-tasks of positioning the workpieces, placing bolts, fastening bolts, changing the direction of the workpieces, etc., such that each sub-task may be accomplished by a single worker or a single robot. The N sub-tasks obtained through decomposition are numbered from 1 to N in sequence, so that a decomposed sub-task list is obtained.

In the following specific embodiments of the present application, 40 subtasks obtained by decomposition will be described as an example.

Those skilled in the art will appreciate that in the case where the task to be allocated is already a plurality of tasks that can be executed by a single work unit, respectively, the decomposition of the task may have been completed in advance or the decomposition of the task may not be required. In this case, it can be considered that step S100 has been performed or need not be performed.

Thereafter, in step S110, the type of each subtask is determined. By determining the type of each subtask, as discussed above, each class of work units is capable of performing at least one type of subtask, thereby enabling a determination of which class or classes of work units a subtask is capable of being performed by.

In the following specific embodiment of the present application, the 40 subtasks decomposed in step S100 are determined as three types, i.e., types "0", "1" and "2", respectively. The task of type "1" is a task suitable for a work unit of a worker class, the task of type "2" is a task suitable for a work unit of a robot class, and the task of type "0" is a task common to work units of a worker class and a robot class. That is, the task of the type "0" is suitable for both the work units of the worker class and the robot class.

FIG. 2 shows a schematic diagram of the types of subtasks according to an exemplary embodiment of the invention. As shown in fig. 2, tasks 1, 6, 14, 18, etc. are tasks of type "1" and are tasks suitable for work units of the worker class, tasks 2, 7, 8, 11, etc. are tasks of type "2" and are tasks suitable for work units of the robot class, and tasks 3, 4, 5, 6, etc. are tasks of type "0" and are tasks suitable for work units of the worker class and the robot class.

Thereafter, in step S120, the execution duration of each sub-task, that is, the time taken from the start of executing the sub-task to the completion of the sub-task, is determined. In this step, the time taken for each sub-task to be independently executed by a single work unit to complete, i.e., the time each sub-task occupies the work unit, is determined according to the actual situation of each sub-task.

FIG. 3 shows a schematic diagram of subtask duration in accordance with an exemplary embodiment of the present invention. As shown in fig. 3, the execution time period of the subtask 1 is 16 seconds, the execution time period of the subtask 2 is 10 seconds, and the execution time period of the subtask 3 is 16 seconds. In this embodiment, subtask 3 is a task of type "0" (i.e., a task appropriate for both worker class and robot class). The execution time period of the subtask 3 when executed by the work unit of the worker class and when executed by the work unit of the robot class may be the same or different. For convenience, the duration of execution of the subtask 3 by the work unit of the worker class and the execution of the subtask 3 by the work unit of the robot class are set to 16 seconds herein. However, it will also be apparent to those skilled in the art that for a subtask that can be performed by multiple types of work units, the length of time that the subtask is performed by the different types of work units can be determined and recorded separately.

Thereafter, in step S130, a precedence relationship between the subtasks is determined.

Because the subtasks are interrelated, and thus some subtasks have a precedence relationship between them. This precedence relationship is sometimes mandatory and cannot be altered. For example, the sub-task of inserting a bolt and a washer into a corresponding screw hole must precede the sub-task of tightening the bolt in place, and follow the sub-task of positioning the two workpieces relative to each other. In addition, some subtasks may not have a precedence relationship between them. For example, the order between the subtask of inserting a bolt No. 1 into the corresponding screw No. 1 and the subtask of inserting a bolt No. 2 into the corresponding screw No. 2 may be non-fixed, i.e., no matter whether any of them has been performed, the execution of the other is not affected.

According to an embodiment of the present disclosure, determining a precedence relationship between subtasks includes generating a contact matrix for the subtasks. In an embodiment of the application, for N subtasks, a two-dimensional contact matrix of N+1 rows and N+1 columns is generated. For example, the value at the first row and first column (1, 1) of the matrix is empty, and the remaining values of the first row and first column are in turn the respective numbers 1 to N of the subtasks. The precedence relation between any two subtasks i and j is represented by the values at row i+1 and column j+1 through the contact matrix of the subtasks. For example, when task i is located before task j, the value at that point may be set to "1"; when task i is located after task j, this value may be set to "2"; when task i and task j do not have a precedence relationship with each other, the value may be set to "0". Further, for the value at the i-th row and i-th column, since the value represents the relationship between the subtask i and the subtask i itself, the value thereof may be set to "0" to indicate that the two do not have a precedence relationship. Alternatively, the value at row i and column i may also remain empty.

The subtask contact matrix can be determined by manually or by a computer determining the precedence relationship between each subtask and another subtask through the execution flow of the combing task and filling corresponding values in each position in the subtask contact matrix.

Fig. 4 shows a schematic diagram of a subtask contact matrix according to an exemplary embodiment of the invention. A contact matrix for 40 subtasks is shown in fig. 4. As shown in fig. 4, a value of "0" at row 2 (corresponding to subtask 1) and column 3 (corresponding to subtask 2) indicates that there is no precedence relationship between subtasks 1 and 2. Further, a value of "1" at row 2 (corresponding to subtask 1) and column 6 (corresponding to subtask 5) indicates that subtask 1 precedes subtask 5. Further, a value of "2" at row 10 (corresponding to subtask 9) and column 12 (corresponding to subtask 11) indicates that subtask 9 follows subtask 11. The values for each location in fig. 4 are not described in detail.

As shown in fig. 4, after determining the subtask contact matrix, the precedence relationship between any one subtask and another subtask may be read directly from the subtask contact matrix.

As shown in fig. 4, there is a case where the values of the rows of the contact matrix corresponding to one sub-task and the values of the rows of the contact matrix corresponding to another sub-task are identical in content. For example, the rows of the contact matrix corresponding to subtasks 1, 2, 3, 4 (rows 2, 3, 4, 5 of the matrix) are identical to each other. Similarly, the rows of the contact matrix (rows 6, 7, 8, 9 of the matrix) corresponding to the subtasks 5, 6, 7, 8, 11, 12 are identical to each other. The exactly same rows in the contact matrix indicate that the subtasks corresponding to the rows have the same precedence relationship with respect to other subtasks corresponding to other rows, respectively, and that the subtasks do not have precedence relationship with each other.

In an embodiment of the present application, these sub-tasks, which have the same precedence relationship with respect to other sub-tasks, respectively, and which do not have precedence relationships with each other, may be divided into a group of sub-tasks. Thus, subtasks 1, 2, 3, 4 may be divided into a group. Similarly, subtasks 5, 6, 7, 8, 11, 12 may be divided into a set of subtasks. In the above division, the subtasks in fig. 4 may be divided into the following 8 groups of subtasks: group 1 subtasks (subtasks 1, 2, 3, 4), group 2 subtasks (subtasks 5, 6, 7, 8, 11, 12), group 3 subtasks (subtasks 9, 10, 13, 14, 15), group 4 subtasks (subtasks 16, 17, 18, 19, 24, 25), group 5 subtasks (subtasks 20, 21, 22, 23), group 6 subtasks (subtasks 26, 27, 34, 35, 36), group 7 subtasks (subtasks 28, 29, 30, 31, 32, 33), and group 8 subtasks (subtasks 37, 38, 39, 40).

As shown in fig. 4, each group of subtasks has a certain precedence relationship between them. For example, in the embodiment shown in FIG. 4, the subtasks are performed sequentially in order from group 1 subtask to group 8 subtask, i.e., after group 1 subtask is performed, group 2 subtask is performed, then group 3 subtask … … is performed, and so on.

Those skilled in the art will appreciate that while a two-dimensional subtask contact matrix is used in the above embodiments to determine the precedence relationship between subtasks, other ways of determining precedence relationships between subtasks may be employed. For example, other types of data may be employed and responsive data may be populated therein, so long as the populated data is capable of representing a precedence relationship between subtasks.

It will be appreciated by those skilled in the art that although in the above embodiments, a plurality of subtasks are included in each set of subtasks that do not have precedence relationships with each other, when one subtask has precedence relationships with all other subtasks, only one subtask may be included in the set of subtasks in which the subtask is located.

Thereafter, in step S140, a task pool is generated according to the type of the subtask, the execution duration of the subtask, and the contact matrix of the subtask. The task pool includes the type of each subtask, the execution duration, and the precedence relationship between the subtasks, and thus the task pool includes information required in assigning the subtasks.

In an embodiment of the present application, the task pool may be defined in the form of a three-layer matrix corresponding to each other. The first layer matrix includes the number of each subtask decomposed in step S100, the second layer matrix includes the type of each subtask determined in step S110, and the third layer matrix includes the duration of each subtask determined in step S120. According to the precedence relationship between the subtasks determined in step S130, the numbers of the subtasks are filled in the first layer matrix, and such that in each row of the first layer matrix, only the numbers of a group of subtasks (i.e., the subtasks having the same precedence relationship with respect to the other subtasks, respectively, and having no precedence relationship with each other) are included. The second layer matrix includes the type of the respective subtask at a location corresponding to each subtask number in the first layer matrix. The third layer matrix includes execution durations of respective subtasks at positions corresponding to each subtask number in the first layer matrix.

In some embodiments of the present application, in the first layer matrix, each group of subtasks may be filled into rows of the matrix from top to bottom or from bottom to top according to a precedence relationship between each group of subtasks.

Alternatively, in other embodiments of the present application, each group of subtasks may be filled into each row of the matrix entirely out of the precedence relationship between each group of subtasks. For example, a set of subtasks may be populated into any row in the contact matrix without regard to the precedence relationship between the set of subtasks and other sets of subtasks.

After the position of the serial number of each subtask in the first layer matrix is determined, the corresponding subtask types and execution time lengths are filled in the corresponding positions of the second layer matrix and the third layer matrix, so that the second layer matrix and the third layer matrix are obtained. Thereby yielding a task pool represented by a three-layer matrix.

FIG. 5A shows a schematic diagram of a task pool according to an exemplary embodiment of the application.

In the embodiment of the present application shown in fig. 5A, each group of subtasks is placed in the same row of the first layer matrix, respectively, row by row and down, starting from the first row, according to the numbering of the subtasks. As shown in fig. 5A, in the first layer matrix, the numbers of the 1 st group of subtasks (subtasks 1, 2, 3, 4) are put in the 1 st row, the numbers of the 2 nd group of subtasks (subtasks 5, 6, 7, 8, 11, 12) are put in the 2 nd row, the numbers of the 3 rd group of subtasks (subtasks 9, 10, 13, 14, 15) are put in the 3 rd row, the numbers of the 4 th group of subtasks (subtasks 16, 17, 18, 19, 24, 25) are put in the 4 th row, the numbers of the 5 th group of subtasks (subtasks 20, 21, 22, 23) are put in the 5 th row, the numbers of the 6 th group of subtasks (subtasks 26, 27, 34, 35, 36) are put in the 6 th row, the numbers of the 7 th group of subtasks (subtasks 28, 29, 30, 31, 32, 33) are put in the 7 th row, the 8 th group of subtasks (subtasks 37, 38, 39, 40) are put in the 8 th row.

Furthermore, in another embodiment of the present invention, each group of subtasks may be placed in the same row of the first layer matrix, respectively, from the last row, in the reverse order of the numbering of the subtasks, row by row upwards. The obtained first layer matrix is the same as in the above embodiment.

After determining the position of each subtask number in the first layer matrix, the corresponding positions of the second layer matrix and the third layer matrix are filled with the corresponding subtask type shown in fig. 2 and the corresponding subtask execution duration shown in fig. 3, so as to obtain the second layer matrix and the third layer matrix. Thus, a task pool represented by a three-layer matrix as shown in FIG. 5A is obtained.

In the three-layer matrix of the task pool shown in fig. 5A, since the number of subtasks in each row in the matrix is different, the length of each row of the matrix is different, so that there is a space in the matrix where there is no data. To facilitate data processing, each row in the tri-layer matrix may be filled to the same length with preset data according to the length of the row with the largest number of subtasks. The preset data indicates that the corresponding position in the matrix should be empty.

FIG. 5B shows a schematic diagram of a filled task pool, according to an exemplary embodiment of the invention. In embodiments of the present disclosure, the preset data may be filled with "0". After filling, as shown in fig. 5B, the three-layer matrix of the task pool no longer has empty space.

Those skilled in the art will appreciate that the filling step may not be performed in the case where the number of subtasks in each row of the three-layer matrix of the task pool is the same when no filling is performed.

It will be appreciated by those skilled in the art that while in the above embodiments a three-layer matrix form is used to generate the task pool, other ways may be employed to generate the task pool, provided that the generated task pool is capable of recording the type and execution duration of the subtasks and the precedence relationship between the subtasks.

Thereafter, in step S150, a work pool is generated for recording information allocated to each sub-task.

In an embodiment of the present application, the working pool may take the form of a plurality of slots, and the number of slots is the same as and corresponds to the number of working units one by one. Each slot is marked by a class and number of the work unit corresponding to the slot and is used to record information of the subtasks assigned to the work unit. When a task is assigned to a work unit, the information of the corresponding subtask (including the subtask number, the type of subtask, the duration of executing the subtask) may be placed in the slot of the work unit executing the subtask. In the task allocation process, by putting information of the subtasks executed by the corresponding work units (including information of the subtasks that have been executed by the work units and information of the subtasks that are being executed by the work units) into the respective slots of the work pools, it is possible to conveniently record the process and result of the entire task allocation.

Fig. 6 shows a schematic diagram of a working pool according to an exemplary embodiment of the invention. As shown in fig. 6, the working unit in the present embodiment includes two workers (H ₁ and H₂ ) And two robots (R) ₁ and R₂ ) And a slot is provided for each worker and each robot, respectively, and each slot is marked by the category and number of the corresponding work unit. Thus, the working pool in this embodiment has four slots. Worker H ₁ For recording the positions of slots allocated to workers H ₁ Is a subtask of the sub-task. Worker H ₂ For recording the positions of slots allocated to workers H ₂ Is a subtask of the sub-task. Robot R ₁ For recording the slots allocated to the robot R ₁ Is a subtask of the sub-task. Robot R ₂ For recording the slots allocated to the robot R ₂ Is a subtask of the sub-task.

Those skilled in the art will appreciate that the working pool may have other forms as long as information of the subtasks assigned to each working unit can be recorded.

Thereafter, in step S160, a state space S, an action space a, and a reward space are constructed according to a markov decision process (Markov Decision Process, MDP).

Those skilled in the art will appreciate that a Markov decision process is a mathematical model of sequential decisions for simulating the randomness strategies and rewards achievable by an agent in an environment where the system states have Markov properties. MDP is built based on a set of interactive objects, namely agents and environments, with elements including states, actions, policies, and rewards. In the simulation of MDP, the smart agent will perceive the current system state, act on the environment according to policies, thereby changing the state of the environment and rewarding. Markov decision processes are known to those skilled in the art, and descriptions and explanations of other underlying concepts are omitted herein.

The state space s includes the current state of the task pool and the current state of the work pool. Those skilled in the art will appreciate that the state space s represents the state of the task pool and the working pool before the task is executed. After starting execution of the task, the state space s varies with the execution of the task, and for any time during which the task is executed, the corresponding state space can be determined.

In accordance with an embodiment of the present disclosure, where the total number of subtasks is N, the state space s may be represented as:

s＝[c ₁ (p,t ₁ ),…,c _N (p,t _N ),H ₁ ,…,H _U ,R ₁ ,…,R _V ,…]

wherein ,c_N (p,t _N ) Representing the current state of a subtask N, p is the type of the subtask, t _N For the remaining duration of the subtask, H _U Representing the task state assigned to the U-th work unit among the work units of the first category (workers in this example), R _V Representing the task state assigned to the V-th work unit among the work units (robots in this example) of the second category. If there are more than three classes of work units, the task state of each work unit may be added to the state space s accordingly.

C when the task is not started to be executed _N (p,t _N ) T for each subtask _N Indicating the total length of time the subtask is executed. At some point in the process that the subtask is being executed, t of the subtask _N Indicating the remaining duration of the subtask at that time. After the sub-task is executed, t of the sub-task _N Indicating the remaining time of the subtask at this time, i.e. "0". H _U Indicating the tasks to which the U-th worker has been assigned, including the tasks that the work unit has completed and the tasks that are being performed. In one embodiment of the present disclosure, H _U The sum of the length of work of the various subtasks that the unit of work has completed and the length of time that has been spent for the subtasks being performed may be included.

Fig. 7 shows a graphical representation of a state space according to an exemplary embodiment of the present invention, i.e. fig. 7 is a graphical form of a mathematical expression for representing the above state space s. As shown in fig. 7, the state space s may be represented in the form of a three-layer matrix similar to the task pool shown in fig. 5B. In contrast to the three-layer matrix of the task pool shown in fig. 5B, the three-layer matrix of the state space s includes rows representing the states of the work units in addition to rows representing the states of the subtasks. The row in the state space s representing the state of the subtask corresponds to the row of the matrix of the task pool shown in fig. 5B, and the portion representing the current state of the work unit is attached below the row representing the state of the subtask in the present embodiment.

In the first layer matrix, the portion representing the current state of the work units includes the number of each work unit, and the numbers of the work units of the same category are placed in the same row of the first layer matrix. As shown in fig. 7, the penultimate row of the first layer matrix represents the numbers of workers (two workers in total, numbered 1 and 2, respectively), and the penultimate row of the first layer matrix represents the numbers of robots (two robots in total, numbered 1 and 2, respectively). In addition, in the second layer matrix, the portion representing the current state of the work unit includes a category of the work unit corresponding to the first layer matrix. As shown in fig. 7, similarly to the category of the subtask, the category of the worker is "1", and the category of the robot is "2". Thereafter, in the third layer matrix, the portion representing the current state of the work unit includes the work time length of the work unit corresponding to the first layer matrix, that is, the sum of the work time lengths of the respective sub-tasks that the work unit has completed and the time lengths that have been spent for the sub-tasks being executed. As shown in fig. 7, since the execution of the task has not yet started, in the third layer, the current operation time of each operation unit is "0".

In addition, the state space s shown in fig. 7 is also different from the task pool shown in fig. 5B in that, in a portion related to each subtask, the third layer matrix in fig. 7 represents the remaining time period of each subtask, and the third layer matrix in fig. 5B represents the time taken from the start of execution of the subtask to the completion of the subtask, i.e., the execution time period.

Those skilled in the art will appreciate that the graphical representation of the state space s shown in FIG. 7 is provided herein for ease of understanding, and that the state space s may take other graphical representations.

The action space a includes actions to perform assigning subtasks to work units.

In an embodiment of the application, the action space a represents selecting a subtask and selecting a work unit to execute the subtask. In the case of having, for example, two or more kinds of work units, one at a time is extracted from Q subtasks, and one is selected as a work unit from U first-class work units (workers in this example) and V second-class work units (robots in this example), so that there are q× (u+v+ …) possibilities for actions, the action space a can be expressed as:

wherein ,representing the action of subtask Q assigned to one of the U units of work of the first category,/I >Representing the actions of subtask Q assigned to one of the V units of work of the second class.

In an embodiment of the present application, Q subtasks may be the number of subtasks in a subtask group including the largest number of subtasks, where, as discussed above, the subtasks having the same precedence relationship with respect to other subtasks and having no precedence relationship with each other respectively form one subtask group. In embodiments of the present disclosure, Q may be the length of a row of the matrix of the task pool.

Fig. 8 shows a graphical representation of an action space according to an exemplary embodiment of the present application, i.e. fig. 8 is a graphical form of a mathematical expression for representing the above action space a. In the embodiment of the present application, Q is the length of one row of the matrix of the task pool (i.e., 6), and the work units are 4 (i.e., 2 workers and 2 robots) in total, so the action space a includes 24 actions in total.

As shown in fig. 8, the action space of an embodiment of the present application is a discrete action space, i.e., each action is discrete, discontinuous.

Those skilled in the art will appreciate that the graphical representation of the action space a shown in fig. 8 is provided herein for ease of understanding, and that the action space a may take other graphical representations.

The bonus space includes rewards (feedback) obtained from performing the task allocation actions.

In embodiments of the present application, the bonus space may include obtaining a positive bonus for each allocation action performed correctly before all sub-task allocations are completed. Correctly performing an allocation action refers to allocating a subtask to a unit of work that is capable of performing a class of subtasks of that type. For example, correctly performing the allocation action includes: the subtasks of type "1" are assigned to the work units of the class worker, the subtasks of type "2" are assigned to the work units of the class robot, and the subtasks of type "0" are assigned to the work units of the class worker or robot.

In embodiments of the present application, the bonus space may also include a negative bonus for each time a dispensing action is performed in error before all sub-task dispensing is completed. The erroneous execution allocation actions include selecting sub-tasks in the task pool that are filled with "0", i.e. sub-tasks that do not actually exist. Further, in contrast to correctly performing the allocation action, the incorrectly performing allocation action includes allocating the subtasks to the work units of the class that are not capable of performing the type of subtask. For example, assigning a subtask of type "1" to a work unit of type "2" to a worker is a false execution of the assignment action.

In an embodiment of the present application, the bonus space may further include an instant bonus that is obtained when all sub-task assignments are completed, including the following two parts:

where "1" indicates that the rewards obtained for all subtask assignments are completed correctly,representing the additional rewards, maxT is the overall length of time the task is executed, aveT is the nominal average task length, i.e., the ratio of the sum of the lengths of all subtasks to the total number of work units.

For example, in the case of a work unit having two or more kinds and a first kind of work unit having U and a second kind of work unit having V,where T is the sum of the durations of all subtasks.

Then, in step S170, deep reinforcement learning is performed based on the state space, the action space, and the bonus space, and a task allocation scheme having the largest total bonus value is determined as an optimal allocation scheme.

Based on the state space s, action space a, and reward space constructed above, training is performed using a Deep-reinforcement learning algorithm (e.g., DQN (Deep Q-Network), DDQN (Double DQN), dueling-DQN, etc.). First, one subtask is extracted from the first row representing the state of the subtask in the state space to be allocated, and the duration value of the extracted subtask in the state space s is set to 0. The work unit assigned with the subtask starts to execute the subtask, and the duration of executing the subtask is recorded as the work duration in the state space. When a unit of work is assigned a plurality of subtasks in the same row, the unit of work sequentially executes the subtasks, and the sum of the work durations of the unit of work executing the subtasks is recorded in the state space. And automatically moving the first row to a second row representing the state of the subtask when the task duration of the first row is all 0, and uniformly setting the working duration of all working units in the state space to be the maximum value in the working durations of all working units when the subtask of the first row is executed. Thereafter, similarly, one subtask is extracted from the second row representing the state of the subtask to be allocated, and the time length value of the extracted subtask in the state space s is set to 0. And automatically moving the second row to a third row representing the state of the subtask when the task duration of the second row is all 0, and increasing the working duration of all working units in the state space to the maximum value in the working duration of each working unit when the task of the second row is executed. The above operation is repeated for each row representing the state of the subtask in turn until the allocation of the subtask is ended. The end of the subtask allocation may be that the duration value of all subtasks in the state space is 0, i.e., all subtask allocation is completed. Alternatively, the assignment of sub-tasks may end when an assignment action is performed erroneously.

During the execution of the allocation of the subtasks, prize values are obtained according to the prize space described above. The total prize for each screen is the sum of the prize values for all of the experience tracks in that screen. The maximum value of the empirical track length of each curtain in the training process is the number of subtasks N.

In the deep reinforcement learning algorithm, the loss function is defined as' oss= (y (s, a) -q (s, a)) ² Wherein y (s, a) is an action rewards value obtained by an actual experience track, and q (s, a) is a predicted value of the action rewards calculated by a neural network algorithm. The DQN algorithm updates the parameter values in the loss function through the gradient descent algorithm, so that the loss function is continuously reduced, the accuracy of the action rewarding predicted value of the algorithm is improved, and finally the total rewarding of one curtain is achievedAnd the excitation value is maximum, and an optimal human-computer cooperation task allocation scheme is obtained.

Based on the state space s, the action space a, and the reward space constructed according to the embodiments of the present disclosure, a person skilled in the art can train and optimize using a known deep reinforcement learning algorithm, and thus the contents of the deep reinforcement learning algorithm known to the person skilled in the art are not described in detail herein. Any related art means that can be conceived by those skilled in the art can be applied to the technical solution of the present application to implement the technical solution of the present application.

Fig. 9 shows a graph of optimal allocation of outputs according to an exemplary embodiment of the present application. FIG. 9 is a diagram of the work pool of FIG. 6 populated with information for each sub-task assigned according to the optimized optimal sub-task assignment scheme. As shown in fig. 9, in the present embodiment, the information of 40 subtasks in the embodiment of the present application is placed in the slots of the work units in the work pool for executing the subtasks, which is processed and optimized by the deep reinforcement learning algorithm.

As shown in fig. 9, subtasks 1, 2, 3, 4 in the first row of the state space representing the state of the subtasks are respectively assigned to the work unit H ₂ 、R ₁ 、R ₂ and H₁ And the information of the subtasks is respectively put into the slots of the corresponding working units. For example, at work unit H ₁ Information "4.sub.0.sub.17" for the subtask at the bottom of the slot of (1) indicates that the subtask numbered 4 is assigned to work cell H ₁ To be executed, the subtask is of the type "0", and the execution duration is "17".

As shown in fig. 9, in executing the task of the first line, the working unit H ₁ 、H ₂ 、R ₁ and R₂ The operating durations of (2) are 17, 16, 10 and 16 respectively, wherein the maximum value is 17. After the subtasks in the first row of the state space are executed, according to an embodiment of the present disclosure, the working durations of all the working units are set uniformly to the maximum value of the working durations of the working units when the tasks of the first row are executed. That is, the working unit H ₁ 、H ₂ 、R ₁ and R₂ Is set to 17.

Next, subtasks 5, 6, 7, 8 and 11 and 12 in the second row of the state space are assigned to the work cell H, respectively ₂ 、H ₁ 、R ₁ 、R ₁ 、R ₂ and R₂ . It can be seen that the working unit R ₁ Is assigned subtasks 7 and 8 and work unit R ₂ Sub-tasks 11 and 12 are assigned. Thus, during the allocation of subtasks in the second row of the state space, the work unit R ₁ Is the sum of the operating time "8" for executing subtask 7 and the operating time "10" for executing subtask 8, i.e., 18. Similarly, during the allocation of subtasks in the second row of the state space, work unit R ₁ Is the sum of the operating time "8" for executing subtask 11 and the operating time "16" for executing subtask 12, i.e., 24.

Thus, in performing the task of the second row, the work unit H ₁ 、H ₂ 、R ₁ and R₂ The working time periods of (2) are 6, 5, 18 (8+10) and 24 (16+8), wherein the maximum value is 24. After the subtasks in the second row of the state space are executed, according to embodiments of the present disclosure, the working durations of all the working units in the state space are all increased by the maximum value of the working durations of the working units when the tasks of the second row are executed. That is, the working unit H ₁ 、H ₂ 、R ₁ and R₂ Is set to 17+24, i.e. 41, where 17 is the maximum of the working durations of the working units when performing the tasks of the first row and 24 is the maximum of the working durations of the working units when performing the tasks of the second row.

Next, subtasks 9, 10, 13, 14 and 15 in the third row of the state space are allocated, wherein both subtasks 9 and 10 are allocated to the work unit R ₂ . In executing the task of the third row, the working unit H ₁ 、H ₂ 、R ₁ and R₂ The working time periods of (2) are respectively 8, 12, 9 and 11 (7+4), wherein the maximum value is 12. After the subtasks in the third row of the state space are performed, work unit H is then moved ₁ 、H ₂ 、R ₁ and R₂ Is set to 41 (17+24) +12, i.e., 53.

And then, sub-tasks in other rows of the state space are sequentially distributed to the working units, and the working time length of each working unit is correspondingly determined. During execution of the third through eighth rows of the state space, the maximum of the operation durations of the respective operation units is 21 (19+2), 16, 19, 14, and 19, respectively. Finally, according to the optimal allocation result, the total duration of execution of the task is 142 seconds, i.e., 17+24+12+21+16+19+14+19=142.

As shown, during the whole task execution, the work unit H ₁ The actual working time is 83 seconds, and the idle waiting time is 59 seconds; working unit H ₂ The actual working time is 78 seconds, and the idle waiting time is 64 seconds; working unit R ₁ The actual working time is 119 seconds, and the idle waiting time is 23 seconds; working unit R ₂ The actual working time is 117 seconds, and the idle waiting time is 25 seconds.

According to the optimal distribution result, a total reward value obtained for reinforcement learning training isWherein the whole duration maxT of task execution is 142 and aveT is +.>The optimal total prize value for a curtain is calculated as 39.56927.

The embodiment of the application provides a method for generating a contact matrix, a task pool and a working pool, which is used for decomposing tasks and constructing a unified task environment, so that the problem of complex task allocation can be solved, and the method has higher universality.

In addition, the application provides a processing method for formalized reinforcement learning of task allocation, which comprises a definition strategy of a state space, an action space, a reward space and the like, so that the task allocation can be subjected to sequence optimization by adopting a general reinforcement learning algorithm, and the effective solution way of the task allocation problem is greatly enriched.

According to the embodiment of the application, tasks are distributed for the cooperation (such as man-machine cooperation) of more than two types of working units, and multi-target tasks with shortest overall working time, correct task distribution, reasonable distribution sequence and the like are considered. The technical scheme of the embodiment outputs the allocation scheme with the shortest global duration aiming at the task with a more complex task structure and task space, and ensures that the allocation of each subtask meets the type attribute.

Although in the above example, task allocation is performed with workers and robots as two categories, it will be understood by those skilled in the art that in the case where workers and robots are further divided into a plurality of categories, respectively, tasks may be allocated among a plurality of worker categories and/or a plurality of robot categories, thereby performing human-computer cooperation and robot-computer cooperation.

Fig. 10 illustrates a general hardware environment 400 in which the present disclosure may be applied, according to an exemplary embodiment of the present disclosure.

With reference to fig. 10, a computing device 400 will now be described as an example of a hardware device applicable to aspects of the present disclosure. Computing device 400 may be any machine configured to perform processing and/or computing, and may be, but is not limited to, a workstation, a server, a desktop computer, a laptop computer, a television, a tablet computer, a personal digital assistant, a smart phone, a portable camera, or any combination thereof. The apparatus 100 described above may be implemented in whole or at least in part by a computing device 400 or similar device or system.

Computing device 400 may include elements capable of connecting with bus 402 or communicating with bus 402 via one or more interfaces. For example, computing device 400 may include a bus 402, one or more processors 404, one or more input devices 406, and one or more output devices 408. The one or more processors 404 may be any type of processor and may include, but are not limited to, one or more general purpose processors and/or one or more special purpose processors (such as special purpose processing chips). Input device 406 may be any type of device capable of inputting information to a computing device, and may Including but not limited to a mouse, keyboard, touch screen, microphone, and/or remote control. According to some embodiments of the present disclosure, input device 406 may also include a camera. Output device 408 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, video/audio output terminals, and/or a printer. Computing device 400 may also include a non-transitory storage device 410 or any storage device connected to non-transitory storage device 410, non-transitory storage device 410 may be non-transitory and may implement a data store, and may include, but is not limited to, a disk drive, an optical storage device, a solid state storage, a floppy disk, a flexible disk, a hard disk, a magnetic tape, or any other magnetic medium, a compact disk or any other optical medium, ROM (read only memory), RAM (random access memory), cache memory, and/or any other memory chip or cartridge, and/or any other medium from which a computer may read data, instructions, and/or code. The non-transitory storage device 410 may be detachable from the interface. The non-transitory storage device 410 may have data/instructions/code for implementing the methods and steps described above. Computing device 400 may also include communication device 412. The communication device 412 may be any type of device or system capable of communicating with external apparatus and/or with a network and may include, but is not limited to, a modem, a network card, an infrared communication device, wireless communication equipment, and/or a device such as bluetooth ^TM A device, 802.11 device, wiFi device, wiMax device, a chipset of a cellular communication facility, etc.

Bus 402 can include, but is not limited to, an Industry Standard Architecture (ISA) bus, a Micro Channel Architecture (MCA) bus, an Enhanced ISA (EISA) bus, a Video Electronics Standards Association (VESA) local bus, and a Peripheral Component Interconnect (PCI) bus.

Computing device 400 may also include a working memory 414, where working memory 414 may be any type of working memory that may store instructions and/or data useful for the operation of processor 404, and may include, but is not limited to, random access memory and/or read-only memory devices.

The software elements may reside in a working memory 414 including, but not limited to, an operating system 416, one or more application programs 418, drivers, and/or other data and code. Instructions for performing the above-described methods and steps may be included in one or more application programs 418. Executable code or source code of instructions of the software elements may be stored in a non-transitory computer readable storage medium, such as the storage device(s) 410 described above, and may be read into working memory 414, possibly compiled and/or installed. Executable code or source code for the instructions of the software elements may also be downloaded from a remote location.

From the above embodiments, it is apparent to those skilled in the art that the present disclosure may be implemented by software and necessary hardware, or may be implemented by hardware, firmware, etc. Based on this understanding, embodiments of the present disclosure may be implemented, in part, in software. The computer software may be stored in a computer readable storage medium, such as a floppy disk, hard disk, optical disk, or flash memory. The computer software includes a series of instructions that cause a computer (e.g., a personal computer, a service station, or a network terminal) to perform a method according to various embodiments of the present disclosure, or a portion thereof.

Having thus described the present disclosure, it is clear that the present disclosure can be varied in a number of ways. Such variations are not to be regarded as a departure from the spirit and scope of the present disclosure, and all such modifications as would be obvious to one skilled in the art are intended to be included within the scope of the following claims.

While certain specific embodiments of the invention have been illustrated in detail by way of example, it will be appreciated by those skilled in the art that the foregoing examples are intended to be illustrative only and not to limit the scope of the invention. It should be appreciated that some of the steps in the foregoing methods are not necessarily performed in the order illustrated, but they may be performed simultaneously, in a different order, or in an overlapping manner. Furthermore, one skilled in the art may add some steps or omit some steps as desired. Some of the components in the foregoing systems are not necessarily arranged as shown, and one skilled in the art may add some components or omit some components as desired. It will be appreciated by those skilled in the art that changes could be made to the embodiments described above without departing from the scope and spirit of the invention. The scope of the invention is defined by the appended claims.

Claims

1. A method of task allocation based on deep reinforcement learning for allocating tasks between at least two classes of work units, the method comprising:

decomposing the task into a plurality of subtasks such that each subtask can be performed by a single unit of work;

determining the type of each subtask, wherein each class of work unit can execute at least one type of subtask;

determining an execution duration of each subtask, the execution duration of a subtask representing a time taken from starting to execute the subtask to completing the subtask;

determining a sequencing relation between subtasks;

generating a task pool according to the type of the subtasks, the execution time of the subtasks and the sequence relation among the subtasks;

generating a working pool for recording information of subtasks allocated to each working unit;

constructing a state space, an action space and a reward space according to a Markov decision process, wherein the state space comprises the current state of a task pool and the current state of a working pool, the action space comprises an action for executing subtasks to be distributed to working units, and the reward space comprises rewards obtained by executing the task distribution action; and

and performing deep reinforcement learning according to the state space, the action space and the rewarding space, and determining a task allocation scheme with the maximum total rewards as an optimal allocation scheme.

2. The task allocation method according to claim 1, wherein the category of the work unit includes a person and a robot.

3. The task allocation method according to claim 2, wherein the types of subtasks include subtasks suitable for human execution, subtasks suitable for robot execution and subtasks suitable for human and robot execution.

4. The task allocation method according to claim 1, wherein the execution duration of each subtask includes an execution duration when the subtask is executed by each work unit, respectively.

5. The task allocation method according to any one of claims 1 to 4, wherein determining a precedence relationship between sub-tasks comprises generating a contact matrix of sub-tasks:

for N subtasks, a two-dimensional contact matrix of N+1 rows and N+1 columns is generated, wherein the values at the first row and the first column (1, 1) of the matrix are null, the remaining values of the first row and the first column are respectively numbered 1 to N of the subtasks in sequence, and for any two subtasks i and j, the precedence relation between the subtasks is represented by the values at the i+1 row and the j+1 column.

6. The task allocation method according to claim 5, wherein when the task i is located before the task j, a value at the j+1st column of the i+1st row is set to "1", when the task i is located after the task j, a value at the j is set to "2", and when the task i and the task j do not have a precedence relationship with each other, a value at the j is set to "0".

7. The task allocation method according to any one of claims 1 to 4, wherein generating a task pool from a contact matrix of sub-tasks comprises:

generating three layers of matrixes corresponding to each other, wherein the first layer of matrixes comprise the serial numbers of each subtask, the second layer of matrixes comprise the types of the corresponding subtasks at the positions corresponding to the serial numbers of each subtask in the first layer of matrixes, and the third layer of matrixes comprise the execution time length of the corresponding subtasks at the positions corresponding to the serial numbers of each subtask in the first layer of matrixes, and each row of the first layer of matrixes only comprises serial numbers of subtask groups which have the same sequence relation with other subtasks respectively and have no sequence relation with each other.

8. The task allocation method according to claim 7, wherein each row in the three-layer matrix is filled to the same length with preset data according to a length of a row having the largest number of subtasks.

9. The task allocation method according to any one of claims 1 to 4, wherein generating a work pool includes:

a number of slots is provided, equal to the number of work units, each slot corresponding to a work unit and being marked by the class and number of the work unit, and each slot being for accommodating a subtask assigned to the respective work unit.

10. The task allocation method according to any one of claims 1 to 4, wherein, in the case where the total number of subtasks is N, the state space s is represented as:

s＝[c ₁ (p,t ₁ ),…,c _N (p,t _N ),H ₁ ,…,H _U ,R ₁ ,…,R _V ,…]

wherein ,c_N (p,t _N ) Representing the current state of a subtask N, p is the type of the subtask, t _N For the remaining duration of the subtask, H _U Representing the task state to which the U-th work unit of the first class of work units is assigned, R _V Representing the task state to which the V-th work unit of the second class of work units is assigned.

11. The task allocation method according to any one of claims 1 to 4, wherein the action space a is represented as:

wherein ,representing the action of subtask Q assigned to one of the U units of work of the first category,/I>The actions of the subtask Q assigned to one of the V work units of the second class are represented, and Q is the number of subtasks in the subtask group including the largest number of subtasks, wherein the subtasks having the same precedence relationship with respect to the other subtasks and having no precedence relationship with each other respectively constitute one subtask group.

12. The task allocation method according to claim 11, wherein the action space a is a discrete action space.

13. The task allocation method according to any one of claims 1 to 4, wherein the bonus space includes:

before all the subtasks are distributed, a positive reward is obtained every time a distribution action is correctly executed, wherein the correct execution of the distribution action refers to distributing the subtasks to working units capable of executing the class of the subtasks;

before all subtasks are distributed, a negative rewards is obtained by executing a distribution action once per error, wherein the error execution distribution action comprises selecting tasks filled with 0 in a task pool and distributing the subtasks to working units of a category which cannot execute the subtasks of the type; and

the instant rewards obtained when all subtasks are distributed are completed, comprising the following two parts:

14. A deep reinforcement learning based task allocation system comprising:

at least one processor; and

at least one storage device storing instructions that, when executed by the at least one processor, cause the at least one processor to perform the method of any of claims 1-13.

15. A computer program product comprising instructions which, when executed by a processor, cause the method of any of claims 1-13 to be performed.

16. A non-transitory computer-readable storage medium storing instructions which, when executed by a processor, cause performance of the method recited in any one of claims 1-13.