WO2021139726A1

WO2021139726A1 - Task migration method and apparatus, and computer device and readable storage medium

Info

Publication number: WO2021139726A1
Application number: PCT/CN2021/070663
Authority: WO
Inventors: 高燕强; 柴庆龙; 张鑫宇; 徐远超
Original assignee: 中科寒武纪科技股份有限公司
Priority date: 2020-01-07
Filing date: 2021-01-07
Publication date: 2021-07-15

Abstract

The present application relates to a task migration method and apparatus, and a computer device and a readable storage medium. The method comprises: when it is detected that a migratable task satisfies a preset migration condition, determining a target node matching the migratable task from various nodes according to a task attribute of the migratable task, the task attribute comprising a target number of operation units required for executing the migratable task; and migrating the migratable task to the target node, so that the target node executes the migratable task. According to the present application, the waiting duration of the migratable task can be reduced, and the execution efficiency of the migratable task is improved. The present application relates to an operation processing method and apparatus, and a computer device and a readable storage medium.

Description

Method, device, computer equipment and readable storage medium for task migration

Related application

This application requires the application on January 7, 2020, the application number is 202010012242.8, the name is "task migration method, device, computer equipment and readable storage medium"; the application number is 202010012302.6, the name is "job processing method, device "Computer equipment and readable storage medium" is the priority of the Chinese patent application, which is hereby incorporated by reference in its entirety.

Technical field

This application relates to the field of computer technology, and in particular to a method, device, computer equipment, and readable storage medium for task migration, and a method, device, computer equipment, and readable storage medium for job processing.

Background technique

Currently, NUMA (Non Uniform Memory Access Architecture) architecture is commonly used in chip design for artificial intelligence applications. A chip based on the NUMA architecture usually includes a processor with multiple arithmetic units and multiple storage units. Among them, a plurality of arithmetic units are usually divided into a plurality of arithmetic unit groups, and each arithmetic unit group is equipped with at least one storage unit, and an arithmetic unit group and its corresponding storage unit constitute a node. In this way, the reading and writing of data required by the arithmetic unit in a node can be realized through the storage unit in the node. During the operation of the chip, the task or job to be executed needs to be assigned to a certain node for execution, but there are still problems with the task or job processing at present.

Summary of the invention

Based on this, in order to solve the above-mentioned problems, the present application provides a task migration method, device, computer equipment, and readable storage medium.

A method for task migration, the method includes:

When it is detected that a migratable task meets the preset migration condition, a target node that matches the migratable task is determined in each node according to the task attribute of the migratable task, and the task attribute includes executing the migratable task. The target number of computing units required for the task;

Migrating the migratable task to the target node, so as to execute the migratable task through the target node.

A device for task migration, the device includes:

The first determining module is configured to determine a target node matching the migratable task in each node according to the task attribute of the migratable task when it is detected that the migratable task meets the preset migration condition, and the task The attribute includes the target number of arithmetic units required to execute the migratable task;

The migration module is configured to migrate the migratable task to the target node, so as to execute the migratable task through the target node.

A computer device, including a memory and a processor, the memory stores a computer program that can run on the processor, and is characterized in that the processor implements any of the methods described above when the processor executes the computer program step.

A computer-readable storage medium has a computer program stored thereon, and when the computer program is executed by a processor, it realizes the steps of any one of the above-mentioned methods.

This application provides a method, device, computer equipment, and readable storage medium for task migration. When the CPU detects that the migratable task meets the preset migration condition, the target node that matches the migratable task is determined in each node according to the task attribute of the migratable task. Among them, the task attribute includes the target number of arithmetic units required to execute the migratable task. Then, the CPU migrates the migratable task to the target node to execute the migratable task through the target node. In this way, when the computing unit expected by the migratable task cannot execute the migratable task, or the migratable task needs to wait a long time before it can be executed by the computing unit expected by the migratable task, the CPU can perform the migratable task Migrate to the target node, thereby reducing the waiting time of the migratable task and improving the execution efficiency of the migratable task.

This application also provides a method, device, computer equipment and readable storage medium for job processing.

A method for job processing, the method comprising:

When the preset processing conditions are met, the first node matching the target job is determined in each node according to the job attributes of the target job contained in the target task, and the job attributes include the operations required to execute the target job The target number of units;

The target job included in the target task is executed through the first node and the second node where the arithmetic unit that executes the target task is located.

As an optional implementation manner, the determining the first node matching the target job in each node according to the job attribute of the target job included in the target task includes:

For each of the nodes, if the number of free arithmetic units in the node is greater than or equal to the target number, then the node is determined as the first node.

As an optional implementation manner, the method further includes:

Acquiring the idle time length of the idle computing unit in each node;

If there is a task containing multiple jobs waiting to be executed in the splittable task list, and there is an idle time longer than or equal to the preset time threshold in the idle time of each idle computing unit, it is determined that the preset processing condition is satisfied.

As an optional implementation manner, when the preset processing conditions are met, according to the job attributes of the target job contained in the target task, before the first node matching the target job is determined in each node, the Methods also include:

Acquiring the target task to be executed, and determining the dimensional information of the target task and the target number of arithmetic units required to execute the target task;

If the ratio of the product of the dimensional information to the number of targets is greater than 1, the target task is added to the list of splittable tasks;

Modify the affinity mask of the target task according to the preset affinity mask modification rule.

As an optional implementation manner, before the execution of the target job included in the target task through the first node and the second node where the arithmetic unit that executes the target task is located, the method further includes:

In the use mask of the target task, the position corresponding to the first node is 1.

If in the affinity mask and usage mask of the target job, the bits corresponding to the first node and the second node are both 1, then the passing of the first node and the execution of the target are executed. The second node where the computing unit of the task is located executes the steps of the target job included in the target task.

As an optional implementation manner, the affinity mask of the target job is the same as the affinity mask of the target task, and the usage mask of the target job is the same as the usage mask of the target task. .

A device for job processing, the device comprising:

The first determining module is used to determine the first node that matches the target job in each node according to the job attributes of the target job contained in the target task when the preset processing conditions are met, and the job attributes include the execution of the target job. State the target number of arithmetic units required for the target operation;

The execution module is configured to execute the target job included in the target task through the first node and the second node where the arithmetic unit that executes the target task is located.

A computer device includes a memory and a processor, the memory stores a computer program that can run on the processor, and the processor implements the steps of any one of the methods when the computer program is executed.

A computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of any one of the methods.

The embodiments of the present application provide a method, device, computer equipment, and readable storage medium for job processing. When the preset processing conditions are met, the CPU determines the first node matching the target job among the nodes according to the job attributes of the target job included in the target task. Among them, the job attribute includes the target number of arithmetic units required to execute the target job. Then, the CPU executes the target job included in the target task through the first node and the second node where the arithmetic unit that executes the target task is located. In this way, when the target job needs to wait a long time before it can be executed by the arithmetic unit that the target job is waiting to execute, the CPU can jointly execute the target job through the first node and the second node, thereby reducing the waiting time of the target job, Improve the execution efficiency of the target job.

Description of the drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only These are the embodiments of the present application. For those of ordinary skill in the art, other drawings can be obtained from the disclosed drawings without creative work.

Figure 1-1 is a schematic diagram of an intelligent processor provided by an embodiment of the application;

Figure 1-2 is a schematic flowchart of a task migration method provided by an embodiment of the application;

Figures 1-3 are schematic structural diagrams of a task migration device provided by an embodiment of this application;

Figure 1-4 is a schematic structural diagram of a computer device provided by an embodiment of this application;

Figure 2-1 is a schematic flowchart of a method for job splitting and affinity mask modification provided by an embodiment of the application;

2-2 is a schematic flowchart of a job processing method provided by an embodiment of this application;

Figure 2-3 is a schematic flowchart of a method for determining processing conditions provided by an embodiment of the application;

Figures 2-4 are schematic structural diagrams of a job processing device provided by an embodiment of the application.

Detailed ways

The technical solutions in the embodiments of the present application will be clearly and completely described below in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are only a part of the embodiments of the present application, rather than all the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.

It should be understood that the terms "first", "second", "third" and "fourth" in the claims, specification and drawings of the present disclosure are used to distinguish different objects, rather than to describe a specific order. . The terms "comprising" and "comprising" used in the specification and claims of this disclosure indicate the existence of the described features, wholes, steps, operations, elements and/or components, but do not exclude one or more other features, wholes The existence or addition of, steps, operations, elements, components, and/or their collections.

It should also be understood that the terms used in this disclosure specification are only for the purpose of describing specific embodiments, and are not intended to limit the disclosure. As used in this disclosure and claims, unless the context clearly indicates other circumstances, the singular forms "a", "an" and "the" are intended to include plural forms. It should be further understood that the term "and/or" used in this disclosure specification and claims refers to any combination of one or more of the associated listed items and all possible combinations, and includes these combinations.

As used in this specification and claims, the term "if" can be interpreted as "when" or "once" or "in response to determination" or "in response to detection" depending on the context. Similarly, the phrase "if determined" or "if detected [described condition or event]" can be interpreted as meaning "once determined" or "in response to determination" or "once detected [described condition or event]" depending on the context ]" or "in response to detection of [condition or event described]".

In the chip running process, the task to be executed needs to be allocated to a node for execution. The specific allocation process is: first determine the memory size required to execute the task, and then determine the remaining memory space according to the memory unit corresponding to each node The target node that meets the memory size. For example, the node with the largest remaining memory space may be used as the target node, or, among the nodes with the remaining memory space greater than the memory size, a node may be randomly selected as the target node. Then, based on the affinity binding principle, the task is assigned to the target node for execution.

However, due to the affinity binding principle in the above allocation process, the task often needs to wait for a long time, which seriously affects the execution efficiency of the task.

In order to solve the above technical problem, the embodiment of the present application provides a task migration method. The method can be applied to a chip. The chip may include at least one processor. Optionally, the chip may have heterogeneous multiprocessors. The chip may include an intelligent processor with a NUMA architecture and a general-purpose processor. The general-purpose processor may be a CPU (central processing unit, central processing unit), and the intelligent processor may be an accelerator or an IPU (Intelligent Processing Unit, Intelligent processing unit), or GPU (Graphics Processing Unit, graphics unit), may also be other types of intelligent processors, which are not limited in the embodiment of the present application. Specifically, the method can be applied to a chip, and a CPU (central processing unit, central processing unit) in the chip can execute the task migration method described above to schedule multiple tasks to an intelligent processor for processing. Of course, in other embodiments, the intelligent processor of the chip can also execute the above-mentioned task migration method. For the specific execution process of the task migration method in the embodiment of the present application, please refer to the following description.

Optionally, the intelligent processor of the NUMA architecture further includes a processor with multiple arithmetic units and multiple storage units. Multiple arithmetic units are usually divided into multiple arithmetic unit groups, and each arithmetic unit group is equipped with at least one storage unit, and an arithmetic unit group and its corresponding storage unit constitute a node. The reading and writing of data required by the arithmetic unit in a node can all be realized through the storage unit in the node, and the reading and writing of data between different nodes is realized through the communication interface.

Figure 1-1 is a schematic diagram of an intelligent processor with a NUMA architecture provided by an embodiment of the application. As shown in Figure 1-1, the smart processor contains 16 arithmetic units and 4 storage units. The smart processor is divided into 4 nodes, and each node contains 4 arithmetic units and 1 storage unit. Figure 1-1 only provides a schematic diagram of an intelligent processor in a schematic manner. In other possible implementation manners, each node may also include more than four arithmetic units and one storage unit, and the storage unit may include multiple Sub-storage unit. For example, each node may include four sub-nodes, that is, each node may include 16 arithmetic units. Each sub-node contains four arithmetic units and one sub-storage unit, and the arrangement of the four sub-nodes can be arranged in the manner of four nodes. Further, the above-mentioned task allocation method can be executed among the sub-nodes of a single node, and the execution process of the method can be detailed in the description of the task allocation method below.

When a task is scheduled to the software queue, the processor can allocate the task expected by the task in the node to which the storage unit storing the task data of the task belongs according to the number of computing units required to execute the task Arithmetic unit, and add 1 to the wait reference count (ie, clu_wait_ref) of each arithmetic unit expected by the task. For example, as shown in Figure 1-1, the number of arithmetic units required for this task is 2, and the storage unit storing the task data of this task is storage unit 1. Then the processor can combine the arithmetic unit 1 and the arithmetic unit 1 in node 1. Unit 2 is determined as the arithmetic unit expected by the task, and the waiting reference counts of arithmetic unit 1 and arithmetic unit 2 are incremented by one.

After the processor determines the arithmetic unit that executes the task, the task is scheduled to the hardware queue, and the real reference count (ie, clu_real_ref) of each arithmetic unit that executes the task is incremented by 1. For example, as shown in Figure 1-1, after the processor determines that the arithmetic units performing the task are arithmetic unit 1 and arithmetic unit 2, the processor can add 1 to the true reference counts of arithmetic unit 1 and arithmetic unit 2.

When the execution of the task is completed, the waiting reference count of each arithmetic unit expected by the task is decreased by 1, and at the same time, the true reference count of each arithmetic unit that executes the task is decreased by 1. For example, after the arithmetic unit 1 and the arithmetic unit 2 have performed the task, the processor may decrement the waiting reference count and the true reference count of the arithmetic unit 1 and the arithmetic unit 2 by one. If the arithmetic unit expected by the task migrates, the waiting reference count of each source arithmetic unit expected by the task is decreased by 1, and the waiting reference count of each destination arithmetic unit expected by the task is increased by 1. For example, as shown in Figure 1-1, the arithmetic unit expected by the task is migrated from arithmetic unit 1 and arithmetic unit 2 to arithmetic unit 3 and arithmetic unit 4. Subtract 1 and add 1 to the waiting reference counts of arithmetic unit 3 and arithmetic unit 4.

In order to facilitate understanding, firstly, a method for determining transferable tasks provided in this application is introduced, and the specific processing process is as follows.

Step 1: Obtain the target task to be executed, and determine the task type of the target task, the task execution time, the minimum cross-node memory access delay of the node to which the third arithmetic unit belongs to the target task, and the expected execution in the third arithmetic unit The number of tasks.

In implementation, after a certain task is scheduled to the software queue, the processor needs to determine whether a certain task (that is, the target task) is a migratable task. Correspondingly, the processor can obtain the task type of the target task, the task execution duration, the minimum cross-node memory access delay of the node to which the third arithmetic unit is expected for the target task, and the expected execution time of the task in the third arithmetic unit. The number (that is, the waiting reference count of the third arithmetic unit) and so on. Among them, the task types can include memory-intensive (that is, there are many I/O (Input/Output) instructions in the task, and tasks that require frequent reading and writing of data in the storage unit during execution) and computationally intensive (that is, There are many calculation instructions in the task, and tasks that require a large amount of computing resources when executed), and may also include other task types, which are not limited in the embodiment of the present application. Then, the processor can determine whether the task type of the target task is computationally intensive, whether the task execution time of the target task is greater than the minimum cross-node memory access delay, and whether the waiting reference count of the third arithmetic unit is greater than or equal to the third The preset number threshold. Wherein, the third preset number threshold can be set by a technician based on experience.

Step 2: If the task type is computationally intensive, and/or the task execution time is greater than the minimum cross-node memory access delay, and/or the number of tasks expected to be executed in the third arithmetic unit is greater than or equal to the third preset number threshold, The target task is determined to be a transferable task, and the affinity mask of the target task is modified according to the preset affinity mask modification rule.

In implementation, if the task type of the target task is computationally intensive, and/or the task execution time of the target task is greater than the minimum cross-node memory access delay, and/or the waiting reference count of the third arithmetic unit is greater than or equal to the first Three preset number thresholds indicate that after the target task is migrated, the execution efficiency of the target task will not be affected, and the busy third computing unit expected by the target task may affect the execution efficiency of the target task. Therefore, the processor can determine that the target task is a migratable task. Then, the processor can modify the affinity mask of the target task according to the preset affinity mask modification rule. Among them, the affinity mask of the target task (affinity) is used to indicate the nodes that can execute the target task in each node, and the affinity mask includes the total number of nodes contained in the intelligent processor. The bit uniquely corresponds to a node, if a bit is 1, it means that the node corresponding to the bit can perform the target task, if a bit is 0, it means that the node corresponding to the bit cannot perform the target task; affinity The mask modification rule can be set by the technician according to the migration scope of the migratable task.

For example, the affinity mask modification rule is that a migratable task can be migrated to all nodes, and the original affinity mask of the target task is 0001. If the target task is a migratable task, the processor can be based on the affinity Mask modification rules, modify the affinity mask of the target task to 1111. For another example, the affinity mask modification rule is that the migratable task can be migrated to node 3 and node 4. The original affinity mask of the target task is 0001. If the target task is a migratable task, the processor can follow Affinity mask modification rules, modify the affinity mask of the target task to 1101.

In order to facilitate understanding, secondly, a method for judging migration conditions provided in this application is introduced. The specific processing process is as follows.

Step 1: If the task attribute of the task executed in the second computing unit expected by the transferable task is different from the task attribute of the transferable task, it is determined that the transferable task meets the preset migration condition.

In implementation, when a certain arithmetic unit is assigned to perform a certain task, the arithmetic unit can only execute tasks with the same task attributes as the task attributes. Among them, the task attribute is the number of arithmetic units required to execute the task. Based on the foregoing principle, after the processor determines that a certain task is a migratable task, the processor can obtain the task attributes of the migratable task and the task attributes of the tasks executed in the second computing unit expected by the migratable task. Then, the processor can determine whether the task attribute of the task executed in the second computing unit is the same as the task attribute of the migratable task. If the task attribute of the task executed in the second arithmetic unit is different from the task attribute of the migratable task, it means that the second arithmetic unit cannot execute the migratable task, and the processor can determine that the migratable task meets the preset migration condition. In this way, subsequent processors can migrate the migratable task to other nodes that can execute the migratable task. If the task attribute of the task executed in the second computing unit is the same as the task attribute of the migratable task, the processor executes step two.

Step 2: If the task attributes of the tasks executed in the second arithmetic unit are the same as the task attributes of the transferable tasks, it is determined whether the total number of tasks to be executed in the second arithmetic unit is greater than or equal to the first preset number threshold.

In implementation, if the task attribute of the task executed in the second arithmetic unit is the same as the task attribute of the migratable task, it means that the second arithmetic unit can execute the migratable task. Then, the processor may further determine whether the total number of tasks to be executed in the second arithmetic unit (that is, the true reference count of the second arithmetic unit) is greater than or equal to the first preset number threshold. Wherein, the first preset number threshold can be set by a technician based on experience. If the total number of tasks to be executed in the second arithmetic unit is less than the first preset number threshold, it means that the migratable task can be executed by the second arithmetic unit without waiting a long time, and the processor can determine that the migratable task is not Meet the preset migration conditions. If the total number of tasks to be executed in the second computing unit is greater than or equal to the first preset number threshold, the processor executes step three.

Step 3: If the total number of tasks to be executed in the second computing unit is greater than or equal to the first preset number threshold, it is determined that the migratable tasks meet the preset migrating condition.

In implementation, if the total number of tasks to be executed in the second computing unit is greater than or equal to the first preset number threshold, it means that the transferable task needs to wait a long time before being executed by the second computing unit. Correspondingly, the processor can determine that the migratable task satisfies the preset migration condition, so that the processor migrates the migratable task to other nodes, thereby reducing the waiting time of the migratable task and improving the execution efficiency of the migratable task . If the total number of tasks to be executed in the second arithmetic unit is less than the first preset number threshold, it means that the migratable task can be executed by the second arithmetic unit without waiting a long time. Correspondingly, the processor may determine that the migratable task does not meet the preset migrating condition.

The method for task migration provided in the present application will be described in detail below in conjunction with specific embodiments. As shown in Figure 1-2, the specific steps are as follows.

Step 201: When it is detected that the migratable task meets the preset migration condition, a target node that matches the migratable task is determined in each node according to the task attribute of the migratable task. Among them, the task attribute includes the target number of arithmetic units required to execute the migratable task.

In implementation, after a certain task is scheduled to the software queue in the chip, the processor can determine whether the task is a migratable task. If the task is a migratable task, the processor may further detect whether the migratable task meets the preset migration condition. When the processor detects that the migratable task meets the preset migration condition, it can determine the target node matching the migratable task among the nodes according to the task attributes of the migratable task. Wherein, the task attribute of the migratable task includes the target number of computing units required to execute the migratable task. Optionally, the target number of arithmetic units required by the target task may be represented by a task identifier, and the task identifier may be a Block task or a Union task, etc., which is not specifically limited here. When the task identifier is a Union task, the system can determine the target number of arithmetic units according to the value of Union. For example, when Union=1, it indicates that four arithmetic units in a node are required to run the target task. When Union=2, it indicates that eight arithmetic units in two nodes are required to run the target task. When Union=3, it indicates that twelve arithmetic units in three nodes are required to run the target task. When Union=4, it indicates that sixteen arithmetic units in four nodes are required to run the target task. When the task identifier is a Block task, it indicates that 1 arithmetic unit is required to run the target task.

Optionally, according to the task attributes of the migratable task, the processor determines the specific processing process of the target node matching the migratable task in each node as follows.

Step 1: If there is a candidate node containing the target number of free computing units in each node, among the candidate nodes, the candidate node with the smallest distance from the node to which the computing unit expected by the migratable task belongs is determined as Target node.

In implementation, when the processor detects that the migratable task meets the preset migration condition, it can firstly determine whether there are candidate nodes containing the target number of free computing units (that is, computing units with a true reference count equal to 0) in each node. . If there is a candidate node, the processor can determine the candidate node with the smallest distance from the node to which the computing unit expected by the migratable task belongs among the candidate nodes as the target node. In this way, the subsequent processor can determine the candidate node as the target node. The migration task is migrated to the target node, and the migratable task is executed through the idle computing unit in the target node, thereby reducing the waiting time of the migratable task and improving the execution efficiency of the migratable task.

For example, the node to which the computing unit expected by the migratable task belongs is node 1, the candidate nodes are node 1 and node 2, and the distances between node 1 and node 1 and node 2 are 0 and 1, respectively, then the target node is node 1. For another example, the node to which the computing unit expected by the migratable task belongs is node 1, the candidate nodes are node 2 and node 4, and the distances between node 1 and node 2 and node 4 are 1 and 2, respectively, then the target node is Node 2.

It should be noted that if there are multiple candidate nodes with the smallest distance among the candidate nodes, the processor can determine the number of candidate nodes with the smallest distance in descending order or descending order of the node identifiers. Target node. Or, when there are multiple candidate nodes with the smallest distance among the candidate nodes, one of the above multiple candidate nodes can be randomly selected as the target node, which is not specifically limited here.

Step 2: If there is no candidate node containing the target number of idle arithmetic units in each node, the node containing the first arithmetic unit with the smallest total number of tasks to be executed is determined as the target node. Wherein, the first operation unit is an operation unit whose task attribute of the executed task is the same as the task attribute of the transferable task.

In implementation, when a certain arithmetic unit is assigned to perform a certain task, the arithmetic unit can only execute tasks with the same task attributes as the task attributes. Based on the above principle, if there is no candidate node containing the target number of idle arithmetic units in each node, the processor can further determine in each node that the task attribute of the executed task is the same as the task attribute of the migratable task. Operation unit. Then, the processor may determine the first arithmetic unit with the smallest total number of tasks to be executed (that is, the smallest true reference count) in each first arithmetic unit, and determine the first arithmetic unit with the smallest total number of tasks to be executed. The node to which the arithmetic unit belongs is used as the target node. In this way, the subsequent processor can migrate the migratable task to the target node, and execute the migratable task through the target computing unit in the target node, thereby reducing the waiting time of the migratable task and improving the execution efficiency of the migratable task . For example, the number of arithmetic units required to perform a migratable task is 3, and the number of arithmetic units required for the task executed in node 1 is 3 arithmetic units are arithmetic unit 1 to arithmetic unit 3, and arithmetic unit 1 to arithmetic unit The total number of tasks to be executed in 3 is 10, and the number of arithmetic units required for tasks to be executed in node 2 is 3 arithmetic units are arithmetic units 6 to 8, and arithmetic units 6 to 8 are to be executed The total number of tasks is 15, and the number of arithmetic units required for tasks executed in node 4 is 3. The arithmetic units are arithmetic units 13 to 15 and the total number of tasks to be executed from arithmetic units 13 to 15 are 5 , The target node is the node 4 to which the computing unit 13 to the computing unit 15 belong.

As an optional implementation manner, the migration of tasks will affect the execution of other tasks. Therefore, when the processor detects that the migratable task meets the preset migration conditions, according to the task attributes of the migratable task, before determining the target node matching the migratable task in each node, the processor can determine the Whether the arithmetic unit has uneven load, the specific processing process is as follows.

Step 1: Obtain the number of tasks expected to be executed in each arithmetic unit.

In implementation, the processor can obtain the number of tasks expected to be executed in each arithmetic unit (that is, the waiting reference count of each arithmetic unit). Then, the processor can determine the maximum waiting reference count and the minimum waiting reference count among the waiting reference counts of each arithmetic unit. After that, the processor may calculate the difference between the maximum waiting reference count and the minimum waiting reference count (that is, the maximum difference), and determine whether the maximum difference is greater than or equal to the second preset number threshold. Wherein, the second preset number threshold can be set by a technician based on experience. If the maximum difference is less than the second preset number threshold, it means that the computing unit in the intelligent processor has no load unevenness, and the processor does not need to perform task migration. If the maximum difference is greater than or equal to the second preset number threshold, it means that the computing units in the smart processor are not evenly loaded, and the processor executes step two.

Step 2: If the maximum difference between the number of tasks expected to be executed in each arithmetic unit is greater than or equal to the second preset number threshold, when it is detected that the migratable task meets the preset migrating condition, according to the task of the migratable task Properties, in each node, determine the target node that matches the migratable task.

In implementation, if the maximum difference is greater than or equal to the second preset number threshold, the load of the computing unit in the description is uneven. When the processor detects that the migratable task meets the preset migration condition, it will be based on the task attribute of the migratable task. , In each node, determine the target node that matches the migratable task. Wherein, when the processor detects that the migratable task meets the preset migration condition, according to the task attributes of the migratable task, in each node, the process of determining the target node matching the migratable task is similar to step 201, here No longer.

Step 202: Migrate the migratable task to the target node, so as to execute the migratable task through the target node.

In implementation, after the processor determines the target node, it can migrate the migratable task to the target node, so as to execute the migratable task through the target node.

As an optional implementation manner, before the processor migrates the migratable task to the target node, the processor may also modify the use mask of the migratable task. The specific processing process is: if the target node and the migratable task expect The nodes where the arithmetic units are located are not the same, then in the use mask of the migratable task, the position corresponding to the target node is set to 1, and the position corresponding to the node where the arithmetic unit expected by the migratable task is located is 0.

In implementation, the usage_mask of the migratable task is used to indicate the node that determines the execution of the migratable task in each node. The usage mask includes the total number of bits contained in the chip, and each bit uniquely corresponds to For a node, if a bit is 1, it means that the node corresponding to the bit is determined to execute the migratable task, and if a bit is 0, it means that the node corresponding to the bit does not execute the migratable task. After the processor determines the target node of the migratable task, it can determine whether the target node is the same as the node where the computing unit expected by the migratable task is located. If the target node is the same as the node where the computing unit expected by the migratable task is located, the processor does not need to modify the use mask of the migratable task. If the target node is not the same as the node where the computing unit expected by the migratable task is located, the processor can set the position corresponding to the target node to 1 in the use mask of the migratable task, and set the desired position of the migratable task. The position corresponding to the node where the arithmetic unit is located is 0. For example, if the node where the computing unit expected by the migratable task is located is node 1, then the original usage mask of the migratable task is 0001. Assuming that the target node is node 2, the modified usage mask of the migratable task is 0010.

As an optional implementation manner, before the processor migrates the migratable task to the target node, the processor can also determine whether the migratable task can be migrated to the target node according to the affinity mask and the usage mask of the migratable task. , The specific processing process is: if the bits corresponding to the target node in the affinity mask and the usage mask of the migratable task are both 1, then the migratable task is migrated to the target node.

In implementation, before the processor migrates the migratable task to the target node, the processor can determine whether the bits corresponding to the target node in the affinity mask and the usage mask of the migratable task are both 1. If the bits corresponding to the target node in the affinity mask and the usage mask of the migratable task are both 1, it means that the migratable task can be migrated to the target node. Correspondingly, the processor can migrate the migratable task to the target node. If the position corresponding to the target node in the affinity mask of the migratable task is 0, it means that the migratable task cannot be migrated to the target node.

In the method for task migration provided by the embodiment of the present application, when the processor detects that the migratable task meets the preset migration condition, the target node that matches the migratable task is determined in each node according to the task attribute of the migratable task. Among them, the task attribute includes the target number of arithmetic units required to execute the migratable task. Then, the processor migrates the migratable task to the target node to execute the migratable task through the target node. In this way, when the computing unit expected by the migratable task cannot execute the migratable task, or the migratable task needs to wait a long time before it can be executed by the computing unit expected by the migratable task, the processor can perform the migratable task. The task is migrated to the target node, thereby reducing the waiting time of the migratable task and improving the execution efficiency of the migratable task.

The embodiment of the present application also provides a device for task migration. As shown in Figures 1-3, the device includes:

The first determining module 310 is configured to, when it is detected that the migratable task meets the preset migration condition, determine the target node matching the migratable task in each node according to the task attribute of the migratable task, and the task attribute includes executing the migratable task. The target number of computing units required for the task;

The migration module 320 is configured to migrate the migratable task to the target node, so as to execute the migratable task through the target node.

As an optional implementation manner, the first determining module 310 is specifically configured to:

If there are candidate nodes containing the target number of free computing units in each node, among the candidate nodes, the candidate node with the smallest distance from the node to which the computing unit expected by the migratable task belongs is determined as the target node;

If there is no candidate node containing the target number of idle arithmetic units in each node, the node containing the first arithmetic unit with the smallest total number of tasks to be executed is determined as the target node, and the first arithmetic unit is the task to be executed The task attribute of the task attribute is the same as the task attribute of the transferable task.

As an optional implementation manner, the device further includes:

The second determining module is configured to determine that the migratable task meets the preset migration condition if the task attribute of the task executed in the second computing unit expected by the migratable task is different from the task attribute of the migratable task;

A judging module for judging whether the total number of tasks to be executed in the second arithmetic unit is greater than or equal to the first preset number threshold if the task attributes of the tasks executed in the second arithmetic unit are the same as those of the migratable tasks ；

The third determining module is configured to determine that if the total number of tasks to be executed in the second computing unit is greater than or equal to the first preset number threshold, the migratable tasks meet the preset migrating condition.

As an optional implementation manner, the device further includes:

The obtaining module is used to obtain the number of tasks expected to be executed in each arithmetic unit;

The fourth determining module is configured to trigger the first determining module 310 to execute when it is detected that the migratable task meets the predetermined threshold if the maximum difference between the number of tasks expected to be executed in each arithmetic unit is greater than or equal to the second preset number threshold. When setting the migration condition, according to the task attribute of the migratable task, in each node, determine the step of the target node that matches the migratable task.

As an optional implementation manner, the device further includes:

The fifth determination module is used to obtain the target task to be executed, and determine the task type of the target task, the task execution time, the minimum cross-node memory access delay of the node to which the third operation unit belongs and the third operation expected by the target task The number of tasks expected to be performed in the unit;

Modification module, used if the task type is computationally intensive, and/or the task execution time is greater than the minimum cross-node memory access delay, and/or the number of tasks expected to be executed in the third arithmetic unit is greater than or equal to the third preset number Threshold, the target task is determined to be a transferable task, and the affinity mask of the target task is modified according to the preset affinity mask modification rule.

As an optional implementation manner, the device further includes:

The setting module is used to set the corresponding position of the target node to 1 in the use mask of the migratable task if the target node is not the same as the node where the computing unit expected by the migratable task is located, and set the desired position of the migratable task The position corresponding to the node where the arithmetic unit of is 0.

As an optional implementation manner, the device further includes:

The sixth determining module is used to trigger the migration module 320 to execute the step of migrating the migratable task to the target node if the bits corresponding to the target node in the affinity mask and the usage mask of the migratable task are both 1.

According to the task migration device provided by the embodiment of the present application, when the CPU detects that the migratable task meets the preset migration condition, the target node that matches the migratable task is determined in each node according to the task attribute of the migratable task. Among them, the task attribute includes the target number of arithmetic units required to execute the migratable task. Then, the CPU migrates the migratable task to the target node to execute the migratable task through the target node. In this way, when the computing unit expected by the migratable task cannot execute the migratable task, or the migratable task needs to wait a long time before it can be executed by the computing unit expected by the migratable task, the CPU can perform the migratable task Migrate to the target node, thereby reducing the waiting time of the migratable task and improving the execution efficiency of the migratable task.

In one embodiment, as shown in FIGS. 1-4, the present application also provides a computer device, including a memory and a processor, the memory stores a computer program that can run on the processor, and the processor The method steps of the above task migration are realized when the computer program is executed. Specifically, the implementation process of the method in which the processor executes the above-mentioned task migration can refer to FIG. 1-2 and the above description, which will not be repeated here.

In one embodiment, a computer-readable storage medium has a computer program stored thereon, and when the computer program is executed by a processor, the steps of the above-mentioned task migration method are realized.

In an embodiment of the present application, there is a situation in which a certain task can be split into at least one subtask (hereinafter referred to as a job). In this case, different jobs can also be allocated to a certain node for execution. Therefore, due to the affinity binding principle in the above allocation process, the job often needs to wait for a long time, which seriously affects the execution efficiency of the job.

Based on this, an embodiment of the present application also provides a job processing method, which can be applied to a chip, and the chip can include an intelligent processor with a NUMA architecture and a general-purpose processor. The general-purpose processor can be a CPU (central processing unit, central processing unit) and so on. The intelligent processor using the NUMA architecture can be an accelerated processor, or an IPU (Intelligent Processing Unit) processor, or a GPU (Graphics Processing Unit, graphics processing unit) processor, or other types of processors , The embodiments of this application are not limited. Specifically, the method can be applied to the above-mentioned chip, and a general-purpose processor (CPU) in the above-mentioned chip can execute the above-mentioned job processing method to distribute multiple jobs to at least one arithmetic unit in the intelligent processor for execution. For the specific execution process of the job processing method of this application, please refer to the following description.

Optionally, the intelligent processor of the NUMA architecture includes a processor with multiple arithmetic units and multiple storage units. Multiple arithmetic units are usually divided into multiple arithmetic unit groups, and each arithmetic unit group is equipped with at least one storage unit, and an arithmetic unit group and its corresponding storage unit constitute a node. The reading and writing of data required by the arithmetic unit in a node can all be realized through the storage unit in the node, and the reading and writing of data between different nodes is realized through the communication interface. Figure 1-1 is a schematic diagram of an intelligent processor with a NUMA architecture provided by an embodiment of the application. As shown in Figure 1-1, the smart processor contains 16 arithmetic units and 4 storage units. The smart processor is divided into 4 nodes, and each node contains 4 arithmetic units and 1 storage unit. Figure 1-1 only provides a schematic diagram of an intelligent processor in a schematic manner. In other possible implementation manners, each node may also include more than four arithmetic units and a storage unit, and the storage unit may include multiple Sub-storage unit. For example, each node may include four sub-nodes, that is, each node may include 16 arithmetic units. Each sub-node contains four arithmetic units and one sub-storage unit, and the arrangement of the four sub-nodes can be arranged in the manner of four nodes. Further, the above-mentioned job processing method may be executed between the sub-nodes of a single node, and the execution process can be referred to the description of the job processing method below.

The embodiment of this application first introduces the division of jobs and the modification of the affinity mask, as shown in Figure 2-1. The specific processing process is as follows:

Step 201: Obtain the target task to be executed, and determine each dimensional information of the target task and the target number of arithmetic units required to execute the target task.

In implementation, when a task (that is, the target task) is scheduled to the software queue, the processor can determine the dimensional information of the target task (that is, dimX, dimY, and dimZ) and the calculation unit required to execute the target task. The number of targets (ie kernel_class). Then, the processor can calculate the ratio of the product of each dimension information (that is, dimX*dimY*dimZ) to the target number, and determine whether the ratio is greater than one. If the ratio is greater than 1, it means that the target task can be split into multiple jobs, and the processor executes step 202. If the ratio is less than or equal to 1, it means that the target task cannot be split into multiple jobs.

Step 202: If the ratio of the product of each dimension information to the number of targets is greater than 1, the target task is added to the list of splittable tasks.

In implementation, if the ratio is greater than 1, it means that the target task can be split into multiple jobs. Correspondingly, the processor can add the target task to the list of splittable tasks. The splittable task list is used to store tasks that can be split into multiple jobs; the splittable task list may be a linked list or other types of lists, which is not limited in the embodiment of the present application. In addition, when all jobs included in a task in the splittable task list are executed, the processor may delete the task from the splittable task list.

Optionally, when it is determined that the target task can be split into multiple jobs, the target task can be sent to the scheduler, and the scheduler can be based on the dimensional information of the target task and the calculation unit required by the target task. Task attributes such as the number of targets, split the target task into multiple jobs. Further optionally, the scheduler may be a hardware scheduler placed on a chip, and the hardware scheduler may include multiple circuit modules such as a task splitting unit. Of course, the scheduler may also be a software scheduler, which is not specifically limited here.

Step 203: Modify the affinity mask of the target task according to the preset affinity mask modification rule.

In implementation, after the processor splits the target task into multiple target jobs, the processor can modify the affinity mask of the target task according to the preset affinity mask modification rule. Among them, the affinity mask of the target task (affinity) is used to indicate the nodes that can execute the target task in each node, and the affinity mask includes the total number of nodes contained in the intelligent processor. A bit uniquely corresponds to a node. If a bit is 1, it means that the node corresponding to the bit can perform the target task; if a bit is 0, it means that the node corresponding to the bit cannot perform the target task. The affinity mask modification rule can be set by the technician according to the range of nodes processed by the job. For example, if the affinity mask modification rule is a task to migrate to all nodes, and the original affinity mask of the target task is 0001, the processor can modify the rule according to the affinity mask to change the affinity of the target task. The harmony mask is revised to 1111. For another example, the affinity mask modification rule is that the task can be migrated to node 3 and node 4. The original affinity mask of the target task is 0001, and the processor can modify the rule according to the affinity mask to change the target The affinity mask of the task is modified to 1101. The above-mentioned step 202 and step 203 do not distinguish the sequence.

It should be noted that since the target job is obtained by splitting the target task, the affinity mask of the target job is the same as the affinity mask of the target task.

The following will introduce a job processing method provided by the present application in conjunction with specific embodiments, as shown in Figure 2-2, and the specific processing process is as follows.

Step 301: When the preset processing conditions are met, the first node matching the target job is determined in each node according to the job attribute of the target job included in the target task. Among them, the job attribute includes the target number of arithmetic units required to execute the target job.

In implementation, after the processor schedules a certain task (ie, target task) to the software queue, the processor can determine whether the target task can be split into multiple target jobs (JOB). If the target task can be divided into multiple target tasks, it can be determined that the target task is a task that can relax affinity, and then the processor can further determine whether the preset processing condition is satisfied. Among them, the processing procedure for the processor to determine whether the preset processing condition is met will be described in detail later. When the preset processing conditions are met, the processor may determine the first node matching the target job among the nodes according to the job attributes of the target job. Wherein, the job attribute of the target job includes the target number of arithmetic units required to execute the target job.

Optionally, according to the job attributes of the target job, the processor determines the first node that matches the target job in each node. The processing process is: for each node in each node, if there is an idle computing unit in the node, If the number is greater than or equal to the target number, the node is determined as the first node.

In implementation, when the preset processing conditions are met, for each node, the processor can obtain the number of free arithmetic units (that is, the true reference count is equal to 0) in the node. Then, the processor can determine whether the number of free arithmetic units in the node is greater than or equal to the target number. If the number of free computing units in the node is greater than or equal to the target number, it means that the node can execute the target job included in the target task. Correspondingly, the processor may confirm the node as the first node. If the number of free arithmetic units in the node is less than the target number, it means that the node cannot execute the target job included in the target task. Correspondingly, this node is not the first node.

As an optional implementation manner, as shown in FIG. 2-3, the processing procedure for the processor to determine whether the preset processing condition is satisfied is as follows:

Step 401: Obtain idle time lengths of idle computing units in each node.

In implementation, for each node in each node, the processor can obtain the idle duration of the idle operation unit (that is, the true reference count is equal to 0) in the node.

Step 402: If there is a task that includes multiple jobs waiting to be executed in the splittable task list, and there is an idle time longer than or equal to the preset time threshold in the idle time of each idle computing unit, it is determined that the preset processing condition is satisfied .

In implementation, after the processor obtains the idle duration of each idle computing unit, it can further determine whether there is a task containing multiple jobs waiting to be executed in the splittable task list, and determine the idle duration of each idle computing unit, Whether there is an idle period greater than or equal to the preset period threshold. Wherein, the preset duration threshold can be set by a technician based on experience. If there is a task that contains multiple jobs waiting to be executed in the splittable task list, and there is an idle duration greater than or equal to the preset duration threshold in the idle duration of each idle computing unit, it means that the processor can split the splittable task The tasks to be executed in the list are allocated to the idle arithmetic units in each node for execution. Correspondingly, the processor can determine that the preset processing condition is satisfied. If there is no task waiting to be executed in the splittable task list, or there is no idle time greater than or equal to the preset duration threshold in the idle time of each idle computing unit, it means that the processor cannot wait for the splittable task list. The tasks to be executed are allocated to the idle computing units in each node for execution. Correspondingly, the processor may determine that the preset processing condition is not satisfied.

Step 302: Execute the target job included in the target task through the first node and the second node where the arithmetic unit that executes the target task is located.

In implementation, after the processor determines the first node, it can execute the target job included in the target task through the first node and the second node where the computing unit that executes the target task is located. In this way, when the target job needs to wait a long time before it can be executed by the arithmetic unit that the target job is waiting to execute, the processor can execute the target job through the first node and the second node, thereby reducing the waiting time of the target job , Improve the execution efficiency of the target job.

As an optional implementation manner, before the processor executes the target job included in the target task through the first node and the second node, the processor may also modify the use mask of the target job according to the determined first node. The process is: in the use mask of the target task, set the position corresponding to the first node to 1.

In the implementation, the usage_mask of the target task is used to indicate the node that determines the execution of the target task in each node. The usage mask includes the total number of bits contained in the intelligent processor, and each bit uniquely corresponds to For a node, if a bit is 1, it means that the node corresponding to the bit is determined to perform the target task, and if a bit is 0, it means that the node corresponding to the bit does not perform the target task. After the processor determines the first node of the target job, before the processor schedules the target job to the hardware queue, the position corresponding to the first node can be set to 1 in the usage mask of the target task. For example, the original use mask of the target task is 0001. Assuming that the first node is node 2 and node 4, the modified use mask of the target task is 1011.

It should be noted that, since the target job is obtained by splitting the target task, the usage mask of the target job is the same as the usage mask of the target task.

As an optional implementation manner, before the processor executes the target job included in the target task through the first node and the second node, the processor may also determine the first node according to the affinity mask and usage mask of the target job. Whether a node and a second node can execute the target job, the specific processing process is: if the affinity mask and usage mask of the target job, the bits corresponding to the first node and the second node are both 1, then execute The steps of the target job included in the target task are executed through the first node and the second node where the arithmetic unit that executes the target task is located.

In implementation, after the processor obtains the affinity mask and usage mask of the target job, for each first node, the processor can determine the affinity mask and usage mask of the target job. Whether the bits corresponding to the first node are all 1. If the bits corresponding to the first node are all 1, it means that the first node can execute the target job. In the same way, for the second node where the arithmetic unit that executes the target task is located, the processor can determine whether the bit corresponding to the second node is also 1 in the affinity mask and the usage mask of the target job. If the bits corresponding to the second node are all 1, it means that the second node can execute the target job. Correspondingly, the processor can execute the target job included in the target task through the first node and the second node. If there is a 0 in the bit corresponding to the first node, it means that the first node cannot execute the target job, and the processor will execute the target job only through the second node. For example, if the affinity mask of the target job is 1101, the usage mask is 1001, the first node is node 2 and node 4, and the second node is node 1, then the bits corresponding to node 1 are all 1, and node 4 corresponds to The bits are all 1, and the bit of node 2 in the affinity mask is 0, and the processor can execute the target job through node 1 and node 4.

In the method for job processing provided by the embodiment of the present application, when a preset processing condition is met, the processor determines the first node matching the target job among the nodes according to the job attributes of the target job included in the target task. Among them, the job attribute includes the target number of arithmetic units required to execute the target job. Then, the processor executes the target job included in the target task through the first node and the second node where the arithmetic unit that executes the target task is located. In this way, when the target job needs to wait a long time before it can be executed by the arithmetic unit that the target job is waiting to execute, the processor can execute the target job through the first node and the second node, thereby reducing the waiting time of the target job , Improve the execution efficiency of the target job.

An embodiment of the present application also provides a device for job processing. As shown in Figures 2-4, the device includes:

The first determining module 510 is used to determine the first node matching the target job in each node according to the job attributes of the target job contained in the target task when the preset processing conditions are met, and the job attributes include those required to execute the target job. The target number of arithmetic units;

The execution module 520 is configured to execute the target job included in the target task through the first node and the second node where the arithmetic unit that executes the target task is located.

As an optional implementation manner, the first determining module 510 is specifically configured to:

For each node in each node, if the number of free arithmetic units in the node is greater than or equal to the target number, then the node is determined as the first node.

As an optional implementation manner, the device further includes:

The obtaining module is used to obtain the idle time of the idle computing unit in each node;

The second determining module is used to determine that if there are tasks that include multiple jobs waiting to be executed in the splittable task list, and the idle time length of each idle computing unit is greater than or equal to the preset time length threshold, then it is determined to meet Preset processing conditions.

As an optional implementation manner, the device further includes:

The third determining module is used to obtain the target task to be executed, and determine the dimensional information of the target task and the target number of computing units required to execute the target task;

Add module for adding the target task to the splittable task list if the ratio of the product of each dimension information to the target number is greater than one;

The modification module is used to modify the affinity mask of the target task according to the preset affinity mask modification rule.

As an optional implementation manner, the device further includes:

The setting module is used to set the position corresponding to the first node to 1 in the use mask of the target task.

As an optional implementation manner, the device further includes:

The fourth determining module is used to trigger the execution module 520 to execute the pass through the first node and execute the target task if the bits corresponding to the first node and the second node in the affinity mask and the usage mask of the target job are both 1. The second node where the arithmetic unit is located, executes the steps of the target job contained in the target task.

As an optional implementation manner, the affinity mask of the target job is the same as the affinity mask of the target task, and the usage mask of the target job is the same as the usage mask of the target task.

According to a job processing device provided by an embodiment of the present application, when a preset processing condition is met, the CPU determines the first node matching the target job among the nodes according to the job attributes of the target job included in the target task. Among them, the job attribute includes the target number of arithmetic units required to execute the target job. Then, the CPU executes the target job included in the target task through the first node and the second node where the arithmetic unit that executes the target task is located. In this way, when the target job needs to wait a long time before it can be executed by the arithmetic unit that the target job is waiting to execute, the CPU can jointly execute the target job through the first node and the second node, thereby reducing the waiting time of the target job, Improve the execution efficiency of the target job.

In one embodiment, as shown in FIGS. 1-4, a computer device provided by the present application includes a memory and a processor, the memory stores a computer program that can run on the processor, and the processor executes The computer program implements the method steps of the above-mentioned job processing. Specifically, the implementation process of the method in which the processor executes the above-mentioned job processing can refer to FIG. 2-1, FIG. 2-2, FIG. 2-3, and the above description, which will not be repeated here.

In one embodiment, a computer-readable storage medium has a computer program stored thereon, and when the computer program is executed by a processor, the steps of the above-mentioned job processing method are realized.

It should be noted that for the foregoing method embodiments, for the sake of simple description, they are all expressed as a series of action combinations, but those skilled in the art should know that the present disclosure is not limited by the described sequence of actions. Because according to this disclosure, certain steps can be performed in other order or at the same time. Secondly, those skilled in the art should also be aware that the embodiments described in the specification are optional embodiments, and the actions and modules involved are not necessarily required by the disclosure.

It should be further noted that although the steps in the flowcharts in Figure 1-2, Figure 2-1, Figure 2-2, and Figure 2-3 are displayed in sequence according to the arrow's instructions, these steps are not necessarily in accordance with the arrow's instructions. The order of execution. Unless specifically stated in this article, the execution of these steps is not strictly limited in order, and these steps can be executed in other orders. Moreover, at least part of the steps in the figure may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but can be executed at different times. The order of execution of these sub-steps or stages It is not necessarily performed sequentially, but may be performed alternately or alternately with at least a part of other steps or sub-steps or stages of other steps.

It should be understood that the above device embodiments are only illustrative, and the device of the present disclosure may also be implemented in other ways. For example, the division of units/modules in the above-mentioned embodiments is only a logical function division, and there may be other division methods in actual implementation. For example, multiple units, modules or components may be combined or integrated into another system, or some features may be omitted or not implemented.

In addition, unless otherwise specified, the functional units/modules in the various embodiments of the present disclosure may be integrated into one unit/module, or each unit/module may exist alone physically, or two or more units/modules may exist. The modules are integrated together. The above-mentioned integrated unit/module can be realized in the form of hardware or software program module.

If the integrated unit/module is implemented in the form of hardware, the hardware may be a digital circuit, an analog circuit, and so on. The physical realization of the hardware structure includes but is not limited to transistors, memristors and so on. Unless otherwise specified, the artificial intelligence processor may be any appropriate hardware processor, such as CPU, GPU, FPGA, DSP, ASIC, and so on. Unless otherwise specified, the storage unit may be any suitable magnetic storage medium or magneto-optical storage medium, such as RRAM (Resistive Random Access Memory), DRAM (Dynamic Random Access Memory), Static random access memory SRAM (Static Random-Access Memory), enhanced dynamic random access memory EDRAM (Enhanced Dynamic Random Access Memory), high-bandwidth memory HBM (High-Bandwidth Memory), hybrid storage cube HMC (Hybrid Memory Cube), etc. Wait.

If the integrated unit/module is implemented in the form of a software program module and sold or used as an independent product, it can be stored in a computer readable memory. Based on this understanding, the technical solution of the present disclosure essentially or the part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a memory. It includes several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present disclosure. The aforementioned memory includes: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes.

In the above-mentioned embodiments, the description of each embodiment has its own emphasis. For parts that are not described in detail in an embodiment, reference may be made to related descriptions of other embodiments. The technical features of the above embodiments can be combined arbitrarily. In order to make the description concise, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction in the combinations of these technical features, they should be It is considered as the range described in this specification.

The foregoing can be better understood according to the following clauses:

Clause A1. A method for task migration, the method includes:

Clause A2, the method according to clause A1, the determining the target node matching the migratable task in each node according to the task attribute of the migratable task includes:

If there is a candidate node containing the target number of free computing units in each node, among the candidate nodes, the candidate node with the smallest distance from the node to which the computing unit expected by the migratable task belongs will be selected , Determined as the target node;

If there is no candidate node that contains the target number of idle arithmetic units among the nodes, the node containing the first arithmetic unit with the smallest total number of tasks to be executed is determined as the target node, and the first The arithmetic unit is an arithmetic unit whose task attribute of the executed task is the same as the task attribute of the transferable task.

Clause A3. The method according to clause A1, the method further comprising:

If the task attribute of the task executed in the second computing unit expected by the transferable task is different from the task attribute of the transferable task, determining that the transferable task meets the preset migration condition;

If the task attribute of the task executed in the second operation unit is the same as the task attribute of the transferable task, it is determined whether the total number of tasks to be executed in the second operation unit is greater than or equal to the first preset number Threshold

If the total number of tasks to be executed in the second computing unit is greater than or equal to the first preset number threshold, it is determined that the migratable tasks meet the preset migrating condition.

Clause A4. According to the method described in Clause A1, when it is detected that the migratable task meets the preset migration condition, the target node that matches the migratable task is determined in each node according to the task attribute of the migratable task Previously, the method also included:

Obtain the number of tasks expected to be executed in each arithmetic unit;

If the maximum difference between the number of tasks expected to be executed in each arithmetic unit is greater than or equal to the second preset number threshold, then execute said when it is detected that the migratable tasks meet the preset migrating condition, according to the migratable task For the task attribute of the migration task, in each node, the step of determining the target node that matches the migratable task.

Clause A5. The method according to clause A1, the method further comprising:

Obtain the target task to be executed, and determine the task type of the target task, the task execution duration, the minimum cross-node memory access delay of the node to which the third arithmetic unit is expected for the target task, and the third arithmetic unit The number of tasks expected to be performed in the

If the task type is computationally intensive, and/or the task execution time is greater than the minimum cross-node memory access delay, and/or the number of tasks expected to be executed in the third arithmetic unit is greater than or equal to the third If the preset number threshold is set, the target task is determined to be a transferable task, and the affinity mask of the target task is modified according to the preset affinity mask modification rule.

Clause A6. The method according to clause A1, before the migrating the migratable task to the target node, the method further includes:

If the target node is not the same as the node where the computing unit expected by the migratable task is located, then in the use mask of the migratable task, the position corresponding to the target node is set to 1, and the The position corresponding to the node where the computing unit expected by the migratable task is located is 0.

Clause A7. The method according to clause A1, before the migrating the migratable task to the target node, the method further includes:

If the bits corresponding to the target node in the affinity mask and the usage mask of the migratable task are both 1, then the step of migrating the migratable task to the target node is performed.

Clause A8. A device for task migration, which includes:

Clause A9. A computer device, including a memory and a processor, the memory stores a computer program that can run on the processor, and the processor implements any one of clauses A1 to A7 when the computer program is executed The steps of the method.

Clause A10. A computer-readable storage medium on which a computer program is stored, which, when executed by a processor, implements the steps of the method described in any one of clauses A1 to A7.

Clause B1. A method of job processing, the method comprising:

Clause B2. The method according to clause B1, wherein the determining a first node matching the target job in each node according to the job attribute of the target job contained in the target task includes:

Clause B3, the method according to clause B1, the method further comprising:

Acquiring the maximum idle duration of the computing units that are idle in each node;

Clause B4. The method according to clause B1, when the preset processing conditions are met, before the first node matching the target job is determined in each node according to the job attributes of the target job included in the target task, The method also includes:

Clause B5. The method according to clause B1, before executing the target job included in the target task through the first node and the second node where the computing unit that executes the target task is located, the method further includes :

Clause B6. The method according to clause B1, before executing the target job included in the target task through the first node and the second node where the computing unit that executes the target task is located, the method further includes :

Clause B7. The method according to clause B6, wherein the affinity mask of the target job is the same as the affinity mask of the target task, and the usage mask of the target job is the same as the usage mask of the target task The mask is the same.

Clause B8. A device for job processing, the device comprising:

Clause B9. A computer device, including a memory and a processor, the memory stores a computer program that can run on the processor, and the processor implements any one of clauses B1 to B7 when the computer program is executed The steps of the method.

Clause B10. A computer-readable storage medium with a computer program stored thereon, which, when executed by a processor, implements the steps of the method described in any one of clauses B1 to B7.

Those skilled in the art can understand that the structure shown in the figure is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied. The specific computer device may include More or fewer parts than shown in the figure, or some parts are combined, or have a different part arrangement. Finally, it should be noted that in this article, relational terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply these entities. Or there is any such actual relationship or sequence between operations. Moreover, the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article or device including a series of elements not only includes those elements, but also includes those that are not explicitly listed Other elements of, or also include elements inherent to this process, method, article or equipment. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, method, article, or equipment that includes the element.

The various embodiments in this specification are described in a progressive manner. Each embodiment focuses on the differences from other embodiments, and the same or similar parts between the various embodiments can be referred to each other.

The foregoing description of the disclosed embodiments enables those skilled in the art to implement or use this application. Various modifications to these embodiments will be obvious to those skilled in the art, and the general principles defined herein can be implemented in other embodiments without departing from the spirit or scope of the application. Therefore, this application will not be limited to the embodiments shown in this document, but should conform to the widest scope consistent with the principles and novel features disclosed in this document.

Claims

A method for task migration, characterized in that the method includes:

When it is detected that a migratable task meets the preset migration condition, a target node that matches the migratable task is determined in each node according to the task attribute of the migratable task, and the task attribute includes executing the migratable task. The target number of computing units required for the task;

Migrating the migratable task to the target node, so as to execute the migratable task through the target node.
The method according to claim 1, wherein the determining a target node matching the migratable task in each node according to the task attribute of the migratable task comprises:

If there is a candidate node containing the target number of free computing units in each node, among the candidate nodes, the candidate node with the smallest distance from the node to which the computing unit expected by the migratable task belongs will be selected , Determined as the target node;

If there is no candidate node that contains the target number of idle arithmetic units among the nodes, the node containing the first arithmetic unit with the smallest total number of tasks to be executed is determined as the target node, and the first The arithmetic unit is an arithmetic unit whose task attribute of the executed task is the same as the task attribute of the transferable task.
The method according to claim 1, wherein the method further comprises:

If the task attribute of the task executed in the second computing unit expected by the transferable task is different from the task attribute of the transferable task, determining that the transferable task meets the preset migration condition;

If the task attribute of the task executed in the second operation unit is the same as the task attribute of the transferable task, it is determined whether the total number of tasks to be executed in the second operation unit is greater than or equal to the first preset number Threshold

If the total number of tasks to be executed in the second computing unit is greater than or equal to the first preset number threshold, it is determined that the migratable tasks meet the preset migrating condition.
The method according to claim 1, wherein when it is detected that a migratable task meets a preset migration condition, according to the task attributes of the migratable task, determine in each node that the migratable task matches the migratable task Before the target node, the method further includes:

Obtain the number of tasks expected to be executed in each arithmetic unit;

If the maximum difference between the number of tasks expected to be executed in each arithmetic unit is greater than or equal to the second preset number threshold, then execute said when it is detected that the migratable tasks meet the preset migrating condition, according to the migratable task For the task attribute of the migration task, in each node, the step of determining the target node that matches the migratable task.
The method according to claim 1, wherein the method further comprises:

Obtain the target task to be executed, and determine the task type of the target task, the task execution duration, the minimum cross-node memory access delay of the node to which the third arithmetic unit is expected for the target task, and the third arithmetic unit The number of tasks expected to be performed in the

If the task type is computationally intensive, and/or the task execution time is greater than the minimum cross-node memory access delay, and/or the number of tasks expected to be executed in the third arithmetic unit is greater than or equal to the third If the preset number threshold is set, the target task is determined to be a transferable task, and the affinity mask of the target task is modified according to the preset affinity mask modification rule.
The method according to claim 1, wherein before the migrating the migratable task to the target node, the method further comprises:

If the target node is not the same as the node where the computing unit expected by the migratable task is located, then in the use mask of the migratable task, the position corresponding to the target node is set to 1, and the The position corresponding to the node where the computing unit expected by the migratable task is located is 0.
The method according to claim 1, wherein before the migrating the migratable task to the target node, the method further comprises:

If the bits corresponding to the target node in the affinity mask and the usage mask of the migratable task are both 1, then the step of migrating the migratable task to the target node is performed.
A device for task migration, characterized in that the device comprises:

The first determining module is configured to determine a target node matching the migratable task in each node according to the task attribute of the migratable task when it is detected that the migratable task meets the preset migration condition, and the task The attribute includes the target number of arithmetic units required to execute the migratable task;

The migration module is configured to migrate the migratable task to the target node, so as to execute the migratable task through the target node.
A computer device comprising a memory and a processor, and a computer program that can be run on the processor is stored on the memory, wherein the processor implements any one of claims 1 to 7 when the computer program is executed. The steps of the method described in item.
A computer-readable storage medium having a computer program stored thereon, wherein the computer program implements the steps of the method according to any one of claims 1 to 7 when the computer program is executed by a processor.