CN113157403A

CN113157403A - Job processing method and device, computer equipment and readable storage medium

Info

Publication number: CN113157403A
Application number: CN202010012302.6A
Authority: CN
Inventors: 不公告发明人
Original assignee: Cambricon Technologies Corp Ltd
Current assignee: Cambricon Technologies Corp Ltd
Priority date: 2020-01-07
Filing date: 2020-01-07
Publication date: 2021-07-23

Abstract

The application relates to a method and a device for processing jobs, a computer device and a readable storage medium. The method comprises the following steps: when detecting that a target job meets a preset processing condition, determining a target node matched with the target job in each node according to job attributes of the target job, wherein the job attributes comprise the target number of arithmetic units required by executing the target job; and processing the target job to the target node so as to execute the target job through the target node. By the method and the device, waiting time of the target operation can be shortened, and execution efficiency of the target operation is improved.

Description

Job processing method and device, computer equipment and readable storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for processing a job, a computer device, and a readable storage medium.

Background

At present, a Non Uniform Memory Access Architecture (NUMA) Architecture is generally adopted in chip design for artificial intelligence application. A chip based on a NUMA architecture typically includes a processor having a plurality of arithmetic units and a plurality of memory units. The plurality of arithmetic units are usually divided into a plurality of arithmetic unit groups, each arithmetic unit group is provided with at least one storage unit, and one arithmetic unit group and the corresponding storage unit form one node. Therefore, the data read and write required by the arithmetic unit in one node can be realized by the storage unit in the node.

In the running process of a chip, tasks to be executed need to be allocated to a certain node for execution, and the specific allocation process is as follows: firstly, the size of a memory required by executing the task is determined, and then a target node with the memory residual space meeting the memory size is determined according to the memory unit corresponding to each node. For example, a node with the largest memory remaining space may be used as the target node, or one node may be randomly selected as the target node from nodes whose memory remaining space is larger than the memory size. And then, distributing the task to the target node to execute based on the affinity binding principle.

However, due to the affinity binding principle, the job often needs to wait for a long time in the above allocation process, and the execution efficiency of the job is seriously affected.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a method, an apparatus, a computer device and a readable storage medium for job processing.

In a first aspect, a method for processing a job is provided, the method comprising:

when a preset processing condition is met, determining a first node matched with a target job in each node according to the job attribute of the target job contained in a target task, wherein the job attribute comprises the target number of arithmetic units required by executing the target job;

and executing the target operation contained in the target task through the first node and a second node where the operation unit executing the target task is located.

As an optional implementation manner, the determining, in each node, a first node matching a target job according to a job attribute of the target job included in the target task includes:

and for each node in the nodes, if the number of idle operation units in the node is greater than or equal to the target number, determining the node as a first node.

As an optional implementation, the method further comprises:

acquiring the idle time of an idle operation unit in each node;

and if the task which is to be executed and comprises a plurality of jobs exists in the detachable task list and the idle time length which is greater than or equal to the preset time length threshold exists in the idle time length of each idle operation unit, determining that the preset processing condition is met.

As an optional implementation manner, before determining, in each node, a first node matching a target job according to a job attribute of the target job included in a target task when a preset processing condition is satisfied, the method further includes:

acquiring a target task to be executed, and determining all dimension information of the target task and the target number of arithmetic units required for executing the target task;

if the ratio of the product of the dimension information to the target number is greater than 1, adding the target task into a detachable task list;

and modifying the affinity mask of the target task according to a preset affinity mask modification rule.

As an optional implementation manner, before the target job included in the target task is executed by the first node and the second node where the operation unit executing the target task is located, the method further includes:

and in the use mask of the target task, setting the position corresponding to the first node as 1.

and if the bits corresponding to the first node and the second node in the affinity mask and the use mask of the target task are both 1, executing the target task contained in the target task through the first node and the second node where the operation unit executing the target task is located.

As an optional implementation, the affinity mask of the target job is the same as the affinity mask of the target task, and the usage mask of the target job is the same as the usage mask of the target task.

In a second aspect, there is provided an apparatus for job processing, the apparatus comprising:

the system comprises a first determining module, a first judging module and a second determining module, wherein the first determining module is used for determining a first node matched with a target job in each node according to the job attribute of the target job contained in a target task when a preset processing condition is met, and the job attribute comprises the target number of arithmetic units required by executing the target job;

and the execution module is used for executing the target operation contained in the target task through the first node and a second node where the operation unit for executing the target task is located.

In a third aspect, a computer device is provided, comprising a memory and a processor, the memory having stored thereon a computer program operable on the processor, the processor implementing the steps of the method of any one of the first aspect when executing the computer program.

In a fourth aspect, a computer-readable storage medium is provided, having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method of any one of the first aspects.

The embodiment of the application provides a method and a device for processing jobs, computer equipment and a readable storage medium. And when the preset processing condition is met, the CPU determines a first node matched with the target job in each node according to the job attribute of the target job contained in the target task. Wherein the job attribute includes a target number of arithmetic units required to execute the target job. Then, the CPU executes the target job included in the target task through the first node and the second node where the arithmetic unit executing the target task is located. In this way, when the target job needs to wait for a long time to be executed by the operation unit waiting for execution of the target job, the CPU can execute the target job through the first node and the second node together, thereby reducing the waiting time of the target job and improving the execution efficiency of the target job.

Drawings

FIG. 1 is a schematic diagram of an intelligent processor provided in an embodiment of the present application;

fig. 2 is a flowchart illustrating a method for job splitting and affinity mask modification according to an embodiment of the present disclosure;

FIG. 3 is a flowchart illustrating a method for processing jobs according to an embodiment of the present application;

fig. 4 is a schematic flowchart of a method for determining processing conditions according to an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of an apparatus for processing a job according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present disclosure will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, not all embodiments of the present disclosure. All other embodiments, which can be derived by one skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the scope of protection of the present disclosure.

It should be understood that the terms "first," "second," "third," and "fourth," etc. in the claims, description, and drawings of the present disclosure are used to distinguish between different objects and are not used to describe a particular order. The terms "comprises" and "comprising," when used in the specification and claims of this disclosure, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the disclosure herein is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in the specification and claims of this disclosure, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the specification and claims of this disclosure refers to any and all possible combinations of one or more of the associated listed items and includes such combinations.

As used in this specification and claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".

The embodiment of the present application provides a job processing method, which may be applied to a chip, where the chip may include an intelligent processor adopting a NUMA architecture and a general-purpose processor, and the general-purpose processor may be a Central Processing Unit (CPU), and the like. The Intelligent processor adopting the NUMA architecture may be an accelerated processor, an Intelligent Processing Unit (IPU) processor, a Graphics Processing Unit (GPU) processor, or other types of processors, and the embodiments of the present application are not limited thereto. Specifically, the method may be applied to the above chip, and a general purpose processor (CPU) in the above chip may execute the above job processing method to distribute a plurality of jobs to at least one arithmetic unit in the smart processor for execution. Specific execution procedures of the job processing method of the present application can be referred to the following description.

Optionally, the intelligent processor of the NUMA architecture includes a processor having a plurality of arithmetic units and a plurality of memory units. The plurality of arithmetic units are usually divided into a plurality of arithmetic unit groups, each arithmetic unit group is provided with at least one storage unit, and one arithmetic unit group and the corresponding storage unit form a node. The data read-write required by the arithmetic unit in one node can be realized by the storage unit in the node, and the data read-write is realized among different nodes by the communication interface. Fig. 1 is a schematic diagram of an intelligent processor of a NUMA architecture according to an embodiment of the present application. As shown in fig. 1, the intelligent processor includes 16 arithmetic units and 4 storage units, and the intelligent processor is divided into 4 nodes, each of which includes 4 arithmetic units and 1 storage unit.

Fig. 1 provides a schematic diagram of an intelligent processor only in a schematic manner, and in other possible implementation manners, each node may further include more than four arithmetic units and 1 storage unit, and the storage unit may include a plurality of sub storage units. For example, each node may comprise four child nodes, i.e. each node may comprise 16 arithmetic units. Each child node comprises four arithmetic units and 1 child storage unit, and the arrangement mode of the four child nodes can be arranged according to the mode of four nodes. Further, the above-mentioned job processing method may be executed between the respective child nodes of a single node, and the execution process thereof may be specifically referred to the following description of the job processing method.

After a task is dispatched to the software queue, the processor may allocate, in the node to which the storage unit storing the task data of the task belongs, an arithmetic unit desired by the task according to the number of arithmetic units required to execute the task, and add 1 to the waiting reference count (i.e., clu _ wait _ ref) of the arithmetic units desired by the task. For example, as shown in fig. 1, the number of arithmetic units required for the task is 2, and the storage unit storing the task data of the task is storage unit 1, the processor may determine arithmetic units 1 and 2 as arithmetic units desired for the task in node 1, and add 1 to the waiting reference counts of arithmetic units 1 and 2.

When the processor determines the arithmetic unit to execute the task, the task is dispatched to the hardware queue and the true reference count (i.e., clu _ real _ ref) of the arithmetic units executing the task is incremented by 1. For example, as shown in fig. 1, after the processor determines that the arithmetic units performing the task are arithmetic unit 1 and arithmetic unit 2, the processor may add 1 to the true reference counts of arithmetic unit 1 and arithmetic unit 2.

When the task is completed, the waiting reference count of each arithmetic unit desired by the task is decremented by 1, and at the same time, the true reference count of each arithmetic unit executing the task is decremented by 1. For example, after the arithmetic units 1 and 2 execute the task, the processor may decrement the waiting reference count and the true reference count of the arithmetic units 1 and 2 by 1. If the arithmetic unit expected by the task is migrated, the waiting reference count of each source arithmetic unit expected by the task is decreased by 1, and the waiting reference count of each destination arithmetic unit expected by the task is increased by 1. For example, as shown in fig. 1, if the arithmetic unit expected by the task is migrated from the arithmetic units 1 and 2 to the arithmetic units 3 and 4, the processor decrements the waiting reference counts of the arithmetic units 1 and 2 by 1 and increments the waiting reference counts of the arithmetic units 3 and 4 by 1.

The embodiment of the present application first introduces the division of the job and the modification of the affinity mask, as shown in fig. 2, the specific processing procedure is as follows:

step 201, obtaining a target task to be executed, and determining each dimension information of the target task and a target number of arithmetic units required for executing the target task.

In implementation, after a task (i.e., a target task) is scheduled to a software queue, the processor may determine dimension information (i.e., dimX, dimY, and dimZ) of the target task and a target number of arithmetic units (i.e., kernel _ class) required to execute the target task. The processor may then calculate a ratio of the product of the dimensional information (i.e., dimX dimY dimZ) to the target number and determine whether the ratio is greater than 1. If the ratio is greater than 1, indicating that the target task can be split into multiple jobs, the processor performs step 202. If the ratio is less than or equal to 1, the target task cannot be split into a plurality of jobs.

Step 202, if the ratio of the product of the dimension information to the target number is greater than 1, adding the target task into the detachable task list.

In implementation, if the ratio is greater than 1, it indicates that the target task can be split into multiple jobs. Accordingly, the processor may add the target task to the list of detachable tasks. The detachable task list is used for storing tasks which can be detached into a plurality of jobs; the detachable task list can be a linked list or other types of lists, and the embodiment of the application is not limited. In addition, when all the jobs included in a certain task in the detachable task list are executed, the processor may delete the task from the detachable task list.

Alternatively, when it is determined that the target task can be split into a plurality of jobs, the target task may be sent to a scheduler, and the scheduler may split the target task into a plurality of jobs according to task attributes such as each dimension information of the target task and a target number of arithmetic units required by the target task. Further alternatively, the scheduler may be a hardware scheduler disposed on a chip, and the hardware scheduler may include a plurality of circuit modules such as a task splitting unit. Of course, the scheduler may also be a software scheduler, and is not limited herein.

And step 203, modifying the affinity mask of the target task according to a preset affinity mask modification rule.

In implementation, after the processor splits the target task into a plurality of target jobs, the processor may modify the affinity mask of the target task according to a preset affinity mask modification rule. The affinity mask (affinity) of the target task is used for representing a node which can execute the target task in each node, the affinity mask includes total number of bits of the nodes included in the intelligent processor, each bit uniquely corresponds to one node, if a certain bit is 1, the node corresponding to the bit can execute the target task, and if a certain bit is 0, the node corresponding to the bit cannot execute the target task. Affinity mask modification rules may be set by a technician according to the node scope of job processing. For example, if the affinity mask modification rule is that a task is to be migrated to all nodes and the original affinity mask of the target task is 0001, the processor may modify the affinity mask of the target task to 1111 according to the affinity mask modification rule. For another example, if the affinity mask modification rule is that the task may be migrated to the node 3 and the node 4, and the original affinity mask of the target task is 0001, the processor may modify the affinity mask of the target task to 1101 according to the affinity mask modification rule. The above steps 202 and 203 do not distinguish the sequence.

Since the target job is split by the target task, the affinity mask of the target job is the same as the affinity mask of the target task.

The method for processing jobs provided by the present application will be described with reference to specific embodiments, as shown in fig. 3, and the specific processing procedure is as follows.

Step 301, when a preset processing condition is met, according to the job attribute of the target job included in the target task, determining a first node matched with the target job in each node. Wherein the job attribute includes a target number of arithmetic units required to execute the target job.

In implementation, after the processor schedules a task (i.e., a target task) to the software queue, the processor may determine whether the target task can be split into a plurality of target JOBs (JOBs). If the target task can be split into a plurality of target jobs, it can be determined that the target task is a task that can relax affinity, and then the processor can further determine whether a preset processing condition is satisfied. The processor determines whether a processing procedure satisfying a predetermined processing condition is satisfied, and details will be described later. When the preset processing condition is satisfied, the processor may determine, among the nodes, a first node that matches the target job according to the job attribute of the target job. Wherein the job attribute of the target job includes a target number of arithmetic units required to execute the target job.

Optionally, the processing procedure of determining, by the processor, a first node matched with the target job in each node according to the job attribute of the target job is as follows: for each node in the nodes, if the number of idle operation units in the node is larger than or equal to the target number, the node is determined as the first node.

In implementation, when a preset processing condition is satisfied, for each node, the processor may obtain the number of free arithmetic units (i.e., true reference count equal to 0) in the node. The processor may then determine whether the number of free arithmetic units in the node is greater than or equal to the target number. If the number of the idle operation units in the node is larger than or equal to the target number, the node can execute the target operation contained in the target task. Accordingly, the processor may identify the node as the first node. If the number of the idle operation units in the node is less than the target number, the node is indicated to be incapable of executing the target operation contained in the target task. Accordingly, the node is not the first node.

As an alternative embodiment, as shown in fig. 4, the process of the processor determining whether the preset process condition is satisfied is as follows:

step 401, obtaining the idle time length of the idle operation unit in each node.

In implementation, for each of the nodes, the processor may obtain the idle duration of the idle arithmetic unit (i.e., true reference count equal to 0) in the node.

Step 402, if a task to be executed including a plurality of jobs exists in the detachable task list and an idle time length greater than or equal to a preset time length threshold exists in the idle time lengths of the idle operation units, determining that a preset processing condition is met.

In implementation, after obtaining the idle time of each idle operation unit, the processor may further determine whether a task to be executed and including a plurality of jobs exists in the detachable task list, and determine whether an idle time greater than or equal to a preset time threshold exists in the idle time of each idle operation unit. Wherein the preset duration threshold may be set by a technician based on experience. If the detachable task list has a task to be executed and including a plurality of jobs, and the idle time length of each idle operation unit is greater than or equal to the preset time length threshold value, the processor can allocate the task to be executed in the detachable task list to the idle operation unit in each node for execution. Accordingly, the processor may determine that a preset processing condition is satisfied. If the task waiting for execution does not exist in the detachable task list or the idle time length of each idle operation unit is not greater than or equal to the preset time length threshold value, the processor cannot allocate the task to be executed in the detachable task list to the idle operation unit in each node for execution. Accordingly, the processor may determine that the preset processing condition is not satisfied.

Step 302, executing the target job contained in the target task through the first node and the second node where the operation unit executing the target task is located.

In implementation, after the processor determines the first node, the processor may execute a target job included in the target task through the first node and a second node where the arithmetic unit executing the target task is located. In this way, when the target job needs to wait for a long time to be executed by the operation unit waiting for execution of the target job, the processor can execute the target job through the first node and the second node together, so that the waiting time of the target job is reduced, and the execution efficiency of the target job is improved.

As an optional implementation manner, before the processor executes the target job included in the target task through the first node and the second node, the processor may further modify the usage mask of the target job according to the determined first node, where a specific processing procedure is as follows: in the usage mask of the target task, the position corresponding to the first node is set to 1.

In implementation, a usage mask (usage _ mask) of the target task is used to indicate a node in each node that determines to execute the target task, where the usage mask includes a total number of bits of the nodes included in the intelligent processor, each bit uniquely corresponds to one node, if a certain bit is 1, it indicates that the node corresponding to the bit determines to execute the target task, and if a certain bit is 0, it indicates that the node corresponding to the bit does not execute the target task. After the processor determines the first node of the target job, before the processor schedules the target job to the hardware queue, the position corresponding to the first node may be 1 in the usage mask of the target task. For example, the original usage mask for the target task is 0001. Assuming that the first node is node 2, node 4, the modified usage mask of the target task is 1011.

Since the target job is split by the target task, the usage mask of the target job is the same as that of the target task.

As an optional implementation manner, before the processor executes a target job included in the target task through the first node and the second node, the processor may further determine whether the first node and the second node can execute the target job according to an affinity mask and a usage mask of the target job, where a specific processing procedure is as follows: and if the bits corresponding to the first node and the second node in the affinity mask and the use mask of the target job are both 1, executing the target job contained in the target task through the first node and the second node where the operation unit executing the target task is located.

In implementation, after the processor obtains the affinity mask and the usage mask of the target job, for each first node, the processor may determine whether bits corresponding to the first node are both 1 in the affinity mask and the usage mask of the target job. If the bit corresponding to the first node is 1, it indicates that the first node can execute the target operation. Similarly, for the second node where the arithmetic unit executing the target task is located, the processor may determine whether the bit corresponding to the second node is also 1 in the affinity mask and the usage mask of the target job. If the bit corresponding to the second node is 1, it indicates that the second node can execute the target operation. Accordingly, the processor can execute the target job contained in the target task through the first node and the second node. If there is 0 in the bit corresponding to the first node, it indicates that the first node cannot execute the target job, and the processor will execute the target job only through the second node. For example, if the affinity mask of the target job is 1101, the usage mask is 1001, the first node is node 2 and node 4, and the second node is node 1, then the bits corresponding to node 1 are all 1, the bits corresponding to node 4 are all 1, and the bit of node 2 in the affinity mask is 0, the processor may execute the target job through node 1 and node 4.

The embodiment of the application provides a method for processing jobs. And when the preset processing condition is met, the processor determines a first node matched with the target job in each node according to the job attribute of the target job contained in the target task. Wherein the job attribute includes a target number of arithmetic units required to execute the target job. Then, the processor executes the target job included in the target task through the first node and the second node where the arithmetic unit executing the target task is located. In this way, when the target job needs to wait for a long time to be executed by the operation unit waiting for execution of the target job, the processor can execute the target job through the first node and the second node together, so that the waiting time of the target job is reduced, and the execution efficiency of the target job is improved.

An embodiment of the present application further provides an apparatus for processing a job, as shown in fig. 5, the apparatus includes:

a first determining module 510, configured to determine, according to a job attribute of a target job included in the target task, a first node matching the target job among the nodes when a preset processing condition is satisfied, where the job attribute includes a target number of arithmetic units required to execute the target job;

and the execution module 520 is configured to execute the target job included in the target task through the first node and the second node where the operation unit for executing the target task is located.

As an optional implementation manner, the first determining module 510 is specifically configured to:

for each node in the nodes, if the number of idle operation units in the node is larger than or equal to the target number, the node is determined as the first node.

As an optional implementation, the apparatus further comprises:

the acquisition module is used for acquiring the idle time of the idle operation unit in each node;

and the second determining module is used for determining that the preset processing condition is met if the task which is waiting to be executed and contains a plurality of jobs exists in the detachable task list and the idle time length which is greater than or equal to the preset time length threshold exists in the idle time lengths of the idle operation units.

As an optional implementation, the apparatus further comprises:

the third determining module is used for acquiring a target task to be executed and determining all dimension information of the target task and the target number of the arithmetic units required by executing the target task;

the adding module is used for adding the target task into the detachable task list if the ratio of the product of the dimension information to the target number is greater than 1;

and the modification module is used for modifying the affinity mask of the target task according to a preset affinity mask modification rule.

As an optional implementation, the apparatus further comprises:

and the setting module is used for setting the position corresponding to the first node as 1 in the use mask of the target task.

As an optional implementation, the apparatus further comprises:

and a fourth determining module, configured to trigger the executing module 520 to execute the step of executing the target job included in the target task through the first node and the second node where the operation unit executing the target task is located, if the bits corresponding to the first node and the second node in the affinity mask and the usage mask of the target job are both 1.

As an alternative embodiment, the affinity mask of the target job is the same as the affinity mask of the target task, and the usage mask of the target job is the same as the usage mask of the target task.

The embodiment of the application provides a job processing device. And when the preset processing condition is met, the CPU determines a first node matched with the target job in each node according to the job attribute of the target job contained in the target task. Wherein the job attribute includes a target number of arithmetic units required to execute the target job. Then, the CPU executes the target job included in the target task through the first node and the second node where the arithmetic unit executing the target task is located. In this way, when the target job needs to wait for a long time to be executed by the operation unit waiting for execution of the target job, the CPU can execute the target job through the first node and the second node together, thereby reducing the waiting time of the target job and improving the execution efficiency of the target job.

In one embodiment, a computer device is provided, as shown in fig. 6, comprising a memory and a processor, the memory having stored thereon a computer program operable on the processor to, when executed, perform the method steps of job processing described above.

In one embodiment, a computer-readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the method of job processing described above.

It is noted that while for simplicity of explanation, the foregoing method embodiments have been described as a series of acts or combination of acts, it will be appreciated by those skilled in the art that the present disclosure is not limited by the order of acts, as some steps may, in accordance with the present disclosure, occur in other orders and concurrently. Further, those skilled in the art will also appreciate that the embodiments described in the specification are exemplary embodiments and that acts and modules referred to are not necessarily required by the disclosure.

It is further noted that, although the various steps in the flowcharts of fig. 2-4 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2-4 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternating with other steps or at least some of the sub-steps or stages of other steps.

It should be understood that the above-described apparatus embodiments are merely exemplary, and that the apparatus of the present disclosure may be implemented in other ways. For example, the division of the units/modules in the above embodiments is only one logical function division, and there may be another division manner in actual implementation. For example, multiple units, modules, or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented.

In addition, unless otherwise specified, each functional unit/module in the embodiments of the present disclosure may be integrated into one unit/module, each unit/module may exist alone physically, or two or more units/modules may be integrated together. The integrated units/modules may be implemented in the form of hardware or software program modules.

If the integrated unit/module is implemented in hardware, the hardware may be digital circuits, analog circuits, etc. Physical implementations of hardware structures include, but are not limited to, transistors, memristors, and the like. The artificial intelligence processor may be any suitable hardware processor, such as a CPU, GPU, FPGA, DSP, ASIC, etc., unless otherwise specified. Unless otherwise specified, the Memory unit may be any suitable magnetic storage medium or magneto-optical storage medium, such as resistive Random Access Memory rram (resistive Random Access Memory), Dynamic Random Access Memory dram (Dynamic Random Access Memory), Static Random Access Memory SRAM (Static Random-Access Memory), enhanced Dynamic Random Access Memory edram (enhanced Dynamic Random Access Memory), High-Bandwidth Memory HBM (High-Bandwidth Memory), hybrid Memory cubic hmc (hybrid Memory cube), and so on.

The integrated units/modules, if implemented in the form of software program modules and sold or used as a stand-alone product, may be stored in a computer readable memory. Based on such understanding, the technical solution of the present disclosure may be embodied in the form of a software product, which is stored in a memory and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned memory comprises: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments. The technical features of the embodiments may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The foregoing may be better understood in light of the following clauses:

clause a1, corresponding to right 1; clause a2, corresponding to right 2; clause a3, corresponding to right 3; clause a4, corresponding to right 4; clause a5, corresponding to right 5; clause a6, corresponding to right 6; clause a7, corresponding to claim 7; clause A8, corresponding to right 8; clause a9, corresponding to claim 9.

For example, clause a1, a method of job processing, the method comprising:

Clause a2, the method of clause a1, the determining, among nodes, a first node matching a target job included in the target task according to a job attribute of the target job, comprising:

Clause A3, the method of clause a1, the method further comprising:

acquiring the maximum idle time of the idle operation unit in each node;

Clause a4, the method according to clause a1, wherein when the preset processing condition is satisfied, before determining a first node matching a target job among nodes according to a job attribute of the target job included in the target task, the method further includes:

Clause a5, the method of clause a1, wherein before the target job included in the target task is executed by the first node and the second node where the arithmetic unit executing the target task is located, the method further comprises:

Clause a6, the method of clause a1, wherein before the target job included in the target task is executed by the first node and the second node where the arithmetic unit executing the target task is located, the method further comprises:

Clause a7, the method of clause a6, the affinity mask of the target job being the same as the affinity mask of the target task and the usage mask of the target job being the same as the usage mask of the target task.

Clause A8, an apparatus for job processing, the apparatus comprising:

Clause a9, a computer device comprising a memory and a processor, the memory having stored thereon a computer program operable on the processor, when executing the computer program, implementing the steps of the method of any of clauses a 1-a 7.

Clause a10, a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method of any of clauses a 1-a 7.

The foregoing detailed description of the embodiments of the present disclosure has been presented for purposes of illustration and description and is intended to be exemplary only and is not intended to be exhaustive or to limit the invention to the precise forms disclosed. Meanwhile, a person skilled in the art should, according to the idea of the present disclosure, change or modify the embodiments and applications of the present disclosure. In view of the above, this description should not be taken as limiting the present disclosure.

Claims

1. A method of job processing, the method comprising:

2. The method according to claim 1, wherein the determining, in each node, a first node matching the target job according to the job attribute of the target job included in the target task comprises:

3. The method of claim 1, further comprising:

acquiring the idle time of an idle operation unit in each node;

4. The method according to claim 1, wherein when a preset processing condition is satisfied, before determining a first node matching a target job in each node according to a job attribute of the target job included in a target task, the method further comprises:

5. The method according to claim 1, wherein before the target job included in the target task is executed by the first node and a second node where an arithmetic unit for executing the target task is located, the method further comprises:

6. The method according to claim 1, wherein before the target job included in the target task is executed by the first node and a second node where an arithmetic unit for executing the target task is located, the method further comprises:

7. The method of claim 6, wherein the affinity mask of the target job is the same as the affinity mask of the target task and the usage mask of the target job is the same as the usage mask of the target task.

8. An apparatus for job processing, the apparatus comprising:

9. A computer device comprising a memory and a processor, the memory having stored thereon a computer program operable on the processor, wherein the processor, when executing the computer program, performs the steps of the method of any of claims 1 to 7.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.