CN111026517B

CN111026517B - Task decomposition device and task scheduler

Info

Publication number: CN111026517B
Application number: CN201811179131.5A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Shanghai Cambricon Information Technology Co Ltd
Current assignee: Shanghai Cambricon Information Technology Co Ltd
Priority date: 2018-10-10
Filing date: 2018-10-10
Publication date: 2023-04-28
Anticipated expiration: 2038-10-10
Also published as: CN111026517A

Abstract

The application relates to a task decomposition device and a task scheduler. The device can decompose the task to obtain the decomposition information of the task, so that the processor can process a plurality of jobs in parallel according to the decomposed task, and further the data processing efficiency of the system is improved.

Description

Task decomposition device and task scheduler

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a task decomposition device and a task scheduler.

Background

The deep neural network is the basis of many artificial intelligence applications at present, and is applied in various aspects such as voice recognition, image processing, data analysis, advertisement recommendation systems, automatic driving of automobiles and the like, so that the deep neural network is applied to various aspects of life.

However, the large computational effort of deep neural networks has limited their faster development and wider application. In order to improve task processing efficiency of the deep neural network, a parallel processing mode is generally adopted to process the tasks, so how to obtain the tasks which can be parallel becomes a technical problem to be solved.

Disclosure of Invention

In view of the above, it is desirable to provide a task decomposition device and a task scheduler capable of acquiring tasks that can be performed in parallel.

A task decomposition device, comprising:

the comparator is used for acquiring dependency information in the configuration information of the task, judging whether the dependency of the task is met according to the dependency information, and acquiring a task identifier of the task after the dependency of the task is met; and

the data divider is connected with the comparator and is used for decomposing the task into a plurality of jobs after the comparator obtains the task identification of the task, so as to obtain the decomposition information of the task.

In one embodiment, the data divider is specifically configured to obtain a task decomposition number and a job size in the configuration information of the task, and decompose the plurality of tasks into a plurality of jobs according to the task decomposition number and the job size, so as to obtain decomposition information of the task.

In one embodiment, the task decomposition number is 2 ⁿ Wherein n is a positive integer; the job size is an integer multiple of the processor word length.

In one embodiment, the device further includes a memory, where the memory is connected to the comparator, and is configured to obtain configuration information of a task, and store the configuration information of the task.

In one embodiment, the comparator comprises:

the analysis circuit is used for analyzing the configuration information of the task to obtain an analysis result, and the analysis result comprises the dependency relationship information of the task; and

and the comparison judging circuit is used for judging whether the task has a front task or not according to the dependency relationship information, and judging that the dependency relationship of the task is satisfied if the task does not have the front task.

In one embodiment, the comparison and judgment circuit is also connected with a state monitoring device;

the comparison judging circuit is further used for sending a request for inquiring whether the front-end task of the task is executed or not to the state monitoring device if the front-end task exists in the task, and determining that the dependency relationship of the task is met if the comparison judging circuit receives the signal, transmitted by the state monitoring device, that the front-end task of the task is executed.

In one embodiment, the comparison and judgment circuit is further configured to send a task registration request to the state monitoring device to obtain a task identifier of a task when the dependency relationship of the task is satisfied.

In one embodiment, the device further includes a state controller, where the state controller is connected to the comparator, and is configured to update a task state of the task to a state to be scheduled after acquiring a task identifier of the task, and send a task scheduling request of the task in the state to be scheduled to a task scheduling device.

In one embodiment, the state controllers include a first state controller and a second state controller, the first state controller is connected with the second state controller, and the first state controller and the second state controller are respectively connected with the task scheduling device;

the first state controller is used for updating the state of the task to a state to be scheduled after acquiring the task identifier of the task;

the second state controller is used for sending the decomposition information of the task to the task scheduling device and correspondingly updating the state of the task into a scheduling state.

In one embodiment, the first state controller is further configured to receive scheduling feedback information returned by the task scheduling device, and update a state of a corresponding task to a state to be scheduled or a state to end scheduling according to the scheduling feedback information.

A task scheduler, comprising: task scheduling means and task decomposing means according to any one of claims 8-10, said task decomposing means being connected to the task scheduling means;

the task scheduling device is used for receiving task scheduling requests of the tasks sent by the task decomposing device and correspondingly acquiring decomposing information and all task information of the tasks according to the task scheduling requests of the tasks;

the task decomposition device is used for sending the decomposition information of the task to the task scheduling device.

The task decomposition device and the task scheduler judge whether the dependency relationship of the task is satisfied through the comparator, acquire the task identification of the task after the comparator judges that the dependency relationship of the task is satisfied, decompose the task into a plurality of jobs through the data divider after acquiring the task identification, and acquire the decomposition information of the task; therefore, a plurality of jobs can be scheduled in parallel according to the decomposition information of the task, and the processor can process the jobs obtained by decomposing the task in parallel, so that the task can be rapidly processed, and the processing efficiency of the system can be further improved.

Drawings

FIG. 1 is an application environment diagram of a task decomposition device in one embodiment;

FIG. 2 is an application environment diagram of a task decomposition device in one embodiment;

FIG. 3 is an application environment diagram of a task decomposition device in one embodiment;

FIG. 4 is an application environment diagram of a task decomposition device in one embodiment;

FIG. 5 is a schematic diagram of a computing device according to one embodiment;

FIG. 6 is a block diagram of a computing device provided by another embodiment;

FIG. 7 is a block diagram of a main processing circuit provided by one embodiment;

FIG. 8 is a block diagram of a computing device provided by one embodiment;

FIG. 9 is a block diagram of another computing device provided by one embodiment;

FIG. 10 is a schematic diagram of a tree module according to one embodiment;

FIG. 11 is a block diagram of a computing device provided by one embodiment;

FIG. 12 is a block diagram of a computing device provided by one embodiment;

FIG. 13 is a block diagram of a computing device provided by one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

As shown in fig. 1 to 3, a task decomposition device 100 according to an embodiment of the present application includes: a comparator 110 and a data divider 120, the data divider 120 being connected to the comparator 110. The comparator 110 is configured to obtain dependency information in configuration information of a task, determine whether a dependency of the task is satisfied according to the dependency information, and obtain a task identifier of the task after the dependency of the task is satisfied, where the task identifier may be used to distinguish different tasks. The data divider 120 is configured to obtain a task identifier of the task, and then decompose the task into a plurality of jobs, so as to obtain decomposition information of the task.

The task decomposition device acquires the task identification after the dependency relationship of the task is satisfied, and decomposes the task into a plurality of jobs after acquiring the task identification to obtain decomposition information of the task. This allows the processor to process the job resulting from the task decomposition in parallel, so that the task can be processed quickly and the processing efficiency of the system can be further improved.

Alternatively, the task decomposition device 100 may be connected to the global memory through DMA (Direct Memory Access ). Specifically, the comparator 110 of the task decomposition device 100 is connected to the global memory, and the comparator 110 can obtain the configuration information of the task from the global memory. Alternatively, the global Memory may be DRAM (Dynamic Random Access Memory ) or SRAM (Static Random-Access Memory), or the like. Optionally, the task decomposition device 100 may further include a memory, where the memory may temporarily store configuration information of the task acquired by the task decomposition device 100 from the global memory, and a result of parsing the configuration information of the task by the comparator. The comparator 110 and the data divider 120 of the task decomposition device 100 can directly read from the memory when using the configuration information of the task or the analysis result of the configuration information, so that it can reduce inter-chip data transmission and improve data processing efficiency.

The configuration information of the task is used for describing the task, and may include identity, category, dependency information and the like of the task. The identity of the task comprises information representing the identity of the task, such as a task name or a task serial number. The classes of tasks include block (blocking task), cluster (clustering task), and unit (joint task). The dependency relationship information of the tasks can describe the dependency relationship of the tasks, and whether the dependency relationship exists between the tasks refers to whether the execution of the tasks is dependent on the execution results of other tasks. The dependency information of a task may include pre-task information and post-task information of the task. Wherein, the task on which a certain task depends is called a front-end task; in turn, tasks that rely on it are referred to as post-tasks, as opposed to pre-tasks.

Further, the configuration information of the task may further include a task decomposition number and a job size, wherein the task decomposition number refers to the number of jobs formed by decomposing the task, and the job size refers to the data capacity of each job. The data divider 120 can obtain the configuration information of the task, and divide the corresponding task according to the task decomposition number and the job size in the configuration information of the task The solution is a plurality of jobs. Optionally, the task decomposition number is 2 ⁿ N is a positive integer. Still further, each job can be allocated to a corresponding processor for processing, and thus the size of each job can be an integer multiple of the corresponding processor word length. Processor word size may reflect the ability of a processor to process data a single time.

In one embodiment, as shown in FIG. 3, the task decomposition device 100 may be coupled to a condition monitoring device 500. The comparator 110 may analyze the configuration information, obtain dependency information in the configuration information of the task according to the analysis result, and determine whether the dependency of the task is satisfied according to the dependency information, so as to determine whether to send a task registration request to the state monitoring device 500, and obtain a task identifier of the task. Optionally, the state monitoring device 500 may assign a task identifier to the task according to the task registration request received by the state monitoring device to complete the registration of the task. The state monitoring apparatus 500 then transmits the task identification of the task to the task decomposition apparatus 100, and the task for which the task identification has been obtained is updated to the state to be scheduled by the state controller 130.

Further, when the comparator 110 determines whether the data dependency relationship of the task is satisfied according to the dependency relationship information of the task, it first determines whether a task has a pre-task, if the task has a pre-task, then it sends a query request to the task state monitoring device 500, the state monitoring device 500 queries whether the pre-task of the task is completed according to the query request, and after the pre-task of the task is completed, it feeds back a message that the pre-task of the task is completed to the comparator 110, and the comparator 110 determines that the dependency relationship of the task is satisfied according to the message that the pre-task of the task is completed. If the task does not have a pre-task, the comparator 110 can directly determine that the dependency of the task is satisfied. After the dependency relationship of the task is satisfied, the comparator 110 sends a task registration request to the state monitoring device 500 to acquire the task identifier of the task.

Alternatively, the state monitoring device 500 may check whether the pre-task of the task is executed completely through a preset check bit. Specifically, when the configuration information of each task is preset, the dependency check bit corresponding to the task is correspondingly set, and the dependency check bit corresponding to each task can represent whether the task is executed completely or not. More specifically, the state monitoring device 500 queries whether the execution of the task ahead of the task is completed, first determines the dependency check bit corresponding to the task ahead of the task, and then determines whether the execution of the task ahead of the task is completed according to the value of the check bit.

Further, when the state monitoring device 500 queries that the task is not completed, it may monitor whether the task is completed, i.e. monitor whether the corresponding dependency check bit is updated, and determine that the task is completed after the update of the corresponding dependency check bit is monitored.

Alternatively, referring to fig. 4, the comparator 110 of the task decomposition device 100 may include an parsing circuit 111 and a comparison judging circuit 112. And the analysis circuit 111 is used for analyzing the configuration information of the task to obtain an analysis result, and acquiring the dependency relationship information of the task from the analysis result. The comparison and judgment circuit 112 is configured to judge whether a task ahead of the task exists according to the dependency relationship information, and if the task ahead of the task does not exist, judge that the dependency relationship of the task is satisfied.

Further, the comparison and judgment circuit 112 is further configured to judge whether a task ahead of the task exists according to the dependency relationship information, if the task ahead exists, send a request for querying whether the task ahead is completed to the state monitoring device 500, and if the comparison and judgment circuit receives a signal that the task ahead is completed and transmitted by the state monitoring device 500, determine that the dependency relationship of the task is satisfied.

Further, the comparison and judgment circuit 112 is further configured to send a task registration request to the state monitoring device 500 to obtain a task identifier of a task when the dependency relationship of the task is satisfied.

As shown in fig. 2, the task decomposition device 100 may further include a state controller 130, which may control the scheduling of tasks by controlling the state of the tasks. Further, the state controller 130 may be connected to a task scheduler 200. The comparator of the task decomposition device 100 can obtain the configuration information of the task from the global memory, control the state of the task through the state controller 130 thereof, and send a task scheduling request of the task to the task scheduling device 200 when the state of the task is in a state to be scheduled, so as to perform task scheduling.

Further, the task scheduling device 200 may connect the first processor 300 and the second processor 400. And the task scheduling is a process of determining which task the first processor 300 or the second processor 400 processes. Alternatively, the first processor 300 may be a general-purpose processor such as a CPU, and the second processor 400 may be a coprocessor of the first processor 300. Specifically, the second processor 400 may include a second processor body 410 and a control device 420 for controlling the operation of the second processor body, and the second processor body 410 may be an IPU (Intelligence Processing Unit, intelligent processor) or an NPU (Neural-network Process Unit, neural network processor) or the like. Further, the number of the second processor bodies may be plural, and the plural second processor bodies are connected to the control device of the second processor body. Alternatively, the second processor body may also include a plurality of processor cores. The task scheduler 200 may be connected to a control device of the second processor body, which is capable of transmitting processor state information of the second processor body to the task scheduler 200.

The state controller 130 may update the state of the task to a state to be scheduled after acquiring the task identifier of the task, and send a task scheduling request of the task in the state to be scheduled to the task scheduling device 200. After receiving the task scheduling request, the task scheduling device 200 may acquire configuration information of the task in the state to be scheduled, where the configuration information is all task information. The task decomposition information is acquired from the task decomposition device. The state controller 130 transmits the decomposition information of the task to the task scheduling device, and correspondingly updates the state of the task to the scheduling state.

Further, the state controller 130 may also receive the target scheduling feedback information returned by the task scheduling device 200, and update the state of the corresponding task to the to-be-scheduled state or the scheduling end state according to the scheduling feedback information. Specifically, when the feedback information includes task scheduling failure information, the state controller 130 updates the state of the corresponding task from the scheduling state to the state to be scheduled, and resends the task scheduling request of the task to the task scheduling device 200; when the feedback information contains task scheduling success information, the state of the corresponding task is updated from a scheduling state to a scheduling end state.

In one embodiment, the state monitoring device 500 is further configured to receive job end information of a job, and determine whether an execution abnormality exists in a corresponding task according to the job end information of the job; and if the corresponding task has abnormal execution, generating a task destruction instruction. Alternatively, the job end information of the job includes result flag data, and the state monitoring apparatus 500 may determine whether the current task has an abnormal execution according to the result flag data included in the job end information of the job.

For example, if it is determined that the current task has an execution abnormality, the control device of the second processor body may set the result flag data in the job end information of the current job to be non-0 (e.g., the abnormality flag data is 1), and at this time, the state monitoring device 500 may determine that the current task has an execution abnormality according to the result flag data. If there is no execution abnormality in the current task, the control device of the second processor body may set the result flag data in the job end information of the current job to 0, and at this time, the state monitoring device 500 may determine that there is no execution abnormality in the current task according to the result flag data. Other tasks are set forth in a similar manner and are not described in detail herein.

Further, the execution exception of the job may include a first exception condition and a second exception condition, and the task destruction instruction may include a first task destruction instruction corresponding to the first exception condition and a second task destruction instruction corresponding to the second exception condition. Alternatively, when it is determined that the job has an abnormality, the abnormality processing circuit may further determine whether the execution abnormality of the current task is a first abnormality or a second abnormality, based on abnormality flag data included in job end information of the job. The first exception condition and the second exception condition may be one or a combination of a plurality of exceptions such as an exception condition of insufficient resources of the second processor, a failure of the second processor, and the like.

Optionally, the exception handling circuit is configured to obtain a first task destruction instruction when determining that the job has a first exception condition according to job end information of the job, and transmit the first task destruction instruction to the task destruction circuit, where the task destruction circuit destroys a task to which the job belongs according to the first task destruction instruction. Specifically, the task destruction circuit may be configured to terminate scheduling a job having an execution abnormality and all jobs after the job when receiving the first task destruction instruction, and obtain scheduling end information of a task to which the job belongs. Further, after the task destruction circuit completes the destruction operation of the task to which the job belongs, task scheduling end information of the task to which the job belongs may be transmitted to the state monitoring device.

The task scheduler further comprises a register file connected to the task decomposition means. If the exception handling circuit determines that the job has a second exception condition according to the job ending information of the job, a second task destroying instruction can be obtained to destroy the task to which the job belongs and all tasks after the task to which the job belongs. Specifically, if the exception handling circuit determines that the job has a second exception condition according to the job end information of the job, a second task destruction instruction can be obtained, and the second task destruction instruction is transmitted to the task destruction circuit, so that the task destruction circuit is notified to destroy the task to which the job belongs and all the tasks after the task. Optionally, after the task destruction circuit receives the second task destruction instruction transmitted by the exception processing circuit, the task destruction circuit may destroy all tasks in a task queue where the task to which the job belongs is located. Specifically, the task assigning device firstly terminates the task to which the job belongs and the scheduling of other tasks after the task to which the job belongs according to the second task destruction instruction, and notifies a register connected with the task assigning device to clear the task to which the job belongs. After the task to which the job belongs is cleared from the register, scheduling end information of the task to which the job belongs can be obtained.

Meanwhile, after the task to which the job belongs is cleared from the register, the task assigning device may send task registration requests corresponding to other tasks after the task to which the job belongs to the state monitoring device, so as to obtain task identifiers corresponding to other tasks after the task to which the job belongs. The task registration circuit of the state monitoring device may assign a task identifier to other tasks after the task to which the job belongs, respectively. When the task destroying circuit receives the task identifier fed back by the task registration circuit of the state monitoring device, the task destroying circuit can obtain scheduling end information corresponding to other tasks after the task to which the job belongs according to the received task identifier so as to destroy all the tasks after the task to which the job belongs. Further, the task assigning device may further transmit scheduling end information of each task to be processed to the state monitoring device.

By setting the exception handling mechanism, the accuracy of the task execution result can be ensured. And when an abnormal condition exists, the state monitoring device can inform the task destroying circuit to destroy the corresponding task and/or all the tasks after the corresponding task, so that resource waste caused by the fact that the second processor continues to execute other tasks when the abnormal condition exists is avoided.

Optionally, the state control circuit is further configured to obtain a first interrupt signal when receiving the task destruction instruction, and transmit the first interrupt signal to the first processor, and then execute the destruction operation. Specifically, when the task destruction circuit receives a task destruction instruction, the task to which the job belongs is first terminated to schedule, so that unnecessary resources are avoided being consumed in scheduling under abnormal conditions. Meanwhile, after the task destroying circuit receives the task destroying instruction, a first interrupt signal can be obtained and transmitted to the first processor. Further, after the first processor receives the first interrupt signal, the first processor may further acquire status information of each second processor body, and determine, according to the status information of each second processor body, the second processor body with an exception.

The state control circuit is also used for obtaining a second interrupt signal after the destruction operation is completed and transmitting a second middle section signal to the first processor. Specifically, the state monitoring device obtains the abnormal processing ending information after receiving the scheduling ending information of the current task or the scheduling ending information of all tasks in the task queue to which the current task belongs, and transmits the abnormal processing ending information to the task dispatching device; the task destruction circuit is further used for obtaining a second interrupt signal according to the abnormal processing ending information and transmitting a second middle section signal to the first processor.

Through the exception handling mechanism, accuracy of task execution results is guaranteed. And when the abnormal condition exists, the state monitoring device can destroy the corresponding task or all the tasks after the corresponding task through the task decomposition device, so that the resource waste caused by the fact that the second processor continues to execute other tasks when the abnormal condition exists is avoided.

Alternatively, as shown in fig. 4, the state controller 130 of the task decomposition device may include a first state controller 131 and a second state controller 132. Wherein the first state controller 131 is connected 132 with the second state controller. The first state controller 131 and the second state controller 132 are connected to the task scheduler 200, respectively.

The first state controller 131 is configured to update a state of the task to a state to be scheduled after acquiring a task identifier of the task, and send a task scheduling request to the task scheduling device 200. Specifically, the first state controller 131 may update the state of the task to a state to be scheduled after acquiring the task identifier of the task, and send a task scheduling request of the task in the state to be scheduled to the task scheduling device 200.

And a second state controller 132, configured to send the decomposition information of the task to the task scheduling device, and correspondingly update the state of the task to a scheduling state. Specifically, after receiving the task scheduling request, the task scheduling device 200 obtains configuration information of the task in the state to be scheduled, where the configuration information is all task information. The task decomposition information is acquired from the task decomposition device. The second state controller 132 transmits the decomposition information of the task to the task scheduling device, and correspondingly updates the state of the task to the scheduling state.

Further, the first state controller 131 may receive the scheduling feedback information returned by the task scheduling device 200, and update the state of the corresponding task to the to-be-scheduled state or the scheduling end state according to the scheduling feedback information. Specifically, when the feedback information includes task scheduling failure information, the first state controller 131 updates the state of the corresponding task from the scheduling state to the state to be scheduled, and resends the task scheduling request; when the feedback information contains task scheduling success information, the state of the corresponding task is updated from a scheduling state to a scheduling end state.

According to the task decomposition device, the states of the tasks are controlled through the first state controller and the second state controller, so that the task scheduling process is controlled, and the task scheduling can be more orderly and efficient.

Further, the data divider 120 of the task decomposition device 100 may decompose the task into a plurality of jobs after the task acquires the task identifier, to obtain decomposition information of the task. Specifically, the data divider 120 obtains the task type, the task division number and the job size in the configuration information of the task according to the analysis result of the configuration information of the task, and decomposes the task into a plurality of jobs according to the task type, the task division number and the job size of the task, thereby obtaining decomposition information of the task. The task categories may include: block (blocking task), cluster (clustering task) and unit (joint task).

Further, the task scheduling device 200 may acquire the decomposition information of the task from the task decomposition device 100, acquire all task information of the task from the global memory, and then transmit the decomposition information of the task and all task information to the target processor corresponding to the task. Optionally, when the target processor is the second processor 400, the task decomposition device 100 sends the decomposition information of the task and all task information to the control device 420 of the second processor, after the control device 420 of the second processor receives the decomposition information of the task and all task information, all task information of the target task is processed according to the decomposition information of the task, so as to obtain a plurality of jobs to be sent to the second processor, and then the control device 420 of the second processor sends the plurality of jobs to the second processor corresponding to the target processor. Alternatively, the target processor may be one or more second processor bodies 410, and/or multiple processor cores of a single second processor body 410.

Optionally, the control device 420 of the second processor body may collect execution status information of each job on the second processor, that is, collect end information of each job on the second processor body 410, and send the end information to the status monitoring device 500, where the status monitoring device 500 may update status entries correspondingly according to the received end information, so as to monitor the task execution status. Alternatively, the status monitor apparatus 500 may be connected to the global memory through DMA, so that the status monitor apparatus 500 may write the job end information of each job obtained by the status monitor apparatus into the global memory.

Optionally, the control device 320 of the second processor body may also collect processor state information of the second processor body 310 and send the processor state information to the task scheduling device 200.

In one embodiment, there is also provided a task scheduler including: any of the task decomposition devices 100 and task scheduling devices 200 mentioned in the above embodiments, the task decomposition device 100 is connected to the task scheduling device 200. And the task scheduling device 200 is used for receiving the task scheduling request of the task sent by the task decomposing device, and correspondingly acquiring the decomposing information of the task and all task information tasks according to the task scheduling request of the task. The task decomposition device is used for sending the decomposition information of the task to the task scheduling device.

The working principle of the task scheduler according to the embodiment of the present application is illustrated below with reference to fig. 1 and 4:

when a task needs to be decomposed, whether the dependency relationship of the task is met is firstly judged. The task decomposition device 100 acquires dependency information in the configuration information of the task, determines whether the dependency of the task is satisfied according to the dependency information, and when the dependency of the task is satisfied, transmits a registration request of the task to the state monitoring device 500, and acquires a task identifier of the task. After the task identification is acquired, the task is decomposed into a plurality of jobs, and decomposition information of the task is obtained. After the task identifier is acquired, the state of the corresponding task is updated to be the state to be scheduled, and a task scheduling request of the task in the state to be scheduled is sent to the task scheduling device 200.

After receiving the task scheduling request of the task, the task scheduling device 200 may acquire configuration information of the task in the state to be scheduled, where the configuration information is all task information. The task decomposition information is acquired from the task decomposition device. The task decomposition device 100 transmits the decomposition information of the task to the task scheduling device 200, and updates the state of the task to the scheduling state in response to the decomposition information.

When the task scheduling device 200 obtains the task decomposition information from the task decomposition device 100, obtains all task information of the task from the global memory, and transmits the task decomposition information and all task information of the task to the corresponding target processor. Alternatively, when the target processor is the second processor body 410, the task decomposition device 100 first sends the decomposition information of the task and all task information to the control device 320 of the second processor, and after receiving the decomposition information of the task and all task information, the control device 420 of the second processor body processes all task information of the task according to the decomposition information, that is, decomposes the task into a plurality of jobs, and finally the control device 420 of the second processor body distributes the plurality of jobs to the second processor body 410. The configuration information processor word length of the task may alternatively be one or more second processor ontologies 410, and/or multiple processor cores of a single second processor ontologie 410.

After receiving the decomposition information of the task and the complete task information, the control device 420 of the second processor receives the decomposition information of the task, and then the control device of the second processor body processes the complete task information of the task according to the decomposition information, i.e. the task is decomposed into a plurality of jobs, and finally the control device 320 of the second processor body distributes the plurality of jobs to the second processor body 310. Wherein the overall task information for the task includes input data and associated computer instructions, and the like.

The task scheduler provided by the embodiment can schedule the tasks to be executed by the processor efficiently and orderly, so that the system processing efficiency is improved.

In one embodiment, the second processor body 310 includes a computing device as shown in FIG. 5, comprising: a controller unit 11 and an arithmetic unit 12, wherein the controller unit 11 is connected to the arithmetic unit 12, and the arithmetic unit 12 includes: a master processing circuit and a plurality of slave processing circuits.

Specifically, the controller unit 11 may be used to obtain a job, which may include data, machine learning models, and calculation instructions. In an alternative, the manner of acquiring the input data and calculating the instruction may be obtained through a data input/output unit, and the data input/output unit may be one or more data I/O interfaces or I/O pins.

The above-described computing instructions include, but are not limited to: the present embodiments are not limited to the specific form of the above-described calculation instructions, either forward or reverse training instructions, or other neural network calculation instructions, etc., such as convolution calculation instructions.

The controller unit 11 is further configured to parse the calculation instruction to obtain a plurality of operation instructions, and send the plurality of operation instructions and the input data to the main processing circuit;

a master processing circuit 101 for performing preamble processing on the input data and transmitting data and operation instructions with the plurality of slave processing circuits;

a plurality of slave processing circuits 102, configured to execute intermediate operations in parallel according to the data and the operation instruction transmitted from the master processing circuit to obtain a plurality of intermediate results, and transmit the plurality of intermediate results to the master processing circuit;

the main processing circuit 101 is configured to perform subsequent processing on the plurality of intermediate results to obtain a calculation result of the calculation instruction.

According to the technical scheme, the operation unit is set to be of a master multi-slave structure, for a calculation instruction of forward operation, the calculation instruction according to the forward operation can be used for splitting data, so that the part with larger calculation amount can be subjected to parallel operation through the plurality of slave processing circuits, the operation speed is improved, the operation time is saved, and the power consumption is further reduced.

Optionally, the machine learning calculation may specifically include: the artificial neural network operation, the input data may specifically include: neuron data and weight data are input. The calculation result may specifically be: and outputting the neuron data as a result of the artificial neural network operation.

The operation in the neural network can be one-layer operation in the neural network, and in the multi-layer neural network, the implementation process is that in the forward operation, after the execution of the artificial neural network of the upper layer is completed, the operation instruction of the lower layer can take the output neuron calculated in the operation unit as the input neuron of the lower layer to perform operation (or perform certain operations on the output neuron and then take the operation as the input neuron of the lower layer), and meanwhile, the weight is replaced by the weight of the lower layer; in the backward operation, when the backward operation of the artificial neural network of the previous layer is completed, the next-layer operation instruction performs an operation with the input neuron gradient calculated by the operation unit as the output neuron gradient of the next layer (or performs some operations on the input neuron gradient and then uses the operation as the output neuron gradient of the next layer), and simultaneously replaces the weight with the weight of the next layer.

The machine learning computation may also include support vector machine operations, k-nearest neighbor (k-nn) operations, k-means (k-means) operations, principal component analysis operations, and the like. For convenience of description, a specific scheme of machine learning calculation is described below by taking an artificial neural network operation as an example.

For the artificial neural network operation, if the artificial neural network operation has multiple layers of operation, the input neurons and the output neurons of the multiple layers of operation do not refer to the neurons in the input layer and the neurons in the output layer of the whole neural network, but for any two adjacent layers in the network, the neurons in the lower layer of the forward operation of the network are the input neurons, and the neurons in the upper layer of the forward operation of the network are the output neurons. Taking convolutional neural networks as an example, let a convolutional neural network have L layers, k=1, 2,..l-1, for the K-th layer and the K + 1-th layer, we refer to the K-th layer as the input layer, where the neurons are the input neurons, the k+1-th layer as the output layer, where the neurons are the output neurons. That is, each layer except the topmost layer can be used as an input layer, and the next layer is a corresponding output layer.

Optionally, the computing device may further include: the storage unit 10 and the direct memory access unit 50, the storage unit 10 may include: one or any combination of a register and a cache, specifically, the cache is used for storing the calculation instruction; the register is used for storing the input data and scalar; the cache is a scratch pad cache. The direct memory access unit 50 is used for reading or storing data from the storage unit 10.

Optionally, the controller unit includes: an instruction storage unit 110', an instruction processing unit 111, and a store queue unit 113;

an instruction storage unit 110' is configured to store calculation instructions associated with the artificial neural network operation.

The instruction processing unit 111 is configured to parse the calculation instruction to obtain a plurality of operation instructions.

A store queue unit 113 for storing an instruction queue, the instruction queue comprising: a plurality of arithmetic instructions or calculation instructions to be executed in the order of the queue.

For example, in an alternative embodiment, the main processing circuit may also include a controller unit, which may include a main instruction processing unit, specifically for decoding instructions into micro-instructions. In a further alternative of course, the slave processing circuit may also comprise a further controller unit comprising a slave instruction processing unit, in particular for receiving and processing microinstructions. The micro instruction may be the next instruction of the instruction, and may be obtained by splitting or decoding the instruction, and may be further decoded into control signals of each component, each unit, or each processing circuit.

In one alternative, the structure of the calculation instructions may be as shown in the following table.

Operation code

Registers or immediate

Register/immediate

...

The ellipses in the table above represent that multiple registers or immediate numbers may be included.

In another alternative, the computing instructions may include: one or more operation domains and an operation code. The computing instructions may include neural network computing instructions. Taking a neural network operation instruction as an example, as shown in table 1, a register number 0, a register number 1, a register number 2, a register number 3, and a register number 4 may be operation domains. Wherein each of register number 0, register number 1, register number 2, register number 3, register number 4 may be a number of one or more registers.

The register may be an off-chip memory, or may be an on-chip memory in practical applications, and may be used to store data, where the data may specifically be n-dimensional data, where n is an integer greater than or equal to 1, for example, n=1 is 1-dimensional data, i.e., a vector, where n=2 is 2-dimensional data, i.e., a matrix, where n=3 or more is a multidimensional tensor.

Optionally, the controller unit may further include:

The dependency relationship processing unit 108 is configured to determine, when a plurality of operation instructions are provided, whether a first operation instruction has an association relationship with a zeroth operation instruction before the first operation instruction, if the first operation instruction has an association relationship with the zeroth operation instruction, cache the first operation instruction in the instruction storage unit, and extract the first operation instruction from the instruction storage unit and transmit the first operation instruction to the operation unit after the execution of the zeroth operation instruction is completed;

the determining whether the association relationship exists between the first operation instruction and the zeroth operation instruction before the first operation instruction includes:

extracting a first storage address interval of required data (for example, a matrix) in the first operation instruction according to the first operation instruction, extracting a zeroth storage address interval of the required matrix in the zeroth operation instruction according to the zeroth operation instruction, if the first storage address interval and the zeroth storage address interval have overlapping areas, determining that the first operation instruction and the zeroth operation instruction have an association relationship, if the first storage address interval and the zeroth storage address interval do not have overlapping areas, and determining that the first operation instruction and the zeroth operation instruction do not have an association relationship.

In an alternative embodiment, the arithmetic unit 12 may comprise one master processing circuit 101 and a plurality of slave processing circuits 102, as shown in fig. 6. In one embodiment, as shown in FIG. 6, a plurality of slave processing circuits are distributed in an array; each slave processing circuit is connected with other adjacent slave processing circuits, and the master processing circuit is connected with k slave processing circuits in the plurality of slave processing circuits, wherein the k slave processing circuits are as follows: the K slave processing circuits shown in fig. 6 include only the n slave processing circuits in the 1 st row, the n slave processing circuits in the m th row, and the m slave processing circuits in the 1 st column, that is, the K slave processing circuits are slave processing circuits directly connected to the master processing circuit among the plurality of slave processing circuits.

K slave processing circuits for forwarding data and instructions between the master processing circuit and the plurality of slave processing circuits.

Optionally, as shown in fig. 7, the main processing circuit may further include: a conversion processing circuit 110", an activation processing circuit 111, an addition processing circuit 112, or any combination thereof;

Conversion processing circuitry 110 "for performing an exchange (e.g., a conversion of continuous data with discrete data) between the first data structure and the second data structure with the data block or intermediate result received by the main processing circuitry; or to perform an exchange between the first data type and the second data type (e.g., a conversion of a fixed point type and a floating point type) with respect to the data block or intermediate result received by the main processing circuit.

The activation processing circuit 111 is used for executing the activation operation of the data in the main processing circuit.

The addition processing circuit 112 is used for executing addition operation or accumulation operation.

The main processing circuit is used for determining that the input neuron is broadcast data, the weight is distribution data, distributing the distribution data into a plurality of data blocks, and sending at least one data block in the plurality of data blocks and at least one operation instruction in a plurality of operation instructions to the auxiliary processing circuit;

the plurality of slave processing circuits are used for executing operation on the received data blocks according to the operation instruction to obtain an intermediate result and transmitting the operation result to the master processing circuit;

the main processing circuit is used for processing the intermediate results sent by the plurality of slave processing circuits to obtain the result of the calculation instruction, and sending the result of the calculation instruction to the controller unit.

The slave processing circuit includes: a multiplication processing circuit.

The multiplication processing circuit is used for executing product operation on the received data blocks to obtain a product result.

A forwarding processing circuit (optional) for forwarding the received data block or the product result.

And the accumulation processing circuit is used for executing accumulation operation on the product result to obtain the intermediate result.

In another embodiment, the operation instruction is a matrix-by-matrix instruction, an accumulate instruction, an activate instruction, or the like calculation instruction.

The specific calculation method of the calculation device shown in fig. 5 is described below by the neural network operation instruction. For neural network operation instructions, the actual formulas that need to be performed may be s=s (Σwx) _i +b), wherein the weight w is multiplied by the input data x _i And summing, adding the bias b, and performing an activation operation s (h) to obtain a final output result s.

In an alternative embodiment, as shown in fig. 8, the operation unit includes: a tree module 40, the tree module comprising: a root port 401 and a plurality of branch ports 404, wherein the root port of the tree module is connected with the main processing circuit, and the plurality of branch ports of the tree module are respectively connected with one of a plurality of auxiliary processing circuits;

The above tree module has a transmitting and receiving function, for example, as shown in fig. 8, and is a transmitting function, as shown in fig. 9, and is a receiving function.

The tree module is used for forwarding the data blocks, the weights and the operation instructions between the master processing circuit and the plurality of slave processing circuits.

Alternatively, the tree module is an optional result of the computing device, which may include at least a layer 1 node, which is a line structure with forwarding functionality, and which may not itself have computing functionality. Such as a tree module, has zero level nodes, i.e., the tree module is not required.

Alternatively, the tree module may be in a tree structure of n-branches, for example, a tree structure of two branches as shown in fig. 10, or may be in a tree structure of three branches, where n may be an integer greater than or equal to 2. The specific embodiment of the present application is not limited to the specific value of n, and the number of layers may be 2, and the processing circuit may be connected to a node of a layer other than the node of the penultimate layer, for example, the node of the penultimate layer shown in fig. 10.

Alternatively, the above-mentioned operation unit may carry a separate cache, as shown in fig. 11, and may include: a neuron buffering unit 63 which buffers the input neuron vector data and the output neuron value data of the slave processing circuit.

As shown in fig. 12, the operation unit may further include: the weight buffer unit 64 is used for buffering the weight data required by the slave processing circuit in the calculation process.

In an alternative embodiment, the arithmetic unit 12 may comprise a branch processing circuit 103 as shown in fig. 13; a specific connection structure thereof is shown in fig. 13, in which,

the master processing circuit 101 is connected to the branch processing circuit(s) 103, and the branch processing circuit 103 is connected to the one or more slave processing circuits 102;

branch processing circuitry 103 for executing data or instructions that are forwarded between the master processing circuitry 101 and the slave processing circuitry 102.

In an alternative embodiment, taking the example of the fully connected operation in the neural network operation, the process may be: y=f (wx+b), where x is the input neuron matrix, w is the weight matrix, b is the bias scalar, and f is the activation function, which may be specifically: a sigmoid function, a tanh, relu, softmax function. Assuming here a binary tree structure with 8 slave processing circuits, the method implemented may be:

the controller unit acquires an input neuron matrix x, a weight matrix w and a full-connection operation instruction from the storage unit, and transmits the input neuron matrix x, the weight matrix w and the full-connection operation instruction to the main processing circuit;

The main processing circuit determines the input neuron matrix x as broadcast data, determines the weight matrix w as distribution data, splits the weight matrix w into 8 sub-matrices, distributes the 8 sub-matrices to 8 slave processing circuits through a tree module, broadcasts the input neuron matrix x to the 8 slave processing circuits,

the slave processing circuit performs multiplication operation and accumulation operation of 8 submatrices and an input neuron matrix x in parallel to obtain 8 intermediate results, and the 8 intermediate results are sent to the master processing circuit;

the main processing circuit is used for sequencing the 8 intermediate results to obtain an operation result of wx, executing the operation of the bias b on the operation result, executing the activating operation to obtain a final result y, sending the final result y to the controller unit, and outputting or storing the final result y into the storage unit by the controller unit.

The method for executing the neural network forward operation instruction by the computing device shown in fig. 5 may specifically be:

the controller unit extracts the neural network forward operation instruction, the operation domain corresponding to the neural network operation instruction and at least one operation code from the instruction storage unit, transmits the operation domain to the data access unit, and sends the operation code to the operation unit.

The controller unit extracts the weight w and the offset b corresponding to the operation domain from the storage unit (when b is 0, the offset b does not need to be extracted), the weight w and the offset b are transmitted to the main processing circuit of the operation unit, the controller unit extracts the input data Xi from the storage unit, and the input data Xi is transmitted to the main processing circuit.

The main processing circuit determines multiplication operation according to the at least one operation code, determines that input data Xi are broadcast data, determines weight data are distribution data, and splits the weight w into n data blocks;

an instruction processing unit of the controller unit determines a multiplication instruction, a bias instruction and an accumulation instruction according to the at least one operation code, sends the multiplication instruction, the bias instruction and the accumulation instruction to a main processing circuit, and the main processing circuit sends the multiplication instruction and input data Xi to a plurality of slave processing circuits in a broadcast manner, and distributes the n data blocks to the plurality of slave processing circuits (for example, n slave processing circuits are provided, and each slave processing circuit sends one data block); and the main processing circuit is used for executing accumulation operation on the intermediate results sent by the plurality of slave processing circuits according to the accumulation instruction to obtain an accumulation result, executing addition offset b on the accumulation result according to the offset instruction to obtain a final result, and sending the final result to the controller unit.

In addition, the order of addition and multiplication may be reversed.

According to the technical scheme, the multiplication operation and the bias operation of the neural network are realized through one instruction, namely the neural network operation instruction, the intermediate results calculated by the neural network are not required to be stored or extracted, and the storage and extraction operations of intermediate data are reduced, so that the method has the advantages of reducing corresponding operation steps and improving the calculation effect of the neural network.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples merely represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims

1. A task decomposition device, comprising:

the data divider is connected with the comparator and is used for decomposing the task into a plurality of jobs after the task identifier of the task is acquired by the comparator, so as to obtain decomposition information of the task;

the data divider is specifically configured to obtain a task decomposition number and a job size in the configuration information of the task, decompose the plurality of tasks into the plurality of jobs according to the task decomposition number and the job size, and obtain decomposition information of the task; wherein each of the jobs is capable of being allocated to a corresponding processor for processing, each of the jobs being an integer multiple of a word length of the corresponding processor.

2. The apparatus of claim 1, further comprising a memory coupled to the comparator for obtaining configuration information for a task and storing the configuration information for the task.

3. The apparatus of claim 1, wherein the comparator comprises:

4. The apparatus of claim 3 wherein said comparison and judgment circuit is further connected to a condition monitoring means;

5. The apparatus of claim 4, wherein the comparison and judgment circuit is further configured to send a task registration request to the state monitoring apparatus to obtain a task identifier of a task when the dependency relationship of the task is satisfied.

6. The apparatus according to any one of claims 1-5, further comprising a state controller, wherein the state controller is connected to the comparator, and configured to update a task state of the task to a state to be scheduled after acquiring a task identifier of the task, and send a task scheduling request of the task in the state to be scheduled to a task scheduling device.

7. The apparatus of claim 6, wherein the state controller comprises a first state controller and a second state controller, the first state controller being connected to the second state controller, the first state controller and the second state controller being respectively connected to the task scheduling device;

8. The apparatus of claim 7, wherein the first state controller is further configured to receive scheduling feedback information returned by the task scheduling apparatus, and update a state of a corresponding task to a to-be-scheduled state or a scheduling end state according to the scheduling feedback information.

9. The apparatus of any one of claims 1-5, wherein the task decomposition number is 2 ⁿ Wherein n is a positive integer.

10. A task scheduler, comprising: task scheduling means and task decomposing means according to any one of claims 1 to 9, said task decomposing means being connected to the task scheduling means;