CN111026540A

CN111026540A - Task processing method, task scheduler and task processing device

Info

Publication number: CN111026540A
Application number: CN201811179227.1A
Authority: CN
Inventors: 不公告发明人
Original assignee: Shanghai Cambricon Information Technology Co Ltd
Current assignee: Shanghai Cambricon Information Technology Co Ltd
Priority date: 2018-10-10
Filing date: 2018-10-10
Publication date: 2020-04-17
Anticipated expiration: 2038-10-10
Also published as: CN111026540B

Abstract

The method sends the task to a processor for processing after acquiring the identifier of the task, thereby ensuring that the task is executed according to a reasonable sequence and ensuring the correctness of program execution logic.

Description

Task processing method, task scheduler and task processing device

Technical Field

The present application relates to the field of computer application technologies, and in particular, to a task processing method, a task scheduler, and a task processing apparatus.

Background

The deep neural network is the foundation of many artificial intelligence applications at present, and has been applied in many aspects such as speech recognition, image processing, data analysis, advertisement recommendation system, car autopilot, and the like in a breakthrough manner, so that the deep neural network has been applied in various aspects of life.

However, the computation of the deep neural network is huge, which always restricts the faster development and wider application of the deep neural network. How to improve the task processing efficiency of the deep neural network becomes a technical problem which needs to be solved urgently.

Disclosure of Invention

In view of the above, it is desirable to provide a task processing method, a task scheduler, and a method for a task processing device, which improve the processing efficiency of tasks.

A task processing method comprises the following steps: acquiring configuration information of a task; judging whether the data dependency relationship of the task is met or not according to the configuration information of the task; if the data dependency relationship of the task is satisfied, acquiring a task identifier of the task; after the task obtains the task identifier, splitting the task into a plurality of jobs to obtain the decomposition information of the task; and acquiring all information of the task, and processing the task according to the all information of the task and the decomposition information of the task.

In one embodiment, the step of determining whether the data dependency relationship of the task is satisfied according to the configuration information of the task includes: analyzing the configuration information of the task to obtain the dependency relationship information of the task; and judging whether the task has a preposed task or not according to the dependency relationship information, wherein if the task does not have the preposed task, the dependency relationship of the task is satisfied.

In one embodiment, the method further comprises: if the task has a preposed task, judging whether the task is executed completely, and if the task is executed completely, the dependency relationship of the task is satisfied.

In one embodiment, before the step of obtaining the configuration information of the task, the method further includes: monitoring a task queue, and if a task in a to-be-transmitted state exists in the task queue, sending an information acquisition request to the task queue; and receiving the configuration information of the task returned by the task queue according to the information acquisition request.

In one embodiment, before the step of obtaining the configuration information of the task, the method further includes: monitoring a task queue, and determining a target queue according to whether a task in a to-be-transmitted state exists in the task queue; acquiring a queue identifier of the target queue; sending an information acquisition request to the target queue according to the index of the queue identifier of the target queue; and receiving the configuration information of the task returned by the target queue according to the information acquisition request.

In one embodiment, after the step of processing the task according to the total information and the decomposition information of the task, the method further includes: and receiving end information of a plurality of jobs of the task, and writing the end information of the plurality of jobs into a cache.

In one embodiment, the step of writing the end information of the plurality of jobs to the cache comprises: determining a blocking interval where the task is located, and determining a previous blocking interval according to the blocking interval; and after the end information of all the operations in the last blocking interval is written into the cache, writing the end information of a plurality of operations of the task into the cache.

A task scheduler comprising: a task decomposition device and a task scheduling device,

the task decomposition device is used for acquiring configuration information of a task and judging whether the data dependency relationship of the task is met or not according to the configuration information of the task;

the task decomposition device is further configured to obtain a task identifier of the task when the dependency relationship of the task is satisfied, and split the task into a plurality of jobs after the task obtains the task identifier, so as to obtain decomposition information of the task;

the task scheduling device is connected with the task decomposition device and used for acquiring all information of the tasks and decomposition information of the tasks and sending the acquired all information and decomposition information of the tasks to the processor.

A task processing device comprises the task scheduler and a processor,

and the processor is used for processing the task according to all the information of the task and the decomposition information.

In one embodiment, the processor is configured to perform machine learning calculations, and the computing device of the processor comprises: an arithmetic unit and a controller unit, the arithmetic unit comprising: a master processing circuit and a plurality of slave processing circuits; the controller unit is used for acquiring data, a machine learning model and a calculation instruction; the controller unit is further configured to analyze the calculation instruction to obtain a plurality of operation instructions, and send the plurality of operation instructions and the data to the main processing circuit; the main processing circuit is used for performing preamble processing on the data and transmitting data and operation instructions with the plurality of slave processing circuits; the plurality of slave processing circuits are used for executing intermediate operation in parallel according to the data and the operation instruction transmitted from the master processing circuit to obtain a plurality of intermediate results and transmitting the plurality of intermediate results to the master processing circuit; and the main processing circuit is used for executing subsequent processing on the plurality of intermediate results to obtain a calculation result of the calculation instruction.

The task processing method, the task scheduler and the task processing device provided by the application comprise the following steps: the method comprises the following steps: after the dependency relationship of the task is satisfied, the task is registered first, then the task is decomposed into a plurality of jobs to obtain the decomposition information of the task, and finally the processor processes the task according to the decomposition information of the task and all information. According to the task processing method, the task can be split according to the processing capacity of the processor, and after the task is split into a plurality of jobs, the task can also be processed by a plurality of processors or processors, so that the task can be matched with the processor as soon as possible, and the processing efficiency of the task can be improved.

Drawings

FIG. 1 is a diagram of an application environment of a task decomposition device in one embodiment;

FIG. 2 is a diagram of an application environment of a task decomposition device in one embodiment;

FIG. 3 is a diagram of an application environment of a task decomposition device in one embodiment;

FIG. 4 is a schematic diagram of a computing device, according to an embodiment;

FIG. 5 is a block diagram of a computing device provided in accordance with another embodiment;

FIG. 6 is a block diagram of a main processing circuit provided by one embodiment;

FIG. 7 is a block diagram of a computing device provided in one embodiment;

FIG. 8 is a block diagram of another computing device provided in one embodiment;

FIG. 9 is a schematic diagram of a tree module according to an embodiment;

FIG. 10 is a block diagram of a computing device provided by an embodiment;

FIG. 11 is a block diagram of a computing device provided in one embodiment;

FIG. 12 is a block diagram of a computing device provided by an embodiment;

FIG. 13 is a flowchart of the steps of a task processing method, provided by one embodiment;

FIG. 14 is a flowchart of steps provided in one embodiment for determining whether data dependencies of tasks are satisfied;

FIG. 15 is a flowchart providing steps for determining whether data dependencies of tasks are satisfied according to another embodiment;

FIG. 16 is a flowchart illustrating additional steps in a method for task processing, according to an embodiment;

fig. 17 is a flowchart illustrating additional steps of a task processing method according to another embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

As shown in fig. 1 to 3, a task scheduler 100 according to an embodiment of the present application includes a task decomposition device 110 and a task scheduling device 120, and the task scheduler may connect a first processor 200 and a second processor 300. Alternatively, the first processor 200 may be a general-purpose processor such as a CPU, and the second processor 300 may be a coprocessor of the first processor 200. Specifically, the second processor 300 may include a second processor body 310 and a control device 320 for controlling the operation of the second processor body, and the second processor body 310 may be an artificial intelligent processor such as an IPU (intelligent Processing Unit) or an NPU (Neural-network Processing Unit). Further, the number of the second processor bodies may be plural, and the plural second processor bodies are all connected to the control device of the second processor body.

Further, the task scheduler 100 may be connected to the global Memory through a Direct Memory Access (DMA). Alternatively, the global Memory may be a DRAM (Dynamic Random Access Memory) or an SRAM (Static Random-Access Memory), and the like.

The task decomposition device 110 is configured to obtain configuration information of a task, determine whether a data dependency relationship of the task is satisfied according to the configuration information of the task, and obtain a task identifier of the task when the data dependency relationship of the task is satisfied; after the task obtains the task identifier, the task is divided into a plurality of jobs, and decomposition information of the task is obtained.

The configuration information of the task is used to describe the task, and may include a task identity, a task type, task dependency information, and the like of the task. The task identity includes information representing the task identity, such as a task name or a task sequence number. Task types, including block (blocking task), cluster (clustering task) and union (normal task). The task dependency information includes a pre-task and a post-task of the task. Wherein, the task on which a certain task depends is called a preposed task; conversely, a task that depends on it is referred to as a post-task, as opposed to a pre-task. Further, the configuration information of the task may also include a weight of the task. The weight of a task may reflect the importance of the task.

Further, the configuration information of the task may further include a task decomposition number and a job size, wherein the task decomposition number refers to the number of jobs formed by decomposing the task, and the job size refers to the data capacity of each job. The task decomposition device 110 can decompose the corresponding task into a plurality of jobs according to the number of task decompositions and the job size in the configuration information of the task. Alternatively, the number of task decompositions is 2ⁿAnd n is a positive integer. Further, each job can be assigned to a corresponding processor for processing, and thus the size of each job can be an integer multiple of the word size of the corresponding processor. The processor word size may reflect the ability of the processor to process data a single time.

Optionally, when the task is stored in a queue, the task decomposition device 110 may monitor the task queue, and if there is a task in a to-be-transmitted state in the task queue, send an information acquisition request to the task queue; and receiving the configuration information of the corresponding task returned by the task queue according to the information acquisition request.

Optionally, when a plurality of queues are used to store tasks, the task decomposition device 110 may monitor the task queues, and determine a target queue according to whether a task in a to-be-transmitted state exists in the task queues; acquiring a queue identifier of the target queue; sending an information acquisition request to the target queue according to the index of the queue identifier of the target queue; and receiving the configuration information of the corresponding task returned by the target queue according to the information acquisition request.

The task scheduler 120 is configured to obtain the decomposition information of the task and the whole information of the task.

The first processor 200 or the second processor 300 is configured to process the task according to the decomposition information of the task and the entire information of the task.

Further, the task scheduling device is used to schedule a task, and the task is a process of determining which job is processed by the first processor 200 or the second processor 300. Specifically, after acquiring the identifier of the task, the task decomposition device 110 further updates the state of the task that acquires the identifier of the task to the state to be scheduled, and sends a scheduling request of the task in the state to be scheduled to the task scheduling device. The task scheduling device 120 receives the task scheduling request sent by the task decomposition device, and obtains configuration information of a corresponding task, and after receiving the task scheduling request, the task scheduling device 120 obtains the configuration information of the task in the state to be scheduled, where the configuration information is all task information. And acquiring the decomposition information of the task from the task decomposition device. The task scheduler 120 sends the decomposition information of the task to the task scheduler, and updates the state of the task to a scheduling state correspondingly.

Further, the task decomposition device 110 may receive the scheduling feedback information sent by the task scheduling device 120, and update the state of the corresponding task according to the scheduling feedback information. Specifically, when the feedback information includes task scheduling failure information, the task decomposition device 110 updates the state of the corresponding task from the scheduling state to the state to be scheduled, and sends the scheduling request of the task to the task scheduling device 120 again; and when the feedback information contains task scheduling success information, updating the state of the corresponding task from the scheduling state to the scheduling ending state.

The task scheduling device 120 may also obtain decomposition information of the task, obtain all information of the target task from the global memory, and send the decomposition information of the task and all task information of the target task to the processor.

Further, after receiving the scheduling request of the task sent by the task decomposition device 110, the task scheduling device 120 correspondingly obtains the configuration information of the task, where the configuration information is all task information. And acquiring the decomposition information of the task from the task decomposition device. Status information of the processor is also obtained. The task scheduler 120 can obtain processor information (for example, information such as a processor type) required for each job of the task, based on the entire task information and the task resolution information of the task, and obtain information such as a processing capability of a processor required for each job, based on the size of each job. The processor state information of the processor may include information of the type of the processor, operation state information of the processor (whether the processor is idle), and processing capability of the processor. In this way, the task scheduler 120 can match each job of the task with the processor based on the overall task information and the task breakdown information of the task, and the processor state information. Further, if the job is successfully matched with the processor, the task scheduling device 120 may further obtain information such as a processor identifier of the processor matched with the job, where the processor identifier is used to identify the processor.

The task scheduling device 120 is configured to select a target job from the job set to be scheduled according to a target weight of each job in the job set to be scheduled, and obtain scheduling information. Specifically, the task scheduler 120 may send the jobs in the job set to be scheduled to the processor one by one for processing. The task scheduler 120 determines the current scheduling target job according to the target weight of each job in the job set to be scheduled. The target weight of each job in the job set to be scheduled may be obtained by calculation, and of course, the target weight of each job in the job set to be scheduled may also be preset.

Finally, after determining the target job, the task scheduling device 120 obtains the decomposition information of the task to which the target job belongs from the task decomposition device correspondingly, and obtains all corresponding task information from the global memory. And packaging and sending all the acquired task information, decomposition information and scheduling information of the tasks to the target processor.

Further, when the target processor is the second processor body 310, the task scheduling device 120 first sends all the acquired task information of the target task and the decomposition information of the target task to the control device 320 of the second processor body, and then the control device of the second processor body processes all the task information of the target task according to the decomposition information, that is, the target task is decomposed into a plurality of jobs, and finally the control device 320 of the second processor body distributes the plurality of jobs to the second processor body 310 for processing.

In one embodiment, as shown in FIG. 3, task decomposition device 110 may be coupled to a condition monitoring device 400. The status monitoring apparatus 400 may distribute a task identifier for the corresponding task based on the registration request of the task decomposition apparatus 110. The task identity may be used to distinguish between different tasks. Specifically, the task decomposition device 110 may analyze the configuration information of the task to obtain dependency relationship information of the task, and determine whether the dependency relationship of the task is satisfied according to the dependency relationship, and further determine whether to send a task registration request to the state monitoring device 400 to obtain a task identifier of the task. The status monitoring apparatus 400 may allocate a task identifier to the task according to the task registration request received by the status monitoring apparatus to complete the registration of the task. Then, the state monitoring device 400 transmits the task identifier of the task to the task decomposition device 110, and the task decomposition device 110 updates the state of the task with the task identifier to be in a to-be-scheduled state.

Further, when determining whether the data dependency relationship of the task is satisfied according to the configuration information of the task, the task decomposition device 110 first determines whether the task has a pre-task, and if the task has the pre-task, sends an inquiry request to the task state monitoring device 400, the state monitoring device 400 inquires whether the pre-task of the task is completely executed according to the inquiry request, and after the pre-task of the task is completely executed, feeds back a message that the pre-task of the task is completely executed to the task decomposition device 110, and the task decomposition device 110 determines that the dependency relationship of the task is satisfied according to the message that the pre-task of the task is completely executed. If the task does not have a pre-task, the task decomposition device 110 may directly determine that the dependency relationship of the task is satisfied. After the dependency relationship of the task is satisfied, the task decomposition device 110 sends a task registration request to the status monitoring device 400 to obtain a task identifier of the task.

Alternatively, the status monitoring apparatus 400 may check whether the task ahead of the task is executed completely through a preset check bit. Specifically, when the configuration information of each task is set in advance, a dependency check bit corresponding to the task is correspondingly set, and the dependency check bit corresponding to each task can represent whether the task is executed completely. More specifically, the status monitoring device 400 queries whether the task is executed before, first determines a dependency check bit corresponding to the task before, and then determines whether the task is executed before according to a value of the check bit.

Further, when querying that the task is not executed completely, the status monitoring apparatus 400 may monitor whether the task is executed completely, that is, monitor whether the corresponding dependency check bit is updated, and determine that the task is executed completely after monitoring that the corresponding dependency check bit is updated.

Alternatively, the control device 320 of the second processor body may collect the execution state information of each job on the second processor, that is, collect the end information of each job on the second processor body 310, and send the end information to the state monitoring device 400, and the state monitoring device 400 may update the state table entry according to the received end information, so as to monitor the task execution state. Alternatively, the status monitor apparatus 400 may be connected to the global memory through DMA, so that the status monitor apparatus 400 may write the obtained job end information of each job into the global memory.

Optionally, the control device 320 of the second processor body may also collect the processor status information of the second processor body 310 and send the processor status information to the task scheduling device 120.

In one embodiment, the status monitoring apparatus 500 is further configured to receive job end information of a job, and determine whether an execution exception exists in a corresponding task according to the job end information of the job; and if the corresponding task has execution abnormality, generating a task destroying instruction. Alternatively, the job end information of the job includes result flag data, and the status monitoring apparatus 500 may determine whether the current task has an execution abnormality according to the result flag data included in the job end information of the job.

For example, if it is determined that there is an execution abnormality in the current task, the control device of the second processor body may set the result flag data in the job end information of the current job to be non-0 (e.g., the abnormality flag data is 1), and at this time, the status monitoring device 500 may determine that there is an execution abnormality in the current task according to the result flag data. If the current task has no execution abnormality, the control device of the second processor body may set the result flag data in the job end information of the current job to 0, and at this time, the status monitoring device 500 may determine that the current task has no execution abnormality according to the result flag data. Other tasks are set in the same way and are not detailed here.

Further, the execution exception of the job may include a first exception condition and a second exception condition, and the task destruction instruction may include a first task destruction instruction corresponding to the first exception condition and a second task destruction instruction corresponding to the second exception condition. Alternatively, when it is determined that there is an abnormality in the job, the abnormality processing circuit may further determine whether the execution abnormality of the current task is a first abnormality or a second abnormality, based on abnormality flag data included in the job end information of the job. The first abnormal situation and the second abnormal situation may be a combination of one or more of the abnormalities such as the second processor resource shortage and the second processor failure.

Optionally, the exception handling circuit is configured to, when it is determined that the job has the first exception condition according to the job end information of the job, obtain a first task destruction instruction, and transmit the first task destruction instruction to the task destruction circuit, where the task destruction circuit destroys the task to which the job belongs according to the first task destruction instruction. Specifically, when receiving the first task destruction instruction, the task destruction circuit may be configured to terminate scheduling of the job with the execution exception and all jobs after the job, and obtain scheduling end information of the task to which the job belongs. Further, after the task destroying circuit completes the operation of destroying the task to which the job belongs, the task scheduling end information of the task to which the job belongs may be transmitted to the state monitoring device.

The task scheduler further comprises a register file connected to the task decomposition means. If the exception handling circuit determines that the job has a second exception condition according to the job end information of the job, a second task destruction instruction can be obtained to destroy the task to which the job belongs and all tasks after the task to which the job belongs. Specifically, if the exception handling circuit determines that the job has a second exception condition according to the job end information of the job, a second task destruction instruction may be obtained, and the second task destruction instruction is transmitted to the task destruction circuit, so as to notify the task destruction circuit to destroy the task to which the job belongs and all the tasks after the task. Optionally, after the task destroying circuit receives the second task destroying instruction transmitted by the exception handling circuit, the task destroying circuit may destroy all tasks in the task queue where the task to which the job belongs is located. Specifically, the task assigning device terminates the scheduling of the task to which the job belongs and other tasks subsequent to the task to which the job belongs according to the second task destruction instruction, and notifies a register connected to the task assigning device to clear the task to which the job belongs. After the task to which the job belongs is cleared from the register, scheduling end information of the task to which the job belongs may be obtained.

Meanwhile, after the task to which the job belongs is cleared from the register, the task assigning device may send a task registration request corresponding to another task after the task to which the job belongs to the status monitoring device, so as to obtain a task identifier corresponding to another task after the task to which the job belongs. The task registration circuit of the condition monitoring device may assign a task identifier to each of the other tasks subsequent to the task to which the job belongs. When the task destroying circuit receives the task identifier fed back by the task registration circuit of the state monitoring device, the task destroying circuit may obtain scheduling end information corresponding to other tasks after the task to which the job belongs according to the received task identifier, so as to destroy all tasks after the task to which the job belongs. Further, the task assigning device can also transmit scheduling end information of each task to be processed to the state monitoring device.

By setting the exception handling mechanism, the accuracy of the task execution result can be ensured. And when the abnormal condition exists, the state monitoring device can inform the task destroying circuit to destroy the corresponding task and/or all the tasks after the corresponding task, so that the resource waste caused by the fact that the second processor continues to execute other tasks when the abnormal condition exists is avoided.

Optionally, the state control circuit is further configured to obtain a first interrupt signal when receiving a task destruction instruction, transmit the first interrupt signal to the first processor, and then perform a destruction operation. Specifically, when the task destruction circuit receives the task destruction instruction, the task scheduling to which the job belongs is terminated first, so that unnecessary resources are prevented from being consumed by scheduling under an abnormal condition. Meanwhile, after the task destroying circuit receives the task destroying instruction, a first interrupt signal can be obtained and transmitted to the first processor. Further, after the first processor receives the first interrupt signal, the first processor may further obtain state information of each second processor body, and determine the second processor body in which the abnormality occurs according to the state information of each second processor body.

The state control circuit is also used for obtaining a second interrupt signal after the destroying operation is finished and transmitting a second middle-section signal to the first processor. Specifically, after receiving scheduling end information of a current task, or after receiving scheduling end information of the current task and all tasks in a task queue to which the current task belongs, the state monitoring device obtains exception handling end information, and transmits the exception handling end information to the task assigning device; the task destroying circuit is further used for obtaining a second interrupt signal according to the abnormal processing ending information and transmitting the second middle section signal to the first processor.

Through the exception handling mechanism, the accuracy of the task execution result is ensured. And if the abnormal condition exists, the state monitoring device can destroy the corresponding task or all the tasks after the corresponding task through the task decomposition device, so that the resource waste caused by the fact that the second processor continues to execute other tasks when the abnormal condition exists is avoided.

Alternatively, the status monitoring apparatus 400 receives end information of a plurality of jobs of the task after the task is executed and processed, and writes the end information of the plurality of jobs in the cache. Specifically, the state monitoring device 400 determines a blocking interval where the task is located, and determines a previous blocking interval according to the blocking interval; and after the end information of all the operations in the last blocking interval is written into the cache, writing the end information of a plurality of operations of the task into the cache.

In one embodiment, the second processor body 310 includes a computing device as shown in fig. 4, the computing device including: a controller unit 11 and an arithmetic unit 12, wherein the controller unit 11 is connected with the arithmetic unit 12, and the arithmetic unit 12 comprises: a master processing circuit and a plurality of slave processing circuits.

Specifically, the controller unit 11 may be configured to obtain a job, which may include data, a machine learning model, and computational instructions. In an alternative, the input data and the calculation instruction may be obtained through a data input/output unit, and the data input/output unit may be one or more data I/O interfaces or I/O pins.

The above calculation instructions include, but are not limited to: the present invention is not limited to the specific expression of the above-mentioned computation instruction, such as a convolution operation instruction, or a forward training instruction, or other neural network operation instruction.

The controller unit 11 is further configured to analyze the calculation instruction to obtain a plurality of operation instructions, and send the plurality of operation instructions and the input data to the main processing circuit;

a master processing circuit 101 configured to perform a preamble process on the input data and transmit data and an operation instruction with the plurality of slave processing circuits;

a plurality of slave processing circuits 102 configured to perform an intermediate operation in parallel according to the data and the operation instruction transmitted from the master processing circuit to obtain a plurality of intermediate results, and transmit the plurality of intermediate results to the master processing circuit;

and the main processing circuit 101 is configured to perform subsequent processing on the plurality of intermediate results to obtain a calculation result of the calculation instruction.

The technical scheme that this application provided sets the arithmetic element to a main many slave structures, to the computational instruction of forward operation, it can be with the computational instruction according to the forward operation with data split, can carry out parallel operation to the great part of calculated amount through a plurality of processing circuits from like this to improve the arithmetic speed, save the operating time, and then reduce the consumption.

Optionally, the machine learning calculation specifically includes: the artificial neural network operation, where the input data specifically includes: neuron data and weight data are input. The calculation result may specifically be: the result of the artificial neural network operation outputs neuron data.

In the forward operation, after the execution of the artificial neural network of the previous layer is completed, the operation instruction of the next layer takes the output neuron calculated in the operation unit as the input neuron of the next layer to perform operation (or performs some operation on the output neuron and then takes the output neuron as the input neuron of the next layer), and at the same time, the weight value is replaced by the weight value of the next layer; in the reverse operation, after the reverse operation of the artificial neural network of the previous layer is completed, the operation instruction of the next layer takes the input neuron gradient calculated in the operation unit as the output neuron gradient of the next layer to perform operation (or performs some operation on the input neuron gradient and then takes the input neuron gradient as the output neuron gradient of the next layer), and at the same time, the weight value is replaced by the weight value of the next layer.

The above-described machine learning calculations may also include support vector machine operations, k-nearest neighbor (k-nn) operations, k-means (k-means) operations, principal component analysis operations, and the like. For convenience of description, the following takes artificial neural network operation as an example to illustrate a specific scheme of machine learning calculation.

For the artificial neural network operation, if the artificial neural network operation has multilayer operation, the input neurons and the output neurons of the multilayer operation do not refer to the neurons in the input layer and the neurons in the output layer of the whole neural network, but for any two adjacent layers in the network, the neurons in the lower layer of the network forward operation are the input neurons, and the neurons in the upper layer of the network forward operation are the output neurons. Taking a convolutional neural network as an example, let a convolutional neural network have L layers, K1, 2.., L-1, for the K-th layer and K + 1-th layer, we will refer to the K-th layer as an input layer, in which the neurons are the input neurons, and the K + 1-th layer as an output layer, in which the neurons are the output neurons. That is, each layer except the topmost layer can be used as an input layer, and the next layer is a corresponding output layer.

Optionally, the computing device may further include: the storage unit 10 and the direct memory access unit 50, the storage unit 10 may include: one or any combination of a register and a cache, specifically, the cache is used for storing the calculation instruction; the register is used for storing the input data and a scalar; the cache is a scratch pad cache. The direct memory access unit 50 is used to read or store data from the storage unit 10.

Optionally, the controller unit includes: an instruction cache unit 110', an instruction processing unit 111, and a store queue unit 113;

the instruction cache unit 110' is configured to store the calculation instruction associated with the artificial neural network operation.

The instruction processing unit 111 is configured to analyze the calculation instruction to obtain a plurality of operation instructions.

A store queue unit 113 for storing an instruction queue, the instruction queue comprising: and a plurality of operation instructions or calculation instructions to be executed according to the front and back sequence of the queue.

For example, in an alternative embodiment, the main processing circuit may also include a controller unit, and the controller unit may include a main instruction processing unit, specifically configured to decode instructions into microinstructions. Of course in another alternative the slave processing circuit may also comprise a further controller unit comprising a slave instruction processing unit, in particular for receiving and processing microinstructions. The micro instruction may be a next-stage instruction of the instruction, and the micro instruction may be obtained by splitting or decoding the instruction, and may be further decoded into control signals of each component, each unit, or each processing circuit.

In one alternative, the structure of the calculation instruction may be as shown in the following table.

Operation code

Registers or immediate data

Register/immediate

...

The ellipses in the above table indicate that multiple registers or immediate numbers may be included.

In another alternative, the computing instructions may include: one or more operation domains and an opcode. The computation instructions may include neural network operation instructions. Taking the neural network operation instruction as an example, as shown in table 1, register number 0, register number 1, register number 2, register number 3, and register number 4 may be operation domains. Each of register number 0, register number 1, register number 2, register number 3, and register number 4 may be a number of one or more registers.

The register may be an off-chip memory, and in practical applications, may also be an on-chip memory for storing data, where the data may specifically be n-dimensional data, where n is an integer greater than or equal to 1, and for example, when n is equal to 1, the data is 1-dimensional data, that is, a vector, and when n is equal to 2, the data is 2-dimensional data, that is, a matrix, and when n is equal to 3 or more, the data is a multidimensional tensor.

Optionally, the controller unit may further include:

the dependency processing unit 108 is configured to determine whether a first operation instruction is associated with a zeroth operation instruction before the first operation instruction when there are multiple operation instructions, cache the first operation instruction in the instruction storage unit if the first operation instruction is associated with the zeroth operation instruction, and extract the first operation instruction from the instruction storage unit and transmit the first operation instruction to the operation unit after the zeroth operation instruction is executed;

the determining whether the first operation instruction has an association relationship with a zeroth operation instruction before the first operation instruction comprises:

extracting a first storage address interval of required data (such as a matrix) in the first operation instruction according to the first operation instruction, extracting a zeroth storage address interval of the required matrix in the zeroth operation instruction according to the zeroth operation instruction, if the first storage address interval and the zeroth storage address interval have an overlapped area, determining that the first operation instruction and the zeroth operation instruction have an association relation, and if the first storage address interval and the zeroth storage address interval do not have an overlapped area, determining that the first operation instruction and the zeroth operation instruction do not have an association relation.

In another alternative embodiment, the arithmetic unit 12 may include a master processing circuit 101 and a plurality of slave processing circuits 102, as shown in fig. 5. In one embodiment, as shown in FIG. 5, a plurality of slave processing circuits are distributed in an array; each slave processing circuit is connected with other adjacent slave processing circuits, the master processing circuit is connected with k slave processing circuits in the plurality of slave processing circuits, and the k slave processing circuits are as follows: it should be noted that, as shown in fig. 5, the K slave processing circuits include only the n slave processing circuits in the 1 st row, the n slave processing circuits in the m th row, and the m slave processing circuits in the 1 st column, that is, the K slave processing circuits are slave processing circuits directly connected to the master processing circuit among the plurality of slave processing circuits.

And the K slave processing circuits are used for forwarding data and instructions between the main processing circuit and the plurality of slave processing circuits.

Optionally, as shown in fig. 6, the main processing circuit may further include: one or any combination of the conversion processing circuit 110 ″, the activation processing circuit 111, and the addition processing circuit 112;

conversion processing circuitry 110 "for performing an interchange between the first data structure and the second data structure (e.g., conversion of continuous data to discrete data) on the data blocks or intermediate results received by the main processing circuitry; or to perform an interchange between the first data type and the second data type (e.g. a conversion of a fixed point type to a floating point type) on a data block or an intermediate result received by the main processing circuitry.

And an activation processing circuit 111 for executing an activation operation of data in the main processing circuit.

And an addition processing circuit 112 for performing addition operation or accumulation operation.

The master processing circuit is configured to determine that the input neuron is broadcast data, determine that a weight is distribution data, distribute the distribution data into a plurality of data blocks, and send at least one data block of the plurality of data blocks and at least one operation instruction of the plurality of operation instructions to the slave processing circuit;

the plurality of slave processing circuits are used for executing operation on the received data blocks according to the operation instruction to obtain an intermediate result and transmitting the operation result to the main processing circuit;

and the main processing circuit is used for processing the intermediate results sent by the plurality of slave processing circuits to obtain the result of the calculation instruction and sending the result of the calculation instruction to the controller unit.

The slave processing circuit includes: a multiplication processing circuit.

The multiplication processing circuit is used for executing multiplication operation on the received data block to obtain a product result.

Forwarding processing circuitry (optional) for forwarding the received data block or the product result.

And the accumulation processing circuit is used for performing accumulation operation on the product result to obtain the intermediate result.

In another embodiment, the operation instruction is a matrix by matrix instruction, an accumulation instruction, an activation instruction, or the like.

The following describes a specific calculation method of the calculation apparatus shown in fig. 4 by a neural network operation instruction. For a neural network operation instruction, the formula that actually needs to be executed may be s-s (Σ wx)_i+ b), wherein the weight w is multiplied by the input data x_iAnd summing, adding a bias b, and performing activation operation s (h) to obtain a final output result s.

In an alternative embodiment, as shown in fig. 7, the arithmetic unit comprises: a tree module 40, the tree module comprising: a root port 401 and a plurality of branch ports 404, wherein the root port of the tree module is connected with the main processing circuit, and the branch ports of the tree module are respectively connected with one of the plurality of slave processing circuits;

the tree module has a transceiving function, for example, as shown in fig. 7, the tree module is a transmitting function, and as shown in fig. 8, the tree module is a receiving function.

And the tree module is used for forwarding data blocks, weights and operation instructions between the main processing circuit and the plurality of slave processing circuits.

Optionally, the tree module is an optional result of the computing device, and may include at least 1 layer of nodes, where the nodes are line structures with forwarding function, and the nodes themselves may not have computing function. If the tree module has zero-level nodes, the tree module is not needed.

Optionally, the tree module may have an n-ary tree structure, for example, a binary tree structure as shown in fig. 9, or may have a ternary tree structure, where n may be an integer greater than or equal to 2. The present embodiment is not limited to the specific value of n, the number of layers may be 2, and the slave processing circuit may be connected to nodes of other layers than the node of the penultimate layer, for example, the node of the penultimate layer shown in fig. 9.

Optionally, the arithmetic unit may carry a separate cache, as shown in fig. 10, and may include: a neuron buffer unit, the neuron buffer unit 63 buffers the input neuron vector data and the output neuron value data of the slave processing circuit.

As shown in fig. 11, the arithmetic unit may further include: and a weight buffer unit 64, configured to buffer weight data required by the slave processing circuit in the calculation process.

In an alternative embodiment, the arithmetic unit 12, as shown in fig. 12, may include a branch processing circuit 103; the specific connection structure is shown in fig. 12, wherein,

the main processing circuit 101 is connected to branch processing circuit(s) 103, the branch processing circuit 103 being connected to one or more slave processing circuits 102;

a branch processing circuit 103 for executing data or instructions between the forwarding main processing circuit 101 and the slave processing circuit 102.

In an alternative embodiment, taking the fully-connected operation in the neural network operation as an example, the process may be: f (wx + b), where x is an input neuron matrix, w is a weight matrix, b is a bias scalar, and f is an activation function, and may specifically be: sigmoid function, tanh, relu, softmax function. Here, a binary tree structure is assumed, and there are 8 slave processing circuits, and the implementation method may be:

the controller unit acquires an input neuron matrix x, a weight matrix w and a full-connection operation instruction from the storage unit, and transmits the input neuron matrix x, the weight matrix w and the full-connection operation instruction to the main processing circuit;

the main processing circuit determines the input neuron matrix x as broadcast data, determines the weight matrix w as distribution data, divides the weight matrix w into 8 sub-matrixes, then distributes the 8 sub-matrixes to 8 slave processing circuits through a tree module, broadcasts the input neuron matrix x to the 8 slave processing circuits,

the slave processing circuit executes multiplication and accumulation operation of the 8 sub-matrixes and the input neuron matrix x in parallel to obtain 8 intermediate results, and the 8 intermediate results are sent to the master processing circuit;

and the main processing circuit is used for sequencing the 8 intermediate results to obtain a wx operation result, executing the offset b operation on the operation result, executing the activation operation to obtain a final result y, sending the final result y to the controller unit, and outputting or storing the final result y into the storage unit by the controller unit.

The method for executing the neural network forward operation instruction by the computing device shown in fig. 4 may specifically be:

the controller unit extracts the neural network forward operation instruction, the operation domain corresponding to the neural network operation instruction and at least one operation code from the instruction storage unit, transmits the operation domain to the data access unit, and sends the at least one operation code to the operation unit.

The controller unit extracts the weight w and the offset b corresponding to the operation domain from the storage unit (when b is 0, the offset b does not need to be extracted), transmits the weight w and the offset b to the main processing circuit of the arithmetic unit, extracts the input data Xi from the storage unit, and transmits the input data Xi to the main processing circuit.

The main processing circuit determines multiplication operation according to the at least one operation code, determines input data Xi as broadcast data, determines weight data as distribution data, and splits the weight w into n data blocks;

the instruction processing unit of the controller unit determines a multiplication instruction, an offset instruction and an accumulation instruction according to the at least one operation code, and sends the multiplication instruction, the offset instruction and the accumulation instruction to the master processing circuit, the master processing circuit sends the multiplication instruction and the input data Xi to a plurality of slave processing circuits in a broadcasting mode, and distributes the n data blocks to the plurality of slave processing circuits (for example, if the plurality of slave processing circuits are n, each slave processing circuit sends one data block); the plurality of slave processing circuits are used for executing multiplication operation on the input data Xi and the received data block according to the multiplication instruction to obtain an intermediate result, sending the intermediate result to the master processing circuit, executing accumulation operation on the intermediate result sent by the plurality of slave processing circuits according to the accumulation instruction by the master processing circuit to obtain an accumulation result, executing offset b on the accumulation result according to the offset instruction to obtain a final result, and sending the final result to the controller unit.

In addition, the order of addition and multiplication may be reversed.

According to the technical scheme, multiplication and offset operation of the neural network are achieved through one instruction, namely the neural network operation instruction, storage or extraction is not needed in the intermediate result of the neural network calculation, and storage and extraction operations of intermediate data are reduced, so that the method has the advantages of reducing corresponding operation steps and improving the calculation effect of the neural network.

In one embodiment, a task processing device is further provided, where the task processing device includes the task scheduler mentioned in any of the above embodiments, and a processor, the task scheduler is connected with the processor, the task scheduler is used for decomposing and scheduling tasks, and the processor is used for processing tasks.

The working principle of the task processing device according to the embodiment of the present application is illustrated below with reference to fig. 1 and 3:

FIG. 13 is a flowchart illustrating steps of a task processing method according to one embodiment. The method comprises the following steps:

and step S100, acquiring configuration information of the task.

Specifically, the task scheduler 100 acquires configuration information tasks of tasks. Optionally, the task scheduler 110 obtains configuration information of the task from the global memory.

And step S200, judging whether the data dependency relationship of the task is met according to the configuration information of the task. And if the data dependency relationship of the task is satisfied, executing step S300 to obtain a task identifier of the task. If the data dependency relationship of the task is not satisfied, step S400 is executed, and after the dependency relationship of the task is satisfied, the task identifier of the task is obtained.

Specifically, the task decomposition device 110 of the task scheduler 100 determines whether the data dependency of the task is satisfied according to the configuration information of the task. More specifically, the task decomposition device 110 may analyze the configuration information of the task to obtain dependency relationship information of the task, determine whether the dependency relationship of the task is satisfied according to the dependency relationship, and further determine whether to send a task registration request to the state monitoring device 400 to obtain a task identifier of the task. The status monitoring apparatus 400 may allocate a task identifier to the task according to the task registration request received by the status monitoring apparatus to complete the registration of the task. Then, the status monitor 400 transmits the task identifier of the task to the task decomposition device 110, and the task having the task identifier is updated to the to-be-scheduled status by the status controller 130.

Step S500: after the task obtains the task identifier, the task is divided into a plurality of jobs, and decomposition information of the task is obtained.

In particular toAfter the task obtains the task identifier, the task decomposition device 110 splits the task into a plurality of jobs to obtain the decomposition information of the task. Optionally, the task decomposition device 110 decomposes the task according to the task decomposition number and the job size in the configuration information of the task, wherein the task decomposition number and the job size can be used for the data divider to decompose the corresponding task into a plurality of jobs. Alternatively, the number of task decompositions is 2ⁿN is a positive integer, and the size of the operation is an integral multiple of the processing capacity of the processor.

And step S600, acquiring all task information of the tasks, and processing the tasks according to the all task information of the tasks and the decomposition information of the tasks.

Specifically, the task scheduling device of the task scheduler 100 acquires all task information of a task and decomposition information of the task, and then transmits the all task information of the task and the decomposition information of the task to the processor. And after receiving all the task information of the task and the decomposition information of the task, the processor processes all the task information of the task according to the decomposition information of the task to obtain a plurality of jobs, and then distributes the jobs to the processor for processing.

The task processing method provided by the embodiment includes: after the dependency relationship of the task is satisfied, the task is registered first, then the task is decomposed into a plurality of jobs to obtain the decomposition information of the task, and finally the processor processes the task according to the decomposition information of the task and all task information. According to the task processing method, the task can be split according to the processing capacity of the processor, and after the task is split into a plurality of jobs, the task can also be processed by a plurality of processors or processors, so that the task can be matched with the processor as soon as possible, and the processing efficiency of the task can be improved.

As an alternative implementation, as shown in fig. 14, the step S200 includes:

step S210 a: and analyzing the configuration information of the task to obtain the dependency relationship information of the task.

Step S220 a: and judging whether the task has a preposed task or not according to the dependency relationship information, and if the task does not have the preposed task, meeting the dependency relationship of the task.

Specifically, when determining whether the data dependency relationship of the task is satisfied according to the configuration information of the task, the task decomposition device 110 first determines whether a pre-task exists in the task, and if the pre-task does not exist in the task, the task decomposition device 110 may directly determine that the dependency relationship of the task is satisfied. After the dependency relationship of the task is satisfied, the task decomposition device 110 sends a task registration request to the status monitoring device 400 to obtain a task identifier of the task.

As an alternative implementation, as shown in fig. 15, the step S200 includes:

step S210 b: and analyzing the configuration information of the task to obtain the dependency relationship information of the task.

Step S220 b: and judging whether the task has a preposed task according to the dependency relationship information, if so, judging whether the task is executed completely, and if so, meeting the dependency relationship of the task.

Specifically, when determining whether the data dependency relationship of the task is satisfied according to the configuration information of the task, the task decomposition device 110 first determines whether the task has a pre-task, and if the task has the pre-task, sends an inquiry request to the task state monitoring device 400, the state monitoring device 400 inquires whether the pre-task of the task is completely executed according to the inquiry request, and after the pre-task of the task is completely executed, feeds back a message that the pre-task of the task is completely executed to the task decomposition device 110, and the task decomposition device 110 determines that the dependency relationship of the task is satisfied according to the message that the pre-task of the task is completely executed.

Further, the status monitoring apparatus 400 may check whether the task ahead of the task is executed completely through a preset check bit. Specifically, when the configuration information of each task is set in advance, a dependency check bit corresponding to the task is correspondingly set, and the dependency check bit corresponding to each task can represent whether the task is executed completely. More specifically, the status monitoring device 400 queries whether the task is executed before, first determines a dependency check bit corresponding to the task before, and then determines whether the task is executed before according to a value of the check bit.

In one embodiment, as shown in fig. 16, before step S100, the method further includes:

step S700 a: and monitoring the task queue, and if the task queue has a task to be transmitted, sending an information acquisition request to the task queue.

Step S800 a: and receiving the configuration information of the task returned by the task queue according to the information acquisition request.

Specifically, optionally, when the task is stored in a queue, the task decomposition device 110 may monitor the task queue, and if there is a task in a to-be-transmitted state in the task queue, send an information acquisition request to the task queue; and receiving the configuration information of the task returned by the task queue according to the information acquisition request.

In another alternative embodiment, as shown in fig. 17, before step S100, the method further includes:

step S700 b: monitoring a task queue, determining a target queue according to whether a task in a to-be-transmitted state exists in the task queue, and acquiring a queue identifier of the target queue.

Step S800 b: and sending an information acquisition request to the target queue according to the index of the queue identifier of the target queue.

Step S900 b: and receiving the configuration information of the task returned by the target queue according to the information acquisition request.

Specifically, when a plurality of queues are used to store tasks, the task decomposition device 110 may monitor the task queues and determine a target queue according to whether a task in a to-be-transmitted state exists in the task queues; acquiring a queue identifier of the target queue; sending an information acquisition request to the target queue according to the index of the queue identifier of the target queue; and receiving the configuration information of the task returned by the target queue according to the information acquisition request.

The task and the method for acquiring the configuration information of the task, which are provided by the embodiment, can ensure that the correct task and the configuration information are orderly acquired when the tasks stored in the plurality of queues are processed.

As an optional implementation manner, after step S600, the method further includes: and receiving end information of a plurality of jobs of the task, and writing the end information of the plurality of jobs into the cache.

Specifically, the status monitoring apparatus 400 receives end information of a plurality of jobs of the task after the task is executed and processed, and writes the end information of the plurality of jobs in the cache.

As an optional implementation manner, the writing of the end information of the plurality of jobs to the cache step includes: determining a blocking interval where the task is located, and determining a previous blocking interval according to the blocking interval; after the end information of all the jobs in the last block interval is written into the buffer, the end information of a plurality of jobs of the task is written into the buffer.

Specifically, the state monitoring device 400 determines a blocking interval where the task is located, and determines a previous blocking interval according to the blocking interval; and after the end information of all the operations in the last blocking interval is written into the cache, writing the end information of a plurality of operations of the task into the cache. More specifically, if the end information of all the jobs in the previous block interval is written into the cache, the end information of the current task may be directly written; if the end information of all the operations in the previous blocking interval is not written into the cache or is not written into the cache completely, the end information of the current task can be written into the cache only after the end information of all the operations in the previous blocking interval is written into the cache completely. Wherein, all tasks from one blocking task to the next are one blocking interval. The blocking task refers to a task which calls a blocking primitive to block the task and is awakened after the end information of the corresponding preposed task is written into a specified position. It should be noted that, if a task can be executed without depending on the processing results of other tasks, the end information can be directly written into the cache, and the block interval where the current task is located and the subsequent processing are not needed.

The method for writing the ending information of the task in the cache provided by the embodiment can ensure that the data for blocking the task reference is accurate and the execution result of the task is correct.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A task processing method, comprising the steps of:

acquiring configuration information of a task;

judging whether the data dependency relationship of the task is met or not according to the configuration information of the task;

if the data dependency relationship of the task is satisfied, acquiring a task identifier of the task;

after the task obtains the task identifier, splitting the task into a plurality of jobs to obtain the decomposition information of the task;

and acquiring all information of the task, and processing the task according to the all information of the task and the decomposition information of the task.

2. The method according to claim 1, wherein the step of determining whether the data dependency relationship of the task is satisfied according to the configuration information of the task comprises:

analyzing the configuration information of the task to obtain the dependency relationship information of the task;

and judging whether the task has a preposed task or not according to the dependency relationship information, wherein if the task does not have the preposed task, the dependency relationship of the task is satisfied.

3. The method of claim 2, further comprising:

if the task has a preposed task, judging whether the task is executed completely, and if the task is executed completely, the dependency relationship of the task is satisfied.

4. The method of claim 1, wherein prior to the step of obtaining configuration information for the task, the method further comprises:

monitoring a task queue, and if a task in a to-be-transmitted state exists in the task queue, sending an information acquisition request to the task queue;

and receiving the configuration information of the task returned by the task queue according to the information acquisition request.

5. The method of claim 1, wherein prior to the step of obtaining configuration information for the task, the method further comprises:

monitoring a task queue, and determining a target queue according to whether a task in a to-be-transmitted state exists in the task queue;

acquiring a queue identifier of the target queue;

sending an information acquisition request to the target queue according to the index of the queue identifier of the target queue;

and receiving the configuration information of the task returned by the target queue according to the information acquisition request.

6. The method of claim 1, wherein after the step of processing the task according to the total information and the decomposition information of the task, the method further comprises:

and receiving end information of a plurality of jobs of the task, and writing the end information of the plurality of jobs into a cache.

7. The method of claim 6, wherein writing end information of the plurality of jobs to a cache comprises:

determining a blocking interval where the task is located, and determining a previous blocking interval according to the blocking interval;

and after the end information of all the operations in the last blocking interval is written into the cache, writing the end information of a plurality of operations of the task into the cache.

8. A task scheduler, comprising: task decomposition device and task scheduling device

9. A task processing apparatus comprising the task scheduler and the processor according to claim 8,

10. The apparatus of claim 9, wherein the processor is configured to perform machine learning calculations, and wherein the processor comprises: an arithmetic unit and a controller unit, the arithmetic unit comprising: a master processing circuit and a plurality of slave processing circuits;

the controller unit is used for acquiring data, a machine learning model and a calculation instruction;

the controller unit is further configured to analyze the calculation instruction to obtain a plurality of operation instructions, and send the plurality of operation instructions and the data to the main processing circuit;

the main processing circuit is used for performing preamble processing on the data and transmitting data and operation instructions with the plurality of slave processing circuits;

the plurality of slave processing circuits are used for executing intermediate operation in parallel according to the data and the operation instruction transmitted from the master processing circuit to obtain a plurality of intermediate results and transmitting the plurality of intermediate results to the master processing circuit;

and the main processing circuit is used for executing subsequent processing on the plurality of intermediate results to obtain a calculation result of the calculation instruction.