WO2022171002A1

WO2022171002A1 - Task processing method and apparatus, many-core system, and computer-readable medium

Info

Publication number: WO2022171002A1
Application number: PCT/CN2022/074490
Authority: WO
Inventors: 施路平; 张伟豪; 林俊峰; 祝夭龙
Original assignee: 北京灵汐科技有限公司
Priority date: 2021-02-10
Filing date: 2022-01-28
Publication date: 2022-08-18

Abstract

Provided in the present disclosure is a task processing method. The method comprises: acquiring a computational graph of a problem to be processed, wherein the computational graph comprises a plurality of layers that are arranged in sequence, each layer comprises a plurality of tasks, the tasks in any layer are not performed on the basis of results of the tasks in the current layer or the subsequent layer, and at least some of tasks in at least some of the layers are performed on the basis of the results of the tasks in the previous layer; dividing the computational graph into a plurality of task groups, wherein each task group comprises tasks of at least one layer; and determining a mapping relationship between the tasks of each task group and processing cores of a many-core system, wherein in the mapping relationship, each task group corresponds to a core group, and each core group comprises at least one processing core. By means of the method, the robustness of a computational graph can be improved. Further provided are a task processing apparatus, a many-core system, and a computer-readable medium.

Description

Method and apparatus for task processing, many-core system, and computer-readable medium

technical field

The present disclosure relates to the field of many-core technologies, and in particular, to a method and apparatus for task processing, a many-core system, and a computer-readable medium.

Background technique

The many-core system includes a plurality of processing cores (cores or processing engines), and each processing core can exchange information through routing. The many-core system solves the problem to be processed through the electronic computing process. In essence, the many-core system includes multiple processing cores (kernels or processing engines), and each processing core can exchange information through routing. The many-core system solves the problem to be solved through the electronic computing process, which is essentially a process in which multiple tasks corresponding to the problem to be solved are mapped (or allocated) to different processing cores, and the tasks are processed separately by each processing core.

However, the processing cores in the many-core system will inevitably be in an invalid state (such as due to failure), so how can some processing cores in the many-core system be in an invalid state to still obtain a certain degree of usable processing results is very important.

SUMMARY OF THE INVENTION

Embodiments of the present disclosure provide a task processing method and apparatus, a many-core system, and a computer-readable medium.

In a first aspect, an embodiment of the present disclosure provides a task processing method, including:

Obtain a computational graph of the problem to be processed; the computational graph includes multiple layers, each layer includes multiple tasks, the tasks in any layer are not executed based on the results of the tasks in this layer or the following layers, and at least some layers have at least some of the tasks are executed based on the results of tasks in its previous layer;

dividing the computational graph into a plurality of task groups; each task group includes tasks of at least one layer;

A mapping relationship between the tasks of each task group and each processing core of the many-core system is determined; in the mapping relationship, each task group corresponds to a core group, and each core group includes at least one processing core.

In a second aspect, an embodiment of the present disclosure provides an apparatus for task processing, including:

an acquisition module, configured to acquire a calculation graph of the problem to be processed; the calculation graph includes multiple layers, each layer includes multiple tasks, and the tasks in any layer are not executed based on the results of the tasks in this layer or the following layers, At least some of the tasks in at least some of the layers are executed based on the results of the tasks in the previous layers;

a block module, configured to divide the computational graph into a plurality of task groups; each task group includes tasks of at least one layer;

The mapping module is configured to determine the mapping relationship between the tasks of each task group and each processing core of the many-core system; in the mapping relationship, each task group corresponds to a core group, and each core group includes at least one processing core.

In a third aspect, an embodiment of the present disclosure provides a many-core system, including:

multiple processing cores; and

an on-chip network configured to exchange data and external data among the plurality of processing cores;

One or more of the processing cores store one or more instructions, and the one or more of the instructions are executed by the one or more of the processing cores, so that the one or more of the processing cores can execute any of the above. A method of task processing.

In a fourth aspect, an embodiment of the present disclosure provides a computer-readable medium on which a computer program is stored, wherein the computer program implements any one of the above task processing methods when executed by a processing core.

In a fifth aspect, an embodiment of the present disclosure provides a computer program product, including a computer program, and when the computer program is executed by a processor, any one of the foregoing task processing methods is implemented.

In the embodiment of the present disclosure, the tasks in the calculation graph are divided into multiple task groups, each task group includes tasks of at least one layer, and all tasks in the same layer can be mapped to two different processing cores for processing, and also Tasks in different layers can be mapped to the same core group, so that when any processing core is invalid (such as due to failure), a layer of the computational graph is only "broken" at most, and all tasks in a layer will not be generated. Invalid situation, so that the overall calculation graph can still obtain a certain degree of usable processing results, which greatly improves the robustness of the calculation graph.

It should be understood that what is described in this section is not intended to identify key or critical features of embodiments of the disclosure, nor is it intended to limit the scope of the disclosure. Other features of the present disclosure will become readily understood from the following description.

Description of drawings

The accompanying drawings are used to provide a further understanding of the present disclosure and constitute a part of the specification, and together with the embodiments of the present disclosure, they are used to explain the present disclosure, and are not intended to limit the present disclosure. The above and other features and advantages will become more apparent to those skilled in the art by describing detailed example embodiments with reference to the accompanying drawings, in which:

FIG. 1 is a flowchart of a method for task processing provided by an embodiment of the present disclosure;

2 is a flowchart of a method for task processing provided by an embodiment of the present disclosure;

3 is a flowchart of another task processing method provided by an embodiment of the present disclosure;

4 is a schematic diagram of a process of processing a computation graph in an embodiment of the present disclosure;

5 is a flowchart of a method for task processing after step S205 in an embodiment of the present disclosure;

6 is a schematic diagram of a partition of a computation graph and a routing relationship of tasks therein in an embodiment of the present disclosure;

FIG. 7 is a schematic diagram of a process of processing a computation graph in an embodiment of the present disclosure;

8 is a block diagram of a composition of an apparatus for task processing provided by an embodiment of the present disclosure;

FIG. 9 is a block diagram of the composition of a many-core system according to an embodiment of the present disclosure.

Detailed ways

In order for those skilled in the art to better understand the technical solutions of the present disclosure, exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, including various details of the embodiments of the present disclosure to facilitate understanding, and they should be considered to be exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted from the following description for clarity and conciseness.

Various embodiments of the present disclosure and various features of the embodiments may be combined with each other without conflict.

As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

The terminology used herein is used to describe particular embodiments only and is not intended to limit the present disclosure. As used herein, the singular forms "a" and "the" are intended to include the plural forms as well, unless the context clearly dictates otherwise. It will also be understood that when the terms "comprising" and/or "made of" are used in this specification, the stated features, integers, steps, operations, elements and/or components are specified to be present, but not precluded or Add one or more other features, integers, steps, operations, elements, components and/or groups thereof. Words like "connected" or "connected" are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect.

Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art. It will also be understood that terms such as those defined in commonly used dictionaries should be construed as having meanings consistent with their meanings in the context of the related art and this disclosure, and will not be construed as having idealized or over-formal meanings, unless expressly so limited herein.

The actual work to be done in many problems (such as image processing, speech recognition, etc.) can be expressed in the form of "computation graph (or task graph, logic graph)". That is, all operations to be performed to solve the problem are divided into multiple "tasks (or nodes)", each task includes a certain operation, and there is a certain order among different tasks. For example, if the operation of a task uses the operation results of other tasks, the task is said to be performed based on the results of other tasks; or, the task is a subsequent task of other tasks, and other tasks are previous tasks of this task.

Due to the above relationship between tasks, the computational graph can be divided into multiple "layers", each layer includes multiple tasks, and the tasks in any one layer are not based on tasks in this layer or subsequent layers, but At least some of the tasks in at least some of the layers are performed based on the results of the tasks in the previous layers. That is, if the task in the previous layer is not completed, the task in the latter layer may not be performed, because the operation result of the task in the previous layer may be used in the operation of the task in the latter layer; The tasks in the layer cannot be performed and will not affect the tasks in the previous layer, because the operation of the tasks in the previous layer will not use the operation results of the tasks in the latter layer; and the tasks in the same layer do not exist based on relationships (subordination relationships). ), because if there is a relationship-based, the corresponding tasks should belong to two different layers.

Illustratively, a "neural network (NN)" is a form of computational graph. The neural network is divided into multiple layers, each layer includes multiple nodes, and a certain operation needs to be performed in each node, and the nodes of different layers are connected by a certain relationship (for example, the output of one node is used as the input of the next layer of nodes) ; Thus, each layer of the neural network can be regarded as a layer of the computational graph, and each node of the neural network can be regarded as a task of the computational graph.

Exemplarily, the neural network in the embodiment of the present disclosure can be used for image processing, speech recognition, etc., which can specifically be in the form of a convolutional neural network (CNN), a spiking neural network (SNN), a recurrent neural network (RNN), and the like.

Among them, some problems may correspond to multiple different computational graphs. That is, the number of tasks in the calculation graph, the layer where the tasks are located, the relationship between tasks, and the specific operations of each task can be different, but these different calculation graphs can solve the problem (but the effect of solving the problem is not necessarily the same) .

The above computational graphs, which may take many forms, are called "trainable computational graphs". That is, for a computational graph that can solve a problem, the tasks in it can be adjusted through training, so that the computational graph after training has different effects on solving the problem.

For example, a neural network is a form of trainable computational graph. For example, a neural network dealing with a problem (such as image classification) is usually trained by adjusting the nodes in it (such as adjusting the weights of nodes) according to the effect of the current neural network on the problem (such as the accuracy of image classification), thereby changing Neural networks (computational graphs), and improve their performance on problems (such as improving the accuracy of image classification).

In some related technologies, when a problem is processed by a many-core system, the tasks of each layer of the corresponding computing graph can be mapped (assigned) to one processing core, and the tasks of different layers can be mapped to different processing cores middle.

However, according to the above method, once a certain processing core of the many-core system is invalid (for example, due to failure), all the tasks equivalent to one layer of the calculation graph cannot be processed, so that all tasks after this layer cannot actually be performed. , therefore, the whole problem cannot be solved at all (that is, no processing result can be obtained), and the robustness of the system is very poor.

In a first aspect, an embodiment of the present disclosure provides a task processing method. The method is based on the many-core system, which includes how to map the tasks of a computational graph to the processing cores of the many-core system.

Referring to FIG. 1 , the task processing method of the embodiment of the present disclosure includes:

S101, obtaining a calculation graph of the problem to be processed.

The computation graph includes multiple layers, each layer includes multiple tasks, the tasks in any layer are not executed based on the results of the tasks in this layer or the layers after it, and at least some tasks in at least some layers are based on the tasks in the previous layer. result is executed.

S102: Divide the computation graph into multiple task groups; each task group includes tasks of at least one layer.

S103: Determine the mapping relationship between the tasks of each task group and each processing core of the many-core system; in the mapping relationship, each task group corresponds to a core group, and each core group includes at least one processing core.

When a problem to be processed (such as image processing, speech recognition, etc.) is to be processed by the many-core system, its corresponding computational graph is obtained. Among them, a pre-set calculation graph can be obtained; the calculation graph can also be generated according to a predetermined rule according to a specific problem to be processed.

In some embodiments, the task group includes task blocks, and the task blocks are obtained by performing intra-layer partitioning based on the computation graph, that is, each layer in the computation graph is divided into multiple task blocks, and each task block includes at least one task .

In some embodiments, the task group includes a task area, and the task area is obtained by performing inter-layer partitioning based on the computation graph, that is, the tasks in the computation graph are divided into multiple task areas, and each task area includes at least two layers of Task.

FIG. 2 is a flowchart of a task processing method provided by an embodiment of the present disclosure. Referring to FIG. 2 , the task processing method according to the embodiment of the present disclosure includes:

S201, obtaining a calculation graph of the problem to be processed.

The computation graph includes a plurality of layers arranged in sequence, each layer includes a plurality of tasks, the tasks in any layer are not performed based on the results of the tasks in this layer or the subsequent layers, and at least some tasks in at least some layers are based on its The results of the tasks in the previous layer are carried out.

S203: Divide each layer of the computation graph into multiple task blocks.

Wherein, each task block includes at least one task.

4, the tasks in each layer of the computation graph are divided into multiple "groups", that is, each layer is divided into multiple "task blocks", and each task block includes one or more tasks in the layer. The number of task blocks divided into each layer may be preset, for example, it is preset that each layer is divided into a task blocks.

Alternatively, "blocking" can also be performed according to a determined method, and the actual number of task blocks obtained shall prevail.

S205: Determine a first mapping relationship between each task block and multiple processing cores of the many-core system.

Wherein, in the first mapping relationship, each task block is mapped to one processing core, each processing core is mapped to multiple task blocks, and all task blocks of any layer are mapped to at least two different processing cores.

Referring to FIG. 4 , after obtaining a plurality of task blocks, it can be determined which processing core of the many-core system should map each task block to, that is, the first mapping relationship (not necessarily the actual mapping).

According to the first mapping relationship of the embodiment of the present disclosure, the mapping is performed "scattered", that is, all task blocks in the same layer should be mapped to different processing cores as "separately" as possible, at least to ensure that they are not all mapped to the same one processing core. Since each task block includes tasks in the same layer, and each task block in the same layer is mapped to different processing cores, the above first mapping relationship ensures that all tasks in the same layer are mapped to at least two different processing cores .

In the embodiment of the present disclosure, all tasks in the same layer of the computation graph are mapped to at least two different processing cores for processing, so that when any processing core is invalid (for example, due to a fault), one layer of the computation graph is only "bad" at most Part of it is removed” without causing all the tasks of a layer to be invalid, so that the overall computational graph can still obtain a certain degree of usable processing results, which greatly improves the robustness of the computational graph.

In some embodiments, the computational graph is a trainable computational graph. Among them, a trainable computational graph can solve the same problem to be solved when at least some of the tasks are different.

In some embodiments, the computational graph is a neural network (NN). As a way of the embodiment of the present disclosure, the above computation graph is a trainable computation graph, and further is a neural network, such as a convolutional neural network (CNN), a spiking neural network (SNN), a recurrent neural network (RNN), etc. ; so that the computational graph can be trained to further improve its redundancy performance. Of course, the computation graph in the embodiment of the present disclosure is not limited to be used for a trainable computation graph (eg, a neural network), but can also be used for other computation graphs.

In some embodiments, in the first mapping relationship, any two task blocks of any layer are respectively mapped to two different processing cores.

Further, referring to FIG. 4 , a maximum scattered mapping (or “mutually exclusive” mapping) can be performed, that is, it is ensured that no task blocks from the same layer are mapped to the same processing core.

In some embodiments, referring to FIG. 3 and FIG. 4 , between acquiring the computation graph of the problem to be processed (S201) and dividing each layer of the computation graph into a plurality of task blocks (S203), the method further includes:

S202 , training the computation graph to improve the redundancy performance of the computation graph.

The computational graph can also be trained before it is "chunked" to improve its redundancy performance.

In some embodiments, the training computation graph (S202) includes at least one of the following:

(1) Invalidate some tasks in the computational graph to train the computational graph.

As a method of the embodiment of the present disclosure, the Dropout method can be used to invalidate some tasks in the calculation graph (for example, make the weight of some nodes of the neural network to be 0), and adjust other tasks, so that the calculation graph can be invalidated in these tasks produces somewhat usable results in the case of improved robustness.

(2) A region of the invalid computational graph to train the computational graph.

Among them, the area includes multiple tasks.

As a method of the embodiment of the present disclosure, the method of Dropblock can be used to invalidate all tasks located in a region (which may include a part of a layer, or corresponding parts of multiple layers) in the calculation graph, and adjust other tasks, Making the computational graph yield somewhat usable results in cases where tasks in the region are ineffective, improving its robustness.

In each "area" of the above method, usually more tasks are finally mapped to one processing core, so it can better improve the redundancy performance of the computational graph.

(3) The computational graph is trained by adversarial sample defense.

As a method of the embodiment of the present disclosure, the "adversarial sample defense" method can be used to train the computational graph.

In the embodiment of the present disclosure, all computation graphs to be trained must be trainable computation graphs (eg, neural networks).

In the embodiment of the present disclosure, each type of training may be performed only once, or may be performed multiple times in a loop. When performing multiple training sessions, the specific methods used for each training session may be the same or different. When multiple trainings are performed, the training can be ended when a preset end criterion is reached, and the end criterion may include the convergence of the computation graph, reaching a predetermined number of training times, reaching a predetermined redundancy performance, and the like.

All the trainings in the embodiments of the present disclosure can meet the above requirements, and will not be described in detail later.

In the embodiment of the present disclosure, the training method of the computational graph is not limited in the embodiment of the present disclosure, and any method capable of realizing computational graph training can be used for training the computational graph.

In some embodiments, referring to FIGS. 3 and 4 , dividing each layer of the computational graph into multiple task blocks (S203), including: extending the computational graph, dividing each layer of the expanded computational graph into multiple task blocks task block.

Wherein, the expansion includes adding redundant tasks to at least some layers of the computation graph.

As a method of the embodiment of the present disclosure, it can be "expanded" or "decompressed" first, that is, some tasks (redundant tasks) that do not exist in each layer of the calculation graph are "added", and then the The decompressed computation graph is divided into blocks, so that at least some of the obtained task blocks include the above "redundant tasks", so as to improve redundant performance.

The specific "expansion amount" of each layer may be preset. For example, the redundancy coefficient b can be set to indicate the ratio of the operation amount of the extended task to the operation amount of the original task: if the redundancy coefficient b=0, it is equivalent to not extending; if b=1, it is equivalent to extending The computation amount of the outgoing task is the same as the computation amount of the original task; generally speaking, b can be greater than 0 and less than or equal to 1 (if b is greater than 1, it is also feasible, for example, there can be "multiple copies" of task backup).

Alternatively, each layer may be expanded according to a certain method, and the actual task amount obtained by the expansion according to this method shall prevail.

In some embodiments, redundant tasks include at least one of the following:

(1) Backup tasks. Among them, the backup task is the same as the task in the corresponding layer. As a mode of the embodiment of the present disclosure, each extended task may be an original task in its corresponding layer, so that there are actually "multiple copies" of the corresponding task, so they can be used as "backups" for each other. That is, when a certain task is not completed (eg, the processing core fails), subsequent tasks can be performed using its backed-up operation results to improve robustness.

Of course, for the "backup task" and the "original task", the follow-up should usually be divided into different task blocks, and these task blocks should be mapped to different processing cores.

(2) Empty tasks. As a way of the embodiment of the present disclosure, empty tasks that do not perform actual operations (or perform empty operations) can be expanded.

(3) Invalid tasks. As a way of the embodiment of the present disclosure, some tasks that need to perform operations but are not required for the operations in the original calculation diagram can be extended, that is, invalid tasks. Wherein, the invalid task may be generated randomly, or may be generated by other specific anti-compression techniques.

The expansion modes of different layers may be the same or different.

In some embodiments, each layer of the computation graph is divided into a plurality of task blocks (S203), including any one of the following:

(1) Randomly divide each layer of the computational graph into multiple task blocks.

As one way of the embodiment of the present disclosure, the tasks in each layer may be randomly "blocked", that is, the number of tasks and specific tasks in each task block are random.

(2) Divide each layer of the computation graph into multiple task blocks evenly.

As a method of the embodiment of the present disclosure, each layer may be evenly divided into multiple task blocks, that is, the number of tasks in each task block of the same layer is equal or substantially equal (for example, in all task blocks of the same layer, If the number of tasks in the task block with the fewest tasks is 100%, the number of tasks in the task block with the most tasks does not exceed 110%).

(3) Divide each layer of the computation graph into multiple pre-task blocks, and combine all the pre-task blocks that should be mapped to one processing core according to the first mapping relationship into one task block.

As a method of the embodiment of the present disclosure, the subsequent "mapping" can be considered when dividing the blocks, that is, each layer is first divided into multiple "blocks (pre-task blocks)", and then if there are multiple pre-task blocks will be mapped to a processing core, they can be merged directly as a task block.

(4) At least according to the hardware resources of each processing core, each layer of the computation graph is divided into a plurality of task blocks.

As a method of the embodiment of the present disclosure, when performing "blocking", the actual hardware resources of the processing core (such as cache, etc.) may also be considered, as well as the subsequent "mapping", so that each task block is mapped to according to the The hardware resources of the processing core determine which tasks should be grouped into the task block.

The hardware resources of different processing cores may be the same or different.

Wherein, the block mode of different layers may be the same or different.

In some embodiments, any two task blocks of any one layer are mapped to two different processing cores according to the mapping relationship.

In some embodiments, referring to FIG. 3 and FIG. 4 , in dividing each layer of the computation graph into multiple task blocks (S203) and determining the first mapping relationship between each task block and multiple processing cores of the many-core system (S205), it also includes:

S204, invalidating some task blocks to train each task block and improve the redundancy performance of the computation graph.

After "blocking", one or more (but not all) task blocks can also be invalidated (that is, all tasks in them are invalid), and the tasks in other task blocks can be adjusted so that the remaining task blocks are still Can produce somewhat usable results (i.e. blocks of training tasks), improving the robustness of the computational graph.

Of course, the above training is essentially training on "computational graphs".

The above training is performed after chunking (also after expansion or decompression), which is equivalent to improving the robustness at the "chunk level".

In some embodiments, invalidating partial task blocks to train each task block (S204) includes at least one of the following:

(1) Random invalid partial task blocks to train other task blocks in the computation graph.

As a way of the embodiment of the present disclosure, it can be a random invalid partial task block to train the remaining task blocks in the computation graph.

(2) Determine key task blocks including key tasks and invalid key task blocks to train each task block in the computation graph.

As a method of the embodiment of the present disclosure, a "key task" that plays a key role may be determined according to the structural features of the computational graph, and the key task block where the key task is located is invalid to train the task block.

In some embodiments, referring to FIG. 3 and FIG. 4 , after determining the first mapping relationship between each task block and the multiple processing cores of the many-core system (S205), the method further includes:

S206, the invalid part processes all the task blocks mapped in the core, so as to train each task block and improve the redundancy performance of the calculation graph.

After the first mapping relationship is determined (including the actual mapping), all tasks mapped in some processing cores can also be invalidated (equivalent to invalidating some processing cores), and other tasks can be adjusted so that the remaining tasks are still Can produce somewhat usable results (ie blocks of training tasks), improving robustness.

The above training is equivalent to simulating some cases where the core is invalid due to failure, so the robustness of the "core level" can be improved from the perspective of the final practical application.

In some embodiments, all task blocks mapped in the invalid part processing core to train each task block (S206) include at least one of the following:

(1) The random invalid part processes all the task blocks mapped in the core to train each task block.

As one way of implementing an embodiment of the present disclosure, all tasks mapped to one or more (but not all) processing cores may be disabled (ie, one or more processing cores) to train each task block.

(2) Invalidate all the task blocks mapped in each processing core in turn, so as to train each task block.

As a method of the embodiment of the present disclosure, each processing core may be invalidated in sequence, that is, only one processing core is invalidated at a time, but all processing cores have been invalidated.

(3) Determine key task blocks including key tasks and all task blocks mapped in the processing cores to which invalid key task blocks are mapped, so as to train each task block.

As a method of the embodiment of the present disclosure, the key task may be determined according to the structure of the calculation graph, and then the key task block where the key task is located is determined, and the processing core corresponding to the key task block is invalidated to train each task block.

In some embodiments, referring to FIG. 3 , after determining the first mapping relationship between each task block and the multiple processing cores of the many-core system (S205), the method further includes:

S207, map each task block to a plurality of processing cores according to the first mapping relationship.

S208, each processing core processes the tasks in the task blocks mapped therein.

After the first mapping relationship is determined, each task block can also be mapped (or allocated) to the processing core according to the first mapping relationship, and processed by each processing core, so as to realize the actual function of the calculation graph and solve the above problems to be processed. question.

Of course, the above steps of determining the first mapping relationship (S205) and performing the mapping according to the first mapping relationship (S207) may actually be integrated, that is, the mapping can be performed directly.

Of course, when the training of the above step S206 is to be performed, referring to FIG. 3 , the step S206 may be performed before the step S207, that is, the task is not actually mapped to the processing core, but only the processing core is mapped according to the first mapping relationship. The task is invalid, and the training is performed.

Alternatively, step S206 can also be performed after step S207, that is, the task can be actually mapped to the processing core, and the processing core can be actually invalidated for training.

In the embodiment of the present disclosure, all tasks in the same layer of the computation graph are mapped to at least two different processing cores for processing, which can improve the robustness of the computation graph. However, tasks of different layers must be located in different processing cores. Therefore, for all subsequent tasks based on the results of the previous tasks, the corresponding operation results must be transmitted across the core from the processing core where the previous task is located to the one where the subsequent task is located. processing core. As a result, there must be a large number of inter-core routes between the processing cores, resulting in a complex inter-core routing structure and a large amount of data transmitted by the inter-core routing. Inter-core routing leads to performance degradation in many-core systems.

Referring to FIG. 5 , an embodiment of the present disclosure provides a task processing method. The method is to divide tasks of different layers into a task group according to an inter-layer partition, so as to reduce the routing of the computation graph. Methods of task processing include:

Step S501, obtaining a calculation graph to be processed; wherein, the calculation graph includes a plurality of layers arranged in sequence, each layer includes a plurality of tasks, and the tasks in any layer are not performed based on the results of the tasks in this layer or the subsequent layers, At least some of the tasks in at least some of the layers are performed based on the results of the tasks in the previous layers.

It should be noted that the calculation graph can also be obtained by other means, for example, a preset calculation graph; the calculation graph can also be generated according to a predetermined rule according to a specific problem to be processed.

S503: Divide the calculation graph into multiple task areas.

Wherein, each task area includes multiple tasks from at least two different layers, and the tasks of each layer are located in at least two task areas; the internal routing strength of each task area exceeds the preset standard, and the The internal routing strength is determined according to the proportion of the routes between the tasks in the task area to all the routes involved in the tasks in the task area.

In some embodiments, the computation graph is divided into multiple task regions, that is, each task of the computation graph is divided into a corresponding task region. 6 and 7 , the calculation graph is divided into three task areas (task area 0 to task area 2), the number of tasks in it is represented by the vertical size of the filling area corresponding to the layer, and the different processing cores are represented by blank boxes. (processing core 0 to processing core 2) and core groups (core group 0 to core group 2).

Among them, the tasks in each task area cannot all come from one layer of the computation graph, but must come from at least two different layers; at the same time, all the tasks of a layer cannot all be located in one task area, but must be divided into at least two different layers. Each task area is equivalent to taking some tasks from multiple layers of the computational graph.

Referring to FIG. 6 , there are many “routes (or data paths)” connected between tasks in different layers of the computation graph, wherein the operation result of the preceding task needs to be transmitted to the succeeding task through routing, so that the succeeding task can utilize the operation result . As a result, many tasks will connect one or more routes (including incoming routes and outgoing routes), and these routes are the routes "involved" by the task. It can be seen that in the "all routes" involved in all tasks in a task area (represented by dashed boxes in Figure 6 and Figure 7), the two ends of a part of the routes are connected to the tasks in the task area. It is called the "internal route (represented by solid arrows in Figure 6)" of the task area, that is, the route between tasks in the task area; while there are other routes, one end of which connects the tasks in the task area, and the other end connects to the tasks in the task area. Such routes that connect tasks outside the task area (and must also be tasks in other task areas) are called "external routes (indicated by dashed arrows in Figure 6)" of the task area.

According to the ratio of "internal routing" to "all routing" in each task area, the "internal routing strength" of the task area can be calculated. Therefore, the "internal routing strength" also reflects the task progress in a task area. The relative degree of "internal interaction".

In the embodiment of the present disclosure, the internal routing strength of each task area obtained by division exceeds the preset standard. Therefore, each task area must be a "strong internal routing area", that is, the tasks in each task area are mainly Do "internal interactions" and less "external interactions" with tasks in other task areas.

It should be noted that, in order to explain the embodiments of the present disclosure more clearly, the computation graph is set in three layers (layer 0 to layer 2), and the computation graph is divided into three task areas. However, this does not indicate a limitation on the number of layers and the number of tasks in the embodiment of the present disclosure, and the embodiment of the present disclosure does not limit the number of layers and the number of tasks in the calculation graph.

S505: Determine the second mapping relationship between the tasks of each task area and each processing core of the many-core system.

In the second mapping relationship, each task area corresponds to a core group, each core group includes one or more processing cores, and the tasks of each task area are mapped to the processing cores of the corresponding core group.

Referring to FIG. 7, after the task area is determined, a corresponding core group is determined for each task area (each core group includes one or more processing cores in the many-core system), and all tasks in each task area are Map to the processing core of its corresponding core group; that is, each task in the task area needs to be mapped to a processing core of its corresponding core group, and each processing core of a core group needs to be mapped There is at least one task from the task area corresponding to this core group.

In the embodiment of the present disclosure, the “strong internal routing area” with relatively close internal connections in the calculation graph is divided into one task area, so the tasks in each task area are less “externally interacted”; It is mapped to one core group, so it is less necessary to perform "cross-group" data transmission between different core groups; thus, less inter-core (or inter-group) routing is required in the embodiment of the present disclosure, which can be simplified The inter-core (or inter-group) routing structure reduces the amount of data transmitted by inter-core (or inter-group) routing to improve the performance (such as efficiency) of many-core systems.

In addition, in the embodiment of the present disclosure, each task area includes tasks from multiple different layers, that is, the tasks of different layers are assigned to different core groups, so when one processing core is invalid (for example, due to a fault), the calculation graph Generally, only a part of the layer is "broken", and it will not cause all tasks of a layer to be invalid, so that the overall calculation graph can still obtain a certain degree of usable processing results, which improves the robustness of the calculation graph. sex.

It should be noted that the computation graph is a trainable computation graph, in which at least some of the tasks are different, the same problem to be solved is solved. The computational graph can be a neural network or other computational graph. Neural networks such as Convolutional Neural Networks (CNN), Spiking Neural Networks (SNNs), Recurrent Neural Networks (RNNs), etc. The redundancy performance can be improved by training computational graphs.

In some embodiments, at least a portion of the task region includes tasks from each layer of the computation graph. Referring to FIG. 6 and FIG. 7 , the tasks in the task area can come from all layers, that is, the task area can take at least one task from each layer. Further, each task area includes tasks from each layer.

In some embodiments, the internal routing strength includes: the proportion of internal routing volume, the internal routing volume ratio of each task area is the amount of data transmitted by routing between tasks in the task area, accounting for the task area in the task area. The proportion of the data volume transmitted by all the routes involved; and/or, the proportion of the number of internal routes, the proportion of the number of internal routes in each task area is the number of routes between tasks in the task area, accounting for the number of routes in the task area The ratio of the total number of routes involved in tasks in the area.

In some embodiments, "internal routing volume ratio" and/or "internal routing volume ratio" may be used as specific indicators of the above internal routing strength. Among them, all routes of each task area include internal routes and external routes, and the amount of data transmitted by each route during the operation of the calculation graph is also fixed and can be known in advance; thus, the internal routes of each task area can be used to transmit data. The ratio of the data volume to the data volume transmitted by all routes is used as the proportion of internal routing volume, that is, one of the specific indicators of internal routing strength. Among them, all routes of each task area include internal routes and external routes, and the "number" of various routes is determined and known; thus, the number of internal routes available in each task area is relative to its The ratio of the number of all routes is used as the proportion of the number of internal routes, that is, one of the specific indicators of the strength of internal routes.

In some embodiments, the preset criteria include: the proportion of the amount of internal routes is greater than the first threshold; and/or the proportion of the number of internal routes is greater than the second threshold. The preset criteria that the task area should meet may be that one of the internal routing amount and the internal routing number is greater than the corresponding threshold, or both may be greater than the corresponding threshold at the same time.

It should be understood that when both the required internal routing volume and the internal routing number are greater than the corresponding thresholds, each of the thresholds may be different from the threshold required for one of the internal routing volume and internal routing number to be greater than the corresponding thresholds.

In some embodiments, at least part of the core set includes only one processing core. Referring to FIG. 7 , there may be a core group (core group 2) that includes only one processing core (processing core 2), so the tasks of its corresponding task area (task area 1) must also all be mapped to the processing core. Therefore, the data transmission between tasks of different layers in the task area (internal interaction for the task area) must all be intra-core data transmission, which can minimize inter-core routing. Of course, it is also feasible if all core groups consist of only one processing core.

In some embodiments, at least some of the core groups include multiple processing cores; in the second mapping relationship, each task area corresponding to the core group including multiple processing cores is divided into multiple second task blocks, and each task area The number of the second task blocks is the same as the number of processing cores included in the core group corresponding to the task area, and the second task blocks in each task area are respectively mapped to the processing cores of the core group corresponding to the task area. middle.

As an example of the embodiment of the present disclosure, referring to FIG. 7 , there may also be a core group (core group 0, core group 1) including multiple processing cores (processing core 0, processing core 1), so the corresponding task area (task Area 0, task area 2) need to be "blocked" first, each second task block divided includes multiple tasks, and then each second task block is divided into a processing core of the corresponding core group (so the first The number of task blocks and the number of processing cores of the core group must be the same).

Obviously, the hardware resources (such as cache, etc.) of each processing core are fixed, so each processing core may not be able to process all tasks in the entire task area; therefore, it can be combined with comprehensive consideration of hardware resources, computing load balancing, etc. , and use multiple processing cores to form a core group to jointly process a task area.

Referring to Figure 7, in order to simplify the routing structure between different processes in the same core group, the tasks from the same layer can be divided into the same second task block as much as possible, so that the subsequent tasks located in the same layer can be located in one processing core as much as possible. , so that among multiple processing cores in a core group, inter-core routing is mainly established between processing cores corresponding to different layers, rather than forming a very complex "grid-like" routing. Of course, it is also feasible if all core groups include multiple processing cores.

In some embodiments, the distance between any two processing cores is less than a predetermined distance in a core group comprising at least part of a plurality of processing cores. For example, in a core group (eg, core group 0, core group 1) including multiple processing cores, the distance between the corresponding processing cores (eg, processing core 0, processing core 1) may be smaller than the preset distance; The transmission efficiency is also related to the distance between the processing cores. Therefore, dividing the processing cores with "closer" distances into a core group can reduce the time-consuming of data transmission within the core group.

The specific form of the above "distance between processing cores" may be various, for example, it may be the straight-line physical distance between processing cores, or the total length of the inter-core routes connecting the processing cores, or the distance between processing cores. The number of other processing cores (or the number of hops of routing) and the like that are spaced apart will not be described in detail here.

In some embodiments, at least some of the processing cores belong to multiple core groups simultaneously. Referring to FIG. 7 , the processing cores included in different core groups may be “overlapping” (for example, core group 0 and core group 1 both include processing core 0 and processing core 1), so that tasks in different task areas (including the second task block) ) may be divided into the same processing core to make fuller use of hardware resources and better achieve computing load balancing.

The processing cores included in different core groups may be exactly the same (it may also be considered that multiple core groups are "merged"), or the processing cores included in different core groups may be "partially overlapped", which will not be described in detail here.

In some embodiments, between dividing the computation graph into a plurality of task areas (S503) and determining the second mapping relationship between the tasks of each task area and each processing core of the many-core system (S505), the method further includes: extending At least part of the mission area. Among them, the expansion includes adding redundant tasks in the task area.

Referring to Figure 5 and Figure 7, after dividing into multiple task areas, each task area can also be "expanded", or "decompressed", that is, "adding" some tasks that were not originally in the task area (redundant task), and then map the decompressed task area to the corresponding core group (including mapping after partitioning), so that the above "redundant tasks" are also included in the processing core to improve redundant performance.

In some embodiments, redundant tasks include at least one of the following:

(1) Backup tasks. Among them, the backup task is the same as the task in the corresponding task area. As a mode of the embodiment of the present disclosure, each extended task can be an original task in its corresponding task area, so that there are actually "multiple copies" of the corresponding task, so they can be used as "backups" for each other to improve the robustness Awesome.

The expansion modes of different task areas may be the same or different.

In some embodiments, between acquiring the computation graph (S501) and dividing the computation graph into multiple task areas (S503), the method further includes:

S502 , training the computation graph to improve the redundancy performance of the computation graph.

Referring to Figures 5 and 7, before "partitioning" the computational graph, it can also be trained to improve its redundancy performance.

In some embodiments, the above training includes at least one of the following:

Among them, the above invalidated tasks can be located in a continuous area (not necessarily a task area) of the calculation graph, that is, it can be specifically Dropblock training.

(2) The computational graph is trained by adversarial sample defense.

In some embodiments, between dividing the computation graph into a plurality of task areas (S503) and determining the second mapping relationship between the tasks of each task area and each processing core of the many-core system (S505), the method further includes:

S504, invalidating part of the task area, so as to train each task area and improve the redundancy performance of the calculation graph.

Referring to FIG. 5 and FIG. 7 , after “partitioning”, one or more (but not all) task areas can also be invalidated (that is, all tasks in them are invalid), and the tasks in other task areas can be adjusted to Making the remaining task regions still yield some usable results (ie, training task regions), improving robustness.

The above training is performed after "partitioning" (specifically, after expansion, or before expansion), which is equivalent to improving the robustness of "district level".

In some embodiments, the above training includes at least one of the following:

(1) Random invalid part of the task area to train each task area.

As a way of the embodiment of the present disclosure, it may be a random invalid partial task region to train the remaining task region.

(2) Determine the key mission areas including key missions and the invalid key mission areas to train each mission area.

As a method of the embodiment of the present disclosure, a "critical task" that plays a key role may be determined according to the structural features of the computational graph, and the critical task area where the critical task is located is invalid to train the task area.

In some embodiments, after determining the second mapping relationship between the tasks of each task area and each processing core of the many-core system (S505), the method further includes:

S506, the invalid part processes all the tasks mapped in the core, so as to train each task area and improve the redundancy performance of the calculation graph.

Referring to FIG. 5 and FIG. 7 , after the second mapping relationship is determined (including the actual mapping), all tasks mapped in some processing cores can be invalidated (equivalent to invalidating some processing cores), and other tasks can be adjusted. , so that the remaining tasks can still produce results that are available to a certain extent (i.e. the training task area), improving robustness.

In some embodiments, the above training includes at least one of the following:

(1) The random invalid part processes all the tasks mapped in the core to train each task area.

As one way of implementing an embodiment of the present disclosure, all tasks mapped to one or more (but not all) processing cores may be disabled (ie, one or more processing cores) to train each task area.

(2) Invalidate all the task areas mapped in each processing core in turn, so as to train each task area.

(3) Determine the key tasks, and all the tasks mapped in the processing core to which the invalid key tasks are mapped to train each task area.

As a method of the embodiment of the present disclosure, the key tasks may be determined according to the structure of the computation graph, and then the processing cores where the key tasks are located are invalidated, so as to train each task area.

In some embodiments, referring to Fig. 5, after determining the second mapping relationship between the tasks of each task area and each processing core of the many-core system (S505), it also includes:

S507: Map the tasks of each task area to each processing core of the many-core system according to the second mapping relationship.

S508, each processing core processes the tasks mapped to it.

After the second mapping relationship is determined, the tasks in each task area can also be mapped (or allocated) to the processing cores of the corresponding core group according to the second mapping relationship, and processed by each processing core, so as to realize the computation graph. Actual function to solve the above pending problems.

Of course, the above steps of determining the second mapping relationship (S505) and performing the mapping (S507) may actually be integrated, that is, the mapping can be performed directly.

Of course, when the training of the above step S506 is to be performed, referring to FIG. 5 , the step S506 may be performed before the step S507, that is, the task is not actually mapped to the processing core, but only the processing core is mapped according to the second mapping relationship. The task is invalid, and the training is performed.

Alternatively, step S506 can also be performed after step S507, that is, the task can be actually mapped to the processing core, and the processing core can be actually invalidated for training.

In the embodiment of the present disclosure, all computation graphs to be trained must be trainable computation graphs (eg, neural networks). Each exercise can be done only once, or it can be repeated multiple times. When performing multiple training sessions, the specific methods used for each training session may be the same or different. When multiple trainings are performed, the training can be ended when a preset end criterion is reached, and the end criterion may include the convergence of the computation graph, reaching a predetermined number of training times, reaching a predetermined redundancy performance, and the like.

In a second aspect, referring to FIG. 8 , an embodiment of the present disclosure provides an apparatus 800 for processing tasks, including:

The obtaining module 801 is configured to obtain a calculation graph of the problem to be processed; the calculation graph includes a plurality of layers arranged in sequence, each layer includes a plurality of tasks, and the tasks in any layer are not based on the results of the tasks in this layer or subsequent layers to perform, at least some of the tasks in at least some of the layers are performed based on the results of the tasks in the previous layers;

A block module 802, configured to divide each layer of the computation graph into a plurality of task blocks; each task block includes at least one task;

The mapping module 803 is configured to determine a first mapping relationship between each task block and multiple processing cores of the many-core system; according to the first mapping relationship, each task block is mapped to one processing core, and each processing core is mapped to multiple processing cores. Each task block is mapped to at least two different processing cores in any layer.

In some embodiments, the apparatus 800 for task processing further includes:

The training module 804 is configured to train the computation graph based on the first mapping relationship, and obtain the computation graph after training; wherein, the computation graph includes a plurality of layers arranged in sequence, each layer includes a plurality of tasks, and the tasks in any layer are not based on this at least some of the tasks in at least some of the layers are performed based on the results of the tasks in the previous layer;

The partition module 805 is configured to divide the computation graph into a plurality of task regions; each task region includes a plurality of tasks from at least two different layers, and the tasks of each layer are located in at least two task regions; each task region The internal routing strength of each task exceeds the preset standard, and the internal routing strength of each task is determined according to the proportion of the routes between tasks in the task area to all the routes involved in the tasks in the task area;

The mapping module 803 is further configured to determine the second mapping relationship between the tasks of each task area and each processing core of the many-core system; in the second mapping relationship, each task area corresponds to a core group, and each core group includes a or multiple processing cores, the tasks of each task area are mapped to the processing cores of its corresponding core group.

The apparatus 800 for task processing in this embodiment of the present disclosure may implement the above-mentioned task processing method.

It should be understood that when the above task processing method further includes other steps, the task processing apparatus 800 may also include other modules for implementing corresponding steps.

In a third aspect, referring to FIG. 9 , an embodiment of the present disclosure provides a many-core system 900, including:

a plurality of processing cores 901; and

an on-chip network 902 configured to exchange data and external data among the plurality of processing cores 901;

One or more instructions are stored in the one or more processing cores 901, and the one or more instructions are executed by the one or more processing cores 901, so that the one or more processing cores 901 can execute any of the above tasks. method.

The many-core system 900 in the embodiment of the present disclosure can implement the above-mentioned task processing method, including performing actual computation graphs and/or tasks in the computation graphs to obtain processing results of the problems to be processed.

In a fourth aspect, embodiments of the present disclosure further provide a computer-readable storage medium on which a computer program is stored, wherein the computer program implements the above-mentioned task processing method when executed by a processor/processing core. Computer-readable storage media can be volatile or non-volatile computer-readable storage media.

In a fifth aspect, embodiments of the present disclosure further provide a computer program product, comprising computer-readable codes, or a non-volatile computer-readable storage medium carrying computer-readable codes, when the computer-readable codes are stored in an electronic device When running in the processor, the processor in the electronic device executes the above task processing method.

Those of ordinary skill in the art can understand that all or some of the steps in the methods disclosed above, functional modules/units in the systems, and devices can be implemented as software, firmware, hardware, and appropriate combinations thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be composed of several physical components Components execute cooperatively. Some or all physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit . Such software may be distributed on computer-readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). As known to those of ordinary skill in the art, the term computer storage media includes both volatile and nonvolatile implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules or other data flexible, removable and non-removable media. Computer storage media include, but are not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical disk storage, magnetic cartridges, magnetic tape, magnetic disk storage or other magnetic storage devices, or may Any other medium used to store desired information and that can be accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism, and can include any information delivery media, as is well known to those of ordinary skill in the art .

Example embodiments have been disclosed herein, and although specific terms are employed, they are used and should only be construed in a general descriptive sense and not for purposes of limitation. In some instances, it will be apparent to those skilled in the art that features, characteristics and/or elements described in connection with a particular embodiment may be used alone or in combination with other embodiments, unless expressly stated otherwise. Features and/or elements are used in combination. Accordingly, it will be understood by those of ordinary skill in the art that various changes in form and details may be made therein without departing from the scope of the present disclosure as set forth in the appended claims.

Claims

A method for task processing, comprising:

Obtain a computational graph of the problem to be processed; the computational graph includes multiple layers, each layer includes multiple tasks, the tasks in any layer are not executed based on the results of the tasks in this layer or the following layers, and at least some layers have at least some of the tasks are executed based on the results of tasks in its previous layer;

dividing the computational graph into a plurality of task groups; each task group includes tasks of at least one layer;

A mapping relationship between the tasks of each task group and each processing core of the many-core system is determined; in the mapping relationship, each task group corresponds to a core group, and each core group includes at least one processing core.
The method for task processing according to claim 1, wherein the task group includes task blocks, and each layer of the computation graph is divided into a plurality of task blocks; In the mapping relationship between processing cores, each of the task blocks is mapped to one processing core, each of the processing cores is mapped to multiple task blocks, and all the task blocks of any layer are mapped to at least two. in different processing cores;

The dividing the computational graph into multiple task groups includes: dividing each layer of the computational graph into a plurality of task blocks;

The determining of the mapping relationship between the tasks of each task group and each processing core of the many-core system includes: determining a first mapping relationship between each task block and a plurality of processing cores of the many-core system.
The method for task processing according to claim 2, further comprising at least one of the following steps:

Before dividing each layer of the computation graph into a plurality of task blocks, the method further includes: training the computation graph to improve the redundancy performance of the computation graph;

Before the determining the first mapping relationship between each task block and the multiple processing cores of the many-core system, the method further includes: invalid part of the task block, so as to train the calculation graph and improve the redundancy performance of the calculation graph;

After determining the first mapping relationship between each task block and the multiple processing cores of the many-core system, the method further includes: all the task blocks mapped in the invalid part of the processing core, so as to train the calculation graph and improve the performance of the calculation graph. Redundant performance.
The method for task processing according to claim 2, wherein the dividing each layer of the computational graph into a plurality of task blocks comprises:

Expanding the computational graph, dividing each layer of the expanded computational graph into a plurality of task blocks; the expanding includes adding redundant tasks in at least some layers of the computational graph.
The method for task processing according to claim 2, wherein, in the first mapping relationship, any two task blocks in any layer are mapped to two different processing cores.
The method for task processing according to any one of claims 1 to 5, wherein the computation graph is a trainable computation graph; and the trainable computation graph is capable of, when at least some of the tasks are different, Solve the same pending issue.
The method for task processing according to claim 1, wherein the task group includes a task area; and the computation graph is divided into a plurality of task areas; each task area includes multiple tasks from at least two different layers. tasks, and the tasks of each layer are located in at least two task areas; the internal routing strength of each task area exceeds the preset standard, and the internal routing strength of each task is based on the routing between tasks in the task area. The proportion of all routes involved in the tasks in the task area is determined;

Determine the second mapping relationship between the tasks of each task area and each processing core of the many-core system; in the second mapping relationship, each task area corresponds to a core group, and each core group includes one or more processing cores , the task of each task area is mapped to the processing core of its corresponding core group;

The dividing the calculation graph into multiple task groups includes: dividing the calculation graph into multiple task areas;

The determining of the mapping relationship between the tasks of each task group and each processing core of the many-core system includes: determining a second mapping relationship between the tasks of each task area and each processing core of the many-core system.
The method for task processing according to claim 7, wherein at least part of the task area includes tasks from each layer of the computational graph.
The method for task processing according to claim 7, wherein the internal routing strength comprises: the ratio of the internal routing volume, and/or the ratio of the internal routing volume; the internal routing volume ratio of each task area is: The amount of data transmitted by routes between tasks in this task area accounts for the proportion of data transmitted by all routes involved in the tasks in this task area; the proportion of the number of internal routes in each task area is The number of routes between tasks, the proportion of the total number of routes involved in the tasks in the task area;

The preset criteria include: the proportion of the amount of internal routes is greater than the first threshold, and/or the proportion of the number of internal routes is greater than the second threshold.
The method for task processing according to claim 7, wherein at least part of the core group includes only one processing core; and/or, at least part of the core group includes multiple processing cores; in the second mapping relationship, with Each task area corresponding to a core group including multiple processing cores is divided into multiple second task blocks, and the number of second task blocks in each task area is the same as the number of processing cores included in the core group corresponding to the task area. , the second task blocks of each task area are respectively mapped to the processing cores of the core group corresponding to the task area.
The method for task processing according to claim 7, wherein at least some of the processing cores belong to multiple core groups at the same time.
The method for task processing according to claim 7, further comprising at least one of the following steps:

Before dividing the computation graph into multiple task areas, the method further includes: training the computation graph to improve the redundancy performance of the computation graph;

Before the determining of the second mapping relationship between the tasks of each task area and each processing core of the many-core system, the method further includes: an invalid part of the task area, so as to train each task area and improve the redundancy performance of the calculation graph;

After the second mapping relationship between the tasks of each task area and the processing cores of the many-core system is determined, the method further includes: all the tasks mapped in the invalid part processing cores, so as to train each task area and improve the redundancy of the calculation graph. remaining performance.
The method for task processing according to claim 1, wherein the computation graph is a neural network.
The method for task processing according to any one of claims 2 to 5, wherein after the determining the first mapping relationship between each task block and the multiple processing cores of the many-core system, the method further comprises:

According to the first mapping relationship, each task block is mapped to a plurality of processing cores;

Each processing core processes the tasks in the task block to which it is mapped.
A device for task processing, comprising:

an acquisition module, configured to acquire a calculation graph of the problem to be processed; the calculation graph includes multiple layers, each layer includes multiple tasks, and the tasks in any layer are not executed based on the results of the tasks in this layer or the following layers, At least some of the tasks in at least some of the layers are executed based on the results of the tasks in the previous layers;

a block module, configured to divide the computational graph into a plurality of task groups; each task group includes tasks of at least one layer;

The mapping module is configured to determine the mapping relationship between the tasks of each task group and each processing core of the many-core system; in the mapping relationship, each task group corresponds to a core group, and each core group includes at least one processing core.
A many-core system, characterized in that it includes:

multiple processing cores; and

A network-on-chip configured to exchange data and external data among the plurality of processing cores; wherein one or more of the processing cores store one or more instructions, and one or more of the instructions are stored by one or more of the processing cores; Each of the processing cores is executed to enable one or more of the processing cores to perform the method of task processing described in any one of claims 1 to 14 .
A computer-readable medium on which a computer program is stored, characterized in that, when the computer program is executed by a processing core, the method for processing a task according to any one of claims 1 to 14 is implemented.
A computer program product comprising a computer program which, when executed by a processor, implements the method of task processing according to any one of claims 1 to 14.