WO2022171002A1 - Task processing method and apparatus, many-core system, and computer-readable medium - Google Patents

Task processing method and apparatus, many-core system, and computer-readable medium Download PDF

Info

Publication number
WO2022171002A1
WO2022171002A1 PCT/CN2022/074490 CN2022074490W WO2022171002A1 WO 2022171002 A1 WO2022171002 A1 WO 2022171002A1 CN 2022074490 W CN2022074490 W CN 2022074490W WO 2022171002 A1 WO2022171002 A1 WO 2022171002A1
Authority
WO
WIPO (PCT)
Prior art keywords
task
tasks
core
processing
graph
Prior art date
Application number
PCT/CN2022/074490
Other languages
French (fr)
Chinese (zh)
Inventor
施路平
张伟豪
林俊峰
祝夭龙
Original Assignee
北京灵汐科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202110184918.6A external-priority patent/CN112835718A/en
Priority claimed from CN202110184939.8A external-priority patent/CN112835719B/en
Application filed by 北京灵汐科技有限公司 filed Critical 北京灵汐科技有限公司
Publication of WO2022171002A1 publication Critical patent/WO2022171002A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]

Definitions

  • the present disclosure relates to the field of many-core technologies, and in particular, to a method and apparatus for task processing, a many-core system, and a computer-readable medium.
  • the many-core system includes a plurality of processing cores (cores or processing engines), and each processing core can exchange information through routing.
  • the many-core system solves the problem to be processed through the electronic computing process.
  • the many-core system includes multiple processing cores (kernels or processing engines), and each processing core can exchange information through routing.
  • the many-core system solves the problem to be solved through the electronic computing process, which is essentially a process in which multiple tasks corresponding to the problem to be solved are mapped (or allocated) to different processing cores, and the tasks are processed separately by each processing core.
  • processing cores in the many-core system will inevitably be in an invalid state (such as due to failure), so how can some processing cores in the many-core system be in an invalid state to still obtain a certain degree of usable processing results is very important.
  • Embodiments of the present disclosure provide a task processing method and apparatus, a many-core system, and a computer-readable medium.
  • an embodiment of the present disclosure provides a task processing method, including:
  • the computational graph includes multiple layers, each layer includes multiple tasks, the tasks in any layer are not executed based on the results of the tasks in this layer or the following layers, and at least some layers have at least some of the tasks are executed based on the results of tasks in its previous layer;
  • each task group includes tasks of at least one layer
  • a mapping relationship between the tasks of each task group and each processing core of the many-core system is determined; in the mapping relationship, each task group corresponds to a core group, and each core group includes at least one processing core.
  • an apparatus for task processing including:
  • an acquisition module configured to acquire a calculation graph of the problem to be processed;
  • the calculation graph includes multiple layers, each layer includes multiple tasks, and the tasks in any layer are not executed based on the results of the tasks in this layer or the following layers, At least some of the tasks in at least some of the layers are executed based on the results of the tasks in the previous layers;
  • a block module configured to divide the computational graph into a plurality of task groups; each task group includes tasks of at least one layer;
  • the mapping module is configured to determine the mapping relationship between the tasks of each task group and each processing core of the many-core system; in the mapping relationship, each task group corresponds to a core group, and each core group includes at least one processing core.
  • an embodiment of the present disclosure provides a many-core system, including:
  • an on-chip network configured to exchange data and external data among the plurality of processing cores
  • One or more of the processing cores store one or more instructions, and the one or more of the instructions are executed by the one or more of the processing cores, so that the one or more of the processing cores can execute any of the above.
  • a method of task processing is
  • an embodiment of the present disclosure provides a computer-readable medium on which a computer program is stored, wherein the computer program implements any one of the above task processing methods when executed by a processing core.
  • an embodiment of the present disclosure provides a computer program product, including a computer program, and when the computer program is executed by a processor, any one of the foregoing task processing methods is implemented.
  • the tasks in the calculation graph are divided into multiple task groups, each task group includes tasks of at least one layer, and all tasks in the same layer can be mapped to two different processing cores for processing, and also Tasks in different layers can be mapped to the same core group, so that when any processing core is invalid (such as due to failure), a layer of the computational graph is only "broken" at most, and all tasks in a layer will not be generated. Invalid situation, so that the overall calculation graph can still obtain a certain degree of usable processing results, which greatly improves the robustness of the calculation graph.
  • FIG. 1 is a flowchart of a method for task processing provided by an embodiment of the present disclosure
  • FIG. 2 is a flowchart of a method for task processing provided by an embodiment of the present disclosure
  • FIG. 3 is a flowchart of another task processing method provided by an embodiment of the present disclosure.
  • FIG. 4 is a schematic diagram of a process of processing a computation graph in an embodiment of the present disclosure
  • step S205 is a flowchart of a method for task processing after step S205 in an embodiment of the present disclosure
  • FIG. 6 is a schematic diagram of a partition of a computation graph and a routing relationship of tasks therein in an embodiment of the present disclosure
  • FIG. 7 is a schematic diagram of a process of processing a computation graph in an embodiment of the present disclosure.
  • FIG. 8 is a block diagram of a composition of an apparatus for task processing provided by an embodiment of the present disclosure.
  • FIG. 9 is a block diagram of the composition of a many-core system according to an embodiment of the present disclosure.
  • computational graph or task graph, logic graph
  • tasks or nodes
  • each task includes a certain operation, and there is a certain order among different tasks. For example, if the operation of a task uses the operation results of other tasks, the task is said to be performed based on the results of other tasks; or, the task is a subsequent task of other tasks, and other tasks are previous tasks of this task.
  • the computational graph can be divided into multiple "layers", each layer includes multiple tasks, and the tasks in any one layer are not based on tasks in this layer or subsequent layers, but At least some of the tasks in at least some of the layers are performed based on the results of the tasks in the previous layers. That is, if the task in the previous layer is not completed, the task in the latter layer may not be performed, because the operation result of the task in the previous layer may be used in the operation of the task in the latter layer; The tasks in the layer cannot be performed and will not affect the tasks in the previous layer, because the operation of the tasks in the previous layer will not use the operation results of the tasks in the latter layer; and the tasks in the same layer do not exist based on relationships (subordination relationships). ), because if there is a relationship-based, the corresponding tasks should belong to two different layers.
  • a "neural network (NN)" is a form of computational graph.
  • the neural network is divided into multiple layers, each layer includes multiple nodes, and a certain operation needs to be performed in each node, and the nodes of different layers are connected by a certain relationship (for example, the output of one node is used as the input of the next layer of nodes) ;
  • each layer of the neural network can be regarded as a layer of the computational graph, and each node of the neural network can be regarded as a task of the computational graph.
  • the neural network in the embodiment of the present disclosure can be used for image processing, speech recognition, etc., which can specifically be in the form of a convolutional neural network (CNN), a spiking neural network (SNN), a recurrent neural network (RNN), and the like.
  • CNN convolutional neural network
  • SNN spiking neural network
  • RNN recurrent neural network
  • some problems may correspond to multiple different computational graphs. That is, the number of tasks in the calculation graph, the layer where the tasks are located, the relationship between tasks, and the specific operations of each task can be different, but these different calculation graphs can solve the problem (but the effect of solving the problem is not necessarily the same) .
  • trainable computational graphs that is, for a computational graph that can solve a problem, the tasks in it can be adjusted through training, so that the computational graph after training has different effects on solving the problem.
  • a neural network is a form of trainable computational graph.
  • a neural network dealing with a problem such as image classification
  • a problem such as image classification
  • a neural network dealing with a problem is usually trained by adjusting the nodes in it (such as adjusting the weights of nodes) according to the effect of the current neural network on the problem (such as the accuracy of image classification), thereby changing Neural networks (computational graphs), and improve their performance on problems (such as improving the accuracy of image classification).
  • the tasks of each layer of the corresponding computing graph can be mapped (assigned) to one processing core, and the tasks of different layers can be mapped to different processing cores middle.
  • an embodiment of the present disclosure provides a task processing method.
  • the method is based on the many-core system, which includes how to map the tasks of a computational graph to the processing cores of the many-core system.
  • the task processing method of the embodiment of the present disclosure includes:
  • the computation graph includes multiple layers, each layer includes multiple tasks, the tasks in any layer are not executed based on the results of the tasks in this layer or the layers after it, and at least some tasks in at least some layers are based on the tasks in the previous layer. result is executed.
  • S102 Divide the computation graph into multiple task groups; each task group includes tasks of at least one layer.
  • S103 Determine the mapping relationship between the tasks of each task group and each processing core of the many-core system; in the mapping relationship, each task group corresponds to a core group, and each core group includes at least one processing core.
  • a problem to be processed such as image processing, speech recognition, etc.
  • its corresponding computational graph is obtained.
  • a pre-set calculation graph can be obtained; the calculation graph can also be generated according to a predetermined rule according to a specific problem to be processed.
  • the task group includes task blocks, and the task blocks are obtained by performing intra-layer partitioning based on the computation graph, that is, each layer in the computation graph is divided into multiple task blocks, and each task block includes at least one task .
  • the task group includes a task area, and the task area is obtained by performing inter-layer partitioning based on the computation graph, that is, the tasks in the computation graph are divided into multiple task areas, and each task area includes at least two layers of Task.
  • the tasks in the calculation graph are divided into multiple task groups, each task group includes tasks of at least one layer, and all tasks in the same layer can be mapped to two different processing cores for processing, and also Tasks in different layers can be mapped to the same core group, so that when any processing core is invalid (such as due to failure), a layer of the computational graph is only "broken" at most, and all tasks in a layer will not be generated. Invalid situation, so that the overall calculation graph can still obtain a certain degree of usable processing results, which greatly improves the robustness of the calculation graph.
  • FIG. 2 is a flowchart of a task processing method provided by an embodiment of the present disclosure.
  • the task processing method according to the embodiment of the present disclosure includes:
  • the computation graph includes a plurality of layers arranged in sequence, each layer includes a plurality of tasks, the tasks in any layer are not performed based on the results of the tasks in this layer or the subsequent layers, and at least some tasks in at least some layers are based on its The results of the tasks in the previous layer are carried out.
  • a problem to be processed such as image processing, speech recognition, etc.
  • its corresponding computational graph is obtained.
  • a pre-set calculation graph can be obtained; the calculation graph can also be generated according to a predetermined rule according to a specific problem to be processed.
  • each task block includes at least one task.
  • each layer of the computation graph is divided into multiple "groups", that is, each layer is divided into multiple "task blocks", and each task block includes one or more tasks in the layer.
  • the number of task blocks divided into each layer may be preset, for example, it is preset that each layer is divided into a task blocks.
  • blocking can also be performed according to a determined method, and the actual number of task blocks obtained shall prevail.
  • S205 Determine a first mapping relationship between each task block and multiple processing cores of the many-core system.
  • each task block is mapped to one processing core, each processing core is mapped to multiple task blocks, and all task blocks of any layer are mapped to at least two different processing cores.
  • the mapping is performed "scattered", that is, all task blocks in the same layer should be mapped to different processing cores as “separately” as possible, at least to ensure that they are not all mapped to the same one processing core. Since each task block includes tasks in the same layer, and each task block in the same layer is mapped to different processing cores, the above first mapping relationship ensures that all tasks in the same layer are mapped to at least two different processing cores .
  • all tasks in the same layer of the computation graph are mapped to at least two different processing cores for processing, so that when any processing core is invalid (for example, due to a fault), one layer of the computation graph is only "bad” at most Part of it is removed” without causing all the tasks of a layer to be invalid, so that the overall computational graph can still obtain a certain degree of usable processing results, which greatly improves the robustness of the computational graph.
  • the computational graph is a trainable computational graph.
  • a trainable computational graph can solve the same problem to be solved when at least some of the tasks are different.
  • the computational graph is a neural network (NN).
  • the above computation graph is a trainable computation graph, and further is a neural network, such as a convolutional neural network (CNN), a spiking neural network (SNN), a recurrent neural network (RNN), etc. ; so that the computational graph can be trained to further improve its redundancy performance.
  • CNN convolutional neural network
  • SNN spiking neural network
  • RNN recurrent neural network
  • the computation graph in the embodiment of the present disclosure is not limited to be used for a trainable computation graph (eg, a neural network), but can also be used for other computation graphs.
  • any two task blocks of any layer are respectively mapped to two different processing cores.
  • a maximum scattered mapping (or “mutually exclusive” mapping) can be performed, that is, it is ensured that no task blocks from the same layer are mapped to the same processing core.
  • the method further includes:
  • the computational graph can also be trained before it is "chunked” to improve its redundancy performance.
  • the training computation graph (S202) includes at least one of the following:
  • the Dropout method can be used to invalidate some tasks in the calculation graph (for example, make the weight of some nodes of the neural network to be 0), and adjust other tasks, so that the calculation graph can be invalidated in these tasks produces somewhat usable results in the case of improved robustness.
  • the area includes multiple tasks.
  • the method of Dropblock can be used to invalidate all tasks located in a region (which may include a part of a layer, or corresponding parts of multiple layers) in the calculation graph, and adjust other tasks, Making the computational graph yield somewhat usable results in cases where tasks in the region are ineffective, improving its robustness.
  • the "adversarial sample defense" method can be used to train the computational graph.
  • all computation graphs to be trained must be trainable computation graphs (eg, neural networks).
  • each type of training may be performed only once, or may be performed multiple times in a loop.
  • the specific methods used for each training session may be the same or different.
  • the training can be ended when a preset end criterion is reached, and the end criterion may include the convergence of the computation graph, reaching a predetermined number of training times, reaching a predetermined redundancy performance, and the like.
  • the training method of the computational graph is not limited in the embodiment of the present disclosure, and any method capable of realizing computational graph training can be used for training the computational graph.
  • dividing each layer of the computational graph into multiple task blocks including: extending the computational graph, dividing each layer of the expanded computational graph into multiple task blocks task block.
  • the expansion includes adding redundant tasks to at least some layers of the computation graph.
  • the decompressed computation graph is divided into blocks, so that at least some of the obtained task blocks include the above “redundant tasks”, so as to improve redundant performance.
  • the specific "expansion amount" of each layer may be preset.
  • the computation amount of the outgoing task is the same as the computation amount of the original task; generally speaking, b can be greater than 0 and less than or equal to 1 (if b is greater than 1, it is also feasible, for example, there can be "multiple copies" of task backup).
  • each layer may be expanded according to a certain method, and the actual task amount obtained by the expansion according to this method shall prevail.
  • redundant tasks include at least one of the following:
  • each extended task may be an original task in its corresponding layer, so that there are actually “multiple copies” of the corresponding task, so they can be used as "backups" for each other. That is, when a certain task is not completed (eg, the processing core fails), subsequent tasks can be performed using its backed-up operation results to improve robustness.
  • the follow-up should usually be divided into different task blocks, and these task blocks should be mapped to different processing cores.
  • Invalid tasks As a way of the embodiment of the present disclosure, some tasks that need to perform operations but are not required for the operations in the original calculation diagram can be extended, that is, invalid tasks. Wherein, the invalid task may be generated randomly, or may be generated by other specific anti-compression techniques.
  • the expansion modes of different layers may be the same or different.
  • each layer of the computation graph is divided into a plurality of task blocks (S203), including any one of the following:
  • the tasks in each layer may be randomly "blocked", that is, the number of tasks and specific tasks in each task block are random.
  • each layer may be evenly divided into multiple task blocks, that is, the number of tasks in each task block of the same layer is equal or substantially equal (for example, in all task blocks of the same layer, If the number of tasks in the task block with the fewest tasks is 100%, the number of tasks in the task block with the most tasks does not exceed 110%).
  • the subsequent "mapping" can be considered when dividing the blocks, that is, each layer is first divided into multiple “blocks (pre-task blocks)", and then if there are multiple pre-task blocks will be mapped to a processing core, they can be merged directly as a task block.
  • each layer of the computation graph is divided into a plurality of task blocks.
  • the actual hardware resources of the processing core may also be considered, as well as the subsequent “mapping", so that each task block is mapped to according to the The hardware resources of the processing core determine which tasks should be grouped into the task block.
  • the hardware resources of different processing cores may be the same or different.
  • block mode of different layers may be the same or different.
  • any two task blocks of any one layer are mapped to two different processing cores according to the mapping relationship.
  • each layer of the computation graph into multiple task blocks (S203) and determining the first mapping relationship between each task block and multiple processing cores of the many-core system (S205), it also includes:
  • one or more (but not all) task blocks can also be invalidated (that is, all tasks in them are invalid), and the tasks in other task blocks can be adjusted so that the remaining task blocks are still Can produce somewhat usable results (i.e. blocks of training tasks), improving the robustness of the computational graph.
  • invalidating partial task blocks to train each task block (S204) includes at least one of the following:
  • it can be a random invalid partial task block to train the remaining task blocks in the computation graph.
  • a "key task” that plays a key role may be determined according to the structural features of the computational graph, and the key task block where the key task is located is invalid to train the task block.
  • the method further includes:
  • the invalid part processes all the task blocks mapped in the core, so as to train each task block and improve the redundancy performance of the calculation graph.
  • all tasks mapped in some processing cores can also be invalidated (equivalent to invalidating some processing cores), and other tasks can be adjusted so that the remaining tasks are still Can produce somewhat usable results (ie blocks of training tasks), improving robustness.
  • all task blocks mapped in the invalid part processing core to train each task block (S206) include at least one of the following:
  • the random invalid part processes all the task blocks mapped in the core to train each task block.
  • all tasks mapped to one or more (but not all) processing cores may be disabled (ie, one or more processing cores) to train each task block.
  • each processing core may be invalidated in sequence, that is, only one processing core is invalidated at a time, but all processing cores have been invalidated.
  • the key task may be determined according to the structure of the calculation graph, and then the key task block where the key task is located is determined, and the processing core corresponding to the key task block is invalidated to train each task block.
  • the method further includes:
  • each processing core processes the tasks in the task blocks mapped therein.
  • each task block can also be mapped (or allocated) to the processing core according to the first mapping relationship, and processed by each processing core, so as to realize the actual function of the calculation graph and solve the above problems to be processed. question.
  • the step S206 may be performed before the step S207, that is, the task is not actually mapped to the processing core, but only the processing core is mapped according to the first mapping relationship. The task is invalid, and the training is performed.
  • step S206 can also be performed after step S207, that is, the task can be actually mapped to the processing core, and the processing core can be actually invalidated for training.
  • all tasks in the same layer of the computation graph are mapped to at least two different processing cores for processing, which can improve the robustness of the computation graph.
  • tasks of different layers must be located in different processing cores. Therefore, for all subsequent tasks based on the results of the previous tasks, the corresponding operation results must be transmitted across the core from the processing core where the previous task is located to the one where the subsequent task is located. processing core.
  • there must be a large number of inter-core routes between the processing cores resulting in a complex inter-core routing structure and a large amount of data transmitted by the inter-core routing. Inter-core routing leads to performance degradation in many-core systems.
  • an embodiment of the present disclosure provides a task processing method.
  • the method is to divide tasks of different layers into a task group according to an inter-layer partition, so as to reduce the routing of the computation graph.
  • Methods of task processing include:
  • Step S501 obtaining a calculation graph to be processed; wherein, the calculation graph includes a plurality of layers arranged in sequence, each layer includes a plurality of tasks, and the tasks in any layer are not performed based on the results of the tasks in this layer or the subsequent layers, At least some of the tasks in at least some of the layers are performed based on the results of the tasks in the previous layers.
  • calculation graph can also be obtained by other means, for example, a preset calculation graph; the calculation graph can also be generated according to a predetermined rule according to a specific problem to be processed.
  • each task area includes multiple tasks from at least two different layers, and the tasks of each layer are located in at least two task areas; the internal routing strength of each task area exceeds the preset standard, and the The internal routing strength is determined according to the proportion of the routes between the tasks in the task area to all the routes involved in the tasks in the task area.
  • the computation graph is divided into multiple task regions, that is, each task of the computation graph is divided into a corresponding task region.
  • the calculation graph is divided into three task areas (task area 0 to task area 2), the number of tasks in it is represented by the vertical size of the filling area corresponding to the layer, and the different processing cores are represented by blank boxes. (processing core 0 to processing core 2) and core groups (core group 0 to core group 2).
  • each task area cannot all come from one layer of the computation graph, but must come from at least two different layers; at the same time, all the tasks of a layer cannot all be located in one task area, but must be divided into at least two different layers.
  • Each task area is equivalent to taking some tasks from multiple layers of the computational graph.
  • routes there are many “routes (or data paths)” connected between tasks in different layers of the computation graph, wherein the operation result of the preceding task needs to be transmitted to the succeeding task through routing, so that the succeeding task can utilize the operation result .
  • many tasks will connect one or more routes (including incoming routes and outgoing routes), and these routes are the routes "involved” by the task. It can be seen that in the "all routes” involved in all tasks in a task area (represented by dashed boxes in Figure 6 and Figure 7), the two ends of a part of the routes are connected to the tasks in the task area.
  • the "internal routing strength" of the task area can be calculated. Therefore, the “internal routing strength” also reflects the task progress in a task area. The relative degree of "internal interaction".
  • each task area must be a "strong internal routing area", that is, the tasks in each task area are mainly Do “internal interactions” and less “external interactions” with tasks in other task areas.
  • the computation graph is set in three layers (layer 0 to layer 2), and the computation graph is divided into three task areas.
  • this does not indicate a limitation on the number of layers and the number of tasks in the embodiment of the present disclosure, and the embodiment of the present disclosure does not limit the number of layers and the number of tasks in the calculation graph.
  • S505 Determine the second mapping relationship between the tasks of each task area and each processing core of the many-core system.
  • each task area corresponds to a core group
  • each core group includes one or more processing cores
  • the tasks of each task area are mapped to the processing cores of the corresponding core group.
  • a corresponding core group is determined for each task area (each core group includes one or more processing cores in the many-core system), and all tasks in each task area are Map to the processing core of its corresponding core group; that is, each task in the task area needs to be mapped to a processing core of its corresponding core group, and each processing core of a core group needs to be mapped There is at least one task from the task area corresponding to this core group.
  • the “strong internal routing area” with relatively close internal connections in the calculation graph is divided into one task area, so the tasks in each task area are less “externally interacted”; It is mapped to one core group, so it is less necessary to perform "cross-group” data transmission between different core groups; thus, less inter-core (or inter-group) routing is required in the embodiment of the present disclosure, which can be simplified
  • the inter-core (or inter-group) routing structure reduces the amount of data transmitted by inter-core (or inter-group) routing to improve the performance (such as efficiency) of many-core systems.
  • each task area includes tasks from multiple different layers, that is, the tasks of different layers are assigned to different core groups, so when one processing core is invalid (for example, due to a fault), the calculation graph Generally, only a part of the layer is "broken", and it will not cause all tasks of a layer to be invalid, so that the overall calculation graph can still obtain a certain degree of usable processing results, which improves the robustness of the calculation graph. sex.
  • the computation graph is a trainable computation graph, in which at least some of the tasks are different, the same problem to be solved is solved.
  • the computational graph can be a neural network or other computational graph.
  • Neural networks such as Convolutional Neural Networks (CNN), Spiking Neural Networks (SNNs), Recurrent Neural Networks (RNNs), etc.
  • CNN Convolutional Neural Networks
  • SNNs Spiking Neural Networks
  • RNNs Recurrent Neural Networks
  • the redundancy performance can be improved by training computational graphs.
  • At least a portion of the task region includes tasks from each layer of the computation graph.
  • the tasks in the task area can come from all layers, that is, the task area can take at least one task from each layer. Further, each task area includes tasks from each layer.
  • the internal routing strength includes: the proportion of internal routing volume, the internal routing volume ratio of each task area is the amount of data transmitted by routing between tasks in the task area, accounting for the task area in the task area.
  • the proportion of the data volume transmitted by all the routes involved; and/or, the proportion of the number of internal routes, the proportion of the number of internal routes in each task area is the number of routes between tasks in the task area, accounting for the number of routes in the task area The ratio of the total number of routes involved in tasks in the area.
  • internal routing volume ratio and/or “internal routing volume ratio” may be used as specific indicators of the above internal routing strength.
  • all routes of each task area include internal routes and external routes, and the amount of data transmitted by each route during the operation of the calculation graph is also fixed and can be known in advance; thus, the internal routes of each task area can be used to transmit data.
  • the ratio of the data volume to the data volume transmitted by all routes is used as the proportion of internal routing volume, that is, one of the specific indicators of internal routing strength.
  • all routes of each task area include internal routes and external routes, and the "number" of various routes is determined and known; thus, the number of internal routes available in each task area is relative to its The ratio of the number of all routes is used as the proportion of the number of internal routes, that is, one of the specific indicators of the strength of internal routes.
  • the preset criteria include: the proportion of the amount of internal routes is greater than the first threshold; and/or the proportion of the number of internal routes is greater than the second threshold.
  • the preset criteria that the task area should meet may be that one of the internal routing amount and the internal routing number is greater than the corresponding threshold, or both may be greater than the corresponding threshold at the same time.
  • each of the thresholds may be different from the threshold required for one of the internal routing volume and internal routing number to be greater than the corresponding thresholds.
  • At least part of the core set includes only one processing core.
  • core group 2 there may be a core group (core group 2) that includes only one processing core (processing core 2), so the tasks of its corresponding task area (task area 1) must also all be mapped to the processing core. Therefore, the data transmission between tasks of different layers in the task area (internal interaction for the task area) must all be intra-core data transmission, which can minimize inter-core routing.
  • all core groups consist of only one processing core.
  • At least some of the core groups include multiple processing cores; in the second mapping relationship, each task area corresponding to the core group including multiple processing cores is divided into multiple second task blocks, and each task area The number of the second task blocks is the same as the number of processing cores included in the core group corresponding to the task area, and the second task blocks in each task area are respectively mapped to the processing cores of the core group corresponding to the task area. middle.
  • core group 0, core group 1 including multiple processing cores (processing core 0, processing core 1), so the corresponding task area (task Area 0, task area 2) need to be "blocked" first, each second task block divided includes multiple tasks, and then each second task block is divided into a processing core of the corresponding core group (so the first The number of task blocks and the number of processing cores of the core group must be the same).
  • each processing core may not be able to process all tasks in the entire task area; therefore, it can be combined with comprehensive consideration of hardware resources, computing load balancing, etc. , and use multiple processing cores to form a core group to jointly process a task area.
  • the tasks from the same layer can be divided into the same second task block as much as possible, so that the subsequent tasks located in the same layer can be located in one processing core as much as possible.
  • inter-core routing is mainly established between processing cores corresponding to different layers, rather than forming a very complex "grid-like" routing.
  • all core groups include multiple processing cores.
  • the distance between any two processing cores is less than a predetermined distance in a core group comprising at least part of a plurality of processing cores.
  • a core group eg, core group 0, core group 1
  • the distance between the corresponding processing cores may be smaller than the preset distance;
  • the transmission efficiency is also related to the distance between the processing cores. Therefore, dividing the processing cores with "closer" distances into a core group can reduce the time-consuming of data transmission within the core group.
  • distance between processing cores may be various, for example, it may be the straight-line physical distance between processing cores, or the total length of the inter-core routes connecting the processing cores, or the distance between processing cores.
  • the number of other processing cores (or the number of hops of routing) and the like that are spaced apart will not be described in detail here.
  • the processing cores included in different core groups may be “overlapping” (for example, core group 0 and core group 1 both include processing core 0 and processing core 1), so that tasks in different task areas (including the second task block) ) may be divided into the same processing core to make fuller use of hardware resources and better achieve computing load balancing.
  • the processing cores included in different core groups may be exactly the same (it may also be considered that multiple core groups are "merged"), or the processing cores included in different core groups may be "partially overlapped", which will not be described in detail here.
  • the method further includes: extending At least part of the mission area.
  • the expansion includes adding redundant tasks in the task area.
  • each task area can also be "expanded”, or “decompressed”, that is, “adding” some tasks that were not originally in the task area (redundant task), and then map the decompressed task area to the corresponding core group (including mapping after partitioning), so that the above “redundant tasks” are also included in the processing core to improve redundant performance.
  • the specific "expansion amount" of each layer may be preset.
  • the computation amount of the outgoing task is the same as the computation amount of the original task; generally speaking, b can be greater than 0 and less than or equal to 1 (if b is greater than 1, it is also feasible, for example, there can be "multiple copies" of task backup).
  • each layer may be expanded according to a certain method, and the actual task amount obtained by the expansion according to this method shall prevail.
  • redundant tasks include at least one of the following:
  • each extended task can be an original task in its corresponding task area, so that there are actually “multiple copies” of the corresponding task, so they can be used as "backups" for each other to improve the robustness Awesome.
  • Invalid tasks As a way of the embodiment of the present disclosure, some tasks that need to perform operations but are not required for the operations in the original calculation diagram can be extended, that is, invalid tasks. Wherein, the invalid task may be generated randomly, or may be generated by other specific anti-compression techniques.
  • the expansion modes of different task areas may be the same or different.
  • the method further includes:
  • the above training includes at least one of the following:
  • the Dropout method can be used to invalidate some tasks in the calculation graph (for example, make the weight of some nodes of the neural network to be 0), and adjust other tasks, so that the calculation graph can be invalidated in these tasks produces somewhat usable results in the case of improved robustness.
  • the above invalidated tasks can be located in a continuous area (not necessarily a task area) of the calculation graph, that is, it can be specifically Dropblock training.
  • the "adversarial sample defense" method can be used to train the computational graph.
  • the method further includes:
  • one or more (but not all) task areas can also be invalidated (that is, all tasks in them are invalid), and the tasks in other task areas can be adjusted to Making the remaining task regions still yield some usable results (ie, training task regions), improving robustness.
  • the above training includes at least one of the following:
  • it may be a random invalid partial task region to train the remaining task region.
  • a "critical task” that plays a key role may be determined according to the structural features of the computational graph, and the critical task area where the critical task is located is invalid to train the task area.
  • the method further includes:
  • the invalid part processes all the tasks mapped in the core, so as to train each task area and improve the redundancy performance of the calculation graph.
  • all tasks mapped in some processing cores can be invalidated (equivalent to invalidating some processing cores), and other tasks can be adjusted. , so that the remaining tasks can still produce results that are available to a certain extent (i.e. the training task area), improving robustness.
  • the above training includes at least one of the following:
  • the random invalid part processes all the tasks mapped in the core to train each task area.
  • all tasks mapped to one or more (but not all) processing cores may be disabled (ie, one or more processing cores) to train each task area.
  • each processing core may be invalidated in sequence, that is, only one processing core is invalidated at a time, but all processing cores have been invalidated.
  • the key tasks may be determined according to the structure of the computation graph, and then the processing cores where the key tasks are located are invalidated, so as to train each task area.
  • S507 Map the tasks of each task area to each processing core of the many-core system according to the second mapping relationship.
  • each processing core processes the tasks mapped to it.
  • the tasks in each task area can also be mapped (or allocated) to the processing cores of the corresponding core group according to the second mapping relationship, and processed by each processing core, so as to realize the computation graph.
  • the tasks in each task area can also be mapped (or allocated) to the processing cores of the corresponding core group according to the second mapping relationship, and processed by each processing core, so as to realize the computation graph.
  • mapping can be performed directly.
  • the step S506 may be performed before the step S507, that is, the task is not actually mapped to the processing core, but only the processing core is mapped according to the second mapping relationship. The task is invalid, and the training is performed.
  • step S506 can also be performed after step S507, that is, the task can be actually mapped to the processing core, and the processing core can be actually invalidated for training.
  • all computation graphs to be trained must be trainable computation graphs (eg, neural networks). Each exercise can be done only once, or it can be repeated multiple times. When performing multiple training sessions, the specific methods used for each training session may be the same or different. When multiple trainings are performed, the training can be ended when a preset end criterion is reached, and the end criterion may include the convergence of the computation graph, reaching a predetermined number of training times, reaching a predetermined redundancy performance, and the like.
  • the “strong internal routing area” with relatively close internal connections in the calculation graph is divided into one task area, so the tasks in each task area are less “externally interacted”; It is mapped to one core group, so it is less necessary to perform "cross-group” data transmission between different core groups; thus, less inter-core (or inter-group) routing is required in the embodiment of the present disclosure, which can be simplified
  • the inter-core (or inter-group) routing structure reduces the amount of data transmitted by inter-core (or inter-group) routing to improve the performance (such as efficiency) of many-core systems.
  • each task area includes tasks from multiple different layers, that is, the tasks of different layers are assigned to different core groups, so when one processing core is invalid (for example, due to a fault), the calculation graph Generally, only a part of the layer is "broken", and it will not cause all tasks of a layer to be invalid, so that the overall calculation graph can still obtain a certain degree of usable processing results, which improves the robustness of the calculation graph. sex.
  • an apparatus 800 for processing tasks including:
  • the obtaining module 801 is configured to obtain a calculation graph of the problem to be processed; the calculation graph includes a plurality of layers arranged in sequence, each layer includes a plurality of tasks, and the tasks in any layer are not based on the results of the tasks in this layer or subsequent layers to perform, at least some of the tasks in at least some of the layers are performed based on the results of the tasks in the previous layers;
  • a block module 802 configured to divide each layer of the computation graph into a plurality of task blocks; each task block includes at least one task;
  • the mapping module 803 is configured to determine a first mapping relationship between each task block and multiple processing cores of the many-core system; according to the first mapping relationship, each task block is mapped to one processing core, and each processing core is mapped to multiple processing cores. Each task block is mapped to at least two different processing cores in any layer.
  • the apparatus 800 for task processing further includes:
  • the training module 804 is configured to train the computation graph based on the first mapping relationship, and obtain the computation graph after training; wherein, the computation graph includes a plurality of layers arranged in sequence, each layer includes a plurality of tasks, and the tasks in any layer are not based on this at least some of the tasks in at least some of the layers are performed based on the results of the tasks in the previous layer;
  • the partition module 805 is configured to divide the computation graph into a plurality of task regions; each task region includes a plurality of tasks from at least two different layers, and the tasks of each layer are located in at least two task regions; each task region The internal routing strength of each task exceeds the preset standard, and the internal routing strength of each task is determined according to the proportion of the routes between tasks in the task area to all the routes involved in the tasks in the task area;
  • the mapping module 803 is further configured to determine the second mapping relationship between the tasks of each task area and each processing core of the many-core system; in the second mapping relationship, each task area corresponds to a core group, and each core group includes a or multiple processing cores, the tasks of each task area are mapped to the processing cores of its corresponding core group.
  • the apparatus 800 for task processing in this embodiment of the present disclosure may implement the above-mentioned task processing method.
  • the task processing apparatus 800 may also include other modules for implementing corresponding steps.
  • an embodiment of the present disclosure provides a many-core system 900, including:
  • an on-chip network 902 configured to exchange data and external data among the plurality of processing cores 901;
  • One or more instructions are stored in the one or more processing cores 901, and the one or more instructions are executed by the one or more processing cores 901, so that the one or more processing cores 901 can execute any of the above tasks. method.
  • the many-core system 900 in the embodiment of the present disclosure can implement the above-mentioned task processing method, including performing actual computation graphs and/or tasks in the computation graphs to obtain processing results of the problems to be processed.
  • embodiments of the present disclosure further provide a computer-readable storage medium on which a computer program is stored, wherein the computer program implements the above-mentioned task processing method when executed by a processor/processing core.
  • Computer-readable storage media can be volatile or non-volatile computer-readable storage media.
  • embodiments of the present disclosure further provide a computer program product, comprising computer-readable codes, or a non-volatile computer-readable storage medium carrying computer-readable codes, when the computer-readable codes are stored in an electronic device When running in the processor, the processor in the electronic device executes the above task processing method.
  • Such software may be distributed on computer-readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media).
  • computer storage media includes both volatile and nonvolatile implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules or other data flexible, removable and non-removable media.
  • Computer storage media include, but are not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical disk storage, magnetic cartridges, magnetic tape, magnetic disk storage or other magnetic storage devices, or may Any other medium used to store desired information and that can be accessed by a computer.
  • communication media typically embodies computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism, and can include any information delivery media, as is well known to those of ordinary skill in the art .

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Complex Calculations (AREA)

Abstract

Provided in the present disclosure is a task processing method. The method comprises: acquiring a computational graph of a problem to be processed, wherein the computational graph comprises a plurality of layers that are arranged in sequence, each layer comprises a plurality of tasks, the tasks in any layer are not performed on the basis of results of the tasks in the current layer or the subsequent layer, and at least some of tasks in at least some of the layers are performed on the basis of the results of the tasks in the previous layer; dividing the computational graph into a plurality of task groups, wherein each task group comprises tasks of at least one layer; and determining a mapping relationship between the tasks of each task group and processing cores of a many-core system, wherein in the mapping relationship, each task group corresponds to a core group, and each core group comprises at least one processing core. By means of the method, the robustness of a computational graph can be improved. Further provided are a task processing apparatus, a many-core system, and a computer-readable medium.

Description

任务处理的方法和装置、众核系统、计算机可读介质Method and apparatus for task processing, many-core system, and computer-readable medium 技术领域technical field
本公开涉及众核技术领域,特别涉及一种任务处理的方法和装置、众核系统、计算机可读介质。The present disclosure relates to the field of many-core technologies, and in particular, to a method and apparatus for task processing, a many-core system, and a computer-readable medium.
背景技术Background technique
众核系统包括多个处理核(内核或处理引擎),各处理核之间可通过路由进行信息交互。众核系统通过电子计算过程来解决待处理问题的过程,实质上是对众核系统包括多个处理核(内核或处理引擎),各处理核之间可通过路由进行信息交互。众核系统通过电子计算过程来解决待处理问题的过程,实质上是对待处理问题对应的多个任务映射(或者说被分配)至不同的处理核中,由各处理核分别处理任务的过程。The many-core system includes a plurality of processing cores (cores or processing engines), and each processing core can exchange information through routing. The many-core system solves the problem to be processed through the electronic computing process. In essence, the many-core system includes multiple processing cores (kernels or processing engines), and each processing core can exchange information through routing. The many-core system solves the problem to be solved through the electronic computing process, which is essentially a process in which multiple tasks corresponding to the problem to be solved are mapped (or allocated) to different processing cores, and the tasks are processed separately by each processing core.
然而,众核系统中的处理核不可避免地会出现无效(如因故障)的状态,故如何在众核系统中的部分处理核处于无效状态时,仍能得出一定程度上可用的处理结果是十分重要的。However, the processing cores in the many-core system will inevitably be in an invalid state (such as due to failure), so how can some processing cores in the many-core system be in an invalid state to still obtain a certain degree of usable processing results is very important.
发明内容SUMMARY OF THE INVENTION
本公开实施例提供一种任务处理的方法和装置、众核系统、计算机可读介质。Embodiments of the present disclosure provide a task processing method and apparatus, a many-core system, and a computer-readable medium.
第一方面,本公开实施例提供一种任务处理的方法,包括:In a first aspect, an embodiment of the present disclosure provides a task processing method, including:
获取待处理问题的计算图;所述计算图包括多个层,每个层包括多个任务,任意层中的任务不基于本层或其后层中的任务的结果执行,至少部分层中有至少部分任务基于其前层中的任务的结果执行;Obtain a computational graph of the problem to be processed; the computational graph includes multiple layers, each layer includes multiple tasks, the tasks in any layer are not executed based on the results of the tasks in this layer or the following layers, and at least some layers have at least some of the tasks are executed based on the results of tasks in its previous layer;
将所述计算图分为多个任务组;每个任务组包括至少一个层的任务;dividing the computational graph into a plurality of task groups; each task group includes tasks of at least one layer;
确定各任务组的任务与众核系统的各处理核间的映射关系;在所述映射关系中,每个任务组对应一个核组,每个核组包括至少一个处理核。A mapping relationship between the tasks of each task group and each processing core of the many-core system is determined; in the mapping relationship, each task group corresponds to a core group, and each core group includes at least one processing core.
第二方面,本公开实施例提供一种任务处理的装置,包括:In a second aspect, an embodiment of the present disclosure provides an apparatus for task processing, including:
获取模块,配置为获取待处理问题的计算图;所述计算图包括多个层,每个层包括多个任务,任意层中的任务不基于本层或其后层中的任务的结果执行,至少部分层中有至少部分任务基于其前层中的任务的结果执行;an acquisition module, configured to acquire a calculation graph of the problem to be processed; the calculation graph includes multiple layers, each layer includes multiple tasks, and the tasks in any layer are not executed based on the results of the tasks in this layer or the following layers, At least some of the tasks in at least some of the layers are executed based on the results of the tasks in the previous layers;
分块模块,配置为将所述计算图分为多个任务组;每个任务组包括至少一个层的任务;a block module, configured to divide the computational graph into a plurality of task groups; each task group includes tasks of at least one layer;
映射模块,配置为确定各任务组的任务与众核系统的各处理核间的映射关系;在所述映射关系中,每个任务组对应一个核组,每个核组包括至少一个处理核。The mapping module is configured to determine the mapping relationship between the tasks of each task group and each processing core of the many-core system; in the mapping relationship, each task group corresponds to a core group, and each core group includes at least one processing core.
第三方面,本公开实施例提供一种众核系统,包括:In a third aspect, an embodiment of the present disclosure provides a many-core system, including:
多个处理核;以及multiple processing cores; and
片上网络,被配置为交互所述多个处理核间的数据和外部数据;an on-chip network configured to exchange data and external data among the plurality of processing cores;
一个或多个所述处理核中存储有一个或多个指令,一个或多个所述指令被一个或多个所述处理核执行,以使一个或多个所述处理核能够执行实现上述任意一种任务处理的方 法。One or more of the processing cores store one or more instructions, and the one or more of the instructions are executed by the one or more of the processing cores, so that the one or more of the processing cores can execute any of the above. A method of task processing.
第四方面,本公开实施例提供一种计算机可读介质,其上存储有计算机程序,其中,所述计算机程序在被处理核执行时实现上述任意一种任务处理的方法。In a fourth aspect, an embodiment of the present disclosure provides a computer-readable medium on which a computer program is stored, wherein the computer program implements any one of the above task processing methods when executed by a processing core.
第五方面,本公开实施例提供一种计算机程序产品,包括计算机程序,计算机程序在被处理器执行时实现上述任意一种任务处理的方法。In a fifth aspect, an embodiment of the present disclosure provides a computer program product, including a computer program, and when the computer program is executed by a processor, any one of the foregoing task processing methods is implemented.
本公开实施例中,将计算图中的任务分为多个任务组,每个任务组包括至少一个层的任务,同一个层中的所有任务可以映射至两个不同的处理核中处理,也可以将不同层中的任务映射至同一核组内,从而在任意处理核无效(如因故障)时,计算图的一个层最多只“坏掉一部分”,而不会产生一个层的所有任务均无效的情况,从而使计算图整体仍可得出一定程度上可用的处理结果,极大的提高了计算图的鲁棒性。In the embodiment of the present disclosure, the tasks in the calculation graph are divided into multiple task groups, each task group includes tasks of at least one layer, and all tasks in the same layer can be mapped to two different processing cores for processing, and also Tasks in different layers can be mapped to the same core group, so that when any processing core is invalid (such as due to failure), a layer of the computational graph is only "broken" at most, and all tasks in a layer will not be generated. Invalid situation, so that the overall calculation graph can still obtain a certain degree of usable processing results, which greatly improves the robustness of the calculation graph.
应当理解,本部分所描述的内容并非旨在标识本公开的实施例的关键或重要特征,也不用于限制本公开的范围。本公开的其它特征将通过以下的说明书而变得容易理解。It should be understood that what is described in this section is not intended to identify key or critical features of embodiments of the disclosure, nor is it intended to limit the scope of the disclosure. Other features of the present disclosure will become readily understood from the following description.
附图说明Description of drawings
附图用来提供对本公开的进一步理解,并且构成说明书的一部分,与本公开的实施例一起用于解释本公开,并不构成对本公开的限制。通过参考附图对详细示例实施例进行描述,以上和其他特征和优点对本领域技术人员将变得更加显而易见,在附图中:The accompanying drawings are used to provide a further understanding of the present disclosure and constitute a part of the specification, and together with the embodiments of the present disclosure, they are used to explain the present disclosure, and are not intended to limit the present disclosure. The above and other features and advantages will become more apparent to those skilled in the art by describing detailed example embodiments with reference to the accompanying drawings, in which:
图1为本公开实施例提供的一种任务处理的方法的流程图;FIG. 1 is a flowchart of a method for task processing provided by an embodiment of the present disclosure;
图2为本公开实施例提供的一种任务处理的方法的流程图;2 is a flowchart of a method for task processing provided by an embodiment of the present disclosure;
图3为本公开实施例提供的另一种任务处理的方法的流程图;3 is a flowchart of another task processing method provided by an embodiment of the present disclosure;
图4为本公开实施例中计算图被处理的过程示意图;4 is a schematic diagram of a process of processing a computation graph in an embodiment of the present disclosure;
图5为本公开实施例中步骤S205之后任务处理的方法的流程图;5 is a flowchart of a method for task processing after step S205 in an embodiment of the present disclosure;
图6为本公开实施例中计算图的分区和其中任务的路由关系示意图;6 is a schematic diagram of a partition of a computation graph and a routing relationship of tasks therein in an embodiment of the present disclosure;
图7为本公开实施例中计算图被处理的过程示意图;FIG. 7 is a schematic diagram of a process of processing a computation graph in an embodiment of the present disclosure;
图8为本公开实施例提供的一种任务处理的装置的组成框图;8 is a block diagram of a composition of an apparatus for task processing provided by an embodiment of the present disclosure;
图9为本公开实施例提供的一种众核系统的组成框图。FIG. 9 is a block diagram of the composition of a many-core system according to an embodiment of the present disclosure.
具体实施方式Detailed ways
为使本领域的技术人员更好地理解本公开的技术方案,以下结合附图对本公开的示范性实施例做出说明,其中包括本公开实施例的各种细节以助于理解,应当将它们认为仅仅是示范性的。因此,本领域普通技术人员应当认识到,可以对这里描述的实施例做出各种改变和修改,而不会背离本公开的范围和精神。同样,为了清楚和简明,以下的描述中省略了对公知功能和结构的描述。In order for those skilled in the art to better understand the technical solutions of the present disclosure, exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, including various details of the embodiments of the present disclosure to facilitate understanding, and they should be considered to be exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted from the following description for clarity and conciseness.
在不冲突的情况下,本公开各实施例及实施例中的各特征可相互组合。Various embodiments of the present disclosure and various features of the embodiments may be combined with each other without conflict.
如本文所使用的,术语“和/或”包括一个或多个相关列举条目的任何和所有组合。As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.
本文所使用的术语仅用于描述特定实施例,且不意欲限制本公开。如本文所使用的,单数形式“一个”和“该”也意欲包括复数形式,除非上下文另外清楚指出。还将理解的是,当本说明书中使用术语“包括”和/或“由……制成”时,指定存在所述特征、整体、步骤、操作、元件和/或组件,但不排除存在或添加一个或多个其它特征、整体、步骤、 操作、元件、组件和/或其群组。“连接”或者“相连”等类似的词语并非限定于物理的或者机械的连接,而是可以包括电性的连接,不管是直接的还是间接的。The terminology used herein is used to describe particular embodiments only and is not intended to limit the present disclosure. As used herein, the singular forms "a" and "the" are intended to include the plural forms as well, unless the context clearly dictates otherwise. It will also be understood that when the terms "comprising" and/or "made of" are used in this specification, the stated features, integers, steps, operations, elements and/or components are specified to be present, but not precluded or Add one or more other features, integers, steps, operations, elements, components and/or groups thereof. Words like "connected" or "connected" are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect.
除非另外限定,否则本文所用的所有术语(包括技术和科学术语)的含义与本领域普通技术人员通常理解的含义相同。还将理解,诸如那些在常用字典中限定的那些术语应当被解释为具有与其在相关技术以及本公开的背景下的含义一致的含义,且将不解释为具有理想化或过度形式上的含义,除非本文明确如此限定。Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art. It will also be understood that terms such as those defined in commonly used dictionaries should be construed as having meanings consistent with their meanings in the context of the related art and this disclosure, and will not be construed as having idealized or over-formal meanings, unless expressly so limited herein.
很多问题(如图像处理、语音识别等)中要进行的实际工作可用"计算图(或者说任务图、逻辑图)"的形式表述。即,要解决该问题需进行的所有运算被分为多个"任务(或者说节点)",每个任务包括一定的运算,且不同任务间存在一定的顺序。例如,某任务的运算若要用到其它任务的运算结果,则称该任务基于其它任务的结果进行;或者说,该任务是其它任务的在后任务,其它任务是该任务的在前任务。The actual work to be done in many problems (such as image processing, speech recognition, etc.) can be expressed in the form of "computation graph (or task graph, logic graph)". That is, all operations to be performed to solve the problem are divided into multiple "tasks (or nodes)", each task includes a certain operation, and there is a certain order among different tasks. For example, if the operation of a task uses the operation results of other tasks, the task is said to be performed based on the results of other tasks; or, the task is a subsequent task of other tasks, and other tasks are previous tasks of this task.
由于任务间存在以上关系,计算图可被分为多个"层",每个层包括多个任务,且任意一个层中的任务,都不是基于本层或后续层中的任务进行的,而至少部分层中有至少部分任务基于其前层中的任务的结果进行。即,若在前层中的任务没有完成,可能导致在后层中的任务不能进行,因为在后层的任务的运算过程中可能要用到在前层的任务的运算结果;但若在后层中的任务不能进行,不会影响在前层中的任务,因为在前层的任务的运算不会使用在后层的任务的运算结果;且同一层中的任务不存在基于关系(从属关系),因为若存在基于关系,则相应的任务就应属于两个不同层。Due to the above relationship between tasks, the computational graph can be divided into multiple "layers", each layer includes multiple tasks, and the tasks in any one layer are not based on tasks in this layer or subsequent layers, but At least some of the tasks in at least some of the layers are performed based on the results of the tasks in the previous layers. That is, if the task in the previous layer is not completed, the task in the latter layer may not be performed, because the operation result of the task in the previous layer may be used in the operation of the task in the latter layer; The tasks in the layer cannot be performed and will not affect the tasks in the previous layer, because the operation of the tasks in the previous layer will not use the operation results of the tasks in the latter layer; and the tasks in the same layer do not exist based on relationships (subordination relationships). ), because if there is a relationship-based, the corresponding tasks should belong to two different layers.
示例性的,“神经网络(NN)”是计算图的一种形式。神经网络分为多层,每个层包括多个节点,在每个节点中需进行一定的运算,而不同层的节点间以一定的关系相连(如一个节点输出作为下一层节点的输入);从而,神经网络的每个层可视为计算图的一个层,而神经网络的每个节点可视为计算图的一个任务。Illustratively, a "neural network (NN)" is a form of computational graph. The neural network is divided into multiple layers, each layer includes multiple nodes, and a certain operation needs to be performed in each node, and the nodes of different layers are connected by a certain relationship (for example, the output of one node is used as the input of the next layer of nodes) ; Thus, each layer of the neural network can be regarded as a layer of the computational graph, and each node of the neural network can be regarded as a task of the computational graph.
示例性的,本公开实施例中的神经网络可用于进行图像处理、语音识别等,其具体可为卷积神经网络(CNN)、脉冲神经网络(SNN)、循环神经网络(RNN)等形式。Exemplarily, the neural network in the embodiment of the present disclosure can be used for image processing, speech recognition, etc., which can specifically be in the form of a convolutional neural network (CNN), a spiking neural network (SNN), a recurrent neural network (RNN), and the like.
其中,部分问题可能对应多个不同的计算图。即,计算图中的任务数、任务所处的层、任务间的关系、每个任务的具体运算等可不同,但这些不同计算图均可解决该问题(但解决问题的效果不一定相同)。Among them, some problems may correspond to multiple different computational graphs. That is, the number of tasks in the calculation graph, the layer where the tasks are located, the relationship between tasks, and the specific operations of each task can be different, but these different calculation graphs can solve the problem (but the effect of solving the problem is not necessarily the same) .
以上可能有多种形式的计算图称为“可训练计算图”。即,对能解决一个问题的计算图,可通过训练对其中的任务进行调整,从而使训练后的计算图解决问题的效果不同。The above computational graphs, which may take many forms, are called "trainable computational graphs". That is, for a computational graph that can solve a problem, the tasks in it can be adjusted through training, so that the computational graph after training has different effects on solving the problem.
例如,神经网络是可训练计算图的一种形式。例如,处理一个问题(如图像分类)的神经网络通常是训练得到的,即根据当前神经网络解决问题的效果(如图像分类的准确度)调整其中的节点(如调整节点的权重),从而改变神经网络(计算图),并改善其处理问题的效果(如提高图像分类的准确度)。For example, a neural network is a form of trainable computational graph. For example, a neural network dealing with a problem (such as image classification) is usually trained by adjusting the nodes in it (such as adjusting the weights of nodes) according to the effect of the current neural network on the problem (such as the accuracy of image classification), thereby changing Neural networks (computational graphs), and improve their performance on problems (such as improving the accuracy of image classification).
在一些相关技术中,当要用众核系统处理一个问题时,可将其对应的计算图的每层的任务映射(分配)至一个处理核中,而不同层的任务被映射至不同处理核中。In some related technologies, when a problem is processed by a many-core system, the tasks of each layer of the corresponding computing graph can be mapped (assigned) to one processing core, and the tasks of different layers can be mapped to different processing cores middle.
但根据以上方式,一旦众核系统的某一个处理核无效(如因故障),则相当于计算图的一层的全部任务都无法被处理,从而该层后的所有任务实际上也都无法进行,因此,必 然导致整个问题完全无法被解决(即不能得出任何处理结果),系统的鲁棒性很差。However, according to the above method, once a certain processing core of the many-core system is invalid (for example, due to failure), all the tasks equivalent to one layer of the calculation graph cannot be processed, so that all tasks after this layer cannot actually be performed. , therefore, the whole problem cannot be solved at all (that is, no processing result can be obtained), and the robustness of the system is very poor.
第一方面,本公开实施例提供一种任务处理的方法。该方法基于众核系统,其包括如何将一个计算图的任务映射至众核系统的各处理核中。In a first aspect, an embodiment of the present disclosure provides a task processing method. The method is based on the many-core system, which includes how to map the tasks of a computational graph to the processing cores of the many-core system.
参照图1,本公开实施例任务处理的方法包括:Referring to FIG. 1 , the task processing method of the embodiment of the present disclosure includes:
S101,获取待处理问题的计算图。S101, obtaining a calculation graph of the problem to be processed.
计算图包括多个层,每个层包括多个任务,任意层中的任务不基于本层或其后层中的任务的结果执行,至少部分层中有至少部分任务基于其前层中的任务的结果执行。The computation graph includes multiple layers, each layer includes multiple tasks, the tasks in any layer are not executed based on the results of the tasks in this layer or the layers after it, and at least some tasks in at least some layers are based on the tasks in the previous layer. result is executed.
S102,将计算图分为多个任务组;每个任务组包括至少一个层的任务。S102: Divide the computation graph into multiple task groups; each task group includes tasks of at least one layer.
S103,确定各任务组的任务与众核系统的各处理核间的映射关系;在映射关系中,每个任务组对应一个核组,每个核组包括至少一个处理核。S103: Determine the mapping relationship between the tasks of each task group and each processing core of the many-core system; in the mapping relationship, each task group corresponds to a core group, and each core group includes at least one processing core.
当要用众核系统处理一个待处理问题(如图像处理、语音识别等)时,获取其对应计算图。其中,可以获取已预先设置的计算图;也可根据具体的待处理问题,按照预定规则生成计算图。When a problem to be processed (such as image processing, speech recognition, etc.) is to be processed by the many-core system, its corresponding computational graph is obtained. Among them, a pre-set calculation graph can be obtained; the calculation graph can also be generated according to a predetermined rule according to a specific problem to be processed.
在一些实施例中,任务组包括任务块,任务块是基于计算图进行层内分块获得的,即,将计算图中每个层分为多个任务块,每个任务块包括至少一个任务。In some embodiments, the task group includes task blocks, and the task blocks are obtained by performing intra-layer partitioning based on the computation graph, that is, each layer in the computation graph is divided into multiple task blocks, and each task block includes at least one task .
在一些实施例中,任务组包括任务区,任务区是基于计算图进行层间分区获得的,即,将计算图中的任务分为多个任务区,每个任务区至少包括两个层的任务。In some embodiments, the task group includes a task area, and the task area is obtained by performing inter-layer partitioning based on the computation graph, that is, the tasks in the computation graph are divided into multiple task areas, and each task area includes at least two layers of Task.
本公开实施例中,将计算图中的任务分为多个任务组,每个任务组包括至少一个层的任务,同一个层中的所有任务可以映射至两个不同的处理核中处理,也可以将不同层中的任务映射至同一核组内,从而在任意处理核无效(如因故障)时,计算图的一个层最多只“坏掉一部分”,而不会产生一个层的所有任务均无效的情况,从而使计算图整体仍可得出一定程度上可用的处理结果,极大的提高了计算图的鲁棒性。In the embodiment of the present disclosure, the tasks in the calculation graph are divided into multiple task groups, each task group includes tasks of at least one layer, and all tasks in the same layer can be mapped to two different processing cores for processing, and also Tasks in different layers can be mapped to the same core group, so that when any processing core is invalid (such as due to failure), a layer of the computational graph is only "broken" at most, and all tasks in a layer will not be generated. Invalid situation, so that the overall calculation graph can still obtain a certain degree of usable processing results, which greatly improves the robustness of the calculation graph.
图2为本公开实施例提供的一种任务处理的方法的流程图。参照图2,本公开实施例的任务处理的方法包括:FIG. 2 is a flowchart of a task processing method provided by an embodiment of the present disclosure. Referring to FIG. 2 , the task processing method according to the embodiment of the present disclosure includes:
S201,获取待处理问题的计算图。S201, obtaining a calculation graph of the problem to be processed.
其中,计算图包括多个依次设置的层,每个层包括多个任务,任意层中的任务不基于本层或其后层中的任务的结果进行,至少部分层中有至少部分任务基于其前层中的任务的结果进行。The computation graph includes a plurality of layers arranged in sequence, each layer includes a plurality of tasks, the tasks in any layer are not performed based on the results of the tasks in this layer or the subsequent layers, and at least some tasks in at least some layers are based on its The results of the tasks in the previous layer are carried out.
当要用众核系统处理一个待处理问题(如图像处理、语音识别等)时,获取其对应计算图。其中,可以获取已预先设置的计算图;也可根据具体的待处理问题,按照预定规则生成计算图。When a problem to be processed (such as image processing, speech recognition, etc.) is to be processed by the many-core system, its corresponding computational graph is obtained. Among them, a pre-set calculation graph can be obtained; the calculation graph can also be generated according to a predetermined rule according to a specific problem to be processed.
S203,将计算图的每个层分为多个任务块。S203: Divide each layer of the computation graph into multiple task blocks.
其中,每个任务块包括至少一个任务。Wherein, each task block includes at least one task.
参照图4,将计算图的每个层中的任务分为多“组”,即将每个层分为多个“任务块”,每个任务块包括该层中的一个或多个任务。其中,每层分出的任务块的数量,可以是预先设定的,如预先设定每层分为a个任务块。4, the tasks in each layer of the computation graph are divided into multiple "groups", that is, each layer is divided into multiple "task blocks", and each task block includes one or more tasks in the layer. The number of task blocks divided into each layer may be preset, for example, it is preset that each layer is divided into a task blocks.
或者,也可以是根据确定的方式进行“分块”,以实际所得的任务块数量为准。Alternatively, "blocking" can also be performed according to a determined method, and the actual number of task blocks obtained shall prevail.
S205,确定各任务块与众核系统的多个处理核间的第一映射关系。S205: Determine a first mapping relationship between each task block and multiple processing cores of the many-core system.
其中,在第一映射关系中,每个任务块映射到一个处理核中,每个处理核中映射多个任务块,任意一个层的所有任务块被映射到至少两个不同处理核中。Wherein, in the first mapping relationship, each task block is mapped to one processing core, each processing core is mapped to multiple task blocks, and all task blocks of any layer are mapped to at least two different processing cores.
参照图4,在得到多个任务块后,则可确定应将每个任务块映射到众核系统的哪个处理核中,也就是第一映射关系(不一定进行实际映射)。Referring to FIG. 4 , after obtaining a plurality of task blocks, it can be determined which processing core of the many-core system should map each task block to, that is, the first mapping relationship (not necessarily the actual mapping).
根据本公开实施例的第一映射关系,映射是“分散”进行,即同一个层中的所有任务块应尽量“分开”映射到不同处理核中,至少保证它们不会被全部映射到同一个处理核中。由于每个任务块包括同一层中的任务,且同一层的各任务块被映射到不同处理核中,故以上第一映射关系保证了同一层的所有任务至少映射到两个不同的处理核中。According to the first mapping relationship of the embodiment of the present disclosure, the mapping is performed "scattered", that is, all task blocks in the same layer should be mapped to different processing cores as "separately" as possible, at least to ensure that they are not all mapped to the same one processing core. Since each task block includes tasks in the same layer, and each task block in the same layer is mapped to different processing cores, the above first mapping relationship ensures that all tasks in the same layer are mapped to at least two different processing cores .
本公开实施例中,计算图的同一个层中的所有任务至少映射至两个不同的处理核中处理,从而在任意处理核无效(如因故障)时,计算图的一个层最多只“坏掉一部分”,而不会产生一个层的所有任务均无效的情况,从而使计算图整体仍可得出一定程度上可用的处理结果,极大的提高了计算图的鲁棒性。In the embodiment of the present disclosure, all tasks in the same layer of the computation graph are mapped to at least two different processing cores for processing, so that when any processing core is invalid (for example, due to a fault), one layer of the computation graph is only "bad" at most Part of it is removed” without causing all the tasks of a layer to be invalid, so that the overall computational graph can still obtain a certain degree of usable processing results, which greatly improves the robustness of the computational graph.
在一些实施例中,计算图为可训练计算图。其中,可训练的计算图能在其中至少部分任务有不同的情况下,解决相同的待处理问题。In some embodiments, the computational graph is a trainable computational graph. Among them, a trainable computational graph can solve the same problem to be solved when at least some of the tasks are different.
在一些实施例中,计算图为神经网络(NN)。作为本公开实施例的一种方式,以上计算图是可训练计算图,进一步是神经网络,如卷积神经卷积神经网络(CNN)、脉冲神经网络(SNN)、循环神经网络(RNN)等;从而可通过对计算图进行训练,进一步提高其冗余性能。当然,本公开实施例计算图并不限用于可训练计算图(如神经网络),而是也可用于其它的计算图。In some embodiments, the computational graph is a neural network (NN). As a way of the embodiment of the present disclosure, the above computation graph is a trainable computation graph, and further is a neural network, such as a convolutional neural network (CNN), a spiking neural network (SNN), a recurrent neural network (RNN), etc. ; so that the computational graph can be trained to further improve its redundancy performance. Of course, the computation graph in the embodiment of the present disclosure is not limited to be used for a trainable computation graph (eg, a neural network), but can also be used for other computation graphs.
在一些实施例中,在第一映射关系中,任意一个层的任意两个任务块被分别映射到两个不同处理核中。In some embodiments, in the first mapping relationship, any two task blocks of any layer are respectively mapped to two different processing cores.
进一步的,参照图4,可进行最大限度的分散映射(或者说是“互斥”映射),即保证不会有来自同一层的任务块被映射到同一个处理核中。Further, referring to FIG. 4 , a maximum scattered mapping (or “mutually exclusive” mapping) can be performed, that is, it is ensured that no task blocks from the same layer are mapped to the same processing core.
在一些实施例中,参照图3、图4,在获取待处理问题的计算图(S201)与将计算图的每个层分为多个任务块(S203)之间,还包括:In some embodiments, referring to FIG. 3 and FIG. 4 , between acquiring the computation graph of the problem to be processed (S201) and dividing each layer of the computation graph into a plurality of task blocks (S203), the method further includes:
S202,训练计算图,以提高计算图的冗余性能。S202 , training the computation graph to improve the redundancy performance of the computation graph.
在对计算图进行“分块”之前,还可对其进行训练,以提高其冗余性能。The computational graph can also be trained before it is "chunked" to improve its redundancy performance.
在一些实施例中,训练计算图(S202),包括以下至少一项:In some embodiments, the training computation graph (S202) includes at least one of the following:
(1)无效计算图中的部分任务,以训练计算图。(1) Invalidate some tasks in the computational graph to train the computational graph.
作为本公开实施例的一种方式,可采用Dropout方式,将计算图中的部分任务无效(如使神经网络的部分节点的权重为0),并调整其它任务,使计算图可在这些任务无效的情况下产生一定程度上可用的结果,改善其鲁棒性。As a method of the embodiment of the present disclosure, the Dropout method can be used to invalidate some tasks in the calculation graph (for example, make the weight of some nodes of the neural network to be 0), and adjust other tasks, so that the calculation graph can be invalidated in these tasks produces somewhat usable results in the case of improved robustness.
(2)无效计算图的一个区域,以训练计算图。(2) A region of the invalid computational graph to train the computational graph.
其中,区域包括多个任务。Among them, the area includes multiple tasks.
作为本公开实施例的一种方式,可采用Dropblock的方式,使计算图中位于一个区域(可包括一层的一部分,或多个层的相应部分)内的任务全部无效,并调整其它任务,使 计算图可在该区域的任务无效的情况下产生一定程度上可用的结果,改善其鲁棒性。As a method of the embodiment of the present disclosure, the method of Dropblock can be used to invalidate all tasks located in a region (which may include a part of a layer, or corresponding parts of multiple layers) in the calculation graph, and adjust other tasks, Making the computational graph yield somewhat usable results in cases where tasks in the region are ineffective, improving its robustness.
以上方式的每个“区域”中,通常有较多任务最终被映射到一个处理核中,故其可更好的提高计算图的冗余性能。In each "area" of the above method, usually more tasks are finally mapped to one processing core, so it can better improve the redundancy performance of the computational graph.
(3)通过对抗样本防御方式训练计算图。(3) The computational graph is trained by adversarial sample defense.
作为本公开实施例的一种方式,可采用“对抗样本防御”方式训练计算图。As a method of the embodiment of the present disclosure, the "adversarial sample defense" method can be used to train the computational graph.
本公开实施例中,所有要进行训练的计算图必然都是可训练计算图(如神经网络)。In the embodiment of the present disclosure, all computation graphs to be trained must be trainable computation graphs (eg, neural networks).
本公开实施例中,每种训练可仅进行一次,也可循环进行多次。当进行多次训练时,各次训练采用的具体方式可相同,也可不同。当进行多次训练时,可在达到预设的结束标准时结束训练,结束标准可包括计算图收敛、达到预定训练次数、达到预定的冗余性能等。In the embodiment of the present disclosure, each type of training may be performed only once, or may be performed multiple times in a loop. When performing multiple training sessions, the specific methods used for each training session may be the same or different. When multiple trainings are performed, the training can be ended when a preset end criterion is reached, and the end criterion may include the convergence of the computation graph, reaching a predetermined number of training times, reaching a predetermined redundancy performance, and the like.
本公开实施例中的所有训练都可符合以上要求,后续不再详细描述。All the trainings in the embodiments of the present disclosure can meet the above requirements, and will not be described in detail later.
本公开实施例中,本公开实施例对计算图的训练方式不作限定,能够实现计算图训练的任意一种方法均可用于对计算图进行训练。In the embodiment of the present disclosure, the training method of the computational graph is not limited in the embodiment of the present disclosure, and any method capable of realizing computational graph training can be used for training the computational graph.
在一些实施例中,参照图3、图4,将计算图的每个层分为多个任务块(S203),包括:扩展计算图,将扩展后的计算图的每个层分为多个任务块。In some embodiments, referring to FIGS. 3 and 4 , dividing each layer of the computational graph into multiple task blocks (S203), including: extending the computational graph, dividing each layer of the expanded computational graph into multiple task blocks task block.
其中,扩展包括在计算图的至少部分层中添加冗余任务。Wherein, the expansion includes adding redundant tasks to at least some layers of the computation graph.
作为本公开实施例的一种方式,可先对其进行“扩展”,或者说“反压缩”,即在计算图各层中“增加”一些原本没有的任务(冗余任务),之后再对经过反压缩的计算图分块,从而所得的至少部分任务块中包括以上“冗余任务”,以提高冗余性能。As a method of the embodiment of the present disclosure, it can be "expanded" or "decompressed" first, that is, some tasks (redundant tasks) that do not exist in each layer of the calculation graph are "added", and then the The decompressed computation graph is divided into blocks, so that at least some of the obtained task blocks include the above "redundant tasks", so as to improve redundant performance.
其中,各层具体的“扩展量”可以是预先设定的。例如,可设定冗余系数b,表示扩展出的任务的运算量相对原任务的运算量的比例:若冗余系数b=0,则相当于未扩展;若b=1,则相当于扩展出的任务的运算量与原任务的运算量相同;通常而言,b可大于0,且小于或等于1(若b大于1也可行,如可有任务备份“多份”)。The specific "expansion amount" of each layer may be preset. For example, the redundancy coefficient b can be set to indicate the ratio of the operation amount of the extended task to the operation amount of the original task: if the redundancy coefficient b=0, it is equivalent to not extending; if b=1, it is equivalent to extending The computation amount of the outgoing task is the same as the computation amount of the original task; generally speaking, b can be greater than 0 and less than or equal to 1 (if b is greater than 1, it is also feasible, for example, there can be "multiple copies" of task backup).
或者,也可以是根据确定的方式对各层进行扩展,以根据该方式扩展得到的实际任务量为准。Alternatively, each layer may be expanded according to a certain method, and the actual task amount obtained by the expansion according to this method shall prevail.
在一些实施例中,冗余任务包括以下至少一项:In some embodiments, redundant tasks include at least one of the following:
(1)备份任务。其中,备份任务与相应层中的任务相同。作为本公开实施例的一种方式,每个扩展出的任务可是其对应的层中原有的一个任务,从而,相应任务实际存在“多份”,故可相互作为“备份”。即,当某个任务未被完成时(如所在处理核故障),后续任务可利用其备份的运算结果进行,以提高鲁棒性。(1) Backup tasks. Among them, the backup task is the same as the task in the corresponding layer. As a mode of the embodiment of the present disclosure, each extended task may be an original task in its corresponding layer, so that there are actually "multiple copies" of the corresponding task, so they can be used as "backups" for each other. That is, when a certain task is not completed (eg, the processing core fails), subsequent tasks can be performed using its backed-up operation results to improve robustness.
当然,对于“备份任务”和“原任务”,后续通常应被分入不同的任务块,且这些任务块应被映射到不同的处理核中。Of course, for the "backup task" and the "original task", the follow-up should usually be divided into different task blocks, and these task blocks should be mapped to different processing cores.
(2)空任务。作为本公开实施例的一种方式,可扩展不进行实际运算(或者说进行空运算)的空任务。(2) Empty tasks. As a way of the embodiment of the present disclosure, empty tasks that do not perform actual operations (or perform empty operations) can be expanded.
(3)无效任务。作为本公开实施例的一种方式,可扩展出一些需要进行运算,但不是原计算图中所需的运算的任务,即无效任务。其中,无效任务可以是随机产生的,也可通过其它具体的反压缩技术产生。(3) Invalid tasks. As a way of the embodiment of the present disclosure, some tasks that need to perform operations but are not required for the operations in the original calculation diagram can be extended, that is, invalid tasks. Wherein, the invalid task may be generated randomly, or may be generated by other specific anti-compression techniques.
其中,不同层的扩展方式可以相同,也可不同。The expansion modes of different layers may be the same or different.
在一些实施例中,将计算图的每个层分为多个任务块(S203),包括以下任意一项:In some embodiments, each layer of the computation graph is divided into a plurality of task blocks (S203), including any one of the following:
(1)将计算图的每个层随机分为多个任务块。(1) Randomly divide each layer of the computational graph into multiple task blocks.
作为本公开实施例的一种方式,可以是对每层中的任务随机的进行“分块”,即每个任务块中的任务数量以及具体的任务,都是随机的。As one way of the embodiment of the present disclosure, the tasks in each layer may be randomly "blocked", that is, the number of tasks and specific tasks in each task block are random.
(2)将计算图的每个层均匀的分为多个任务块。(2) Divide each layer of the computation graph into multiple task blocks evenly.
作为本公开实施例的一种方式,可以是将每层均匀的分为多个任务块,即同一层的各任务块中的任务数量相等或基本相等(例如,同一层的所有任务块中,若以任务最少的任务块中的任务数量为100%,则任务最多的任务块中的任务数量不超过110%)。As a method of the embodiment of the present disclosure, each layer may be evenly divided into multiple task blocks, that is, the number of tasks in each task block of the same layer is equal or substantially equal (for example, in all task blocks of the same layer, If the number of tasks in the task block with the fewest tasks is 100%, the number of tasks in the task block with the most tasks does not exceed 110%).
(3)将计算图的每个层分为多个预任务块,将根据第一映射关系应被映射到一个处理核的所有预任务块合并为一个任务块。(3) Divide each layer of the computation graph into multiple pre-task blocks, and combine all the pre-task blocks that should be mapped to one processing core according to the first mapping relationship into one task block.
作为本公开实施例的一种方式,可以在分块时考虑到后续的“映射”,即先将各层分为多个“块(预任务块)”,而后续若有多个预任务块会被映射到一个处理核中,则可将它们直接合并,作为一个任务块。As a method of the embodiment of the present disclosure, the subsequent "mapping" can be considered when dividing the blocks, that is, each layer is first divided into multiple "blocks (pre-task blocks)", and then if there are multiple pre-task blocks will be mapped to a processing core, they can be merged directly as a task block.
(4)至少根据各处理核的硬件资源,将计算图的每个层分为多个任务块。(4) At least according to the hardware resources of each processing core, each layer of the computation graph is divided into a plurality of task blocks.
作为本公开实施例的一种方式,在进行“分块”时,还可考虑处理核的实际硬件资源(如缓存等),以及后续的“映射”,从而根据每个任务块被映射到的处理核的硬件资源,决定应将哪些任务分入该任务块中。As a method of the embodiment of the present disclosure, when performing "blocking", the actual hardware resources of the processing core (such as cache, etc.) may also be considered, as well as the subsequent "mapping", so that each task block is mapped to according to the The hardware resources of the processing core determine which tasks should be grouped into the task block.
其中,不同处理核的硬件资源可相同,也可不同。The hardware resources of different processing cores may be the same or different.
其中,不同层的分块方式可以相同,也可不同。Wherein, the block mode of different layers may be the same or different.
在一些实施例中,根据映射关系,任意一个层的任意两个任务块被映射到两个不同处理核中。In some embodiments, any two task blocks of any one layer are mapped to two different processing cores according to the mapping relationship.
在一些实施例中,参照图3、图4,在将计算图的每个层分为多个任务块(S203)与确定各任务块与众核系统的多个处理核间的第一映射关系(S205)之间,还包括:In some embodiments, referring to FIG. 3 and FIG. 4 , in dividing each layer of the computation graph into multiple task blocks (S203) and determining the first mapping relationship between each task block and multiple processing cores of the many-core system (S205), it also includes:
S204,无效部分任务块,以训练各任务块,提高计算图的冗余性能。S204, invalidating some task blocks to train each task block and improve the redundancy performance of the computation graph.
在进行“分块”后,还可将一个或多个(但不能是全部)任务块无效(即将其中所有的任务无效),并调整其它的任务块中的任务,以使剩余的任务块仍可产生一定程度上可用的结果(即训练任务块),改善计算图的鲁棒性。After "blocking", one or more (but not all) task blocks can also be invalidated (that is, all tasks in them are invalid), and the tasks in other task blocks can be adjusted so that the remaining task blocks are still Can produce somewhat usable results (i.e. blocks of training tasks), improving the robustness of the computational graph.
当然,以上训练本质上也是对“计算图”的训练。Of course, the above training is essentially training on "computational graphs".
以上训练是在分块后(也是在扩展或反压缩后)进行的,相当于提高“块级别”的鲁棒性。The above training is performed after chunking (also after expansion or decompression), which is equivalent to improving the robustness at the "chunk level".
在一些实施例中,无效部分任务块,以训练各任务块(S204)包括以下至少一项:In some embodiments, invalidating partial task blocks to train each task block (S204) includes at least one of the following:
(1)随机无效部分任务块,以训练计算图中其它的任务块。(1) Random invalid partial task blocks to train other task blocks in the computation graph.
作为本公开实施例的一种方式,可以是随机的无效部分任务块,以训练计算图中剩余的任务块。As a way of the embodiment of the present disclosure, it can be a random invalid partial task block to train the remaining task blocks in the computation graph.
(2)确定包括关键任务的关键任务块,无效关键任务块,以训练计算图中各任务块。(2) Determine key task blocks including key tasks and invalid key task blocks to train each task block in the computation graph.
作为本公开实施例的一种方式,可以是根据计算图的结构特征确定出其中起到关键作用的“关键任务”,并将关键任务所在的关键任务块无效,以训练任务块。As a method of the embodiment of the present disclosure, a "key task" that plays a key role may be determined according to the structural features of the computational graph, and the key task block where the key task is located is invalid to train the task block.
在一些实施例中,参照图3、图4,在确定各任务块与众核系统的多个处理核间的第一映射关系(S205)之后,还包括:In some embodiments, referring to FIG. 3 and FIG. 4 , after determining the first mapping relationship between each task block and the multiple processing cores of the many-core system (S205), the method further includes:
S206,无效部分处理核中映射的全部任务块,以训练各任务块,提高计算图的冗余性能。S206, the invalid part processes all the task blocks mapped in the core, so as to train each task block and improve the redundancy performance of the calculation graph.
在确定了第一映射关系后(包括进行实际的映射后),还可将部分处理核中映射的全部任务无效(相当于将部分处理核无效),并调整其它任务,以使剩余的任务仍可产生一定程度上可用的结果(即训练任务块),改善鲁棒性。After the first mapping relationship is determined (including the actual mapping), all tasks mapped in some processing cores can also be invalidated (equivalent to invalidating some processing cores), and other tasks can be adjusted so that the remaining tasks are still Can produce somewhat usable results (ie blocks of training tasks), improving robustness.
当然,以上训练本质上也是对“计算图”的训练。Of course, the above training is essentially training on "computational graphs".
以上训练相当于模拟部分处理核因故障等无效的情况,故可从最终实际应用的角度,改善“核级别”的鲁棒性。The above training is equivalent to simulating some cases where the core is invalid due to failure, so the robustness of the "core level" can be improved from the perspective of the final practical application.
在一些实施例中,无效部分处理核中映射的全部任务块,以训练各任务块(S206)包括以下至少一项:In some embodiments, all task blocks mapped in the invalid part processing core to train each task block (S206) include at least one of the following:
(1)随机无效部分处理核中映射的全部任务块,以训练各任务块。(1) The random invalid part processes all the task blocks mapped in the core to train each task block.
作为本公开实施例的一种方式,可以是将映射至一个或多个(但不能是全部)处理核的全部任务无效(即将一个或多个处理核无效),以训练各任务块。As one way of implementing an embodiment of the present disclosure, all tasks mapped to one or more (but not all) processing cores may be disabled (ie, one or more processing cores) to train each task block.
(2)依次分别无效每个处理核中映射的全部任务块,以训练各任务块。(2) Invalidate all the task blocks mapped in each processing core in turn, so as to train each task block.
作为本公开实施例的一种方式,可以是将各处理核依次无效,即每次仅无效一个处理核,但所有处理核均被无效过。As a method of the embodiment of the present disclosure, each processing core may be invalidated in sequence, that is, only one processing core is invalidated at a time, but all processing cores have been invalidated.
(3)确定包括关键任务的关键任务块,无效关键任务块被映射到的处理核中映射的全部任务块,以训练各任务块。(3) Determine key task blocks including key tasks and all task blocks mapped in the processing cores to which invalid key task blocks are mapped, so as to train each task block.
作为本公开实施例的一种方式,可以是根据计算图的结构确定关键任务,进而确定关键任务所在的关键任务块,并将关键任务块对应的处理核无效,以训练各任务块。As a method of the embodiment of the present disclosure, the key task may be determined according to the structure of the calculation graph, and then the key task block where the key task is located is determined, and the processing core corresponding to the key task block is invalidated to train each task block.
在一些实施例中,参照图3,在确定各任务块与众核系统的多个处理核间的第一映射关系(S205)之后,还包括:In some embodiments, referring to FIG. 3 , after determining the first mapping relationship between each task block and the multiple processing cores of the many-core system (S205), the method further includes:
S207,按照第一映射关系,将各任务块映射到多个处理核中。S207, map each task block to a plurality of processing cores according to the first mapping relationship.
S208,每个处理核处理被映射到其中的任务块中的任务。S208, each processing core processes the tasks in the task blocks mapped therein.
在确定第一映射关系后,还可根据第一映射关系将各任务块映射(或者说分配)到处理核中,并由各处理核进行处理,以实现计算图的实际功能,解决以上待处理问题。After the first mapping relationship is determined, each task block can also be mapped (or allocated) to the processing core according to the first mapping relationship, and processed by each processing core, so as to realize the actual function of the calculation graph and solve the above problems to be processed. question.
当然,以上确定第一映射关系(S205)和按照第一映射关系进行映射(S207)的步骤实际可以是一体的,即可直接进行映射。Of course, the above steps of determining the first mapping relationship (S205) and performing the mapping according to the first mapping relationship (S207) may actually be integrated, that is, the mapping can be performed directly.
当然,当要进行以上S206步骤的训练时,参照图3,S206步骤可以是在S207步骤前进行的,即可不将任务实际映射至处理核中,而只根据第一映射关系,将处理核对应的任务无效,进行训练。Of course, when the training of the above step S206 is to be performed, referring to FIG. 3 , the step S206 may be performed before the step S207, that is, the task is not actually mapped to the processing core, but only the processing core is mapped according to the first mapping relationship. The task is invalid, and the training is performed.
或者,S206步骤也可在S207步骤之后进行,即可将任务实际映射至处理核中,并实际将处理核无效,以进行训练。Alternatively, step S206 can also be performed after step S207, that is, the task can be actually mapped to the processing core, and the processing core can be actually invalidated for training.
本公开实施例将计算图的同一个层中的所有任务至少映射至两个不同的处理核中进行处理,能够提升计算图的鲁棒性。然而,不同层的任务必然位于不同处理核中,故对所 有基于在前任务的结果进行的在后任务,相应的运算结果都必须从在前任务所在的处理核跨核传输到在后任务所在的处理核。由此,在各处理核间必然存在大量的核间路由,导致核间路由结构复杂,核间路由传输的数据量大,而由于核间路由的传输效率远低于核内路由,故复杂的核间路由导致众核系统的性能降低。In the embodiment of the present disclosure, all tasks in the same layer of the computation graph are mapped to at least two different processing cores for processing, which can improve the robustness of the computation graph. However, tasks of different layers must be located in different processing cores. Therefore, for all subsequent tasks based on the results of the previous tasks, the corresponding operation results must be transmitted across the core from the processing core where the previous task is located to the one where the subsequent task is located. processing core. As a result, there must be a large number of inter-core routes between the processing cores, resulting in a complex inter-core routing structure and a large amount of data transmitted by the inter-core routing. Inter-core routing leads to performance degradation in many-core systems.
参照图5,本公开实施例提供的一种的任务处理的方法,该方法是按照层间分区的方式将不同层的任务分在一个任务组,以减少计算图的路由。任务处理的方法包括:Referring to FIG. 5 , an embodiment of the present disclosure provides a task processing method. The method is to divide tasks of different layers into a task group according to an inter-layer partition, so as to reduce the routing of the computation graph. Methods of task processing include:
步骤S501,获取待处理的计算图;其中,计算图包括多个依次设置的层,每个层包括多个任务,任意层中的任务不基于本层或其后层中的任务的结果进行,至少部分层中有至少部分任务基于其前层中的任务的结果进行。Step S501, obtaining a calculation graph to be processed; wherein, the calculation graph includes a plurality of layers arranged in sequence, each layer includes a plurality of tasks, and the tasks in any layer are not performed based on the results of the tasks in this layer or the subsequent layers, At least some of the tasks in at least some of the layers are performed based on the results of the tasks in the previous layers.
需要说明的是,计算图也可以是通过其他方式获得,例如,预先设置的计算图;也可根据具体的待处理问题,按照预定规则生成计算图。It should be noted that the calculation graph can also be obtained by other means, for example, a preset calculation graph; the calculation graph can also be generated according to a predetermined rule according to a specific problem to be processed.
S503,将计算图分为多个任务区。S503: Divide the calculation graph into multiple task areas.
其中,每个任务区包括来自至少两个不同层的多个任务,且每个层的任务至少位于两个任务区中;每个任务区的内部路由强度均超过预设标准,每个任务的内部路由强度,根据该任务区中的任务之间的路由占该任务区中的任务所涉及的全部路由的比例确定。Wherein, each task area includes multiple tasks from at least two different layers, and the tasks of each layer are located in at least two task areas; the internal routing strength of each task area exceeds the preset standard, and the The internal routing strength is determined according to the proportion of the routes between the tasks in the task area to all the routes involved in the tasks in the task area.
在一些实施例中,将计算图划分为多个任务区,即将计算图的每个任务划分至相应的任务区中。参照图6和图7,计算图被划分为三个任务区(任务区0至任务区2),以层对应的填充区域的纵向尺寸表示其中的任务个数,以空白方框表示不同处理核(处理核0至处理核2)和核组(核组0至核组2)。In some embodiments, the computation graph is divided into multiple task regions, that is, each task of the computation graph is divided into a corresponding task region. 6 and 7 , the calculation graph is divided into three task areas (task area 0 to task area 2), the number of tasks in it is represented by the vertical size of the filling area corresponding to the layer, and the different processing cores are represented by blank boxes. (processing core 0 to processing core 2) and core groups (core group 0 to core group 2).
其中,每个任务区中的任务不能全部来自计算图的一个层,而要至少要来自两个不同的层;同时,一个层的全部任务也不能全位于一个任务区,而要分入至少两个任务区;由此,每个任务区相当于从计算图的多个层中各取了一些任务。Among them, the tasks in each task area cannot all come from one layer of the computation graph, but must come from at least two different layers; at the same time, all the tasks of a layer cannot all be located in one task area, but must be divided into at least two different layers. Each task area is equivalent to taking some tasks from multiple layers of the computational graph.
参照图6,计算图的不同层的任务间连接有很多“路由(或者数据通路)”,其中,在前任务的运算结果需通过路由传输给在后任务,以供在后任务利用该运算结果。由此,很多任务都会连接一个或多个路由(包括进入的路由和出发的路由),而这些路由就是该任务“涉及”的路由。可见,一个任务区(图6、图7中以虚线框表示)中的所有任务所涉及的“全部路由”中,有一部分路由的两端连接的都是该任务区中的任务,这类路由称为该任务区的“内部路由(图6中以实线箭头表示)”,即任务区中的任务之间的路由;而另有一些路由,其一端连接任务区中的任务,另一端则连接任务区外的任务(必然也是其它任务区中的任务),这类路由称为该任务区的“外部路由(图6中以虚线箭头表示)”。Referring to FIG. 6 , there are many “routes (or data paths)” connected between tasks in different layers of the computation graph, wherein the operation result of the preceding task needs to be transmitted to the succeeding task through routing, so that the succeeding task can utilize the operation result . As a result, many tasks will connect one or more routes (including incoming routes and outgoing routes), and these routes are the routes "involved" by the task. It can be seen that in the "all routes" involved in all tasks in a task area (represented by dashed boxes in Figure 6 and Figure 7), the two ends of a part of the routes are connected to the tasks in the task area. It is called the "internal route (represented by solid arrows in Figure 6)" of the task area, that is, the route between tasks in the task area; while there are other routes, one end of which connects the tasks in the task area, and the other end connects to the tasks in the task area. Such routes that connect tasks outside the task area (and must also be tasks in other task areas) are called "external routes (indicated by dashed arrows in Figure 6)" of the task area.
根据每个任务区中“内部路由”相对“全部路由”的比例,可计算得到该任务区的“内部路由强度”,由此,“内部路由强度”也就体现了一个任务区中的任务进行“内部交互”的相对程度。According to the ratio of "internal routing" to "all routing" in each task area, the "internal routing strength" of the task area can be calculated. Therefore, the "internal routing strength" also reflects the task progress in a task area. The relative degree of "internal interaction".
本公开实施例中,划分得到的每个任务区的内部路由强度均超过预设标准,由此,每个任务区都必然是一个“强内部路由区域”,即每个任务区中的任务主要进行“内部交互”,而较少与其它任务区的任务的“外部交互”。In the embodiment of the present disclosure, the internal routing strength of each task area obtained by division exceeds the preset standard. Therefore, each task area must be a "strong internal routing area", that is, the tasks in each task area are mainly Do "internal interactions" and less "external interactions" with tasks in other task areas.
需要说明的是,为了更清楚地说明本公开实施例,计算图设置三层(层0至层2), 计算图被分成三个任务区。但这并不表明对本公开实施例中层数和任务数的限制,本公开实施例对计算图的层数和任务数不作限定。It should be noted that, in order to explain the embodiments of the present disclosure more clearly, the computation graph is set in three layers (layer 0 to layer 2), and the computation graph is divided into three task areas. However, this does not indicate a limitation on the number of layers and the number of tasks in the embodiment of the present disclosure, and the embodiment of the present disclosure does not limit the number of layers and the number of tasks in the calculation graph.
S505,确定各任务区的任务与众核系统的各处理核间的第二映射关系。S505: Determine the second mapping relationship between the tasks of each task area and each processing core of the many-core system.
在第二映射关系中,每个任务区对应一个核组,每个核组包括一个或多个处理核,每个任务区的任务被映射至其对应的核组的处理核中。In the second mapping relationship, each task area corresponds to a core group, each core group includes one or more processing cores, and the tasks of each task area are mapped to the processing cores of the corresponding core group.
参照图7,在确定出任务区后,为每个任务区确定对应的核组(每个核组包括众核系统中的一个或多个处理核),并将每个任务区中的所有任务映射到其对应的核组的处理核中;即,任务区中每个任务都需要被映射到其对应的核组的某一个处理核中,而一个核组的每个处理核中都要映射有至少一个来自该核组对应的任务区的任务。Referring to FIG. 7, after the task area is determined, a corresponding core group is determined for each task area (each core group includes one or more processing cores in the many-core system), and all tasks in each task area are Map to the processing core of its corresponding core group; that is, each task in the task area needs to be mapped to a processing core of its corresponding core group, and each processing core of a core group needs to be mapped There is at least one task from the task area corresponding to this core group.
本公开实施例中,计算图中内部联系比较密切的“强内部路由区域”被划分为一个任务区,故每个任务区的任务都较少进行“外部交互”;而由于每个任务区被映射至一个核组中,故不同核组间也较少需要进行“跨组”的数据传输;由此,本公开实施例中所需的核间(或组间)路由也较少,可简化核间(或组间)路由结构,降低核间(或组间)路由传输的数据量,以改善众核系统的性能(如效率)。In the embodiment of the present disclosure, the “strong internal routing area” with relatively close internal connections in the calculation graph is divided into one task area, so the tasks in each task area are less “externally interacted”; It is mapped to one core group, so it is less necessary to perform "cross-group" data transmission between different core groups; thus, less inter-core (or inter-group) routing is required in the embodiment of the present disclosure, which can be simplified The inter-core (or inter-group) routing structure reduces the amount of data transmitted by inter-core (or inter-group) routing to improve the performance (such as efficiency) of many-core systems.
另外,本公开实施例中,每个任务区包括来自多个不同层的任务,即不同层的任务是被分配到不同核组中的,故一个处理核无效(如因故障)时,计算图的一个层一般只会"坏掉一部分",而不会产生一个层的所有任务均无效的情况,从而使计算图整体仍可得出一定程度上可用的处理结果,提高了计算图的鲁棒性。In addition, in the embodiment of the present disclosure, each task area includes tasks from multiple different layers, that is, the tasks of different layers are assigned to different core groups, so when one processing core is invalid (for example, due to a fault), the calculation graph Generally, only a part of the layer is "broken", and it will not cause all tasks of a layer to be invalid, so that the overall calculation graph can still obtain a certain degree of usable processing results, which improves the robustness of the calculation graph. sex.
需要说明的是,计算图为可训练计算图,在其中至少部分任务有不同的情况下,解决相同的待处理问题。计算图可以为神经网络或其它计算图。神经网络如卷积神经卷积神经网络(CNN)、脉冲神经网络(SNN)、循环神经网络(RNN)等;可通过训练计算图,可提高其冗余性能。It should be noted that the computation graph is a trainable computation graph, in which at least some of the tasks are different, the same problem to be solved is solved. The computational graph can be a neural network or other computational graph. Neural networks such as Convolutional Neural Networks (CNN), Spiking Neural Networks (SNNs), Recurrent Neural Networks (RNNs), etc. The redundancy performance can be improved by training computational graphs.
在一些实施例中,至少部分任务区包括来自计算图的每个层中的任务。参照图6、图7,任务区中的任务可来自所有的层,即任务区可从每个层中都至少取一个任务。进一步的,每个任务区都包括来自每个层中的任务。In some embodiments, at least a portion of the task region includes tasks from each layer of the computation graph. Referring to FIG. 6 and FIG. 7 , the tasks in the task area can come from all layers, that is, the task area can take at least one task from each layer. Further, each task area includes tasks from each layer.
在一些实施例中,内部路由强度包括:内部路由量占比,每个任务区的内部路由量占比为该任务区中的任务之间的路由传输的数据量,占该任务区中的任务所涉及的全部路由传输的数据量的比例;和/或,内部路由数占比,每个任务区的内部路由数占比为该任务区中的任务之间的路由的个数,占该任务区中的任务所涉及的全部路由的个数的比例。In some embodiments, the internal routing strength includes: the proportion of internal routing volume, the internal routing volume ratio of each task area is the amount of data transmitted by routing between tasks in the task area, accounting for the task area in the task area. The proportion of the data volume transmitted by all the routes involved; and/or, the proportion of the number of internal routes, the proportion of the number of internal routes in each task area is the number of routes between tasks in the task area, accounting for the number of routes in the task area The ratio of the total number of routes involved in tasks in the area.
在一些实施例中,可用“内部路由量占比”和/或“内部路由量占比”作为以上内部路由强度的具体指标。其中,每个任务区的全部路由包括内部路由和外部路由,而在计算图运行过程中各路由传输的数据量也是固定且可预先得知的;由此,可用每个任务区的内部路由传输的数据量,相对其全部路由传输的数据量的比例,作为内部路由量占比,即内部路由强度的具体指标之一。其中,每个任务区的全部路由包括内部路由和外部路由,其中各种路由的“个数”都是确定且已知的;由此,可用每个任务区的内部路由的个数,相对其全部路由的个数的比例,作为内部路由数占比,即内部路由强度的具体指标之一。In some embodiments, "internal routing volume ratio" and/or "internal routing volume ratio" may be used as specific indicators of the above internal routing strength. Among them, all routes of each task area include internal routes and external routes, and the amount of data transmitted by each route during the operation of the calculation graph is also fixed and can be known in advance; thus, the internal routes of each task area can be used to transmit data. The ratio of the data volume to the data volume transmitted by all routes is used as the proportion of internal routing volume, that is, one of the specific indicators of internal routing strength. Among them, all routes of each task area include internal routes and external routes, and the "number" of various routes is determined and known; thus, the number of internal routes available in each task area is relative to its The ratio of the number of all routes is used as the proportion of the number of internal routes, that is, one of the specific indicators of the strength of internal routes.
在一些实施例中,预设标准包括:内部路由量占比大于第一阈值;和/或,内部路由 数占比大于第二阈值。其中,任务区应符合的预设标准可以是内部路由量、内部路由数中有一者大于相应的阈值,或者也可以是二者同时大于相应的阈值。In some embodiments, the preset criteria include: the proportion of the amount of internal routes is greater than the first threshold; and/or the proportion of the number of internal routes is greater than the second threshold. The preset criteria that the task area should meet may be that one of the internal routing amount and the internal routing number is greater than the corresponding threshold, or both may be greater than the corresponding threshold at the same time.
应当理解,当要求内部路由量和内部路由数同时大于相应的阈值时,其中的每个阈值,与要求内部路由量和内部路由数中的一者大于的阈值可以是不同的。It should be understood that when both the required internal routing volume and the internal routing number are greater than the corresponding thresholds, each of the thresholds may be different from the threshold required for one of the internal routing volume and internal routing number to be greater than the corresponding thresholds.
在一些实施例中,至少部分核组仅包括一个处理核。参照图7,可有核组(核组2)仅包括一个处理核(处理核2),从而其对应的任务区(任务区1)的任务必然也被全部映射到该处理核中。由此,任务区中不同层的任务之间的数据传输(对任务区而言是内部交互)就必然全部是核内的数据传输,可最大限度的降低核间路由。当然,如果是所有核组都仅包括一个处理核,也是可行的。In some embodiments, at least part of the core set includes only one processing core. Referring to FIG. 7 , there may be a core group (core group 2) that includes only one processing core (processing core 2), so the tasks of its corresponding task area (task area 1) must also all be mapped to the processing core. Therefore, the data transmission between tasks of different layers in the task area (internal interaction for the task area) must all be intra-core data transmission, which can minimize inter-core routing. Of course, it is also feasible if all core groups consist of only one processing core.
在一些实施例中,至少部分核组包括多个处理核;在第二映射关系中,与包括多个处理核的核组对应的每个任务区分为多个第二任务块,每个任务区的第二任务块的个数与该任务区对应的核组包括的处理核的个数相同,每个任务区的各第二任务块被分别映射至该任务区对应的核组的各处理核中。In some embodiments, at least some of the core groups include multiple processing cores; in the second mapping relationship, each task area corresponding to the core group including multiple processing cores is divided into multiple second task blocks, and each task area The number of the second task blocks is the same as the number of processing cores included in the core group corresponding to the task area, and the second task blocks in each task area are respectively mapped to the processing cores of the core group corresponding to the task area. middle.
作为本公开实施例的一种方式,参照图7,也可有核组(核组0、核组1)包括多个处理核(处理核0、处理核1),故对应的任务区(任务区0、任务区2)需要先被“分块”,分出的每个第二任务块包括多个任务,再将每个第二任务块分到相应核组的一个处理核中(故第二任务块数与核组的处理核数必然相同)。As an example of the embodiment of the present disclosure, referring to FIG. 7 , there may also be a core group (core group 0, core group 1) including multiple processing cores (processing core 0, processing core 1), so the corresponding task area (task Area 0, task area 2) need to be "blocked" first, each second task block divided includes multiple tasks, and then each second task block is divided into a processing core of the corresponding core group (so the first The number of task blocks and the number of processing cores of the core group must be the same).
显然,每个处理核的硬件资源(如缓存等)是固定的,故不一定每个处理核都正好能处理整个任务区的所有任务;因此,可结合硬件资源、计算负载均衡等的综合考虑,用多个处理核组成一个核组,共同处理一个任务区。Obviously, the hardware resources (such as cache, etc.) of each processing core are fixed, so each processing core may not be able to process all tasks in the entire task area; therefore, it can be combined with comprehensive consideration of hardware resources, computing load balancing, etc. , and use multiple processing cores to form a core group to jointly process a task area.
参照图7,为了简化同一个核组中不同处理间的路由结构,可尽量将其中来自同一层的任务分入同一个第二任务块,以使后续位于同一层的任务尽量位于一个处理核中,从而在一个核组的多个处理核间,主要是对应不同层的处理核间建立核间路由,而不是形成很复杂的“网格状”路由。当然,如果是所有核组都包括多个处理核,也是可行的。Referring to Figure 7, in order to simplify the routing structure between different processes in the same core group, the tasks from the same layer can be divided into the same second task block as much as possible, so that the subsequent tasks located in the same layer can be located in one processing core as much as possible. , so that among multiple processing cores in a core group, inter-core routing is mainly established between processing cores corresponding to different layers, rather than forming a very complex "grid-like" routing. Of course, it is also feasible if all core groups include multiple processing cores.
在一些实施例中,在至少部分包括多个处理核的核组中,任意两个处理核间的距离小于预设距离。例如,在包括多个处理核的核组(如核组0、核组1)中,相应处理核(如处理核0、处理核1)间的距离可小于预设距离;因为核间路由的传输效率也与处理核间的距离相关,故将距离“比较接近”的处理核分为一个核组,可降低核组内部的数据传输的耗时等。In some embodiments, the distance between any two processing cores is less than a predetermined distance in a core group comprising at least part of a plurality of processing cores. For example, in a core group (eg, core group 0, core group 1) including multiple processing cores, the distance between the corresponding processing cores (eg, processing core 0, processing core 1) may be smaller than the preset distance; The transmission efficiency is also related to the distance between the processing cores. Therefore, dividing the processing cores with "closer" distances into a core group can reduce the time-consuming of data transmission within the core group.
其中,以上“处理核间的距离”的具体形式可以是多样的,例如可以是处理核间的直线物理距离,也可以是连接处理核的核间路由的总长度,也可以是处理核之间所间隔的其它处理核的个数(或者说路由的跳数)等,在此不再详细描述。The specific form of the above "distance between processing cores" may be various, for example, it may be the straight-line physical distance between processing cores, or the total length of the inter-core routes connecting the processing cores, or the distance between processing cores. The number of other processing cores (or the number of hops of routing) and the like that are spaced apart will not be described in detail here.
在一些实施例中,至少部分处理核同时属于多个核组。参照图7,不同核组包括的处理核可以是有“重叠”的(如核组0、核组1均包括处理核0、处理核1),从而不同任务区的任务(包括第二任务块)可能被分入同一个处理核中,以更充分的利用硬件资源,更好的实现计算负载均衡。In some embodiments, at least some of the processing cores belong to multiple core groups simultaneously. Referring to FIG. 7 , the processing cores included in different core groups may be “overlapping” (for example, core group 0 and core group 1 both include processing core 0 and processing core 1), so that tasks in different task areas (including the second task block) ) may be divided into the same processing core to make fuller use of hardware resources and better achieve computing load balancing.
其中,可以是不同核组包括的处理核完全相同(也可认为是多个核组被“合并”), 也可以是不同核组包括的处理核“部分重叠”,在此不再详细描述。The processing cores included in different core groups may be exactly the same (it may also be considered that multiple core groups are "merged"), or the processing cores included in different core groups may be "partially overlapped", which will not be described in detail here.
在一些实施例中,在将计算图分为多个任务区(S503)与确定各任务区的任务与众核系统的各处理核间的第二映射关系(S505)之间,还包括:扩展至少部分任务区。其中,扩展包括在任务区中添加冗余任务。In some embodiments, between dividing the computation graph into a plurality of task areas (S503) and determining the second mapping relationship between the tasks of each task area and each processing core of the many-core system (S505), the method further includes: extending At least part of the mission area. Among them, the expansion includes adding redundant tasks in the task area.
参照图5、图7,在划分得到多个任务区后,还可对各任务区进行“扩展”,或者说“反压缩”,即在任务区中“增加”一些原本没有的任务(冗余任务),之后再将经过反压缩的任务区映射至相应核组(包括分块后映射),从而处理核中也包括以上“冗余任务”,以提高冗余性能。Referring to Figure 5 and Figure 7, after dividing into multiple task areas, each task area can also be "expanded", or "decompressed", that is, "adding" some tasks that were not originally in the task area (redundant task), and then map the decompressed task area to the corresponding core group (including mapping after partitioning), so that the above "redundant tasks" are also included in the processing core to improve redundant performance.
其中,各层具体的“扩展量”可以是预先设定的。例如,可设定冗余系数b,表示扩展出的任务的运算量相对原任务的运算量的比例:若冗余系数b=0,则相当于未扩展;若b=1,则相当于扩展出的任务的运算量与原任务的运算量相同;通常而言,b可大于0,且小于或等于1(若b大于1也可行,如可有任务备份“多份”)。The specific "expansion amount" of each layer may be preset. For example, the redundancy coefficient b can be set to indicate the ratio of the operation amount of the extended task to the operation amount of the original task: if the redundancy coefficient b=0, it is equivalent to not extending; if b=1, it is equivalent to extending The computation amount of the outgoing task is the same as the computation amount of the original task; generally speaking, b can be greater than 0 and less than or equal to 1 (if b is greater than 1, it is also feasible, for example, there can be "multiple copies" of task backup).
或者,也可以是根据确定的方式对各层进行扩展,以根据该方式扩展得到的实际任务量为准。Alternatively, each layer may be expanded according to a certain method, and the actual task amount obtained by the expansion according to this method shall prevail.
在一些实施例中,冗余任务包括以下至少一项:In some embodiments, redundant tasks include at least one of the following:
(1)备份任务。其中,备份任务与相应任务区中的任务相同。作为本公开实施例的一种方式,每个扩展出的任务可是其对应的任务区中原有的一个任务,从而,相应任务实际存在“多份”,故可相互作为“备份”,以提高鲁棒性。(1) Backup tasks. Among them, the backup task is the same as the task in the corresponding task area. As a mode of the embodiment of the present disclosure, each extended task can be an original task in its corresponding task area, so that there are actually "multiple copies" of the corresponding task, so they can be used as "backups" for each other to improve the robustness Awesome.
(2)空任务。作为本公开实施例的一种方式,可扩展不进行实际运算(或者说进行空运算)的空任务。(2) Empty tasks. As a way of the embodiment of the present disclosure, empty tasks that do not perform actual operations (or perform empty operations) can be expanded.
(3)无效任务。作为本公开实施例的一种方式,可扩展出一些需要进行运算,但不是原计算图中所需的运算的任务,即无效任务。其中,无效任务可以是随机产生的,也可通过其它具体的反压缩技术产生。(3) Invalid tasks. As a way of the embodiment of the present disclosure, some tasks that need to perform operations but are not required for the operations in the original calculation diagram can be extended, that is, invalid tasks. Wherein, the invalid task may be generated randomly, or may be generated by other specific anti-compression techniques.
其中,不同任务区的扩展方式可以相同,也可不同。The expansion modes of different task areas may be the same or different.
在一些实施例中,在获取计算图(S501)与将计算图分为多个任务区(S503)之间,还包括:In some embodiments, between acquiring the computation graph (S501) and dividing the computation graph into multiple task areas (S503), the method further includes:
S502,训练计算图,以提高计算图的冗余性能。S502 , training the computation graph to improve the redundancy performance of the computation graph.
参照图5和图7,在对计算图进行“分区”之前,还可对其进行训练,以提高其冗余性能。Referring to Figures 5 and 7, before "partitioning" the computational graph, it can also be trained to improve its redundancy performance.
在一些实施例中,以上训练包括以下至少一项:In some embodiments, the above training includes at least one of the following:
(1)无效计算图中的部分任务,以训练计算图。(1) Invalidate some tasks in the computational graph to train the computational graph.
作为本公开实施例的一种方式,可采用Dropout方式,将计算图中的部分任务无效(如使神经网络的部分节点的权重为0),并调整其它任务,使计算图可在这些任务无效的情况下产生一定程度上可用的结果,改善其鲁棒性。As a method of the embodiment of the present disclosure, the Dropout method can be used to invalidate some tasks in the calculation graph (for example, make the weight of some nodes of the neural network to be 0), and adjust other tasks, so that the calculation graph can be invalidated in these tasks produces somewhat usable results in the case of improved robustness.
其中,以上被无效的任务可位于计算图的一个连续区域(不一定是任务区)中,即其具体可为Dropblock训练。Among them, the above invalidated tasks can be located in a continuous area (not necessarily a task area) of the calculation graph, that is, it can be specifically Dropblock training.
(2)通过对抗样本防御方式训练计算图。(2) The computational graph is trained by adversarial sample defense.
作为本公开实施例的一种方式,可采用“对抗样本防御”方式训练计算图。As a method of the embodiment of the present disclosure, the "adversarial sample defense" method can be used to train the computational graph.
在一些实施例中,在将计算图分为多个任务区(S503)与确定各任务区的任务与众核系统的各处理核间的第二映射关系(S505)之间,还包括:In some embodiments, between dividing the computation graph into a plurality of task areas (S503) and determining the second mapping relationship between the tasks of each task area and each processing core of the many-core system (S505), the method further includes:
S504,无效部分任务区,以训练各任务区,提高计算图的冗余性能。S504, invalidating part of the task area, so as to train each task area and improve the redundancy performance of the calculation graph.
参照图5和图7,在进行“分区”后,还可将一个或多个(但不能是全部)任务区无效(即将其中所有的任务无效),并调整其它的任务区中的任务,以使剩余的任务区仍可产生一定程度上可用的结果(即训练任务区),改善鲁棒性。Referring to FIG. 5 and FIG. 7 , after “partitioning”, one or more (but not all) task areas can also be invalidated (that is, all tasks in them are invalid), and the tasks in other task areas can be adjusted to Making the remaining task regions still yield some usable results (ie, training task regions), improving robustness.
当然,以上训练本质上也是对“计算图”的训练。Of course, the above training is essentially training on "computational graphs".
以上训练是在“分区”后(具体可为在扩展后,也可为在扩展前)进行的,相当于提高“区级别”的鲁棒性。The above training is performed after "partitioning" (specifically, after expansion, or before expansion), which is equivalent to improving the robustness of "district level".
在一些实施例中,以上训练包括以下至少一项:In some embodiments, the above training includes at least one of the following:
(1)随机无效部分任务区,以训练各任务区。(1) Random invalid part of the task area to train each task area.
作为本公开实施例的一种方式,可以是随机的无效部分任务区,以训练剩余的任务区。As a way of the embodiment of the present disclosure, it may be a random invalid partial task region to train the remaining task region.
(2)确定包括关键任务的关键任务区,无效关键任务区,以训练各任务区。(2) Determine the key mission areas including key missions and the invalid key mission areas to train each mission area.
作为本公开实施例的一种方式,可以是根据计算图的结构特征确定出其中起到关键作用的“关键任务”,并将关键任务所在的关键任务区无效,以训练任务区。As a method of the embodiment of the present disclosure, a "critical task" that plays a key role may be determined according to the structural features of the computational graph, and the critical task area where the critical task is located is invalid to train the task area.
在一些实施例中,在确定各任务区的任务与众核系统的各处理核间的第二映射关系(S505)之后,还包括:In some embodiments, after determining the second mapping relationship between the tasks of each task area and each processing core of the many-core system (S505), the method further includes:
S506,无效部分处理核中映射的全部任务,以训练各任务区,提高计算图的冗余性能。S506, the invalid part processes all the tasks mapped in the core, so as to train each task area and improve the redundancy performance of the calculation graph.
参照图5和图7,在确定了第二映射关系后(包括进行实际的映射后),还可将部分处理核中映射的全部任务无效(相当于将部分处理核无效),并调整其它任务,以使剩余的任务仍可产生一定程度上可用的结果(即训练任务区),改善鲁棒性。Referring to FIG. 5 and FIG. 7 , after the second mapping relationship is determined (including the actual mapping), all tasks mapped in some processing cores can be invalidated (equivalent to invalidating some processing cores), and other tasks can be adjusted. , so that the remaining tasks can still produce results that are available to a certain extent (i.e. the training task area), improving robustness.
当然,以上训练本质上也是对“计算图”的训练。Of course, the above training is essentially training on "computational graphs".
以上训练相当于模拟部分处理核因故障等无效的情况,故可从最终实际应用的角度,改善“核级别”的鲁棒性。The above training is equivalent to simulating some cases where the core is invalid due to failure, so the robustness of the "core level" can be improved from the perspective of the final practical application.
在一些实施例中,以上训练包括以下至少一项:In some embodiments, the above training includes at least one of the following:
(1)随机无效部分处理核中映射的全部任务,以训练各任务区。(1) The random invalid part processes all the tasks mapped in the core to train each task area.
作为本公开实施例的一种方式,可以是将映射至一个或多个(但不能是全部)处理核的全部任务无效(即将一个或多个处理核无效),以训练各任务区。As one way of implementing an embodiment of the present disclosure, all tasks mapped to one or more (but not all) processing cores may be disabled (ie, one or more processing cores) to train each task area.
(2)依次分别无效每个处理核中映射的全部任务区,以训练各任务区。(2) Invalidate all the task areas mapped in each processing core in turn, so as to train each task area.
作为本公开实施例的一种方式,可以是将各处理核依次无效,即每次仅无效一个处理核,但所有处理核均被无效过。As a method of the embodiment of the present disclosure, each processing core may be invalidated in sequence, that is, only one processing core is invalidated at a time, but all processing cores have been invalidated.
(3)确定关键任务,无效关键任务被映射到的处理核中映射的全部任务,以训练各任务区。(3) Determine the key tasks, and all the tasks mapped in the processing core to which the invalid key tasks are mapped to train each task area.
作为本公开实施例的一种方式,可以是根据计算图的结构确定关键任务,进而将关键任务所在的处理核无效,以训练各任务区。As a method of the embodiment of the present disclosure, the key tasks may be determined according to the structure of the computation graph, and then the processing cores where the key tasks are located are invalidated, so as to train each task area.
在一些实施例中,参照图5,在确定各任务区的任务与众核系统的各处理核间的第二 映射关系(S505)之后,还包括:In some embodiments, referring to Fig. 5, after determining the second mapping relationship between the tasks of each task area and each processing core of the many-core system (S505), it also includes:
S507,按照第二映射关系,将各任务区的任务映射至众核系统的各处理核中。S507: Map the tasks of each task area to each processing core of the many-core system according to the second mapping relationship.
S508,每个处理核处理被映射到其中的任务。S508, each processing core processes the tasks mapped to it.
在确定第二映射关系后,还可根据第二映射关系将各任务区中的任务映射(或者说分配)到相应核组的处理核中,并由各处理核进行处理,以实现计算图的实际功能,解决以上待处理问题。After the second mapping relationship is determined, the tasks in each task area can also be mapped (or allocated) to the processing cores of the corresponding core group according to the second mapping relationship, and processed by each processing core, so as to realize the computation graph. Actual function to solve the above pending problems.
当然,以上确定第二映射关系(S505)和进行映射(S507)的步骤实际可以是一体的,即可直接进行映射。Of course, the above steps of determining the second mapping relationship (S505) and performing the mapping (S507) may actually be integrated, that is, the mapping can be performed directly.
当然,当要进行以上S506步骤的训练时,参照图5,S506步骤可以是在S507步骤前进行的,即可不将任务实际映射至处理核中,而只根据第二映射关系,将处理核对应的任务无效,进行训练。Of course, when the training of the above step S506 is to be performed, referring to FIG. 5 , the step S506 may be performed before the step S507, that is, the task is not actually mapped to the processing core, but only the processing core is mapped according to the second mapping relationship. The task is invalid, and the training is performed.
或者,S506步骤也可在S507步骤之后进行,即可将任务实际映射至处理核中,并实际将处理核无效,以进行训练。Alternatively, step S506 can also be performed after step S507, that is, the task can be actually mapped to the processing core, and the processing core can be actually invalidated for training.
本公开实施例中,所有要进行训练的计算图必然都是可训练计算图(如神经网络)。每种训练可仅进行一次,也可循环进行多次。当进行多次训练时,各次训练采用的具体方式可相同,也可不同。当进行多次训练时,可在达到预设的结束标准时结束训练,结束标准可包括计算图收敛、达到预定训练次数、达到预定的冗余性能等。In the embodiment of the present disclosure, all computation graphs to be trained must be trainable computation graphs (eg, neural networks). Each exercise can be done only once, or it can be repeated multiple times. When performing multiple training sessions, the specific methods used for each training session may be the same or different. When multiple trainings are performed, the training can be ended when a preset end criterion is reached, and the end criterion may include the convergence of the computation graph, reaching a predetermined number of training times, reaching a predetermined redundancy performance, and the like.
本公开实施例中,计算图中内部联系比较密切的“强内部路由区域”被划分为一个任务区,故每个任务区的任务都较少进行“外部交互”;而由于每个任务区被映射至一个核组中,故不同核组间也较少需要进行“跨组”的数据传输;由此,本公开实施例中所需的核间(或组间)路由也较少,可简化核间(或组间)路由结构,降低核间(或组间)路由传输的数据量,以改善众核系统的性能(如效率)。In the embodiment of the present disclosure, the “strong internal routing area” with relatively close internal connections in the calculation graph is divided into one task area, so the tasks in each task area are less “externally interacted”; It is mapped to one core group, so it is less necessary to perform "cross-group" data transmission between different core groups; thus, less inter-core (or inter-group) routing is required in the embodiment of the present disclosure, which can be simplified The inter-core (or inter-group) routing structure reduces the amount of data transmitted by inter-core (or inter-group) routing to improve the performance (such as efficiency) of many-core systems.
另外,本公开实施例中,每个任务区包括来自多个不同层的任务,即不同层的任务是被分配到不同核组中的,故一个处理核无效(如因故障)时,计算图的一个层一般只会“坏掉一部分”,而不会产生一个层的所有任务均无效的情况,从而使计算图整体仍可得出一定程度上可用的处理结果,提高了计算图的鲁棒性。In addition, in the embodiment of the present disclosure, each task area includes tasks from multiple different layers, that is, the tasks of different layers are assigned to different core groups, so when one processing core is invalid (for example, due to a fault), the calculation graph Generally, only a part of the layer is "broken", and it will not cause all tasks of a layer to be invalid, so that the overall calculation graph can still obtain a certain degree of usable processing results, which improves the robustness of the calculation graph. sex.
第二方面,参照图8,本公开实施例提供一种任务处理的装置800,包括:In a second aspect, referring to FIG. 8 , an embodiment of the present disclosure provides an apparatus 800 for processing tasks, including:
获取模块801,配置为获取待处理问题的计算图;计算图包括多个依次设置的层,每个层包括多个任务,任意层中的任务不基于本层或其后层中的任务的结果进行,至少部分层中有至少部分任务基于其前层中的任务的结果进行;The obtaining module 801 is configured to obtain a calculation graph of the problem to be processed; the calculation graph includes a plurality of layers arranged in sequence, each layer includes a plurality of tasks, and the tasks in any layer are not based on the results of the tasks in this layer or subsequent layers to perform, at least some of the tasks in at least some of the layers are performed based on the results of the tasks in the previous layers;
分块模块802,配置为将计算图的每个层分为多个任务块;每个任务块包括至少一个任务;A block module 802, configured to divide each layer of the computation graph into a plurality of task blocks; each task block includes at least one task;
映射模块803,配置为确定各任务块与众核系统的多个处理核间的第一映射关系;根据第一映射关系,每个任务块映射到一个处理核中,每个处理核中映射多个任务块,任意一个层的所有任务块被映射到至少两个不同处理核中。The mapping module 803 is configured to determine a first mapping relationship between each task block and multiple processing cores of the many-core system; according to the first mapping relationship, each task block is mapped to one processing core, and each processing core is mapped to multiple processing cores. Each task block is mapped to at least two different processing cores in any layer.
在一些实施例中,任务处理的装置800还包括:In some embodiments, the apparatus 800 for task processing further includes:
训练模块804,配置为基于第一映射关系训练计算图,获得训练后的计算图;其中, 计算图包括多个依次设置的层,每个层包括多个任务,任意层中的任务不基于本层或其后层中的任务的结果进行,至少部分层中有至少部分任务基于其前层中的任务的结果进行;The training module 804 is configured to train the computation graph based on the first mapping relationship, and obtain the computation graph after training; wherein, the computation graph includes a plurality of layers arranged in sequence, each layer includes a plurality of tasks, and the tasks in any layer are not based on this at least some of the tasks in at least some of the layers are performed based on the results of the tasks in the previous layer;
分区模块805,配置为将计算图分为多个任务区;每个任务区包括来自至少两个不同层的多个任务,且每个层的任务至少位于两个任务区中;每个任务区的内部路由强度均超过预设标准,每个任务的内部路由强度,根据该任务区中的任务之间的路由占该任务区中的任务所涉及的全部路由的比例确定;The partition module 805 is configured to divide the computation graph into a plurality of task regions; each task region includes a plurality of tasks from at least two different layers, and the tasks of each layer are located in at least two task regions; each task region The internal routing strength of each task exceeds the preset standard, and the internal routing strength of each task is determined according to the proportion of the routes between tasks in the task area to all the routes involved in the tasks in the task area;
映射模块803,还配置为确定各任务区的任务与众核系统的各处理核间的第二映射关系;在第二映射关系中,每个任务区对应一个核组,每个核组包括一个或多个处理核,每个任务区的任务被映射至其对应的核组的处理核中。The mapping module 803 is further configured to determine the second mapping relationship between the tasks of each task area and each processing core of the many-core system; in the second mapping relationship, each task area corresponds to a core group, and each core group includes a or multiple processing cores, the tasks of each task area are mapped to the processing cores of its corresponding core group.
本公开实施例的任务处理的装置800可实现上述的任务处理的方法。The apparatus 800 for task processing in this embodiment of the present disclosure may implement the above-mentioned task processing method.
应当理解,当上述的任务处理的方法中还包括其它的步骤时,任务处理的装置800中也可包括实现相应步骤的其它模块。It should be understood that when the above task processing method further includes other steps, the task processing apparatus 800 may also include other modules for implementing corresponding steps.
第三方面,参照图9,本公开实施例提供一种众核系统900,包括:In a third aspect, referring to FIG. 9 , an embodiment of the present disclosure provides a many-core system 900, including:
多个处理核901;以及a plurality of processing cores 901; and
片上网络902,被配置为交互多个处理核901间的数据和外部数据;an on-chip network 902 configured to exchange data and external data among the plurality of processing cores 901;
一个或多个处理核901中存储有一个或多个指令,一个或多个指令被一个或多个处理核901执行,以使一个或多个处理核901能够执行实现上述任意一种任务处理的方法。One or more instructions are stored in the one or more processing cores 901, and the one or more instructions are executed by the one or more processing cores 901, so that the one or more processing cores 901 can execute any of the above tasks. method.
本公开实施例的众核系统900可实现上述的任务处理的方法,包括对计算图和/或计算图中的任务进行实际,以得到对待处理问题的处理结果。The many-core system 900 in the embodiment of the present disclosure can implement the above-mentioned task processing method, including performing actual computation graphs and/or tasks in the computation graphs to obtain processing results of the problems to be processed.
第四方面,本公开实施例还提供了一种计算机可读存储介质,其上存储有计算机程序,其中,计算机程序在被处理器/处理核执行时实现上述的任务处理的方法。计算机可读存储介质可以是易失性或非易失性计算机可读存储介质。In a fourth aspect, embodiments of the present disclosure further provide a computer-readable storage medium on which a computer program is stored, wherein the computer program implements the above-mentioned task processing method when executed by a processor/processing core. Computer-readable storage media can be volatile or non-volatile computer-readable storage media.
第五方面,本公开实施例还提供了一种计算机程序产品,包括计算机可读代码,或者承载有计算机可读代码的非易失性计算机可读存储介质,当计算机可读代码在电子设备的处理器中运行时,电子设备中的处理器执行上述任务处理的方法。In a fifth aspect, embodiments of the present disclosure further provide a computer program product, comprising computer-readable codes, or a non-volatile computer-readable storage medium carrying computer-readable codes, when the computer-readable codes are stored in an electronic device When running in the processor, the processor in the electronic device executes the above task processing method.
本领域普通技术人员可以理解,上文中所公开方法中的全部或某些步骤、系统、装置中的功能模块/单元可以被实施为软件、固件、硬件及其适当的组合。在硬件实施方式中,在以上描述中提及的功能模块/单元之间的划分不一定对应于物理组件的划分;例如,一个物理组件可以具有多个功能,或者一个功能或步骤可以由若干物理组件合作执行。某些物理组件或所有物理组件可以被实施为由处理器,如中央处理器、数字信号处理器或微处理器执行的软件,或者被实施为硬件,或者被实施为集成电路,如专用集成电路。这样的软件可以分布在计算机可读介质上,计算机可读介质可以包括计算机存储介质(或非暂时性介质)和通信介质(或暂时性介质)。如本领域普通技术人员公知的,术语计算机存储介质包括在用于存储信息(诸如计算机可读指令、数据结构、程序模块或其它数据)的任何方法或技术中实施的易失性和非易失性、可移除和不可移除介质。计算机存储介质包括但不限于RAM、ROM、EEPROM、闪存或其它存储器技术、CD-ROM、数字多功能盘(DVD)或其它光盘存储、磁盒、磁带、磁盘存储或其它磁存储装置、或者可以用于存储期望的信息 并且可以被计算机访问的任何其它的介质。此外,本领域普通技术人员公知的是,通信介质通常包含计算机可读指令、数据结构、程序模块或者诸如载波或其它传输机制之类的调制数据信号中的其它数据,并且可包括任何信息递送介质。Those of ordinary skill in the art can understand that all or some of the steps in the methods disclosed above, functional modules/units in the systems, and devices can be implemented as software, firmware, hardware, and appropriate combinations thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be composed of several physical components Components execute cooperatively. Some or all physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit . Such software may be distributed on computer-readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). As known to those of ordinary skill in the art, the term computer storage media includes both volatile and nonvolatile implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules or other data flexible, removable and non-removable media. Computer storage media include, but are not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical disk storage, magnetic cartridges, magnetic tape, magnetic disk storage or other magnetic storage devices, or may Any other medium used to store desired information and that can be accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism, and can include any information delivery media, as is well known to those of ordinary skill in the art .
本文已经公开了示例实施例,并且虽然采用了具体术语,但它们仅用于并仅应当被解释为一般说明性含义,并且不用于限制的目的。在一些实例中,对本领域技术人员显而易见的是,除非另外明确指出,否则可单独使用与特定实施例相结合描述的特征、特性和/或元素,或可与其它实施例相结合描述的特征、特性和/或元件组合使用。因此,本领域技术人员将理解,在不脱离由所附的权利要求阐明的本公开的范围的情况下,可进行各种形式和细节上的改变。Example embodiments have been disclosed herein, and although specific terms are employed, they are used and should only be construed in a general descriptive sense and not for purposes of limitation. In some instances, it will be apparent to those skilled in the art that features, characteristics and/or elements described in connection with a particular embodiment may be used alone or in combination with other embodiments, unless expressly stated otherwise. Features and/or elements are used in combination. Accordingly, it will be understood by those of ordinary skill in the art that various changes in form and details may be made therein without departing from the scope of the present disclosure as set forth in the appended claims.

Claims (18)

  1. 一种任务处理的方法,其特征在于,包括:A method for task processing, comprising:
    获取待处理问题的计算图;所述计算图包括多个层,每个层包括多个任务,任意层中的任务不基于本层或其后层中的任务的结果执行,至少部分层中有至少部分任务基于其前层中的任务的结果执行;Obtain a computational graph of the problem to be processed; the computational graph includes multiple layers, each layer includes multiple tasks, the tasks in any layer are not executed based on the results of the tasks in this layer or the following layers, and at least some layers have at least some of the tasks are executed based on the results of tasks in its previous layer;
    将所述计算图分为多个任务组;每个任务组包括至少一个层的任务;dividing the computational graph into a plurality of task groups; each task group includes tasks of at least one layer;
    确定各任务组的任务与众核系统的各处理核间的映射关系;在所述映射关系中,每个任务组对应一个核组,每个核组包括至少一个处理核。A mapping relationship between the tasks of each task group and each processing core of the many-core system is determined; in the mapping relationship, each task group corresponds to a core group, and each core group includes at least one processing core.
  2. 根据权利要求1所述的任务处理的方法,其特征在于,所述任务组包括任务块,而且,所述计算图的每个层分为多个任务块;所述任务块与众核系统的各处理核间的映射关系中,每个所述任务块映射到一个处理核中,每个所述处理核中映射多个任务块,任意一个层的所有所述任务块被映射到至少两个不同处理核中;The method for task processing according to claim 1, wherein the task group includes task blocks, and each layer of the computation graph is divided into a plurality of task blocks; In the mapping relationship between processing cores, each of the task blocks is mapped to one processing core, each of the processing cores is mapped to multiple task blocks, and all the task blocks of any layer are mapped to at least two. in different processing cores;
    所述将所述计算图分为多个任务组,包括:将所述计算图的每个层分为多个任务块;The dividing the computational graph into multiple task groups includes: dividing each layer of the computational graph into a plurality of task blocks;
    所述确定各任务组的任务与众核系统的各处理核间的映射关系,包括:确定各任务块与众核系统的多个处理核间的第一映射关系。The determining of the mapping relationship between the tasks of each task group and each processing core of the many-core system includes: determining a first mapping relationship between each task block and a plurality of processing cores of the many-core system.
  3. 根据权利要求2所述的任务处理的方法,其特征在于,还包括以下步骤中的至少一项:The method for task processing according to claim 2, further comprising at least one of the following steps:
    所述将所述计算图的每个层分为多个任务块之前,还包括:训练所述计算图,以提高所述计算图的冗余性能;Before dividing each layer of the computation graph into a plurality of task blocks, the method further includes: training the computation graph to improve the redundancy performance of the computation graph;
    所述确定各任务块与众核系统的多个处理核间的第一映射关系之前,还包括:无效部分任务块,以训练所述计算图,提高所述计算图的冗余性能;Before the determining the first mapping relationship between each task block and the multiple processing cores of the many-core system, the method further includes: invalid part of the task block, so as to train the calculation graph and improve the redundancy performance of the calculation graph;
    所述确定各任务块与众核系统的多个处理核间的第一映射关系之后,还包括:无效部分处理核中映射的全部任务块,以训练所述计算图,提高所述计算图的冗余性能。After determining the first mapping relationship between each task block and the multiple processing cores of the many-core system, the method further includes: all the task blocks mapped in the invalid part of the processing core, so as to train the calculation graph and improve the performance of the calculation graph. Redundant performance.
  4. 根据权利要求2所述的任务处理的方法,其特征在于,所述将所述计算图的每个层分为多个任务块,包括:The method for task processing according to claim 2, wherein the dividing each layer of the computational graph into a plurality of task blocks comprises:
    扩展所述计算图,将扩展后的所述计算图的每个层分为多个任务块;所述扩展包括在所述计算图的至少部分层中添加冗余任务。Expanding the computational graph, dividing each layer of the expanded computational graph into a plurality of task blocks; the expanding includes adding redundant tasks in at least some layers of the computational graph.
  5. 根据权利要求2所述的任务处理的方法,其特征在于,在所述第一映射关系中,任意一个层的任意两个任务块被映射到两个不同处理核中。The method for task processing according to claim 2, wherein, in the first mapping relationship, any two task blocks in any layer are mapped to two different processing cores.
  6. 根据权利要求1至5中任意一项所述的任务处理的方法,其特征在于,所述计算图为可训练计算图;所述可训练计算图能在其中至少部分任务有不同的情况下,解决相同的待处理问题。The method for task processing according to any one of claims 1 to 5, wherein the computation graph is a trainable computation graph; and the trainable computation graph is capable of, when at least some of the tasks are different, Solve the same pending issue.
  7. 根据权利要求1所述的任务处理的方法,其特征在于,所述任务组包括任务区;而且,所述计算图分为多个任务区;每个任务区包括来自至少两个不同层的多个任务,且每个层的任务至少位于两个任务区中;每个任务区的内部路由强度均超过预设标准,每个任务的内部路由强度,根据该任务区中的任务之间的路由占该任务区中的任务所涉 及的全部路由的比例确定;The method for task processing according to claim 1, wherein the task group includes a task area; and the computation graph is divided into a plurality of task areas; each task area includes multiple tasks from at least two different layers. tasks, and the tasks of each layer are located in at least two task areas; the internal routing strength of each task area exceeds the preset standard, and the internal routing strength of each task is based on the routing between tasks in the task area. The proportion of all routes involved in the tasks in the task area is determined;
    确定各任务区的任务与众核系统的各处理核间的第二映射关系;在所述第二映射关系中,每个任务区对应一个核组,每个核组包括一个或多个处理核,每个任务区的任务被映射至其对应的核组的处理核中;Determine the second mapping relationship between the tasks of each task area and each processing core of the many-core system; in the second mapping relationship, each task area corresponds to a core group, and each core group includes one or more processing cores , the task of each task area is mapped to the processing core of its corresponding core group;
    所述将所述计算图分为多个任务组,包括:将所述计算图分为多个任务区;The dividing the calculation graph into multiple task groups includes: dividing the calculation graph into multiple task areas;
    所述确定各任务组的任务与众核系统的各处理核间的映射关系,包括:确定各任务区的任务与众核系统的各处理核间的第二映射关系。The determining of the mapping relationship between the tasks of each task group and each processing core of the many-core system includes: determining a second mapping relationship between the tasks of each task area and each processing core of the many-core system.
  8. 根据权利要求7所述的任务处理的方法,其特征在于,至少部分任务区包括来自所述计算图的每个层中的任务。The method for task processing according to claim 7, wherein at least part of the task area includes tasks from each layer of the computational graph.
  9. 根据权利要求7所述的任务处理的方法,其特征在于,所述内部路由强度包括:内部路由量占比,和/或,内部路由数占比;每个任务区的内部路由量占比为该任务区中的任务之间的路由传输的数据量,占该任务区中的任务所涉及的全部路由传输的数据量的比例;每个任务区的内部路由数占比为该任务区中的任务之间的路由的个数,占该任务区中的任务所涉及的全部路由的个数的比例;The method for task processing according to claim 7, wherein the internal routing strength comprises: the ratio of the internal routing volume, and/or the ratio of the internal routing volume; the internal routing volume ratio of each task area is: The amount of data transmitted by routes between tasks in this task area accounts for the proportion of data transmitted by all routes involved in the tasks in this task area; the proportion of the number of internal routes in each task area is The number of routes between tasks, the proportion of the total number of routes involved in the tasks in the task area;
    所述预设标准包括:内部路由量占比大于第一阈值,和/或,内部路由数占比大于第二阈值。The preset criteria include: the proportion of the amount of internal routes is greater than the first threshold, and/or the proportion of the number of internal routes is greater than the second threshold.
  10. 根据权利要求7所述的任务处理的方法,其特征在于,至少部分核组仅包括一个处理核;和/或,至少部分核组包括多个处理核;在所述第二映射关系中,与包括多个处理核的核组对应的每个任务区分为多个第二任务块,每个任务区的第二任务块的个数与该任务区对应的核组包括的处理核的个数相同,每个任务区的各第二任务块被分别映射至该任务区对应的核组的各处理核中。The method for task processing according to claim 7, wherein at least part of the core group includes only one processing core; and/or, at least part of the core group includes multiple processing cores; in the second mapping relationship, with Each task area corresponding to a core group including multiple processing cores is divided into multiple second task blocks, and the number of second task blocks in each task area is the same as the number of processing cores included in the core group corresponding to the task area. , the second task blocks of each task area are respectively mapped to the processing cores of the core group corresponding to the task area.
  11. 根据权利要求7所述的任务处理的方法,其特征在于,至少部分处理核同时属于多个核组。The method for task processing according to claim 7, wherein at least some of the processing cores belong to multiple core groups at the same time.
  12. 根据权利要求7所述的任务处理的方法,其特征在于,还包括以下步骤中的至少一项:The method for task processing according to claim 7, further comprising at least one of the following steps:
    所述将所述计算图分为多个任务区之前,还包括:训练所述计算图,以提高所述计算图的冗余性能;Before dividing the computation graph into multiple task areas, the method further includes: training the computation graph to improve the redundancy performance of the computation graph;
    所述确定各任务区的任务与众核系统的各处理核间的第二映射关系之前,还包括:无效部分任务区,以训练各任务区,提高所述计算图的冗余性能;Before the determining of the second mapping relationship between the tasks of each task area and each processing core of the many-core system, the method further includes: an invalid part of the task area, so as to train each task area and improve the redundancy performance of the calculation graph;
    所述确定各任务区的任务与众核系统的各处理核间的第二映射关系之后,还包括:无效部分处理核中映射的全部任务,以训练各任务区,提高所述计算图的冗余性能。After the second mapping relationship between the tasks of each task area and the processing cores of the many-core system is determined, the method further includes: all the tasks mapped in the invalid part processing cores, so as to train each task area and improve the redundancy of the calculation graph. remaining performance.
  13. 根据权利要求1所述的任务处理的方法,其特征在于,所述计算图为神经网络。The method for task processing according to claim 1, wherein the computation graph is a neural network.
  14. 根据权利要求2至5中任意一项所述的任务处理的方法,其特征在于,在所述确定各任务块与众核系统的多个处理核间的第一映射关系之后,还包括:The method for task processing according to any one of claims 2 to 5, wherein after the determining the first mapping relationship between each task block and the multiple processing cores of the many-core system, the method further comprises:
    按照所述第一映射关系,将各任务块映射到多个处理核中;According to the first mapping relationship, each task block is mapped to a plurality of processing cores;
    每个处理核处理被映射到其中的任务块中的任务。Each processing core processes the tasks in the task block to which it is mapped.
  15. 一种任务处理的装置,其特征在于,包括:A device for task processing, comprising:
    获取模块,配置为获取待处理问题的计算图;所述计算图包括多个层,每个层包括 多个任务,任意层中的任务不基于本层或其后层中的任务的结果执行,至少部分层中有至少部分任务基于其前层中的任务的结果执行;an acquisition module, configured to acquire a calculation graph of the problem to be processed; the calculation graph includes multiple layers, each layer includes multiple tasks, and the tasks in any layer are not executed based on the results of the tasks in this layer or the following layers, At least some of the tasks in at least some of the layers are executed based on the results of the tasks in the previous layers;
    分块模块,配置为将所述计算图分为多个任务组;每个任务组包括至少一个层的任务;a block module, configured to divide the computational graph into a plurality of task groups; each task group includes tasks of at least one layer;
    映射模块,配置为确定各任务组的任务与众核系统的各处理核间的映射关系;在所述映射关系中,每个任务组对应一个核组,每个核组包括至少一个处理核。The mapping module is configured to determine the mapping relationship between the tasks of each task group and each processing core of the many-core system; in the mapping relationship, each task group corresponds to a core group, and each core group includes at least one processing core.
  16. 一种众核系统,其特征在于,包括:A many-core system, characterized in that it includes:
    多个处理核;以及multiple processing cores; and
    片上网络,被配置为交互所述多个处理核间的数据和外部数据;其中,一个或多个所述处理核中存储有一个或多个指令,一个或多个所述指令被一个或多个所述处理核执行,以使一个或多个所述处理核能够执行权利要求1至14中任意一项所述的任务处理的方法。A network-on-chip configured to exchange data and external data among the plurality of processing cores; wherein one or more of the processing cores store one or more instructions, and one or more of the instructions are stored by one or more of the processing cores; Each of the processing cores is executed to enable one or more of the processing cores to perform the method of task processing described in any one of claims 1 to 14 .
  17. 一种计算机可读介质,其上存储有计算机程序,其特征在于,所述计算机程序在被处理核执行时实现如权利要求1至14中任意一项所述的任务处理的方法。A computer-readable medium on which a computer program is stored, characterized in that, when the computer program is executed by a processing core, the method for processing a task according to any one of claims 1 to 14 is implemented.
  18. 一种计算机程序产品,包括计算机程序,所述计算机程序在被处理器执行时实现根据权利要求1至14中任意一项所述的任务处理的方法。A computer program product comprising a computer program which, when executed by a processor, implements the method of task processing according to any one of claims 1 to 14.
PCT/CN2022/074490 2021-02-10 2022-01-28 Task processing method and apparatus, many-core system, and computer-readable medium WO2022171002A1 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202110184918.6 2021-02-10
CN202110184918.6A CN112835718A (en) 2021-02-10 2021-02-10 Method and device for processing task, many-core system and computer readable medium
CN202110184939.8 2021-02-10
CN202110184939.8A CN112835719B (en) 2021-02-10 2021-02-10 Method and device for task processing, many-core system and computer readable medium

Publications (1)

Publication Number Publication Date
WO2022171002A1 true WO2022171002A1 (en) 2022-08-18

Family

ID=82838259

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/074490 WO2022171002A1 (en) 2021-02-10 2022-01-28 Task processing method and apparatus, many-core system, and computer-readable medium

Country Status (1)

Country Link
WO (1) WO2022171002A1 (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102929725A (en) * 2012-11-12 2013-02-13 中国人民解放军海军工程大学 Dynamic reconfiguration method of signal processing parallel computing software
CN110097611A (en) * 2019-04-28 2019-08-06 上海联影智能医疗科技有限公司 Image rebuilding method, device, equipment and storage medium
US20200042856A1 (en) * 2018-07-31 2020-02-06 International Business Machines Corporation Scheduler for mapping neural networks onto an array of neural cores in an inference processing unit
US20200160182A1 (en) * 2018-05-31 2020-05-21 Neuralmagic Inc. System and method of executing neural networks
CN111723900A (en) * 2019-03-18 2020-09-29 北京灵汐科技有限公司 Mapping method of neural network based on many-core processor and computing device
CN111950699A (en) * 2020-07-03 2020-11-17 清华大学深圳国际研究生院 Neural network regularization method based on characteristic space correlation
CN112084866A (en) * 2020-08-07 2020-12-15 浙江工业大学 Target detection method based on improved YOLO v4 algorithm
CN112348828A (en) * 2020-10-27 2021-02-09 浙江大华技术股份有限公司 Example segmentation method and device based on neural network and storage medium
CN112835719A (en) * 2021-02-10 2021-05-25 北京灵汐科技有限公司 Method and device for processing task, many-core system and computer readable medium
CN112835718A (en) * 2021-02-10 2021-05-25 北京灵汐科技有限公司 Method and device for processing task, many-core system and computer readable medium

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102929725A (en) * 2012-11-12 2013-02-13 中国人民解放军海军工程大学 Dynamic reconfiguration method of signal processing parallel computing software
US20200160182A1 (en) * 2018-05-31 2020-05-21 Neuralmagic Inc. System and method of executing neural networks
US20200042856A1 (en) * 2018-07-31 2020-02-06 International Business Machines Corporation Scheduler for mapping neural networks onto an array of neural cores in an inference processing unit
CN111723900A (en) * 2019-03-18 2020-09-29 北京灵汐科技有限公司 Mapping method of neural network based on many-core processor and computing device
CN110097611A (en) * 2019-04-28 2019-08-06 上海联影智能医疗科技有限公司 Image rebuilding method, device, equipment and storage medium
CN111950699A (en) * 2020-07-03 2020-11-17 清华大学深圳国际研究生院 Neural network regularization method based on characteristic space correlation
CN112084866A (en) * 2020-08-07 2020-12-15 浙江工业大学 Target detection method based on improved YOLO v4 algorithm
CN112348828A (en) * 2020-10-27 2021-02-09 浙江大华技术股份有限公司 Example segmentation method and device based on neural network and storage medium
CN112835719A (en) * 2021-02-10 2021-05-25 北京灵汐科技有限公司 Method and device for processing task, many-core system and computer readable medium
CN112835718A (en) * 2021-02-10 2021-05-25 北京灵汐科技有限公司 Method and device for processing task, many-core system and computer readable medium

Similar Documents

Publication Publication Date Title
US8904149B2 (en) Parallelization of online learning algorithms
US20190392300A1 (en) Systems and methods for data compression in neural networks
WO2017156968A1 (en) Neural network computing method, system and device therefor
CN105701178B (en) Distributed picture storage system
CN110308984B (en) Cross-cluster computing system for processing geographically distributed data
WO2020187041A1 (en) Neural network mapping method employing many-core processor and computing device
US9210219B2 (en) Systems and methods for consistent hashing using multiple hash rings
WO2022012576A1 (en) Path planning method and apparatus, path planning device, and storage medium
WO2017028494A1 (en) Data recovery method, data storage method, and corresponding apparatus and system
JP7009020B2 (en) Learning methods, learning systems, learning devices, methods, applicable devices, and computer programs
US20160065449A1 (en) Bandwidth-Weighted Equal Cost Multi-Path Routing
US11100094B2 (en) Taking snapshots of blockchain data
CN112835719B (en) Method and device for task processing, many-core system and computer readable medium
US20240045869A1 (en) A method and device of data transmission
US11194792B2 (en) Taking snapshots of blockchain data
CN104052495B (en) Low density parity check code hierarchical decoding architecture for reducing hardware buffer
CN112835718A (en) Method and device for processing task, many-core system and computer readable medium
WO2022171002A1 (en) Task processing method and apparatus, many-core system, and computer-readable medium
CN115361332B (en) Fault-tolerant route processing method and device, processor and electronic equipment
CN109783002B (en) Data reading and writing method, management equipment, client and storage system
CN116861966B (en) Transformer model accelerator and construction and data processing methods and devices thereof
CN111935026A (en) Data transmission method, device, processing equipment and medium
US10826819B2 (en) System and method for data transmission in distributed computing environments
KR20210081663A (en) Interconnect device, operation method of interconnect device, and artificial intelligence(ai) accelerator system
WO2015143981A1 (en) Packet forwarding method, system, and apparatus

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22752163

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22752163

Country of ref document: EP

Kind code of ref document: A1