CN112835719B - Method and device for task processing, many-core system and computer readable medium - Google Patents

Method and device for task processing, many-core system and computer readable medium Download PDF

Info

Publication number
CN112835719B
CN112835719B CN202110184939.8A CN202110184939A CN112835719B CN 112835719 B CN112835719 B CN 112835719B CN 202110184939 A CN202110184939 A CN 202110184939A CN 112835719 B CN112835719 B CN 112835719B
Authority
CN
China
Prior art keywords
task
tasks
processing
core
task area
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110184939.8A
Other languages
Chinese (zh)
Other versions
CN112835719A (en
Inventor
施路平
张伟豪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Lynxi Technology Co Ltd
Original Assignee
Beijing Lynxi Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Lynxi Technology Co Ltd filed Critical Beijing Lynxi Technology Co Ltd
Priority to CN202110184939.8A priority Critical patent/CN112835719B/en
Publication of CN112835719A publication Critical patent/CN112835719A/en
Priority to PCT/CN2022/074490 priority patent/WO2022171002A1/en
Application granted granted Critical
Publication of CN112835719B publication Critical patent/CN112835719B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)
  • Multi Processors (AREA)

Abstract

The present disclosure provides a method of task processing, comprising: acquiring a calculation graph of a problem to be processed; dividing the calculation map into a plurality of task areas; each task area comprises a plurality of tasks from at least two different layers, and the tasks of each layer are located in at least two task areas; the internal routing strength of each task area exceeds a preset standard, and the internal routing strength of each task is determined according to the proportion of routes among tasks in the task area to all routes related to the tasks in the task area; determining the mapping relation between the tasks of each task area and each processing core of the many-core system; according to the mapping relation, each task area corresponds to a core group, each core group comprises one or more processing cores, and the task of each task area is mapped into the processing core of the corresponding core group. The disclosure also provides a task processing device, an electronic device and a computer readable medium.

Description

Method and device for task processing, many-core system and computer readable medium
Technical Field
The present disclosure relates to the field of many-core technology, and in particular, to a method and apparatus for task processing, a many-core system, and a computer readable medium.
Background
One problem to be solved by the electronic computing process is essentially to process a plurality of tasks (or operations) corresponding thereto, and some of these tasks may be performed based on the results of other tasks, i.e., the results of other operations are used in some of the operations, so these tasks are "sequential".
The above problems can be handled by many-core systems. The many-core system comprises a plurality of processing cores (or cores and processing engines) capable of interacting, and a plurality of tasks corresponding to the problem to be processed can be mapped (or distributed) to different processing cores and respectively processed by the processing cores.
However, when a subsequent task and a preceding task (i.e., the task on which the subsequent task is based) are mapped into different processing cores, the results of the preceding task need to be transferred across cores from the processing core in which the preceding task is located to the processing core in which the subsequent task is located. Therefore, a large number of inter-core routes are necessary between each processing core, so that the inter-core routes have complex structures, the data volume transmitted by the inter-core routes is large, and the transmission efficiency of the inter-core routes is far lower than that of the inter-core routes, so that the performance of the many-core system is reduced due to the complex inter-core routes.
Disclosure of Invention
Embodiments of the present disclosure provide a method and apparatus for task processing, a many-core system, and a computer readable medium.
In a first aspect, an embodiment of the present disclosure provides a method for task processing, including:
acquiring a calculation graph of a problem to be processed; the calculation graph comprises a plurality of layers which are sequentially arranged, each layer comprises a plurality of tasks, the tasks in any layer are not performed based on the results of the tasks in the layer or the later layers, and at least part of the tasks in at least some layers are performed based on the results of the tasks in the previous layers;
dividing the computational graph into a plurality of task areas; each task area comprises a plurality of tasks from at least two different layers, and the tasks of each layer are located in at least two task areas; the internal routing strength of each task area exceeds a preset standard, and the internal routing strength of each task is determined according to the proportion of routes among tasks in the task area to all routes related to the tasks in the task area;
determining the mapping relation between the tasks of each task area and each processing core of the many-core system; according to the mapping relation, each task area corresponds to a core group, each core group comprises one or more processing cores, and the task of each task area is mapped into the processing core of the corresponding core group.
In some embodiments, at least a portion of the task area includes tasks from each layer.
In some embodiments, the internal routing strength comprises:
an internal routing amount duty ratio, wherein the internal routing amount duty ratio of each task area is the data amount transmitted by routes among tasks in the task area and accounts for the proportion of the data amount transmitted by all routes related to the tasks in the task area;
and/or the number of the groups of groups,
the internal route number ratio is the number of routes between tasks in the task area, and is the ratio of the number of all routes related to the tasks in the task area.
In some embodiments, the preset criteria include:
the internal routing amount duty cycle is greater than a first threshold;
and/or the number of the groups of groups,
the internal routing number duty cycle is greater than the second threshold.
In some embodiments, at least a portion of the core group includes only one processing core.
In some embodiments, at least a portion of the core group includes a plurality of processing cores;
according to the mapping relation, each task area corresponding to the core group comprising a plurality of processing cores is divided into a plurality of task blocks, the number of the task blocks of each task area is the same as the number of the processing cores comprising the core group corresponding to the task area, and each task block of each task area is mapped to each processing core of the core group corresponding to the task area.
In some embodiments, in a core group at least partially comprising a plurality of processing cores, a distance between any two processing cores is less than a preset distance.
In some embodiments, at least some of the processing cores belong to multiple core groups simultaneously.
In some embodiments, between the dividing the computation graph into a plurality of task areas and the determining the mapping relationship between the tasks of the task areas and the processing cores of the many-core system, the method further comprises:
expanding at least part of the task area; the expanding includes adding redundant tasks in the task area.
In some embodiments, between the obtaining a computational graph of the problem to be processed and the dividing the computational graph into a plurality of task areas, further comprising:
the computational graph is trained to improve redundancy performance of the computational graph.
In some embodiments, between the dividing the computation graph into a plurality of task areas and the determining the mapping relationship between the tasks of the task areas and the processing cores of the many-core system, the method further comprises:
and invalidating part of task areas to train each task area and improve the redundancy performance of the calculation graph.
In some embodiments, after determining the mapping relationship between the tasks of each task area and each processing core of the many-core system, the method further includes:
And invalidating part of all tasks mapped in the processing core to train each task area and improve the redundancy performance of the calculation graph.
In some embodiments, the computational graph is a trainable computational graph; the trainable computational graph can solve the same problem to be processed in the case that at least part of tasks are different.
In some embodiments, the computational graph is a neural network.
In some embodiments, after determining the mapping relationship between the tasks of each task area and each processing core of the many-core system, the method further includes:
according to the mapping relation, mapping the tasks of each task area into each processing core of the many-core system;
each processing core processes tasks mapped into it.
In a second aspect, an embodiment of the present disclosure provides an apparatus for task processing, including:
the acquisition module is configured to acquire a calculation graph of a problem to be processed; the calculation graph comprises a plurality of layers which are sequentially arranged, each layer comprises a plurality of tasks, the tasks in any layer are not performed based on the results of the tasks in the layer or the later layers, and at least part of the tasks in at least some layers are performed based on the results of the tasks in the previous layers;
a partitioning module configured to partition the computational graph into a plurality of task areas; each task area comprises a plurality of tasks from at least two different layers, and the tasks of each layer are located in at least two task areas; the internal routing strength of each task area exceeds a preset standard, and the internal routing strength of each task is determined according to the proportion of routes among tasks in the task area to all routes related to the tasks in the task area;
The mapping module is configured to determine the mapping relation between the tasks of each task area and each processing core of the many-core system; according to the mapping relation, each task area corresponds to a core group, each core group comprises one or more processing cores, and the task of each task area is mapped into the processing core of the corresponding core group.
In a third aspect, embodiments of the present disclosure provide a many-core system comprising:
a plurality of processing cores; and
a network on chip configured to interact data between the plurality of processing cores and external data;
one or more of the processing cores have one or more instructions stored therein that are executed by one or more of the processing cores to enable one or more of the processing cores to perform a method of performing any one of the task processes described above.
In a fourth aspect, embodiments of the present disclosure provide a computer readable medium having a computer program stored thereon, wherein the computer program, when executed by a processing core, implements a method of any one of the task processing described above.
In the embodiment of the disclosure, a 'strong internal routing area' with relatively close internal connection in a calculation graph is divided into a task area, so that the task of each task area is less subjected to 'external interaction'; because each task area is mapped into one core group, data transmission of 'cross-group' is less needed among different core groups; thus, fewer inter-core (or inter-group) routes are required in embodiments of the present disclosure, which may simplify the inter-core (or inter-group) routing structure, reduce the amount of data transferred by inter-core (or inter-group) routes, and improve the performance (e.g., efficiency) of many-core systems.
In addition, in the embodiment of the present disclosure, each task area includes tasks from multiple different layers, that is, the tasks of different layers are allocated to different core groups, so when one processing core is invalid (e.g., due to a fault), one layer of the computation graph generally only "breaks out a part" and does not generate a situation that all the tasks of one layer are invalid, thereby making the computation graph overall still obtain a processing result that is available to a certain extent, and improving the robustness of the computation graph.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.
Drawings
The accompanying drawings are included to provide a further understanding of the disclosure, and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure, without limitation to the disclosure. The above and other features and advantages will become more readily apparent to those skilled in the art by describing in detail exemplary embodiments with reference to the attached drawings, in which:
FIG. 1 is a flow chart of a method of task processing provided by an embodiment of the present disclosure;
FIG. 2 is a flow chart of another method of task processing provided by an embodiment of the present disclosure;
FIG. 3 is a schematic diagram of a partition of a computation graph and a routing relationship of tasks therein in a method for task processing according to an embodiment of the present disclosure;
FIG. 4 is a schematic diagram of a process in which a computational graph is processed in a method for task processing according to an embodiment of the present disclosure;
FIG. 5 is a block diagram of an apparatus for task processing provided by an embodiment of the present disclosure;
FIG. 6 is a block diagram of a many-core system provided by an embodiment of the present disclosure;
fig. 7 is a block diagram of a computer readable medium according to an embodiment of the present disclosure.
Detailed Description
For a better understanding of the technical solutions of the present disclosure, exemplary embodiments of the present disclosure will be described below with reference to the accompanying drawings, in which various details of the embodiments of the present disclosure are included to facilitate understanding, and they should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Embodiments of the disclosure and features of embodiments may be combined with each other without conflict.
As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used herein, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect.
Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and the present disclosure, and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
The actual work to be done in many problems such as image processing, speech recognition, etc. can be expressed in the form of "computational graph (or task graph, logic graph)". That is, all operations to be performed to solve the problem are divided into a plurality of "tasks (or nodes)", each task includes a certain operation, and a certain order exists between different tasks. For example, if the operation of a task needs to use the operation result of other tasks, the task is called to be performed based on the result of other tasks; alternatively, the task is a task subsequent to the other task, and the other task is a task preceding the task.
Because of the above relationships between tasks, referring to FIG. 3, a computational graph may be divided into multiple "layers," each layer comprising multiple tasks, and the tasks in any one layer are not based on the tasks in the present layer or in subsequent layers, but at least some of the layers are based at least in part on the results of the tasks in their preceding layers. That is, if the task in the previous layer is not completed, the task in the subsequent layer may not be performed, because the operation result of the task in the previous layer may be used in the operation process of the task in the subsequent layer; however, if the task in the rear layer cannot be performed, the front layer is not affected, and the operation result of the task in the rear layer is not used for the operation of the task in the front layer; and tasks in the same layer are not based on relationships because if present, the corresponding tasks should belong to two different layers.
For example, in fig. 3 and 4, different layers (layer 0 to layer 2) are indicated by boxes filled differently, and different task areas (task area 0 to task area 2) are indicated by dashed boxes. Whereas in fig. 3, circles in boxes represent tasks in the layer; the arrow pointing from one circle to another circle indicates that the task of the following circle (the following task) is performed based on the result of the task of the preceding circle (the preceding task) (or, the calculation result of the preceding task needs to be transmitted to the following task, so there is a route between different tasks).
Among these, the "Neural Network (NN)" is one form of a computational graph. The neural network is divided into multiple layers, each layer comprises a plurality of nodes, certain operation is needed in each node, and the nodes of different layers are connected with each other in a certain relation (for example, the output of one node is used as the input of the node of the next layer); thus, each layer of the neural network may be considered a layer of the computational graph, while each node of the neural network may be considered a task of the computational graph, with nodes having a connection relationship representing a relationship-based task.
By way of example, the neural network in embodiments of the present disclosure may be used for image processing, speech recognition, etc., which may specifically be in the form of a Convolutional Neural Network (CNN), a impulse neural network (SNN), a cyclic neural network (RNN), etc.
Therein, exemplary, some of the questions may correspond to a plurality of different computational graphs. That is, the number of tasks in the computational graph, the layers in which the tasks are located, the relationships between the tasks, the specific operations for each task, etc. may be different, but these different computational graphs may solve the problem (but the effect of solving the problem is not necessarily the same).
The above possible forms of computation graphs are referred to as "trainable computation graphs". That is, the task in the computational graph which can solve one problem can be adjusted by training, so that the effect of solving the problem of the computational graph after training is different.
For example, neural networks are one form of trainable computational graph. For example, a neural network that handles a problem (e.g., image classification) is typically trained by adjusting nodes (e.g., adjusting weights of nodes) in the neural network based on the effect of the current neural network on solving the problem (e.g., accuracy of image classification), thereby changing the neural network (computational graph) and improving its effect on handling the problem (e.g., improving the accuracy of image classification).
In some related art, when a problem is to be handled with a many-core system, tasks of each layer of its corresponding computational graph may be mapped (allocated) into one processing core, while tasks of different layers are mapped into different processing cores.
According to the above manner, the tasks of different layers are necessarily located in different processing cores, so for all the following tasks performed based on the results of the preceding tasks, the corresponding operation results must be transmitted from the processing core in which the preceding task is located to the processing core in which the following task is located. Therefore, a large number of inter-core routes are necessary between each processing core, so that the inter-core routes have complex structures, the data volume transmitted by the inter-core routes is large, and the transmission efficiency of the inter-core routes is far lower than that of the inter-core routes, so that the performance of the many-core system is reduced due to the complex inter-core routes.
In addition, according to the above manner, once a certain processing core of the many-core system is invalid (e.g., due to a fault), all tasks corresponding to a layer of the computation graph cannot be processed, so that all tasks behind the layer cannot be actually performed, which inevitably results in that the whole problem cannot be solved at all (i.e., any processing result cannot be obtained), and the robustness of the system is poor.
In a first aspect, referring to fig. 2, an embodiment of the present disclosure provides a method for task processing, including:
s101, acquiring a calculation map of the problem to be processed.
The calculation graph comprises a plurality of layers which are arranged in sequence, each layer comprises a plurality of tasks, the tasks in any layer are not performed based on the results of the tasks in the layer or the later layers, and at least part of the tasks in at least some layers are performed based on the results of the tasks in the preceding layers.
When a problem to be processed (such as image processing, voice recognition, etc.) is to be processed by the many-core system, a corresponding calculation map is obtained. Wherein, a preset calculation graph can be obtained; the computational graph may also be generated according to predetermined rules based on specific questions to be processed.
S103, dividing the calculation map into a plurality of task areas.
Wherein each task area includes a plurality of tasks from at least two different layers, and the tasks of each layer are located in at least two task areas; the internal route intensity of each task area exceeds a preset standard, and the internal route intensity of each task is determined according to the proportion of routes among the tasks in the task area to all the routes related to the tasks in the task area.
Referring to fig. 3 and 4, the calculation map is divided into a plurality of task areas, that is, each task of the calculation map is divided into a corresponding task area.
In fig. 3 and 4, the number of tasks in the filling area is represented by the vertical size of the filling area corresponding to the layer, and different processing cores (processing core 0 to processing core 2) and core groups (core group 0 to core group 2) are represented by blank boxes.
Wherein the tasks in each task area cannot all come from one layer of the computational graph, but at least come from two different layers; meanwhile, all tasks of one layer cannot be located in one task area, and at least two task areas need to be divided; thus, each task area corresponds to a number of tasks taken from each of the multiple layers of the computational graph.
Referring to fig. 3, a plurality of "routes (or data paths)" are connected between tasks at different layers of the computation graph, wherein the computation result of the preceding task needs to be transmitted to the following task through the routes so that the following task can utilize the computation result. Thus, many tasks connect one or more routes (including both the incoming route and the outgoing route), which are the routes that the task "involves". It can be seen that, of the "all routes" involved in all tasks in a task area (indicated by the dashed boxes in fig. 3 and 4), the tasks in the task area are connected at both ends of a part of the routes, and such routes are called "internal routes (indicated by solid arrows in fig. 3) of the task area, i.e., routes between tasks in the task area; while there are routes, one end of which connects tasks in the task area and the other end of which connects tasks outside the task area (and certainly also tasks in other task areas), such routes being called "external routes" of the task area (indicated by dashed arrows in fig. 3).
According to the proportion of the internal route to the total route in each task area, the internal route intensity of the task area can be calculated, so that the internal route intensity represents the relative degree of the internal interaction of the tasks in one task area.
In the embodiment of the disclosure, the internal routing strength of each task area obtained by dividing exceeds a preset standard, so that each task area is necessarily a "strong internal routing area", that is, tasks in each task area mainly perform "internal interaction", and "external interaction" of tasks in other task areas is less.
S105, determining the mapping relation between the tasks of each task area and each processing core of the many-core system.
According to the mapping relation, each task area corresponds to one core group, each core group comprises one or more processing cores, and the task of each task area is mapped into the processing core of the corresponding core group.
Referring to FIG. 4, after determining the task areas, a corresponding group of cores (each core group including one or more processing cores in the many-core system) is determined for each task area, and all tasks in each task area are mapped into the processing cores of their corresponding group of cores; that is, each task in the task area needs to be mapped to a processing core of its corresponding core group, and each processing core of a core group needs to be mapped with at least one task from the task area corresponding to the core group.
In the embodiment of the disclosure, a 'strong internal routing area' with relatively close internal connection in a calculation graph is divided into a task area, so that the task of each task area is less subjected to 'external interaction'; because each task area is mapped into one core group, data transmission of 'cross-group' is less needed among different core groups; thus, fewer inter-core (or inter-group) routes are required in embodiments of the present disclosure, which may simplify the inter-core (or inter-group) routing structure, reduce the amount of data transferred by inter-core (or inter-group) routes, and improve the performance (e.g., efficiency) of many-core systems.
In addition, in the embodiment of the present disclosure, each task area includes tasks from multiple different layers, that is, the tasks of different layers are allocated to different core groups, so when one processing core is invalid (e.g., due to a fault), one layer of the computation graph generally only "breaks out a part" and does not generate a situation that all the tasks of one layer are invalid, thereby making the computation graph overall still obtain a processing result that is available to a certain extent, and improving the robustness of the computation graph.
In some embodiments, the computational graph is a trainable computational graph.
Wherein the trainable computational graph is capable of solving the same problem to be processed in situations where at least some of the tasks are different.
In some embodiments, the computational graph is a Neural Network (NN).
As one way of an embodiment of the present disclosure, the above computational graph is a trainable computational graph, further a neural network, such as a Convolutional Neural Network (CNN), a impulse neural network (SNN), a Recurrent Neural Network (RNN), and the like; thus, the redundancy performance of the calculation graph can be further improved by training the calculation graph.
Of course, embodiments of the present disclosure are not limited to use with trainable computational graphs (e.g., neural networks), but may be used with other computational graphs as well.
In some embodiments, at least a portion of the task area includes tasks from each layer.
Referring to fig. 3 and 4, in some embodiments, the tasks in the task area may be from all layers, i.e., the task area may take at least one task from each layer.
Further, each task area may include tasks from each layer.
In some embodiments, the internal routing strength includes:
an internal routing amount duty ratio, wherein the internal routing amount duty ratio of each task area is the data amount transmitted by routes among tasks in the task area and accounts for the proportion of the data amount transmitted by all routes related to the tasks in the task area;
and/or the number of the groups of groups,
the internal route number ratio is the number of routes between tasks in the task area, and is the ratio of the number of all routes related to the tasks in the task area.
Specifically, "internal routing amount duty ratio" and/or "internal routing amount duty ratio" may be used as specific indicators of the above internal routing intensities.
Wherein, as before, all routes of each task area comprise an internal route and an external route, and the data quantity transmitted by each route is fixed and can be known in advance in the operation process of the calculation map; thus, the proportion of the data volume transmitted by the internal route of each task area relative to the data volume transmitted by the whole route can be used as one of specific indexes of the internal route volume ratio, namely the internal route intensity.
Wherein, as before, the total routes for each task area include internal routes and external routes, and wherein the "number" of the various routes is determined and known; thus, the ratio of the number of internal routes in each task area to the number of total routes can be used as one of specific indexes of the internal route number ratio, i.e., the internal route strength.
In some embodiments, the preset criteria include:
the internal routing amount duty cycle is greater than a first threshold;
and/or the number of the groups of groups,
the internal routing number duty cycle is greater than the second threshold.
Specifically, the preset standard to which the task area should conform may be that one of the internal routing amount and the internal routing number is greater than the corresponding threshold, or may be that both are greater than the corresponding threshold.
It should be appreciated that when the amount of internal routing and the number of internal routes are required to be simultaneously greater than the respective thresholds, each of the thresholds may be different from the threshold at which one of the amount of internal routing and the number of internal routes is required to be greater.
In some embodiments, at least a portion of the core group includes only one processing core.
Referring to fig. 4, the coreable group (core group 2) includes only one processing core (processing core 2), so that the tasks of its corresponding task area (task area 1) must also be mapped entirely into the processing core. Therefore, the data transmission (internal interaction for the task area) among the tasks of different layers in the task area is necessarily all the data transmission in the core, and the inter-core routing can be reduced to the greatest extent.
Of course, it is also possible if all core groups comprise only one processing core.
In some embodiments, at least a portion of the core group includes a plurality of processing cores;
according to the mapping relation, each task area corresponding to the core group comprising a plurality of processing cores is divided into a plurality of task blocks, the number of the task blocks of each task area is the same as the number of the processing cores comprising the core group corresponding to the task area, and each task block of each task area is respectively mapped into each processing core of the core group corresponding to the task area.
As a way of an embodiment of the disclosure, referring to fig. 4, there may be a core group (core group 0, core group 1) including a plurality of processing cores (processing cores 0, processing cores 1), so the corresponding task area (task area 0, task area 2) needs to be "partitioned" first, each partitioned task block includes a plurality of tasks, and then each task block is partitioned into one processing core of the corresponding core group (so the number of task blocks is necessarily the same as the number of processing cores of the core group).
Obviously, the hardware resources (such as cache) of each processing core are fixed, so that not every processing core can just process all tasks of the whole task area; therefore, a plurality of processing cores can be combined to form a core group to jointly process a task area in combination with comprehensive consideration of hardware resources, calculation load balancing and the like.
It should be understood, however, that, referring to fig. 4, in order to simplify the routing structure between different processing cores in the same core group, tasks from the same layer may be divided into the same task block as much as possible, so that tasks located in the same layer subsequently may be located in one processing core as much as possible, so that inter-core routes are established among multiple processing cores in one core group, mainly corresponding to the processing cores of different layers, instead of forming very complex "grid-like" routes.
Of course, it is also possible if all core groups comprise a plurality of processing cores.
In some embodiments, in a core group at least partially comprising a plurality of processing cores, a distance between any two processing cores is less than a preset distance.
In a core group (e.g., core group 0, core group 1) including a plurality of processing cores, a distance between the corresponding processing cores (e.g., processing core 0, processing core 1) may be less than a preset distance; because the transmission efficiency of the inter-core routing is also related to the distance between processing cores, the processing cores with the distance being 'closer' are divided into a core group, so that the time consumption of data transmission inside the core group can be reduced, and the like.
The specific form of the above "distance between processing cores" may be various, for example, a linear physical distance between processing cores, a total length of an inter-core route connecting processing cores, or the number of other processing cores (or hops of a route) spaced between processing cores, which will not be described in detail herein.
In some embodiments, at least some of the processing cores belong to multiple core groups simultaneously.
Referring to fig. 4, processing cores included in different core groups may be "overlapped" (e.g., core group 0 and core group 1 each include processing core 0 and processing core 1), so that tasks (including task blocks) of different task areas may be separated into the same processing core, so as to more fully utilize hardware resources and better implement computing load balancing.
The processing cores included in different core groups may be identical (multiple core groups may be considered to be "merged"), or the processing cores included in different core groups may be "partially overlapped", which will not be described in detail herein.
In some embodiments, between dividing the computational graph into a plurality of task areas (S103) and determining a mapping relationship between tasks of each task area and each processing core of the many-core system (S105), further comprising:
s1031, expanding at least part of the task area.
Wherein expanding includes adding redundant tasks in the task area.
Referring to fig. 2 and fig. 4, after a plurality of task areas are obtained by dividing, each task area may be further "expanded" or "decompressed", that is, "adding" some tasks (redundant tasks) that are not originally available in the task area, and then mapping the task area after being decompressed to a corresponding core group (including mapping after blocking), so that the processing core also includes the above "redundant tasks" to improve redundancy performance.
The specific "expansion" of each layer may be preset. For example, the redundancy coefficient b may be set to indicate a ratio of the calculated amount of the extended task to the calculated amount of the original task: if the redundancy coefficient b=0, this corresponds to unexpanded; if b=1, the amount of computation corresponding to the extended task is the same as the amount of computation of the original task; in general, b may be greater than 0 and less than or equal to 1 (if b is greater than 1 is also possible, such as a tasked "multiple") backup.
Alternatively, each layer may be extended according to a specific manner, so that the actual task amount obtained by the extension according to the specific manner may be used.
In some embodiments, the redundant tasks include at least one of:
(1) Backup tasks.
Wherein the backup task is the same as the task in the corresponding task area.
As a way of the embodiment of the disclosure, each extended task may be an original task in the corresponding task area, so that the corresponding tasks actually have "multiple copies", and thus may be used as "backups" mutually, so as to improve robustness.
(2) Empty tasks.
As a way of an embodiment of the present disclosure, a null task that does not perform an actual operation (or performs a null operation) may be extended.
(3) Invalidating tasks.
As a way of an embodiment of the disclosure, some tasks that need to perform operations, but not the operations required in the original computational graph, i.e., invalidating tasks, may be extended.
The invalid task may be generated randomly or by other specific back compression technique.
The expansion modes of different task areas can be the same or different.
In some embodiments, between acquiring the calculation map of the problem to be processed (S101) and dividing the calculation map into a plurality of task areas (S103), further comprising:
s102, training the calculation graph to improve redundancy performance of the calculation graph.
Referring to fig. 2 and 4, the computational graph may also be trained to improve its redundancy performance prior to being "partitioned".
In some embodiments, the above training comprises at least one of:
(1) The partial tasks in the computational graph are invalidated to train the computational graph.
As a way of the embodiment of the disclosure, a Dropout way may be adopted, where a part of tasks in the computation graph are invalidated (for example, the weight of a part of nodes of the neural network is set to 0), and other tasks are adjusted, so that the computation graph may generate a result that is usable to a certain extent if the tasks are invalidated, thereby improving the robustness thereof.
Wherein the above invalidated tasks may be located in one continuous area (not necessarily the task area) of the computational graph, i.e. they may specifically be Dropblock training.
(2) The computational graph is trained by countering the sample defensive approach.
As one way of embodiments of the present disclosure, the computational graph may be trained in an "fight sample defense" manner.
In some embodiments, between dividing the computational graph into a plurality of task areas (S103) and determining a mapping relationship between tasks of each task area and each processing core of the many-core system (S105), further comprising:
s104, invalidating part of task areas to train each task area and improve redundancy performance of the calculation map.
Referring to fig. 2 and 4, after "partitioning," one or more (but not all) of the task areas may be invalidated (i.e., all tasks therein may be invalidated), and the tasks in the other task areas may be adjusted so that the remaining task areas may still produce somewhat usable results (i.e., training task areas), improving robustness.
Of course, the above training is also training on "computational graphs" in nature.
The training is performed after "partitioning" (specifically, after expansion or before expansion), which is equivalent to improving the robustness of "region level".
In some embodiments, the above training comprises at least one of:
(1) And randomly invalidating part of the task areas to train each task area.
As one way of an embodiment of the present disclosure, it may be a random inactive portion of the task area to train the remaining task area.
(2) And determining a critical task area comprising the critical tasks, and invalidating the critical task area to train each task area.
As one way of an embodiment of the disclosure, a "critical task" playing a critical role therein may be determined according to the structural features of the computational graph, and the critical task area where the critical task is located is invalidated to train the task area.
In some embodiments, after determining the mapping relationship between the tasks of the task areas and the processing cores of the many-core system (S105), further includes:
s106, invalidating all tasks mapped in the part processing core so as to train each task area and improve the redundancy performance of the calculation graph.
Referring to fig. 2 and fig. 4, after the mapping relationship is determined (including after the actual mapping is performed), all tasks mapped in a part of processing cores may be invalidated (corresponding to invalidating a part of processing cores), and other tasks may be adjusted, so that the remaining tasks may still generate a result (i.e., training task area) that is usable to a certain extent, thereby improving robustness.
Of course, the above training is also training on "computational graphs" in nature.
The training is equivalent to simulating the situation that the processing core is invalid due to faults and the like, so that the robustness of the core level can be improved from the point of view of final practical application.
In some embodiments, the above training comprises at least one of:
(1) The random invalidity portion processes all tasks mapped in the core to train each task area.
As a way of an embodiment of the present disclosure, all tasks mapped to one or more (but not all) processing cores may be invalidated (i.e., one or more processing cores are invalidated) to train the task areas.
(2) And sequentially invalidating all the task areas mapped in each processing core respectively to train each task area.
As a way of an embodiment of the present disclosure, each processing core may be invalidated sequentially, i.e., only one processing core is invalidated at a time, but all processing cores are invalidated.
(3) The critical tasks are determined, and all tasks mapped in the processing cores to which the critical tasks are mapped are invalidated to train the task areas.
As a way of an embodiment of the disclosure, a critical task may be determined according to a structure of a computational graph, and then a processing core where the critical task is located is invalidated, so as to train each task area.
In some embodiments, referring to fig. 2, after determining a mapping relationship between tasks of each task area and each processing core of the many-core system (S105), further includes:
and S107, mapping the tasks of each task area to each processing core of the many-core system according to the mapping relation.
S108, each processing core processes the task mapped thereto.
After the mapping relation is determined, the tasks in each task area can be mapped (or distributed) into the processing cores of the corresponding core group according to the mapping relation, and the processing cores are used for processing, so that the actual function of the calculation graph is realized, and the problem to be processed is solved.
Of course, the above steps of determining the mapping relation (S105) and performing the mapping (S107) may be actually integrated, i.e., the mapping may be performed directly.
Of course, when the training in step S106 is to be performed, referring to fig. 2, step S106 may be performed before step S107, and the training may be performed by invalidating the task corresponding to the processing core only according to the mapping relationship without actually mapping the task to the processing core.
Alternatively, the step S106 may be performed after the step S107, that is, the task may be actually mapped to the processing core and the processing core may be actually disabled for training.
It should be understood that in the disclosed embodiments, all of the computational graphs to be trained are necessarily trainable computational graphs (e.g., neural networks).
In the disclosed embodiments, each training may be performed only once, or may be performed in cycles.
In the embodiment of the disclosure, when multiple training is performed, the specific modes adopted by each training may be the same or different.
In embodiments of the present disclosure, when multiple exercises are performed, the exercises may be ended when a preset ending criterion is reached, which may include calculation map convergence, reaching a predetermined number of exercises, reaching a predetermined redundancy performance, and so on.
In a second aspect, referring to fig. 5, an embodiment of the present disclosure provides an apparatus 600 for task processing, including:
an obtaining module 601 configured to obtain a calculation map of a problem to be processed; the calculation graph comprises a plurality of layers which are arranged in sequence, each layer comprises a plurality of tasks, the tasks in any layer are not performed based on the results of the tasks in the layer or the later layers, and at least part of the tasks in at least part of the layers are performed based on the results of the tasks in the preceding layer;
a partitioning module 602 configured to partition the computational graph into a plurality of task areas; each task area comprises a plurality of tasks from at least two different layers, and the tasks of each layer are located in at least two task areas; the internal routing strength of each task area exceeds a preset standard, and the internal routing strength of each task is determined according to the proportion of routes among tasks in the task area to all routes related to the tasks in the task area;
The mapping module 603 is configured to determine a mapping relationship between the tasks of each task area and each processing core of the many-core system; according to the mapping relation, each task area corresponds to a core group, each core group comprises one or more processing cores, and the task of each task area is mapped into the processing core of the corresponding core group.
The task processing device 600 of the embodiment of the present disclosure may implement the task processing method described above.
It should be understood that, when other steps are included in the above-described method for task processing, other modules for implementing the corresponding steps may also be included in the apparatus 600 for task processing.
In a third aspect, referring to fig. 6, an embodiment of the present disclosure provides a many-core system, comprising:
a plurality of processing cores 701; and
a network on chip 702 configured to interact data among the plurality of processing cores 701 and external data;
one or more processing cores 701 have stored therein one or more instructions that are executed by the one or more processing cores 701 to enable the one or more processing cores 701 to perform a method of performing any one of the task processing described above.
The many-core system of the embodiment of the disclosure can realize the above method for task processing, including performing actual task in the computational graph to obtain a processing result of a problem to be processed.
In a fourth aspect, referring to fig. 7, an embodiment of the present disclosure provides a computer readable medium 800 having a computer program stored thereon, wherein the computer program, when executed by a processing core, implements a method of any of the task processing described above.
The disclosed embodiments provide a method for implementing the task processing described above when the computer program stored in the computer readable medium 800 is executed by the processing core.
Those of ordinary skill in the art will appreciate that all or some of the steps, systems, functional modules/units in the apparatus, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between the functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed cooperatively by several physical components. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as known to those skilled in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. Furthermore, as is well known to those of ordinary skill in the art, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media.
Example embodiments have been disclosed herein, and although specific terms are employed, they are used and should be interpreted in a generic and descriptive sense only and not for purpose of limitation. In some instances, it will be apparent to one skilled in the art that features, characteristics, and/or elements described in connection with a particular embodiment may be used alone or in combination with other embodiments unless explicitly stated otherwise. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the scope of the disclosure as set forth in the appended claims.

Claims (18)

1. A method of task processing, comprising:
acquiring a calculation graph of a problem to be processed; the calculation graph comprises a plurality of layers which are sequentially arranged, each layer comprises a plurality of tasks, the tasks in any layer are not performed based on the results of the tasks in the layer or the later layers, and at least part of the tasks in at least some layers are performed based on the results of the tasks in the previous layers;
dividing the computational graph into a plurality of task areas; each task area comprises a plurality of tasks from at least two different layers, and the tasks of each layer are located in at least two task areas; the internal routing strength of each task area exceeds a preset standard, and the internal routing strength of each task is determined according to the proportion of routes among tasks in the task area to all routes related to the tasks in the task area;
Determining the mapping relation between the tasks of each task area and each processing core of the many-core system; according to the mapping relation, each task area corresponds to a core group, each core group comprises one or more processing cores, and the task of each task area is mapped into the processing core of the corresponding core group.
2. A method of task processing as claimed in claim 1, wherein,
at least a portion of the task area includes tasks from each layer.
3. The method of task processing according to claim 1, wherein the internal routing strength comprises:
an internal routing amount duty ratio, wherein the internal routing amount duty ratio of each task area is the data amount transmitted by routes among tasks in the task area and accounts for the proportion of the data amount transmitted by all routes related to the tasks in the task area;
and/or the number of the groups of groups,
the internal route number ratio is the number of routes between tasks in the task area, and is the ratio of the number of all routes related to the tasks in the task area.
4. A method of task processing according to claim 3, wherein the preset criteria comprises:
the internal routing amount duty cycle is greater than a first threshold;
and/or the number of the groups of groups,
The internal routing number duty cycle is greater than the second threshold.
5. A method of task processing as claimed in claim 1, wherein,
at least part of the core group includes only one processing core.
6. A method of task processing as claimed in claim 1, wherein,
at least a portion of the core groups include a plurality of processing cores;
according to the mapping relation, each task area corresponding to the core group comprising a plurality of processing cores is divided into a plurality of task blocks, the number of the task blocks of each task area is the same as the number of the processing cores comprising the core group corresponding to the task area, and each task block of each task area is mapped to each processing core of the core group corresponding to the task area.
7. A method of task processing as defined in claim 6, wherein,
in a core group at least partially including a plurality of processing cores, a distance between any two processing cores is smaller than a preset distance.
8. A method of task processing as claimed in claim 1, wherein,
at least part of the processing cores belong to a plurality of core groups simultaneously.
9. The method of task processing according to claim 1, wherein between the dividing the computation graph into a plurality of task areas and the determining a mapping relationship between the tasks of each task area and each processing core of the many-core system, further comprising:
Expanding at least part of the task area; the expanding includes adding redundant tasks in the task area.
10. The method of task processing according to claim 1, wherein between the obtaining a computational graph of a problem to be processed and the dividing the computational graph into a plurality of task areas, further comprising:
the computational graph is trained to improve redundancy performance of the computational graph.
11. The method of task processing according to claim 1, wherein between the dividing the computation graph into a plurality of task areas and the determining a mapping relationship between the tasks of each task area and each processing core of the many-core system, further comprising:
and invalidating part of task areas to train each task area and improve the redundancy performance of the calculation graph.
12. The method for task processing according to claim 1, further comprising, after the determining a mapping relationship between the tasks of each task area and each processing core of the many-core system:
and invalidating part of all tasks mapped in the processing core to train each task area and improve the redundancy performance of the calculation graph.
13. A method of task processing according to any one of claims 1 to 12, wherein,
the calculation graph is a trainable calculation graph; the trainable computational graph can solve the same problem to be processed in the case that at least part of tasks are different.
14. A method of task processing according to any one of claims 1 to 12, wherein,
the computational graph is a neural network.
15. A method of task processing according to any one of claims 1 to 12, wherein after said determining a mapping relationship between tasks of each task area and each processing core of the many-core system, further comprising:
according to the mapping relation, mapping the tasks of each task area into each processing core of the many-core system;
each processing core processes tasks mapped into it.
16. An apparatus for task processing, comprising:
the acquisition module is configured to acquire a calculation graph of a problem to be processed; the calculation graph comprises a plurality of layers which are sequentially arranged, each layer comprises a plurality of tasks, the tasks in any layer are not performed based on the results of the tasks in the layer or the later layers, and at least part of the tasks in at least some layers are performed based on the results of the tasks in the previous layers;
a partitioning module configured to partition the computational graph into a plurality of task areas; each task area comprises a plurality of tasks from at least two different layers, and the tasks of each layer are located in at least two task areas; the internal routing strength of each task area exceeds a preset standard, and the internal routing strength of each task is determined according to the proportion of routes among tasks in the task area to all routes related to the tasks in the task area;
The mapping module is configured to determine the mapping relation between the tasks of each task area and each processing core of the many-core system; according to the mapping relation, each task area corresponds to a core group, each core group comprises one or more processing cores, and the task of each task area is mapped into the processing core of the corresponding core group.
17. A many-core system, comprising:
a plurality of processing cores; and
a network on chip configured to interact data between the plurality of processing cores and external data;
one or more of the processing cores having stored therein one or more instructions for execution by one or more of the processing cores to enable one or more of the processing cores to perform the method of task processing of any one of claims 1 to 15.
18. A computer readable medium having stored thereon a computer program, wherein the computer program when executed by a processing core implements the method of task processing according to any of claims 1 to 15.
CN202110184939.8A 2021-02-10 2021-02-10 Method and device for task processing, many-core system and computer readable medium Active CN112835719B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202110184939.8A CN112835719B (en) 2021-02-10 2021-02-10 Method and device for task processing, many-core system and computer readable medium
PCT/CN2022/074490 WO2022171002A1 (en) 2021-02-10 2022-01-28 Task processing method and apparatus, many-core system, and computer-readable medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110184939.8A CN112835719B (en) 2021-02-10 2021-02-10 Method and device for task processing, many-core system and computer readable medium

Publications (2)

Publication Number Publication Date
CN112835719A CN112835719A (en) 2021-05-25
CN112835719B true CN112835719B (en) 2023-10-31

Family

ID=75933602

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110184939.8A Active CN112835719B (en) 2021-02-10 2021-02-10 Method and device for task processing, many-core system and computer readable medium

Country Status (1)

Country Link
CN (1) CN112835719B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022171002A1 (en) * 2021-02-10 2022-08-18 北京灵汐科技有限公司 Task processing method and apparatus, many-core system, and computer-readable medium
CN116048740A (en) * 2021-10-28 2023-05-02 北京灵汐科技有限公司 Task scheduling method and system based on many-core system, electronic equipment and medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103902646A (en) * 2013-12-27 2014-07-02 北京天融信软件有限公司 Distributed task managing system and method
CN111124626A (en) * 2018-11-01 2020-05-08 北京灵汐科技有限公司 Many-core system and data processing method and processing device thereof
CN112114942A (en) * 2019-06-21 2020-12-22 北京灵汐科技有限公司 Streaming data processing method based on many-core processor and computing device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020049841A1 (en) * 2000-03-03 2002-04-25 Johnson Scott C Systems and methods for providing differentiated service in information management environments
US20020120741A1 (en) * 2000-03-03 2002-08-29 Webb Theodore S. Systems and methods for using distributed interconnects in information management enviroments

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103902646A (en) * 2013-12-27 2014-07-02 北京天融信软件有限公司 Distributed task managing system and method
CN111124626A (en) * 2018-11-01 2020-05-08 北京灵汐科技有限公司 Many-core system and data processing method and processing device thereof
CN112114942A (en) * 2019-06-21 2020-12-22 北京灵汐科技有限公司 Streaming data processing method based on many-core processor and computing device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于平衡兴趣树的P2P空间数据服务调度;侯哲威;王青山;丁琳;任留成;;计算机应用研究(09);全文 *
无线传感器网络分布式离群数据检测研究;唐琪;刘学军;;传感技术学报(06);全文 *

Also Published As

Publication number Publication date
CN112835719A (en) 2021-05-25

Similar Documents

Publication Publication Date Title
CN108416327B (en) Target detection method and device, computer equipment and readable storage medium
CN112835719B (en) Method and device for task processing, many-core system and computer readable medium
US11651224B2 (en) Method for formatting a weight matrix, accelerator using the formatted weight matrix, and system including the accelerator
CN109919313B (en) Gradient transmission method and distributed training system
Gabler Minimax solutions in sampling from finite populations
EP3800590A1 (en) Accelerated embedding layer computations
WO2021110147A1 (en) Methods and apparatuses for image processing, image training and channel shuffling
WO2022012576A1 (en) Path planning method and apparatus, path planning device, and storage medium
CN109993293B (en) Deep learning accelerator suitable for heap hourglass network
CN112835718A (en) Method and device for processing task, many-core system and computer readable medium
CN112596872A (en) Task scheduling method, task preprocessing method, task processing device, task processing unit and task processing medium
CN105528183B (en) A kind of method and storage equipment of storing data
CN111935005B (en) Data transmission method, device, processing equipment and medium
CN110135428A (en) Image segmentation processing method and device
US11467973B1 (en) Fine-grained access memory controller
CN115361332B (en) Fault-tolerant route processing method and device, processor and electronic equipment
US20220027714A1 (en) Convolution block array for implementing neural network application and method using the same, and convolution block circuit
WO2022171002A1 (en) Task processing method and apparatus, many-core system, and computer-readable medium
CN114970838A (en) Neuron output data calculation method and device, many-core system and medium
CN114881221A (en) Mapping scheme optimization method and device, electronic equipment and readable storage medium
CN112446463A (en) Neural network full-connection layer operation method and device and related products
CN112132583A (en) Transaction processing method and device of block chain, electronic equipment and readable storage medium
CN112955906B (en) Neural network layer grouping method, device, equipment, storage medium and program product
CN113722554A (en) Data classification method and device and computing equipment
CN111767204A (en) Overflow risk detection method, device and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant