CN115658274B - Modularized scheduling method, device and computing equipment for neural network reasoning in core particle - Google Patents

Modularized scheduling method, device and computing equipment for neural network reasoning in core particle Download PDF

Info

Publication number
CN115658274B
CN115658274B CN202211425389.5A CN202211425389A CN115658274B CN 115658274 B CN115658274 B CN 115658274B CN 202211425389 A CN202211425389 A CN 202211425389A CN 115658274 B CN115658274 B CN 115658274B
Authority
CN
China
Prior art keywords
module
modules
data
many
depth
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211425389.5A
Other languages
Chinese (zh)
Other versions
CN115658274A (en
Inventor
毛旷
许慧卿
潘秋红
汤昭荣
杨弢
杨佳宁
叶茂伟
王颖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lab
Original Assignee
Zhejiang Lab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lab filed Critical Zhejiang Lab
Priority to CN202211425389.5A priority Critical patent/CN115658274B/en
Publication of CN115658274A publication Critical patent/CN115658274A/en
Application granted granted Critical
Publication of CN115658274B publication Critical patent/CN115658274B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a modularized scheduling method, a modularized scheduling device and a modularized scheduling computing device for neural network reasoning in a core particle, which comprise the following steps: acquiring a scheduling strategy search space for neural network reasoning in the core particle; obtaining and generating operator depth according to a computational graph of the neural network, and dividing operators into serial groups according to the computational graph; dividing a computational graph according to the data dependency relationship among operators, the operator depth and the serial group to obtain a data dependency module and a parallel data dependency module; calculating the data dependency complexity of the data dependency module, and calculating the maximum available resource allocation quantity of operators according to the data dependency complexity, the parallel data dependency module and the total number of core particle resources, wherein the maximum available resource allocation quantity is used as an initial constraint of iterative search of a scheduling strategy; and searching a space and an initial constraint iterative search according to a scheduling strategy to enable the sum of computing overhead, intra-operator and inter-operator data transmission overhead and congestion overhead generated by core multi-stage routing to be minimum.

Description

Modularized scheduling method, device and computing equipment for neural network reasoning in core particle
Technical Field
The invention belongs to the technical field of deep learning, neural network accelerators and high-performance computation crossover, and particularly relates to a modularized scheduling method, device and computing equipment for neural network reasoning in a core particle.
Background
Neural networks are used to solve the challenging problems of classification, identification, prediction, etc. in practical applications. However, in order to meet the increasingly complex problems and demands of higher application precision, large-scale neural networks are becoming research hotspots and trends, and the amount of data processed by the large-scale neural networks is rapidly increasing, which presents great challenges for the computation and storage performance of the neural network accelerators.
With the development of the core technology, neural network accelerators are being upgraded from chips to cores. The core particle integrates and packages the small chip units into a module chip meeting specific functions, has higher flexibility and performance and lower cost compared with the original common chip, and is very beneficial to the development of the architecture of the special field of the neural network accelerator.
Therefore, for a more complex structure with high integration level, such as a core, in order to fully consider the data dependency relationship between operators in the neural network so as to search a high-performance core scheduling strategy with the reasoning cost as small as possible, the reasoning of the neural network on the core is accelerated, and the design of a modularized scheduling method for the neural network reasoning in the core becomes a significant challenge.
In the prior art, there is a neural network reasoning scheduling solution for a core particle structure, and the prior art lacks of balancing a plurality of important factors influencing reasoning overhead, and also ignores the influence of a scheduling strategy adopted by a local subgraph with complex data dependency relationship in the neural network on the whole reasoning overhead, so that the scheduling technology of the neural network accelerator has a great improvement space.
Patent document CN112149816a discloses a heterogeneous memory fusion system supporting neural network reasoning acceleration, which comprises a host processor, a nonvolatile memory module, a 3D stacked memory module, a network module, a configuration circuit and a voltage generator, and aims at optimizing neural network calculation and memory access by comprehensively considering the requirements of the neural network and the advantages and disadvantages of the 3D stacked memory and memristors, and supporting large-scale neural network calculation acceleration, but does not solve the problem of reasoning resource scheduling optimization of a neural network accelerator.
Patent document CN114168321a discloses a resource allocation method, in which a current resource sub-pool is divided into resource sub-pools, the number of which matches with the number of functional modules in a network chip that need to use resources, each resource sub-pool including available resources; screening target resource sub-pools meeting resource screening conditions from each resource sub-pool group respectively; the available resources in the target resource sub-pool are allocated to the functional modules needing to use the resources in a way that the performance of the resources available to each functional module reaches the highest frequency of the functional modules, but the resource allocation method is not suitable for allocating the resources for operators of the neural network in the core particle.
Disclosure of Invention
In view of the above, the present invention aims to provide a modularized scheduling method for neural network reasoning in a core particle, which determines a resource optimal scheduling policy by considering factors such as an operator dividing mode, a resource allocation mode, and a placement position of a resource in the core particle on the basis of considering a data dependency relationship.
In order to achieve the above object, an embodiment of the present invention provides a modular scheduling method for neural network reasoning in a core, including the following steps:
acquiring a scheduling policy search space for neural network reasoning in the core particle, comprising: obtaining a division mode of operators in the neural network, a resource allocation mode containing the total number of core particle resources, and a placement position of the obtained resources in the core particles;
obtaining a computational graph of the neural network, generating a depth of the neural network operator according to the computational graph, and dividing the neural network operator into serial groups according to the computational graph;
dividing the computational graph according to the data dependency relationship among operators presented by the computational graph, the operator depth and the serial group to obtain a data dependency module and a parallel data dependency module;
determining the data dependence complexity of a data dependence module, and calculating the maximum available resource allocation quantity of operators according to the data dependence complexity, the parallel data dependence module and the total number of core particle resources, wherein the maximum available resource allocation quantity is used as an initial constraint of iterative search of a scheduling strategy;
And searching the space and the initial constraint iteration search according to the optimal modularized scheduling strategy so as to minimize the sum of the calculation cost, the data transmission cost in operators and among operators and the congestion cost generated by the core multi-stage routing.
PreferablyObtaining a division mode of each type of operators in the neural network, wherein the division mode comprises; representing each type of operator in the neural network as V c C represents operator type, V c All operators belonging to class c are contained, each operator V c Dividing means composition dividing means set T c Partition mode set T c Partitioning means t comprising a plurality of possible types of operators for partitioning class c i I represents the index of the division mode, and the division mode t i Describes the partitionable dimension of the class c operator, and the mapping scheme of each dimension to a core particle;
the resource allocation method for each type of operator in the neural network is obtained, and the method comprises the following steps: obtaining the total number R of core particle resources allocated to operators in a neural network * ,R * The expression is M x N, M and N respectively represent the total number of rows and the total number of columns of the embedded neural network processor in all the core grains; also obtain a sub-core array r of all possible array sizes allocated to the operator j Is an index of sub-core arrays, each sub-core array representing a number of resource allocations to an operator, the number of resource allocations being represented as m x n, m and n being represented as sub-core arrays R, respectively j The number of rows and columns of the middle embedded neural network processor;
obtaining a placement position of the array of sub-kernels in the kernels, comprising: acquiring all possible placement positions p in the kernels for each sub-kernel array r that can be assigned to a neural network operator k Is set P of (2) r K represents an index of a placement position p k Expressed by coordinates (x, y), where (x, y) represents the coordinate position of the embedded neural network processor in the core at row 1 and column 1 in the sub-core array r, and x is required to satisfy x+m not greater than the total number of rows M of the embedded neural network processor in the core, and y is required to satisfy y+n not greater than the total number of columns N of the embedded neural network processor in the core.
Preferably, the acquired computational graph of the neural network is represented by nodes and edges, wherein the nodes of the computational graph represent operators of the neural network and the edges represent data dependencies between the operators of the neural network;
dividing the neural network operators into serial groups according to the computational graph, comprising: dividing serial groups according to the output and input conditions of nodes in the calculation graph, wherein each serial group consists of a node with output and input of 1 or consists of a plurality of continuous nodes with output and input of 1, and when one node is a preceding node or a subsequent node of the other node, the two nodes are continuous.
Preferably, dividing the computational graph into data-dependent modules includes:
dividing the computational graph into a pair of multi-modules, comprising: dividing a node with the input degree equal to 1 and the output degree greater than 1 in the calculation graph and the subsequent nodes of all adjacent depths into the same one pair of multiple modules, taking the minimum depth of the node in the one pair of multiple modules as the depth of the one pair of multiple modules, and forming a set DD by all the one pair of multiple modules 0
The computational graph is divided into many-to-one modules, comprising: dividing nodes with the ingress degree larger than 1 and the egress degree equal to 1 in the calculation graph and all the preamble nodes with all the adjacent depths into the same many-to-one module, taking the minimum depth of the nodes in the many-to-one module as the depth of the many-to-one module, and forming a set DD by all the many-to-one modules 1
Dividing the computational graph into many-to-many modules, comprising: dividing nodes with the ingress degree greater than 1 and the egress degree greater than 1 in the calculation graph and all the preceding nodes and the following nodes into the same many-to-many modules, taking the minimum depth of the nodes in the many-to-many modules as the depth of the many-to-many modules, and forming a set DD by all the many-to-many modules 2
Dividing the computational graph into a parallel pair of multi-modules, comprising: will aggregate DD 0 The multiple modules with the same depth are combined into a parallel one-to-multiple module, and the minimum depth of the nodes in the parallel one-to-multiple module is used as the depth of the parallel one-to-multiple module, and simultaneously the node is collected from DD 0 Deleting one-to-many modules participating in parallel one-to-many module combination, and forming a set DD by all the parallel one-to-many modules 3
Dividing the computational graph into one-to-one modules includes: traversing the serial groups, creating a one-to-one module for each serial group, adding all operators in the serial groupsThe one-to-one module takes the minimum depth of the nodes in the one-to-one module as the depth of the one-to-one module, if the ingress of the preceding node of the minimum depth node in the serial group is more than 1 and the egress is equal to 1, the preceding node is added into the one-to-one module corresponding to the serial group, and if the ingress of the following node of the maximum depth node in the serial group is equal to 1 and the egress is more than 1, the following node is added into the one-to-one module corresponding to the serial group; traversal set DD 1 If there are many-to-one modules with 3 nodes and minimum depth nodes with the ingress degree greater than 1 and the egress degree equal to 1, two one-to-one modules are created, the node with the minimum depth in the many-to-one modules and the subsequent nodes are added into one-to-one module, the node with the maximum depth in the many-to-one modules and the preceding nodes are added into another one-to-one module, the minimum depth of the nodes in the one-to-one modules is used as the depth of the one-to-one module, and simultaneously, the node DD is collected 1 Delete many-to-one modules, all one-to-one modules make up the set DD 4
The one-to-many module, the many-to-one module, the many-to-many module, the parallel one-to-many module and the one-to-one module are collectively called as a data dependent module, and the merging set DD 0 、DD 1 、DD 2 、DD 3 And DD 4 A set of data dependent modules DD is generated.
Preferably, dividing the computational graph into parallel data-dependent modules includes:
traversing the computational graph of the neural network, and for a data dependent module dd with depth d where a current traversing operator is located, executing the following steps:
if the parallel data dependency module with the depth d exists or other data dependency modules with the depth d do not exist, continuing to traverse the next data dependency module;
if the parallel data dependency module with the depth d does not exist, and if other data dependency modules with the depth d and the data dependency module dd do not intersect, dividing a plurality of data dependency modules which do not contain the same operator and are at the depth d into the same parallel data dependency module;
and after the traversal is finished, outputting a parallel data dependency module set PDD formed by the parallel data dependency modules.
Preferably, determining the data dependency complexity of the data dependency module comprises:
the data dependency complexity ordering of the data dependency modules includes: first, a one-stage ordering is performed according to rule 1, rule 1 being: the data dependence complexity of one-to-many modules, many-to-one modules, many-to-many modules and parallel one-to-many modules is higher than that of one-to-one modules; then, performing two-stage sorting according to rule 2, wherein rule 2 is as follows: in all data dependency modules except one-to-one modules, the data dependency complexity with a large number of operators is high; next, a three-stage ordering is performed according to rule 3, rule 3 being: for one-to-one modules, the data dependence complexity of a large number of operators is high, and finally, four-stage sequencing is performed according to rule 4 to obtain a sequenced data dependence module set DD * Rule 4 is: for the data dependent modules with the same operator number, the complexity of the data dependent module traversed by adopting the breadth-first search algorithm in the computational graph is higher;
by traversing a set DD of data-dependent modules * The data dependent complexity is tagged for each of the data dependent modules.
Preferably, calculating the maximum available resource allocation number of the operator in terms of the data dependent complexity, the parallel data dependent module and the total number of core resources comprises:
traversing a set DD of data-dependent modules marked with a data-dependent complexity order * For the currently traversed data dependent module dd, if the operators in the data dependent module dd have all determined the maximum available resource allocation quantity, continuing to traverse the next data dependent module, and if a plurality of operators with the undetermined maximum available resource allocation quantity exist in the data dependent module dd, executing the following steps:
firstly, a parallel data dependency module pdd where a data dependency module dd is located is obtained, and a first operator proportion of the number of operators of the data dependency module dd to the total number of operators of the parallel data dependency module pdd is calculated;
then, the total number R of core particle resources is calculated * Proportional to the first operatorAnd taking the result of the multiplication as the maximum available resource allocation number for which an operator is present in the data dependent module dd, the maximum available resource allocation number being undetermined.
Preferably, searching for an optimal modular scheduling policy that minimizes the sum of computation overhead, intra-operator and inter-operator data transmission overhead, congestion overhead generated by core multi-level routing, according to a scheduling policy search space and an initial constraint iterative search, comprises:
firstly, performing iterative search according to initial constraint to determine an optimal scheduling strategy of each data dependent module, including: traversing a set DD of data-dependent modules marked with an order in accordance with data-dependent complexity * For the currently traversed data dependent module dd, the following steps are iteratively performed:
in each traversal, if the data dependent module dd is in the parallel data dependent module pdd and there is a data dependent module in the parallel data dependent module pdd that has determined the scheduling policy of all operators in the module, the sum r of the operator resource allocation numbers of the data dependent modules that have completely determined the scheduling policy is obtained * And calculating a second operator proportion of the number of operators of which the scheduling strategies are not determined in the data dependency modules of which the scheduling strategies are not determined in the parallel data dependency modules pdd to the total number of the remaining operators of the parallel data dependency modules pdd, and adding the total number of core particle resources R * And r * Multiplying the difference by the proportion of the second operator, and dynamically adjusting the maximum available resource quantity m of operators with undetermined scheduling strategies in the data dependency module with undetermined scheduling strategies in the parallel data dependency module pdd according to the multiplication result 0 *n 0 ,m 0 Representing the number of rows, n, of an embedded neural network processor 0 Representing the number of columns of the embedded neural network processor;
in each round of traversal, searching all possible scheduling strategies of each operator in the data dependency module dd from a scheduling strategy search space, wherein the scheduling strategies are expressed as { t, r, p }, t represents the division mode of the operator v, r represents the number of resource allocation obtained by the operator v, p represents the placement position of the resource obtained by the operator v in the core particle, and the number of resource allocation allocated to the operator v is limited at the same timeThe total number m of resource allocation is not more than the maximum available resource number m of operators 0 *n 0 M is not more than m 0 N is not more than n 0
In each search of each round of traversal, generating a scheduling strategy combination corresponding to the data dependent module dd according to all possible scheduling strategies of each operator in the data dependent module dd, and calculating total reasoning expense of the data dependent module dd under the scheduling strategy combination, wherein the total reasoning expense is the sum of calculation expense generated by the data dependent module dd in a reasoning process, data transmission expense in operators and among operators and congestion expense generated by core multi-level routing;
after iteration is finished, selecting a scheduling strategy combination with minimum total reasoning cost for the data dependent module dd as an optimal scheduling strategy;
And then, integrating the optimal scheduling strategies of each data dependency module to output a scheduling strategy set of the computational graph, wherein the scheduling strategy set records the optimal scheduling strategy adopted by each operator.
To achieve the above object, an embodiment further provides a modular scheduling apparatus for neural network reasoning in a core, including:
an acquisition unit for acquiring a scheduling policy search space for neural network reasoning in the core particle, comprising: obtaining the division mode of operators in the neural network, and the placement positions of the obtained resources in the core particles, wherein the placement positions comprise the total number of core particle resources;
the serial group dividing unit is used for acquiring a calculation graph of the neural network, generating the depth of the neural network operator according to the calculation graph, and dividing the neural network operator into serial groups according to the calculation graph;
the data dependency module dividing unit is used for dividing the computational graph into a data dependency module and a parallel data dependency module according to the data dependency relationship, the operator depth and the serial group among operators presented by the computational graph;
the data dependence complexity determining unit is used for determining the data dependence complexity of the data dependence module, and calculating the maximum available resource allocation quantity of operators according to the data dependence complexity, the parallel data dependence module and the total core resource number, wherein the maximum available resource allocation quantity is used as an initial constraint of iterative search of a scheduling strategy;
The iterative search unit is used for searching the optimal modularized scheduling strategy which is used for searching the space and the initial constraint iteration according to the scheduling strategy so as to minimize the sum of the calculation cost, the data transmission cost in operators and among operators and the congestion cost generated by the core multi-stage routing.
To achieve the above object, an embodiment further provides a computing device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the above-mentioned modular scheduling method of neural network reasoning in a core particle when executing the computer program.
Compared with the prior art, the invention has the beneficial effects that at least the following steps are included:
the method not only can fully utilize the core particle resources, but also can fully consider the data dependency relationship among the neural network operators to obtain a high-performance core particle resource scheduling strategy and accelerate the reasoning of the neural network in the core particle;
dividing a calculation graph according to the data dependency relationship among operators, preferentially exploring the scheduling strategy of the data dependency module with high data dependency complexity, dynamically adjusting the maximum available resource allocation quantity which can be used by the data dependency module which can be executed in parallel, enabling the scheduling method to preferentially select the optimal scheduling strategy for the data dependency module with higher global reasoning data dependency complexity, and then selecting the optimal scheduling strategy for the data dependency module with lower global reasoning data dependency complexity, and simultaneously fully considering the complex interaction relationship of a plurality of operators with close relationship in the data dependency module on the scheduling strategy decision, and comprehensively obtaining the scheduling strategy with stronger performance and lower reasoning delay;
The three-dimensional scheduling strategy search space of an operator dividing mode, a resource allocation mode and a placement position of resources in core grains is comprehensively considered, data transmission overhead in the operators and among the operators is added into total reasoning overhead calculation for determining the scheduling strategy, so that the covered dimension of the scheduling method is wider, the scheduling granularity is combined, a plurality of important factors such as calculation overhead for influencing the reasoning delay of a neural network on the core grains, congestion overhead generated by core grain multistage routing and the like are comprehensively considered, the performance of the scheduling method is higher, and finally the reasoning delay of the determined scheduling strategy is lower.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a modular scheduling method for neural network reasoning in a core particle provided by an embodiment;
the embodiment of fig. 2 provides a core resource configured for neural network reasoning;
FIG. 3 is an array of embedded neural network processors in a core particle provided by an embodiment;
FIG. 4 is a simplified version of an exemplary ResNeXt principal constituent structure provided by an embodiment;
FIG. 5 is a schematic view of operator depth in a computational graph provided by an embodiment;
FIGS. 6 and 7 are schematic diagrams of data dependency modules provided by embodiments;
fig. 8 is a schematic structural diagram of a modular scheduling apparatus for neural network reasoning in a core particle provided by an embodiment.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the detailed description is presented by way of example only and is not intended to limit the scope of the invention.
In order to improve the neural network reasoning performance in the core grain and reduce the reasoning total cost, the embodiment provides a modularized scheduling method for the neural network reasoning in the core grain, as shown in fig. 1, which comprises the following steps:
step 1, a scheduling strategy search space for neural network reasoning in the core particle is obtained.
In an embodiment, obtaining a scheduling policy search space for neural network reasoning in a core particle mainly includes: and acquiring a division mode and a resource allocation mode of an operator in the neural network and a placement position of the obtained resource in the core particle, wherein the division mode and the resource allocation mode of the operator and the placement position of the obtained resource in the core particle form a scheduling strategy search space.
The method for obtaining the division mode of each type of operators in the neural network comprises the following steps of; representing each type of operator in the neural network as V c C represents operator type, V c All operators belonging to class c are contained, each operator V c Dividing means composition dividing means set T c Partition mode set T c Partitioning means t comprising a plurality of possible types of operators for partitioning class c i I represents the index of the division mode, and the division mode t i The partitionable dimension of the class c operator and the mapping scheme of each dimension to a core particle are described. It should be noted that each partition is customized in advance, and each partition is used as a possible component of the subsequent operator scheduling policy.
In an embodiment, core particle resources configured for neural network reasoning are shown in fig. 2, the core particles are in communication connection with other core particles through a router, and an embedded neural Network Processor (NPU) arranged in an array in each core particle is shown in fig. 3, and based on the core particle resources, a resource allocation mode of each operator in the neural network is obtained, which includes: obtaining the total number R of core particle resources allocated to operators in a neural network * ,R * The expression is M x N, M and N respectively represent the total number of rows and the total number of columns of the embedded neural network processor in all the core grains; also obtain a sub-core array r of all possible array sizes allocated to the operator j And j is an index of sub-core arrays, each sub-core array representing a number of resource allocations allocated to the operator, the number of resource allocations being represented as m x n,m and n are respectively represented as a sub-core particle array r j The number of rows and columns of the embedded neural network processor.
Note that, the sub-kernel array refers to a portion of kernels, as shown by a box containing 4*2 NPUs in fig. 3, which is allocated to a single operator, indicating the number of resource allocations for the operator configuration. It should be noted that each possible size of the sub-core array is predefined, and the number of resource allocations for each size is one possible component of the subsequent operator scheduling policy.
Obtaining a placement position of the array of sub-kernels in the kernels, comprising: acquiring all possible placement positions p in the kernels for each sub-kernel array r that can be assigned to a neural network operator k Is set P of (2) r K represents an index of a placement position p k Expressed by coordinates (x, y), which represent the coordinate positions of the embedded neural network processors in the core at row 1 and column 1 in the sub-core array r, x is required to satisfy x+m not greater than the total number of rows M of the embedded neural network processors in the core, and y is required to satisfy y+n not greater than the total number of columns N of the embedded neural network processors in the core. It should be noted that, the placement position of the obtained resources of each operator class in the core particle is also predefined, and the placement position is also used as a possible component of the subsequent operator scheduling policy. Thus, different partitioning, resource allocation and placement locations constitute one possible scheduling strategy for the operator.
And step 2, acquiring a computational graph of the neural network, and generating operator depth and dividing serial groups according to the computational graph.
In the embodiment, each neural network corresponds to a unique calculation graph G for reasoning, the calculation graph is a directed acyclic graph, and is represented by a node set V and an edge set E, wherein nodes of the calculation graph represent operators of the neural network, and edges between the nodes represent data dependency relations between the operators of the neural network. Illustratively, taking a neural network ResNeXt as an example, the obtained computational graph corresponding to a simplified version of the ResNeXt main constituent structure, as shown in FIG. 5, v1-v28 are operators, and directional arrows between the operators represent data dependency relationships.
In an embodiment, when generating the operator depth of the neural network according to the computational graph, the computational graph may be traversed by a breadth-first search algorithm, in the traversing process, for a current traversing node, whether all preceding nodes of the current traversing node mark the depth, if no preceding node with unlabeled depth exists or no preceding node exists, the depth d of the current node v is marked, if a preceding node with unlabeled depth exists, the depth of the current node is not marked, and the traversing of the next node is continued, so that the operator depth of each operator in the neural network is determined, as shown in fig. 5, R1-R10 represent the operator depths, the operator depth of the operator v1 is R1, and the operator depths of the operators v11-v14 are R5.
In an embodiment, dividing the neural network operators into serial groups according to the computational graph includes: dividing serial groups according to the output and input conditions of nodes in the calculation graph, wherein each serial group consists of nodes with output and input of 1, or consists of a plurality of continuous nodes with output and input of 1, and when one node is a preceding node or a following node of the other node, the two nodes are continuous. Specifically, all nodes in the computational graph can be traversed through a depth-first search algorithm, and in the traversing process, for a current traversing node, if the current node does not meet the condition that the outbound degree and the inbound degree are both 1, the next node is continuously traversed; if the current node meets the condition that the output degree and the input degree are both 1, further judging whether the preamble node of the current node meets the condition that the output degree and the input degree are both 1, if the preamble node meets the condition, adding the current node into the serial group where the preamble node is located, and if the preamble node does not meet the regulation, creating a new serial group and adding the current node into the new innovative serial group.
And step 3, dividing the computational graph according to the data dependency relationship, the operator depth and the serial group to obtain a data dependency module and a parallel data dependency module.
In an embodiment, dividing a computational graph to obtain a data dependency module according to a data dependency relationship between operators presented by the computational graph, operator depth and a serial group includes:
(a) Dividing the computational graph into a pair of multi-modules, comprising: the degree of entry in the calculation map is equal to1 and the node with the output degree larger than 1 and the subsequent nodes of all adjacent depths are divided into the same pair of multiple modules, the minimum depth of the node corresponding operator in the pair of multiple modules is taken as the depth of the pair of multiple modules, and all the first pair of multiple modules form a set DD 0 . Wherein, adjacent depth is understood as: in the computational graph, the depth difference from the current node is 1, and as shown in fig. 5, the subsequent nodes with adjacent depths of the operator v2 corresponding to the node are v3, v4, v5 and v6. As shown in fig. 6, #2 (v 2, v3, v4, v5, v 6) is a one-to-many module consisting of v2, v3, v4, v5, v6.
(b) Dividing the computational graph into many-to-one modules, comprising: dividing nodes with the ingress degree larger than 1 and the egress degree equal to 1 in the calculation graph and all the preamble nodes with all the adjacent depths into the same many-to-one module, taking the minimum depth of the nodes in the many-to-one module as the depth of the many-to-one module, and forming a set DD by all the many-to-one modules 1 . As shown in fig. 6, #12 (v 28, v24, v25, v26, v 27) is a many-to-one module consisting of v28, v24, v25, v26, v 27.
(c) Dividing the computational graph into many-to-many modules, comprising: dividing nodes with the ingress degree greater than 1 and the egress degree greater than 1 in the calculation graph and all the preceding nodes and the following nodes into the same many-to-many modules, taking the minimum depth of the nodes in the many-to-many modules as the depth of the many-to-many modules, and forming a set DD by all the many-to-many modules 2 . As shown in fig. 6, #7 (v 15, v11, v12, v13, v14, v16, v17, v18, v 19) is a many-to-many module consisting of v15, v11, v12, v13, v14, v16, v17, v18, v 19.
(d) Dividing the computational graph into a parallel pair of multi-modules, comprising: will aggregate DD 0 The multiple modules with the same depth are combined into a parallel one-to-multiple module, and the minimum depth of the nodes in the parallel one-to-multiple module is used as the depth of the parallel one-to-multiple module, and simultaneously the node is collected from DD 0 Deleting one-to-many modules participating in parallel one-to-many module combination, and forming a set DD by all the parallel one-to-many modules 3 . As shown in fig. 7, #1 (v 1, v2, v3, v 4) is a parallel one-to-many module, and is composed of v1, v2, v3, v4, wherein v1, v2, v3 forms one-to-many module, and v2, v3, v4 forms another parallel one-to-many module.
(e) Dividing the computational graph into one-to-one modules includes: traversing the serial groups, creating one-to-one modules for each serial group, adding all operators in the serial groups into the one-to-one modules, taking the minimum depth of nodes in the one-to-one modules as the depth of the one-to-one modules, adding the preceding node into the one-to-one module corresponding to the serial group if the ingress of the preceding node of the depth minimum node in the serial group is greater than 1 and the egress of the preceding node is equal to 1, and adding the succeeding node into the one-to-one module corresponding to the serial group if the ingress of the succeeding node of the depth maximum node in the serial group is equal to 1 and the egress of the succeeding node of the depth minimum node in the serial group is greater than 1; traversal set DD 1 If there are many-to-one modules with 3 nodes and minimum depth nodes with the ingress degree greater than 1 and the egress degree equal to 1, two one-to-one modules are created, the node with the minimum depth in the many-to-one modules and the subsequent nodes are added into one-to-one module, the node with the maximum depth in the many-to-one modules and the previous nodes are added into another one-to-one module, the minimum depth of the nodes in the one-to-one module is used as the depth of the one-to-one module, and simultaneously, the node DD is collected from 1 Delete many-to-one modules, all one-to-one modules make up the set DD 4 . As shown in fig. 6, #1 (v 1, v 2) is a one-to-one module, and is composed of v1, v 2.
(f) The one-to-many module, the many-to-one module, the many-to-many module, the parallel one-to-many module and the one-to-one module are collectively called as a data dependent module, and the merging set DD 0 、DD 1 、DD 2 、DD 3 And DD 4 A set of data dependent modules DD is generated.
In an embodiment, dividing the computational graph into parallel data-dependent modules according to the data-dependent modules and the computational graph includes:
traversing the computational graph of the neural network, and for a data dependent module dd with depth d where a current traversing operator is located, executing the following steps:
if the parallel data dependency module with the depth d exists or other data dependency modules with the depth d do not exist, continuing to traverse the next data dependency module;
if the parallel data dependency module with the depth d does not exist, and if other data dependency modules with the depth d and the data dependency module dd do not intersect, dividing a plurality of data dependency modules which do not contain the same operator and are at the depth d into the same parallel data dependency module;
and after the traversal is finished, outputting a parallel data dependency module set PDD formed by the parallel data dependency modules.
And 4, determining the data dependency complexity of the data dependency module, and calculating the initial constraint of iterative search of the scheduling strategy.
In an embodiment, edges between nodes in the data dependency module represent data dependency relationships between nodes, and based on the data dependency relationships, the data dependency complexity of the data dependency module is determined by ordering according to the number of edges between the nodes and the connection mode of the edges. The embodiment defines 4 rules for data dependent complexity ordering, including: rule 1 is: the data dependence complexity of one-to-many modules, many-to-one modules, many-to-many modules and parallel one-to-many modules is higher than that of one-to-one modules; rule 2 is: in all data dependency modules except one-to-one modules, the data dependency complexity with a large number of operators is high; rule 3 is: for a one-to-one module, the data dependence complexity of a large number of operators is high; rule 4 is: for the data dependent modules with the same operator number, the complexity of the data dependent module traversed by adopting the breadth-first search algorithm in the computational graph is higher.
According to the 4 rules, the data dependency complexity ordering of the data dependency module includes:
(a) The data dependency complexity ordering of the data dependency modules includes: firstly, carrying out one-stage sequencing according to a rule 1; then, performing two-stage sequencing according to rule 2; next, three-stage ordering is performed according to rule 3; finally, four-stage sequencing is carried out according to rule 4 to obtain a sequenced data dependency module set DD * The method comprises the steps of carrying out a first treatment on the surface of the For the calculation chart shown in fig. 6, the order determined according to the data-dependent complexity is #7→ #2→ #12→ #3→ #4 … → #12→ #1.
(b) By traversing the ordered set DD of data dependent modules * Tagging data dependent complexes for each of the data dependent modules thereinThe complexity, get the data dependent module set DD marked with the data dependent complexity order *
On the basis of obtaining the data dependency complexity, calculating the maximum available resource allocation quantity of operators according to the data dependency complexity, the parallel data dependency module and the total core resource number, wherein the maximum available resource allocation quantity is used as an initial constraint of iterative search of a scheduling strategy. Wherein calculating the maximum available resource allocation number of the operator comprises:
traversing a set DD of data-dependent modules marked with a data-dependent complexity order * For the currently traversed data dependent module dd, if the operators in the data dependent module dd have all determined the maximum available resource allocation quantity, continuing to traverse the next data dependent module, and if a plurality of operators with the undetermined maximum available resource allocation quantity exist in the data dependent module dd, executing the following steps:
firstly, a parallel data dependency module pdd where a data dependency module dd is located is obtained, and a first operator proportion of the number of operators of the data dependency module dd to the total number of operators of the parallel data dependency module pdd is calculated;
Then, the total number R of core particle resources is calculated * And taking the product result as the maximum available resource allocation quantity of operators with undetermined maximum available resource allocation quantity in the data dependency module dd. The maximum available resource allocation quantity is used as an initial constraint of iterative search of a scheduling strategy and is used for constraining the scheduling strategy of an operator.
And 5, searching the optimal modularized scheduling strategy according to the scheduling strategy searching space and the initial constraint iteration.
In an embodiment, a data dependent module scheduling policy that makes a sum of computation overhead, intra-operator and inter-operator data transmission overhead and congestion overhead generated by core multi-level routing minimum according to a scheduling policy search space and an initial constraint iterative search is used as an optimal modularized scheduling policy, and the method includes:
(a) And (3) taking the maximum available resource allocation quantity calculated in the step (4) as an initial constraint, and carrying out iterative search according to the initial constraint to determine the optimal scheduling strategy of each data dependent module.
During the iterative search process, the initial constraints for the constraint operator scheduling policy are dynamically updated. Wherein the iterative search includes: traversing a set DD of data-dependent modules marked with an order in accordance with data-dependent complexity * For the currently traversed data dependent module dd, the following steps are iteratively performed:
in each traversal, if the data dependent module dd is in the parallel data dependent module pdd and there is a data dependent module in the parallel data dependent module pdd that has determined the scheduling policy of all operators in the module, the sum r of the operator resource allocation numbers of the data dependent modules that have completely determined the scheduling policy is obtained * And calculating a second operator proportion of the number of operators of which the scheduling strategies are not determined in the data dependency modules of which the scheduling strategies are not determined in the parallel data dependency modules pdd to the total number of the remaining operators of the parallel data dependency modules pdd, and adding the total number of core particle resources R * And r * Multiplying the difference by the second operator proportion, and dynamically adjusting the maximum available resource quantity m of operators with undetermined scheduling strategies in the data dependency modules with undetermined scheduling strategies in the parallel data dependency modules pdd according to the multiplication result 0 *n 0 ,m 0 Representing the number of rows, n, of an embedded neural network processor 0 Representing the number of columns of the embedded neural network processor;
in each traversal, searching all possible scheduling strategies of each operator in a data dependent module dd from a scheduling strategy search space, wherein the scheduling strategies are expressed as { t, r, p }, t represents a division mode of the operator v, r represents the number of resource allocation obtained by the operator v, p represents the placement position of the resource obtained by the operator v in a core particle, and the limit of the number of resource allocation allocated to the operator v is required to meet that the total number m x n of resource allocation is not more than the maximum available number m of the operator 0 *n 0 M is not more than m 0 N is not more than n 0
Generating a scheduling strategy combination corresponding to the data dependent module dd according to all possible scheduling strategies of each operator in the data dependent module dd in each search of each round of traversal, and calculating total reasoning expense of the data dependent module dd under the scheduling strategy combination, wherein the total reasoning expense is the sum of calculation expense generated by the data dependent module dd in a reasoning process, data transmission expense in operators and among operators and congestion expense generated by core multi-stage routing;
in the embodiment, the calculation cost of a single operator is calculated by adopting the formula (1), the calculation cost of all operators in the data dependence module dd is added up to obtain the calculation cost generated in the reasoning process of the module,
Figure BDA0003942030130000181
wherein d v The calculation overhead of the single operator v is expressed, and assuming that the data sizes involved in calculation are p×k and k×q, respectively, the calculation amounts allocated to each NPU are p×k and k×q. The size of the pulse array on the NPU is w×w.
In an embodiment, the data transmission overhead is divided into a data transmission overhead of a single operator and a data transmission overhead between operator pairs (v, u) having a data dependency relationship.
Single-hot coding s of partitioning strategy of input operator v for data transmission overhead of single operator v Intra-operator communication overhead matrix R v Intra-core bandwidth b 1 Inter-core particle bandwidth b 2 Mapping strategy m v The partitioning strategy and the mapping strategy are determined by the partitioning mode of operators in the scheduling strategy search space, and are updated continuously in the iterative process. Calculating and outputting the intra-operator data transmission overhead C according to the input data and the formula (2) v
Figure BDA0003942030130000182
For data transmission overhead between operator pairs (v, u) with data dependency, one-hot encoding s of partitioning strategy of input operator v v Communication delay matrix R between operator pairs (v, u) (v,u) Single hot coding s of partitioning strategy of operator u u Then calculate according to equation (3)And outputs the data transmission overhead C between operators (v,u)
Figure BDA0003942030130000191
In an embodiment, the congestion overhead includes congestion overhead between a single operator congestion overhead and operator pairs (v, u) having data dependencies.
For single operator congestion overhead, single thermal coding s according to partitioning strategy v Mapping strategy m v Respectively calculating the average hop number h of route forwarding in the core granule 1 And the route forwarding cycle number r 1 Product of inter-core routing forwarding average hop count h 2 And the route forwarding cycle number r 2 The product is summed to obtain the route forwarding delay within operator v. The following are specific calculation examples:
In particular, one-hot coding s as a partitioning strategy v3 The division method represented by the = (0, 1) needs that the result of each NPU participating in calculation is reduced to the storage unit corresponding to the NPU with the largest coordinate, and the mapping strategy m v3 =(submesh 0 (0, 1, 0), (0,1,0,3)) knows the operator v 3 The calculation is carried out only on the (0, 0), (0, 1), (0, 2) and (0, 3) NPU of the (0, 1) core particle, and the calculation result on each NPU needs to be carried to the storage unit corresponding to the (0, 3), then, on the (0, 1) core particle, the route forwarding hop count of the (0, 0) NPU transmission calculation result to the storage unit corresponding to the (0, 3) NPU is 3, the route forwarding hop count of the (0, 1) NPU transmission calculation result to the storage unit corresponding to the (0, 3) NPU is 2, the route forwarding hop count of the (0, 2) NPU transmission calculation result to the storage unit corresponding to the (0, 3) NPU is 1, and if the total 100 times of data carrying are not needed, the route forwarding average hop count h in the core particle can be calculated 1 The number of intra-core route forwarding cycles is determined by the routing components employed in the core, e.g., the number of intra-core route forwarding cycles r, =100 x (3+2+1)/3=200 1 =1, then intra-core route forwarding average hop count h 1 And the route forwarding cycle number r 1 The product is 2, and there is no number between core particles under the current strategyAccording to the transmission, the average hop count h of the inter-core routing forwarding 2 =0, therefore, inter-core routing forwarding average hop count h 2 And the route forwarding cycle number r 2 The product is 0 and the sum is added to obtain the operator v 3 The route forwarding delay in this is 200. The step can calculate and obtain the single operator congestion overhead under a certain scheduling strategy.
For congestion overhead between operator pairs (v, u) with data dependency, one-hot encoding s according to partitioning strategy v Sum s u Mapping strategy m v And m u Respectively calculating the average hop number h of route forwarding in the core granule 1 And the route forwarding cycle number r 1 Product of inter-core routing forwarding average hop count h 2 And the route forwarding cycle number r 2 The product is summed to obtain the route forwarding delay between operator pairs (v, u). The following are specific calculation examples:
in particular, one-hot coding s as a partitioning strategy v1 Sum s v3 Mapping strategy m v1 And m v3 The data transmission mode between operators is (0,0,0,3) that the calculation result of the NPU is required to be transported to a storage unit corresponding to the NPU (0, 1, 0), the route forwarding hop count between the core particles is 1, the route forwarding hop count in the core particles is 2, the calculation result of the NPU (0,0,0,3) is required to be transported to a storage unit corresponding to the NPU (0, 1,0, 1), the route forwarding hop count between the core particles is 1, the route forwarding hop count in the core particles is 3, the calculation result of the NPU (0,0,0,3) needs to be transported to a storage unit corresponding to the NPU (0,1,0,2), the inter-core route forwarding hop count is 1, the intra-core route forwarding hop count is 4, the calculation result of the NPU (0,0,0,3) needs to be transported to a storage unit corresponding to the NPU (0,1,0,3), the inter-core route forwarding hop count is 1, the intra-core route forwarding hop count is 5, and if 100 times of data transportation are taken, the intra-core route forwarding average hop count h 1 =100×2+3+4+5)/4=350, inter-core routing forwarding average hop count h 2 100 x (1+1+1)/4=100, and the number of intra-die route forwarding cycles is determined by the routing components employed within the die, and the number of inter-die route forwarding cycles is determined by the routing components employed between the die, e.g., the number of intra-die route forwarding cycles r 1 =1, inter-core routing forwarding cycle number r 2 =5,The intra-core route forwarding average hop count h 1 And the route forwarding cycle number r 1 The product is 350, and the average hop number h of the route forwarding among the core grains 2 And the route forwarding cycle number r 2 The product is 500, and the sum is taken to obtain the operator pair (v 1 ,v 33 ) The inter-route forwarding delay is 850. This step can calculate the congestion overhead between the operator pairs (v, u) under a certain scheduling policy.
And after the iteration is finished, selecting a scheduling strategy combination with the minimum total reasoning cost for the data dependent module dd as an optimal scheduling strategy.
(b) And integrating the optimal scheduling strategies of each data dependency module to output a scheduling strategy set of the computational graph, wherein the scheduling strategy set records the optimal scheduling strategy adopted by each operator.
Based on the same inventive concept, an embodiment further provides a modular scheduling apparatus for neural network reasoning in a core particle, as shown in fig. 8, including:
the acquisition unit is used for acquiring a scheduling strategy search space for neural network reasoning in the core particle;
The serial group dividing unit is used for acquiring the computational graph of the neural network, generating operator depth according to the computational graph and dividing the operator depth into serial groups;
the data dependency module dividing unit is used for dividing the calculation graph to obtain a data dependency module and a parallel data dependency module according to the data dependency relationship, the operator depth and the serial group.
And the data dependency complexity determination unit is used for determining the data dependency complexity of the data dependency module and calculating the initial constraint of the iterative search of the scheduling strategy.
And the iterative search unit is used for iteratively searching the optimal modularized scheduling strategy according to the scheduling strategy search space and the initial constraint.
It should be noted that, when the modularized scheduling device for neural network reasoning in a core provided in the foregoing embodiment performs modularized scheduling for neural network reasoning in a core, the foregoing division of each functional unit should be illustrated, and the foregoing functional allocation may be completed by different functional units according to needs, that is, the internal structure of a terminal or a server is divided into different functional units, so as to complete all or part of the functions described above. In addition, the device for modularized scheduling of neural network reasoning in the core particle provided in the above embodiment belongs to the same concept as the modularized scheduling method embodiment of neural network reasoning in the core particle, and the specific implementation process of the device is detailed in the modularized scheduling method embodiment of neural network reasoning in the core particle, which is not described herein.
The embodiment also provides a computing device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor executing the computer program to implement the steps of the modular scheduling method for neural network reasoning in centrum provided in the above embodiment.
In practical applications, the memory may be a volatile memory at the near end, such as a RAM, or a nonvolatile memory, such as a ROM, a FLASH, a floppy disk, a mechanical hard disk, or a remote storage cloud. The processor may be a Central Processing Unit (CPU), a Microprocessor (MPU), a Digital Signal Processor (DSP), or a Field Programmable Gate Array (FPGA), i.e. the steps of modular scheduling of neural network reasoning in the core can be implemented by these processors.
The foregoing detailed description of the preferred embodiments and advantages of the invention will be appreciated that the foregoing description is merely illustrative of the presently preferred embodiments of the invention, and that no changes, additions, substitutions and equivalents of those embodiments are intended to be included within the scope of the invention.

Claims (7)

1. The modularized scheduling method for neural network reasoning in the core grain is characterized by comprising the following steps of:
Acquiring a scheduling policy search space for neural network reasoning in the core particle, comprising: obtaining a division mode of operators in the neural network, a resource allocation mode containing the total number of core particle resources, and a placement position of the obtained resources in the core particles;
acquiring a computational graph of the neural network, wherein the computational graph is represented by nodes and edges, the nodes represent operators of the neural network, and the edges represent data dependency relations among the operators of the neural network;
generating a neural network operator depth according to the calculation map;
dividing the neural network operators into serial groups according to the computational graph, comprising: dividing serial groups according to the output and input conditions of nodes in a calculation graph, wherein each serial group consists of a node with output and input of 1 or consists of a plurality of continuous nodes with output and input of 1, and when one node is a preceding node or a subsequent node of the other node, the two nodes are continuous;
dividing the computational graph to obtain a data dependency module according to the data dependency relationship among operators presented by the computational graph, the operator depth and the serial group, wherein the data dependency module comprises:
dividing the computational graph into a pair of multi-modules, comprising: dividing a node with the input degree equal to 1 and the output degree greater than 1 in the calculation graph and the subsequent nodes of all adjacent depths into the same one pair of multiple modules, taking the minimum depth of the node in the one pair of multiple modules as the depth of the one pair of multiple modules, and forming a set DD by all the one pair of multiple modules 0
The computational graph is divided into many-to-one modules, comprising: dividing nodes with the ingress degree larger than 1 and the egress degree equal to 1 in the calculation graph and all the preamble nodes with all the adjacent depths into the same many-to-one module, taking the minimum depth of the nodes in the many-to-one module as the depth of the many-to-one module, and forming a set DD by all the many-to-one modules 1
Dividing the computational graph into many-to-many modules, comprising: dividing nodes with the ingress degree greater than 1 and the egress degree greater than 1 in the calculation graph and all the preceding nodes and the following nodes into the same many-to-many modules, taking the minimum depth of the nodes in the many-to-many modules as the depth of the many-to-many modules, and forming a set DD by all the many-to-many modules 2
Dividing the computational graph into a parallel pair of multi-modules, comprising: will aggregate DD 0 The multiple modules with the same depth are combined into a parallel one-to-one multi-module, and the minimum depth of the nodes in the parallel one-to-many module is used as the depth of the parallel one-to-many moduleAt the same time from the collection DD 0 Deleting one-to-many modules participating in parallel one-to-many module combination, and forming a set DD by all the parallel one-to-many modules 3
Dividing the computational graph into one-to-one modules includes: traversing the serial groups, creating one-to-one modules for each serial group, adding all operators in the serial groups into the one-to-one modules, taking the minimum depth of nodes in the one-to-one modules as the depth of the one-to-one modules, adding the preceding node into the one-to-one module corresponding to the serial group if the ingress of the preceding node of the depth minimum node in the serial group is greater than 1 and the egress of the preceding node is equal to 1, and adding the succeeding node into the one-to-one module corresponding to the serial group if the ingress of the succeeding node of the depth maximum node in the serial group is equal to 1 and the egress of the succeeding node of the depth minimum node in the serial group is greater than 1; traversal set DD 1 If there are many-to-one modules with 3 nodes and minimum depth nodes with the ingress degree greater than 1 and the egress degree equal to 1, two one-to-one modules are created, the node with the minimum depth in the many-to-one modules and the subsequent nodes are added into one-to-one module, the node with the maximum depth in the many-to-one modules and the preceding nodes are added into another one-to-one module, the minimum depth of the nodes in the one-to-one modules is used as the depth of the one-to-one module, and simultaneously, the node DD is collected 1 Delete many-to-one modules, all one-to-one modules make up the set DD 4
The one-to-many module, the many-to-one module, the many-to-many module, the parallel one-to-many module and the one-to-one module are collectively called as a data dependent module, and the merging set DD 0 、DD 1 、DD 2 、DD 3 And DD 4 Generating a data dependent module set DD;
dividing the computational graph into parallel data-dependent modules according to the data-dependent modules and the computational graph, including: traversing the computational graph of the neural network, and for a data dependent module dd with depth d where a current traversing operator is located, executing the following steps: if the parallel data dependency module with the depth d exists or other data dependency modules with the depth d do not exist, continuing to traverse the next data dependency module; if the parallel data dependency module with the depth d does not exist, and if other data dependency modules with the depth d and the data dependency module dd do not intersect, dividing a plurality of data dependency modules which do not contain the same operator and are at the depth d into the same parallel data dependency module, and outputting a parallel data dependency module set PDD formed by the parallel data dependency modules after traversing is finished;
Determining the data dependence complexity of a data dependence module, and calculating the maximum available resource allocation quantity of operators according to the data dependence complexity, the parallel data dependence module and the total number of core particle resources, wherein the maximum available resource allocation quantity is used as an initial constraint of iterative search of a scheduling strategy;
and searching the space and the initial constraint iteration search according to the optimal modularized scheduling strategy so as to minimize the sum of the calculation cost, the data transmission cost in operators and among operators and the congestion cost generated by the core multi-stage routing.
2. The modular scheduling method for neural network reasoning in a core particle according to claim 1, wherein the method for obtaining the division mode of each type of operators in the neural network comprises the following steps of; representing each type of operator in the neural network as V c C represents operator type, V c All operators belonging to class c are contained, each operator V c Dividing means composition dividing means set T c Partition mode set T c Partitioning means t comprising a plurality of possible types of operators for partitioning class c i I represents the index of the division mode, and the division mode t i Describes the partitionable dimension of the class c operator, and the mapping scheme of each dimension to a core particle;
the resource allocation method for each type of operator in the neural network is obtained, and the method comprises the following steps: obtaining the total number R of core particle resources allocated to operators in a neural network * ,R * The expression is M x N, M and N respectively represent the total number of rows and the total number of columns of the embedded neural network processor in all the core grains; also obtain a sub-core array r of all possible array sizes allocated to the operator j Is an index of sub-core arrays, each sub-core array representing a number of resource allocations to an operator, the number of resource allocations being represented as m x n, m and n being represented as sub-core arrays R, respectively j Is embedded inThe number of rows and columns of the neural network processor;
obtaining a placement position of the array of sub-kernels in the kernels, comprising: acquiring all possible placement positions p in the kernels for each sub-kernel array r that can be assigned to a neural network operator k Is set P of (2) r K represents an index of a placement position p k Expressed by coordinates (x, y), where (x, y) represents the coordinate position of the embedded neural network processor in the core at row 1 and column 1 in the sub-core array r, and x is required to satisfy x+m not greater than the total number of rows M of the embedded neural network processor in the core, and y is required to satisfy y+n not greater than the total number of columns N of the embedded neural network processor in the core.
3. The modular scheduling method of neural network reasoning in a centrum of claim 1, wherein determining the data dependent complexity of the data dependent module comprises:
The data dependency complexity ordering of the data dependency modules includes: first, a one-stage ordering is performed according to rule 1, rule 1 being: the data dependence complexity of one-to-many modules, many-to-one modules, many-to-many modules and parallel one-to-many modules is higher than that of one-to-one modules; then, performing two-stage sorting according to rule 2, wherein rule 2 is as follows: in all data dependency modules except one-to-one modules, the data dependency complexity with a large number of operators is high; next, a three-stage ordering is performed according to rule 3, rule 3 being: for one-to-one modules, the data dependence complexity of a large number of operators is high, and finally, four-stage sequencing is performed according to rule 4 to obtain a sequenced data dependence module set DD * Rule 4 is: for the data dependent modules with the same operator number, the complexity of the data dependent module traversed by adopting the breadth-first search algorithm in the computational graph is higher;
by traversing a set DD of data-dependent modules * The data dependent complexity is tagged for each of the data dependent modules.
4. The modular scheduling method for neural network reasoning in centrum of claim 1, wherein calculating the maximum available resource allocation number of operators based on data dependent complexity, parallel data dependent modules and total number of centrum resources comprises:
Traversing a set DD of data-dependent modules marked with a data-dependent complexity order * For the currently traversed data dependent module dd, if the operators in the data dependent module dd have all determined the maximum available resource allocation quantity, continuing to traverse the next data dependent module, and if a plurality of operators with the undetermined maximum available resource allocation quantity exist in the data dependent module dd, executing the following steps:
firstly, a parallel data dependency module pdd where a data dependency module dd is located is obtained, and a first operator proportion of the number of operators of the data dependency module dd to the total number of operators of the parallel data dependency module pdd is calculated;
then, the total number R of core particle resources is calculated * And taking the product result as the maximum available resource allocation quantity of operators with undetermined maximum available resource allocation quantity in the data dependency module dd.
5. The modular scheduling method for neural network reasoning in a core as claimed in claim 1, wherein the optimal modular scheduling strategy for minimizing the sum of the computation overhead, the intra-operator and inter-operator data transmission overhead, and the congestion overhead generated by the core multi-level routing according to the scheduling strategy search space and the initial constraint iterative search comprises:
Firstly, performing iterative search according to initial constraint to determine an optimal scheduling strategy of each data dependent module, including: traversing a set DD of data-dependent modules marked with an order in accordance with data-dependent complexity * For the currently traversed data dependent module dd, the following steps are iteratively performed:
in each traversal, if the data dependent module dd is in the parallel data dependent module pdd and there is a data dependent module in the parallel data dependent module pdd that has determined the scheduling policy of all operators in the module, the sum r of the operator resource allocation numbers of the data dependent modules that have completely determined the scheduling policy is obtained * And calculating a second operator proportion of the number of operators of which the scheduling strategies are not determined in the data dependency modules of which the scheduling strategies are not determined in the parallel data dependency modules pdd to the total number of the remaining operators of the parallel data dependency modules pdd, and adding the total number of core particle resources R * And r * Multiplying the difference by the proportion of the second operator, and dynamically adjusting the maximum available resource quantity m of operators with undetermined scheduling strategies in the data dependency module with undetermined scheduling strategies in the parallel data dependency module pdd according to the multiplication result 0 *n 0 ,m 0 Representing the number of rows, n, of an embedded neural network processor 0 Representing the number of columns of the embedded neural network processor;
in each traversal, searching all possible scheduling strategies of each operator in a data dependent module dd from a scheduling strategy search space, wherein the scheduling strategies are expressed as { t, r, p }, t represents a division mode of the operator v, r represents the number of resource allocation obtained by the operator v, p represents the placement position of the resource obtained by the operator v in a core particle, and the limit of the number of resource allocation allocated to the operator v is required to meet that the total number m x n of resource allocation is not more than the maximum available number m of the operator 0 *n 0 M is not more than m 0 N is not more than n 0
In each search of each round of traversal, generating a scheduling strategy combination corresponding to the data dependent module dd according to all possible scheduling strategies of each operator in the data dependent module dd, and calculating total reasoning expense of the data dependent module dd under the scheduling strategy combination, wherein the total reasoning expense is the sum of calculation expense generated by the data dependent module dd in a reasoning process, data transmission expense in operators and among operators and congestion expense generated by core multi-level routing;
after iteration is finished, selecting a scheduling strategy combination with minimum total reasoning cost for the data dependent module dd as an optimal scheduling strategy;
and then, integrating the optimal scheduling strategies of each data dependency module to output a scheduling strategy set of the computational graph, wherein the scheduling strategy set records the optimal scheduling strategy adopted by each operator.
6. A modular scheduling apparatus for neural network reasoning in a core particle, comprising:
an acquisition unit for acquiring a scheduling policy search space for neural network reasoning in the core particle, comprising: obtaining the division mode of operators in the neural network, and the placement positions of the obtained resources in the core particles, wherein the placement positions comprise the total number of core particle resources;
the serial group dividing unit is used for obtaining a calculation graph of the neural network, wherein the calculation graph is represented by nodes and edges, the nodes represent operators of the neural network, and the edges represent data dependency relations among the operators of the neural network; generating a neural network operator depth from the computational graph, dividing the neural network operator into serial groups from the computational graph, comprising: dividing serial groups according to the output and input conditions of nodes in a calculation graph, wherein each serial group consists of a node with output and input of 1 or consists of a plurality of continuous nodes with output and input of 1, and when one node is a preceding node or a subsequent node of the other node, the two nodes are continuous;
the data dependency module dividing unit is configured to divide a computation graph into data dependency modules according to a data dependency relationship, an operator depth and a serial group between operators presented by the computation graph, and includes: dividing the computational graph into a pair of multi-modules, comprising: dividing a node with the input degree equal to 1 and the output degree greater than 1 in the calculation graph and the subsequent nodes of all adjacent depths into the same one pair of multiple modules, taking the minimum depth of the node in the one pair of multiple modules as the depth of the one pair of multiple modules, and forming a set DD by all the one pair of multiple modules 0
The computational graph is divided into many-to-one modules, comprising: dividing nodes with the ingress degree larger than 1 and the egress degree equal to 1 in the calculation graph and all the preamble nodes with all the adjacent depths into the same many-to-one module, taking the minimum depth of the nodes in the many-to-one module as the depth of the many-to-one module, and forming a set DD by all the many-to-one modules 1
Dividing the computational graph into many-to-many modules, comprising: dividing nodes with the ingress degree larger than 1 and the egress degree larger than 1 in the calculation map and all the preceding nodes and the following nodes into the same multi-pair multi-module, and dividing the multi-pairThe minimum depth of the nodes in the multi-module is taken as the depth of the multi-to-multi-module, and all the multi-to-multi-modules form a set DD 2
Dividing the computational graph into a parallel pair of multi-modules, comprising: will aggregate DD 0 The multiple modules with the same depth are combined into a parallel one-to-multiple module, and the minimum depth of the nodes in the parallel one-to-multiple module is used as the depth of the parallel one-to-multiple module, and simultaneously the node is collected from DD 0 Deleting one-to-many modules participating in parallel one-to-many module combination, and forming a set DD by all the parallel one-to-many modules 3
Dividing the computational graph into one-to-one modules includes: traversing the serial groups, creating one-to-one modules for each serial group, adding all operators in the serial groups into the one-to-one modules, taking the minimum depth of nodes in the one-to-one modules as the depth of the one-to-one modules, adding the preceding node into the one-to-one module corresponding to the serial group if the ingress of the preceding node of the depth minimum node in the serial group is greater than 1 and the egress of the preceding node is equal to 1, and adding the succeeding node into the one-to-one module corresponding to the serial group if the ingress of the succeeding node of the depth maximum node in the serial group is equal to 1 and the egress of the succeeding node of the depth minimum node in the serial group is greater than 1; traversal set DD 1 If there are many-to-one modules with 3 nodes and minimum depth nodes with the ingress degree greater than 1 and the egress degree equal to 1, two one-to-one modules are created, the node with the minimum depth in the many-to-one modules and the subsequent nodes are added into one-to-one module, the node with the maximum depth in the many-to-one modules and the preceding nodes are added into another one-to-one module, the minimum depth of the nodes in the one-to-one modules is used as the depth of the one-to-one module, and simultaneously, the node DD is collected 1 Delete many-to-one modules, all one-to-one modules make up the set DD 4
The one-to-many module, the many-to-one module, the many-to-many module, the parallel one-to-many module and the one-to-one module are collectively called as a data dependent module, and the merging set DD 0 、DD 1 、DD 2 、DD 3 And DD 4 Generating a data dependent module set DD;
the method is also used for dividing the computational graph into parallel data dependence modules according to the data dependence modules and the computational graph, and comprises the following steps: traversing the computational graph of the neural network, and for a data dependent module dd with depth d where a current traversing operator is located, executing the following steps: if the parallel data dependency module with the depth d exists or other data dependency modules with the depth d do not exist, continuing to traverse the next data dependency module; if the parallel data dependency module with the depth d does not exist, and if other data dependency modules with the depth d and the data dependency module dd do not intersect, dividing a plurality of data dependency modules which do not contain the same operator and are at the depth d into the same parallel data dependency module, and outputting a parallel data dependency module set PDD formed by the parallel data dependency modules after traversing is finished;
The data dependence complexity determining unit is used for determining the data dependence complexity of the data dependence module, and calculating the maximum available resource allocation quantity of operators according to the data dependence complexity, the parallel data dependence module and the total core resource number, wherein the maximum available resource allocation quantity is used as an initial constraint of iterative search of a scheduling strategy;
the iterative search unit is used for searching the optimal modularized scheduling strategy which is used for searching the space and the initial constraint iteration according to the scheduling strategy so as to minimize the sum of the calculation cost, the data transmission cost in operators and among operators and the congestion cost generated by the core multi-stage routing.
7. A computing device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor, when executing the computer program, implements the steps of the modular scheduling method of neural network reasoning in a centrum of any of claims 1-5.
CN202211425389.5A 2022-11-14 2022-11-14 Modularized scheduling method, device and computing equipment for neural network reasoning in core particle Active CN115658274B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211425389.5A CN115658274B (en) 2022-11-14 2022-11-14 Modularized scheduling method, device and computing equipment for neural network reasoning in core particle

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211425389.5A CN115658274B (en) 2022-11-14 2022-11-14 Modularized scheduling method, device and computing equipment for neural network reasoning in core particle

Publications (2)

Publication Number Publication Date
CN115658274A CN115658274A (en) 2023-01-31
CN115658274B true CN115658274B (en) 2023-06-06

Family

ID=85021920

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211425389.5A Active CN115658274B (en) 2022-11-14 2022-11-14 Modularized scheduling method, device and computing equipment for neural network reasoning in core particle

Country Status (1)

Country Link
CN (1) CN115658274B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115829017B (en) * 2023-02-20 2023-05-23 之江实验室 Method, device, medium and equipment for processing data based on core particles
CN115860081B (en) * 2023-03-01 2023-05-26 之江实验室 Core algorithm scheduling method, system, electronic equipment and storage medium
CN116523045B (en) * 2023-03-13 2023-11-07 之江实验室 Deep learning reasoning simulator oriented to multi-core chip
CN116151315B (en) * 2023-04-04 2023-08-15 之江实验室 Attention network scheduling optimization method and device for on-chip system
CN117155792B (en) * 2023-10-30 2024-01-12 中诚华隆计算机技术有限公司 Inter-core communication dynamic bandwidth adjustment method and system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112506496A (en) * 2020-11-21 2021-03-16 中国人民解放军战略支援部队信息工程大学 Method and system for building system-on-chip development environment
CN113487029A (en) * 2021-08-05 2021-10-08 杭州电子科技大学 Transplantable neural network distributed parallel strategy searching method

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103377035A (en) * 2012-04-12 2013-10-30 浙江大学 Pipeline parallelization method for coarse-grained streaming application
CN112149369B (en) * 2020-09-21 2024-04-05 交叉信息核心技术研究院(西安)有限公司 Multi-core packaging level system based on core particle architecture and core particle-oriented task mapping method thereof
US20220206875A1 (en) * 2020-12-24 2022-06-30 Intel Corporation Software visible and controllable lock-stepping with configurable logical processor granularities
CN114780227B (en) * 2022-06-20 2022-09-23 中国人民解放军国防科技大学 Task scheduling mapping method and system based on core granular network processor architecture
CN115186821B (en) * 2022-09-13 2023-01-06 之江实验室 Core particle-oriented neural network inference overhead estimation method and device and electronic equipment

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112506496A (en) * 2020-11-21 2021-03-16 中国人民解放军战略支援部队信息工程大学 Method and system for building system-on-chip development environment
CN113487029A (en) * 2021-08-05 2021-10-08 杭州电子科技大学 Transplantable neural network distributed parallel strategy searching method

Also Published As

Publication number Publication date
CN115658274A (en) 2023-01-31

Similar Documents

Publication Publication Date Title
CN115658274B (en) Modularized scheduling method, device and computing equipment for neural network reasoning in core particle
EP3746945B1 (en) Improving performance of neural network arrays
CN110851272B (en) Cloud task scheduling method based on phagocytic particle swarm genetic hybrid algorithm
US10943167B1 (en) Restructuring a multi-dimensional array
CN113193984B (en) Air-space-ground integrated network resource mapping method and system
Xu A new parallel N-body gravity solver: TPM
Trifunović et al. Parallel multilevel algorithms for hypergraph partitioning
CN107430704A (en) Neural network algorithm is realized in nerve synapse substrate based on the metadata associated with neural network algorithm
CN115186821B (en) Core particle-oriented neural network inference overhead estimation method and device and electronic equipment
WO2016123808A1 (en) Data processing system, calculation node and data processing method
Schlag et al. Scalable edge partitioning
CN108304926B (en) Pooling computing device and method suitable for neural network
CN115860081A (en) Core particle algorithm scheduling method and system, electronic equipment and storage medium
CN105634974A (en) Route determining method and apparatus in software-defined networking
CN115168281B (en) Neural network on-chip mapping method and device based on tabu search algorithm
WO2020133463A1 (en) Neural network system and data processing technology
Vinay Kumar et al. Multi-culture diversity based self adaptive particle swarm optimization for optimal floorplanning
Akbari et al. A high-performance network-on-chip topology for neuromorphic architectures
TWI740895B (en) Distribution method and device for application attribution service cluster
Benmammar et al. A pareto optimal multi-objective optimisation for parallel dynamic programming algorithm applied in cognitive radio ad hoc networks
CN110781247B (en) Vector clustering method, device and storage medium
Censor-Hillel et al. Deterministic near-optimal distributed listing of cliques
Gaffour et al. A new congestion-aware routing algorithm in network-on-chip: 2D and 3D comparison
Abadi et al. A scalable flexible SOM NOC-based hardware architecture
Dandachi et al. A robust monte-carlo-based deep learning strategy for virtual network embedding

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant