CN116842992A - Operator fusion method and system for image recognition depth neural network - Google Patents

Operator fusion method and system for image recognition depth neural network Download PDF

Info

Publication number
CN116842992A
CN116842992A CN202310524971.5A CN202310524971A CN116842992A CN 116842992 A CN116842992 A CN 116842992A CN 202310524971 A CN202310524971 A CN 202310524971A CN 116842992 A CN116842992 A CN 116842992A
Authority
CN
China
Prior art keywords
fusion
operator
candidate
operators
scheme
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310524971.5A
Other languages
Chinese (zh)
Inventor
时洋
王家男
陈照云
邓灿
赵宵磊
文梅
王耀华
扈啸
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202310524971.5A priority Critical patent/CN116842992A/en
Publication of CN116842992A publication Critical patent/CN116842992A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/086Learning methods using evolutionary algorithms, e.g. genetic algorithms or genetic programming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/10Interfaces, programming languages or software development kits, e.g. for simulating neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Abstract

The invention discloses an operator fusion method and system for an image recognition depth neural network, wherein the method comprises the steps of dividing a calculation graph of an input depth neural network model into calculation subgraphs; randomly constructing a plurality of candidate fusion schemes for each calculation subgraph, and testing and selecting an optimal fusion scheme on hardware; the fusion of two continuous complex operators in the candidate fusion scheme is realized by directly merging the external loops of two complex operators after circularly reorganizing the latter complex operator to unify the external loops of the two complex operators into the dimension and the shape based on the intermediate result tensor; and combining the optimal fusion schemes of the calculation subgraphs to obtain an optimal fusion execution scheme of the deep neural network model. The method and the device can solve the problem that complex operators in the image recognition depth neural network cannot be fused, improve the execution efficiency of the image recognition task through the fusion of the complex operators and the fusion of the whole neural network, and relieve the problem of cache resource shortage of the image recognition task in hardware deployment.

Description

Operator fusion method and system for image recognition depth neural network
Technical Field
The invention relates to a program code optimization technology in the field of image recognition, in particular to an operator fusion method and an operator fusion system for an image recognition depth neural network.
Background
With the continuous development of machine learning, network models represented by Deep Neural Networks (DNNs) have made significant progress in the application of image recognition. In order to meet the prediction performance requirement of higher accuracy, the depth of the neural network for image recognition is continuously increased, and the network layer number of some models reaches thousands of layers. The ever-increasing scale presents challenges of higher computational power and greater storage pressure for existing hardware facilities. The deep learning compiler is taken as a compiling optimization means which is developed in recent years, and provides a new optimization direction in the deployment and execution of a deep neural network model facing image recognition. Compared with the traditional manual compiling, the input of the deep learning compiler is a unified calculation graph representation; the deep learning compiler comprises different kinds of calculation map optimization and has an automatic optimization means; the deep learning compiler may be connected to different backend for deployment of hardware devices. For a deep learning compiler, through the optimization of operator fusion, the efficiency of executing the image recognition-oriented neural network on related hardware is improved, and meanwhile, the problem of hardware cache tension is reduced. Operator fusion is used as a main optimization means in computational graph optimization, and the main purpose of the operator fusion is to combine a plurality of small operators into one large operator. Under the condition that the existing hardware resources are unchanged, the optimization of operator fusion relieves the storage pressure of the deep neural network model during execution and improves the efficiency.
At present, research based on operator fusion is mainly divided into two aspects: the mechanism of operator fusion and the strategy of model fusion. The operator fusion mechanism is to determine whether two adjacent operators in the neural network can be fused and which fusion method is adopted; the fusion strategy of the model is to determine whether and how operators are fused in the whole model. For most image recognition oriented neural networks, the operators contained in the network can be divided into two classes: simple operators and complex operators. A simple operator (element-wise type operator) refers to an operator where the input and output tensors have a one-to-one mapping relationship. While complex operators (non-element-wise type operators) refer to operators in which there is an inter-layer dependency of the input and output tensors, each element is not a one-to-one mapping (i.e., the mapping is complex), but a many-to-one, many-to-many, or one-to-many function mapping, as shown in table 1.
Table 1: simple operator and complex operator examples.
The basic idea of the current fusion method is the same as the traditional loop fusion by an optimized compiler. A conventional loop fusion method is to combine multiple loop calculations into one loop calculation. In the existing neural network compiling and executing framework, only two fusion modes are supported: one is a fusion pattern of multiple consecutive simple operators; the other is a fusion pattern of simple operators and complex operators. Complex operators refer to a fusion between two complex operators that is considered algorithmically complex and unproductive. In the existing image recognition neural network, a large number of adjacent complex operators exist. This greatly reduces the chance of fusion throughout the neural network. So most of the mainstream DNN inference frameworks oriented to image recognition only support the fusion of simple operators with other operators, while the fusion between complex operators is still not supported. In terms of operator fusion mechanisms, in the existing neural network compiling and executing framework, the operator fusion mechanism is limited by a plurality of limits, and only a specific operator fusion mode and a fusion mode of a specific operator type are supported. Therefore, the limitation of the existing fusion mechanism causes that a large number of adjacent complex operators in the neural network cannot be fused, and the reduction of fusion opportunities limits the efficiency of the image recognition task execution optimization.
Disclosure of Invention
The invention aims to solve the technical problems: aiming at the problems in the prior art, the operator fusion method and the system for the image recognition depth neural network can solve the problem that complex operators in the image recognition depth neural network cannot be fused, can improve the execution efficiency of an image recognition task through the fusion of the complex operators and the fusion of the whole neural network, and can relieve the problem of cache resource shortage of the image recognition task in hardware deployment.
In order to solve the technical problems, the invention adopts the following technical scheme:
an operator fusion method for an image recognition depth neural network comprises the following steps:
s101, dividing an input calculation graph of the deep neural network model into calculation subgraphs;
s102, randomly constructing a plurality of candidate fusion schemes for each calculation subgraph respectively, testing the candidate fusion schemes on hardware to obtain execution time, and selecting the candidate fusion scheme with the shortest execution time as an optimal fusion scheme; the operator fusion type in the candidate fusion scheme comprises fusion of two continuous simple operators, fusion of two continuous complex operators and fusion of the continuous simple operators and the continuous complex operators, wherein the fusion of the two continuous complex operators is realized by directly merging the external loops of the two complex operators after the latter complex operator loops are recombined to be based on the dimension and the shape of the intermediate result tensor;
And S103, combining the optimal fusion schemes of the calculation subgraphs to obtain an optimal fusion execution scheme of the deep neural network model.
Optionally, the reorganizing the latter complex operator loop to unify the outer loops of the two complex operators into dimensions and shapes based on the intermediate result tensor and then directly merging the outer loops includes:
s201, aiming at two continuous complex operators to be fused, keeping the calculation strategy of the former complex operator unchanged, modifying the calculation strategy of the latter complex operator into the calculation strategy based on the input tensor through cyclic recombination, wherein the internal mapping relation in the calculation strategy based on the input tensor is the mapping of a single input scalar to the whole output tensor, and the external circulation is the input tensor based on the operators, so that the external circulation of the calculation strategy of the two complex operators is unified into the dimension and the shape of the intermediate result tensor based on the two complex operators;
s202, unifying the external circulation of the calculation strategy into two complex operators based on the dimension and the shape of the intermediate result tensor of the two complex operators, and directly merging the external circulation to obtain the fused code.
Optionally, step S202 further includes performing correctness checking for the fused code: and aiming at the input tensor of the same previous complex operator, taking the calculation results of two complex operators before fusion as a reference value, comparing the calculation results of the fused codes with the reference value, if the calculation results are the same, judging that the correctness test is passed, otherwise, judging that the correctness test is not passed, readjusting the operator fusion mode in the candidate fusion scheme, and then carrying out the correctness test until the fusion of two continuous complex operators in the candidate fusion scheme is all passed.
Optionally, when testing the candidate fusion scheme on hardware, fusion code optimization is further included on basic codes of the candidate fusion scheme before testing the candidate fusion scheme, and the fusion code optimization includes part or all of parallel optimization, cyclic expansion and vectorization optimization.
Optionally, when testing the candidate fusion scheme on hardware, sparse optimization is further included on complex operators in the candidate fusion scheme before testing the candidate fusion scheme: and modifying the activation function positioned between the two complex operators in the candidate fusion scheme into a judgment condition for controlling the execution of calculation, wherein the judgment condition for controlling the execution of calculation enables the latter complex operator part to perform function mapping on the intermediate data scalar only when the intermediate data scalar obtained by the former complex operator part accords with the activation condition of the activation function, otherwise, the latter complex operator part cancels the function mapping on the intermediate data scalar.
Optionally, when testing the candidate fusion scheme on hardware, the method further includes sub operator fusion optimization of complex operators in the candidate fusion scheme before testing the candidate fusion scheme: detecting complex operators with sub operators in a candidate fusion scheme, and fusing the sub operator with sub operators or simple operators of other adjacent complex operators if the sub operator can be fused with the sub operator or simple operator of the other adjacent complex operators aiming at the sub operator of each complex operator.
Optionally, in step S102, a plurality of candidate fusion schemes are randomly constructed for each computation subgraph, and when testing the candidate fusion schemes on hardware to obtain execution time and selecting the candidate fusion scheme with the shortest execution time as the optimal fusion scheme, the processing for each computation subgraph includes:
s301, randomly generating M initial candidate fusion schemes for the computational subgraph based on a predefined basic population quantity M to obtain an initial candidate fusion scheme test set, and testing each candidate fusion scheme in the initial candidate fusion scheme test set on hardware to obtain execution time;
s302, updating candidate fusion schemes in a candidate fusion scheme test set;
S303, testing the updated candidate fusion schemes in the candidate fusion scheme test set on hardware to obtain execution time;
s304, judging whether a plurality of candidate fusion schemes with shorter execution time in the updated candidate fusion scheme test set are kept unchanged, if so, judging that the candidate fusion scheme test set is stable, selecting a candidate fusion scheme with the shortest execution time from the stable candidate fusion scheme test set as the optimal fusion scheme of the computational subgraph, and entering the processing or exiting of the next computational subgraph; otherwise, step S302 is skipped.
Optionally, updating the candidate fusion solution in the candidate fusion solution test set in step S302 includes: sequencing candidate fusion schemes in the candidate fusion scheme test set according to the execution time, and deleting partial candidate fusion schemes with longer execution time; and generating a new candidate fusion scheme according to the predefined mutation probability P variation by adopting a genetic algorithm aiming at the rest of candidate fusion schemes in the candidate fusion scheme test set, and forming an updated candidate fusion scheme test set by the new candidate fusion scheme and the rest of candidate fusion schemes.
In addition, the invention also provides an operator fusion system facing the image recognition depth neural network, which comprises a microprocessor and a memory which are connected with each other, wherein the microprocessor is programmed or configured to execute the operator fusion method facing the image recognition depth neural network.
Furthermore, the invention also provides a computer readable storage medium, wherein the computer readable storage medium stores a computer program, and the computer program is used for being programmed or configured by a microprocessor to execute the operator fusion method of the image recognition-oriented deep neural network.
Compared with the prior art, the invention has the following advantages:
1. aiming at the fusion of two continuous complex operators, the invention unifies the external circulation of the two complex operators into the dimension and the shape based on the intermediate result tensor by utilizing the circulation recombination of the latter complex operator, and then directly merges the external circulation, thereby solving the problem that the complex operators in the image recognition deep neural network cannot be fused, improving the execution efficiency of the image recognition task through the fusion of the complex operators and relieving the problem of cache resource shortage of the image recognition task in hardware deployment.
2. The method comprises the steps of randomly constructing a plurality of candidate fusion schemes for each calculation sub-graph respectively, testing the candidate fusion schemes on hardware to obtain execution time, selecting the candidate fusion scheme with the shortest execution time as an optimal fusion scheme, wherein operator fusion types in the candidate fusion schemes comprise fusion of two continuous simple operators, fusion of two continuous complex operators and fusion of the continuous simple operators and the continuous complex operators, so that the optimal neural network operator fusion scheme can be found by realizing omnibearing operator fusion automatic generation aiming at a deep neural network model, the execution efficiency of an image recognition neural network is improved, the burden of a developer is reduced, the execution efficiency of the image recognition task is further improved, and the problem of cache resource shortage of the image recognition task in hardware deployment is solved.
Drawings
FIG. 1 is a schematic flow chart of a method according to an embodiment of the invention.
Fig. 2 is a schematic diagram of complex operator fusion and subsequent optimization and test flow of a candidate fusion scheme in the first embodiment of the invention.
FIG. 3 is a detailed flow chart of a method according to an embodiment of the invention.
Fig. 4 is a flowchart of a search mechanism for generating an optimal fusion scheme of multiple candidate fusion schemes in the second embodiment of the present invention.
Detailed Description
Embodiment one:
as shown in fig. 1 and 3, the operator fusion method for the image recognition depth neural network in this embodiment includes:
s101, dividing an input calculation graph of the deep neural network model into calculation subgraphs;
s102, randomly constructing a plurality of candidate fusion schemes for each calculation subgraph respectively, testing the candidate fusion schemes on hardware to obtain execution time, and selecting the candidate fusion scheme with the shortest execution time as an optimal fusion scheme; the operator fusion type in the candidate fusion scheme comprises fusion of two continuous simple operators, fusion of two continuous complex operators and fusion of the continuous simple operators and the continuous complex operators, wherein the fusion of the two continuous complex operators is realized by directly merging the external loops of the two complex operators after the latter complex operator loops are recombined to unify the external loops of the two complex operators into the dimension and the shape based on the intermediate result tensor;
and S103, combining the optimal fusion schemes of the calculation subgraphs to obtain an optimal fusion execution scheme of the deep neural network model.
Referring to steps S101 to S103, the operator fusion framework adopted by the operator fusion method for the image recognition depth neural network in this embodiment includes two aspects: one is a complex operator fusion mechanism based on cyclic recombination, and the other is a search mechanism for generating an optimal fusion scheme for multiple candidate fusion schemes.
A: a complex operator fusion mechanism based on cyclic recombination.
In the execution process of the neural network, the execution of each operator is completed through multi-layer loop calculation due to the multi-dimension of the data and the corresponding loop algorithm in the calculation of the neural network. While the computation strategy of the operator is generally determined based on the dimension of the output data of the execution operator, i.e. the outer loop of the multiple loops in the operator execution strategy is the dimension of the output data of the operator. And because there is no dependency relationship between the data in the simple operator, that is, the mapping from the input data to the output data is one-to-one, the dimension of the input data and the dimension of the output data of the simple operator are the same. Because the simple operators have no interlayer data dependency, the simple operators can be directly combined with adjacent simple operators in a circulating way, namely, the same circulation of the outer layers is combined into a new operator, so that the correctness of the data is not influenced, the moving of the data can be reduced, and the calculation efficiency is improved. The complex operators are different in data dimension of the input and output of the operators due to special calculation properties of the operators, and meanwhile, interlayer dependency exists between the data, so that the complex operators can only be fused with adjacent simple operators, and direct fusion cannot be performed between the adjacent complex operators due to inconsistency of calculation strategies, namely, non-uniformity of external circulation. The fundamental obstacle to complex operator fusion is the inconsistent multidimensional loops that result from output-centric computational strategies. The current image recognition oriented neural network compilation and execution framework treats each complex operator as a separate producer-consumer cycle and has its own separate computational strategy. On the basis, the embodiment solves the problem that two complex operators cannot be fused by considering the calculation strategy on the whole so as to improve the execution efficiency after fusion. Because of the complex operator internal mapping and memory calling relationship, some identical input data are called when calculating continuous output scalar; thus, one input scalar will have a mapping relationship with multiple output scalars. In the calculation process, if a mapping function is found to map a single input scalar to all output data associated therewith, repeated calls to the input data can be avoided. With respect to the computational strategy of such complex operators, the scalar in a single input data is mapped as a function of the entire output data tensor, while the outer loop in the computational strategy is based on the data dimension of the input tensor. Such a calculation strategy can be described as an input-centric calculation strategy; in this computational strategy, the internal computation function is a mapping function of each individual input scalar to the entire output tensor, while the external loop is based on the shape and dimensions of the input tensor. For two adjacent complex operators, the calculation strategy of the former complex operator (calculation strategy taking output as the center) is maintained, the calculation strategy of the latter complex operator is modified to be the calculation strategy taking input as the center, and thus, the calculation strategies of the two adjacent complex operators are all based on the dimension of intermediate data between the two operators, and the cyclic combination can be directly carried out. Specifically, in this embodiment, the reorganizing the latter complex operator loop to unify the outer loops of the two complex operators into the dimension and the shape based on the intermediate result tensor, and then directly merging the outer loops includes:
S201, aiming at two continuous complex operators to be fused, keeping the calculation strategy of the former complex operator unchanged, modifying the calculation strategy of the latter complex operator into the calculation strategy based on the input tensor through cyclic recombination, wherein the internal mapping relation in the calculation strategy based on the input tensor is the mapping of a single input scalar to the whole output tensor, and the external circulation is the input tensor based on the operators, so that the external circulation of the calculation strategy of the two complex operators is unified into the dimension and the shape of the intermediate result tensor based on the two complex operators;
s202, unifying the external circulation of the calculation strategy into two complex operators based on the dimension and the shape of the intermediate result tensor of the two complex operators, and directly merging the external circulation to obtain the fused code.
Example code 1 gives the execution code after the GEMM-ReLU-GEMM operator group passes fusion.
The left hand number in example code 1 is the line number, the example code 1 input is tensor A (tensor dimension is MxK), tensor B (tensor dimension is KxN), tensor C (tensor dimension is NxL), it can be seen through fusion that by computing the output-centric computation strategy, each time a common intermediate scalar M [ M ] [ N ] is computed, by activating the ReLU (computation means is by comparing max function with 0), then by mapping M [ M ] [ N ] to the associated output scalar D [ M ] [ L ], by cycling the final output tensor D (tensor dimension is MxL).
As shown in step S201, in the complex operator fusion mechanism based on loop reorganization in this embodiment, each intermediate result scalar is obtained by a first complex operator (output-centric computing strategy), and by a second complex operator (input-centric computing strategy), the intermediate result scalar can be mapped to the final relevant output tensor without repeatedly calling the intermediate result, so that redundant computation is removed. On the other hand, when modifying the calculation strategy of the second complex operator, his outer loop is based on the dimension of its input tensor (intermediate data of both operators). In this way, the two consecutive complex operators unify the outer loops, which can be directly combined in a loop. The operator fusion method based on cyclic recombination unifies the external cyclic axes of two complex operators by modifying the calculation strategy of the complex operators so as to achieve the aim of fusion of the two complex operators, converts the calculation strategy based on the center of an output tensor into the calculation strategy based on the center of an input tensor, is convenient for a user to fuse two continuous complex operators and has the advantage of universality.
B: search mechanisms for generating optimal fusion schemes for a variety of candidate fusion schemes.
Another important part in the operator fusion framework adopted by the operator fusion method for the image recognition depth neural network in the embodiment is an operator fusion strategy for the whole network, namely, which fusion and how fusion of fusion operators in the whole image recognition network are determined. Since the fusion support of the existing fusion method is limited, the search space thereof is limited. In contrast, the search mechanism for generating the optimal fusion scheme of the multiple candidate fusion schemes in this embodiment includes: three parts: firstly, dividing a calculation graph of an input deep neural network model into calculation subgraphs; then randomly constructing a plurality of candidate fusion schemes for each calculation subgraph; then, the candidate fusion scheme is tested through hardware to obtain the execution time, and the candidate fusion scheme with the shortest execution time is selected as the optimal fusion scheme. It should be noted that, when the candidate fusion scheme with the shortest execution time is selected as the optimal fusion scheme, the search space of the candidate fusion scheme can be expanded by a genetic algorithm or other optimization search methods based on the initial candidate fusion scheme, so as to promote the search optimization effect of the optimal fusion scheme.
Step S101 blocks the computation graph into computation subgraphs, the main purpose of which is to reduce the time to search for the best solution. In this embodiment, step S101 is to block the computation graph of the input deep neural network model into computation subgraphs. The calculation map of the deep neural network model may be directly provided, or may be converted from the image recognition task execution framework such as TVM, ONNX or MNN. The deep neural network model in this embodiment is a deep learning neural network, and is used for classifying an input image to obtain a classification result of the image, for example, VGG, alexNet, resNet, mobileNet, and the like. Undoubtedly, the method of the present embodiment does not depend on the specific deep neural network model structure.
As an optional implementation manner, in order to ensure the correctness after the operator fusion optimization, in this embodiment, the correctness of the algorithm optimization is verified by comparing with the calculation result of the unfused continuous complex operator as a reference, specifically, step S202 further includes performing correctness checking on the fused code: and aiming at the input tensor of the same previous complex operator, taking the calculation results of two complex operators before fusion as a reference value, comparing the calculation results of the fused codes with the reference value, if the calculation results are the same, judging that the correctness test is passed, otherwise, judging that the correctness test is not passed, readjusting the operator fusion mode in the candidate fusion scheme, and then carrying out the correctness test until the fusion of two continuous complex operators in the candidate fusion scheme is all passed.
Referring to fig. 2, when testing the candidate fusion scheme on hardware in the present embodiment, fusion code optimization is further included on the basic code of the candidate fusion scheme before testing the candidate fusion scheme, where the fusion code optimization includes a part or all of parallel optimization, loop expansion, and vectorization optimization, so that code execution is more efficient.
Referring to fig. 2, when testing the candidate fusion scheme on hardware in the present embodiment, sparse optimization is further included on complex operators in the candidate fusion scheme before testing the candidate fusion scheme: and modifying the activation function positioned between the two complex operators in the candidate fusion scheme into a judgment condition for controlling the execution of calculation, wherein the judgment condition for controlling the execution of calculation enables the latter complex operator part to perform function mapping on the intermediate data scalar only when the intermediate data scalar obtained by the former complex operator part accords with the activation condition of the activation function, otherwise, the latter complex operator part cancels the function mapping on the intermediate data scalar. Sparse optimization refers to reducing redundant computation caused by intermediate data sparsity by an optimization means. The sparsity of the intermediate data in the neural network is caused by an activation function such as a ReLU, and the activation function carries out zero setting on the data which does not accord with the activation condition in the data. The existing neural network compiling and executing frameworks only use the activating functions such as ReLU and the like as a common simple operator for fusion, but do not utilize the characteristics of the activating functions. Considering the characteristics of the activation function, the activation function between two complex operators is changed into a judgment condition for controlling the execution of calculation. An intermediate data scalar obtained by the previous operator accords with the condition of activating the function ReLU activation, and the latter operator can perform function mapping on the intermediate data scalar; when the intermediate data scalar does not meet the condition for activating the function ReLU, the latter operator cancels the function mapping of the intermediate data scalar. Therefore, redundant calculation caused by sparsity of intermediate data can be reduced, and the efficiency of function execution after fusion is improved.
Example code 2 gives sparse optimized pseudocode after the GEMM-ReLU-GEMM operator group fusion.
The input in example code 2 is tensor A (tensor dimension is MxK), tensor B (tensor dimension is KxN), tensor C (tensor dimension is NxL), and it can be seen through fusion that by the output-centric calculation strategy, a common intermediate scalar M [ M ] [ N ] is calculated each time, by activating ReLU (calculation by comparing max function with 0). If the activated value M [ M ] [ n ] is greater than 0, mapping M [ M ] [ n ] to an associated output scalar Dm ] [ L ] by an input-centric computing strategy, and outputting a final output tensor D (tensor dimension MxL) by a loop; if the activated value Mm is less than 0, the function mapping is not skipped to the calculation of the next scalar Dm l and the mapping of the corresponding Dm l is judged.
Referring to fig. 2, when the candidate fusion scheme is tested on hardware in the present embodiment, sub-operator fusion optimization of complex operators in the candidate fusion scheme is further included before testing the candidate fusion scheme: detecting complex operators with sub operators in a candidate fusion scheme, and fusing the sub operator with sub operators or simple operators of other adjacent complex operators if the sub operator can be fused with the sub operator or simple operator of the other adjacent complex operators aiming at the sub operator of each complex operator. Sub-operator fusion optimization is directed to complex operators with multiple computational steps, such as Softmax operator and battnorm operator. The operators can be divided into a plurality of small operators, the complex operators with a plurality of calculation steps can be divided into a plurality of sub-operators by utilizing the characteristic, and then the sub-operators are fused with the adjacent operators which are cost-effective, so that the aim of improving the execution efficiency of the fused operators is fulfilled. Softmax operators can be divided into three sub-operators: an exponent sub-operator, a sum sub-operator, and a division sub-operator. In the DNN computational graph, the Softmax operator is connected to a GEMM operator, and the exponent sub-operator and the sum sub-operator of the Softmax operator can be fused with the GEMM operator. Such fusion may increase the overall execution efficiency after the GEMM-Softmax operator group fusion.
The above-mentioned various fusion code optimizations, sparse optimizations, and sub-operator fusion optimizations in this embodiment are divided into two aspects: one aspect is code autogenous optimization, including some column optimization means such as computation parallelism, cyclic expansion, computation vectorization and the like mentioned above; another aspect is to exploit the properties of operators and the specific formulated optimizations between operators. The method of the embodiment provides a plurality of operator specific optimizations and operator-to-operator characteristic optimizations, including sparse optimizations, sub-operator optimizations and operator sequence optimizations. The programmer can select corresponding optimization means through the characteristics of two continuous operators to improve the efficiency of the fused execution. In addition, the embodiment also comprises operator sequence optimization, namely, the execution efficiency is improved by reducing redundant calculation through adjusting the execution sequence of operators. This fusion mode can be used for GEMM-Dropout-GEMM fusion mode. Before operator execution, we can take the intermediate data that was zeroed out in advance and eliminate the computation of this data from the first GEMM operator.
Referring to fig. 2, after the above-mentioned various fusion code optimizations, sparse optimizations, and sub operator fusion optimizations are completed, the code of the candidate fusion scheme may be tested with a required software backend (LLVM, CUDA, OPENCL, etc.) deployed to a required hardware (CPU, GPU, DSP, etc.), thereby obtaining the execution time of the candidate fusion scheme.
The search mechanism for generating the optimal fusion scheme of the multiple candidate fusion schemes in this embodiment may adopt a required search method according to needs, for example, as an optional implementation manner, and in this embodiment, a genetic algorithm is adopted to optimize the search mechanism for generating the optimal fusion scheme of the multiple candidate fusion schemes so as to promote better fusion performance of the whole network.
Specifically, in step S102, a plurality of candidate fusion schemes are randomly constructed for each computation sub-graph, and when the candidate fusion schemes are tested on hardware to obtain execution time and the candidate fusion scheme with the shortest execution time is selected as the optimal fusion scheme, the processing for each computation sub-graph includes:
s301, randomly generating M initial candidate fusion schemes for the computational subgraph based on a predefined basic population quantity M to obtain an initial candidate fusion scheme test set, and testing each candidate fusion scheme in the initial candidate fusion scheme test set on hardware to obtain execution time;
s302, updating candidate fusion schemes in a candidate fusion scheme test set;
s303, testing the updated candidate fusion schemes in the candidate fusion scheme test set on hardware to obtain execution time;
S304, judging whether a plurality of candidate fusion schemes with shorter execution time in the updated candidate fusion scheme test set are kept unchanged, if so, judging that the candidate fusion scheme test set is stable, selecting a candidate fusion scheme with the shortest execution time from the stable candidate fusion scheme test set as the optimal fusion scheme of the computational subgraph, and entering the processing or exiting of the next computational subgraph; otherwise, step S302 is skipped.
In this embodiment, the updating the candidate fusion scheme in the candidate fusion scheme test set in step S302 includes: sequencing candidate fusion schemes in the candidate fusion scheme test set according to the execution time, and deleting partial candidate fusion schemes with longer execution time; and generating a new candidate fusion scheme according to the predefined mutation probability P variation by adopting a genetic algorithm aiming at the rest of candidate fusion schemes in the candidate fusion scheme test set, and forming an updated candidate fusion scheme test set by the new candidate fusion scheme and the rest of candidate fusion schemes. It should be noted that, the generation of a new candidate fusion scheme according to the predefined mutation probability pnutation by using the genetic algorithm is a well-known method of the genetic algorithm, and thus the implementation method will not be described in detail again. According to the embodiment, the test results (the running time of executing the subgraph) of all fusion schemes in the neural network subgraph are sequenced from short to long according to time length, fusion scheme candidates of the second half of the execution time sequence are removed, new fusion scheme candidates are generated based on mutation probability P and added into a fusion scheme population of the first half of the execution time sequence, a new fusion scheme population is formed, the updated fusion scheme population with the number M is sent to a test module, and the execution effect of all fusion scheme candidates on actual hardware is recorded. The optimal fusion scheme of the neural network subgraph can be found through the stable judging condition of the candidate fusion scheme test set, and then the next segmented neural network subgraph is taken from the first step. Thus, until all sub-graphs of the neural network are iterated, the optimal fusion scheme of the whole deep neural network model can be obtained.
In summary, the operator fusion method for the image recognition deep neural network in this embodiment can obtain the following technical effects: first: the neural network operator fusion framework adopted by the method of the embodiment outputs the fused computational graph through the computational graph of the input neural network (through ONNX or TVM conversion) and is deployed at the rear ends of different hardware. The programmer inputs the neural network computational graph to be subjected to operator fusion optimization, obtains the computational graph after the operator fusion optimization by defining the optimization means and the related parameters of fusion scheme search, and can select and connect different rear ends to be deployed on related hardware equipment. Second,: the complex operator fusion method based on cyclic recombination and the related optimization means provided by the embodiment method can solve the problem that two continuous complex operators cannot be fused in compiling and executing the neural network for image recognition. The fusion of adjacent complex operators improves the efficiency of operator execution and provides more opportunities for operator fusion for the entire neural network. The increase of the search space of the neural network fusion scheme is beneficial to the search algorithm to find the optimal neural network fusion scheme. The execution efficiency of the image recognition task is improved through the fusion method of the complex operator and the whole network fusion search method. Third,: the neural network fusion scheme searching strategy based on the genetic algorithm provided by the method of the embodiment can effectively find the optimal neural network fusion scheme in the network fusion scheme searching space. The programmer can set different parameters to adjust the search algorithm to meet the requirements of different searches.
In addition, the embodiment also provides an operator fusion system facing the image recognition depth neural network, which comprises a microprocessor and a memory which are connected with each other, wherein the microprocessor is programmed or configured to execute the operator fusion method facing the image recognition depth neural network. In addition, the embodiment also provides a computer readable storage medium, wherein the computer readable storage medium stores a computer program, and the computer program is used for being programmed or configured by a microprocessor to execute the operator fusion method of the image recognition depth neural network.
Embodiment two:
the present embodiment is substantially the same as the first embodiment, and is mainly different from the first embodiment in that the iteration end condition of the genetic algorithm adopted in the search mechanism for generating the optimal fusion scheme of the multiple candidate fusion schemes is different. As shown in fig. 2, step S304 in this embodiment may also use the iteration number to control the iteration, for example: adding 1 to the iteration times, if the iteration times are equal to the preset times of n+1, selecting a candidate fusion scheme with the shortest execution time from the candidate fusion scheme test set as the optimal fusion scheme of the computational subgraph, and entering the processing or exiting of the next computational subgraph; otherwise, step S302 is skipped. The optimal fusion scheme of the neural network subgraph can be found through the judgment condition of N iterations, and then the next segmented neural network subgraph is taken from the first step. Thus, until all sub-graphs of the neural network are iterated, the optimal fusion scheme of the whole deep neural network model can be obtained.
In addition, the embodiment also provides an operator fusion system facing the image recognition depth neural network, which comprises a microprocessor and a memory which are connected with each other, wherein the microprocessor is programmed or configured to execute the operator fusion method facing the image recognition depth neural network. In addition, the embodiment also provides a computer readable storage medium, wherein the computer readable storage medium stores a computer program, and the computer program is used for being programmed or configured by a microprocessor to execute the operator fusion method of the image recognition depth neural network.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above examples, and all technical solutions belonging to the concept of the present invention belong to the protection scope of the present invention. It should be noted that modifications and adaptations to the present invention may occur to one skilled in the art without departing from the principles of the present invention and are intended to be within the scope of the present invention.

Claims (10)

1. An operator fusion method for an image recognition depth neural network is characterized by comprising the following steps:
s101, dividing an input calculation graph of the deep neural network model into calculation subgraphs;
s102, randomly constructing a plurality of candidate fusion schemes for each calculation subgraph respectively, testing the candidate fusion schemes on hardware to obtain execution time, and selecting the candidate fusion scheme with the shortest execution time as an optimal fusion scheme; the operator fusion type in the candidate fusion scheme comprises fusion of two continuous simple operators, fusion of two continuous complex operators and fusion of the continuous simple operators and the continuous complex operators, wherein the fusion of the two continuous complex operators is realized by directly merging the external loops of the two complex operators after the latter complex operator loops are recombined to be based on the dimension and the shape of the intermediate result tensor;
And S103, combining the optimal fusion schemes of the calculation subgraphs to obtain an optimal fusion execution scheme of the deep neural network model.
2. The method for operator fusion for image recognition depth neural network according to claim 1, wherein the merging the outer loops of the two complex operators by the recombination of the latter complex operator loops into the dimension and the shape based on the intermediate result tensor comprises the steps of:
s201, aiming at two continuous complex operators to be fused, keeping the calculation strategy of the former complex operator unchanged, modifying the calculation strategy of the latter complex operator into the calculation strategy based on the input tensor through cyclic recombination, wherein the internal mapping relation in the calculation strategy based on the input tensor is the mapping of a single input scalar to the whole output tensor, and the external circulation is the input tensor based on the operators, so that the external circulation of the calculation strategy of the two complex operators is unified into the dimension and the shape of the intermediate result tensor based on the two complex operators;
s202, unifying the external circulation of the calculation strategy into two complex operators based on the dimension and the shape of the intermediate result tensor of the two complex operators, and directly merging the external circulation to obtain the fused code.
3. The operator fusion method for an image recognition depth neural network according to claim 2, wherein step S202 further comprises performing correctness checking for the fused code: and aiming at the input tensor of the same previous complex operator, taking the calculation results of two complex operators before fusion as a reference value, comparing the calculation results of the fused codes with the reference value, if the calculation results are the same, judging that the correctness test is passed, otherwise, judging that the correctness test is not passed, readjusting the operator fusion mode in the candidate fusion scheme, and then carrying out the correctness test until the fusion of two continuous complex operators in the candidate fusion scheme is all passed.
4. The operator fusion method for an image recognition depth neural network according to claim 2, wherein when the candidate fusion scheme is tested on hardware, fusion code optimization is further performed on basic codes of the candidate fusion scheme before the candidate fusion scheme is tested, and the fusion code optimization comprises part or all of parallel optimization, cyclic expansion and vectorization optimization.
5. The operator fusion method for an image recognition depth neural network according to claim 2, wherein when the candidate fusion scheme is tested on hardware, the method further comprises sparse optimization of complex operators in the candidate fusion scheme before testing the candidate fusion scheme: and modifying the activation function positioned between the two complex operators in the candidate fusion scheme into a judgment condition for controlling the execution of calculation, wherein the judgment condition for controlling the execution of calculation enables the latter complex operator part to perform function mapping on the intermediate data scalar only when the intermediate data scalar obtained by the former complex operator part accords with the activation condition of the activation function, otherwise, the latter complex operator part cancels the function mapping on the intermediate data scalar.
6. The operator fusion method for an image recognition depth neural network according to claim 2, wherein when the candidate fusion scheme is tested on hardware, the method further comprises sub operator fusion optimization of complex operators in the candidate fusion scheme before testing the candidate fusion scheme: detecting complex operators with sub operators in a candidate fusion scheme, and fusing the sub operator with sub operators or simple operators of other adjacent complex operators if the sub operator can be fused with the sub operator or simple operator of the other adjacent complex operators aiming at the sub operator of each complex operator.
7. The operator fusion method for an image recognition depth neural network according to claim 2, wherein in step S102, a plurality of candidate fusion schemes are randomly constructed for each computation subgraph, and when testing the candidate fusion schemes on hardware to obtain execution time and selecting the candidate fusion scheme with the shortest execution time as the optimal fusion scheme, the processing for each computation subgraph includes:
s301, randomly generating M initial candidate fusion schemes for the computational subgraph based on a predefined basic population quantity M to obtain an initial candidate fusion scheme test set, and testing each candidate fusion scheme in the initial candidate fusion scheme test set on hardware to obtain execution time;
S302, updating candidate fusion schemes in a candidate fusion scheme test set;
s303, testing the updated candidate fusion schemes in the candidate fusion scheme test set on hardware to obtain execution time;
s304, judging whether a plurality of candidate fusion schemes with shorter execution time in the updated candidate fusion scheme test set are kept unchanged, if so, judging that the candidate fusion scheme test set is stable, selecting a candidate fusion scheme with the shortest execution time from the stable candidate fusion scheme test set as the optimal fusion scheme of the computational subgraph, and entering the processing or exiting of the next computational subgraph; otherwise, step S302 is skipped.
8. The operator fusion method for an image recognition depth neural network according to claim 7, wherein updating the candidate fusion solution in the candidate fusion solution test set in step S302 comprises: sequencing candidate fusion schemes in the candidate fusion scheme test set according to the execution time, and deleting partial candidate fusion schemes with longer execution time; and generating a new candidate fusion scheme according to the predefined mutation probability P variation by adopting a genetic algorithm aiming at the rest of candidate fusion schemes in the candidate fusion scheme test set, and forming an updated candidate fusion scheme test set by the new candidate fusion scheme and the rest of candidate fusion schemes.
9. An operator fusion system for an image recognition depth neural network, comprising a microprocessor and a memory connected to each other, wherein the microprocessor is programmed or configured to perform the operator fusion method for an image recognition depth neural network according to any one of claims 1 to 8.
10. A computer readable storage medium having a computer program stored therein, wherein the computer program is for programming or configuring by a microprocessor to perform the operator fusion method for an image recognition depth neural network of any one of claims 1 to 8.
CN202310524971.5A 2023-05-10 2023-05-10 Operator fusion method and system for image recognition depth neural network Pending CN116842992A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310524971.5A CN116842992A (en) 2023-05-10 2023-05-10 Operator fusion method and system for image recognition depth neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310524971.5A CN116842992A (en) 2023-05-10 2023-05-10 Operator fusion method and system for image recognition depth neural network

Publications (1)

Publication Number Publication Date
CN116842992A true CN116842992A (en) 2023-10-03

Family

ID=88158788

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310524971.5A Pending CN116842992A (en) 2023-05-10 2023-05-10 Operator fusion method and system for image recognition depth neural network

Country Status (1)

Country Link
CN (1) CN116842992A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117289948A (en) * 2023-11-24 2023-12-26 北京壁仞科技开发有限公司 Operator elimination method, device, system, electronic equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117289948A (en) * 2023-11-24 2023-12-26 北京壁仞科技开发有限公司 Operator elimination method, device, system, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
Wu et al. Red fox: An execution environment for relational query processing on gpus
CN112035116B (en) Agent modeling method for multi-target compiling optimization sequence selection
Talbi Automated design of deep neural networks: A survey and unified taxonomy
CN114995823A (en) Deep learning compiler optimization method for special accelerator for CNN
CN116842992A (en) Operator fusion method and system for image recognition depth neural network
Arbelaez et al. A GPU implementation of parallel constraint-based local search
Zhai et al. Bytetransformer: A high-performance transformer boosted for variable-length inputs
Nelson et al. Reliable generation of high-performance matrix algebra
Fan et al. Graph algorithms: parallelization and scalability
Melab et al. ParadisEO-MO-GPU: a framework for parallel GPU-based local search metaheuristics
Zhang et al. HyQuas: hybrid partitioner based quantum circuit simulation system on GPU
WO2022178660A1 (en) Data processing method and apparatus, device, and medium
Sohrabizadeh et al. Enabling automated FPGA accelerator optimization using graph neural networks
Lässig et al. Analysis of speedups in parallel evolutionary algorithms and (1+ λ) EAs for combinatorial optimization
Schafer Hierarchical high-level synthesis design space exploration with incremental exploration support
Bi et al. Heron: Automatically constrained high-performance library generation for deep learning accelerators
He et al. HOME: A holistic GPU memory management framework for deep learning
Vidal et al. Solving the DNA fragment assembly problem with a parallel discrete firefly algorithm implemented on GPU
Ivutin et al. Optimization Problem for Heterogeneous Computing Systems
Seiferth et al. Offsite autotuning approach: performance model driven autotuning applied to parallel explicit ODE methods
CN112181420A (en) Compiler defect positioning method based on reinforcement learning
Nie et al. Loop selection for multilevel nested loops using a genetic algorithm
Li et al. Optimizing sorting with machine learning algorithms
JP2009301453A (en) Distributed memory type multiprocessor system, masked reverse shift communication method and program
Kornelsen et al. Fast heterogeneous task mapping for reducing edge dnn latency

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination