CN117492766A

CN117492766A - Compiling method, compiler, neural network accelerator, chip and electronic equipment

Info

Publication number: CN117492766A
Application number: CN202311811831.2A
Authority: CN
Inventors: 邓奇光; 刘洪杰
Original assignee: Shenzhen Jiutian Ruixin Technology Co ltd
Current assignee: Shenzhen Jiutian Ruixin Technology Co ltd
Priority date: 2023-12-27
Filing date: 2023-12-27
Publication date: 2024-02-02

Abstract

The invention discloses a compiling method, a compiler, a neural network accelerator, a chip and electronic equipment, relates to the technical field of neural networks, and solves the technical problems of low computing efficiency, high chip cost and high power consumption of a deep learning device. The compiling method comprises the following steps: selecting at least two of a plurality of operators, and performing operator fusion based on an operation sequence to obtain a fusion calculation subset; reading a deep learning model file, and converting the deep learning model file into an original calculation map; based on the operator types and sequences in the original calculation graph, matching with the fusion operator set to obtain a fusion operator; converting the original calculation graph into an optimized calculation graph based on the matched fusion operator; the optimization computation graph is converted into binary instructions. According to the invention, operator fusion is carried out on operators in the original calculation graph, so that the compiling flexibility and the compiling efficiency are improved, the memory data carrying times and carrying quantity are reduced, and therefore, higher calculation efficiency can be realized, and the area and the power consumption of the neural network chip are further reduced.

Description

Compiling method, compiler, neural network accelerator, chip and electronic equipment

Technical Field

The present invention relates to the field of neural networks, and in particular, to a compiling method, a compiler, a neural network accelerator, a chip, and an electronic device.

Background

Deep neural networks (Deep Neural Network, DNN) are a machine learning method based on an artificial neural network architecture, the artificial neural network (Artificial Neural Networks, ANN) using layers of interconnected nodes (called neurons) to process and learn input data. Deep neural networks are artificial neural networks having multiple layers located between an input layer and an output layer. Neural networks are always composed of identical components: neurons, synapses, weights, biases, and functions, which in practical applications are commonly referred to as operators. Common operators are: convolution, pooling, up/down sampling, activation of functions, element manipulation (element addition, element subtraction, element multiplication, element division), etc. Deep learning uses multiple layers to represent different levels of abstraction of data, thereby improving the accuracy and generalization ability of the model, and has been widely applied to the fields of computer vision, speech recognition, natural language processing, machine translation, bioinformatics, drug design, medical image analysis, etc., producing results comparable to or even exceeding the level of human expert. As data volumes continue to accumulate, neural network-based artificial intelligence techniques are becoming increasingly popular. Although the neural network has been proven to successfully solve the practical problems of automatic driving, face recognition and the like, the neural network is difficult to be deployed efficiently in the traditional hardware due to the limitation of the operation performance of the traditional hardware platform. Therefore, there is a need to design a custom hardware platform specifically for neural network algorithms, where the hardware platform is referred to as a neural network accelerator, and its core is typically a set of application specific integrated circuit chips, which are referred to as neural network accelerator chips.

In order to alleviate the problem of the storage wall deteriorated in deep learning, the neural network accelerator chip is usually stored on a design chip at a position close to the operation unit, and the delay of accessing the on-chip storage is far lower than that of accessing the off-chip storage, so that the reasonable utilization of the on-chip storage is a key for playing the performance of the neural network accelerator. The general deep learning device also has off-chip storage, which refers to a cache integrated outside the chip, such as a motherboard, which is slower but has a larger capacity than on-chip storage.

The storage space required by the general deep learning model is larger, the general deep learning model cannot be directly imported onto the on-chip storage, the deep learning model can be imported into the off-chip storage at first, if a calculation map generated according to a general compiling flow is stored into the off-chip storage in the calculation process, the result generated in each time is required to be stored into the off-chip storage, so that higher power consumption can be produced, the area cost of the neural network accelerator chip is increased, and the time required for carrying the result to the off-chip storage is long, so that the calculation of the neural network accelerator is blocked, and the calculation force cannot be fully exerted. One approach to this problem is to minimize the handling of results to off-chip storage and to allow the results to be stored on-chip. Meanwhile, the combination of operator parameters in deep learning or the change of capacity of on-chip storage can influence the optimal splitting strategy of data and operation, so that generated binary codes are different, and the execution efficiency of an algorithm is influenced.

Based on the above reasons, due to the specificity of the deep learning algorithm and the specificity of the neural network accelerator structure, the program optimization in the deep learning field has two important characteristics, namely, extremely sensitivity to the variation of the algorithm and hardware, and extremely high degree of coupling between operation and data. Therefore, how to provide a compiling method for a deep learning algorithm based on the features, so as to improve the compiling flexibility and compiling efficiency, alleviate the problem of a storage wall, improve the computing efficiency, and reduce the area and the power consumption of a neural network accelerator chip becomes a problem to be solved.

Disclosure of Invention

The invention aims to provide a compiling method, a compiler, a neural network accelerator, a chip and electronic equipment, so as to at least solve the technical problems. The preferred technical solutions of the technical solutions provided by the present invention can produce a plurality of technical effects described below.

In order to achieve the above purpose, the present invention provides the following technical solutions:

the invention provides a compiling method, which comprises the following steps: s100: selecting at least two operators of different types from the plurality of operators, and performing operator fusion based on an operation sequence to obtain a fusion calculation subset comprising at least one fusion operator; s200: reading a deep learning model file, and converting the deep learning model file into an original calculation map; s300: based on the operator type and the appearance sequence in the original calculation graph, matching with the fusion operators in the fusion operator set to obtain at least one fusion operator matched with the original calculation graph; s400: converting the original calculation graph into an optimized calculation graph based on at least one matched fusion operator; s500: the optimization computation graph is converted into a binary instruction.

Preferably, the step S300 specifically includes: s310: traversing reversely from the output nodes of the original calculation graph until the input nodes are found out, and obtaining all initial execution paths from the input nodes to the output nodes; s320: finding out repeated parts in all the initial execution paths and marking, and then merging all the initial execution paths into one result execution path; s330: traversing all nodes on the result execution path, obtaining all operator types and appearance sequences based on all nodes, and finding out nodes meeting the operator types and operation sequences of the fusion operator set to obtain at least one group of fusion nodes; s340: and obtaining at least one fusion operator matched with the original calculation graph based on the operator type and the operation sequence of the fusion node.

Preferably, the operator includes at least one of convolution operation, pooling operation, up-down sampling operation, activation function operation, element operation, block normalization operation, and layer normalization operation.

Preferably, the activation function operation includes at least one of a relu activation function operation, a leak_relu activation function operation, a hard swish activation function operation, a sigmoid activation function operation, a silu activation function operation, a pre activation function operation, a tanh activation function operation, a elu activation function operation, a softmax activation function operation, a switch activation function operation, a maxout activation function operation, and a softplus activation function operation; the element operation comprises at least one of addition operation add, subtraction operation sub, multiplication operation mul, division operation div, remainder mod, complement neg, increment inc, decrement dec, maximum value max, minimum value min and absolute value abs.

Preferably, the image processing fusion operator is obtained based on the one-way operation instruction sequence of the convolution operation, the activation function operation, the pooling operation and the element operation.

Preferably, the image processing fusion operator further includes an up-down sampling operation after the element operation is performed.

Preferably, the image processing fusion operator further includes a block normalization operation between the activation function operation and the pooling operation.

Preferably, a comprehensive fusion operator is obtained based on a one-way operation instruction sequence of a convolution operation, a first operator, a second operator, a third operator and a fourth operator; the first operator, the second operator, the third operator and the fourth operator are any one of pooling operation, up-down sampling operation, activation function operation, element operation, block normalization operation and layer normalization operation and are different from each other.

Preferably, in step S200, the original computation graph is a directed graph, and the following topological order is obtained by performing topological conversion on the directed graph:

y ₁ = op ₁ (x),

y ₂ = op ₂ (y ₁ ),

y _n = op _n (y _n-1 ),

wherein y is _1, y _2, ...,y _n-1 Representing intermediate calculation results, y _n Represents the output result, op ₁ ，op ₂ ，op _n Representing the same or different types of operators.

Preferably, in the step S500, the optimization computation graph outputs binary instructions using a convolution kernel generator based on a heuristic search strategy of a cost model.

A compiler for running the compiling method of any preceding claim, comprising: the system comprises a data receiving module, a compiling module and an instruction acquisition module; the data receiving module is used for receiving model data of the deep learning model files with different formats and analyzing the model data to obtain the original calculation map; the compiling module performs operator fusion on the original calculation graph through a calculation graph engine, converts the original calculation graph into the optimized calculation graph and compiles the optimized calculation graph into an operator instruction; the instruction acquisition module is used for receiving the operator instruction, and performing adjustment, encapsulation and format conversion operation on the operator instruction to obtain a conversion instruction; the compilation module also converts the conversion instructions into binary instructions.

Preferably, the computational graph engine operates with the call API through operation of the NNAPI programming library.

A neural network accelerator comprising a compiler according to any one of the preceding claims.

A chip comprising a compiler as described above or comprising a neural network accelerator as described above.

An electronic device comprising a compiler as described above or comprising a neural network accelerator as described above.

By implementing one of the technical schemes, the invention has the following advantages or beneficial effects:

according to the invention, the operator fusion is carried out on operators in the original calculation graph, so that the memory data carrying times and data carrying quantity between single operators can be reduced, the execution time of the whole operators in the deep learning model file is finally reduced, the problem of a storage wall in the neural network operation is relieved, and therefore, the higher calculation efficiency can be realized in the deep learning, and the area and the power consumption of the neural network chip are further reduced. Furthermore, a fusion operator can be used for representing operator combinations commonly used in the deep learning algorithm, even the whole model, so that the compiling flexibility and the compiling efficiency are improved, and the execution efficiency of the deep learning is further improved.

Drawings

For a clearer description of the technical solutions of embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art, in which:

FIG. 1 is a flow chart of a compiling method according to a first embodiment of the invention;

FIG. 2 is a flowchart illustrating the step S300 of FIG. 1 in accordance with an embodiment of the present invention;

FIG. 3 is a schematic diagram I of an image processing fusion operator in accordance with a first embodiment of the present invention;

FIG. 4 is a schematic diagram II of an image processing fusion operator in a first embodiment of the present invention;

FIG. 5 is a third schematic diagram of an image processing fusion operator in accordance with an embodiment of the present invention;

FIG. 6 is a schematic diagram of a comprehensive fusion operator in accordance with a first embodiment of the present invention;

FIG. 7 is a schematic diagram of an NNAPI programming library interface in accordance with a second embodiment of the present invention;

fig. 8 is a schematic diagram of a neural network accelerator in a third embodiment of the invention.

Detailed Description

For a better understanding of the objects, technical solutions and advantages of the present invention, reference should be made to the various exemplary embodiments described hereinafter with reference to the accompanying drawings, which form a part hereof, and in which are described various exemplary embodiments which may be employed in practicing the present invention. The same reference numbers in different drawings identify the same or similar elements unless expressly stated otherwise. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. It is to be understood that they are merely examples of processes, methods, apparatuses, etc. that are consistent with certain aspects of the present disclosure as detailed in the appended claims, other embodiments may be utilized, or structural and functional modifications may be made to the embodiments set forth herein without departing from the scope and spirit of the present disclosure.

In the description of the present invention, it should be understood that the terms "center," "longitudinal," "transverse," and the like are used in an orientation or positional relationship based on that shown in the drawings, and are merely for convenience in describing the present invention and to simplify the description, rather than to indicate or imply that the elements referred to must have a particular orientation, be constructed and operate in a particular orientation. The terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. The term "plurality" means two or more. The terms "connected," "coupled" and "connected" are to be construed broadly and may be, for example, fixedly connected, detachably connected, integrally connected, mechanically connected, electrically connected, communicatively connected, directly connected, indirectly connected via intermediaries, or may be in communication with each other between two elements or in an interaction relationship between the two elements. The term "and/or" includes any and all combinations of one or more of the associated listed items. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art according to the specific circumstances.

In order to illustrate the technical solutions of the present invention, the following description is made by specific embodiments, only the portions related to the embodiments of the present invention are shown.

Embodiment one: as shown in fig. 1, the present invention provides a compiling method, comprising the steps of: s100: selecting at least two different types of operators in a plurality of operators, wherein the operators select or set specific operator types and numbers according to different types of neural network application fields (such as images, texts, voices, natural languages and the like), and at least two operators are at least two operators; operator fusion is performed based on the operation sequence, and a fusion calculation subset comprising at least one fusion operator is obtained, which can be expressed as Op _fused = Fuse(op ₁ , op ₂ , …, op _n ) Wherein Op is _fused Representing a fusion operator, op ₁ , op ₂ , …, op _n Representing at least two operators of different types, and fusing according to the sequence of the occurrence of the different operators, fuse (op ₁ , op ₂ , …, op _n ) Representing at least two operatorsAnd performing operator fusion operation according to the operation sequence. The fusion operator set can comprise various common or unusual fusion operators, so that the applicability is higher. S200: reading a deep learning model file, and converting the deep learning model file into an original calculation map, wherein the original calculation map is composed of a basic data structure Tensor (Tensor) and basic operation unit operators, and the calculation map has two main elements: nodes (nodes) and edges (Edge), where nodes represent data such as vectors, matrices, tensors, and edges represent operations such as addition, subtraction, multiplication, division, convolution, etc. The calculation chart is used for representing: when y= (x+w) ∗ (w+1), the terms "x", "w" are tensors, and the terms "++", "# are operators. S300: based on the operator types and the appearance sequence in the original calculation graph, the operator types and the appearance sequence are matched with the fusion operators in the fusion operator set, namely, the fusion operator matching is based on two elements, the operator appearance types and the appearance front-back sequence can be matched and fused when the two elements are simultaneously satisfied, and obviously, the fusion operators in the fusion operator set need to be matched one by one so as to avoid omission. S400: based on the at least one fusion operator after matching, converting the original calculation graph into an optimization calculation graph, so that subsequent operation is performed based on the optimization calculation graph, operators in the optimization calculation graph can comprise one or more fusion operators, and the carrying of data among different operators in a memory can be reduced through the optimization calculation graph. S500: the optimization calculation graph is converted into a binary instruction, and the binary instruction can be directly executed through the deep learning arithmetic device. The binary instruction is used for guiding hardware devices to execute the hardware instruction of the deep learning algorithm, particularly guiding which hardware devices and the specific content of the hardware instruction are not limited in the invention, and can be flexibly selected according to actual conditions. In the invention, the operator fusion is carried out on operators in the original calculation graph, so that the number of times of memory data handling and the data handling capacity between single operators can be reduced (in the prior art, for example, the on-chip storage of a certain neural network accelerator is only 128KB, and a 1080p RGB image is expressed by 16 bits and requires 12MB space, so that the input image needs to be segmented, at least 96 times are required, meanwhile, the data splitting causes the operation to be carried out along with the splitting, so that a large amount of and multiple times of memory data handling can be generated), and finally the deep reduction is realizedThe execution time of the integral operator in the degree learning model file relieves the problem of a storage wall in the neural network operation, so that the deep learning can realize higher calculation efficiency, and further the area and the power consumption of the neural network chip are reduced. Furthermore, fusion operators can be used for representing operator combinations (such as residual modules in the resnet) commonly used in the deep learning algorithm and even the whole model, so that the compiling flexibility and the compiling efficiency are improved, and the execution efficiency of the deep learning is further improved.

As an alternative embodiment, as shown in fig. 2, the step S300 specifically includes. S310: and traversing the output nodes of the original computational graph reversely until the input nodes are found, namely, carrying out global depth search on the original computational graph to obtain all initial execution paths from the input nodes to the output nodes, so that all operator information in the original computational graph can be obtained. S320: finding out the repeated parts in all initial execution paths, namely that two or more adjacent operators are of the same type, and operator fusion cannot be carried out among the operators, so that a mark needs to be found out and carried out independently, and the repeated operators are not put into the same fusion operator, namely that the repeated operators can form a 'dividing line' among different fusion operators, and the repeated parts can be more than one, so that matching errors in the fusion operators can be avoided after marking, and matching failure can be caused. And combining all the initial execution paths into a result execution path, namely combining operators in the initial execution path according to the operation sequence of time logic, so that the calculation result of the operator appearing in front can be used as input data required by the operator appearing in back, otherwise, the calculation result is not used, and thus, the output of a result is finally realized. The result execution path has information of the type and the number of the required operators and information of the sequence before and after the operators appear. S330: traversing all nodes on a result execution path, obtaining all operator types and appearance sequences based on all nodes, namely finishing information conversion between the result execution path and the operator types and sequences, finding out nodes meeting the operator types and operation sequences of a fused subset, and obtaining at least one group of fused nodes, wherein one group of fused nodes comprises at least two types of operators; the operator types and sequences appearing on the result execution path are matched with the operator types and sequences in the fusion operators, and each matched object on the result execution path is required to have only one input and one output during matching, so that the result execution path can be split into one, two or more fusion operators. S340: and obtaining at least one fusion operator matched with the original calculation graph based on the operator type and the operation sequence of the fusion node, namely finishing the matching operation of the fusion operator. Through the operation, the original calculation map is matched with the fusion operator, so that an optimized calculation map containing the information of the fusion operator is conveniently obtained, and subsequent operation is conveniently carried out.

As an alternative implementation manner, the operator at least includes convolution operation (conv, which is mainly used to extract features, and is widely applied to image processing, natural language processing and audio processing, such as sliding a small window, also called convolution kernel or filter, on the image in the image processing, performing a certain computation on each pixel at each position to obtain a new output image), pooling operation (pool for taking a maximum value or an average value from an objective region), up-sampling operation (reserve, supporting up-sampling operation and down-sampling operation, and up-sampling operation is used to amplify a raw image, so as to display the image on a display device with higher resolution, and down-sampling operation is used to make the image conform to the size of the display region or generate a thumbnail corresponding to the image), activation function operation (activation function running on a neuron of an artificial neural network, and is responsible for mapping the input of the neuron to an output end for providing a nonlinear modeling capability of the neural network, including a linear activation function, a restore-activation function, a random function, a linear activation function, a filter activation function, and a window activation function can be applied to at least one of the same modes, and a plurality of the activation functions, respectively, and a linear activation function has a plurality of activation modes, and a linear activation function can be applied to the activation function, and a linear activation function of the activation function has a plurality of the activation modes, independent results are generated finally, the operation efficiency is improved, the GPU can be used for carrying out operations, including at least one of addition operation add, subtraction operation sub, multiplication operation mul, division operation div, remainder mod, complement neg, increment inc, decrement dec, maximum max, minimum min and absolute value abs, the operations can meet the operation requirements of more various elements, block normalization operation (batch normalization, the essence of the neural network learning process is to learn data distribution, the block normalization operation enables the distribution of training data of each batch to be the same, the convergence of the neural network is ensured, the training speed of the deep neural network is accelerated), layer normalization operation (layer normalization, the block normalization can be influenced by the number of samples of small batches, and the layer normalization is standardized by taking each sample as an independent unit, for example, all characteristic channels of each sample at the current layer are standardized as a whole, and can be directly used in the neural network). Based on the operator types and the sequence combination, the operation requirements of most of the neural networks can be met, so that the invention can meet the operation scene requirements of most of the neural networks, especially the neural networks of computer vision, natural language and the like, and the operator fusion can effectively alleviate the problem of 'storage wall' in the neural networks, thereby improving the execution efficiency of deep learning.

As an alternative embodiment, the image processing fusion operator is obtained based on a unidirectional operation instruction sequence of convolution operation, activation function operation, pooling operation, and element operation, and as shown in fig. 3, the convolution operation, activation function operation, pooling operation, and element operation correspond to a convolution layer, an activation function layer, a pooling layer, and an element operation layer, respectively. When the neural network is applied to image processing, convolution operation, activation function operation, pooling operation and element operation are the most common operations, and correspond to image feature extraction, nonlinear expression of image features, maximum value or average value of a target area and parallel calculation among different target area elements in the image processing respectively, so that an image processing fusion operator can be used in most of image processing, and universal use of the fusion operator in the image processing is ensured. Preferably, in the image processing fusion operator, after the operation of performing the element operation, the up-down sampling operation is further included, so that a unidirectional operation instruction sequence based on the convolution operation, the activation function operation, the pooling operation, the element operation and the up-down sampling operation is obtained, and the image can be enlarged or reduced through the up-down sampling operation, so that the image fusion operator has the pertinence of being applied to the subdivision field, and as shown in fig. 4, the convolution operation, the activation function operation, the pooling operation, the element operation and the up-down sampling operation respectively correspond to the convolution layer, the activation function layer, the pooling layer, the element operation layer and the up-down sampling layer. On this basis, preferably, between the activation function operation and the pooling operation, a block normalization operation is further included, so that a unidirectional operation instruction sequence based on convolution operation, activation function operation, block normalization operation, pooling operation, element operation and up-down sampling operation is obtained, the block normalization operation can accelerate the convergence speed of a neural network in image processing, so that the image fusion operator can accelerate the training speed of a model and improve the efficiency, and as shown in fig. 5, the convolution operation, activation function operation, block normalization operation, pooling operation, element operation and up-down sampling operation respectively correspond to a convolution layer, an activation function layer, a block normalization layer, a pooling layer, an element operation layer and an up-down sampling layer. In the image processing fusion operator, before the element operation layer performs front layer data input (mainly before the deep learning model calculation, the result data generated by other operator operations) can also perform branch back layer output on the activation function layer, the pooling layer and the element operation layer according to the requirement, and finally perform trunk back layer output on the fusion operator, wherein the trunk layer output is the calculation result output performed by the fusion operator. Of course, at least two different operators can be selected from convolution operation, pooling operation, up-down sampling operation, activation function operation, element operation, block normalization operation and layer normalization operation according to actual operation requirements, so that other types of fusion operators can be obtained, and the description is omitted here.

As an alternative embodiment, based on the unidirectional operation instruction sequences of the convolution operation, the first operator, the second operator, the third operator and the fourth operator, a comprehensive fusion operator is obtained, that is, the comprehensive fusion operator includes 5 operators, and the first operator is the convolution operation operator, as shown in fig. 6, where the convolution operation, the first operator, the second operator, the third operator and the fourth operator respectively correspond to the convolution layer, the first operator layer, the second operator layer, the third operator layer and the fourth operator layer. The first operator, the second operator, the third operator and the fourth operator are any one of pooling operation, up-down sampling operation, activation function operation, element operation, block normalization operation and layer normalization operation (namely operators except convolution operation operators) and are different from each other, so that the operator fusion can not be realized due to repeated operators, and the combination of more operators can be realized by the second operator, the third operator and the fourth operator through the comprehensive fusion operator, so that the method is suitable for more complex neural network operation scenes, the application of the operator fusion can be carried out in more fields, the application range and flexibility of the method are further improved, the problem of 'memory wall' frequently occurring in a neural network is relieved, and the area and the power consumption of the neural network accelerator are reduced.

In an optional embodiment, in step S200, the original computation graph is a directed graph, and the following topological order is obtained by performing topological transformation on the directed graph:

y ₁ = op ₁ (x),

y ₂ = op ₂ (y ₁ ),

…,

y _n = op _n (y _n-1 ),

wherein y is ₁ ，y _2,..., y _n-1 Representing intermediate calculation results, y _n Represents the output result, op ₁ ，op ₂ ，op _n Representing the same or different types of operators. Thereby y is _1, y _2,..., y _n-1, y _n Forming a calculation sequence, wherein the calculation result of the former operator is the data required by the calculation of the next operatorAfter topological ordering, the calculation process does not load data from off-chip storage, and the calculation is directly carried out through on-chip storage, so that the calculation blocking of the neural network accelerator due to data carrying can be reduced, and the calculation efficiency is improved.

In an alternative embodiment, in step S500, based on a heuristic search strategy of a cost model, the optimized computation graph outputs binary instructions using a convolution kernel generator, and the convolution kernel generator may use hardware acceleration instructions, thereby improving computation performance. The cost model estimates the total time of operation performed on the neural network accelerator, and the main factors considered by the cost model are the access time of operation, the operation time and the overlapping rate of the two, and the three factors directly determine the performance of the program. Specifically, the neural network accelerator has a calculation unit and an access unit, since calculation and access can be independently performed, in order to minimize the execution time of a program, it is necessary to maximize the overlapping ratio of access time and operation time. Therefore, when the optimization calculation graph is converted into the binary instruction, the heuristic search strategy of the cost model generates various instruction combinations (the heuristic search strategy guides the search by using heuristic information of access time and operation time overlapping rate, so as to achieve the purposes of reducing the search range and reducing the complexity of the problem), the final operation time is calculated for each instruction combination, and then the instruction combination corresponding to the shortest time is found out so as to correspond to the final compiling result, so that the neural network accelerator can realize higher calculation efficiency, and further the area and the power consumption of the neural network chip are reduced.

The embodiment is a specific example only and does not suggest one such implementation of the invention.

Embodiment two: a compiler for running the compiling method in the first embodiment, comprising: the system comprises a data receiving module, a compiling module and an instruction acquisition module. The data receiving module is used for receiving model data of deep learning model files with different formats, such as deep learning model files with tflite, onnx and other formats, and analyzing the model data to obtain an original calculation map; the compiling module performs operator fusion on the original computational graph through a computational graph engine (the operator fusion operation can be realized by integrating the computational graph engine hardware acceleration support of the operator fusion in the neural network chip), namely at least two operators meeting the operator fusion condition (which are consistent with the operator type and the execution sequence of a preset fusion operator and can be matched, and meet the operator fusion condition) are fused into one operator according to the operation sequence, the original computational graph is converted into an optimized computational graph, and the optimized computational graph is compiled into an operator instruction. The instruction acquisition module is used for receiving an operator instruction, performing adjustment, encapsulation and format conversion operation on the operator instruction to obtain a conversion instruction, and enabling the fusion operator generated by the compiler to meet the requirements of the deep learning hardware equipment on instruction formats and the like through the conversion instruction, so that an algorithm corresponding to the fusion operator can be normally executed. The compiling module also converts the converted instruction into a binary instruction, thereby facilitating the execution operation of the algorithm. The binary instruction is used for guiding hardware devices to execute the hardware instruction of the deep learning algorithm, particularly guiding which hardware devices and the specific content of the hardware instruction are not limited in the invention, and can be flexibly selected according to actual conditions. According to the invention, the compiler performs operator fusion on operators meeting the conditions through the compiling module, so that the memory handling times among different operators in the deep learning model file are reduced (in the prior art, the calculation of each operator needs to be independently executed according to the sequence and the calculation result is written into the memory, the calculation of the next operator generally depends on the calculation result of the last operator, and the intermediate result of the last operator needs to be read from the memory, so that memory handling of multiple times of data can occur, the calculation of a neural network processor is blocked, the exertion of calculation force is influenced), the effect of improving the performance of the compiler is achieved, meanwhile, the compiling flexibility and the compiling efficiency are improved, the processing performance of the neural network is further improved, and the chip area, the cost and the power consumption of the compiler are effectively reduced.

As an alternative implementation, the computational graph engine operates by creating and calling APIs through the operations of the NNAPI programming library, the NNAPIs provide a basic functional layer for higher-level machine learning frameworks (such as TensorFlow Lite and Caffe 2) to build and train neural networks, and the neural networks are called by the machine learning libraries, the frameworks and tools, so that developers can train their models outside the device, and support of operator fusion functions by the computational graph engine can be realized. The NNAPIs also support the application of data in Android devices to previously trained developer-defined models to make inferences, which can improve the applicability of the present invention. As shown in fig. 7, the compiling process of the compiling module calls the operation creation of the nniapi and the API to operate the computational graph engine in the NNENGINE (NNENGINE is a set of platforms capable of running the pre-trained neural network model and supporting CPU and GPU reasoning at the desktop end and the host end) to generate corresponding binary instructions, and in this process, the computational graph engine compiles according to a compiling method of an embodiment, during which the fusion operator generates corresponding instructions by using the convolution kernel generator, and the generated instructions are loaded into the runtime system, so as to drive the neural network accelerator to work.

Embodiment III: a neural network accelerator comprising the compiler of embodiment two. As shown in fig. 8, the compiler is a functional module in the neural network accelerator, and further includes a vector processor, a preprocessing module, an in-memory computing matrix, a shared memory, and the like, where the compiler is connected with the vector processor and performs data transmission. The in-memory computing matrix can be a matrix formed by a plurality of CIMs (computing in memory, in-memory computing), and by adopting the vector processor in the first embodiment (the vector processor is a multi-operator fusion vector processor capable of realizing fusion of a plurality of operators), the area and the power consumption of the neural network accelerator are effectively reduced, and the neural network accelerator is convenient to use.

In addition, the memory wall problem can be solved by the memory calculation. The von neumann architecture computer system divides the memory and the processor into two parts, and the overhead of the processor for frequently accessing the memory forms a memory wall, and high-frequency data handling is often the primary cause of power consumption occupied by chips, especially the chips in the AI field, so as to influence the computing power, efficiency, power consumption and the like of the chips. The neural network accelerator with the sensing and calculation integrated technology (integrating sensing, storage and operation) can have ultrahigh calculation power, efficiency and energy efficiency ratio, so that the neural network accelerator can improve the area and power consumption performance of the neural network accelerator without affecting the function of the neural network accelerator.

Embodiment four: a chip comprising the compiler of embodiment two or the neural network accelerator of embodiment three. The modules in the chip provided by the invention can be realized in whole or in part by software, hardware and a combination thereof. The modules can be embedded in or independent of a processor in the computing device in a hardware form, and can be stored in a memory in the computing device in a software form, so that the processor can call and execute operations corresponding to the modules, data handling of the neural network accelerator in computation can be reduced, the computing efficiency is improved, and further the chip area, the cost and the power consumption are effectively reduced.

Fifth embodiment: an electronic device comprising the compiler in embodiment two or the neural network accelerator in embodiment three. The neural network accelerator provided by the invention can be applied to automatic driving, AR, VR and laser radar, and can also be widely applied to a series of electronic equipment with requirements for low power consumption and high energy efficiency ratio, such as smart phones, tablet computers, wearable electronic equipment, smart home electronic products, industry or medical treatment or battery power supply. The electronic device may include other information processing modules (for performing information interaction with the chip, and jointly completing operations specified by a user, such as a CPU, a GPU, and the like), a data receiving and storing module (for storing intermediate data or result data output by deep learning, such as a DDR memory, an HBM memory, and the like), an external device (for receiving original data, and then transmitting the data to other modules, such as a camera, a microphone, and the like, through a universal interconnection interface), and a universal interconnection interface (each module in the electronic device is connected by a signal through a communication protocol and is capable of performing data communication).

The foregoing is only illustrative of the preferred embodiments of the invention, and it will be appreciated by those skilled in the art that various changes in the features and embodiments may be made and equivalents may be substituted without departing from the spirit and scope of the invention. In addition, many modifications may be made to adapt a particular situation or material to the teachings of the invention without departing from the essential scope thereof. Therefore, it is intended that the invention not be limited to the particular embodiment disclosed, but that the invention will include all embodiments falling within the scope of the appended claims.

Claims

1. A compiling method, comprising the steps of:

s100: selecting at least two operators of different types from the plurality of operators, and performing operator fusion based on an operation sequence to obtain a fusion calculation subset comprising at least one fusion operator;

s200: reading a deep learning model file, and converting the deep learning model file into an original calculation map;

s300: based on the operator type and the appearance sequence in the original calculation graph, matching with the fusion operators in the fusion operator set to obtain at least one fusion operator matched with the original calculation graph;

s400: converting the original calculation graph into an optimized calculation graph based on at least one matched fusion operator;

s500: the optimization computation graph is converted into a binary instruction.

2. The compiling method according to claim 1, wherein the step S300 specifically comprises:

s310: traversing reversely from the output nodes of the original calculation graph until the input nodes are found out, and obtaining all initial execution paths from the input nodes to the output nodes;

s320: finding out repeated parts in all the initial execution paths and marking, and then merging all the initial execution paths into one result execution path;

s330: traversing all nodes on the result execution path, obtaining all operator types and appearance sequences based on all nodes, and finding out nodes meeting the operator types and operation sequences of the fusion operator set to obtain at least one group of fusion nodes;

s340: and obtaining at least one fusion operator matched with the original calculation graph based on the operator type and the operation sequence of the fusion node.

3. The compiling method of claim 2, wherein the operator comprises at least one of a convolution operation, a pooling operation, an up-down sampling operation, an activation function operation, an element operation, a block normalization operation, and a layer normalization operation.

4. A compiling method according to claim 3, wherein the activating function operation comprises at least one of a relu activating function operation, a leak_relu activating function operation, a hard swish activating function operation, a sigmoid activating function operation, a silu activating function operation, a pre activating function operation, a tanh activating function operation, a elu activating function operation, a softmax activating function operation, a switch activating function operation, a maxout activating function operation, and a softplus activating function operation; the element operation comprises at least one of addition operation add, subtraction operation sub, multiplication operation mul, division operation div, remainder mod, complement neg, increment inc, decrement dec, maximum value max, minimum value min and absolute value abs.

5. A compiling method according to claim 3, wherein the image processing fusion operator is obtained based on a one-way operation instruction sequence of the convolution operation, the activation function operation, the pooling operation, and the element operation.

6. The compiling method according to claim 5, wherein the image processing fusion operator further comprises an up-down sampling operation after the element operation.

7. The compiling method according to claim 6, wherein the image processing fusion operator further comprises a block normalization operation between the performing of the activation function operation and the pooling operation.

8. A compiling method according to claim 3, wherein the synthetic fusion operator is obtained based on a sequence of unidirectional operation instructions of the convolution operation, the first operator, the second operator, the third operator, and the fourth operator; the first operator, the second operator, the third operator and the fourth operator are any one of pooling operation, up-down sampling operation, activation function operation, element operation, block normalization operation and layer normalization operation and are different from each other.

9. The compiling method according to claim 1, wherein in the step S200, the original calculation map is a directed map, and the following topological order is obtained by performing topological conversion on the directed map:

y ₁ = op ₁ (x),

y ₂ = op ₂ (y ₁ ),

…,

y _n = op _n (y _n-1 ),

wherein y is _1, y _2,..., y _n-1 Representing intermediate calculation results, y _n Represents the output result, op ₁ ，op ₂ ，op _n Representing the same or different types of operators.

10. The compiling method according to claim 1, wherein in the step S500, the optimizing calculation map outputs a binary instruction using a convolution kernel generator based on a heuristic search strategy of a cost model.

11. A compiler for running the compiling method of any one of claims 1-10, comprising: the system comprises a data receiving module, a compiling module and an instruction acquisition module; the data receiving module is used for receiving model data of the deep learning model files with different formats and analyzing the model data to obtain the original calculation map; the compiling module performs operator fusion on the original calculation graph through a calculation graph engine, converts the original calculation graph into the optimized calculation graph and compiles the optimized calculation graph into an operator instruction; the instruction acquisition module is used for receiving the operator instruction, and performing adjustment, encapsulation and format conversion operation on the operator instruction to obtain a conversion instruction; the compilation module also converts the conversion instructions into binary instructions.

12. The compiler of claim 11, wherein the computational graph engine operates by creating and calling APIs through operations of an NNAPI programming library.

13. A neural network accelerator comprising a compiler according to any one of claims 11-12.

14. A chip comprising a compiler as claimed in claim 12 or comprising a neural network accelerator as claimed in claim 13.

15. An electronic device comprising a compiler as claimed in claim 12 or comprising a neural network accelerator as claimed in claim 13.