CN117764122B - Calculation map processing method and device, electronic equipment and storage medium - Google Patents

Calculation map processing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN117764122B
CN117764122B CN202311861591.7A CN202311861591A CN117764122B CN 117764122 B CN117764122 B CN 117764122B CN 202311861591 A CN202311861591 A CN 202311861591A CN 117764122 B CN117764122 B CN 117764122B
Authority
CN
China
Prior art keywords
tensor
graph
address
sparse
tensor data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311861591.7A
Other languages
Chinese (zh)
Other versions
CN117764122A (en
Inventor
蒋力
刘方鑫
黄世远
熊大鹏
李涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Yizhu Intelligent Technology Co ltd
Shanghai Jiaotong University
Original Assignee
Suzhou Yizhu Intelligent Technology Co ltd
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Yizhu Intelligent Technology Co ltd, Shanghai Jiaotong University filed Critical Suzhou Yizhu Intelligent Technology Co ltd
Priority to CN202311861591.7A priority Critical patent/CN117764122B/en
Publication of CN117764122A publication Critical patent/CN117764122A/en
Application granted granted Critical
Publication of CN117764122B publication Critical patent/CN117764122B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Complex Calculations (AREA)

Abstract

The application discloses a computational graph processing method, a device, electronic equipment and a storage medium, wherein the computational graph processing method comprises the following steps: parsing the machine learning model to obtain a first computational graph, the first computational graph comprising a plurality of computational nodes; performing graph level optimization operation on the first calculation graph to obtain a second calculation graph; the graph level optimization operations include address transfer flow optimization operations that include: acquiring tensor data flow of the first computational graph according to node information and data dependency relations of a plurality of computational nodes in the first computational graph; performing sparsification processing on tensor data in the tensor data stream to obtain corresponding sparse tensor data, wherein the sparse tensor data is stored in a tensor-aware sparse storage format; the address indexes of the sparse tensor data are connected in series to form an address transfer stream according to the tensor data stream; the first computational graph is optimized according to the address transfer flow. The application can realize automatic and efficient sparse calculation and improve the calculation efficiency.

Description

Calculation map processing method and device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a method, an apparatus, an electronic device, and a storage medium for processing a computation graph.
Background
The key technology of artificial intelligence today is the neural network, which widely interconnects a large number of simple processing units (called neurons) by simulating human brain neural cell connections, forming a complex network system. The computation process between two adjacent layers of neurons may be abstracted to perform two computation steps on the input data, where the "computation steps" are referred to in the neural network as Operations (OPs). In practical applications, in order to analyze the structure, the computational characteristics and the data flow direction of the neural network, all OPs of the neural network are usually put together to form a computational graph (computational graph). Along with the improvement of the performance of the deep neural network, the parameter number and the calculation amount of the model are larger and larger, and the calculation speed of the model is severely restricted. The neural network model with high resource requirements for the terminal equipment with high real-time requirements greatly increases the deployment difficulty. In the related art, the calculation amount of the neural network model can be reduced by compressing the model through network sparseness, namely, reducing the number of connecting layers or neurons. The sparsity of the model includes structured sparsity and unstructured sparsity. Unstructured sparseness is to obtain an irregular sparse matrix by pruning any element in the matrix, but random memory access is introduced and branch judgment problems caused by zero weight skipping are solved. The structured sparsity is a regular sparse matrix obtained by pruning by taking a channel or a filter as a minimum unit, so that the structured sparsity is convenient to deploy, but can have a great influence on the precision of a network.
In order to compress the sparse matrix, the space occupied by the sparse matrix storage is reduced, and the sparse matrix is generally stored according to a fixed coding format by recording the position information of non-zero values in the sparse matrix. Common sparse storage formats include COO (Coordinate format), CSR (compressed sparse rows), CSC (compressed sparse columns), compressed Sparse Fibers (CSF), and CSB (compressed sparse blocks).
Although sparsification can reduce the scale of a neural network model, most deep learning frameworks and compilers do not support sparsity at present, and only support a traditional sparse storage format even though the sparsity is supported, but the traditional sparse storage format cannot efficiently represent a structured sparse matrix, and serious redundant representation still exists.
Disclosure of Invention
In view of the above problems, an object of the present invention is to provide a method, an apparatus, an electronic device, and a storage medium for processing a computation graph, which apply a tensor-aware sparse storage format to the computation graph, so as to implement automatic and efficient sparse computation and improve computation efficiency.
According to a first aspect of the present invention, there is provided a calculation map processing method including: parsing a machine learning model to obtain a first computational graph, the first computational graph comprising a plurality of computational nodes; performing graph level optimization operation on the first calculation graph to obtain a second calculation graph; wherein the graph level optimization operation comprises an address transfer flow optimization operation, and the address transfer flow optimization operation comprises the following steps: acquiring tensor data flow of the first computational graph according to node information and data dependency relations of a plurality of computational nodes in the first computational graph; performing sparsification processing on tensor data in a tensor data stream to obtain corresponding sparse tensor data, wherein the sparse tensor data is stored in a tensor-aware sparse storage format; according to the tensor data flow, the address indexes of the sparse tensor data are connected in series to form an address transfer flow; and optimizing the first calculation graph according to the address transfer flow.
Preferably, the tensor-aware sparse storage format includes an address index and a non-zero value, the address index includes at least one tensor dimension index, the tensor dimension index is position information of a non-zero element in the sparse tensor data in a tensor dimension, and the non-zero value is a value of the non-zero element in the sparse tensor data.
Preferably, when the sparse tensor data is a two-dimensional tensor, the address index includes a row dimension index and/or a column dimension index; when the sparse tensor data is a four-dimensional tensor, the address index comprises a filter dimension index and/or a channel dimension index; when the sparse tensor data is a two-dimensional block tensor, the address index includes a block dimension index.
Preferably, the graph-level optimization operation further includes a shape inference optimization operation, the shape inference optimization operation including: deducing the shape of the address index of the sparse tensor data according to the shape and the sparsity of the sparse tensor data in the tensor dimension; the shapes of the address indexes of the sparse tensor data are connected in series to form a shape transfer flow according to the address transfer flow; and optimizing the first calculation graph after the address flow optimization according to the shape transfer flow.
Preferably, the calculation map processing method further includes: performing operator-level optimization operations on the second computational graph to obtain a third computational graph, wherein the operator-level optimization operations comprise operator-level address transfer stream optimization operations; wherein the operator-level address transfer optimization operation includes: splitting each computing node in the second computing graph into circularly nested operator nodes to form a third computing graph; acquiring tensor data flow of the third computational graph according to node information of a plurality of operator nodes in the third computational graph and the dependency relationship among the operator nodes; according to the tensor data flow, the address indexes of the sparse tensor data of the operator nodes are connected in series to form an operator-level address transfer flow; and carrying out operator-level optimization on the third calculation graph according to the operator-level address transfer flow.
According to a second aspect of the present invention, there is provided a calculation map processing apparatus including: the analysis module is used for analyzing the machine learning model to obtain a first computing graph which is executable by hardware and comprises a plurality of computing nodes; the image level optimization module is used for performing image level optimization operation on the first calculation image to obtain a second calculation image, wherein the image level optimization operation comprises address transfer flow optimization operation; wherein, the level optimization module includes: the address optimization unit is used for acquiring tensor data flow of the first calculation graph according to node information and data dependency relations of a plurality of calculation nodes in the first calculation graph; performing sparsification processing on tensor data in a tensor data stream to obtain corresponding sparse tensor data, wherein the sparse tensor data is stored in a tensor-aware sparse storage format; according to the tensor data flow, the address indexes of the sparse tensor data are connected in series to form an address transfer flow; and optimizing the first calculation graph according to the address transfer flow.
Preferably, the tensor-aware sparse storage format includes an address index and a non-zero value, the address index includes at least one tensor dimension index, the tensor dimension index is position information of a non-zero element in the sparse tensor data in a tensor dimension, and the non-zero value is a value of the non-zero element in the sparse tensor data.
Preferably, when the sparse tensor data is a two-dimensional tensor, the address index includes a row tensor dimension index and/or a column tensor dimension index; when the sparse tensor data is a four-dimensional tensor, the address index comprises a filter tensor dimension index and/or a channel tensor dimension index; when the sparse tensor data is a two-dimensional block tensor, the address index includes a block tensor dimension index.
Preferably, the graph level optimization module further comprises: a shape inference unit configured to infer a shape of an address index of the sparse tensor data from a shape and sparsity of the sparse tensor data in a tensor dimension; generating a shape transfer stream according to the shape of the address index of the sparse tensor data and the address transfer stream; and optimizing the first calculation graph after the address flow optimization according to the shape transfer flow.
Preferably, the calculation map processing apparatus further includes: the operator-level optimization module is used for performing operator-level optimization operation on the second computation graph to obtain a third computation graph, and the operator-level optimization operation comprises operator-level address transfer stream optimization operation; wherein the operator level optimization module comprises: an operator-level address optimizing unit, configured to split each computing node in the second computation graph into circularly nested operator operations to form a third computation graph; acquiring tensor data flow of the third computational graph according to node information of a plurality of operator nodes in the third computational graph and the dependency relationship among the operator nodes; and according to the tensor data flow, the address indexes of the sparse tensor data of the operator nodes are connected in series to form an operator-level address transfer flow.
According to a third aspect of the present invention, there is provided a computer-readable storage medium storing a computer program which, when executed by a processor, implements the above-described computational graph processing method.
According to a fourth aspect of the present invention, there is provided an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the computer program is executed by the processor to implement the above-mentioned computation graph processing method.
According to the calculation map processing method, the calculation map processing device, the electronic equipment and the storage medium, the second calculation map is obtained by performing map level optimization operation on the first calculation map; wherein the graph level optimization operation comprises an address transfer flow optimization operation, and the address transfer flow optimization operation comprises the following steps: acquiring tensor data flow of the first computational graph according to node information and data dependency relations of a plurality of computational nodes in the first computational graph; performing sparsification processing on tensor data in a tensor data stream to obtain corresponding sparse tensor data, wherein the sparse tensor data is stored in a tensor-aware sparse storage format; according to the tensor data flow, the address indexes of the sparse tensor data are connected in series to form an address transfer flow; according to the method and the device for optimizing the first computation graph, the tensor-aware sparse storage format is applied to the computation graph, so that automatic and efficient sparse computation can be realized, and the computation efficiency is improved.
Furthermore, the tensor-aware sparse storage format adopts a plurality of tensor dimension indexes to represent the position information of the non-zero elements on different tensor dimensions, so that a sparse matrix can be efficiently represented, redundant storage overhead is reduced, and storage efficiency is improved; in addition, the position information of the non-zero element on different tensor dimensions can be rapidly positioned by scanning the non-repeated index in the tensor dimension index, so that the storage cost is reduced, invalid memory access is reduced, and the data index efficiency is improved.
Further, the sparse tensor in the tensor data stream can be encoded into a dense tensor through the address index, and the dense tensor carries the address information of the non-zero element of the original sparse tensor, so that the calculated amount can be reduced when the calculation map is executed, and the calculation efficiency is improved.
Further, deducing the shape of the address index of the sparse tensor data according to the shape and the sparsity of the sparse tensor data in the tensor dimension; generating a shape transfer stream according to the shape of the address index of the sparse tensor data and the address transfer stream; according to the method, the first calculation graph after the address flow optimization is optimized according to the shape transfer flow, the shape of tensor data can be automatically inferred, dynamic control on a static calculation graph can be achieved through changing sparsity, recompilation and reconstruction are not needed, and online learning and incremental learning are facilitated.
Further, in the computational graph of the operator stage, the address indexes of the sparse tensor data of the operator nodes are connected in series to form an operator stage address transfer stream so as to optimize the computational graph of the operator stage, and automatic sparse computation is realized.
Drawings
The above and other objects, features and advantages of the present invention will become more apparent from the following description of embodiments of the present invention with reference to the accompanying drawings, in which:
FIG. 1 shows a flowchart of a calculation map processing method provided by an embodiment of the present invention;
fig. 2 shows a flowchart of step S120 provided by an embodiment of the present invention;
FIG. 3 illustrates a schematic diagram of a tensor-aware sparse storage format provided by an embodiment of the present invention;
FIG. 4 shows a schematic diagram of a sparse storage format in the related art;
FIG. 5 is a diagram showing the index of the compressed lean-line format CSR in the related art in the calculation process;
FIG. 6 is a schematic diagram of indexes of tensor-aware sparse storage format TIEs in a computing process according to an embodiment of the present application;
FIG. 7 is a schematic diagram of an address transfer flow and tensor data flow provided by an embodiment of the present invention;
FIG. 8 is a schematic diagram of a computational graph provided by an embodiment of the present invention;
Fig. 9 shows another flowchart of step S120 provided by an embodiment of the present invention;
FIG. 10 is a flowchart of another method for processing a calculation map according to an embodiment of the present invention;
FIG. 11 is a schematic diagram illustrating a calculation of an operator-level address transfer flow provided by an embodiment of the present invention;
FIG. 12 is a schematic diagram of a calculation map processing apparatus according to an embodiment of the present invention;
fig. 13 shows a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
Various embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. The same reference numbers will be used throughout the drawings to refer to the same or like parts. For clarity, the various features of the drawings are not drawn to scale.
Embodiments of the present application will now be described with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the present application. As one of ordinary skill in the art can know, with the development of technology and the appearance of new scenes, the technical scheme provided by the embodiment of the application is also applicable to similar technical problems.
The terms first, second and the like in the description and in the claims and in the above-described figures, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments described herein may be implemented in other sequences than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The following description is given by way of example of some of the terms used in the embodiments of the present application:
Neural network (Neural Networks, NN): neural networks form complex network systems by simulating human brain neural cell connections, widely interconnecting a large number of simple processing units (called neurons). A simple neural network contains three layers, such as an input layer, an output layer, and a hidden layer (also called middle layer); each wire corresponds to a weight (its value is called weight, parameter).
Operation (OP): a calculation step in a neural network. The OPs may be nested in the OP, such as a convolution OP including an OP that converts an input four-dimensional tensor into a matrix, an OP that converts a weight four-dimensional tensor into a matrix, an OP that performs matrix multiplication on the input matrix and the weight matrix, and an OP that converts the result of the matrix multiplication into a four-dimensional tensor.
Computational graph): to facilitate analysis of neural network structure, computational characteristics, and data flow direction, all OPs of a neural network are typically put together to form a computational graph.
Tensor (Tensor): tensors are general multidimensional data expression forms, and are generalizations of vectors and matrices. The scalar is a 0-dimensional tensor, the vector is a 1-dimensional tensor, and the matrix is a 2-dimensional tensor.
Forward propagation (Forward propagation): in one layer of the neural network, the input data and the result after weight calculation are sent to the next layer as output or are output of the whole neural network.
Counter-propagation (Backward propagation): in the process of training the weight of the neural network, the output of the neural network is obtained through forward propagation; comparing the output with the true output to obtain a loss (loss) value; and finally calculating the gradient of loss on the weight. The weights of the neural network need to be adjusted to provide the neural network with the correct function, and this adjustment process is commonly referred to as "training".
Sparseness: the sparsity artificially introduces a large number of 0 elements into the weight matrix, and skips multiplication and addition related to the 0 elements during matrix multiplication calculation, so that the calculated amount of the neural network can be obviously reduced, and the power consumption of a chip is reduced.
The following describes in further detail the embodiments of the present invention with reference to the drawings and examples.
Fig. 1 shows a flowchart of a calculation map processing method provided by an embodiment of the present invention. As shown in fig. 1, the calculation map processing method includes the following steps.
In step S110, a machine learning model is parsed to obtain a first computational graph of hardware executable, the first computational graph including a plurality of computational nodes.
In the present embodiment, the machine learning model may be a neural network model, but is not limited thereto. The machine learning model is mapped to a high-level intermediate language HIR through a compiler, and output is expressed in the form of a calculation graph to obtain a first calculation graph, wherein the first calculation graph comprises a plurality of calculation nodes and edges, the calculation graph is used for expressing calculation logic related to a neural network, each calculation node in the calculation graph represents corresponding operation OP (such as conv2d node represents a convolution operation, add node represents an addition operation, multi node represents a matrix multiplication operation and the like) performed by the neural network, and one calculation node represents a calculation task; edges represent the data flow direction (i.e., the data flow direction of tensor data between computing nodes), i.e., an edge connects a previous computing node (which may be referred to as a predecessor node) to a subsequent computing node (which may be referred to as a successor node), representing the output of the predecessor node as the input of the successor node.
In step S120, a graph-level optimization operation is performed on the first computation graph to obtain a second computation graph.
In this embodiment, the graph level optimization operation includes an address transfer flow optimization operation. Referring to fig. 2, the address transfer stream optimizing operation includes steps S121 to S124.
Step S121: and acquiring tensor data flow of the first computational graph according to node information of a plurality of computational nodes in the first computational graph.
In this embodiment, the node information of the computing node includes input information, parameter information, output information, and attribute information. The input information of each computing node may include input tensors of the computing nodes and input connection relations, the input tensor of the first computing node in the first computing graph may be input from a user, and the input tensors of other computing nodes may be determined according to the output of the previous computing node; the input connection is a connection between the input of the computing node and the output of the other computing nodes in the description computation graph. The output information of each computing node comprises output tensors and output connection relations of the computing nodes, and the output tensors of the computing nodes can be determined by computing input tensors and parameter tensors of the computing nodes and computing of the computing nodes; the output connection is a connection between the output of the computing node and the input of the other computing nodes in the description computation graph. The parameter information includes, but is not limited to, pre-configured weight parameters required to implement the computational operations of the compute node. The attribute information is information characterizing characteristic attributes of the computing node, which may include, but is not limited to, the type of operation of the computing node.
The input tensor of the computing node and the parameter tensor perform multiplication, convolution and other operations to generate an output tensor, and the output tensor of the computing node is taken as the input of the next computing node, so that sparse tensor data among a plurality of computing nodes in the computing graph form tensor data flow.
Step S122: and carrying out sparsification processing on tensor data in the tensor data flow to obtain corresponding sparse tensor data, wherein the sparse tensor data are stored in a tensor-aware sparse storage format.
In this embodiment, the input tensor and the parameter tensor related to the tensor data stream are subjected to a thinning process, that is, some elements in the input tensor and the parameter tensor are set to 0 to obtain sparse tensor data, and then the sparse tensor data are stored in a tensor-aware sparse storage format (Tensorization-aware Index Entity, TIE). The manner of the sparseness is, for example, one of structured sparseness, semi-structured sparseness, and unstructured sparseness, but is not limited thereto.
The tensor-aware sparse storage format comprises an address index and a non-zero value, wherein the address index comprises at least one tensor dimension index TIE, and the tensor dimension index TIE is the position information of non-zero elements in a sparse matrix on tensor dimensions. For example, for an N-D structured sparse tensorThe sparse tensor comprises N tensor dimensions, the structured sparse tensor has I k elements in a kth dimension, and the TIE of each tensor dimension is represented by K TIE, and the length of the structured sparse tensor is L k.
When the tensor data is a two-dimensional tensor, the address index includes a row dimension index (RTIE) and/or a column dimension index (C TIE); when the tensor data is a four-dimensional tensor, the address index includes a filter dimension index (F TIE) and a channel dimension index (C TIE); when the tensor data is a two-dimensional block tensor, the address index includes a block dimension index (B TIE). The location of the non-zero element in the k-th dimension is located by scanning the non-duplicate index in KTIE.
Referring to FIG. 3, for a 2-D sparse tensor with the data blocks in R 1 and C 2 set to 0, the address indices may be denoted as R TIE [02] and C TIE [ 01 ]. Similarly, for a 4-D sparse tensor with data blocks in F 1 and C 1 set to 0, its address index may be denoted as F TIE [02] and C TIE [02 ]. For a 2-D block sparse tensor with the data blocks in B 1 and B 2 set to 0, its address index may be denoted as B TIE [ 03 ].
The sparse storage formats in the related art include a coordinate format (COO), a compressed sparse row format (CSR), a compressed sparse column format (CSC), a compressed sparse fiber format (CSF), and a compressed block format (CSB). As shown in fig. 4, these sparse storage formats store data by continuously encoding the position information of non-zero elements, and do not fully exploit sparsity in structured sparse mode, yet there is a severe redundant representation.
The tensor-aware sparse storage format TIE provided by the application eliminates redundancy in pos and crd, and realizes more efficient storage. The theoretical upper bound of TIE space complexity isThe memory index efficiency is higher. The method can rapidly locate the position of the non-zero element in the kth dimension by scanning the non-repeated index in the K TIE.
Referring to fig. 5, in the indexing process, although the elements of R 1 rows in the weight matrix are all zero elements, since the index information of R 1 still exists in the pos value in the CSR format, there is redundant information. Referring to FIG. 6, during indexing, non-duplicate indices in TIEs of different dimensions are scanned to locate the position of non-zero elements, reducing redundancy, and being very fast.
Step S123: and concatenating address indexes of the sparse tensor data into an address transfer stream according to the tensor data stream.
In this embodiment, the output tensor of any computing node may be determined by computing the input tensor and the parameter tensor of the computing node, and computing the computing node, where each sparse tensor data corresponds to a location of a corresponding address index record non-zero element, so that the address index of the output tensor is determined by the address index of the input tensor of the computing node and the address index of the parameter tensor, and computing the computing node. When the computing node performs computation, skipping and encoding the sparse tensor data into dense tensor data through zero values according to the address indexes of the sparse tensor data, wherein the dense tensor data carries the address information of non-zero elements of the original sparse tensor; a calculation is then performed on these zero value skip coded dense tensors.
Fig. 7 illustrates Conv2d operation as an example, where the current node takes data and a weight matrix W 0 as inputs, performs a convolution operation, and takes a feature map FM 0 as an output. Only one element in the column dimension of data is non-zero, with an address index of C TIE 0 [0]. only four elements in the filter dimension (i.e., F dimension) of the weight matrix W 0 are non-zero, with an address index of FTIE 0 [ 02 ]. In executing the first node, according to the address index C TIE 0 [0] of the input tensor, only 1 and 5 are activated and 2 and 6 are skipped in the weight matrix W 0, i.e. only non-zero elements associated with C 0 are considered; The address index of the feature map FM 0 is C TIE 1 [ 02 ]. The next node takes as input the feature map FM 0 and the weight matrix W 1, performs a convolution operation, and takes as output the feature map FM 1. the address index of the feature map FM 0 is C TIE 1 [ 02 ], the address index of the weight matrix W 1 is FTIE 1 [0], Only 1 and 5 of the weight matrix W 1 are activated and 3 is skipped, i.e. only non-zero elements associated with C 0 and C 2 are considered, for activated 10 and 50 of profile FM0, and the address index of profile FM 1 is C TIE 1 [0].
Step S124: and optimizing the first calculation graph according to the address transfer flow.
In this embodiment, the address transfer stream is bound to the tensor data stream in the first computation graph, so that the address index of the output tensor of the computation node can be automatically inferred. For example, in the declaration phase, the address index of the input sparse tensor and the address index of the parameter sparse tensor are declared in the input of the compute node, and the address index of the output tensor are declared in the output of the compute node. Fig. 8 shows a part of the computation nodes of a computation graph in which there are not only tensor data flows but also address transfer flows.
In some embodiments, the graph-level optimization operations further include shape inference optimization operations, see fig. 9, including steps S125-S127.
Step S125: and deducing the shape of the address index of the sparse tensor data according to the shape and the sparsity of the sparse tensor data in the tensor dimension.
In this embodiment, the node information of the computing node is further provided with sparsity of the sparse tensor data in the tensor dimension, and the shape of the address index of the sparse tensor data can be deduced according to the shape and sparsity of the sparse tensor data in the tensor dimension. For example, in the statement of the compute node, the computation formula of the output tensor is :[C TIE1,FM0]=conv2d(I0=data,I1=W0,I2=C TIE0,I3=FTIE0 F=3sparsityF=1/3C=2sparsityC=1/2),, where FM 0 is the output tensor, C TIE 1 is the address index of the output tensor, data is the input tensor, W 0 is the weight tensor, C TIE 0 is the address index of the input tensor data, F TIE 0 is the address index of the weight tensor W 0, F is the length of the weight tensor W 0 in the filter dimension, SPARSITYF is the sparsity of the weight tensor in the filter dimension (i.e., the percentage of the sparse filter number), C is the length of the input tensor data in the channel dimension, and SPARSITYC is the sparsity of the input tensor data in the channel dimension (i.e., the percentage of the sparse channel number). Then, the length of C TIE 0 may be calculated by c× (1-SPARSITYC) =1, the shape of the corresponding C TIE 0 becomes (1); similarly, the shape of F TIE 0 may be inferred to be (2) by f× (1-SPARSITYF) =2.
Step S126: and serially connecting the shapes of the address indexes of the sparse tensor data according to the address transfer stream to form a shape transfer stream.
In this embodiment, the output tensor of any computing node may be determined by computing the input tensor and the parameter tensor of the node and computing the node, and each sparse tensor data corresponds to the location of the non-zero element of the corresponding address index record, so the shape of the address index of the output tensor is determined by the shape of the address index of the input tensor of the computing node and the shape of the address index of the parameter tensor and computing the node. The shape of the trailing TIE may be updated based on the shape of the leading TIE. For example, C TIE 1 has the same shape as F TIE 0 as (2).
Step S127: and optimizing the first calculation graph after the address flow optimization according to the shape transfer flow.
In this embodiment, the shape transfer stream is bound to the tensor data stream in the first computational graph, and the shape of the address index of the output tensor of the computational node can be automatically inferred. Thus, dynamic control of the static computational graph can be achieved by changing sparsity without recompilation and reconstruction, which is beneficial to online learning and incremental learning. Referring to fig. 8, the address transfer stream may be replaced with a shape transfer stream in the second computational graph to simplify the representation of the address transfer stream.
In some embodiments, the graph-level optimization operation further includes, but is not limited to, graph pruning, graph fusion, graph segmentation, and other optimization operations.
In a preferred embodiment, the calculation map processing method further includes step S130.
In step S130, an operator-level optimization operation is performed on the second computation graph to obtain a third computation graph, where the operator-level optimization operation includes an operator-level address transfer stream optimization operation.
In the present embodiment, referring to fig. 10, the steps of the operator-level address transfer stream optimizing operation include steps S131 to S134.
Step S131: splitting each computing node in the second computing graph into circularly nested operator nodes to form a third computing graph.
In this embodiment, each computing node is split into circularly nested operator nodes to form a third computation graph, for example, a convolution operator may be split into a plurality of matrix vector multiplication operators described by circularly nesting, i.e., a convolution operation may be converted into a plurality of matrix vector multiplication operations.
Step S132: and acquiring tensor data flow of the third computational graph according to the node information of the plurality of operator nodes and the dependency relationship among the operator nodes in the third computational graph.
In this embodiment, the node information of the operator node is the same as the node information of the computing node, and includes input information, parameter information, output information, and attribute information. The input information of each computing node may include input tensors of the computing nodes and input connection relations, the input tensor of the first computing node in the first computing graph may be input from a user, and the input tensors of other computing nodes may be determined according to the output of the previous computing node; the input connection is a connection between the input of the computing node and the output of the other computing nodes in the description computation graph. The output information of each computing node comprises output tensors and output connection relations of the computing nodes, and the output tensors of the computing nodes can be determined by computing input tensors and parameter tensors of the computing nodes and computing of the computing nodes; the output connection is a connection between the output of the computing node and the input of the other computing nodes in the description computation graph. The parameter information includes, but is not limited to, pre-configured weight parameters required to implement the computational operations of the compute node. The attribute information is information characterizing a characteristic attribute of the computing node, which may include, but is not limited to, a type of operation of the computing node.
The input tensor of the operator nodes and the parameter tensor are multiplied to generate an output tensor, the output tensor of the operator nodes is used as the input of the next operator node, and sparse tensor data among a plurality of operator nodes in the calculation graph form tensor data flow.
Step S133: and according to the tensor data flow, the address indexes of the sparse tensor data of the operator nodes are connected in series to form an operator-level address transfer flow.
In this embodiment, the output tensor of any operator node may be determined by the input tensor and the parameter tensor of the operator node and the calculation of the operator node, and each sparse tensor data corresponds to a location where a corresponding address index records a non-zero element, so that the address index of the output tensor is determined by the address index of the input tensor of the operator node and the address index of the parameter tensor and the calculation of the operator node. When the operator nodes execute calculation, skipping and encoding the sparse tensor data into dense tensor data through zero values according to the address indexes of the sparse tensor data, wherein the dense tensor data carries the address information of non-zero elements of the original sparse tensor; a calculation is then performed on these zero value skip coded dense tensors.
Step S134: and carrying out operator-level optimization on the third calculation graph according to the operator-level address transfer flow.
In this embodiment, the address transfer stream is bound to the tensor data stream in the third computation graph, so that the address index of the output tensor of the operator node can be automatically inferred. For example, in the declaration phase, the address index of the input sparse tensor and the address index of the parameter sparse tensor are declared in the input of the operator node, and the address index of the output tensor is declared in the output of the operator node.
The neural network model in the embodiment of the application is divided into a forward algorithm part and a reverse calculation part, and the forward propagation, or the forward calculation part and the forward calculation part are the model calculation process, and can give out corresponding output for a group of input; the back propagation, or back computation, is the training model parameter, and gradient descent is used across all parameters to minimize the neural network model's loss function on the training data.
In this embodiment, the reverse operator conv2d_backward is taken as an example to describe that the generation process of the address index is applied to each operator node in the computing nodes to form an operator-level address transfer stream. Specifically, the inputs of the inverse operator conv2d_backward include: w i,F TIEi,CTIEi-1, feature map FM i-1 from the previous layer, feature map gradient FG i from the next layer, the output of which includes feature map gradient FG i-1 passed to the previous layer, iterated and passed to the optimizer for weight update WG 4×4. The whole process comprises three steps of data loading, data calculation and gradient updating.
Referring to FIG. 11, the weight sparse tensor W i is converted to a scaled-down weight dense tensor W dense, by a simplified address index during data loading, specifically, directly accessing non-zero locations in the filter dimension F 0、F2 and the channel dimension C 0、C2 in memory by indexes F TIE i and CTIE i-1. The address index TIE adopted by the application can avoid multiple indirect memory accesses in the CSR, and the non-repeated index in each TIE can avoid useless memory accesses. Therefore, the dimension index transferred along the address transfer flow can greatly simplify the address generation process and reduce the resolution overhead. In the data calculation process, the matrix multiplication process of the dense tensor does not cause calculation discontinuity and thread divergence, so that the calculation efficiency is improved; and the dense tensor requires less memory space. In the gradient update process, the value of the gradient WG 2×2 may be rewritten into the WG 4×4 according to the location of the original non-zero element. The value elsewhere in WG 4×4 is 0, the corresponding value in the weight tensor is not involved in the calculation of the forward propagation, and the weight gradient then updates the weight. Since the non-zero position in the weight tensor W i may change in different iterations, such a gradient update process can ensure the accuracy of the computation, and can adapt to sparse changes in some application scenarios, such as Dropout.
According to the calculation map processing method, the first calculation map is subjected to map level optimization operation to obtain the second calculation map; wherein the graph level optimization operation comprises an address transfer flow optimization operation, and the address transfer flow optimization operation comprises the following steps: acquiring tensor data flow of the first computational graph according to node information and data dependency relations of a plurality of computational nodes in the first computational graph; performing sparsification processing on tensor data in a tensor data stream to obtain corresponding sparse tensor data, wherein the sparse tensor data is stored in a tensor-aware sparse storage format; according to the tensor data flow, the address indexes of the sparse tensor data are connected in series to form an address transfer flow; according to the method and the device for optimizing the first computation graph, the tensor-aware sparse storage format is applied to the computation graph, so that automatic and efficient sparse computation can be realized, and the computation efficiency is improved.
Furthermore, the tensor-aware sparse storage format adopts a plurality of tensor dimension indexes to represent the position information of the non-zero elements on different tensor dimensions, so that a sparse matrix can be efficiently represented, redundant storage overhead is reduced, and storage efficiency is improved; in addition, the non-zero position can be rapidly positioned by scanning the non-repeated index in the tensor dimension index, invalid memory access is reduced, and the data index efficiency is improved.
Further, the sparse tensor in the tensor data stream can be encoded into a dense tensor through the address index, and the dense tensor carries the address information of the non-zero element of the original sparse tensor, so that the calculated amount can be reduced when the calculation map is executed, and the calculation efficiency is improved.
Further, deducing the shape of the address index of the sparse tensor data according to the shape and the sparsity of the sparse tensor data in the tensor dimension; generating a shape transfer stream according to the shape of the address index of the sparse tensor data and the address transfer stream; according to the method, the first calculation graph after the address flow optimization is optimized according to the shape transfer flow, the shape of tensor data can be automatically inferred, dynamic control on a static calculation graph can be achieved through changing sparsity, recompilation and reconstruction are not needed, and online learning and incremental learning are facilitated.
Fig. 12 is a schematic structural diagram of a calculation map optimizing apparatus according to an embodiment of the present invention. As shown in fig. 12, the computation graph optimization apparatus 200 includes a parsing module 210, a graph-level optimization module 220, and an operator-level optimization module 230.
The parsing module 210 is configured to parse the machine learning model to obtain a first computation graph executable by hardware, where the first computation graph includes a plurality of computation nodes.
In the present embodiment, the machine learning model may be a neural network model, but is not limited thereto. The machine learning model is mapped to a high-level intermediate language HIR through a compiler, and output is expressed in the form of a calculation graph to obtain a first calculation graph, wherein the first calculation graph comprises a plurality of calculation nodes and edges, the calculation graph is used for expressing calculation logic related to a neural network, each calculation node in the calculation graph represents corresponding operation OP (such as conv2d node represents a convolution operation, add node represents an addition operation, multi node represents a matrix multiplication operation and the like) performed by the neural network, and one calculation node represents a calculation task; edges represent the data flow direction (i.e., the data flow direction of tensor data between computing nodes), i.e., an edge connects a previous computing node (which may be referred to as a predecessor node) to a subsequent computing node (which may be referred to as a successor node), representing the output of the predecessor node as the input of the successor node.
The graph level optimization module 220 is configured to perform a graph level optimization operation on the first computation graph to obtain a second computation graph, where the graph level optimization operation includes an address passing stream optimization operation.
The level optimization module 220 comprises an address optimization unit 221 and a shape inference unit 222.
The address optimizing unit 221 is configured to obtain tensor data flows of the first computation graph according to node information of a plurality of computation nodes in the first computation graph; acquiring tensor data flow of the first computational graph according to node information and data dependency relations of a plurality of computational nodes in the first computational graph; performing sparsification processing on tensor data in a tensor data stream to obtain corresponding sparse tensor data, wherein the sparse tensor data is stored in a tensor-aware sparse storage format; and concatenating address indexes of the sparse tensor data into an address transfer stream according to the tensor data stream.
In this embodiment, the node information of the computing node includes input information, parameter information, output information, and attribute information. The input information of each computing node may include input tensors of the computing nodes and input connection relations, the input tensor of the first computing node in the first computing graph may be input from a user, and the input tensors of other computing nodes may be determined according to the output of the previous computing node; the input connection is a connection between the input of the computing node and the output of the other computing nodes in the description computation graph. The output information of each computing node comprises output tensors and output connection relations of the computing nodes, and the output tensors of the computing nodes can be determined by computing input tensors and parameter tensors of the computing nodes and computing of the computing nodes; the output connection is a connection between the output of the computing node and the input of the other computing nodes in the description computation graph. The parameter information includes, but is not limited to, pre-configured weight parameters required to implement the computational operations of the compute node. The attribute information is information characterizing a characteristic attribute of the computing node, which may include, but is not limited to, a type of operation of the computing node.
The input tensor of the computing node and the parameter tensor perform multiplication, convolution and other operations to generate an output tensor, and the output tensor of the computing node is taken as the input of the next computing node, so that sparse tensor data among a plurality of computing nodes in the computing graph form tensor data flow.
The input tensor and the parameter tensor related in the tensor data stream are subjected to sparsification processing, namely, some elements in the input tensor and the parameter tensor are set to 0 to obtain sparse tensor data, and then the sparse tensor data are stored in a tensor-aware sparse storage format (Tensorization-aware Index Entity, TIE). The manner of the sparseness is, for example, one of structured sparseness, semi-structured sparseness, and unstructured sparseness, but is not limited thereto.
The tensor-aware sparse storage format comprises an address index and a non-zero value, wherein the address index comprises at least one tensor dimension index TIE, and the tensor dimension index TIE is the position information of non-zero elements in a sparse matrix on tensor dimensions. For example, for an N-D structured sparse tensorThe sparse tensor comprises N tensor dimensions, the structured sparse tensor has I k elements in a kth dimension, and the TIE of each tensor dimension is represented by K TIE, and the length of the structured sparse tensor is L k.
When the tensor data is a two-dimensional tensor, the address index includes a row dimension index (RTIE) and/or a column dimension index (C TIE); when the tensor data is a four-dimensional tensor, the address index includes a filter dimension index (F TIE) and a channel dimension index (C TIE); when the tensor data is a two-dimensional block tensor, the address index includes a block dimension index (B TIE). The location of the non-zero element in the k-th dimension is located by scanning the non-duplicate index in KTIE.
Referring to FIG. 3, for a 2-D sparse tensor with the data blocks in R 1 and C 2 set to 0, the address indices may be denoted as R TIE [02] and C TIE [ 01 ]. Similarly, for a 4-D sparse tensor with data blocks in F 1 and C 1 set to 0, its address index may be denoted as F TIE [02] and C TIE [02 ]. For a 2-D block sparse tensor with the data blocks in B 1 and B 2 set to 0, its address index may be denoted as B TIE [ 03 ].
The tensor-aware sparse storage format TIE provided by the application eliminates redundancy in pos and crd in the conventional sparse storage format, and realizes more efficient storage. The theoretical upper bound of TIE space complexity isThe memory index efficiency is higher. The method can rapidly locate the position of the non-zero element in the kth dimension by scanning the non-repeated index in the K TIE.
The output tensor of any computing node can be determined by the input tensor and the parameter tensor of the computing node and the computing node, and each sparse tensor data corresponds to the position of the non-zero element recorded by the corresponding address index, so that the address index of the output tensor is determined by the address index of the input tensor of the computing node and the address index of the parameter tensor and the computing node. When the computing node performs computation, skipping and encoding the sparse tensor data into dense tensor data through zero values according to the address indexes of the sparse tensor data, wherein the dense tensor data carries the address information of non-zero elements of the original sparse tensor; a calculation is then performed on these zero value skip coded dense tensors.
Fig. 7 illustrates Conv2d operation as an example, where the current node takes data and a weight matrix W 0 as inputs, performs a convolution operation, and takes a feature map FM 0 as an output. Only one element in the column dimension of data is non-zero, with an address index of C TIE 0 [0]. only four elements in the filter dimension (i.e., F dimension) of the weight matrix W 0 are non-zero, with an address index of FTIE 0 [ 02 ]. In executing the first node, according to the address index C TIE 0 [0] of the input tensor, only 1 and 5 are activated and 2 and 6 are skipped in the weight matrix W 0, i.e. only non-zero elements associated with C 0 are considered; The address index of the feature map FM 0 is C TIE 1 [ 02 ]. The next node takes as input the feature map FM 0 and the weight matrix W 1, performs a convolution operation, and takes as output the feature map FM 1. the address index of the feature map FM 0 is C TIE 1 [ 02 ], the address index of the weight matrix W 1 is FTIE 1 [0], Only 1 and 5 of the weight matrix W 1 are activated and 3 is skipped, i.e. only non-zero elements associated with C 0 and C 2 are considered, for activated 10 and 50 of profile FM0, and the address index of profile FM 1 is C TIE 1 [0].
Binding the address transfer stream with the tensor data stream in the first computational graph may automatically infer an address index of the output tensor of the computational node. For example, in the declaration phase, the address index of the input sparse tensor and the address index of the parameter sparse tensor are declared in the input of the compute node, and the address index of the output tensor are declared in the output of the compute node.
The shape inference unit 222 is configured to infer a shape of an address index of the sparse tensor data according to the shape and sparsity of the sparse tensor data in the tensor dimension; generating a shape transfer stream according to the shape of the address index of the sparse tensor data and the address transfer stream; and optimizing the first calculation graph after the address flow optimization according to the shape transfer flow. In this embodiment, the node information of the computing nodes is further provided with sparsity of the sparse tensor data in the tensor dimension, and the shape of the address index of the sparse tensor data can be inferred according to the shape and sparsity of the sparse tensor data in the tensor dimension, and the shape of the address index of the output tensor of any computing node is determined by the shape of the address index of the input tensor of the computing node and the shape of the address index of the parameter tensor and the computation of the computing node. for example, in the statement of compute node, the computation formula of the output tensor is :[CTIE1,FM0]=conv2d(I0=data,I1=W0,I2=C TIE0,I3=F TIE0 F=3sparsityF=1/3C=2sparsityC=1/2),, where FM 0 is the output tensor, C TIE 1 is the address index of the output tensor, data is the input tensor, W 0 is the weight tensor, c TIE 0 is the address index of the input tensor data, F TIE 0 is the address index of the weight tensor W 0, F is the length of the weight tensor W 0 in the filter dimension, SPARSITYF is the sparsity of the weight tensor in the filter dimension (i.e., the percentage of the sparse filter number), C is the length of the input tensor data in the channel dimension, SPARSITYC is the sparsity of the input tensor data in the channel dimension (i.e., the percentage of the sparse channel number). then, the length of C TIE 0 may be calculated by c× (1-SPARSITYC) =1, the shape of the corresponding C TIE 0 becomes (1); similarly, the shape of F TIE 0 may be inferred to be (2) by f× (1-SPARSITYF) =2. The shape of the trailing TIE may be updated based on the shape of the leading TIE. For example, C TIE 1 has the same shape as F TIE 0 as (2).
The application binds the shape transfer stream with the tensor data stream in the first computational graph, and can automatically infer the shape of the address index of the output tensor of the computational node. Thus, dynamic control of the static computational graph can be achieved by changing sparsity without recompilation and reconstruction, which is beneficial to online learning and incremental learning.
In some embodiments, the graph-level optimization operation further includes, but is not limited to, graph pruning, graph fusion, graph segmentation, and other optimization operations.
The computation graph processing apparatus further includes an operator-level optimization module 230, configured to perform an operator-level optimization operation on the second computation graph to obtain a third computation graph, where the operator-level optimization operation includes an operator-level address transfer stream optimization operation.
The operator-level optimization module includes an operator-level address optimization unit 231, configured to split each computing node in the second computation graph into circularly nested operator operations to form a third computation graph; acquiring tensor data flow of the third computational graph according to node information of a plurality of operator nodes in the third computational graph and the dependency relationship among the operator nodes; and according to the tensor data flow, the address indexes of the sparse tensor data of the operator nodes are connected in series to form an operator-level address transfer flow.
In this embodiment, each computing node is split into circularly nested operator nodes to form a third computation graph, for example, a convolution operator may be split into a plurality of matrix vector multiplication operators described by circularly nesting, i.e., a convolution operation may be converted into a plurality of matrix vector multiplication operations. The input tensor of the operator nodes and the parameter tensor are multiplied to generate an output tensor, the output tensor of the operator nodes is used as the input of the next operator node, and sparse tensor data among a plurality of operator nodes in the calculation graph form tensor data flow. The address index of the output tensor of any operator node is determined by the address index of the input tensor of the operator node and the address index of the parameter tensor and the calculation of the operator node. Binding the address transfer stream with the tensor data stream in the third computational graph can automatically infer the address index of the output tensor of the operator node.
The application provides a calculation map processing device, which is used for obtaining a second calculation map by carrying out map level optimization operation on a first calculation map; wherein the graph level optimization operation comprises an address transfer flow optimization operation, and the address transfer flow optimization operation comprises the following steps: acquiring tensor data flow of the first computational graph according to node information and data dependency relations of a plurality of computational nodes in the first computational graph; performing sparsification processing on tensor data in a tensor data stream to obtain corresponding sparse tensor data, wherein the sparse tensor data is stored in a tensor-aware sparse storage format; according to the tensor data flow, the address indexes of the sparse tensor data are connected in series to form an address transfer flow; according to the method and the device for optimizing the first computation graph, the tensor-aware sparse storage format is applied to the computation graph, so that automatic and efficient sparse computation can be realized, and the computation efficiency is improved.
Furthermore, the tensor-aware sparse storage format adopts a plurality of tensor dimension indexes to represent the position information of the non-zero elements on different tensor dimensions, so that a sparse matrix can be efficiently represented, redundant storage overhead is reduced, and storage efficiency is improved; in addition, the non-zero position can be rapidly positioned by scanning the non-repeated index in the tensor dimension index, invalid memory access is reduced, and the data index efficiency is improved.
Further, the sparse tensor in the tensor data stream can be encoded into a dense tensor through the address index, and the dense tensor carries the address information of the non-zero element of the original sparse tensor, so that the calculated amount can be reduced when the calculation map is executed, and the calculation efficiency is improved.
Further, deducing the shape of the address index of the sparse tensor data according to the shape and the sparsity of the sparse tensor data in the tensor dimension; generating a shape transfer stream according to the shape of the address index of the sparse tensor data and the address transfer stream; according to the method, the first calculation graph after the address flow optimization is optimized according to the shape transfer flow, the shape of tensor data can be automatically inferred, dynamic control on a static calculation graph can be achieved through changing sparsity, recompilation and reconstruction are not needed, and online learning and incremental learning are facilitated.
Fig. 13 is a schematic structural diagram of an electronic device according to an embodiment of the present invention. As shown in fig. 13, the electronic apparatus 100 of this embodiment includes: at least one processor 101 (only one shown in fig. 13), a memory 102, and a computer program 103 stored in the memory 102 and executable on the at least one processor 101, the processor 101 implementing the steps of the computational graph processing method described above when executing the computer program 103.
The electronic equipment can be a desktop computer, a notebook computer, a palm computer, a cloud server and other computing equipment. The electronic device may include, but is not limited to, a processor, a memory. It will be appreciated by those skilled in the art that fig. 12 is merely an example of the electronic device 100 and is not meant to be limiting of the electronic device 100, and may include more or fewer components than shown, or may combine certain components, or different components, such as may also include input-output devices, network access devices, etc.
The Processor 101 may be a central processing unit (Central Processing Unit, CPU), the Processor 101 may also be other general purpose processors, digital signal processors (DIGITAL SIGNAL processors, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), off-the-shelf Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 102 may in some embodiments be an internal storage unit of the electronic device 100, such as a hard disk or a memory of the electronic device 100. The memory 102 may also be an external storage device of the electronic device 100 in other embodiments, such as a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD), etc. that are provided on the electronic device 100. Further, the memory 102 may also include both internal storage units and external storage devices of the electronic device 100. The memory 102 is used to store an operating system, application programs, boot Loader (Boot Loader), data, and other programs, such as program code of the computer program. The memory 102 may also be used to temporarily store data that has been output or is to be output.
Embodiments of the present application also provide a computer readable storage medium storing a computer program which, when executed by a processor, implements steps for implementing the various method embodiments described above.
Embodiments of the present application provide a computer program product which, when run on an electronic device, causes the electronic device to perform steps that may be carried out in the various method embodiments described above.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present application may implement all or part of the flow of the method of the above embodiments, and may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of each of the method embodiments described above. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include at least: any entity or device capable of carrying computer program code to an apparatus/electronic device, a recording medium, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, and a software distribution medium. Such as a U-disk, removable hard disk, magnetic or optical disk, etc. In some jurisdictions, computer readable media may not be electrical carrier signals and telecommunications signals in accordance with legislation and patent practice.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus/electronic device and method may be implemented in other manners. For example, the apparatus/electronic device embodiments described above are merely illustrative, e.g., the blocks of modules or units are merely logical functional blocks, and may be implemented in other blocks, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted, or not implemented. Alternatively, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection via interfaces, devices or units, which may be in electrical, mechanical or other forms.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
Embodiments in accordance with the present invention, as described above, are not intended to be exhaustive or to limit the invention to the precise embodiments disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best utilize the invention and various modifications as are suited to the particular use contemplated. The invention is limited only by the claims and the full scope and equivalents thereof.

Claims (10)

1. A computational graph processing method, comprising:
parsing a machine learning model to obtain a first computational graph, the first computational graph comprising a plurality of computational nodes;
performing graph level optimization operation on the first calculation graph to obtain a second calculation graph;
Wherein the graph level optimization operation comprises an address transfer flow optimization operation, and the address transfer flow optimization operation comprises the following steps:
acquiring tensor data flow of the first computational graph according to node information and data dependency relations of a plurality of computational nodes in the first computational graph;
Performing sparsification processing on tensor data in a tensor data stream to obtain corresponding sparse tensor data, wherein the sparse tensor data is stored in a tensor-aware sparse storage format;
according to the tensor data flow, the address indexes of the sparse tensor data are connected in series to form an address transfer flow;
Optimizing a first calculation graph according to the address transfer flow;
The tensor-aware sparse storage format comprises an address index and a non-zero value, wherein the address index comprises at least one tensor dimension index, the tensor dimension index is position information of a non-zero element in sparse tensor data on a tensor dimension, and the non-zero value is a value of the non-zero element in the sparse tensor data.
2. The computational graph processing method of claim 1, wherein when the sparse tensor data is a two-dimensional tensor, the address index includes a row dimension index and/or a column dimension index;
When the sparse tensor data is a four-dimensional tensor, the address index comprises a filter dimension index and/or a channel dimension index;
when the sparse tensor data is a two-dimensional block tensor, the address index includes a block dimension index.
3. The computational graph processing method of claim 1, wherein the graph-level optimization operations further comprise shape-inference optimization operations comprising:
Deducing the shape of the address index of the sparse tensor data according to the shape and the sparsity of the sparse tensor data in the tensor dimension;
the shapes of the address indexes of the sparse tensor data are connected in series to form a shape transfer flow according to the address transfer flow;
And optimizing the first calculation graph after the address flow optimization according to the shape transfer flow.
4. The calculation map processing method according to claim 1, characterized by further comprising:
Performing operator-level optimization operations on the second computational graph to obtain a third computational graph, wherein the operator-level optimization operations comprise operator-level address transfer stream optimization operations;
wherein the operator-level address transfer optimization operation includes:
Splitting each computing node in the second computing graph into circularly nested operator nodes to form a third computing graph;
Acquiring tensor data flow of the third computational graph according to node information of a plurality of operator nodes in the third computational graph and the dependency relationship among the operator nodes;
according to the tensor data flow, the address indexes of the sparse tensor data of the operator nodes are connected in series to form an operator-level address transfer flow;
and carrying out operator-level optimization on the third calculation graph according to the operator-level address transfer flow.
5. A calculation map processing apparatus, comprising:
the analysis module is used for analyzing the machine learning model to obtain a first computing graph which is executable by hardware and comprises a plurality of computing nodes;
The image level optimization module is used for performing image level optimization operation on the first calculation image to obtain a second calculation image, wherein the image level optimization operation comprises address transfer flow optimization operation;
wherein, the level optimization module includes:
The address optimization unit is used for acquiring tensor data flow of the first calculation graph according to node information and data dependency relations of a plurality of calculation nodes in the first calculation graph; performing sparsification processing on tensor data in a tensor data stream to obtain corresponding sparse tensor data, wherein the sparse tensor data is stored in a tensor-aware sparse storage format; according to the tensor data flow, the address indexes of the sparse tensor data are connected in series to form an address transfer flow; optimizing a first calculation graph according to the address transfer flow;
The tensor-aware sparse storage format comprises an address index and a non-zero value, wherein the address index comprises at least one tensor dimension index, the tensor dimension index is position information of a non-zero element in sparse tensor data on a tensor dimension, and the non-zero value is a value of the non-zero element in the sparse tensor data.
6. The computational graph processing apparatus of claim 5, wherein when the sparse tensor data is a two-dimensional tensor, the address index comprises a row tensor dimension index and/or a column tensor dimension index;
When the sparse tensor data is a four-dimensional tensor, the address index comprises a filter tensor dimension index and/or a channel tensor dimension index;
When the sparse tensor data is a two-dimensional block tensor, the address index includes a block tensor dimension index.
7. The computational graph processing apparatus of claim 5 wherein the graph level optimization module further comprises:
A shape inference unit configured to infer a shape of an address index of the sparse tensor data from a shape and sparsity of the sparse tensor data in a tensor dimension; generating a shape transfer stream according to the shape of the address index of the sparse tensor data and the address transfer stream; and optimizing the first calculation graph after the address flow optimization according to the shape transfer flow.
8. The computational graph processing apparatus of claim 5, further comprising:
The operator-level optimization module is used for performing operator-level optimization operation on the second computation graph to obtain a third computation graph, and the operator-level optimization operation comprises operator-level address transfer stream optimization operation;
wherein the operator level optimization module comprises:
An operator-level address optimizing unit, configured to split each computing node in the second computation graph into circularly nested operator operations to form a third computation graph; acquiring tensor data flow of the third computational graph according to node information of a plurality of operator nodes in the third computational graph and the dependency relationship among the operator nodes; and according to the tensor data flow, the address indexes of the sparse tensor data of the operator nodes are connected in series to form an operator-level address transfer flow.
9. A computer-readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the computational graph processing method of any one of claims 1-4.
10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the computer program when executed by the processor implements the computational graph processing method of any one of claims 1-4.
CN202311861591.7A 2023-12-29 2023-12-29 Calculation map processing method and device, electronic equipment and storage medium Active CN117764122B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311861591.7A CN117764122B (en) 2023-12-29 2023-12-29 Calculation map processing method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311861591.7A CN117764122B (en) 2023-12-29 2023-12-29 Calculation map processing method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN117764122A CN117764122A (en) 2024-03-26
CN117764122B true CN117764122B (en) 2024-06-25

Family

ID=90316382

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311861591.7A Active CN117764122B (en) 2023-12-29 2023-12-29 Calculation map processing method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117764122B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118132150A (en) * 2024-05-07 2024-06-04 中科寒武纪科技股份有限公司 Data access mode deducing method of calculation graph and related product

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111914774A (en) * 2020-05-07 2020-11-10 清华大学 3D object detection method and device based on sparse convolutional neural network
CN113705798A (en) * 2020-05-21 2021-11-26 平头哥(上海)半导体技术有限公司 Processing unit, computing device and computation graph optimization method of deep learning model

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200042216A1 (en) * 2018-08-03 2020-02-06 Alibaba Group Holding Limited Storage-based graph for enabling computation graph optimization
WO2020182989A1 (en) * 2019-03-13 2020-09-17 Deepmind Technologies Limited Scheduling computation graphs using neural networks
CN115437637A (en) * 2021-06-02 2022-12-06 华为技术有限公司 Compiling method and related device
WO2023093623A1 (en) * 2021-11-29 2023-06-01 中科寒武纪科技股份有限公司 Computation graph optimization method, data processing method and related product

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111914774A (en) * 2020-05-07 2020-11-10 清华大学 3D object detection method and device based on sparse convolutional neural network
CN113705798A (en) * 2020-05-21 2021-11-26 平头哥(上海)半导体技术有限公司 Processing unit, computing device and computation graph optimization method of deep learning model

Also Published As

Publication number Publication date
CN117764122A (en) 2024-03-26

Similar Documents

Publication Publication Date Title
Hu et al. Jittor: a novel deep learning framework with meta-operators and unified graph execution
US20220391678A1 (en) Neural network model processing method and apparatus, computer device, and storage medium
US11093225B2 (en) High parallelism computing system and instruction scheduling method thereof
JP6921079B2 (en) Neural network equipment, vehicle control systems, decomposition processing equipment, and programs
CN111242321A (en) Data processing method and related product
Vitter et al. New classes for parallel complexity: A study of unification and other complete problems for P
CN117764122B (en) Calculation map processing method and device, electronic equipment and storage medium
WO2020069239A1 (en) Exploiting activation sparsity in deep neural networks
CN113449858A (en) Processing method of neural network model and related equipment
CN111338695B (en) Data processing method based on pipeline technology and related product
US20150220315A1 (en) Method and apparatus for compiling
Malik et al. FastONN--Python based open-source GPU implementation for Operational Neural Networks
CN115034402A (en) Model reasoning performance optimization method and device and related products
JPH0795274B2 (en) Array subscript analysis method
CN112463159B (en) Compiling method, compiling device, electronic equipment and storage medium
CN114492772A (en) Neural network tensor shape tracking method and computing platform
CN116227565A (en) Compiling optimization system and neural network accelerator with variable precision
CN114492765A (en) Model optimization method, device, equipment, storage medium and program product
CN114065948A (en) Method and device for constructing pre-training model, terminal equipment and storage medium
CN108874395A (en) Hard Compilation Method and device during a kind of modularization stream process
CN113128015B (en) Method and system for predicting resources required by single-amplitude analog quantum computation
CN111860824A (en) Data processing method and related product
CN113468169B (en) Hardware database query method, database system query method and device
CN116090538A (en) Model weight acquisition method and related system
KR102372869B1 (en) Matrix operator and matrix operation method for artificial neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant