WO2022217419A1

WO2022217419A1 - Neural network model inference method and apparatus, computer device, and storage medium

Info

Publication number: WO2022217419A1
Application number: PCT/CN2021/086552
Authority: WO
Inventors: 庄奇
Original assignee: 深圳元戎启行科技有限公司
Priority date: 2021-04-12
Filing date: 2021-04-12
Publication date: 2022-10-20
Also published as: CN115867923A

Abstract

A neural network model inference method, comprising: obtaining a neural network model inference task, the neural network model inference task comprising a model identifier (202); obtaining a neural network model corresponding to the model identifier, and analyzing the neural network model to obtain a computation graph corresponding to the neural network model, the computation graph comprising a connection layer (204); obtaining a pre-constructed data structure template, and generating, according to the data structure template, a target sub data structure corresponding to the computation graph (206); according to the target sub data structure and the connection layer, determining, in the computation graph, connection layer data to be optimized (208); optimizing, according to target sub structural data, the connection layer data to be optimized, so as to obtain an optimized neural network model (210); and performing inference according to the optimized neural network model to obtain a model inference result (212).

Description

Neural network model inference method, device, computer equipment and storage medium

technical field

The present application relates to a neural network model inference method, apparatus, computer equipment and storage medium.

Background technique

The reasoning of the neural network model refers to deploying a pre-trained neural network model into actual business scenarios, such as image classification, object detection, online translation, etc., that is, input data to the neural network model, and obtain the output data through the neural network model. process. As the network structure of the neural network model becomes more complex, the inference of the neural network model takes more time. In order to improve the inference speed of the neural network model, the traditional method is to fuse the connection layer, that is, the input tensor corresponding to the connection layer is directly written into the output tensor of the connection layer, and the input tensor and the connection layer are deleted to reduce the memory space. Occupy, reduce memory copy time.

However, the inventor realized that the traditional method is only suitable for a simple connection layer structure. As the network structure of the neural network model becomes more complex, the connection operation of the connection layer will also become complicated. If the connection layer fusion is performed by the traditional method , which will lead to errors in the inference results. Therefore, how to obtain correct inference results while improving the inference speed has become a technical problem that needs to be solved at present.

SUMMARY OF THE INVENTION

According to various embodiments disclosed in the present application, a neural network model inference method, apparatus, computer device and storage medium are provided.

A neural network model inference method, comprising:

Obtain a neural network model inference task, where the neural network model inference task includes a model identifier;

obtaining a neural network model corresponding to the model identifier, analyzing the neural network model, and obtaining a computation graph corresponding to the neural network model, where the computation graph includes a connection layer;

Obtaining a pre-built data structure template, and generating a target sub-data structure corresponding to the computation graph according to the data structure template;

Determine the connection layer data to be optimized in the calculation graph according to the target sub-data structure and the connection layer;

Optimizing the connection layer data to be optimized according to the target substructure data to obtain an optimized neural network model; and

Inference is performed according to the optimized neural network model to obtain a model inference result.

A neural network model inference device, comprising:

a task acquisition module for acquiring a neural network model inference task, where the neural network model inference task includes a model identifier;

a model parsing module, configured to obtain a neural network model corresponding to the model identifier, analyze the neural network model, and obtain a computation graph corresponding to the neural network model, where the computation graph includes a connection layer;

a structure generation module, configured to obtain a pre-built data structure template, and generate a target sub-data structure corresponding to the computation graph according to the data structure template;

a data determination module, configured to determine the connection layer data to be optimized in the calculation graph according to the target sub-data structure and the connection layer;

A model optimization module for performing optimization processing on the connection layer data to be optimized according to the target substructure data to obtain an optimized neural network model; and

A model inference module, configured to perform inference according to the optimized neural network model to obtain a model inference result.

A computer device comprising a memory and one or more processors, the memory having computer-readable instructions stored therein, the computer-readable instructions, when executed by the processor, cause the one or more processors to execute The following steps:

One or more computer storage media storing computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform the following steps:

The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below. Other features and advantages of the present application will be apparent from the description, drawings, and claims.

Description of drawings

FIG. 1 is an application environment diagram of a neural network model inference method in one or more embodiments.

FIG. 2 is a schematic flowchart of a neural network model inference method in one or more embodiments.

3 is a schematic diagram of a computational graph in one or more embodiments.

FIG. 4 is a schematic diagram of a computation graph obtained by performing connection layer fusion on the computation graph shown in FIG. 3 in one or more embodiments.

FIG. 5 is a schematic diagram of a computational graph including complex join operations in one or more embodiments.

6 is a schematic diagram of a data structure template in one or more embodiments.

FIG. 7 is a schematic flowchart of a step of generating a target sub-data structure corresponding to a computation graph according to a data structure template in one or more embodiments.

FIG. 8 is a schematic flowchart of an optimized data structure step of generating an output tensor corresponding to a connection layer according to a data structure template and a calculation graph in one or more embodiments.

FIG. 9 is a block diagram of a neural network model inference apparatus in one or more embodiments.

10 is a block diagram of a computer device in one or more embodiments.

Detailed ways

The neural network model inference method provided in this application can be applied to computer equipment, and the computer equipment can be a terminal or a server. It can be understood that the neural network model inference method provided by the present application can be applied to a terminal, a server, or a system including a terminal and a server, and is realized through interaction between the terminal and the server.

The neural network model inference method provided in this application can be applied to the application environment shown in FIG. 1 . The terminal 102 communicates with the server 104 through the network. The terminal 102 can acquire the model inference task, and the model inference task carries the model identifier. The terminal 102 obtains the neural network model corresponding to the model identifier, parses the neural network model, and obtains a calculation graph corresponding to the neural network model, and the calculation graph includes a connection layer, thereby obtaining a pre-built data structure template, and generating a calculation graph according to the data structure template. The corresponding target sub-data structure, according to the target sub-data structure and the connection layer, determine the connection layer data to be optimized in the calculation graph, and perform optimization processing on the connection layer data to be optimized according to the target substructure data to obtain an optimized neural network model, Then, inference is carried out according to the optimized neural network model, and the model inference result is obtained. The terminal 102 may specifically include, but is not limited to, various personal computers, notebook computers, smart phones, and tablet computers. The server 104 can be implemented by an independent server or a server cluster composed of multiple servers.

It can be understood that the neural network model inference method provided by the present application implements inference on the neural network model, can be applied to various application environments, and the neural network model can include various types. For example, the neural network model may include a convolutional neural network model, a recurrent neural network model, a recurrent neural network model, and the like. Neural network models can be used to process many different kinds of data. For example, the neural network model may specifically include an image recognition model, a feature extraction model, a speech recognition model, a text recognition model, a scene classification model, and the like.

In one of the embodiments, the reasoning method of the neural network model provided by the present application can be specifically applied to the field of automatic driving, and the neural network model can specifically include at least one of an image recognition model or a trajectory prediction model.

In one embodiment, the reasoning method of the neural network model provided by the present application can be applied in the text field, and the neural network model can specifically include at least one of an image recognition model, a behavior prediction model, or a risk assessment model.

In one of the embodiments, as shown in FIG. 2, a neural network model inference method is provided, and the method is applied to a computer device as an example for illustration. The computer device may be the terminal or server in FIG. 1, including the following step:

Step 202, acquiring a model reasoning task, where the model reasoning task carries a model identifier.

Model inference refers to the operation of the data input into the neural network model according to the sequence of the model network structure of the neural network model and the corresponding arithmetic operations of the multiple network layers included in the network structure, so as to obtain the output of the neural network model. inference results. The model inference task is used to instruct the computer device to infer the corresponding neural network model. The computer equipment can be a terminal or a server. The model identifier refers to a unique identifier for marking the neural network model, which is used to distinguish the neural network model.

When model inference needs to be performed, the computer device can acquire the model inference task, and analyze the model inference task, thereby obtaining the model identifier carried in the model inference task. Specifically, when the user needs to perform model inference, the computer device can determine the neural network model specified by the user according to the received user operation instruction, and generate a model inference task carrying the model identifier. The computer equipment can also determine the neural network model that needs to be called according to the actual operation requirements, and generate model inference tasks. For example, in the process of image recognition, when the image recognition model needs to be called to recognize and process the image, the computer device can generate a model inference task, so as to infer the image recognition model after inputting the image according to the model inference task, and obtain the image recognition The image recognition result output by the model.

In one of the embodiments, the computer device may store an inference engine in advance, and the computer device may perform a model inference task through the inference engine, and perform inference on the neural network model corresponding to the model identifier. An inference engine refers to a functional module in a computer device that is used to complete inference.

Step 204: Obtain a neural network model corresponding to the model identifier, analyze the neural network model, and obtain a computation graph corresponding to the neural network model, and the computation graph includes a connection layer.

A neural network model is pre-stored in the computer device, and the neural network model is obtained by training a large amount of sample data, so that a corresponding neural network model can be obtained according to the model identifier. The neural network model corresponding to the model identification may include at least one of various types of neural network models. For example, depending on the network structure of the neural network model, it may specifically include at least one of a convolutional neural network model (Convolutional Neural Networks, CNN for short), a recurrent neural network model (Recurrent Neural Network, RNN for short), and a recurrent neural network model. A sort of. According to different functions of the neural network model, the neural network model may specifically include at least one of an image recognition model, a feature extraction model, a speech recognition model, a text recognition model, and a scene classification model.

The computer device analyzes the acquired neural network model, and obtains a calculation graph corresponding to the neural network model. The computation graph can be an abstract graph of the neural network model in the model inference process, and can include multiple operation layers, tensors corresponding to each operation layer, and directed edges between the operation layers and corresponding tensors. The operation layer can be used to represent the network layer in the network structure of the neural network model, and the operation layer can be used to determine the arithmetic operation performed by the corresponding network layer, such as convolution operation, full connection operation, connection operation, etc. Tensor tensor is a data structure that can be understood as a vector or array matrix. The shape of a tensor can be represented by dimensions, a one-dimensional tensor can be called a vector, and a tensor with more than two dimensions can be called an array matrix. Tensors include input tensors and output tensors. The input tensors can be used to represent the input data corresponding to the operation layer, and the output tensors can be used to represent the output data of the operation layer. Each operation layer may include multiple pointers to different input tensors and multiple pointers to different output tensors, and each tensor may include one pointer to the production layer and multiple pointers to different demand layers. The production layer refers to which operation layers are used to obtain the tensor, and the demand layer refers to which operation layers the tensor is used as the input tensor. The directed edge between the operation layer and the corresponding tensor in the computation graph can be generated through the pointer in the operation layer and the pointer in the tensor.

The computer device can refer to the operation layer corresponding to the connection operation as the connection layer. The connection layer is used to splicing the obtained multiple input tensors. The data stored in the output tensors of the connection layer and the data stored in the multiple input tensors are completely different. identical.

In one of the embodiments, a schematic diagram of the computation graph may be shown in FIG. 3 , and the computation graph includes an operation layer layer1, an operation layer layer2, an operation layer layer3, and input tensors and output tensors corresponding to each operation layer. The output tensor tensor1 of the operation layer layer1 and the output tensor tensor2 of the operation layer layer2 are used as the input of layer3, and layer3 connects tensor1 and tensor2 to obtain the output tensor tensor3. Shape in tensor1, tensor2 and tensor3 represents the shape of the tensor, which can be represented by dimensions. For example, Shape in tensor1: 1×3×40×40, from left to right, represents the 0th dimension, the 1st dimension, the 2nd dimension, and the 3rd dimension. The contact axis=1 in layer3 means that tensor1 and tensor2 are connected in the first dimension.

Step 206: Obtain a pre-built data structure template, and generate a target sub-data structure corresponding to the calculation graph according to the data structure template.

The pre-built data structure template refers to the data structure template required to optimize the computational graph. The data structure template is a new data structure template built on the basis of the data structure of the traditional computational graph. The target sub-data structure refers to the data structure of the optimized tensors in the computation graph.

In the neural network model, there are a large number of operation layers. There may be a connection layer in the operation layer. The connection layer can be used to splicing the obtained multiple input tensors, and the data saved in the output tensors of the connection layer and multiple inputs The data held in the tensors are exactly the same. In the inference process of the neural network model, the connection operation of the connection layer will cause duplicate data to be stored in the memory, and also lead to unnecessary memory copy time. As the amount of data increases, the connection operation of the connection layer will occupy more memory space, and the memory copy time will become longer and longer, resulting in a long inference time and a slow inference speed of the neural network model, which is not conducive to real-time Data processing with high requirements. For example, in the field of autonomous driving, it is necessary to quickly obtain data processing results based on neural network model inference. In order to speed up the inference speed in the traditional method, the connection layer is fused and duplicated data is deleted to reduce memory space occupation and memory copy time. However, the traditional method is only suitable for simple connection operations, for example, the input tensors of the connection layer are all connected to the connection layer. When the connection operation in the network structure of the neural network model is complex, for example, a tensor is used as the input of multiple operation layers, and there are operation layers that do not participate in the connection operation in the multiple operation layers. Incorrect inference results are obtained. For another example, when there are continuous connections, the traditional method of connection layer fusion still has the problem that the position of the optimized tensor in the output tensor of the last connection layer cannot be determined, and model inference cannot be performed. For another example, when the output tensor of one operation layer is connected with the output tensors of multiple operation layers at the same time, that is, the output tensor of this operation layer needs to be input into multiple connection layers, and the connection layer fusion is performed in the traditional way. The output tensors of the operation layer will not be stored in multiple connection layers at the same time, which will lead to errors in the model inference results. Therefore, the computer device builds a new data structure template in advance, stores it, and obtains the pre-built data structure template when model inference is required, so as to optimize the data structure of the calculation graph according to the data structure template, which can be used for all situations. The lower connection layer is accurately fused to ensure correct inference results, and at the same time, the memory space occupied by the inference process is reduced, the memory copy time is reduced, and the inference speed is accelerated.

In one of the embodiments, a schematic diagram of a computation graph obtained by performing connection layer fusion on the computation graph shown in FIG. 3 in a traditional manner may be as shown in FIG. 4 . Since the content stored in tensor3 as the connection layer is exactly the same as that in tensor1 and tensor2, the connection operation can be called a memory copy operation. By directly writing the output results of layer1 and layer2 into tensor3, and deleting tensor1, tensor2 and layer3, two less memory spaces tensor1 and tensor2 can be opened up, and one less memory copy operation in layer3.

In one of the embodiments, a schematic diagram of a computational graph including complex connection operations may be shown in FIG. 5 . Layer1-layer7 are all operation layers, layer3, layer5 and layer7 are connection layers. The cosumer layer pointed to by the dotted arrows of tensor1, tensor3 and tensor5 refers to the demand layer other than the connection layer corresponding to each tensor.

In one embodiment, the pre-built data structure template may be the output structure template of the output tensor corresponding to the connection layer in the calculation graph. The schematic diagram of the data structure template may be as shown in FIG. 6, including multiple template items, sub_producer_layers, sub_consumer_layers, sub_shapes, sub_producer_index_table, sub_consumer_index_table, and sub_consumer_stride_table. sub_producer_layers is used to store the production layer of the output tensors corresponding to the connection layer, including the production layer of the input tensors that need to be stored, and sub_consumer_layers is used to store the demand layer of the output tensors corresponding to the connection layer, including the needs of the input tensors to be stored. Layer, sub_shapes is used to store the shape of the output tensor corresponding to the connection layer, including the shape of the input tensor to be stored, sub_producer_index_table is used to store the index corresponding to each production layer in sub_producer_layers, and sub_consumer_index_table is used to store the corresponding data of each demand layer in sub_consumer_layers The index of the input tensor, sub_consumer_stride_table is used to store the size of the input tensor corresponding to each demand layer in sub_consumer_layers.

Since in the data structure of the obtained calculation graph, each operation layer includes multiple pointers to different input tensors and multiple pointers to different output tensors, when the connection operation in the network structure of the neural network model is complex , if a tensor is used as the input of multiple operation layers, other operation layers except the connection layer do not participate in the connection operation. After the connection layer is optimized, other operation layers except the connection layer lose the original input tensors , the output tensor of the connection layer will be used as the output tensor of other operation layers except the connection layer, resulting in errors in the output results of other operation layers. Further, when the calculation graph is shown in Figure 5, there are multiple continuous connection layers, and the connection layer fusion is performed in the traditional way, not only the problem of wrong output results, but also the inability to determine the tensor1, tensor2, tensor4 and tensor6 after fusion. The position of the tensor7 in the tensor7 caused the deleted connection layers layer3, layer5 and layer7, and the cosumer layer to be unable to determine the corresponding input tensor, resulting in errors in the inference results.

The computer device may optimize the data structure of the computation graph according to the above-mentioned pre-built data structure template. Specifically, the computer device may obtain the original data structure of the output tensor corresponding to each connection layer in the calculation graph according to the data structure template. The original data structure can include the output tensor itself, as well as a pointer to the production layer and multiple pointers to different demand layers. The computer equipment can optimize the original data structure according to the data structure template, and complete the data structure template according to the calculation graph and the output tensors corresponding to each connection layer in the calculation graph, so as to obtain the output tensors corresponding to each connection layer in the calculation graph. The optimized data structure is the optimized data structure, and the target sub-data structure corresponding to the calculation graph is determined according to the generated optimized data structure. The target sub-data structure can include the production layer, the demand layer, the shape of the output tensor corresponding to the connection layer, the index corresponding to each production layer, the input tensor index corresponding to each demand layer, and the size of the input tensor corresponding to each demand layer. . There can be one or more connection layers. The production layer and the demand layer of the output tensor corresponding to the connection layer are used to determine the original input tensor of the demand layer that does not participate in the connection operation, so as to ensure the correct output result. The shape of the output tensor is used to ensure the connection made by the connection layer. correctness of operation. The index of the input tensor corresponding to each demand layer and the size of the input tensor corresponding to each demand layer are used to indicate the position of the input tensor in the output tensor. In the case of a continuous connection layer, each demand layer can be searched according to the position. the corresponding input tensor.

Step 208: Determine the connection layer data to be optimized in the calculation graph according to the target sub-data structure and the connection layer.

Step 210: Perform optimization processing on the connection layer data to be optimized according to the target substructure data to obtain an optimized neural network model.

The target sub-data structure refers to the optimized data structure that can correctly fuse the connection layers.

The computer device may determine the connection layer data to be optimized in the computation graph according to the target sub-data structure and the connection layer in the computation graph. The connection layer data to be optimized may include connection layers in the computation graph and input tensors corresponding to the connection layers. The computer device may perform optimization processing on the data of the connection layer to be optimized according to the target substructure data, and the optimization processing method may be to fuse the connection layer, that is, delete the data of the connection layer to be optimized.

In one embodiment, performing optimization processing on the data of the connection layer to be optimized according to the target substructure data to obtain an optimized neural network model; including: deleting the data of the connection layer to be optimized to obtain a calculation graph after deletion; according to the target data structure The deleted computational graphs are connected to obtain the optimized neural network model. After deleting the data of the connection layer to be optimized, the computer device can sequentially connect the deleted calculation graphs according to the operation layer sequence in the target data structure, so as to obtain the optimized calculation graph, and then obtain the optimized calculation graph according to the optimized calculation graph. Neural network model. By deleting the connection layer data to be optimized and connecting the deleted computation graphs, the memory space occupied by the inference process can be reduced, the memory copy time can be reduced, and the inference speed can be accelerated.

Step 212, inference is performed according to the optimized neural network model to obtain a model inference result.

After the computer equipment obtains the optimized neural network model, compared with the traditional neural network model, the optimized neural network model reduces the connection operations in the inference process, reduces the memory occupation of repeated data in the inference process, and reduces the The memory copy time is reduced, thereby improving the model inference speed. The computer equipment can perform inference according to the optimized neural network model, and perform operations in sequence according to the arithmetic operations of the operation layers corresponding to the optimized neural network model to obtain the inferred data results. For example, the computer device may sequentially perform operations on the input images according to the neural network model and in the sequence of arithmetic operations corresponding to the optimized neural network model, to obtain the recognized image results.

In this embodiment, the neural network model inference task is acquired, the neural network model corresponding to the model identifier in the neural network model inference task is acquired, the neural network model is parsed, and a computation graph corresponding to the neural network model is obtained, and the computation graph includes connections layer, so as to obtain a pre-built data structure template, generate the target sub-data structure corresponding to the calculation graph according to the data structure template, determine the connection layer data to be optimized in the calculation graph according to the target sub-data structure and the connection layer, and then according to the target sub-structure data The data of the connection layer to be optimized is optimized to obtain an optimized neural network model, and inference is performed according to the optimized neural network model to obtain a model inference result. By constructing a data structure template in advance and generating the target sub-data structure corresponding to the computational graph according to the template, the data structure of the computational graph can be optimized, so that the optimized target sub-data structure can correctly integrate the connection layers in all cases , so as to obtain the correct inference result, and at the same time reduce the memory space occupied by the inference process, reduce the memory copy time, and speed up the inference speed.

In one of the embodiments, as shown in FIG. 7 , the step of generating the target sub-data structure corresponding to the computation graph according to the data structure template includes:

Step 702 , traverse the operation layers in the calculation graph, and identify the connection layers in the operation layer.

Step 704: Generate an optimized data structure of the output tensor corresponding to the connection layer according to the data structure template and the calculation graph.

Step 706: Determine the target sub-data structure corresponding to the computation graph according to the optimized data structure of the output tensor corresponding to the connection layer.

The computation graph includes operation layers and tensors, the operation layers include connection layers, and the tensors can be input tensors or output tensors of the operation layers. The computer device can traverse all the operation layers in the calculation graph, identify the connection layer in the operation layer, and when the connection layer is identified, obtain the corresponding output of the connection layer in the calculation graph according to multiple template items in the data structure template The template item data of the tensor is added, and the obtained template item data is added to the corresponding template item, so as to generate the optimized data structure of the output tensor corresponding to the connection layer, until the output tensor corresponding to the last connection layer is generated. optimized data structure. The optimized data structure refers to the data structure obtained by optimizing the original data structure of the output tensor.

The computer device may determine the target sub-data structure corresponding to the computation graph according to the generated optimized data structure of the output tensor corresponding to the connection layer. When there is only one connection layer in the computation graph, the optimized data structure of the output tensor corresponding to the connection layer directly determines the target sub-data structure corresponding to the computation graph. When there are multiple consecutive connection layers in the computation graph, as shown in the computation graph in Figure 5, the computer device can determine the target sub-data structure corresponding to the computation graph from the optimized data structure of the output tensor corresponding to the last connection layer. When the computation graph includes multiple parallel connection layers, that is, when the output tensors of one operation layer need to be input into multiple connection layers at the same time, the computer device can determine the optimized data structures of the output tensors corresponding to the multiple connection layers. In order to calculate the target sub-data structure corresponding to the graph, at this time, the target sub-data structure includes the optimized data structure of the output tensors corresponding to the multiple connection layers.

In this embodiment, to traverse the operation layers in the calculation graph to identify the connection layers in the operation layer, it is only necessary to generate an optimized data structure of the output tensor corresponding to the connection layer according to the data structure template and the calculation graph. The optimized data structure of the output tensor determines the target sub-data structure corresponding to the calculation graph, which can quickly obtain the data structure required for the correct fusion of the connection layer in all complex situations.

In one embodiment, the above method further includes: topologically sorting the operation layers in the calculation graph to obtain a topological sequence; identifying whether each operation layer is a connection layer in turn according to the topological sequence; if it is not a connection layer, skipping Operation layer; when it is a connection layer, the optimized data structure of the output tensor corresponding to the connection layer is generated according to the data structure template and the calculation graph.

A computation graph can include multiple operation layers, and a computer device can topologically sort the multiple operation layers to obtain a topological sequence. Topological sorting refers to arranging a sequence that satisfies the topological order according to the dependencies between the operation layers in the directed computational graph, and the sequence obtained by topological sorting is a one-dimensional linear sequence. Specifically, the computer device can first search for an operation layer with an in-degree of 0, that is, no input edge, in the calculation graph, store the operation layer in the stack, and delete the operation layer and the operation layer related to the operation layer from the calculation graph. Connect the directed edges, adjust the in-degree of the operation layer of the deleted directed edge, such as subtracting 1 from the in-degree, and then repeat the steps of finding the operation layer whose in-degree is 0, and deleting and adjusting the operation layer until the calculation is performed. All the operation layers in the figure have been saved to the stack, and all the operation layers in the stack can be output in turn according to the storage sequence between the operation layers, so as to obtain the topology sequence. The arrangement order among the operation layers in the topology sequence may be determined according to the storage sequence of the operation layers. The computer device can access each operation layer according to the arrangement order of each operation layer in the topology sequence, and identify whether the arithmetic operation corresponding to each operation layer is a connection operation. Arithmetic operations refer to the data processing operations performed by each operation layer. When it is not a connection operation, the operation layer is not a connection layer, and the computer device can directly skip the operation layer without performing the optimization step of the data structure. When it is a connection operation, the operation layer is a connection layer, and the optimization step of the data structure needs to be performed. The computer device generates an optimized data structure of the output tensor corresponding to the connection layer according to the data structure template and the calculation graph.

In this embodiment, by topologically sorting the operation layers in the calculation graph, the computer device can ensure that when inferring a certain operation layer, the arithmetic operation corresponding to the previous operation layer has been inferred, thereby improving the recognition accuracy of the connection layer. It is beneficial to quickly obtain the optimized data structure of the output tensor corresponding to the connection layer.

In one embodiment, as shown in FIG. 8 , the step of generating the optimized data structure of the output tensor corresponding to the connection layer according to the data structure template and the calculation graph includes:

Step 802: Obtain the current connection layer, and identify whether there is an optimized input tensor in the input tensor corresponding to the current connection layer.

Step 804, when it exists, obtain the optimized data structure of the optimized output tensor, and generate the optimized data structure of the output tensor corresponding to the current connection layer according to the optimized data structure of the optimized output tensor, the calculation graph and the data structure template , update the next connection layer to the current connection layer, return the step of identifying whether there is an optimized input tensor in the input tensor corresponding to the current connection layer, until the traversal is completed, and generate the optimization of the output tensor corresponding to the connection layer in the operation layer data structure.

Step 806, when there is no optimized input tensor in the input tensor corresponding to the current connection layer, extract the template data corresponding to the output tensor of the current connection layer in the calculation graph according to the data structure template, and add the extracted template data to In the data structure template, the optimized data structure of the output tensor corresponding to the current connection layer is obtained.

The current connection layer refers to the currently accessed connection layer. The optimized input tensor means that the production layer of the input tensor is a connection layer, and the data structure of the input tensor is an optimized data structure generated according to the data structure template and the computation graph.

The computer device obtains the current connection layer, and identifies whether there is an optimized input tensor in the input tensor corresponding to the current connection layer. When it is recognized that there is an optimized input tensor in the input tensor corresponding to the current connection layer, it indicates that there is a continuous connection layer in the calculation graph, and the output tensor corresponding to the current connection layer can be generated according to the optimized data structure of the optimized input tensor optimized data structure. Specifically, since the optimized data structure of the input tensor is the optimized data structure, the computer device can obtain the optimized data structure of the optimized input tensor, and associate the optimized data structure of the optimized input tensor with the one corresponding to the current connection layer. Unoptimized input tensors are combined. The computer device obtains the template data corresponding to the output tensor of the current connection layer in the optimized data structure and calculation graph of the optimized input tensor according to the data structure template, and adds the obtained template data to the data structure template, thereby obtaining the current The optimized data structure corresponding to the connection layer. The optimized data structure corresponding to the current connection layer may include the production layer, the demand layer, the shape of the input tensor stored in the output tensor corresponding to the current connection layer, the index corresponding to each production layer, the index corresponding to each demand layer, and the corresponding demand layer. The size of the corresponding input tensor. The computer device can continue to identify whether the next operation layer is a connection layer, and when it is a connection layer, the next operation layer is used as the next connection layer, and the next connection layer is updated to the current connection layer, and the identification corresponding to the current connection layer is returned. The step of whether there are optimized input tensors in the input tensors, until the optimized data structure of the output tensors corresponding to all the connection layers in the operation layer is generated. By combining the optimized data structure of the optimized input tensor with the unoptimized input tensor corresponding to the current connection layer, the generation efficiency of the optimized data structure can be improved to speed up the fusion speed of the connection layer, thereby improving the model inference speed.

In one of the embodiments, when there are continuous connection layers in the calculation graph, that is, optimized input tensors exist in the input tensors corresponding to the second connection layer and subsequent connection layers, the computer device can optimize the data structure according to the above. The output method generates the optimized data structure of the output tensor corresponding to each connection layer, and uses the optimized data structure of the output tensor corresponding to the last connection layer as the target sub-data structure corresponding to the calculation graph.

When it is recognized that there is no optimized input tensor in the input tensor corresponding to the current connection layer, it indicates that the input tensor corresponding to the current connection layer is a basic tensor. The computer device can directly extract template data corresponding to the output tensor of the current connection layer in the computation graph according to the data structure template. Specifically, the data structure template includes multiple template items, and the computer device can sequentially extract template item data corresponding to the output tensor of the current connection layer according to the multiple template items in the data structure template, and add the extracted template item data to the corresponding In the corresponding template item, the optimized data structure of the output tensor corresponding to the current connection layer is obtained. An optimized data structure can be obtained that correctly fuses the connection layers.

In one embodiment, template data corresponding to the output tensor of the current connection layer is extracted in the calculation graph according to the data structure template, and the extracted template data is correspondingly added to the data structure template to obtain the output tensor corresponding to the current connection layer The optimized data structure includes: sequentially extracting the production layer, demand layer and dimension data corresponding to each input tensor in the calculation graph according to the data structure template; adding the extracted production layer, demand layer and dimension data to the data structure template; Establish the production layer index corresponding to the production layer in the production layer index table of the data structure template; establish the demand layer index corresponding to the demand layer in the demand layer data table of the data structure template, and count the input tensors corresponding to each demand layer. Size; connect the demand layers other than the connection layer in the demand layer corresponding to the input tensor to the demand layer data table to obtain the optimized data structure of the output tensor corresponding to the current connection layer.

The template data corresponding to the output tensor of the current connection layer includes the production layer, demand layer, shape of the output tensor of the current connection layer, the index corresponding to each production layer, the input tensor index corresponding to each demand layer, and the corresponding index of each demand layer. The size of the input tensor. During the fusion process of the connection layer, the input tensor of the connection layer needs to be deleted, and the input tensor of the connection layer needs to be saved, so it can only be saved in the output tensor of the connection layer. Specifically, the computer device extracts the production layer corresponding to each input tensor of the connection layer in the calculation graph as the production layer of the output tensor of the current connection layer. The demand layer corresponding to the input tensor of each connection layer is extracted from the calculation graph as the demand layer of the output tensor of the current connection layer. Extract the shape corresponding to the input tensor of each connection layer in the computation graph as the shape of the output tensor of the current connection layer. The computer device may add the extracted template data to the corresponding template item of the data structure template after extracting the template data. The computer equipment can also establish a production layer index corresponding to each production layer in the production layer index table of the data structure template. The production layer index is used to distinguish multiple production layers, indicating the position of the production layer in the calculation graph, which can ensure that after optimization The accuracy of the position of each production layer in the neural network model. The computer device establishes a requirement layer index corresponding to the requirement layer in the requirement layer data table of the data structure template, and counts the size of the input tensor corresponding to each requirement layer. The demand layer index corresponding to the demand layer is used to determine where the demand layer needs to determine its own input tensor in the output tensor, indicating the position of each input tensor in the output tensor. After the connection layer is fused, it is convenient to find the input tensor corresponding to each demand layer according to the position. The size of the input tensor corresponding to each requirement layer can be used to represent the number of basic tensors required by each requirement layer. The computer equipment attaches the demand layers other than the connection layer in the demand layer corresponding to each input tensor of the connection layer in the demand layer data table, and the attached demand layer is the operation layer that does not participate in the connection operation, avoiding the optimization of the connection layer. After that, the problem of directly using the output tensor as the input tensor of the operation layer that does not participate in the connection operation improves the accuracy of the fusion of the connection layer.

Taking the calculation graph in FIG. 5 as an example, the process of generating the target sub-data structure corresponding to the calculation graph will be described. Layer1-layer7 are operation layers, layer3, layer5, and layer7 are connection layers. The consumer layer pointed to by the dotted arrows of tensor1, tensor3, and tensor5 refers to the demand layer other than the connection layer corresponding to each tensor. The computer device can traverse each operation layer in the order of layer1-layer7. When layer1 and layer2 are identified, if it is identified that they are not connected layers, layer1 and layer2 are skipped. Continue to identify layer3, layer3 is the connection layer, and the two input tensors tensor1 and tensor2 of layer3 are both unoptimized input tensors, then the schematic diagram of the steps to generate the optimized data structure of tensor3 can be shown in the following A1-F1:

A1.sub_producer_layers=[layer1, layer2]

B1.sub_consumer_layers=[tensor1’s consumer layers, tensor 2’s consumer layers]

C1.sub_shapes=[tensor1’s shape, tensor2’s shape]

D1.sub_producer_index_table:

sub producer namesub producer name	layer1layer1	layer2layer2
indexindex	00	11

E1.sub_consumer_index_table and sub_consumer_stride_table:

sub consumer namesub consumer name	tensor1's consumerstensor1's consumers	tensor2's consumerstensor2's consumers
indexindex	00	11
stridestride	11	11

F1. Hook other consumers in the demand layers corresponding to tensor1 and tensor2 except the connection layer layer3 to table e.

In step B1, tensor1's consumer layers represent the demand layer corresponding to tensor1, and tensor2's consumer layers represent the demand layer corresponding to tensor2. The table in step E1 refers to the combined table of sub_consumer_index_table and sub_consumer_stride_table, which can be called the demand layer table. tensor1's consumers represent the demand layer corresponding to tensor1. The index of tensor1's consumers is 0, indicating that the demand layer corresponding to tensor1 can find the input tensor tensor1 at the 0th position of the output tensor. The stride of tensor1's consumers is 0, indicating that the size of tensor1 is 1 unit. Similarly, if the index of tensor2's consumers is 1, it means that the demand layer corresponding to tensor2 can find the input tensor tensor2 in the first position of the output tensor, and the stride of tensor2's consumers is 1, which means that the size of tensor2 is 1 unit.

The computer device continues to recognize layer4, and if it recognizes that it is not a connection layer, skips layer4. Continue to identify layer5, layer5 is the connection layer, and tensor3 in the input tensor of layer5 is the optimized input tensor, then the schematic diagram of the steps to generate the optimized data structure of tensor5 can be shown in the following A2-F2:

A2. Obtain the optimized data structure of tensor3 and combine the optimized data structure of tensor3 with tensor4

B2.sub_producer_layers=[layer1, layer2, layer4]

C2.sub_consumer_layers=[tensor1’s consumer layers, tensor2’s consumer layers, tensor3’s consumer layers, tensor4’s consumer layers]

D2.sub_shapes=[tensor1's shape, tensor2's shape, tensor4's shape]

E2.sub_producer_index_table:

F2.sub_consumer_index_table and sub_consumer_stride_table:

G2. Hook other consumers in the demand layers corresponding to tensor1 to tensor4 except the connection layer layer3 into table e.

The computer device continues to recognize layer6, and if it recognizes that it is not a connection layer, skips layer6. Continue to identify layer7, layer7 is the connection layer, and tensor5 in the input tensor of layer7 is the optimized input tensor, then the schematic diagram of the steps to generate the optimized data structure of tensor7 can be shown in the following A3-F3:

A3. Obtain the optimized data structure of tensor5, and combine the optimized data structure of tensor5 with tensor6

B3.sub_producer_layers=[layer1, layer2, layer4, layer6]

C3.sub_consumer_layers=[tensor1’s consumer layers, tensor2’s consumer layers, tensor3’s consumer layers, tensor4’s consumer layers, tensor5’s consumer layers, tensor6’s consumer layers]

D3.sub_shapes=[tensor1’s shape, tensor2’s shape, tensor4’s shape, tensor6’s shape]

E3.sub_producer_index_table:

F3.sub_consumer_index_table:

G3.sub_consumer_stride_table:

H3. Attach other consumers in the demand layers corresponding to tensor 1 to tensor 6 except the connection layers layer3 and layer5 to sub_consumer_index_table and sub_consumer_stride_table.

The computer device uses the obtained optimized data structure of tensor7 as the target sub-data structure corresponding to the calculation graph.

In one embodiment, determining the connection layer data to be optimized in the computation graph according to the target sub-data structure and the connection layer includes: determining the connection layer to be optimized in the connection layer according to the target sub-data structure; obtaining the connection layer to be optimized in the computation graph The input tensor corresponding to each connection layer in the connection layer; the connection layer to be optimized and the obtained input tensor are used as the connection layer data to be optimized.

The computer device determines the connection layer to be optimized in the connection layer according to the target sub-data structure, and the connection layer to be optimized is the connection layer to be deleted. Since the connection layer connects the input tensors of the connection layer, the data stored in the output tensor of the connection layer is exactly the same as the data stored in the input tensor, and the connection layer and the corresponding input tensors can be deleted. Therefore, the computer device obtains the input tensor corresponding to each connection layer in the connection layer to be optimized in the calculation graph, and uses the connection layer to be optimized and the obtained input tensor as the connection layer data to be optimized.

Taking Figure 7 as an example to illustrate, the computer equipment can determine the connection layers to be optimized as layer3, layer5 and layer7 according to the optimized data structure of tensor7, and obtain the input tensors corresponding to layer3, layer5 and layer7 in the calculation diagram respectively, and obtain tensor1 to tensor 6. Therefore, take layer3, layer5, layer7 and tensor 1 to tensor 6 as the connection layer data to be optimized, and delete the connection layer data to be optimized.

In this embodiment, since the obtained target sub-data structure is a data structure that can correctly fuse the connection layer, the data of the connection layer to be optimized is determined in the connection layer according to the target sub-data structure, which can improve the accuracy of the data of the connection layer to be optimized. , so that the connection layer is correctly fused according to the data of the connection layer to be optimized, and the model inference speed can also be improved after the connection layer is fused.

In one embodiment, as shown in FIG. 9, a neural network model inference apparatus is provided, including: a task acquisition module 902, a model analysis module 904, a structure generation module 906, a data determination module 908, a model optimization module 910 and Model inference module 912, where:

The task acquisition module 902 is configured to acquire a neural network model inference task, where the neural network model inference task includes a model identifier.

The model parsing module 904 is configured to obtain a neural network model corresponding to the model identifier, analyze the neural network model, and obtain a computation graph corresponding to the neural network model, and the computation graph includes a connection layer.

The structure generation module 906 is configured to obtain a pre-built data structure template, and generate a target sub-data structure corresponding to the computation graph according to the data structure template.

The data determination module 908 is configured to determine the connection layer data to be optimized in the calculation graph according to the target sub-data structure and the connection layer.

The model optimization module 910 is configured to perform optimization processing on the connection layer data to be optimized according to the target substructure data to obtain an optimized neural network model.

The model inference module 912 is configured to perform inference according to the optimized neural network model to obtain a model inference result.

In one embodiment, the computation graph includes an operation layer and a tensor, the operation layer includes a connection layer, and the tensor is an input tensor or an output tensor of the operation layer. Perform traversal to identify the connection layer in the operation layer; generate the optimized data structure of the output tensor corresponding to the connection layer according to the data structure template and the calculation graph; determine the target object corresponding to the calculation graph according to the optimized data structure of the output tensor corresponding to the connection layer data structure.

In one of the embodiments, the above-mentioned device further includes: an identification module for topologically sorting the operation layers in the calculation graph to obtain a topological sequence; sequentially identifying whether each operation layer is a connection layer according to the topological sequence; when it is not a connection layer When it is, the operation layer is skipped; when it is the connection layer, the optimized data structure of the output tensor corresponding to the connection layer is generated according to the data structure template and the calculation graph.

In one embodiment, the structure generation module 906 is further configured to obtain the current connection layer, and identify whether there is an optimized input tensor in the input tensor corresponding to the current connection layer; if there is, obtain the optimized data of the optimized output tensor Structure, generate the optimized data structure of the output tensor corresponding to the current connection layer according to the optimized data structure, calculation graph and data structure template of the optimized output tensor, update the next connection layer to the current connection layer, and return to identify the current connection layer The steps of whether there are optimized input tensors in the corresponding input tensors until the optimized data structure of the output tensors corresponding to all connection layers in the operation layer is generated.

In one embodiment, the structure generation module 906 is further configured to extract, according to the data structure template, the output tensor corresponding to the current connection layer in the calculation graph when there is no optimized input tensor in the input tensor corresponding to the current connection layer. Template data, the extracted template data is correspondingly added to the data structure template, and the optimized data structure of the output tensor corresponding to the current connection layer is obtained.

In one embodiment, the structure generation module 906 is further configured to sequentially extract the production layer, the demand layer and the shape corresponding to each input tensor in the calculation graph according to the data structure template; add the extracted production layer, the demand layer and the shape to the data structure template; establish the production layer index corresponding to the production layer in the production layer index table of the data structure template; establish the demand layer index corresponding to the demand layer in the demand layer data table of the data structure template, and count each demand layer The size of the corresponding input tensor; connect the demand layers other than the connection layer in the demand layer corresponding to the input tensor to the demand layer data table to obtain the optimized data structure of the output tensor corresponding to the current connection layer.

In one embodiment, the data determination module 908 is further configured to determine the connection layer to be optimized in the connection layer according to the target sub-data structure; obtain the input tensor corresponding to each connection layer in the connection layer to be optimized in the calculation graph ; Use the connection layer to be optimized and the obtained input tensor as the connection layer data to be optimized.

In one embodiment, the model optimization module 910 is further configured to delete the connection layer data to be optimized to obtain a calculation graph after deletion; and connect the deleted calculation graphs according to the target data structure to obtain an optimized neural network model.

For the specific limitation of the neural network model inference apparatus, please refer to the limitation on the neural network model inference method above, which will not be repeated here. Each module in the above-mentioned neural network model inference apparatus may be implemented in whole or in part by software, hardware and combinations thereof. The above modules can be embedded in or independent of the processor in the computer device in the form of hardware, or stored in the memory in the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.

In one of the embodiments, a computer device is provided, the computer device may be a server, and the internal structure diagram thereof may be as shown in FIG. 10 . The computer device includes a processor, memory, a communication interface, and a database connected by a system bus. Among them, the processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium, an internal memory. The non-volatile storage medium stores an operating system, computer readable instructions and a database. The internal memory provides an environment for the execution of the operating system and computer-readable instructions in the non-volatile storage medium. The database of the computer device is used for storing data of a neural network model inference method. The communication interface of the computer device is used to connect and communicate with an external terminal. The computer readable instructions, when executed by a processor, implement a neural network model inference method.

Those skilled in the art can understand that the structure shown in FIG. 10 is only a block diagram of a partial structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied. Include more or fewer components than shown in the figures, or combine certain components, or have a different arrangement of components.

A computer device, comprising a memory and one or more processors, the memory stores computer-readable instructions, and when the computer-readable instructions are executed by the one or more processors, makes the one or more processors execute the above methods to implement steps in the example.

One or more computer storage media storing computer-readable instructions, which when executed by one or more processors, cause the one or more processors to perform the steps in each of the foregoing method embodiments.

Wherein, the computer storage medium is a readable storage medium, and the readable storage medium may be non-volatile or volatile.

Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing relevant hardware through computer-readable instructions, and the computer-readable instructions can be stored in a non-volatile computer. In the readable storage medium, the computer-readable instructions, when executed, may include the processes of the foregoing method embodiments. Wherein, any reference to memory, storage, database or other medium used in the various embodiments provided in this application may include non-volatile and/or volatile memory. Nonvolatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Road (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

The technical features of the above embodiments can be combined arbitrarily. In order to make the description simple, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction in the combination of these technical features It is considered to be the range described in this specification.

The above-mentioned embodiments only represent several embodiments of the present application, and the descriptions thereof are specific and detailed, but should not be construed as a limitation on the scope of the invention patent. It should be noted that, for those skilled in the art, without departing from the concept of the present application, several modifications and improvements can be made, which all belong to the protection scope of the present application. Therefore, the scope of protection of the patent of the present application shall be subject to the appended claims.

Claims

A neural network model inference method, comprising:

Obtain a neural network model inference task, where the neural network model inference task includes a model identifier;

obtaining a neural network model corresponding to the model identifier, analyzing the neural network model, and obtaining a computation graph corresponding to the neural network model, where the computation graph includes a connection layer;

Obtaining a pre-built data structure template, and generating a target sub-data structure corresponding to the computation graph according to the data structure template;

Determine the connection layer data to be optimized in the calculation graph according to the target sub-data structure and the connection layer;

Optimizing the connection layer data to be optimized according to the target substructure data to obtain an optimized neural network model; and

Inference is performed according to the optimized neural network model to obtain a model inference result.
The method according to claim 1, wherein the computation graph includes an operation layer and a tensor, the operation layer includes a connection layer, and the tensor is an input tensor or an output tensor of the operation layer, The generating of the target sub-data structure corresponding to the computation graph according to the data structure template includes:

Traversing the operation layers in the calculation graph, and identifying the connection layers in the operation layer;

generating an optimized data structure of the output tensor corresponding to the connection layer according to the data structure template and the computation graph; and

The target sub-data structure corresponding to the computation graph is determined according to the optimized data structure of the output tensor corresponding to the connection layer.
The method according to claim 2, wherein the method further comprises:

Topological sorting is performed on the operation layers in the calculation graph to obtain a topological sequence;

Identify whether each operation layer is a connection layer in turn according to the topological sequence;

When it is not a connection layer, the operation layer is skipped; and

When it is a connection layer, an optimized data structure of the output tensor corresponding to the connection layer is generated according to the data structure template and the calculation graph.
The method according to any one of claims 2 to 3, wherein the generating the optimized data structure of the output tensor corresponding to the connection layer according to the data structure template and the calculation graph comprises:

Obtain the current connection layer, and identify whether there is an optimized input tensor in the input tensor corresponding to the current connection layer; and

When it exists, obtain the optimized data structure of the optimized output tensor, and generate the corresponding data structure of the current connection layer according to the optimized data structure of the optimized output tensor, the calculation graph and the data structure template. The optimized data structure of the output tensor, update the next connection layer to the current connection layer, and return to the step of identifying whether there is an optimized input tensor in the input tensor corresponding to the current connection layer, until the operation is generated The optimized data structure of the output tensors corresponding to all connected layers in the layer.
The method according to claim 4, wherein the method further comprises:

When there is no optimized input tensor in the input tensor corresponding to the current connection layer, extract the template data corresponding to the output tensor of the current connection layer in the calculation graph according to the data structure template, and extract the template data corresponding to the output tensor of the current connection layer. The template data is correspondingly added to the data structure template to obtain an optimized data structure of the output tensor corresponding to the current connection layer.
The method according to claim 5, wherein the template data corresponding to the output tensor of the current connection layer is extracted in the calculation graph according to the data structure template, and the extracted template data is correspondingly added to In the data structure template, obtaining the optimized data structure of the output tensor corresponding to the current connection layer includes:

According to the data structure template, the production layer, the demand layer and the shape corresponding to each input tensor are sequentially extracted in the calculation graph;

adding the extracted production layer, requirement layer and shape to the data structure template;

establishing a production layer index corresponding to the production layer in the production layer index table of the data structure template;

establishing a requirement layer index corresponding to the requirement layer in the requirement layer data table of the data structure template, and counting the size of the input tensor corresponding to each requirement layer; and

The requirement layers other than the connection layer in the requirement layer corresponding to the input tensor are attached to the requirement layer data table to obtain an optimized data structure of the output tensor corresponding to the current connection layer.
The method according to claim 1, wherein the determining the data of the connection layer to be optimized in the calculation graph according to the target sub-data structure and the connection layer comprises:

Determine the connection layer to be optimized in the connection layer according to the target sub-data structure;

Obtain, in the computation graph, an input tensor corresponding to each connection layer in the connection layer to be optimized; and

The connection layer to be optimized and the obtained input tensor are used as the connection layer data to be optimized.
The method according to claim 1, wherein the optimizing the connection layer data to be optimized according to the target substructure data, and obtaining the optimized neural network model comprises:

Deleting the connection layer data to be optimized to obtain a calculation graph after deletion; and

The deleted computation graphs are connected according to the target data structure to obtain an optimized neural network model.
A neural network model inference device, comprising:

a task acquisition module for acquiring a neural network model inference task, where the neural network model inference task includes a model identifier;

a model parsing module, configured to obtain a neural network model corresponding to the model identifier, analyze the neural network model, and obtain a computation graph corresponding to the neural network model, where the computation graph includes a connection layer;

a structure generation module, configured to obtain a pre-built data structure template, and generate a target sub-data structure corresponding to the computation graph according to the data structure template;

a data determination module, configured to determine the connection layer data to be optimized in the calculation graph according to the target sub-data structure and the connection layer;

A model optimization module for performing optimization processing on the connection layer data to be optimized according to the target substructure data to obtain an optimized neural network model; and

A model inference module, configured to perform inference according to the optimized neural network model to obtain a model inference result.
The apparatus according to claim 9, wherein the computation graph includes an operation layer and a tensor, the operation layer includes a connection layer, and the tensor is an input tensor or an output tensor of the operation layer, The structure generation module is further configured to traverse the operation layers in the calculation graph, and identify the connection layers in the operation layers; and generate output sheets corresponding to the connection layers according to the data structure template and the calculation graph. and determining the target sub-data structure corresponding to the computation graph according to the optimized data structure of the output tensor corresponding to the connection layer.
A computer device comprising a memory and one or more processors, the memory having computer-readable instructions stored in the memory that, when executed by the one or more processors, cause the one or more processors to Each processor performs the following steps:

Obtain a neural network model inference task, where the neural network model inference task includes a model identifier;

obtaining a neural network model corresponding to the model identifier, analyzing the neural network model, and obtaining a computation graph corresponding to the neural network model, where the computation graph includes a connection layer;

Obtaining a pre-built data structure template, and generating a target sub-data structure corresponding to the computation graph according to the data structure template;

Determine the connection layer data to be optimized in the calculation graph according to the target sub-data structure and the connection layer;

Optimizing the connection layer data to be optimized according to the target substructure data to obtain an optimized neural network model; and

Inference is performed according to the optimized neural network model to obtain a model inference result.
The computer device according to claim 11, wherein the computation graph includes an operation layer and a tensor, the operation layer includes a connection layer, and the tensor is an input tensor or an output tensor of the operation layer , the processor also performs the following steps when executing the computer-readable instructions: traversing the operation layers in the calculation graph, identifying the connection layers in the operation layer; according to the data structure template and the calculation The graph generates an optimized data structure of the output tensor corresponding to the connection layer; and determines a target sub-data structure corresponding to the computation graph according to the optimized data structure of the output tensor corresponding to the connection layer.
The computer device according to claim 12, wherein, when the processor executes the computer-readable instructions, the processor further performs the following steps: performing topological sorting on the operation layers in the calculation graph to obtain a topological sequence; The topological sequence sequentially identifies whether each operation layer is a connection layer; when it is not a connection layer, skips the operation layer; and when it is a connection layer, generates the The optimized data structure of the output tensor corresponding to the connection layer.
The computer device according to any one of claims 12 to 13, wherein when the processor executes the computer-readable instruction, the processor further executes the following steps: acquiring a current connection layer, identifying the current connection layer corresponding to the Whether there is an optimized input tensor in the input tensor; and when there is, obtain the optimized data structure of the optimized output tensor, according to the optimized data structure of the optimized output tensor, the calculation graph and all The data structure template generates the optimized data structure of the output tensor corresponding to the current connection layer, updates the next connection layer to the current connection layer, and returns the identification of whether there is an optimized input tensor corresponding to the current connection layer. until the optimized data structure of the output tensors corresponding to all the connection layers in the operation layer is generated.
The computer device according to claim 14, wherein when the processor executes the computer-readable instructions, the processor further executes the following step: when an optimized input tensor does not exist in the input tensor corresponding to the current connection layer , extract the template data corresponding to the output tensor of the current connection layer in the calculation graph according to the data structure template, and add the extracted template data to the data structure template correspondingly to obtain the current connection layer corresponding The optimized data structure of the output tensor.
One or more computer storage media storing computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform the following steps:

Obtain a neural network model inference task, where the neural network model inference task includes a model identifier;

obtaining a neural network model corresponding to the model identifier, analyzing the neural network model, and obtaining a computation graph corresponding to the neural network model, where the computation graph includes a connection layer;

Obtaining a pre-built data structure template, and generating a target sub-data structure corresponding to the computation graph according to the data structure template;

Determine the connection layer data to be optimized in the calculation graph according to the target sub-data structure and the connection layer;

Optimizing the connection layer data to be optimized according to the target substructure data to obtain an optimized neural network model; and

Inference is performed according to the optimized neural network model to obtain a model inference result.
The storage medium according to claim 16, wherein the computation graph includes an operation layer and a tensor, the operation layer includes a connection layer, and the tensor is an input tensor or an output tensor of the operation layer , when the computer-readable instructions are executed by the processor, the following steps are also performed: traverse the operation layers in the calculation graph, and identify the connection layers in the operation layer; according to the data structure template and the The calculation graph generates an optimized data structure of output tensors corresponding to the connection layer; and determines a target sub-data structure corresponding to the calculation graph according to the optimized data structure of the output tensors corresponding to the connection layer.
The storage medium according to any one of claims 16 to 17, wherein when the computer-readable instructions are executed by the processor, the following steps are further performed: topologically sorting the operation layers in the calculation graph, Obtain a topology sequence; identify whether each operation layer is a connection layer in turn according to the topology sequence; when it is not a connection layer, skip the operation layer; and when it is a connection layer, according to the data structure template and all The computation graph generates an optimized data structure of the output tensors corresponding to the connection layers.
The storage medium according to claim 18, wherein when the computer-readable instructions are executed by the processor, the following steps are further performed: obtaining a current connection layer, and identifying whether there is an existing connection layer in the input tensor corresponding to the current connection layer an optimized input tensor; and when present, obtain an optimized data structure of the optimized output tensor, generated from the optimized data structure of the optimized output tensor, the computation graph, and the data structure template The optimized data structure of the output tensor corresponding to the current connection layer, update the next connection layer to the current connection layer, and return the identification of whether there is an optimized input tensor in the input tensor corresponding to the current connection layer. step until an optimized data structure of output tensors corresponding to all connection layers in the operation layer is generated.
The storage medium according to claim 19, wherein when the computer-readable instructions are executed by the processor, the following step is further performed: when there is no optimized input tensor in the input tensor corresponding to the current connection layer , extract the template data corresponding to the output tensor of the current connection layer in the calculation graph according to the data structure template, and add the extracted template data to the data structure template correspondingly to obtain the current connection layer The optimized data structure of the corresponding output tensor.