CN112529169A

CN112529169A - Data processing method, model optimization device and model execution device

Info

Publication number: CN112529169A
Application number: CN201910883288.4A
Authority: CN
Inventors: 张臻; 鲍翀; 袁鹏
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2019-09-18
Filing date: 2019-09-18
Publication date: 2021-03-19
Also published as: WO2021052460A1

Abstract

The embodiment of the application discloses a data processing method in the field of artificial intelligence, which is used for reducing the loading time of a neural network model. The method in the embodiment of the application comprises the following steps: acquiring a neural network model; determining the memory size of the memory space required for reasoning operation based on the neural network model; and updating the neural network model to obtain a target neural network model, wherein the target neural network model carries information indicating the size of the memory.

Description

Data processing method, model optimization device and model execution device

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a data processing method, a model optimization apparatus, and a model execution apparatus.

Background

Deep Learning (DL) is an important technology in the field of Artificial Intelligence (AI). DL can be generally divided into two processes, Training (Training) and Inference (Inference). In the Training process, a complex deep neural network model is trained by inputting a large amount of sample data and adopting unsupervised learning methods such as reinforcement learning and the like. The reasoning process utilizes the trained model and uses new data to 'reason' various conclusions, for example, the video monitoring equipment judges whether a captured face belongs to a blacklist through the trained deep neural network model. The training process involves huge computation because of the huge amount of training data and the complex deep neural network structure, and is usually performed by a device with powerful computing power, such as a server or a device integrated with a proprietary processor. As shown in fig. 1, the inference process based on the neural network model is to convert input data into output through multiple operator layers, in this process, the input of one operator layer is the output of the previous operator layer, and the output of the operator layer is the input of the next operator layer. The reasoning process is less computationally intensive than the training process, and therefore, neural network reasoning based applications are deployed on more and more intelligent devices, such as cell phones, speakers, headsets, IoT devices, etc. However, the layer-by-layer operation characteristic of neural network inference causes a significant challenge for the memory usage of the intelligent device. When an algorithm layer is running, the memory required for input and output of the algorithm layer needs to be provided. For different operator layers, the memory sizes required for input and output may also be different, so that only a runtime system (runtime system) of a device, referred to as runtime, needs to frequently perform dynamic memory allocation (malloc) and release (free) operations during neural network inference operation, which makes the neural network inference process time consuming and long.

In order to solve the problem, in the prior art, a memory allocation and multiplexing algorithm is deployed on intelligent equipment, before a neural network model is operated, a neural network reasoning process is simulated by operating the memory allocation and multiplexing algorithm, the size of a memory block required by the output tensor of each operator layer is obtained, and runtime is called to allocate all memory blocks at one time before the neural network reasoning is operated, so that memory allocation can be realized at one time according to a pre-calculated result, and the time overhead of memory allocation and release is reduced.

It can be seen that, in the prior art, the memory required by the operation of the neural network model is pre-calculated based on the memory allocation and multiplexing algorithm deployed on the intelligent device, so as to achieve the purpose of reducing the number of times of memory allocation and release operations, but when the resources of the intelligent device are limited, a large amount of pre-calculation time is required, which causes the model loading time to be too long.

Disclosure of Invention

A first aspect of an embodiment of the present application provides a data processing method, including: acquiring a neural network model; determining the memory size of the memory space required for reasoning operation based on the neural network model; and updating the neural network model to obtain a target neural network model, wherein the target neural network model carries information indicating the size of the memory.

In a possible implementation manner of the first aspect, the determining a memory size of a memory space required for the neural network model inference operation includes: determining the operation sequence of an operator layer and the size of a memory block of a tensor, wherein the operator layer is a basic calculation unit for performing inference operation based on the neural network model, and the tensor is input or output of the operator layer; and determining the size of the memory and the position of the memory block in the memory space through a preset memory multiplexing algorithm according to the size of the memory block and the operation sequence, wherein the position is used for indicating the memory block.

In a possible implementation manner of the first aspect, the target neural network model carries a running sequence of the operator layer and a memory offset of a tensor, where the memory offset is used to indicate a position of the memory block in the memory space.

In a possible implementation manner of the first aspect, if the neural network model includes a first neural network model and a second neural network model, the determining a memory size of a memory space required for the neural network model inference operation includes: determining a first memory size of a first memory space required by the first neural model inference operation and a second memory size of a second memory space required by the second neural model inference operation; and determining the memory size of the memory space according to the first memory size and the second memory size.

In one possible implementation of the first aspect, a running order of the first neural network model and the second neural network model is determined, the running order including serial or parallel; if the running sequence is serial, the memory size of the memory space is the first memory size, and the first memory size is larger than or equal to the second memory size; and if the running sequence is parallel, the memory size of the memory space is the sum of the first memory size and the second memory size.

A second aspect of the embodiments of the present application provides a data processing method, including: acquiring a neural network model, wherein the neural network model carries information indicating the memory size, and the information indicating the memory size is used for indicating the total memory space required by the neural network model inference operation; allocating a total memory space with a size larger than or equal to the memory size in the memory; and performing reasoning operation according to the neural network model based on the distributed total memory space.

In a possible implementation manner of the second aspect, the neural network model further includes: the memory offset of the running sequence and tensor of the operator layer is used for indicating the position of the memory block of the tensor in the total memory space, wherein the operator layer is a basic calculation unit of inference operation of the neural network model, and the tensor is input or output of the operator layer.

In a possible implementation manner of the second aspect, the allocating, in the memory, a total memory space with a size equal to the memory size includes: determining, by a runtime system, a base address of the total memory space.

In a possible implementation manner of the second aspect, the position of the memory block of the tensor in the total memory space is determined according to the base address of the total memory space and the memory offset of the tensor.

In a possible implementation manner of the second aspect, the memory size is determined by a memory multiplexing algorithm.

In a possible implementation manner of the second aspect, if the neural network model includes a first neural network model and a second neural network model, the neural network model further includes: a running sequence of the first and second neural network models, the running sequence comprising serial or parallel.

In a possible implementation manner of the second aspect, if the operation sequence is serial, the memory size of the total memory space is a first memory size, the first memory size is greater than or equal to the second memory size, the first memory size is a size of a first memory space required by the first neural model inference operation, and the second memory size is a second memory size of a second memory space required by the second neural model inference operation; and if the running sequence is parallel, the memory size of the total memory space is the sum of the first memory size and the second memory size.

In one possible implementation manner of the second aspect, the method further includes: and releasing the total memory space once after the reasoning operation is finished.

A third aspect of the embodiments of the present application provides a model optimization apparatus, including: the acquisition module is used for acquiring a neural network model; the determining module is used for determining the memory size of the memory space required by the inference operation based on the neural network model; and the updating module is used for updating the neural network model to obtain a target neural network model, and the target neural network model carries information indicating the size of the memory.

In a possible implementation manner of the third aspect, the determining module is specifically configured to: determining the operation sequence of an operator layer and the size of a memory block of a tensor, wherein the operator layer is a basic calculation unit for performing inference operation based on the neural network model, and the tensor is input or output of the operator layer; and determining the size of the memory and the position of the memory block in the memory space through a preset memory multiplexing algorithm according to the size of the memory block and the operation sequence, wherein the position is used for indicating the memory block.

In a possible implementation manner of the third aspect, the target neural network model carries a running order of the operator layer and a memory offset of a tensor, where the memory offset is used to indicate a position of the memory block in the memory space.

In a possible implementation manner of the third aspect, if the neural network model includes a first neural network model and a second neural network model, the determining module is specifically configured to: determining a first memory size of a first memory space required by the first neural model inference operation and a second memory size of a second memory space required by the second neural model inference operation; and determining the memory size of the memory space according to the first memory size and the second memory size.

In a possible implementation manner of the third aspect, the determining module is further configured to: determining a running order of the first neural network model and the second neural network model, the running order comprising serial or parallel; if the running sequence is serial, the memory size of the memory space is the first memory size, and the first memory size is larger than or equal to the second memory size; and if the running sequence is parallel, the memory size of the memory space is the sum of the first memory size and the second memory size.

A fourth aspect of the embodiments of the present application provides a model execution apparatus, including: the acquisition module is used for acquiring a neural network model, wherein the neural network model carries information indicating the memory size, and the information indicating the memory size is used for indicating the total memory space required by the neural network model for inference operation; the allocation module is used for allocating the total memory space with the size larger than or equal to the memory size in the memory; and the operation module is used for performing reasoning operation according to the neural network model based on the distributed total memory space.

In a possible implementation manner of the fourth aspect, the neural network model further includes: the memory offset of the running sequence and tensor of the operator layer is used for indicating the position of the memory block of the tensor in the total memory space, wherein the operator layer is a basic calculation unit of inference operation of the neural network model, and the tensor is input or output of the operator layer.

In a possible implementation manner of the fourth aspect, the allocating module is specifically configured to: determining, by a runtime system, a base address of the total memory space.

In a possible implementation manner of the fourth aspect, the apparatus further includes: a determining module, configured to determine, according to the base address of the total memory space and the memory offset of the tensor, a position of the memory block of the tensor in the total memory space.

In a possible implementation manner of the fourth aspect, the memory size is determined by a memory multiplexing algorithm.

In a possible implementation manner of the fourth aspect, if the neural network model includes a first neural network model and a second neural network model, the neural network model further includes: a running sequence of the first and second neural network models, the running sequence comprising serial or parallel.

In a possible implementation manner of the fourth aspect, if the operation sequence is serial, the memory size of the total memory space is a first memory size, the first memory size is greater than or equal to the second memory size, the first memory size is a size of a first memory space required by the first neural model inference operation, and the second memory size is a second memory size of a second memory space required by the second neural model inference operation; and if the running sequence is parallel, the memory size of the total memory space is the sum of the first memory size and the second memory size.

In a possible implementation manner of the fourth aspect, the apparatus further includes: and the releasing module is used for releasing the total memory space once after the reasoning operation is finished.

A fifth aspect of the embodiments of the present application provides a model optimization apparatus, including: a memory to store instructions; a processor configured to execute the instructions in the memory so that the controller performs the method according to any one of the first aspect and the implementations.

A sixth aspect of the embodiments of the present application provides a model execution apparatus, including: a memory to store instructions; a processor configured to execute the instructions in the memory so that the controller performs the method according to any of the second aspect and the implementations.

An eighth aspect of the present embodiment provides a computer system, including the model optimization apparatus according to any one of the foregoing third aspects and various implementation manners, and the model execution apparatus according to any one of the foregoing fourth aspects and various implementation manners.

An eighth aspect of embodiments of the present application provides a computer program product, which is characterized by including instructions that, when executed on a computer, cause the computer to execute the method according to any one of the first aspect or the second aspect and the implementation manners.

A ninth aspect of embodiments of the present application provides a computer-readable storage medium, which stores instructions that, when executed on a computer, cause the computer to perform a method according to any one of the first or second aspects and implementations described above.

According to the technical scheme, the embodiment of the application has the following advantages:

according to the method provided by the embodiment of the application, the model optimization device can perform pre-calculation according to the neural network model, the memory size of the memory space required by the neural network model for reasoning operation is obtained, and the memory size information is carried in the updated target neural network model, so that the memory multiplexing calculation is not needed after the model execution device is loaded, and the network model loading time of the model execution device is shortened; the model execution device can allocate the memory space required by the network model to execute inference operation according to the information by the acquired information of the memory size carried in the network model file, so that the model execution device can directly realize memory one-time allocation according to the network model file, and the time length for memory multiplexing calculation in the loaded model is reduced.

Drawings

FIG. 1 is a schematic diagram of memory requirements during operation of neural network inference;

FIG. 2 is a diagram of Runtime dynamically allocating and releasing memory during neural network inference operation;

FIG. 3 is a diagram illustrating maximum memory required by neural network model for one-time pre-inference allocation

FIG. 4 is a diagram illustrating the relationship between the memory management scheme and the memory allocation time and amount

FIG. 5 is a diagram illustrating memory allocation on an electronic device;

FIG. 6 is a diagram illustrating the specification and number of memory blocks to be allocated determined by a memory multiplexing algorithm;

FIG. 7 is a system scenario diagram of a quantity processing method in an embodiment of the present application;

FIG. 8 is a schematic diagram of an embodiment of a data processing method in an embodiment of the present application;

FIG. 9 is a schematic diagram of another embodiment of a data processing method in the embodiment of the present application;

FIG. 10 is a schematic diagram of an embodiment of a data processing method in the embodiment of the present application;

FIG. 11 is a schematic diagram of another embodiment of a data processing method in the embodiment of the present application;

FIG. 12 is a schematic diagram of an embodiment of a model optimization apparatus in an embodiment of the present application;

FIG. 13 is a schematic diagram of an embodiment of a model execution apparatus in an embodiment of the present application;

FIG. 14 is a schematic diagram of an embodiment of a model optimization apparatus in an embodiment of the present application;

fig. 15 is a schematic diagram of an embodiment of a model execution apparatus in an embodiment of the present application.

Detailed Description

The embodiment of the application provides a data processing method, which is used for reducing memory waste during the operation of a neural network model, reducing the allocation time length of memory allocation and release and reducing the time length for loading the neural network model.

Neural network models refer to programs and data used to perform cognitive computations resulting from training with large amounts of labeled data. The neural network model comprises a neural network architecture component and a neural network parameter component. The neural network architecture component refers to a network related to a neural network algorithm in a neural network model and a hierarchical structure thereof, that is, a program for executing cognitive computation in the neural network model. The neural network model can be used for carrying out reasoning operation, and the reasoning operation process is to convert input data into output through a plurality of operator layers.

The tensor (tensor) is a generalization of the vector. Assuming that the vector is a one-dimensional "table" (i.e., the components are arranged in a row in order) and the matrix is a two-dimensional "table" (i.e., the components are arranged in vertical and horizontal positions), then the n-order tensor is an n-dimensional "table"; where n is a tensor greater than or equal to 1. In the embodiment of the present application, for convenience of description, a tensor that is an independent variable in one operator layer is referred to as an input tensor, and a tensor that is a dependent variable is referred to as an output tensor.

Please refer to fig. 1, which is a schematic diagram of memory requirements during the operation of neural network inference.

The neural network model inference process is to convert input data into output through a plurality of operator layers, wherein the input tensor of one operator layer is the output tensor of the previous operator layer, and the output tensor of the operator layer is the input tensor of the next operator layer. When an arithmetic layer is running, the memory required for the input tensor and the output tensor of the arithmetic layer needs to be provided. The memory size required for the input tensor and the output tensor may also be different for different computation layers.

Referring to fig. 2, Runtime is a schematic diagram of dynamically allocating and releasing memory when a neural network is operating with inference.

One simple implementation of providing memory allocation for the operator layer is to dynamically allocate (Malloc) and Free (Free) memory for the currently running operator layer when the neural network inference is running, via Runtime, so that the peak memory usage can be kept to a minimum. However, runtime allocation and release cause expensive runtime memory allocation time consumption, so that the neural network inference process is time-consuming and too long.

Please refer to fig. 3, which is a schematic diagram of maximum memory required for one-time allocation before inference by the neural network model.

In order to reduce the time of memory allocation and release consumed by interaction with Runtime, the maximum memory required can be allocated once before the neural network model inference, and the maximum memory is the sum of the memories required by the output tensors of all the operator layers, so that the neural network model inference process does not interact with Runtime any more.

TABLE 1

Network	Ratio of sum of memory of tensor to minimum requirement of memory usage
		MobileNet	4.17
MobileNetV2	3.47
		DeeplabV3	6.75

As shown in table 1, MobileNet and MobileNetV2 in the table are two lightweight image classification networks, and deplab v3 represents a semantic segmentation network. In the three typical neural network model reasoning processes, the ratio of the sum of tensors to the minimum requirement of memory use can determine the condition of memory waste. Therefore, the memory allocation method will cause a great waste of memory, please refer to fig. 4, which is a schematic diagram of a relationship between a memory management scheme and memory allocation time and memory allocation amount;

in order to solve the problems that the running time is too long for dynamic allocation and release of the Runtime in the neural network model reasoning process, and the total amount of the memory consumed by pre-allocating all tensors at one time is too large, an objective scheme is desired in the embodiment of the application, so that the allocation duration of the running time for allocation and release in the memory allocation can be reduced, and the memory allocation amount can be reduced.

Please refer to fig. 5, which is a diagram illustrating memory allocation on an electronic device.

By deploying a memory allocation and multiplexing precomputation algorithm on the electronic equipment, before a neural network model is operated, a neural network reasoning process is simulated through the algorithm, memory blocks which can be shared in a tensor memory allocation process are analyzed, the specification and the number of the memory blocks which need to be allocated are precomputed, and finally, running is called to allocate all the memory blocks which need to be allocated at one time before the neural network model is operated, so that a memory block pool is constructed. Therefore, memory allocation can be realized at one time according to the pre-calculation result, and the interaction duration with runtime is reduced.

It should be noted that there are various pre-calculation algorithms for implementing memory allocation and multiplexing in the neural network model inference process, for example, Greedy algorithm (Greedy) or minimum-cost flow algorithm (MCFP), and the embodiment of the present application does not limit the specific algorithm type, and hereinafter, for convenience of description, it is referred to as a memory multiplexing algorithm for short.

Please refer to fig. 6, which is a schematic diagram illustrating the specification and number of memory blocks to be allocated determined by a memory multiplexing algorithm;

after the memory block with the memory requirement 32B is released by the first computation layer in the figure, the memory block may be allocated to the output tensor of the second computation layer again for use, so that part of the memory may be saved by multiplexing the memory block. Simulating the neural network reasoning process, the specification and the number of the memory blocks to be allocated can be pre-calculated, namely two memory blocks, the sizes of the memory blocks are respectively 32B and 64B, and two memory blocks are allocated by runtime at one time.

In the prior art, before the neural network model is operated, the memory required by the operation of the neural network model is pre-calculated on the electronic equipment through a memory multiplexing algorithm, so that a large amount of pre-calculation time is required, and the model loading time is too long.

Please refer to fig. 7, which is a system scenario diagram of a quantity processing method in an embodiment of the present application.

The device for realizing the data processing method comprises a model optimization device and a model execution device. A system scenario for the method of quantity processing is described below in conjunction with fig. 7.

Inputting the trained network model into a model optimization device, the model optimization device can analyze the model file to obtain a data flow diagram, obtaining the total memory size required by the network model when in operation according to the memory multiplexing algorithm, writing the total memory size into a model file by a model optimizing device to generate an optimized network model, as shown in the figure, the model optimizing device sends the optimized model file to the model executing device, the model executing device is used for implementing the reasoning calculation of the network model, and after the model is loaded, performing system allocation in one operation according to the total memory size carried in the model file, acquiring the memory space of the total memory size, calculating the memory address of the tensor of the acquired operator, inputting data needing inference operation based on the network model, calling the operator according to the memory address of the tensor, and finally outputting an inference result.

Illustratively, a network model for image recognition is acquired after training of a large number of picture data sets, a model optimization device can optimize the image recognition network model, calculate the total memory size required by inference operation of the image recognition network model, write the total memory size into the model to obtain the optimized image recognition network model, and send the optimized image recognition network model to a model execution device, the model execution device performs inference operation, namely identifies an image input by a user, for example, identifies the content of the image as a cat or a dog, and memory allocation can be performed according to the total memory size in a model file before the model execution device performs inference operation, so that the loading duration of the model execution device and the time consumption of memory allocation can be reduced.

In a possible implementation scenario, the model optimization device is a server, the model execution device is a terminal, the server includes a cloud server or a computing server, and the like, and the terminal may be an intelligent device in various forms.

In another possible implementation scenario, the model optimization device is a fat terminal and the model execution device is a thin terminal. The fat terminal may be, for example, a mobile phone, a desktop computer, a tablet computer, a portable computer, etc., and the specific device type of the fat terminal is not limited herein. The thin terminal may be, for example, an internet of things (IOT) terminal device, such as a network camera, an intelligent voice assistant, an intelligent headset, or an intelligent wearable device, where a specific device type of the thin terminal is not limited. It can be understood that the storage capacity and the computing capacity of the fat terminal are superior to those of the thin terminal, and the data processing method provided by the embodiment of the application realizes the process of network model optimization by the fat terminal so as to save the computing resources of the thin terminal and reduce the loading duration of the thin terminal model.

Referring to fig. 8 and fig. 10, the data processing method implemented by the model optimization apparatus and the model execution apparatus is specifically described:

please refer to fig. 8, which is a diagram illustrating an embodiment of a data processing method according to an embodiment of the present application.

801. Acquiring a neural network model;

the obtained neural network model may be one neural network model or a plurality of neural network models, and is not limited herein. The neural network model may be pre-deployed by a user, and is usually stored in a form of a neural network inference model file, where the storage type may be, for example, an open neural network exchange (ONNX) format or a convolutional structure for fast feature embedding (CAFFE) format.

802. Determining the size of a memory block of a tensor of an arithmetic layer and the operation sequence of the arithmetic layer;

and determining the size of the memory blocks of the tensor and the operation sequence of the operator layer according to the acquired neural network model.

The neural network model can be analyzed by a model file analyzer and converted into a data flow graph, please refer to fig. 9, which is a schematic diagram of another embodiment of the data processing method in the embodiment of the present application, a first step in fig. 9 is to convert a model file into an expression form of a data flow graph (dataflow), nodes in the dataflow represent an operator layer of neural network model inference, connecting lines between the nodes are edges, and represent tensors of the operator layer, and as can be known from the diagram, an output tensor of one operator layer is an input tensor of a next operator layer.

In order to determine the memory block size of the tensor, Data memory information (Data Byte) and tensor Shape information (Shape) of the tensor need to be acquired first, wherein the tensor Shape information comprises the number, height, width and channel number of the tensor.

The equation for calculating the memory block size of the tensor is as follows: mem size ═ Batch × Height × Width × Channel × Data Byte

The memory size of the memory is represented by mem size, Batch represents the number of tensors, Height represents the Height of the tensors, Width represents the Width of the tensors, Channel represents the number of channels of the tensors, and Data Byte represents the memory size requirement of unit Data of the tensors.

The operation order of the operator layer can be determined by topological sorting, the sorting algorithm is not limited, and in addition, the operation order of the operators can be indicated by the index (index) of the operator layer.

Optionally, if part of the operator layers are fused by the graph optimization technology, the running sequence of the operator layers can be determined after the operator layers are fused.

Optionally, if there are multiple neural network models, on one hand, each neural network model needs to be analyzed to determine the operator layer ordering of the model. On the other hand, according to a scheme provided by a user or preset, determining a running sequence among the plurality of neural network models, wherein the running sequence comprises a plurality of neural network models which are executed in series or in parallel, the running sequence of an algorithm layer of the plurality of neural network models is determined, and the running sequences of the algorithm layer of the plurality of neural network models which are executed in parallel have no defined relation. The specific sequence of operations between the plurality of neural network models is not limited herein.

803. Determining the memory size and the memory offset of tensor of a memory space according to a memory multiplexing algorithm;

after the memory block size of the tensor and the operation sequence of the operator layer are determined, the memory size of the memory space and the memory offset of the tensor can be determined according to a preset memory multiplexing algorithm, the memory space is a memory required by the operation of the neural network model, namely, the memory required by all tensors.

It should be noted that the memory multiplexing algorithm may be one or more algorithms, and is not limited herein.

Specifically, a neural network model inference process is simulated, and the multiplexed memory blocks are spliced into a memory space according to the memory block size of the tensor and the operation sequence of the operator layer and a preset memory multiplexing algorithm, wherein each memory block is represented by offset (offset). And recording the memory block offset of each tensor in the information of the edges in the data flow graph. For example, as shown in fig. 8, the memory size of the memory space is 192 bits (bit, b), and the offsets of the memory blocks required by the tensors are 0b, 64b, 96b, 0b, 128b, and 64b, respectively, according to the operation order of the arithmetic layer.

Optionally, if there are multiple neural network models, on one hand, it is necessary to simulate the inference process of each neural network model, and determine, through a preset memory multiplexing algorithm, the memory size of the total memory space required for operating the inference process of the neural network model and the memory offset of the output tensor of each operator layer.

On the other hand, according to a scheme provided by a user or preset, determining an operation sequence among the plurality of neural network models, wherein the operation sequence comprises a plurality of neural network models which are executed in series or in parallel, and taking one with the largest size in a plurality of total memory spaces as the memory size required by the inference process of the plurality of neural network models; and taking the sum of the memory sizes of a plurality of total memory spaces as the memory size required by the inference processes for operating the neural network models by the plurality of neural network models executed in parallel.

804. Writing the memory size of a memory space, the operation sequence of an operator layer and the memory offset of a tensor into the neural network model;

the memory size of the memory space determined in step 803 and the memory offset of each tensor are recorded in an offset file of the model, and meanwhile, the running sequence of the operator layer determined in step 802 is also recorded in the neural network model.

And rewriting the optimized data flow graph into a neural network model file, wherein the neural network model file comprises the operation sequence of an arithmetic layer, the memory offset of tensor and the memory size of a memory space. Optionally, the file further includes an operator layer replaced after fusion.

According to the method provided by the embodiment of the application, the size of the memory space required by the network model to execute the reasoning operation is added into the network model file, so that the equipment for executing the network model can directly realize memory one-time allocation according to the network model file, and the time for carrying out memory multiplexing calculation in the loading model is reduced.

Please refer to fig. 10, which is a diagram illustrating an embodiment of a data processing method according to an embodiment of the present application. [ Online ]

1001. Acquiring a neural network model;

the terminal may obtain the neural network model from a cloud or a computing device such as a computing server, where the neural network model may be one neural network model or multiple neural network models, and is not limited herein. The neural network model is usually stored in the form of a neural network inference model file, and the storage type can be an ONNX format or a CAFFE format and the like.

1002. Analyzing the neural network model, and acquiring the memory size of a memory space, the operation sequence of an operator layer and the memory offset of a tensor;

referring to fig. 11, a schematic diagram of another embodiment of a data processing method in an embodiment of the present application is shown, where the neural network model is converted into a data flow graph that can be executed by a parser. And acquiring the operation sequence of the operator layer, the memory offset of each tensor and the memory size of the memory space for operating the neural network model. The memory offset of the tensor is used for indicating the offset position of the memory block of the tensor in the memory space.

1003. Acquiring a base address of a memory space through memory allocation once according to the memory size of the memory space;

according to the memory size of the memory space obtained in step 1002, a memory space base address is obtained through one-time memory allocation, and the memory space base address may be referred to as a memory base address for short.

1004. Acquiring the memory block address of the tensor according to the memory space base address and the memory offset of the tensor;

the operator layer outputs the memory address of the tensor, which is the memory space base address plus the memory offset (@ base) of the tensor. According to the memory space base address and the memory offset of the output tensor of each computation layer, the memory address of the output tensor of each computation layer, namely @ base + offset, can be obtained, for example, the memory offset of the tensor of the input node 2 is 96b, and the base address of the memory space is @ base, so that the memory block address of the tensor can be determined to be @ base +96 b.

1005. Running a neural network model according to the running sequence of the operator layer;

1006. releasing the memory for the first time;

and when the model operation is finished, releasing the memory space once.

Please refer to fig. 12, which is a schematic diagram of an embodiment of a model optimization apparatus according to an embodiment of the present application;

the model optimization device comprises:

an obtaining module 1201, configured to obtain a neural network model;

a determining module 1202, configured to determine a memory size of a memory space required for performing inference operation based on the neural network model;

an updating module 1203, configured to update the neural network model to obtain a target neural network model, where the target neural network model carries information indicating the size of the memory.

Optionally, the determining module 1202 is specifically configured to: determining the operation sequence of an operator layer and the size of a memory block of a tensor, wherein the operator layer is a basic calculation unit for performing inference operation based on the neural network model, and the tensor is input or output of the operator layer; and determining the size of the memory and the position of the memory block in the memory space through a preset memory multiplexing algorithm according to the size of the memory block and the operation sequence, wherein the position is used for indicating the memory block.

Optionally, the target neural network model carries a running sequence of the operator layer and a memory offset of the tensor, where the memory offset is used to indicate a position of the memory block in the memory space.

Optionally, if the neural network model includes a first neural network model and a second neural network model, the determining module 1202 is specifically configured to: determining a first memory size of a first memory space required by the first neural model inference operation and a second memory size of a second memory space required by the second neural model inference operation; and determining the memory size of the memory space according to the first memory size and the second memory size.

Optionally, the determining module 1202 is further configured to: determining a running order of the first neural network model and the second neural network model, the running order comprising serial or parallel; if the running sequence is serial, the memory size of the memory space is the first memory size, and the first memory size is larger than or equal to the second memory size; and if the running sequence is parallel, the memory size of the memory space is the sum of the first memory size and the second memory size.

Please refer to fig. 13, which is a schematic diagram of an embodiment of a model optimization apparatus according to an embodiment of the present application;

the model execution device comprises:

an obtaining module 1301, configured to obtain a neural network model, where the neural network model carries information indicating the memory size, and the information indicating the memory size is used to indicate a total memory space required by inference operation of the neural network model;

an allocating module 1302, configured to allocate, in a memory, a total memory space with a size greater than or equal to a size of the memory;

and the operation module 1303 is configured to perform inference operation according to the neural network model based on the allocated total memory space.

Optionally, the neural network model further includes: the memory offset of the running sequence and tensor of the operator layer is used for indicating the position of the memory block of the tensor in the total memory space, wherein the operator layer is a basic calculation unit of inference operation of the neural network model, and the tensor is input or output of the operator layer.

Optionally, the allocating module 1302 is specifically configured to: determining, by a runtime system, a base address of the total memory space.

Optionally, the apparatus further comprises: a determining module 1304, configured to determine, according to the base address of the total memory space and the memory offset of the tensor, a position of the memory block of the tensor in the total memory space.

Optionally, the memory size is determined by a memory multiplexing algorithm.

Optionally, if the neural network model includes a first neural network model and a second neural network model, the neural network model further includes: a running sequence of the first and second neural network models, the running sequence comprising serial or parallel.

Optionally, if the running sequence is serial, the memory size of the total memory space is a first memory size, the first memory size is greater than or equal to a second memory size, the first memory size is the size of the first memory space required by the first neural model inference operation, and the second memory size is the second memory size of the second memory space required by the second neural model inference operation; and if the running sequence is parallel, the memory size of the total memory space is the sum of the first memory size and the second memory size.

Optionally, the apparatus further comprises:

a releasing module 1305, configured to release the total memory space once after the inference operation is finished.

Please refer to fig. 14, which is a schematic diagram of an embodiment of a model optimization apparatus according to an embodiment of the present application.

The model optimization device provided in this embodiment may be a server or a terminal, and the specific device form of the model optimization device is not limited in this embodiment.

The model optimization device 1400 may vary significantly depending on configuration or performance, and may include one or more processors 1401 and a memory 1402, where the memory 1402 stores programs or data.

Memory 1402 may be volatile memory or non-volatile memory, among others. Optionally, processor 1401 is one or more Central Processing Units (CPUs), or may be a dedicated processor, such as a Graphics Processing Unit (GPU) or a neural Network Processor (NPU), or may be a system on a chip (SoC) integrating one or more CPUs, one or more GPUs, and one or more NPUs, where the CPU may be a single-core CPU or a multi-core CPU. The processor 1401 may be in communication with the memory 1402, and executes a series of instructions in the memory 1402 on the model optimization device 1400.

The model optimization device 1400 also includes one or more wired or wireless network interfaces 1403, such as an ethernet interface.

Optionally, although not shown in FIG. 14, the model optimization device 1400 may also include one or more power supplies; the input/output interface may be used to connect a display, a mouse, a keyboard, a touch screen device, a sensing device, or the like, and the input/output interface is an optional component, and may or may not be present, and is not limited herein.

The process executed by the processor 1401 in the model optimization apparatus 1400 in this embodiment may refer to the method process described in the foregoing method embodiment, which is not described herein again.

Please refer to fig. 15, which is a diagram illustrating an embodiment of a model executing apparatus according to an embodiment of the present application.

The model execution device provided in this embodiment may be a terminal device such as an IOT, and a specific device form of the model execution device is not limited in this embodiment.

The model execution apparatus 1500, which may have a large difference due to different configurations or performances, may include one or more processors 1501 and a memory 1502, and the memory 1502 stores programs or data therein.

The memory 1502 may be volatile memory or non-volatile memory, among others. Optionally, the processor 1501 may be one or more Central Processing Units (CPUs), a special-purpose processor such as a Graphics Processing Unit (GPU) or a neural Network Processing Unit (NPU), or a system on a chip (SoC) integrating one or more CPUs, one or more GPUs, and one or more NPUs, where the CPU may be a single-core CPU or a multi-core CPU. Processor 1501 may be in communication with memory 1502, and execute a series of instructions in memory 1502 on model execution apparatus 1500.

The model execution apparatus 1500 also includes one or more wired or wireless network interfaces 1503, such as an ethernet interface.

Optionally, although not shown in FIG. 15, the model execution apparatus 1500 may also include one or more power supplies; the input/output interface may be used to connect a display, a mouse, a keyboard, a touch screen device, a sensing device, or the like, and the input/output interface is an optional component, and may or may not be present, and is not limited herein.

The process executed by the processor 1501 in the model execution apparatus 1500 in this embodiment may refer to the method process described in the foregoing method embodiments, which is not described herein again.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A data processing method, comprising:

acquiring a neural network model;

determining the memory size of the memory space required for reasoning operation based on the neural network model;

and updating the neural network model to obtain a target neural network model, wherein the target neural network model carries information indicating the size of the memory.

2. The method of claim 1, wherein said determining a memory size of a memory space required for said neural network model inference operation comprises:

determining the operation sequence of an operator layer and the size of a memory block of a tensor, wherein the operator layer is a basic calculation unit for performing inference operation based on the neural network model, and the tensor is input or output of the operator layer;

and determining the size of the memory and the position of the memory block in the memory space through a preset memory multiplexing algorithm according to the size of the memory block and the operation sequence, wherein the position is used for indicating the memory block.

3. The method of claim 2,

the target neural network model carries the running sequence of the operator layer and the memory offset of the tensor, and the memory offset is used for indicating the position of the memory block in the memory space.

4. The method according to any one of claims 1 to 3, wherein if the neural network model comprises a first neural network model and a second neural network model, the determining the memory size of the memory space required for the neural network model inference operation comprises:

determining a first memory size of a first memory space required by the first neural model inference operation and a second memory size of a second memory space required by the second neural model inference operation;

and determining the memory size of the memory space according to the first memory size and the second memory size.

5. The method of claim 4, further comprising:

determining a running order of the first neural network model and the second neural network model, the running order comprising serial or parallel;

if the running sequence is serial, the memory size of the memory space is the first memory size, and the first memory size is larger than or equal to the second memory size;

and if the running sequence is parallel, the memory size of the memory space is the sum of the first memory size and the second memory size.

6. A data processing method, comprising:

acquiring a neural network model, wherein the neural network model carries information indicating the memory size, and the information indicating the memory size is used for indicating the total memory space required by the neural network model inference operation;

allocating a total memory space with a size larger than or equal to the memory size in the memory;

and performing reasoning operation according to the neural network model based on the distributed total memory space.

7. The method of claim 6, further comprising, in the neural network model:

the memory offset of the running sequence and tensor of the operator layer is used for indicating the position of the memory block of the tensor in the total memory space, wherein the operator layer is a basic calculation unit of inference operation of the neural network model, and the tensor is input or output of the operator layer.

8. The method of claim 6 or 7, wherein allocating the total memory space in the memory with the size equal to the memory size comprises:

determining, by a runtime system, a base address of the total memory space.

9. The method of claim 8, further comprising:

and determining the position of the memory block of the tensor in the total memory space according to the base address of the total memory space and the memory offset of the tensor.

10. The method according to any of claims 6 to 9, wherein the memory size is determined by a memory multiplexing algorithm.

11. The method according to any one of claims 6 to 10, wherein if the neural network model comprises a first neural network model and a second neural network model, the neural network model further comprises:

a running sequence of the first and second neural network models, the running sequence comprising serial or parallel.

12. The method of claim 11,

if the running sequence is serial, the memory size of the total memory space is a first memory size, the first memory size is larger than or equal to a second memory size, the first memory size is the size of the first memory space required by the first neural model inference operation, and the second memory size is the second memory size of the second memory space required by the second neural model inference operation;

and if the running sequence is parallel, the memory size of the total memory space is the sum of the first memory size and the second memory size.

13. The method according to any one of claims 6 to 12, further comprising:

and releasing the total memory space once after the reasoning operation is finished.

14. A model optimization apparatus, comprising:

the acquisition module is used for acquiring a neural network model;

the determining module is used for determining the memory size of the memory space required by the inference operation based on the neural network model;

and the updating module is used for updating the neural network model to obtain a target neural network model, and the target neural network model carries information indicating the size of the memory.

15. The apparatus of claim 14, wherein the determining module is specifically configured to:

16. The apparatus according to claim 15, wherein the target neural network model carries therein a running order of the operator layers and a memory offset of a tensor, the memory offset indicating a location of the memory chunk in the memory space.

17. The apparatus according to any one of claims 14 to 16, wherein if the neural network model comprises a first neural network model and a second neural network model, the determining module is specifically configured to:

18. The apparatus of claim 17, wherein the determining module is further configured to:

19. A model execution apparatus, comprising:

the acquisition module is used for acquiring a neural network model, wherein the neural network model carries information indicating the memory size, and the information indicating the memory size is used for indicating the total memory space required by the neural network model for inference operation;

the allocation module is used for allocating the total memory space with the size larger than or equal to the memory size in the memory;

and the operation module is used for performing reasoning operation according to the neural network model based on the distributed total memory space.

20. The apparatus of claim 19, further comprising in the neural network model: the memory offset of the running sequence and tensor of the operator layer is used for indicating the position of the memory block of the tensor in the total memory space, wherein the operator layer is a basic calculation unit of inference operation of the neural network model, and the tensor is input or output of the operator layer.

21. The apparatus according to claim 19 or 20, wherein the allocation module is specifically configured to:

determining, by a runtime system, a base address of the total memory space.

22. The apparatus of claim 21, further comprising:

a determining module, configured to determine, according to the base address of the total memory space and the memory offset of the tensor, a position of the memory block of the tensor in the total memory space.

23. The apparatus according to any of claims 19 to 22, wherein the memory size is determined by a memory multiplexing algorithm.

24. The apparatus of any one of claims 19 to 23, wherein if the neural network model comprises a first neural network model and a second neural network model, the neural network model further comprises: a running sequence of the first and second neural network models, the running sequence comprising serial or parallel.

25. The apparatus of claim 24, wherein if the run order is serial, the memory size of the total memory space is a first memory size, the first memory size is greater than or equal to a second memory size, the first memory size is a size of a first memory space required for the first neural model inference operation, and the second memory size is a second memory size of a second memory space required for the second neural model inference operation;

26. The apparatus of any one of claims 19 to 25, further comprising:

and the releasing module is used for releasing the total memory space once after the reasoning operation is finished.

27. A model optimization apparatus, comprising:

a memory to store instructions;

a processor for executing instructions in the memory, causing the controller to perform the method of any of claims 1-5.

28. A model execution apparatus, comprising:

a memory to store instructions;

a processor for executing instructions in the memory to cause the reflector to perform the method of any of claims 6 to 13.

29. A computer system comprising the model optimisation device of any one of claims 14 to 18 and the model execution device of any one of claims 19 to 1.

30. A computer program product, characterized in that it comprises instructions which, when run on a computer, cause the computer to carry out the method of any one of claims 1 to 13.

31. A computer-readable storage medium storing instructions that, when executed on a computer, cause the computer to perform the method of any one of claims 1 to 13.