WO2021052460A1 - 数据处理方法、模型优化装置和模型执行装置 - Google Patents

数据处理方法、模型优化装置和模型执行装置 Download PDF

Info

Publication number
WO2021052460A1
WO2021052460A1 PCT/CN2020/116183 CN2020116183W WO2021052460A1 WO 2021052460 A1 WO2021052460 A1 WO 2021052460A1 CN 2020116183 W CN2020116183 W CN 2020116183W WO 2021052460 A1 WO2021052460 A1 WO 2021052460A1
Authority
WO
WIPO (PCT)
Prior art keywords
memory
neural network
network model
size
memory size
Prior art date
Application number
PCT/CN2020/116183
Other languages
English (en)
French (fr)
Inventor
张臻
鲍翀
袁鹏
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2021052460A1 publication Critical patent/WO2021052460A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Definitions

  • This application relates to the field of artificial intelligence, and in particular to a data processing method, a model optimization device, and a model execution device.
  • Deep Learning is an important technology in the field of Artificial Intelligence (AI).
  • AI Artificial Intelligence
  • DL can usually be divided into two processes: training and inference.
  • the training process uses a large amount of sample data input and adopts unsupervised learning methods such as reinforcement learning to train a complex deep neural network model.
  • the inference process uses the trained model and uses new data to "reason" to various conclusions.
  • the video surveillance equipment uses the trained deep neural network model to determine whether a captured face belongs to the blacklist.
  • the training process involves huge amount of training data and complex deep neural network structure, and the amount of calculation is huge. It is usually completed by devices with powerful computing capabilities, such as servers or devices integrated with proprietary processors.
  • the reasoning process based on the neural network model is to convert input data into output through a multi-layer operator layer.
  • the input of an operator layer is the output of the previous operator layer.
  • the output is the input of the next operator layer.
  • the amount of calculation in the inference process is lower than in the training process. Therefore, applications based on neural network inference are deployed on more and more smart devices, such as mobile phones, speakers, headphones, and IoT devices.
  • the layer-by-layer operation characteristic of neural network reasoning brings a major challenge to the memory usage of smart devices.
  • an operator layer When an operator layer is running, it needs to provide the memory required for the input and output of the operator layer.
  • the memory size required for input and output may also be different. Therefore, only the runtime system of the device (runtime system), referred to as runtime, needs frequent dynamic memory allocation (malloc ) And release (free) operations, making the neural network reasoning process time-consuming.
  • memory allocation and reuse algorithms are deployed on smart devices.
  • the neural network inference process is simulated by running the memory allocation and reuse algorithm to obtain each operator.
  • the size of the memory block required by the layer output tensor is called before running the neural network inference.
  • the runtime is called to allocate all the memory blocks at one time. As a result, the memory allocation can be realized at one time according to the pre-calculated results, thereby reducing the memory allocation and release Time overhead.
  • the prior art pre-calculates the memory required for the operation of the neural network model based on the memory allocation and reuse algorithm deployed on the smart device, thereby achieving the purpose of reducing the number of memory allocation and release operations, but when the resources of the smart device are limited , It takes a lot of pre-calculation time, resulting in too long model loading time.
  • the first aspect of the embodiments of the present application provides a data processing method, including: obtaining a neural network model; determining the memory size of the memory space required for inference operations based on the neural network model; and updating the neural network model to obtain A target neural network model, where the target neural network model carries information indicating the size of the memory.
  • the determining the memory size of the memory space required by the neural network model inference operation includes: determining the running sequence of the operator layer and the size of the memory block of the tensor, so The operator layer is a basic calculation unit that performs inference operations based on the neural network model, and the tensor is the input or output of the operator layer; according to the size of the memory block and the running sequence, preset The memory multiplexing algorithm for determining the memory size and the location of the memory block in the memory space, and the location is used to indicate the memory block.
  • the target neural network model carries the running sequence of the operator layer and the memory offset of the tensor, and the memory offset is used to indicate the memory block The location in the memory space.
  • the memory size of the memory space required for the inference operation of the neural network model is determined Including: determining the first memory size of the first memory space required by the first neural model inference operation, and the second memory size of the second memory space required by the second neural model inference operation; A memory size and the second memory size determine the memory size of the memory space.
  • the running sequence of the first neural network model and the second neural network model is determined, and the running sequence includes serial or parallel; if the running sequence is serial OK, the memory size of the memory space is the first memory size, and the first memory size is greater than or equal to the second memory size; if the running sequence is parallel, the memory size of the memory space is The sum of the first memory size and the second memory size.
  • the second aspect of the embodiments of the present application provides a data processing method, including: obtaining a neural network model, the neural network model carries information indicating the memory size, and the memory size information is used to indicate the neural network
  • the total memory space required for the model inference operation; the total memory space with a size greater than or equal to the memory size is allocated in the memory; and the inference operation is performed according to the neural network model based on the allocated total memory space.
  • the neural network model further includes: the running sequence of the operator layer and the memory offset of the tensor, and the operator layer is the reasoning operation of the neural network model
  • the basic calculation unit of the tensor is the input or output of the operator layer, and the memory offset is used to indicate the position of the memory block of the tensor in the total memory space.
  • the allocating a total memory space with a size equal to the memory size in the memory includes: determining a base address of the total memory space through a runtime system.
  • the position of the memory block of the tensor in the total memory space is determined according to the base address of the total memory space and the memory offset of the tensor.
  • the memory size is determined by a memory multiplexing algorithm.
  • the neural network model includes a first neural network model and a second neural network model
  • the neural network model further includes: the first neural network model and the neural network model.
  • the running sequence of the second neural network model, the running sequence includes serial or parallel.
  • the memory size of the total memory space is a first memory size, and the first memory size is greater than or equal to the second memory Size, the first memory size is the size of the first memory space required by the first neural model inference operation, and the second memory size is the size of the second memory space required by the second neural model inference operation
  • the method further includes: releasing the total memory space at one time after the inference operation is completed.
  • the third aspect of the embodiments of the present application provides a model optimization device, including: an acquisition module for acquiring a neural network model; a determining module for determining the memory size of the memory space required for inference operations based on the neural network model An update module for updating the neural network model to obtain a target neural network model, the target neural network model carrying information indicating the memory size.
  • the determining module is specifically configured to: determine the running sequence of the operator layer and the size of the memory block of the tensor, and the operator layer is performed based on the neural network model.
  • the basic calculation unit of the inference operation, the tensor is the input or output of the operator layer; according to the size of the memory block and the running sequence, the memory size and the total amount are determined by a preset memory multiplexing algorithm The location of the memory block in the memory space, where the location is used to indicate the memory block.
  • the target neural network model carries the running sequence of the operator layer and the memory offset of the tensor, and the memory offset is used to indicate the memory block The location in the memory space.
  • the determining module is specifically configured to: determine the first neural model inference operation station The first memory size of the first memory space required, and the second memory size of the second memory space required by the second neural model inference operation; determine all the memory sizes according to the first memory size and the second memory size The memory size of the memory space.
  • the determining module is further configured to: determine an operating sequence of the first neural network model and the second neural network model, and the operating sequence includes serial or parallel ; If the running sequence is serial, the memory size of the memory space is the first memory size, and the first memory size is greater than or equal to the second memory size; if the running sequence is parallel, the The memory size of the memory space is the sum of the first memory size and the second memory size.
  • the fourth aspect of the embodiments of the present application provides a model execution device, including: an acquisition module for acquiring a neural network model, the neural network model carries information indicating the memory size, and the memory size information is used for Indicate the total memory space required by the neural network model for inference operations; an allocation module for allocating total memory space with a size greater than or equal to the memory size in the memory; a calculation module for the total memory space based on the allocation , Perform inference operations according to the neural network model.
  • the neural network model further includes: the running sequence of the operator layer and the memory offset of the tensor, and the operator layer is the inference operation of the neural network model
  • the basic calculation unit of the tensor is the input or output of the operator layer, and the memory offset is used to indicate the position of the memory block of the tensor in the total memory space.
  • the allocation module is specifically configured to determine the base address of the total memory space through a runtime system.
  • the device further includes: a determining module, configured to determine the memory of the tensor according to the base address of the total memory space and the memory offset of the tensor The position of the block in the total memory space.
  • the memory size is determined by a memory multiplexing algorithm.
  • the neural network model includes a first neural network model and a second neural network model
  • the neural network model further includes: the first neural network model and the neural network model.
  • the running sequence of the second neural network model, the running sequence includes serial or parallel.
  • the memory size of the total memory space is the first memory size, and the first memory size is greater than or equal to the second memory Size, the first memory size is the size of the first memory space required by the first neural model inference operation, and the second memory size is the size of the second memory space required by the second neural model inference operation The second memory size; if the running sequence is parallel, the memory size of the total memory space is the sum of the first memory size and the second memory size.
  • the device further includes: a release module, configured to release the total memory space at a time after the inference operation is completed.
  • the fifth aspect of the embodiments of the present application provides a model optimization device, which is characterized by comprising: a memory, configured to store instructions; a processor, configured to execute instructions in the memory, so that the controller executes On the one hand and the method of any one of the various implementations.
  • the sixth aspect of the embodiments of the present application provides a model execution device, which is characterized by comprising: a memory, configured to store instructions; a processor, configured to execute instructions in the memory, so that the controller executes the above-mentioned Two aspects and any one of the implementation methods.
  • a seventh aspect of the embodiments of the present application provides a computer system, which is characterized by including the model optimization device according to any one of the foregoing third aspect and each implementation manner, as described in any of the foregoing fourth aspect and each implementation manner.
  • the eighth aspect of the embodiments of the present application provides a computer program product, characterized in that the computer program product includes instructions, and when the instructions run on a computer, the computer executes the aforementioned first aspect or second aspect. Aspects and methods of any one of the various implementation methods.
  • the ninth aspect of the embodiments of the present application provides a computer-readable storage medium, wherein the computer-readable storage medium stores instructions, and when the instructions run on a computer, the computer executes the first The method of any one of the aspect or the second aspect and each implementation manner.
  • the model optimization device can perform pre-calculation according to the neural network model, obtain the memory size of the memory space required for the neural network model to perform inference operations, and carry the memory size information in the updated target neural network
  • the model execution device does not need to perform memory reuse calculations after loading, which reduces the network model loading time of the model execution device; the model execution device obtains the information of the memory size carried in the network model file according to the information Allocate the memory space required by the network model to execute the inference operation, so that the model execution device can directly allocate the memory at one time according to the network model file, reducing the time for loading the model to perform memory reuse calculations.
  • Figure 1 is a schematic diagram of memory requirements during the operation of neural network inference
  • Figure 2 is a schematic diagram of Runtime dynamically allocating and releasing memory when neural network inference is running
  • Figure 3 is a schematic diagram of the maximum memory required for one-time allocation of the neural network model before reasoning
  • Figure 4 is a schematic diagram of the relationship between the memory management scheme and the memory allocation time and memory allocation amount
  • Figure 5 is a schematic diagram of allocating memory on an electronic device
  • Figure 6 is a schematic diagram of determining the specification and quantity of memory blocks that need to be allocated through a memory multiplexing algorithm
  • FIG. 7 is a schematic diagram of a system scenario of a quantity processing method in an embodiment of the application.
  • FIG. 8 is a schematic diagram of an embodiment of a data processing method in an embodiment of the application.
  • FIG. 9 is a schematic diagram of another embodiment of a data processing method in an embodiment of the application.
  • FIG. 10 is a schematic diagram of an embodiment of a data processing method in an embodiment of the application.
  • FIG. 11 is a schematic diagram of another embodiment of a data processing method in an embodiment of this application.
  • FIG. 12 is a schematic diagram of an embodiment of a model optimization device in an embodiment of the application.
  • FIG. 13 is a schematic diagram of an embodiment of a model execution device in an embodiment of the application.
  • FIG. 14 is a schematic diagram of an embodiment of a model optimization device in an embodiment of the application.
  • FIG. 15 is a schematic diagram of an embodiment of a model execution device in an embodiment of the application.
  • the embodiment of the present application provides a data processing method, which is used to reduce the waste of memory when the neural network model is running, reduce the allocation time of memory allocation and release, and can also reduce the time to load the neural network model.
  • a neural network model refers to programs and data that are used to perform cognitive computing after a large amount of labeled data training.
  • the neural network model includes neural network architecture components and neural network parameter components.
  • the neural network architecture component refers to the network and its hierarchical structure related to the neural network algorithm in the neural network model, that is, the program used to perform cognitive computing in the aforementioned neural network model.
  • the neural network model can be used for inference operations. The process of inference operations is to convert input data into output through multiple layers of operators.
  • a tensor is a generalization of a vector. Assuming that the vector is a one-dimensional "table” (that is, the components are arranged in a row in order), and the matrix is a two-dimensional "table” (that is, the components are arranged in vertical and horizontal positions), then the n-th order tensor is an n-dimensional "table”; Among them, n is a tensor greater than or equal to 1.
  • the tensor used as an independent variable in an operator layer is called an input tensor
  • the tensor used as a dependent variable is called an output tensor.
  • Figure 1 is a schematic diagram of the memory requirements of the operation process of neural network inference.
  • the neural network model reasoning process is to convert the input data into output through a multi-layer operator layer.
  • the input tensor of an operator layer is the output tensor of the previous operator layer
  • the output tensor of the operator layer is The input tensor of the next operator layer.
  • an operator layer runs, it needs to provide the memory required for the input tensor and output tensor of the operator layer.
  • the memory size required for the input tensor and output tensor may also be different.
  • Figure 2 a schematic diagram of Runtime dynamically allocating and releasing memory when neural network inference is running.
  • a simple way to provide memory allocation for the operator layer is to dynamically allocate (Malloc) and release (Free) memory for the currently running operator layer when the neural network inference is running through Runtime, so that the peak memory usage can be kept to a minimum.
  • runtime allocation and release will cause expensive runtime memory allocation time consumption, making the neural network reasoning process time-consuming.
  • Figure 3 a schematic diagram of the maximum memory required for one-time allocation of the neural network model before inference.
  • the maximum memory required can be allocated at one time before the neural network model inference.
  • the maximum memory is the sum of the memory required for the output tensor of all operator layers. As a result, no longer interact with Runtime during the neural network model inference process.
  • MobileNet and MobileNetV2 in the table are two lightweight image classification networks, and DeeplabV3 represents a semantic segmentation network.
  • the ratio of the sum of tensors to the minimum memory requirement can be used to determine and measure the memory waste. It can be seen that this memory allocation method will cause a great waste of memory.
  • Figure 4 is a schematic diagram of the relationship between the memory management scheme and the memory allocation time and the amount of memory allocation;
  • the embodiment of the application hopes a target solution that can reduce the memory The allocation duration of Runtime allocation and release in allocation, and reduce the amount of memory allocation.
  • Figure 5 is a schematic diagram of allocating memory on an electronic device.
  • Figure 6 is a schematic diagram of determining the specification and quantity of memory blocks that need to be allocated through a memory multiplexing algorithm
  • the memory block can be allocated to the output tensor of the second operator layer again. Therefore, part of the memory can be saved by reusing the memory block. Simulating the neural network inference process, you can pre-calculate the specifications and the number of memory blocks that need to be allocated, that is, two memory blocks, the sizes of the memory blocks are 32B and 64B respectively, and two memory blocks are allocated at one time by the runtime.
  • pre-calculating the memory required for the neural network model operation through the memory reuse algorithm on the electronic device requires a lot of pre-calculation time, resulting in too long model loading time.
  • FIG. 7 is a schematic diagram of a system scenario of a quantity processing method in an embodiment of this application.
  • the device for implementing the data processing method provided by the embodiment of the present application includes a model optimization device and a model execution device.
  • the following describes the system scenario of the quantity processing method in conjunction with FIG. 7.
  • the model optimization device can parse the model file, obtain the data flow graph, and obtain the total memory size required by the network model when the network model is running according to the memory reuse algorithm.
  • the model optimization device uses the total memory The size is written into the model file to generate the optimized network model.
  • the model optimization device sends the optimized model file to the model execution device.
  • the model execution device is used to realize the reasoning calculation of the network model.
  • the total memory size carried in the model file is allocated by the system at runtime, the memory space of the total memory size is obtained, the memory address of the tensor of the operator is calculated, and the data that needs to be inferred based on the network model is input, according to the memory of the tensor The address performs the operator call, and finally outputs the inference result.
  • a network model for image recognition is obtained.
  • the model optimization device can optimize the image recognition network model, calculate the total memory size required for inference operations, and write Into this model, the optimized image recognition network model is obtained, and the optimized image recognition network model is sent to the model execution device.
  • the process of inference operation performed by the model execution device is to recognize the image input by the user, for example, the image content is recognized as a cat. Or dogs, etc., before the model execution device performs the inference calculation process, memory allocation can be performed according to the total memory size in the model file. In this way, the loading time of the model execution device and the time consumption of memory allocation can be reduced.
  • the model optimization device is a server
  • the model execution device is a terminal.
  • the server includes a cloud server or a computing server.
  • the terminal can be a smart device of various forms. It is understandable that due to the computing resources of the server Generally better than terminal equipment, the process of optimizing the network model by the server can save the computing resources of the terminal and reduce the loading time of the terminal model.
  • the model optimization device is a fat terminal
  • the model execution device is a thin terminal.
  • the fat terminal may be, for example, a mobile phone, a desktop computer, a tablet computer, a portable computer, etc.
  • the specific device type of the fat terminal is not limited here.
  • the thin terminal may be, for example, an Internet of Things (IOT) terminal device, such as a web camera, an intelligent voice assistant, a smart headset, or a smart wearable device, etc.
  • IOT Internet of Things
  • the specific device type of the thin terminal is not limited here. It is understandable that the storage and computing capabilities of fat terminals are better than those of thin terminals.
  • the fat terminal implements the process of network model optimization to save computing resources of the thin terminal and reduce the number of thin terminals. Model loading time.
  • FIG. 8 is a schematic diagram of an embodiment of the data processing method in the embodiment of the application.
  • the acquired neural network model may be one neural network model or multiple neural network models, which is not limited here.
  • the neural network model can be pre-deployed by users, and is usually stored in the form of neural network inference model files.
  • the storage type can be, for example, the open neural network exchange (ONNX) format or the convolutional architecture for fast feature embedding. fast feature embedding, CAFFE) format, etc.
  • the neural network model can be parsed by the model file parser and converted into a data flow diagram.
  • Figure 9 is a schematic diagram of another embodiment of the data processing method in the embodiment of this application.
  • the first step in Figure 9 is about
  • the model file is converted into the expression form of a dataflow graph (dataflow).
  • the nodes in the dataflow represent the operator layer of neural network model inference.
  • the connection between the nodes is the edge, which represents the tensor of the operator layer. From the graph, we can see that a The output tensor of the operator layer is the input tensor of the next operator layer.
  • the tensor shape information includes the number, height, width, and number of channels of the tensor.
  • mem size Batch ⁇ Height ⁇ Width ⁇ Channel ⁇ Data Byte
  • mem size is the memory block size of the tensor
  • Batch represents the number of tensors
  • Height represents the height of the tensor
  • Width represents the width of the tensor
  • Channel represents the number of channels of the tensor
  • Data Byte represents the unit data of the tensor. Memory size requirements.
  • the running order of the operator layer can be determined by topological sorting, and the sorting algorithm is not limited.
  • the running order of the operator can be indicated by the index of the operator layer.
  • the running order of the operator layers can be determined after the operator layer fusion.
  • each neural network model needs to be analyzed to determine the order of the operator layers of the model.
  • determine the running sequence among multiple neural network models includes multiple neural network models executed serially or in parallel and serially, and the operation of its operator layer The sequence is determined, and there is no limit relationship between the running sequence of the operator layers of multiple neural network models that are executed in parallel.
  • the specific running sequence between multiple neural network models is not limited here.
  • the memory size of the memory space and the memory offset of the tensor can be determined according to the preset memory reuse algorithm.
  • the memory space is required for the operation of the neural network model Memory, that is, the memory required by all tensors.
  • the memory required by all tensors is obtained after multiplexing according to the memory multiplexing algorithm.
  • the memory offset of the tensor is used to indicate that the memory block described by the tensor is in the memory Offset in space.
  • the memory multiplexing algorithm can be one or more algorithms, which is not limited here.
  • the process of simulating the reasoning of the neural network model, according to the memory block size of the tensor and the running sequence of the operator layer, according to the preset memory multiplexing algorithm the multiplexed memory blocks are assembled into a memory space, each The memory block is represented by an offset. Record the memory block offset of each tensor in the edge information of the data flow graph. For example, as shown in Figure 8, the memory size of the memory space is 192 bits (bit, b). According to the running sequence of the operator layer, the offsets of the memory blocks required by each tensor are 0b, 64b, 96b, 0b, 128b, 64b.
  • the running sequence includes multiple neural network models executed serially, in parallel, and serially, and multiple total memory spaces are taken.
  • the largest one among them is the memory size required for the inference process of the multiple running neural network models; for multiple neural network models executed in parallel, the sum of the memory sizes of multiple total memory spaces is taken as the multiple running neural network models.
  • the memory size required for the reasoning process of the network model is the largest one among them.
  • the memory size of the memory space determined in step 803 and the memory offset of each tensor are recorded in the offset file of the model.
  • the running sequence of the operator layer determined in step 802 is also recorded in the neural network model.
  • the optimized data flow graph is rewritten into the neural network model file, which includes the running sequence of the operator layer, the memory offset of the tensor and the memory size of the memory space.
  • the file also includes an operator layer to be replaced after fusion.
  • the size of the memory space required for the network model to execute inference operation is added to the network model file, so that the device executing the network model can directly allocate the memory at one time according to the network model file, reducing the load.
  • the length of time for memory reuse calculation into the model is added to the network model file, so that the device executing the network model can directly allocate the memory at one time according to the network model file, reducing the load.
  • FIG. 10 is a schematic diagram of an embodiment of the data processing method in the embodiment of the application.
  • the terminal may obtain the neural network model from computing devices such as the cloud or computing server, and the neural network model may be one neural network model or multiple neural network models, which is not limited here.
  • Neural network models are usually stored in the form of neural network inference model files, and the storage type can be ONNX format or CAFFE format.
  • the neural network model is converted into an executable data flow graph through the parser.
  • FIG. 11 is a schematic diagram of another embodiment of the data processing method in the embodiment of the application.
  • the memory offset of the tensor is used to indicate the offset position of the memory block of the tensor in the memory space.
  • the base address of the memory space is obtained through a memory allocation, and the base address of the memory space may be referred to as the memory base address.
  • the memory address of the output tensor of the operator layer is the base address of the memory space plus the memory offset of the tensor (@base). According to the base address of the memory space and the memory offset of the output tensor of each operator layer, the memory address of the output tensor of each operator layer, namely @base+offset, can be obtained, for example, the memory of the tensor of input node 2
  • the offset is 96b and the base address of the memory space is @base, then the memory block address of the tensor can be determined to be @base+96b.
  • FIG. 12 is a schematic diagram of an embodiment of the model optimization device in the embodiment of the application.
  • the model optimization device includes:
  • the obtaining module 1201 is used to obtain a neural network model
  • the determining module 1202 is configured to determine the memory size of the memory space required for inference operations based on the neural network model
  • the update module 1203 is configured to update the neural network model to obtain a target neural network model, and the target neural network model carries information indicating the memory size.
  • the determining module 1202 is specifically configured to determine the running sequence of an operator layer and the size of the memory block of the tensor, the operator layer is a basic calculation unit that performs inference operations based on the neural network model, so The tensor is the input or output of the operator layer; according to the size of the memory block and the running sequence, the memory size and the memory space of the memory block are determined by a preset memory multiplexing algorithm In the position, the position is used to indicate the memory block.
  • the target neural network model carries the running sequence of the operator layer and the memory offset of the tensor, and the memory offset is used to indicate the position of the memory block in the memory space.
  • the determining module 1202 is specifically configured to: determine the first memory space required for the inference operation of the first neural model A memory size, and a second memory size of the second memory space required by the second neural model inference operation; the memory size of the memory space is determined according to the first memory size and the second memory size.
  • the determining module 1202 is further configured to determine the running sequence of the first neural network model and the second neural network model, where the running sequence includes serial or parallel; if the running sequence is serial OK, the memory size of the memory space is the first memory size, and the first memory size is greater than or equal to the second memory size; if the running sequence is parallel, the memory size of the memory space is The sum of the first memory size and the second memory size.
  • FIG. 13 is a schematic diagram of an embodiment of the model optimization device in the embodiment of the application.
  • the model execution device includes:
  • the obtaining module 1301 is configured to obtain a neural network model, the neural network model carries information indicating the memory size, and the memory size information is used to indicate the total memory space required by the neural network model inference operation;
  • An allocation module 1302 configured to allocate a total memory space with a size greater than or equal to the memory size in the memory
  • the calculation module 1303 is configured to perform inference calculations according to the neural network model based on the allocated total memory space.
  • the neural network model further includes: the running sequence of the operator layer and the memory offset of the tensor, the operator layer is the basic calculation unit of the inference operation of the neural network model, and the tensor It is the input or output of the operator layer, and the memory offset is used to indicate the position of the memory block of the tensor in the total memory space.
  • the allocation module 1302 is specifically configured to determine the base address of the total memory space through a runtime system.
  • the device further includes: a determining module 1304, configured to determine that the memory block of the tensor is in the total memory space according to the base address of the total memory space and the memory offset of the tensor s position.
  • a determining module 1304 configured to determine that the memory block of the tensor is in the total memory space according to the base address of the total memory space and the memory offset of the tensor s position.
  • the memory size is determined by a memory multiplexing algorithm.
  • the neural network model further includes: the running sequence of the first neural network model and the second neural network model ,
  • the running sequence includes serial or parallel.
  • the memory size of the total memory space is a first memory size, the first memory size is greater than or equal to the second memory size, and the first memory size is The size of the first memory space required by the first neural model inference operation, and the second memory size is the second memory size of the second memory space required by the second neural model inference operation; if the operation The sequence is parallel, and the memory size of the total memory space is the sum of the first memory size and the second memory size.
  • the device further includes:
  • the release module 1305 is configured to release the total memory space at one time after the inference operation is completed.
  • FIG. 14 is a schematic diagram of an embodiment of the model optimization device in the embodiment of the application.
  • the model optimization apparatus provided in this embodiment may be a device such as a server or a terminal, and the specific device form is not limited in the embodiment of the present application.
  • the model optimization device 1400 may have relatively large differences due to different configurations or performances, and may include one or more processors 1401 and a memory 1402, and the memory 1402 stores programs or data.
  • the memory 1402 may be volatile storage or non-volatile storage.
  • the processor 1401 is one or more central processing units (CPU, Central Processing Unit), or a dedicated processor, such as a graphics processing unit (GPU) or a neural network processor (neural processing unit). , NPU), it can also be a system on a chip (SoC) that integrates one or more CPUs, one or more GPUs, and one or more NPUs.
  • the CPU can be a single-core CPU or a multi-core CPU. .
  • the processor 1401 may communicate with the memory 1402, and execute a series of instructions in the memory 1402 on the model optimization apparatus 1400.
  • the model optimization device 1400 also includes one or more wired or wireless network interfaces 1403, such as an Ethernet interface.
  • the model optimization apparatus 1400 may also include one or more power supplies; one or more input and output interfaces, which can be used to connect a display, a mouse, a keyboard, a touch screen device or a transmission
  • the input and output interfaces are optional components, which may or may not exist, and are not limited here.
  • FIG. 15 is a schematic diagram of an embodiment of the model execution device in the embodiment of the application.
  • the model execution device provided in this embodiment may be a terminal device such as IOT, and its specific device form is not limited in the embodiment of this application.
  • the model execution device 1500 may have relatively large differences due to different configurations or performances, and may include one or more processors 1501 and a memory 1502, and the memory 1502 stores programs or data.
  • the memory 1502 may be volatile storage or non-volatile storage.
  • the processor 1501 is one or more central processing units (CPU, Central Processing Unit), or a dedicated processor, such as a graphics processing unit (GPU) or a neural network processor (neural processing unit). , NPU), it can also be a system on a chip (SoC) that integrates one or more CPUs, one or more GPUs, and one or more NPUs.
  • the CPU can be a single-core CPU or a multi-core CPU. .
  • the processor 1501 may communicate with the memory 1502, and execute a series of instructions in the memory 1502 on the model execution device 1500.
  • the model execution device 1500 also includes one or more wired or wireless network interfaces 1503, such as an Ethernet interface.
  • the model execution apparatus 1500 may also include one or more power supplies; one or more input and output interfaces, which can be used to connect a display, a mouse, a keyboard, a touch screen device or a transmission
  • the input and output interfaces are optional components, which may or may not exist, and are not limited here.
  • the disclosed system, device, and method may be implemented in other ways.
  • the device embodiments described above are merely illustrative, for example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components may be combined or It can be integrated into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the technical solution of the present application essentially or the part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , Including several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic disks or optical disks and other media that can store program codes. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Neurology (AREA)
  • Image Analysis (AREA)

Abstract

一种数据处理方法,涉及人工智能领域,用于减少神经网络模型载入时长。该方法包括:获取神经网络模型;确定基于所述神经网络模型进行推理运算所需的内存空间的内存尺寸;更新所述神经网络模型,以得到目标神经网络模型,所述目标神经网络模型携带指示所述内存尺寸的信息。

Description

数据处理方法、模型优化装置和模型执行装置
本申请要求于2019年9月18日提交中国专利局、申请号为201910883288.4、发明名称为“数据处理方法、模型优化装置和模型执行装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能领域,尤其涉及一种数据处理方法、模型优化装置和模型执行装置。
背景技术
深度学习(Deep Learning,DL)是人工智能(Artificial Intelligence,AI)领域的重要技术。DL通常可以分为训练(Training)和推理(Inference)两个过程。Training过程通过大量的样本数据输入,采取增强学习等非监督学习方法,训练出一个复杂的深度神经网络模型。推理过程利用训练好的模型,使用新的数据去“推理”出各种结论,如视频监控设备通过训练的深度神经网络模型,判断一张抓拍到的人脸是否属于黑名单。训练过程由于涉及海量的训练数据和复杂的深度神经网络结构,运算量巨大,通常由计算能力强大的设备,比如服务器或集成有专有处理器的设备来完成。如图1所示,基于神经网络模型的推理过程是将输入数据通过多层算子层转换成输出,这个过程中一个算子层的输入是前一算子层的输出,该算子层的输出是下一算子层的输入。推理过程的运算量低于训练过程,因此,以神经网络推理为基础的应用被部署在越来越多的智能设备上,例如手机、音箱、耳机、IoT设备等。然而神经网络推理的逐层运算的特性,为智能设备的内存使用带来了一个重大挑战。当一个算子层运行时,需要提供该算子层的输入和输出所需的内存。对于不同的算子层,其输入和输出所需的内存大小也可能不同,因此只能设备的运行时系统(runtime system),简称runtime,在神经网络推理运算时需要频繁进行内存动态分配(malloc)和释放(free)操作,使得神经网络推理过程耗时过长。
为解决这一问题,现有技术中通过在智能设备上部署内存分配和复用算法,在运行神经网络模型前,先通过运行内存分配和复用算法模拟神经网络推理过程,获取每个算子层输出张量需要的内存块的尺寸,在运行神经网络推理前调用runtime来一次性分配所有内存块,由此,可以根据预计算的结果一次性实现内存分配,从而减少了内存分配和释放的时间开销。
可以看出,现有技术基于在智能设备上部署的内存分配和复用算法预计算神经网络模型运行需要的内存,进而达到减少内存分配和释放操作次数的目的,但是当智能设备资源受限时,需花费大量的预计算时间,造成模型载入时间过长。
发明内容
本申请实施例第一方面提供了一种数据处理方法,包括:获取神经网络模型;确定基于所述神经网络模型进行推理运算所需的内存空间的内存尺寸;更新所述神经网络模型,以得到目标神经网络模型,所述目标神经网络模型携带指示所述内存尺寸的信息。
在第一方面的一种可能的实现方式中,所述确定所述神经网络模型推理运算所需的内存空间的内存尺寸包括:确定算子层的运行顺序和张量的内存块的尺寸,所述算子层为基于所 述神经网络模型进行推理运算的基础计算单元,所述张量为所述算子层的输入或输出;根据所述内存块的尺寸和所述运行顺序,通过预设的内存复用算法确定所述内存尺寸和所述内存块在所述内存空间中的位置,所述位置用于指示所述内存块。
在第一方面的一种可能的实现方式中,所述目标神经网络模型中携带所述算子层的运行顺序,和张量的内存偏移,所述内存偏移用于指示所述内存块在所述内存空间中的位置。
在第一方面的一种可能的实现方式中,若所述神经网络模型包括第一神经网络模型和第二神经网络模型,所述确定所述神经网络模型推理运算所需的内存空间的内存尺寸包括:确定所述第一神经模型推理运算所需的第一内存空间的第一内存尺寸,以及所述第二神经模型推理运算所需的第二内存空间的第二内存尺寸;根据所述第一内存尺寸和所述第二内存尺寸确定所述内存空间的内存尺寸。
在第一方面的一种可能的实现方式中,确定所述第一神经网络模型和所述第二神经网络模型的运行顺序,所述运行顺序包括串行或并行;若所述运行顺序为串行,所述内存空间的内存尺寸为所述第一内存尺寸,所述第一内存尺寸大于或等于所述第二内存尺寸;若所述运行顺序为并行,所述内存空间的内存尺寸为所述第一内存尺寸和所述第二内存尺寸之和。
本申请实施例第二方面提供了一种数据处理方法,包括:获取神经网络模型,所述神经网络模型中携带指示所述内存尺寸的信息,所述内存尺寸的信息用于指示所述神经网络模型推理运算所需的总内存空间;在内存中分配尺寸大于或等于所述内存尺寸的总内存空间;基于分配的所述总内存空间,根据所述神经网络模型进行推理运算。
在第二方面的一种可能的实现方式中,所述神经网络模型中还包括:算子层的运行顺序和张量的内存偏移,所述算子层为所述神经网络模型的推理运算的基础计算单元,所述张量为所述算子层的输入或输出,所述内存偏移用于指示所述张量的内存块在所述总内存空间中的位置。
在第二方面的一种可能的实现方式中,所述在内存中分配尺寸等于所述内存尺寸的总内存空间包括:通过运行时系统确定所述总内存空间的基地址。
在第二方面的一种可能的实现方式中,根据所述总内存空间的基地址和所述张量的内存偏移,确定所述张量的内存块在所述总内存空间中的位置。
在第二方面的一种可能的实现方式中,所述内存尺寸通过内存复用算法确定。
在第二方面的一种可能的实现方式中,若所述神经网络模型包括第一神经网络模型和第二神经网络模型,所述神经网络模型中还包括:所述第一神经网络模型和所述第二神经网络模型的运行顺序,所述运行顺序包括串行或并行。
在第二方面的一种可能的实现方式中,若所述运行顺序为串行,所述总内存空间的内存尺寸为第一内存尺寸,所述第一内存尺寸大于或等于所述第二内存尺寸,所述第一内存尺寸为所述第一神经模型推理运算所需的第一内存空间的尺寸,所述第二内存尺寸为所述第二神经模型推理运算所需的第二内存空间的第二内存尺寸;若所述运行顺序为并行,所述总内存空间的内存尺寸为第一内存尺寸和第二内存尺寸之和。
在第二方面的一种可能的实现方式中,所述方法还包括:当所述推理运算结束后,一次释放所述总内存空间。
本申请实施例第三方面提供了一种模型优化装置,包括:获取模块,用于获取神经网络 模型;确定模块,用于确定基于所述神经网络模型进行推理运算所需的内存空间的内存尺寸;更新模块,用于更新所述神经网络模型,以得到目标神经网络模型,所述目标神经网络模型携带指示所述内存尺寸的信息。
在第三方面的一种可能的实现方式中,所述确定模块具体用于:确定算子层的运行顺序和张量的内存块的尺寸,所述算子层为基于所述神经网络模型进行推理运算的基础计算单元,所述张量为所述算子层的输入或输出;根据所述内存块的尺寸和所述运行顺序,通过预设的内存复用算法确定所述内存尺寸和所述内存块在所述内存空间中的位置,所述位置用于指示所述内存块。
在第三方面的一种可能的实现方式中,所述目标神经网络模型中携带所述算子层的运行顺序,和张量的内存偏移,所述内存偏移用于指示所述内存块在所述内存空间中的位置。
在第三方面的一种可能的实现方式中,若所述神经网络模型包括第一神经网络模型和第二神经网络模型,所述确定模块具体用于:确定所述第一神经模型推理运算所需的第一内存空间的第一内存尺寸,以及所述第二神经模型推理运算所需的第二内存空间的第二内存尺寸;根据所述第一内存尺寸和所述第二内存尺寸确定所述内存空间的内存尺寸。
在第三方面的一种可能的实现方式中,所述确定模块还用于:确定所述第一神经网络模型和所述第二神经网络模型的运行顺序,所述运行顺序包括串行或并行;若所述运行顺序为串行,所述内存空间的内存尺寸为所述第一内存尺寸,所述第一内存尺寸大于或等于所述第二内存尺寸;若所述运行顺序为并行,所述内存空间的内存尺寸为所述第一内存尺寸和所述第二内存尺寸之和。
本申请实施例第四方面提供了一种模型执行装置,包括:获取模块,用于获取神经网络模型,所述神经网络模型中携带指示所述内存尺寸的信息,所述内存尺寸的信息用于指示所述神经网络模型推理运算所需的总内存空间;分配模块,用于在内存中分配尺寸大于或等于所述内存尺寸的总内存空间;运算模块,用于基于分配的所述总内存空间,根据所述神经网络模型进行推理运算。
在第四方面的一种可能的实现方式中,所述神经网络模型中还包括:算子层的运行顺序和张量的内存偏移,所述算子层为所述神经网络模型的推理运算的基础计算单元,所述张量为所述算子层的输入或输出,所述内存偏移用于指示所述张量的内存块在所述总内存空间中的位置。
在第四方面的一种可能的实现方式中,所述分配模块具体用于:通过运行时系统确定所述总内存空间的基地址。
在第四方面的一种可能的实现方式中,所述装置还包括:确定模块,用于根据所述总内存空间的基地址和所述张量的内存偏移,确定所述张量的内存块在所述总内存空间中的位置。
在第四方面的一种可能的实现方式中,所述内存尺寸通过内存复用算法确定。
在第四方面的一种可能的实现方式中,若所述神经网络模型包括第一神经网络模型和第二神经网络模型,所述神经网络模型中还包括:所述第一神经网络模型和所述第二神经网络模型的运行顺序,所述运行顺序包括串行或并行。
在第四方面的一种可能的实现方式中,若所述运行顺序为串行,所述总内存空间的内存尺寸为第一内存尺寸,所述第一内存尺寸大于或等于所述第二内存尺寸,所述第一内存尺寸 为所述第一神经模型推理运算所需的第一内存空间的尺寸,所述第二内存尺寸为所述第二神经模型推理运算所需的第二内存空间的第二内存尺寸;若所述运行顺序为并行,所述总内存空间的内存尺寸为第一内存尺寸和第二内存尺寸之和。
在第四方面的一种可能的实现方式中,所述装置还包括:释放模块,用于当所述推理运算结束后,一次释放所述总内存空间。
本申请实施例第五方面提供了一种模型优化装置,其特征在于,包括:存储器,用于存储指令;处理器,用于执行所述存储器中的指令,使得所述控制器执行如前述第一方面及各实现方式中任一项的方法。
本申请实施例第六方面提供了一种模型执行装置,其特征在于,包括:存储器,用于存储指令;处理器,用于执行所述存储器中的指令,使得所述控制器执行如前述第二方面及各实现方式中任一项的方法。
本申请实施例第七方面提供了一种计算机系统,其特征在于,包括如前述第三方面及各实现方式中任一项所述的模型优化装置,如前述第四方面及各实现方式中任一项所述的模型执行装置。
本申请实施例第八方面提供了一种计算机程序产品,其特征在于,所述计算机程序产品包括指令,当所述指令在计算机上运行时,使得所述计算机执行如前述第一方面或第二方面及各实现方式中任一项的方法。
本申请实施例第九方面提供了一种种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储指令,当所述指令在计算机上运行时,使得所述计算机执行如前述第一方面或第二方面及各实现方式中任一项的方法。
从以上技术方案可以看出,本申请实施例具有以下优点:
本申请实施例提供的方法,模型优化装置可以根据神经网络模型进行预计算,获取神经网络模型进行推理运算所需的内存空间的内存尺寸,并将该内存尺寸信息携带在更新后的目标神经网络模型中,可以使得模型执行装置载入后不必进行内存复用计算,减少了模型执行装置的网络模型载入时长;模型执行装置通过获取的网络模型文件中携带的内存尺寸的信息,可以根据信息分配该网络模型执行推理运行所需的内存空间,以使得模型执行装置可以直接根据网络模型文件实现内存一次性分配,减少了载入模型中进行内存复用计算的时长。
附图说明
图1为神经网络推理的运行过程的内存需求示意图;
图2为Runtime在神经网络推理运行时来动态分配和释放内存的示意图;
图3为神经网络模型推理前一次性分配所需的最大内存的示意图;
图4为内存管理方案与内存分配时间和内存分配量的关系示意图;
图5为在电子设备上分配内存的示意图;
图6为通过内存复用算法确定需要分配的内存块的规格和数量的示意图;
图7为本申请实施例中数量处理方法的系统场景示意图;
图8为本申请实施例中数据处理方法的一个实施例示意图;
图9为本申请实施例中数据处理方法的另一个实施例示意图;
图10为本申请实施例中数据处理方法的一个实施例示意图;
图11为本申请实施例中数据处理方法的另一个实施例示意图;
图12为本申请实施例中模型优化装置的一个实施例示意图;
图13为本申请实施例中模型执行装置的一个实施例示意图;
图14为本申请实施例中模型优化装置的一个实施例示意图;
图15为本申请实施例中模型执行装置的一个实施例示意图。
具体实施方式
本申请实施例提供了一种数据处理方法,用于减少神经网络模型运行时的内存浪费,降低内存分配和释放的分配时长,还可以降低载入神经网络模型时长。
神经网络模型指的是经过大量有标签的数据训练,得到的用于执行认知计算的程序和数据。神经网络模型包括神经网络架构组件与神经网络参数组件。其中,神经网络架构组件指的是神经网络模型中与神经网络算法相关的网络及其层次结构,也就是上述的神经网络模型中用于执行认知计算的程序。神经网络模型可以用于进行推理运算,推理运算的过程是将输入数据通过多层算子层转换成输出。
张量(tensor),是向量的推广。假设向量是一维的“表格”(即分量按照顺序排成一排),矩阵是二维的“表格”(即分量按照纵横位置排列),那么n阶张量就是n维的“表格”;其中,n是大于或等于1的张量。在本申请实施例中,为了便于描述,将一个算子层中作为自变量的张量称为输入张量,作为因变量的张量称为输出张量。
请参阅图1,为神经网络推理的运行过程的内存需求示意图。
神经网络模型推理过程是将输入数据通过多层算子层转换成输出,这个过程中一个算子层的输入张量是前一算子层的输出张量,该算子层的输出张量是下一算子层的输入张量。当一个算子层运行时,需要提供该算子层的输入张量和输出张量所需的内存。对于不同的算子层,其输入张量和输出张量所需的内存大小也可能不同。
请参阅图2,Runtime在神经网络推理运行时来动态分配和释放内存的示意图。
为算子层提供内存分配的一个简单实现方式是通过Runtime在神经网络推理运行时为当前运行的算子层动态分配(Malloc)和释放(Free)内存,这样内存的使用峰值可以保持最低。但runtime分配和释放会造成昂贵的运行时内存分配时间消耗,使得神经网络推理过程耗时过长。
请参阅图3,为神经网络模型推理前一次性分配所需的最大内存的示意图。
为降低与Runtime的交互消耗的内存分配和释放的时间,可以在神经网络模型推理前一次性分配所需的最大内存,最大内存即为所有算子层的输出张量所需的内存的总和,由此,在神经网络模型推理过程中不再与Runtime交互。
表1
网络 张量的内存总和/内存使用最小需求的比值
MobileNet 4.17
MobileNetV2 3.47
DeeplabV3 6.75
如表1所示,表中MobileNet和MobileNetV2为两种轻量化的图像分类网络,DeeplabV3表示语义分割网络。三种典型神经网络模型推理过程中,张量总和与内存使用最小需求的比值可以确定衡量内存浪费情况。由此可见,这种内存分配方法将造成内存极大的浪费。
请参阅图4,为内存管理方案与内存分配时间和内存分配量的关系示意图;
为解决神经网络模型推理过程中Runtime动态分配和释放耗时过长,以及一次性预分配全部张量内存消耗的内存总量过大的问题,本申请实施例希望一种目标方案,可以降低内存分配中Runtime分配和释放的分配时长,并减少内存分配量。
请参阅图5,为在电子设备上分配内存的示意图。
通过在电子设备上部署内存分配和复用预计算算法,在运行神经网络模型前,先通过该算法模拟神经网络推理过程,分析张量内存分配过程中可以共用的内存块,预计算需要分配的内存块的规格和数量,最后在运行神经网络模型推理前调用Runtime来一次性分配所需要的所有内存块,构建一个内存块池。由此,可以根据预计算的结果一次性实现内存分配,降低了与runtime交互的时长。
需要说明的是,实现神经网络模型推理过程内存分配和复用的预计算算法有多种,例如,贪心算法(Greedy)或最小代价流算法(the minimum-cost flow algorithm,MCFP)等,本申请实施例对于具体算法类型不做限定,后文中为描述方便,简称为内存复用算法。
请参阅图6,为通过内存复用算法确定需要分配的内存块的规格和数量的示意图;
图中第一个算子层释放内存需求32B内存块后,可以将该内存块再次分配给第二个算子层的输出张量使用,由此,通过复用内存块可以节省部分内存。模拟神经网络推理过程,可以预计算需要分配的内存块的规格和数量,即两个内存块,内存块的尺寸分别为32B和64B,在由runtime一次性分配两个内存块。
由于现有技术中,运行神经网络模型前,在电子设备上通过内存复用算法预计算神经网络模型运行需要的内存,需花费大量的预计算时间,造成模型载入时间过长。
请参阅图7,为本申请实施例中数量处理方法的系统场景示意图。
本申请实施例提供的实现数据处理方法的装置包括模型优化装置和模型执行装置。下面结合图7对数量处理方法的系统场景进行介绍。
将经过训练的网络模型输入模型优化装置,模型优化装置可以解析该模型文件,获取数据流图,根据内存复用算法获取该网络模型运行时所需的总内存尺寸,模型优化装置将该总内存尺寸写入模型文件,生成优化后的网络模型,如图所示,模型优化装置将优化后的模型文件发送给模型执行装置,模型执行装置用于实现网络模型的推理计算,加载模型后,根据模型文件中携带的总内存尺寸进行一次运行时系统分配,获取总内存尺寸的内存空间,计算获取算子的张量的内存地址,输入需要基于网络模型进行推理运算的数据,根据张量的内存 地址进行算子调用,最终输出推理结果。
示例性的,经过大量图片数据集的训练后获取一个用于进行图像识别的网络模型,模型优化装置可以对该图像识别网络模型进行优化,计算其进行推理运算所需的总内存尺寸,并写入该模型,得到优化后的图像识别网络模型,将优化后的图像识别网络模型发送给模型执行装置,模型执行装置进行推理运算的过程即对用户输入的图像进行识别,例如识别图像内容为猫或狗等,在模型执行装置进行推理运算的过程之前,可以根据模型文件中的总内存尺寸进行内存分配,这样,可以减少模型执行装置载入时长以及内存分配的时间消耗。
在一种可能的实现场景中,模型优化装置为服务器,模型执行装置为终端,服务器包括云端服务器或计算服务器等,终端可以为各种形式的智能设备,可以理解的是,由于服务器的计算资源通常优于终端设备,由服务器实现网络模型优化的过程可以节省终端的计算资源,减少终端模型载入时长。
在另一种可能的实现场景中,模型优化装置为胖终端,模型执行装置为瘦终端。胖终端例如可以为手机、台式机、平板电脑、便携式计算机等,此处对于胖终端的具体设备类型不做限定。瘦终端例如可以为物联网(internet of things,IOT)终端设备,例如网络摄像头、智能语音助手、智能耳机,或智能穿戴设备等,此处对于瘦终端的具体设备类型不做限定。可以理解的是,胖终端的存储能力和计算能力优于瘦终端,本申请实施例提供的数据处理方法,由胖终端实现网络模型优化的过程,以节省瘦终端的计算资源,可以减少瘦终端模型载入时长。
下面请参阅图8和图10,分别对模型优化装置和模型执行装置实现的数据处理方法进行了具体介绍:
请参阅图8,为本申请实施例中数据处理方法的一个实施例示意图。
801、获取神经网络模型;
获取的神经网络模型,可以为一个神经网络模型或者多个神经网络模型,此处不做限定。神经网络模型可以由用户预先部署,通常以神经网络推理模型文件的形式存储,存储类型例如可以是开放神经网络交换(open neural network exchange,ONNX)格式或快速特征嵌入的卷积结构(convolutional architecture for fast feature embedding,CAFFE)格式等。
802、确定算子层的张量的内存块尺寸,以及算子层的运行顺序;
根据获取的神经网络模型确定张量的内存块尺寸,以及算子层的运行顺序。
可以通过模型文件解析器对神经网络模型进行解析,转变成数据流图,请参阅图9,为本申请实施例中数据处理方法的另一个实施例示意图,图9中的第一个步骤,即将模型文件转换成数据流图(dataflow)的表达形式,dataflow中的节点代表神经网络模型推理的算子层,节点之间的连线即边,代表算子层的张量,由图可知,一个算子层的输出张量即为下一个算子层的输入张量。
为确定张量的内存块尺寸,需要先获取张量的数据内存信息(Data Byte)和张量形状(Shape)信息,张量Shape信息包括张量的数量、高度、宽度和通道数。
张量的内存块尺寸的计算公式如下:
mem size=Batch×Height×Width×Channel×Data Byte
其中,mem size为张量的内存块尺寸,Batch代表张量的数量,Height代表张量的高度,Width代表张量的宽度,Channel代表张量的通道数,Data Byte代表张量的单元数据的内存大小需求。
算子层的运行顺序可以通过拓扑排序确定,排序算法不做限定,此外,可以通过算子层的索引(index)指示算子的运行顺序。
可选的,若通过图优化技术融合部分算子层,可以在算子层融合后确定算子层的运行顺序。
可选的,若存在多个神经网络模型,一方面,需要对每个神经网络模型进行解析,确定该模型的算子层排序。另一方面,根据用户提供的或预设的方案,确定多个神经网络模型之间的运行顺序,运行顺序包括串行或者并行,串行执行的多个神经网络模型,其算子层的运行顺序是确定的,并行执行的多个神经网络模型,其算子层的运行顺序之间没有限定关系。多个神经网络模型之间的具体运行顺序此处不做限定。
803、根据内存复用算法,确定内存空间的内存尺寸和张量的内存偏移;
确定张量的内存块尺寸和算子层的运行顺序之后,可以根据预设的内存复用算法,确定内存空间的内存尺寸和张量的内存偏移,内存空间的为神经网络模型运行需要的内存,即所有张量所需内存,本实施例中由所有张量所需内存根据内存复用算法复用后得到,张量的内存偏移用于指示张量所述的内存块在该内存空间中的偏移。
需要说明的是,内存复用算法可以是一种或多种算法,此处不做限定。
具体地,模拟神经网络模型推理的过程,根据张量的内存块尺寸,以及算子层的运行顺序,按照预设的内存复用算法,将复用的内存块拼成一个内存空间,每个内存块用偏移(offset)表示。将每个张量的内存块偏移记录在数据流图中边的信息里。例如图8所示,内存空间的内存尺寸为192比特(bit,b),依据算子层的运行顺序,各张量所需内存块的偏移分别为0b、64b、96b、0b、128b、64b。
可选的,若存在多个神经网络模型,一方面,需要模拟每个神经网络模型推理过程,通过预设的内存复用算法,确定运行该神经网络模型推理过程所需的总内存空间的内存尺寸和每个算子层的输出张量的内存偏移。
另一方面,根据用户提供的或预设的方案,确定多个神经网络模型之间的运行顺序,运行顺序包括串行或者并行,串行执行的多个神经网络模型,取多个总内存空间中尺寸最大的一个,作为该多个运行神经网络模型的推理过程所需的内存尺寸;并行执行的多个神经网络模型,取多个总内存空间的内存尺寸之和,作为该多个运行神经网络模型的推理过程所需的内存尺寸。
804、在神经网络模型中写入内存空间的内存尺寸、算子层的运行顺序和张量的内存偏移;
将步骤803确定的内存空间的内存尺寸和每个张量的内存偏移记录在模型的偏移文件中,同时,神经网络模型中还记录步骤802确定的算子层的运行顺序。
将优化过的数据流图重新写入神经网络模型文件,该神经网络模型文件包括算子层的运行顺序,张量的内存偏移和内存空间的内存尺寸。可选的,文件还包括融合后替换的算子层。
本申请实施例提供的方法,通过在网络模型文件中加入网络模型执行推理运行所需的内存空间的尺寸,以使得执行网络模型的设备可以直接根据网络模型文件实现内存一次性分配, 减少了载入模型中进行内存复用计算的时长。
请参阅图10,为本申请实施例中数据处理方法的一个实施例示意图。
1001、获取神经网络模型;
终端可以从云端或计算服务器等计算设备获取神经网络模型,该神经网络模型可以为一个神经网络模型或者多个神经网络模型,此处不做限定。神经网络模型通常以神经网络推理模型文件的形式存储,存储类型可以是ONNX格式或CAFFE格式等。
1002、解析该神经网络模型,获取内存空间的内存尺寸、算子层的运行顺序和张量的内存偏移;
通过解析器将神经网络模型转换为可以执行的数据流图,请参阅图11,为本申请实施例中数据处理方法的另一个实施例示意图。获取算子层的运行顺序和每个张量的内存偏移,以及运行该神经网络模型的内存空间的内存尺寸。其中,张量的内存偏移用于指示张量的内存块在内存空间中偏移位置。
1003、根据内存空间的内存尺寸,通过一次内存分配获取内存空间基地址;
根据步骤1002中获取的内存空间的内存尺寸,通过一次内存分配获取内存空间基地址,内存空间基地址可以简称内存基址。
1004、根据所述内存空间基地址和张量的内存偏移,获取张量的内存块地址;
算子层输出张量的内存地址,为内存空间基地址加上张量的内存偏移(@base)。根据内存空间基地址和每个算子层的输出张量的内存偏移,可以获取每个算子层的输出张量的内存地址即@base+offset,例如,输入节点2的张量的内存偏移为96b,内存空间的基地址为@base,则可以确定该张量的内存块地址为@base+96b。
1005、根据所述算子层的运行顺序运行神经网络模型;
1006、一次释放内存;
当模型运行结束,一次释放该内存空间。
请参阅图12,为本申请实施例中模型优化装置的一个实施例示意图;
该模型优化装置,包括:
获取模块1201,用于获取神经网络模型;
确定模块1202,用于确定基于所述神经网络模型进行推理运算所需的内存空间的内存尺寸;
更新模块1203,用于更新所述神经网络模型,以得到目标神经网络模型,所述目标神经网络模型携带指示所述内存尺寸的信息。
可选的,所述确定模块1202具体用于:确定算子层的运行顺序和张量的内存块的尺寸,所述算子层为基于所述神经网络模型进行推理运算的基础计算单元,所述张量为所述算子层的输入或输出;根据所述内存块的尺寸和所述运行顺序,通过预设的内存复用算法确定所述内存尺寸和所述内存块在所述内存空间中的位置,所述位置用于指示所述内存块。
可选的,所述目标神经网络模型中携带所述算子层的运行顺序,和张量的内存偏移,所述内存偏移用于指示所述内存块在所述内存空间中的位置。
可选的,若所述神经网络模型包括第一神经网络模型和第二神经网络模型,所述确定模 块1202具体用于:确定所述第一神经模型推理运算所需的第一内存空间的第一内存尺寸,以及所述第二神经模型推理运算所需的第二内存空间的第二内存尺寸;根据所述第一内存尺寸和所述第二内存尺寸确定所述内存空间的内存尺寸。
可选的,所述确定模块1202还用于:确定所述第一神经网络模型和所述第二神经网络模型的运行顺序,所述运行顺序包括串行或并行;若所述运行顺序为串行,所述内存空间的内存尺寸为所述第一内存尺寸,所述第一内存尺寸大于或等于所述第二内存尺寸;若所述运行顺序为并行,所述内存空间的内存尺寸为所述第一内存尺寸和所述第二内存尺寸之和。
请参阅图13,为本申请实施例中模型优化装置的一个实施例示意图;
该模型执行装置,包括:
获取模块1301,用于获取神经网络模型,所述神经网络模型中携带指示所述内存尺寸的信息,所述内存尺寸的信息用于指示所述神经网络模型推理运算所需的总内存空间;
分配模块1302,用于在内存中分配尺寸大于或等于所述内存尺寸的总内存空间;
运算模块1303,用于基于分配的所述总内存空间,根据所述神经网络模型进行推理运算。
可选的,所述神经网络模型中还包括:算子层的运行顺序和张量的内存偏移,所述算子层为所述神经网络模型的推理运算的基础计算单元,所述张量为所述算子层的输入或输出,所述内存偏移用于指示所述张量的内存块在所述总内存空间中的位置。
可选的,所述分配模块1302具体用于:通过运行时系统确定所述总内存空间的基地址。
可选的,所述装置还包括:确定模块1304,用于根据所述总内存空间的基地址和所述张量的内存偏移,确定所述张量的内存块在所述总内存空间中的位置。
可选的,所述内存尺寸通过内存复用算法确定。
可选的,若所述神经网络模型包括第一神经网络模型和第二神经网络模型,所述神经网络模型中还包括:所述第一神经网络模型和所述第二神经网络模型的运行顺序,所述运行顺序包括串行或并行。
可选的,若所述运行顺序为串行,所述总内存空间的内存尺寸为第一内存尺寸,所述第一内存尺寸大于或等于所述第二内存尺寸,所述第一内存尺寸为所述第一神经模型推理运算所需的第一内存空间的尺寸,所述第二内存尺寸为所述第二神经模型推理运算所需的第二内存空间的第二内存尺寸;若所述运行顺序为并行,所述总内存空间的内存尺寸为第一内存尺寸和第二内存尺寸之和。
可选的,所述装置还包括:
释放模块1305,用于当所述推理运算结束后,一次释放所述总内存空间。
请参阅图14,为本申请实施例中模型优化装置的一个实施例示意图。
本实施例提供的模型优化装置,可以为服务器或者终端等设备,本申请实施例中对其具体设备形态不做限定。
该模型优化装置1400可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上处理器1401和存储器1402,该存储器1402中存储有程序或数据。
其中,存储器1402可以是易失性存储或非易失性存储。可选地,处理器1401是一个或多个中央处理器(CPU,Central Processing Unit),也可以是专用处理器,例如图形处理器(graphics processing unit,GPU)或神经网络处理器(neural processing unit,NPU), 还可以是集成一个或多个CPU、一个或多个GPU、一个或多个NPU的芯片系统(system on a chip,SoC),该CPU可以是单核CPU,也可以是多核CPU。处理器1401可以与存储器1402通信,在模型优化装置1400上执行存储器1402中的一系列指令。
该模型优化装置1400还包括一个或一个以上有线或无线网络接口1403,例如以太网接口。
可选地,尽管图14中未示出,模型优化装置1400还可以包括一个或一个以上电源;一个或一个以上输入输出接口,输入输出接口可以用于连接显示器、鼠标、键盘、触摸屏设备或传感设备等,输入输出接口为可选部件,可以存在也可以不存在,此处不做限定。
本实施例中模型优化装置1400中的处理器1401所执行的流程可以参考前述方法实施例中描述的方法流程,此处不加赘述。
请参阅图15,为本申请实施例中模型执行装置的一个实施例示意图。
本实施例提供的模型执行装置,可以为IOT等终端设备,本申请实施例中对其具体设备形态不做限定。
该模型执行装置1500可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上处理器1501和存储器1502,该存储器1502中存储有程序或数据。
其中,存储器1502可以是易失性存储或非易失性存储。可选地,处理器1501是一个或多个中央处理器(CPU,Central Processing Unit),也可以是专用处理器,例如图形处理器(graphics processing unit,GPU)或神经网络处理器(neural processing unit,NPU),还可以是集成一个或多个CPU、一个或多个GPU、一个或多个NPU的芯片系统(system on a chip,SoC),该CPU可以是单核CPU,也可以是多核CPU。处理器1501可以与存储器1502通信,在模型执行装置1500上执行存储器1502中的一系列指令。
该模型执行装置1500还包括一个或一个以上有线或无线网络接口1503,例如以太网接口。
可选地,尽管图15中未示出,模型执行装置1500还可以包括一个或一个以上电源;一个或一个以上输入输出接口,输入输出接口可以用于连接显示器、鼠标、键盘、触摸屏设备或传感设备等,输入输出接口为可选部件,可以存在也可以不存在,此处不做限定。
本实施例中模型执行装置1500中的处理器1501所执行的流程可以参考前述方法实施例中描述的方法流程,此处不加赘述。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部 件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。

Claims (31)

  1. 一种数据处理方法,其特征在于,包括:
    获取神经网络模型;
    确定基于所述神经网络模型进行推理运算所需的内存空间的内存尺寸;
    更新所述神经网络模型,以得到目标神经网络模型,所述目标神经网络模型携带指示所述内存尺寸的信息。
  2. 根据权利要求1所述的方法,其特征在于,所述确定所述神经网络模型推理运算所需的内存空间的内存尺寸包括:
    确定算子层的运行顺序和张量的内存块的尺寸,所述算子层为基于所述神经网络模型进行推理运算的基础计算单元,所述张量为所述算子层的输入或输出;
    根据所述内存块的尺寸和所述运行顺序,通过预设的内存复用算法确定所述内存尺寸和所述内存块在所述内存空间中的位置,所述位置用于指示所述内存块。
  3. 根据权利要求2所述的方法,其特征在于,
    所述目标神经网络模型中携带所述算子层的运行顺序,和张量的内存偏移,所述内存偏移用于指示所述内存块在所述内存空间中的位置。
  4. 根据权利要求1至3中任一项所述的方法,其特征在于,若所述神经网络模型包括第一神经网络模型和第二神经网络模型,所述确定所述神经网络模型推理运算所需的内存空间的内存尺寸包括:
    确定所述第一神经模型推理运算所需的第一内存空间的第一内存尺寸,以及所述第二神经模型推理运算所需的第二内存空间的第二内存尺寸;
    根据所述第一内存尺寸和所述第二内存尺寸确定所述内存空间的内存尺寸。
  5. 根据权利要求4所述的方法,其特征在于,所述方法还包括:
    确定所述第一神经网络模型和所述第二神经网络模型的运行顺序,所述运行顺序包括串行或并行;
    若所述运行顺序为串行,所述内存空间的内存尺寸为所述第一内存尺寸,所述第一内存尺寸大于或等于所述第二内存尺寸;
    若所述运行顺序为并行,所述内存空间的内存尺寸为所述第一内存尺寸和所述第二内存尺寸之和。
  6. 一种数据处理方法,其特征在于,包括:
    获取神经网络模型,所述神经网络模型中携带指示所述内存尺寸的信息,所述内存尺寸的信息用于指示所述神经网络模型推理运算所需的总内存空间;
    在内存中分配尺寸大于或等于所述内存尺寸的总内存空间;
    基于分配的所述总内存空间,根据所述神经网络模型进行推理运算。
  7. 根据权利要求6所述的方法,其特征在于,所述神经网络模型中还包括:
    算子层的运行顺序和张量的内存偏移,所述算子层为所述神经网络模型的推理运算的基础计算单元,所述张量为所述算子层的输入或输出,所述内存偏移用于指示所述张量的内存块在所述总内存空间中的位置。
  8. 根据权利要求6或7所述的方法,其特征在于,所述在内存中分配尺寸等于所述内存 尺寸的总内存空间包括:
    通过运行时系统确定所述总内存空间的基地址。
  9. 根据权利要求8所述的方法,其特征在于,所述方法还包括:
    根据所述总内存空间的基地址和所述张量的内存偏移,确定所述张量的内存块在所述总内存空间中的位置。
  10. 根据权利要求6至9中任一项所述的方法,其特征在于,所述内存尺寸通过内存复用算法确定。
  11. 根据权利要求6至10中任一项所述的方法,其特征在于,若所述神经网络模型包括第一神经网络模型和第二神经网络模型,所述神经网络模型中还包括:
    所述第一神经网络模型和所述第二神经网络模型的运行顺序,所述运行顺序包括串行或并行。
  12. 根据权利要求11所述的方法,其特征在于,
    若所述运行顺序为串行,所述总内存空间的内存尺寸为第一内存尺寸,所述第一内存尺寸大于或等于所述第二内存尺寸,所述第一内存尺寸为所述第一神经模型推理运算所需的第一内存空间的尺寸,所述第二内存尺寸为所述第二神经模型推理运算所需的第二内存空间的第二内存尺寸;
    若所述运行顺序为并行,所述总内存空间的内存尺寸为第一内存尺寸和第二内存尺寸之和。
  13. 根据权利要求6至12中任一项所述的方法,其特征在于,所述方法还包括:
    当所述推理运算结束后,一次释放所述总内存空间。
  14. 一种模型优化装置,其特征在于,包括:
    获取模块,用于获取神经网络模型;
    确定模块,用于确定基于所述神经网络模型进行推理运算所需的内存空间的内存尺寸;
    更新模块,用于更新所述神经网络模型,以得到目标神经网络模型,所述目标神经网络模型携带指示所述内存尺寸的信息。
  15. 根据权利要求14所述的装置,其特征在于,所述确定模块具体用于:
    确定算子层的运行顺序和张量的内存块的尺寸,所述算子层为基于所述神经网络模型进行推理运算的基础计算单元,所述张量为所述算子层的输入或输出;
    根据所述内存块的尺寸和所述运行顺序,通过预设的内存复用算法确定所述内存尺寸和所述内存块在所述内存空间中的位置,所述位置用于指示所述内存块。
  16. 根据权利要求15所述的装置,其特征在于,所述目标神经网络模型中携带所述算子层的运行顺序,和张量的内存偏移,所述内存偏移用于指示所述内存块在所述内存空间中的位置。
  17. 根据权利要求14至16中任一项所述的装置,其特征在于,若所述神经网络模型包括第一神经网络模型和第二神经网络模型,所述确定模块具体用于:
    确定所述第一神经模型推理运算所需的第一内存空间的第一内存尺寸,以及所述第二神经模型推理运算所需的第二内存空间的第二内存尺寸;
    根据所述第一内存尺寸和所述第二内存尺寸确定所述内存空间的内存尺寸。
  18. 根据权利要求17所述的装置,其特征在于,所述确定模块还用于:
    确定所述第一神经网络模型和所述第二神经网络模型的运行顺序,所述运行顺序包括串行或并行;
    若所述运行顺序为串行,所述内存空间的内存尺寸为所述第一内存尺寸,所述第一内存尺寸大于或等于所述第二内存尺寸;
    若所述运行顺序为并行,所述内存空间的内存尺寸为所述第一内存尺寸和所述第二内存尺寸之和。
  19. 一种模型执行装置,其特征在于,包括:
    获取模块,用于获取神经网络模型,所述神经网络模型中携带指示所述内存尺寸的信息,所述内存尺寸的信息用于指示所述神经网络模型推理运算所需的总内存空间;
    分配模块,用于在内存中分配尺寸大于或等于所述内存尺寸的总内存空间;
    运算模块,用于基于分配的所述总内存空间,根据所述神经网络模型进行推理运算。
  20. 根据权利要求19所述的装置,其特征在于,所述神经网络模型中还包括:算子层的运行顺序和张量的内存偏移,所述算子层为所述神经网络模型的推理运算的基础计算单元,所述张量为所述算子层的输入或输出,所述内存偏移用于指示所述张量的内存块在所述总内存空间中的位置。
  21. 根据权利要求19或20所述的装置,其特征在于,所述分配模块具体用于:
    通过运行时系统确定所述总内存空间的基地址。
  22. 根据权利要求21所述的装置,其特征在于,所述装置还包括:
    确定模块,用于根据所述总内存空间的基地址和所述张量的内存偏移,确定所述张量的内存块在所述总内存空间中的位置。
  23. 根据权利要求19至22中任一项所述的装置,其特征在于,所述内存尺寸通过内存复用算法确定。
  24. 根据权利要求19至23中任一项所述的装置,其特征在于,若所述神经网络模型包括第一神经网络模型和第二神经网络模型,所述神经网络模型中还包括:所述第一神经网络模型和所述第二神经网络模型的运行顺序,所述运行顺序包括串行或并行。
  25. 根据权利要求24所述的装置,其特征在于,若所述运行顺序为串行,所述总内存空间的内存尺寸为第一内存尺寸,所述第一内存尺寸大于或等于所述第二内存尺寸,所述第一内存尺寸为所述第一神经模型推理运算所需的第一内存空间的尺寸,所述第二内存尺寸为所述第二神经模型推理运算所需的第二内存空间的第二内存尺寸;
    若所述运行顺序为并行,所述总内存空间的内存尺寸为第一内存尺寸和第二内存尺寸之和。
  26. 根据权利要求19至25中任一项所述的装置,其特征在于,所述装置还包括:
    释放模块,用于当所述推理运算结束后,一次释放所述总内存空间。
  27. 一种模型优化装置,其特征在于,包括:
    存储器,用于存储指令;
    处理器,用于执行所述存储器中的指令,使得所述控制器执行权利要求1至5中任一项所述的方法。
  28. 一种模型执行装置,其特征在于,包括:
    存储器,用于存储指令;
    处理器,用于执行所述存储器中的指令,使得所述反射器执行权利要求6至13中任一项所述的方法。
  29. 一种计算机系统,其特征在于,包括权利要求14至18中任一项所述的模型优化装置,和权利要求19至26中任一项所述的模型执行装置。
  30. 一种计算机程序产品,其特征在于,所述计算机程序产品包括指令,当所述指令在计算机上运行时,使得所述计算机执行权利要求1至13中任一项所述的方法。
  31. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储指令,当所述指令在计算机上运行时,使得所述计算机执行权利要求1至13中任一项所述的方法。
PCT/CN2020/116183 2019-09-18 2020-09-18 数据处理方法、模型优化装置和模型执行装置 WO2021052460A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910883288.4 2019-09-18
CN201910883288.4A CN112529169B (zh) 2019-09-18 2019-09-18 数据处理方法、模型优化装置和模型执行装置

Publications (1)

Publication Number Publication Date
WO2021052460A1 true WO2021052460A1 (zh) 2021-03-25

Family

ID=74883918

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/116183 WO2021052460A1 (zh) 2019-09-18 2020-09-18 数据处理方法、模型优化装置和模型执行装置

Country Status (2)

Country Link
CN (1) CN112529169B (zh)
WO (1) WO2021052460A1 (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113469328A (zh) * 2021-06-24 2021-10-01 上海寒武纪信息科技有限公司 执行转数穿过的装置、板卡、方法及可读存储介质
CN113778459A (zh) * 2021-09-08 2021-12-10 北京航空航天大学杭州创新研究院 一种在fpga和dsp上部署优化的算子库设计方法
CN114201298A (zh) * 2021-12-15 2022-03-18 建信金融科技有限责任公司 内存管理方法、装置、电子设备及存储介质
CN114298272A (zh) * 2021-12-23 2022-04-08 安谋科技(中国)有限公司 神经网络模型的构建方法、图像处理方法、设备及介质
CN115080240A (zh) * 2022-06-29 2022-09-20 美的集团(上海)有限公司 语音处理模型的部署方法、电子设备及存储介质

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022198636A1 (zh) * 2021-03-26 2022-09-29 珠海全志科技股份有限公司 Ai处理器的内存分配方法、计算机装置及计算机可读存储介质
CN113608881B (zh) * 2021-10-09 2022-02-25 腾讯科技(深圳)有限公司 内存分配方法、装置、设备、可读存储介质及程序产品
CN113886090A (zh) * 2021-10-22 2022-01-04 哲库科技(北京)有限公司 内存分配方法及装置、设备、存储介质
CN114492775A (zh) * 2022-01-13 2022-05-13 哲库科技(上海)有限公司 一种数据处理方法、装置、神经网络加速器及存储介质
CN114118389B (zh) * 2022-01-28 2022-05-10 深圳鲲云信息科技有限公司 神经网络数据处理方法、设备及存储介质
CN114237918B (zh) 2022-02-28 2022-05-27 之江实验室 一种面向神经网络模型计算的图执行方法和装置
CN116757284A (zh) * 2022-09-26 2023-09-15 荣耀终端有限公司 模型推理方法、设备、存储介质和程序产品
WO2024108907A1 (zh) * 2022-11-25 2024-05-30 成都登临科技有限公司 一种数据处理方法、装置、ai芯片、电子设备及存储介质
CN116992966B (zh) * 2023-09-28 2024-01-16 深圳鲲云信息科技有限公司 用于人工智能模型推理平台的方法及计算设备
CN117667424A (zh) * 2023-12-21 2024-03-08 摩尔线程智能科技(北京)有限责任公司 内存管理方法、装置和存储介质
CN117992078B (zh) * 2024-04-03 2024-07-12 山东浪潮科学研究院有限公司 一种基于TensorRT-LLM模型推理加速服务的自动化部署方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103577480A (zh) * 2012-08-07 2014-02-12 中国银联股份有限公司 一种参数划分系统及其方法、一种业务处理系统及其方法
CN109407997A (zh) * 2018-11-09 2019-03-01 长沙理工大学 一种数据处理方法、装置、设备及可读存储介质
CN109491784A (zh) * 2018-10-18 2019-03-19 北京旷视科技有限公司 降低内存占用量的方法、装置、电子设备、可读存储介质
CN109558248A (zh) * 2018-12-11 2019-04-02 中国海洋大学 一种用于确定面向海洋模式计算的资源分配参数的方法及系统

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10956500B2 (en) * 2017-01-19 2021-03-23 Google Llc Dynamic-length stateful tensor array
CN109558942B (zh) * 2018-11-20 2021-11-26 电子科技大学 一种基于浅度学习的神经网络迁移方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103577480A (zh) * 2012-08-07 2014-02-12 中国银联股份有限公司 一种参数划分系统及其方法、一种业务处理系统及其方法
CN109491784A (zh) * 2018-10-18 2019-03-19 北京旷视科技有限公司 降低内存占用量的方法、装置、电子设备、可读存储介质
CN109407997A (zh) * 2018-11-09 2019-03-01 长沙理工大学 一种数据处理方法、装置、设备及可读存储介质
CN109558248A (zh) * 2018-12-11 2019-04-02 中国海洋大学 一种用于确定面向海洋模式计算的资源分配参数的方法及系统

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113469328A (zh) * 2021-06-24 2021-10-01 上海寒武纪信息科技有限公司 执行转数穿过的装置、板卡、方法及可读存储介质
CN113469328B (zh) * 2021-06-24 2024-03-19 上海寒武纪信息科技有限公司 执行转数穿过的装置、板卡、方法及可读存储介质
CN113778459A (zh) * 2021-09-08 2021-12-10 北京航空航天大学杭州创新研究院 一种在fpga和dsp上部署优化的算子库设计方法
CN114201298A (zh) * 2021-12-15 2022-03-18 建信金融科技有限责任公司 内存管理方法、装置、电子设备及存储介质
CN114298272A (zh) * 2021-12-23 2022-04-08 安谋科技(中国)有限公司 神经网络模型的构建方法、图像处理方法、设备及介质
CN115080240A (zh) * 2022-06-29 2022-09-20 美的集团(上海)有限公司 语音处理模型的部署方法、电子设备及存储介质
CN115080240B (zh) * 2022-06-29 2023-10-10 美的集团(上海)有限公司 语音处理模型的部署方法、电子设备及存储介质

Also Published As

Publication number Publication date
CN112529169B (zh) 2024-08-13
CN112529169A (zh) 2021-03-19

Similar Documents

Publication Publication Date Title
WO2021052460A1 (zh) 数据处理方法、模型优化装置和模型执行装置
WO2021098509A1 (zh) 神经网络联合编译的方法、装置和电子设备
WO2022262167A1 (zh) 集群资源调度方法及装置、电子设备和存储介质
US10423445B2 (en) Composing and executing workflows made up of functional pluggable building blocks
CN104008064B (zh) 用于多级存储器压缩的方法和系统
US11570272B2 (en) Provisioning using pre-fetched data in serverless computing environments
TWI620075B (zh) 用於雲端巨量資料運算架構之伺服器及其雲端運算資源最佳化方法
US20190324810A1 (en) Method, device and computer readable medium for scheduling dedicated processing resource
TWI798618B (zh) 記憶體分配方法、裝置、及電子設備
CN110413776B (zh) 一种基于cpu-gpu协同并行的文本主题模型lda高性能计算方法
US20230316450A1 (en) Model processing method and apparatus, device, and computer-readable storage medium
WO2023082644A1 (zh) 网络模型处理方法、装置、设备、存储介质及计算机程序产品
Wang et al. An efficient image aesthetic analysis system using Hadoop
CN114118433A (zh) 一种设备的配置参数的推荐方法及装置
US20230316089A1 (en) Ngraph-based gpu backend distributed training method and system
US11080640B1 (en) Systems and methods for managing organizational structures
Risco-Martin et al. A parallel evolutionary algorithm to optimize dynamic memory managers in embedded systems
WO2024040844A1 (zh) 模型调试方法、装置、电子设备及存储介质
CN114741332A (zh) 内存访问方法和装置、计算设备及存储介质
TWI545453B (zh) 分散式系統及其資料庫管理方法及管理系統
KR101558807B1 (ko) 호스트 프로세서와 협업 프로세서 간에 협업 처리를 위한 프로세서 스케줄링 방법 및 그 방법을 수행하는 호스트 프로세서
Marques et al. A cloud computing based framework for general 2D and 3D cellular automata simulation
CN112232027A (zh) 一种符号翻译方法、装置、设备和计算机可读存储介质
KR102671573B1 (ko) 질의 의도 분류를 통한 딥러닝 모델 리소스 할당 시스템
WO2024031968A1 (zh) 一种因子计算方法、装置及计算设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20864531

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20864531

Country of ref document: EP

Kind code of ref document: A1