CN112862074A

CN112862074A - Model operation method and device, electronic equipment and storage medium

Info

Publication number: CN112862074A
Application number: CN202110168983.XA
Authority: CN
Inventors: 谭志鹏
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2021-02-07
Filing date: 2021-02-07
Publication date: 2021-05-28

Abstract

The application discloses a model operation method and device, electronic equipment and a storage medium. The model operation method comprises the following steps: when a neural network model is operated, obtaining a first result processed by a deep convolution module in a target operator module of the neural network model, wherein the target operator module is obtained by fusing the deep convolution module and a point-by-point convolution module which are sequentially connected; caching the first result into an allocated cache space; processing the cached first result according to the point-by-point convolution module in the target operator module to obtain a second result; and determining an output result of the neural network model according to the second result. The method can improve the data access efficiency in the running process of the neural network model and improve the performance of the neural network model.

Description

Model operation method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a model operation method, an apparatus, an electronic device, and a storage medium.

Background

Neural Networks (DNNs) are increasingly used in a variety of applications such as speech recognition, target detection, semantic segmentation, and the like. However, as the neural network technology continues to develop, the number of neurons and synapses (synapsis) in the neural network model increases exponentially, and the operation duration increases rapidly. It is therefore necessary to optimize the neural network model.

Disclosure of Invention

In view of the foregoing problems, the present application provides a model operating method, an apparatus, an electronic device, and a storage medium.

In a first aspect, an embodiment of the present application provides a model running method, where the method includes: when a neural network model is operated, obtaining a first result processed by a deep convolution module in a target operator module of the neural network model, wherein the target operator module is obtained by fusing the deep convolution module and a point-by-point convolution module which are sequentially connected; caching the first result into an allocated cache space; processing the cached first result according to the point-by-point convolution module in the target operator module to obtain a second result; and determining an output result of the neural network model according to the second result.

In a second aspect, an embodiment of the present application provides a model running method, where the method includes: the first acquisition module is used for acquiring a first result processed by the deep convolution module in a target operator module of the neural network model when the neural network model is operated, and the target operator module is obtained by fusing the deep convolution module and the point-by-point convolution module which are sequentially connected; a result caching module, configured to cache the first result in an allocated cache space; the second obtaining module is used for processing the cached first result according to the point-by-point convolution module in the target operator module to obtain a second result; and the result determining module is used for determining an output result of the neural network model according to the second result.

In a third aspect, an embodiment of the present application provides an electronic device, including: one or more processors; a memory; one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more applications being configured to perform the model execution method provided by the first aspect described above

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where a program code is stored in the computer-readable storage medium, and the program code may be called by a processor to execute the model running method provided in the first aspect.

According to the scheme provided by the application, when a neural network model is operated, a first result processed by a deep convolution module in a target operator module of the neural network model is obtained, and the target operator module is obtained by fusing the deep convolution module and a point-by-point convolution module which are sequentially connected; caching the first result into an allocated cache space; processing the cached first result according to the point-by-point volume module in the target operator module to obtain a second result; and determining an output result of the neural network model according to the second result. According to the method and the device, on the basis that the target operator module obtained by integrating the deep convolution and the point-by-point convolution is deployed in the neural network model, the intermediate calculation result of the target operator module is cached, so that the intermediate calculation result can be directly read from the cache space for subsequent calculation, high-performance and high-power-consumption memory reading and writing are reduced, the data access efficiency is improved, and the operation performance of the neural network model is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 shows a flow diagram of a model operation method according to one embodiment of the present application.

Fig. 2 is a schematic diagram illustrating a calculation process of a conventional convolution operation provided in the present application.

Fig. 3 is a schematic diagram illustrating a calculation process of another conventional convolution operation provided in the present application.

Fig. 4 is an overall schematic diagram illustrating a model operation method provided in the present application.

FIG. 5 shows a flow diagram of a model operation method according to another embodiment of the present application.

Fig. 6 shows a schematic diagram of data representation in a neural network model provided by the present application.

Fig. 7 shows a schematic diagram of a sliding window provided by the present application.

Fig. 8 is a schematic diagram illustrating a data memory layout in a neural network model provided in the present application.

FIG. 9 is a schematic diagram illustrating a data memory layout in another neural network model provided herein.

Fig. 10 shows a schematic diagram of memory access under NCHW by deep convolution according to the present application.

Fig. 11 shows a schematic diagram of memory access under NCHW by point-to-point convolution according to the present application.

FIG. 12 shows a flow chart of a method of model operation according to yet another embodiment of the present application.

Fig. 13 shows a flowchart of step S310 in a model running method according to yet another embodiment of the present application.

Fig. 14 shows a schematic diagram of a converged network architecture provided by the present application.

FIG. 15 shows a block diagram of a model execution apparatus according to one embodiment of the present application.

Fig. 16 is a block diagram of an electronic device according to an embodiment of the present application for executing a model running method according to an embodiment of the present application.

Fig. 17 is a storage unit according to an embodiment of the present application, configured to store or carry program code for implementing a model running method according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

At present, the types of electronic devices that can be deployed by the deep learning neural network are increasing, for example, the mobile terminal side (such as a smartphone) is limited by the influence of computing power, memory and other factors, and a lightweight deep learning neural network, such as networks of mobileneetv 1, mobileneetv 2, deepabv 3, yolo v3_ Tiny, and the like, can be deployed. However, as the network structure of the neural network becomes more complex, when the deep learning neural network is deployed on the side of the related electronic device (such as a mobile terminal), how to increase the inference speed of the deep learning neural network becomes more and more important.

The inventor has found through long-term research that in the lightweight deep learning neural network, the general Convolution constraint is usually replaced by a deep Separable Convolution (Depthwise Separable constraint) to reduce the calculation amount of the model. Wherein, the depth separable Convolution may include depth Convolution Depthwise Convolution and Pointwise Convolution. Therefore, when the model inference is performed by adopting a layer-by-layer inference method in the deep learning inference framework, the poitwise contribution is actually calculated after all the results of the Depthwise contribution are calculated. However, this generally has a problem of inefficient performance for high-performance AI (Artificial Intelligence) accelerators.

Specifically, although the computation speed of the high-performance AI accelerator is fast, after the Depthwise contribution computation is completed, the computation result needs to be completely flushed back to a DDR (Double Data Rate SDRAM), and then all Data needs to be loaded from the DDR when the Pointwise contribution is computed, and the memory access speed is slower when the Data is written and read once through the DDR, thereby affecting the inference performance of the model. In addition, the read-write power consumption of the DDR memory is higher, so that the power consumption overhead of a model is also higher.

Therefore, the inventor proposes the model operation method, the apparatus, the electronic device and the storage medium provided in the embodiment of the present application, which can fuse the Depthwise constraint and the poithwise constraint into one operator, cache the computation result of the Depthwise constraint by using a temporary memory, and then directly read from the cache when the poithwise constraint is computed, thereby reducing memory read-write of the DDR and improving the computation performance of the model. The specific model operation method is described in detail in the following examples.

Referring to fig. 1, fig. 1 is a schematic flow chart illustrating a model operation method according to an embodiment of the present application. The model operation method can be applied to electronic equipment, wherein the electronic equipment can be a terminal device capable of operating a neural network model, such as a PC (personal computer), a mobile terminal, a server and the like. The following will describe a specific flow of the present embodiment by taking an electronic device as an example. The model operation method shown in fig. 1 may specifically include the following steps:

step S110: when a neural network model is operated, a first result processed by a deep convolution module in a target operator module of the neural network model is obtained, and the target operator module is obtained by fusing the deep convolution module and a point-by-point convolution module which are sequentially connected.

In this embodiment, the neural network model may include a target operator module, and the target operator module is a processing module corresponding to a newly generated operator after a depth Convolution deconvolution module and a point-by-point Convolution contribution module which are sequentially connected are fused. The operator can be regarded as a part of an algorithm process in a neural network model, the operator can map a function into a function, or map the function into a number, and a processing module corresponding to the operator can realize the algorithm process corresponding to the operator. In some embodiments, the sequentially connected depth convolution module and point-by-point convolution module may be a depth convolution module and a point-by-point convolution module in a depth separable convolution network structure.

It is understood that the deep Convolution Depthwise Convolition can be used to extract features with a similar structure to the conventional Convolution operation, but with a lower amount of parameters and lower operation cost than the conventional Convolution operation.

Referring to fig. 2, fig. 2 is a schematic diagram of a conventional convolution operation. Specifically, for a 5 × 5 pixel, 3-Channel color input picture (shape is 5 × 5 × 3), a conventional convolution layer (with 3 × 3 convolution kernel (Filter) passing through the conventional convolution operation (if the number of output channels (channels) is 4, since the number of channels of each convolution kernel in the conventional convolution operation is the same as the number of channels of the input Feature Map, the shape of the convolution kernel is 3 × 3 × 4), and finally 4 Feature maps are output, where if there is same mapping, the size of the output Feature Map is the same as that of the input layer (5 × 5), and if not, the size of the output Feature Map is 3 × 3.

That is, each convolution kernel of the conventional convolution operation is to simultaneously operate each channel of the input picture. Unlike conventional Convolution operations, one Convolution kernel of the deep Convolution Depthwise Convolution is responsible for one channel, and one channel is convolved by only one Convolution kernel.

Referring to fig. 3, the left part of fig. 3 shows a schematic diagram of a Depthwise computation process. Specifically, also for a 5 × 5 pixel, 3-channel color input picture, the Depthwise Convolution is first performed by a first Convolution operation, and unlike the conventional Convolution operation, the Depthwise Convolution is performed entirely in a two-dimensional plane. The number of convolution kernels is the same as the number of channels in the previous layer (channels and convolution kernels are in one-to-one correspondence). So that 3 Feature maps are generated after the image of one 3-channel is operated. It can be seen that the amount of computation required for Depthwise Convolition is much smaller than for conventional Convolution operation, given the same number of Feature maps generated.

Since the Depthwise contribution performs Convolution operation on each channel of the input layer independently, and Feature information of different channels at the same spatial position is not effectively utilized, after the Depthwise contribution, a layer of Pointwise Convolution Pointwise contribution generally needs to be connected to perform weighted combination on the Feature maps generated by the DeptheConvergence in the depth direction to generate a new Feature map. In some embodiments, this structure in which Depthwise contribution and poithwise contribution are sequentially connected may also be referred to as a depth separable convolutional network structure.

The operation of poitwise Convolution is very similar to the conventional Convolution operation, the size of the Convolution kernel is 1 × 1 × M, and M is the number of channels in the previous layer. That is, there are several convolution kernels and several output Feature maps. Illustratively, referring again to fig. 3, fig. 3 shows a schematic diagram of a depth separable convolutional network structure, wherein the right part of fig. 3 shows a schematic diagram of a calculation process of poitwise convention. The Pointwise contribution generates 4 new Feature maps after 3 Feature maps generated by the Depthwise contribution are subjected to 4 Convolution kernel weight combinations of 1 multiplied by 1.

At present, after all the results of the DepthwiseConvolition are calculated, the calculation results need to be completely flushed back to the DDR, when the PointwiseConvolition is calculated, all data need to be loaded from the DDR, and the DDR writes read data, so that the memory access speed is slow. Therefore, in the embodiment of the application, the result of the Depthwise Convolition calculation can be cached in the cache space, and then when the Pointwise Convolition is calculated, the result can be directly read from the cache space, so that the DDR read-write is reduced, and the memory access efficiency is improved. It can be understood that, under the condition of reading and writing the same data, the power consumption of reading and writing the memory through the DDR is far larger than that of reading and writing the memory in the cache space, so that the performance overhead and the power consumption overhead brought by reading and writing the DDR can be reduced at the same time.

Because the Depthwise contribution and the Pointwise contribution are independent processing modules at present, the caching and reading of the calculation result by the cache space cannot be directly and effectively controlled (generally, a program needs to be added to respectively implement the caching and the reading, and thus the model is more complicated).

Specifically, in this embodiment of the present application, a neural network model having the target operator module may be obtained first, and then when the neural network model is operated, a first result processed by a deep Convolution module in the target operator module, that is, a first result after a deep Convolution deconvolution operation is performed may be obtained, so as to cache the first result in a cache space without writing the first result into a DDR. The first result can be understood as a feature map generated after the deep convolution operation, and the feature map includes at least one feature data.

It can be understood that after the input data of the target operator module is obtained, the input data may be input to the target operator module, and the depth convolution module in the target operator module may perform the depth convolution operation on the input data, so as to obtain a first result after the depth convolution operation processing by the depth convolution module.

Step S120: and caching the first result into an allocated cache space.

In this embodiment of the present application, after obtaining the first result processed by the deep convolution module in the target operator module, the first result may be stored in the allocated cache space. The cache space is used for temporarily storing the storage space of the data in the memory.

The Cache space may be a storage space of a Cache memory, and the Cache may be divided into a first-level Cache (L1Cache), a second-level Cache (L2 Cache), a third-level Cache (L2 Cache), and the like, or any one of the levels of Cache may be a storage space, and specifically, the Cache space is not limited herein, and only power consumption for performing memory reading and writing in the Cache space is less than power consumption for performing memory reading and writing in the DDR. Illustratively, the allocated Cache space may be the memory space of the third level Cache of a chip of the ARM V8 CPU Cortex A family.

Step S130: and processing the cached first result according to the point-by-point convolution module in the target operator module to obtain a second result.

In this embodiment of the application, when the point-by-point convolution module in the target operator module calculates the point-by-point convolution, a first result obtained after processing by the deep convolution module may be read from the buffer space, so that the first result of the buffer is processed by the point-by-point convolution module to obtain a second result. The second result can be understood as a feature map generated by performing point-by-point convolution operation on the first result by a point-by-point convolution module in the target operator module, and the feature map includes at least one feature data.

It can be understood that, since the point-by-point convolution is to perform weighted combination on the first result generated by the depth convolution in the depth direction, when there are a plurality of convolution kernels corresponding to the point-by-point convolution, the first result needs to be subjected to weighted combination for a plurality of times, that is, the first result needs to be multiplexed for a plurality of times, that is, the first result needs to be read from the buffer space for a plurality of times to perform convolution operation. Therefore, compared with the low memory access efficiency and the high function of slowly reading the multiplexed data from the DDR for multiple times, the embodiment of the application reads from the buffer space, greatly improves the memory access efficiency, and greatly reduces the power consumption overhead of memory reading.

Illustratively, as shown in fig. 4, fig. 4 shows an overall schematic diagram of a model operation method provided by the present application. The convolution kernel size of the deep convolution operation is 3 × 3, and the convolution kernel size of the point-by-point convolution operation is 1 × 1. And after the target operator module obtains the first result after the deep convolution operation, caching the first result into a cache space, and then immediately using the first result in the cache space as input to perform point-by-point convolution operation. It can be seen that in fig. 4, there are a plurality of convolution kernels corresponding to the point-by-point convolution operation, and each convolution kernel calls the first result stored in the buffer space to perform weighted combination.

Step S140: and determining an output result of the neural network model according to the second result.

In this embodiment of the application, after the second result after the point-by-point convolution operation is obtained, the subsequent inference of the neural network model can be continued according to the second result, so as to infer the output result of the neural network model.

According to the model operation method provided by the embodiment of the application, when a neural network model is operated, a first result processed by a target operator module deep convolution module of the neural network model is obtained, and the target operator module is obtained by fusing a deep convolution module and a point-by-point convolution module which are sequentially connected; caching the first result into an allocated cache space; the point-by-point convolution module in the target operator module processes the cached first result to obtain a second result; and determining an output result of the neural network model according to the second result. According to the method and the device, on the basis that the target operator module obtained by integrating the deep convolution and the point-by-point convolution is deployed on the neural network model, the intermediate calculation result of the target operator module is cached, so that the intermediate calculation result can be directly read from a cache space for subsequent calculation, high-performance and high-power-consumption memory reading and writing are reduced, the data access efficiency is improved, and the operation performance of the neural network model is improved.

Referring to fig. 5, fig. 5 shows a flow chart of a model operation method according to another embodiment of the present application, where the model operation method specifically includes the following steps:

step S210: when the neural network model is operated, target data in a sliding window is determined based on input data of a target computation module in the neural network model.

Since the cache space is usually small in capacity, when the calculation result after the deep convolution operation is large, the calculation result may not be cached in the cache space, and therefore, in the embodiment of the present application, only a part of the calculation result after the deep convolution operation may be cached in the cache space. Specifically, a sliding window may be provided so that only the input data within the sliding window may be subjected to the deep convolution operation, so that a partial intermediate calculation result may be obtained. Specifically, when the neural network model is operated, the target data in the sliding window can be determined based on the input data of the target operator module in the neural network model, so that only the target data can be subjected to deep convolution operation subsequently.

The sliding window is a slidable window, and the size of the sliding window can be reasonably set as required. For example, the sliding window may have a dimension of 3 × 3 pixels. In the embodiment of the present application, the scale of the sliding window is smaller than that of the complete input data, so that only partial intermediate calculation results can be obtained. In some embodiments, the dimensions of the sliding window may be the height and width of the sliding window, and the dimensions of the complete input data may be the height H and width W of the complete input data.

Referring to fig. 6 and 7, fig. 6 shows a data representation diagram in a neural network model, and fig. 7 shows a sliding window diagram. Wherein C represents the number of channels, for example, the number of channels C of a black-and-white image is 1, and the number of channels C of an RGB color image is 3; h represents the height of the picture; w represents the width of the picture; 000-319 can be understood as the feature value corresponding to each pixel position in the picture. When the input data of the target operator module is shown in fig. 6, the dashed box in fig. 7 is a sliding window, and the eigenvalues 000, 001,004, and 005 in the dashed box are the corresponding target data in the sliding window.

In some embodiments, the size of the sliding window may also be determined according to the buffer capacity of the allocated buffer space, so as to ensure that the obtained first result can be completely buffered in the buffer space. Specifically, before step S210, the model running method of the present application may further include: determining the buffer capacity of the allocated buffer space; and determining the scale of the sliding window according to the cache capacity.

The current buffer capacity of the allocated buffer space may be set manually, or may be determined by real-time query of the electronic device, which is not limited herein. After the electronic device obtains the buffer capacity of the allocated buffer space, the size of the data volume of the first result which can be buffered can be determined, so that the size of the data volume which can be executed once in the input data can be determined according to the size of the data volume of the first result and the convolution parameter (such as shape of convolution kernel) of the deep convolution, and the size of the scale of the sliding window can be determined according to the size of the data volume.

In some embodiments, the size of the sliding window may also be manually set, and then the electronic device may determine the size of the data amount of the first result according to the size of the sliding window and the convolution parameter of the depth convolution, so that the electronic device may allocate a buffer space with sufficient buffer capacity according to the size of the data amount of the first result. For example, if the data size of the obtained first result is h × w × C, the buffer capacity of the buffer space is at least h × w × C. Here, h × w may be understood as the height h and width w of the feature map generated after the depth convolution operation, and C is the number of channels C of the input data of the target computation submodule.

Step S220: and processing the target data according to the depth convolution module in the target operator module to obtain a first result.

In this embodiment of the present application, after determining the target data in the sliding window, the depth convolution module in the target computation module may process the target data, so as to obtain a first result. Specifically, after the input data of the target operator module is obtained, the input data may be input to the target operator module, and the target operator module may first obtain the target data in the sliding window based on the input data, and then perform the above-mentioned deep convolution operation on the target data through the deep convolution module, so as to obtain a first result of the target data after the deep convolution operation.

It can be understood that, because the Depthwise Convolution module and the poithwise Convolution module have been fused into one target operator module, the target operator module can directly control the algorithm to perform only the deep Convolution operation of part of the input data, and then cache the obtained part of the intermediate result. However, because of the characteristics of independent processing and the layer-by-layer reasoning mode of the model, the current independent Depthwise contribution module and the Pointwise contribution module can only calculate the Pointwise contribution after calculating all the results of the input data by using the DepthverseConvergence, cannot realize partial calculation of the input data, and cannot realize the caching of partial calculation results.

Step S230: and caching the first result into an allocated cache space.

Generally, there are two layout approaches for convolutional and pooled memory layout: NCHW and NHWC. Wherein NCHW actually represents [ W H C N ], C is arranged on the outer layer, and the characteristic values in each channel are close together; NHWC represents [ C W H N ], C is arranged on the innermost layer, and the characteristic values of the corresponding spatial positions of a plurality of channels are close together. N represents the batch number (representing N pictures), and generally in the reasoning stage, N is 1.

Referring to fig. 8, fig. 8 is a schematic diagram of a dataram layout in a neural network model. It can be seen that when the data in fig. 6 is arranged in the NCHW layout, the first element is 000, the second element is along the W direction, i.e. 001, so that 002, 003, W direction ends and then starts along the H direction, i.e. 004, 005, 006, 007.

Similarly, when the data in fig. 6 is arranged in the NHWC layout, the first element is 000, the second element is along the C direction, i.e., 020, 040, 060, until 300, and along the W direction, 001, 021, 041, 061.

Referring to fig. 9, fig. 9 is a schematic diagram of a data memory layout in another neural network model. It is exemplified by color-to-gray calculation of an RGB three-channel image, and it can be seen that, in the NCHW layout mode, C is arranged on the outer layer, and pixels in each channel are closely arranged, i.e., in the form of 'rrrrrrrrggggbbbbbb'. In the NHWC layout, C is arranged at the innermost layer, and the pixels of the plurality of channels corresponding to the spatial positions are close together, i.e. in the form of 'rgbrgbrgbrgb'.

In the process of computing the Depthwise contribution, the H and W data on each input channel C of the input data can be directly multiplied and added with the Convolution kernel without being overlapped in the C direction. Therefore, in the memory layout format of NCHW, direct continuous execution can easily realize continuous reading in the memory ROW. Therefore, the data in the neural network model can be stored by adopting the memory layout format of the NCHW, so that the input data can be directly and continuously read subsequently to perform deep convolution operation. Illustratively, referring to fig. 10, fig. 10 shows a schematic diagram of memory access under NCHW by deep convolution. In some embodiments, the first result may be buffered in the allocated buffer space in a layout manner of batch count, channel count, height and width (NCHW).

Step S240: and processing the cached first result according to the point-by-point convolution module in the target operator module to obtain a second result.

In the embodiment of the present application, step S240 may refer to the foregoing embodiment, which is not described herein again.

Step S250: and traversing the input data according to the sliding window, repeatedly executing the input data based on a target operator module in the neural network model, determining the target data in the sliding window, and processing the cached first result according to the point-by-point convolution module in the target operator module to obtain a plurality of second results.

In this embodiment of the application, when the second result corresponding to the target data in the sliding window is calculated once, the sliding window may be moved to cycle through the above process on the input data until all the input data are traversed to obtain a plurality of second results, so that the second result corresponding to each position in the input data may be obtained, that is, the output data of the target operator module is obtained.

Referring to fig. 11, fig. 11 shows a schematic diagram of memory access under NCHW by point convolution. Under the condition of not performing the memory optimization (operator fusion and intermediate result caching) of the present application, the calculation result of the deep convolution (all values of the 3 FeatureMaps on the right side in fig. 11) needs to be written into the DDR, and then the calculation of the point-by-point convolution is performed, and each point on the FeatureMaps obtained by the point-by-point convolution needs to sum up the point multiplication results (input multiplied by 1 × 1 convolution kernel) on each input channel C.

For example, referring to fig. 4 again, if a temporary memory Cache is used to Cache the intermediate computation result of the depth convolution, and the Cache capacity of the memory is h × w × C, after the depth convolution is computed for one time to obtain the first result of an h × w × C window, the Cache may be used to store the first result, and then the Cache may be used as an input immediately to compute the point-by-point convolution. If the convolution kernel size corresponding to the point-by-point convolution is H × W × I × O (in the point-by-point convolution, H × W × 1, I represents the size of the input channel number, I is C of the deep convolution, and O represents the size of the output channel number), when the point-by-point convolution is calculated, the data of the window H × W × C can be multiplexed when the point-by-point convolution is multiplied by the data of O1 × 1 convolution kernels, so that the performance overhead and the power consumption overhead caused by writing all the calculation results of the deep convolution into the DDR are reduced.

Step S260: and taking the plurality of second results as output data of a target operator module, and determining the output result of the neural network model based on the output data.

In this embodiment of the application, after traversing the complete input data to obtain a plurality of second results, the plurality of second results may be spliced and combined, the spliced and combined second results are used as output data of the target operator module, and based on the output data, subsequent inference of the neural network model is continued to infer the output result of the neural network model.

According to the model operation method provided by the embodiment of the application, when a neural network model is operated, a first result obtained after a target operator module in the neural network model executes deep convolution operation is obtained, and the target operator module is obtained by fusing a deep convolution module and a point-by-point convolution module in a depth separable convolution network structure; caching the first result into an allocated cache space; executing point-by-point convolution operation in the target operator module according to the first result in the cache space to obtain a second result; and determining an output result of the neural network model according to the second result. According to the method and the device, on the basis of a target operator module obtained by combining deep convolution and point-by-point convolution in deployment of a neural network model, the intermediate calculation result of the target operator module is cached, so that the intermediate calculation result can be directly read from a cache space for subsequent calculation, high-performance and high-power-consumption memory reading and writing are reduced, the data access efficiency is improved, and the operation performance of the neural network model is improved.

Referring to fig. 12, fig. 12 is a schematic flowchart illustrating a model operating method according to another embodiment of the present application, where the model operating method specifically includes the following steps:

step S310: determining a network model to be optimized, wherein the network model to be optimized at least comprises a depth convolution module and a point-by-point convolution module which are sequentially connected.

In some embodiments, the network model to be optimized may be a neural network model obtained directly from the network side. In addition, the network model to be optimized may also be a neural network model obtained by optimizing the neural network model obtained from the network side. The optimization of the neural network model can be understood as operations of operator fusion (not operator fusion of the present application), network pruning, model quantization, network cutting and the like on the neural network model.

In some embodiments, after the network model is obtained, it may be determined whether the network model meets the memory optimization condition of the present application. Specifically, referring to fig. 13, step S310 may include:

step S311: and traversing the network structure in the network model, and judging whether a depth convolution module and a point-by-point convolution module which are connected in sequence exist.

After the electronic device obtains the network model, the network model can be traversed according to a depth-first method to judge whether a depth convolution module and a point-by-point convolution module which are connected in sequence exist. Because in some embodiments, the depth-separable convolutional network structure includes a depth convolution module and a point-by-point convolution module that are connected in sequence, the network model can also be traversed in a depth-first manner to determine whether the depth-separable convolutional network structure exists. That is, if a Depthwise Convolution is found and then a Pointwise Convolution is found, it can be considered that there are sequentially connected depth Convolution modules and point-by-point Convolution modules, otherwise, the model is traversed until the end of the model.

Step S312: and when the depth convolution module and the point-by-point convolution module which are sequentially connected exist, taking the network model as the network model to be optimized.

It can be understood that if the network model has a deep convolution module and a point-by-point convolution module which are sequentially connected, the network model can be considered to be capable of performing the memory optimization of the application, that is, the network model is used as the network model to be optimized, and then the subsequent optimization process is performed.

Step S320: and fusing the depth convolution module and the point-by-point convolution module to obtain a fused target operator module.

Specifically, an operator is newly generated by fusing the deep convolution module and the point-by-point convolution module, and a fused target operator module is obtained and recorded as spconv (separable convention). In this case, the SPConv input requires two Convolution kernel inputs, namely, the Convolution kernel of the original Depthwise contribution and the Convolution kernel of the original Pointwise contribution, in addition to the original input data.

Referring to fig. 14, fig. 14 is a schematic diagram illustrating a converged network architecture provided by the present application. The system comprises a Seperable Convolution (SPConv) module, a target operator module, a Depthverse Convolution module, a Convolution kernel input module, a Convolution kernel 3filter, a Convolution kernel 1filter and an output Node of the SPConv module, wherein the Seperable Convolution (SPConv) module is the target operator module obtained after the depth Convolution module and the pointwise Convolution module which are sequentially connected are fused, the input Node of the SPConv is the input Node of the original Depthverse Convolution module, meanwhile, the Convolution kernel inputs are provided with two Convolution kernel sizes of 3 × 3 and 1 × 1, the Convolution kernel inputs are respectively the Convolution kernel 3 × 3filter of the original Depthverse Convolution module and the Convolution kernel 1 × 1filter of the original Pointverse Convolution module, and the output Node of the SPConv is the output Node of the original pointverse Convolution Point.

In some embodiments, the fusion of the deep convolution module and the point-by-point convolution module may be implemented based on an AI compilation system TVM. The TVM is a compiler stack for a deep learning system, provides end-to-end compilation for different back ends, and comprises a neural network diagram optimization part and a single operation optimization part in the compilation process. By adding an optimizer in the front end of the TVM, the depth convolution and the point-by-point convolution are automatically combined in the compiling process to generate a new operator.

In other embodiments, the integration of the depth convolution module and the point-by-point convolution module may also be implemented in a tensrflowlite converter. The TensorFlowLite is a neural network computing framework at a mobile terminal, and can realize the fusion of operators in the model conversion process.

It can be understood that, the depth convolution module and the point-by-point convolution module are fused, so that the effect of improving the memory access efficiency achieved by the subsequent application can be realized, the number of network layers of a network model can be reduced, and the memory access times can be reduced.

Step S330: and optimizing the network model to be optimized according to the target operator module to obtain the neural network model.

In some embodiments, after the fused target operator module is obtained, the network model may be trained and optimized again to obtain the aforesaid operable neural network model.

In some embodiments, the network model may also be optimized while the fusion of the deep convolution module and the point-by-point convolution module is implemented in the tensrflowlite converter, so that an operable neural network model may be directly obtained. And is not limited herein.

Step S340: and when the neural network model is operated, acquiring a first result processed by the deep convolution module in the target operator module.

Step S350: and caching the first result into an allocated cache space.

Step S360: and processing the cached first result according to the point-by-point convolution module in the target operator module to obtain a second result.

Step S370: and determining an output result of the neural network model according to the second result.

In the embodiment of the present application, reference may be made to the contents of the foregoing embodiment in steps S340 to S370, which are not described herein again.

According to the model operation method provided by the embodiment of the application, when the network model to be optimized is determined, the depth convolution modules and the point-by-point convolution modules which are sequentially connected in the network model can be fused to obtain the fused target operator module, and the network model is optimized according to the target operator module, so that the neural network model is obtained. And then when the neural network model is operated, caching the first result into an allocated cache space by acquiring the first result processed by the deep convolution module in the target operator module of the neural network model, then processing the cached first result according to the point-by-point convolution module in the target operator module to obtain a second result, and determining the output result of the neural network model according to the second result. According to the method and the device, the deep convolution and the point-by-point convolution in the network model are fused, the number of network layers of the network model is reduced, the memory access times are also reduced, and the intermediate calculation result of the target operator module generated by fusion is cached, so that the intermediate calculation result can be directly read from the cache space for subsequent calculation, the memory read-write with high performance and high power consumption is reduced, the data access efficiency is improved, and the operation performance of the neural network model is improved.

Referring to fig. 15, a block diagram of a model operating apparatus 700 according to an embodiment of the present application is shown, where the model operating apparatus 700 includes: a first retrieving module 710, a result caching module 720, a second retrieving module 730, and a result determining module 740. The first obtaining module 710 is configured to obtain a first result processed by a deep convolution module in a target operator module of a neural network model when the neural network model is running, where the target operator module is obtained by fusing the deep convolution module and a point-by-point convolution module which are sequentially connected; the result caching module 720 is configured to cache the first result in the allocated cache space; the second obtaining module 730 is configured to process the cached first result according to the point-by-point convolution module in the target operator module to obtain a second result; the result determination module 740 is configured to determine an output result of the neural network model according to the second result.

In some embodiments, the first obtaining module 710 may include: a target acquisition unit and a target calculation unit. The target acquisition unit is used for determining target data in a sliding window based on input data of a target computation module in the neural network model; and the target operation unit is used for processing the target data according to the depth convolution module in the target operator module to obtain a first result.

In some embodiments, the model operating apparatus 700 may further include: a cache acquisition module and a scale determination module. The cache obtaining module is used for determining the cache capacity of the allocated cache space; and the scale determining module is used for determining the scale of the sliding window according to the cache capacity.

In some embodiments, the result determination module 740 may be specifically configured to: traversing the input data according to the sliding window, repeatedly executing the input data based on a target operator module in the neural network model, determining the target data in the sliding window, and processing the cached first result according to the point-by-point convolution module in the target operator module to obtain a plurality of second results; and taking a plurality of second results as output data of a target operator module, and determining the output result of the neural network model based on the output data.

In some embodiments, the model operating apparatus 700 may further include: the model optimization system comprises a model determination module, a module fusion module and a model optimization module. The model determining module is used for determining a network model to be optimized, and the network model to be optimized at least comprises a depth convolution module and a point-by-point convolution module which are sequentially connected; the module fusion module is used for fusing the depth convolution module and the point-by-point convolution module to obtain a fused target operator module; and the model optimization module is used for optimizing the network model to be optimized according to the target operator module to obtain the neural network model.

In some embodiments, the model determining module may be specifically configured to: traversing a network structure in the network model, and judging whether a depth convolution module and a point-by-point convolution module which are connected in sequence exist or not; and when the depth convolution module and the point-by-point convolution module which are sequentially connected exist, the network model is taken as the network model to be optimized.

In some embodiments, the result caching module 720 may be specifically configured to: and caching the first result into an allocated cache space in a layout mode of batch number, channel number, height and width (NCHW).

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses and modules may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, the coupling between the modules may be electrical, mechanical or other type of coupling.

In addition, functional modules in the embodiments of the present application may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode.

In summary, the model operation apparatus provided in the embodiment of the present application is used to implement the corresponding model operation method in the foregoing method embodiment, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.

Referring to fig. 16, a block diagram of an electronic device according to an embodiment of the present disclosure is shown. The electronic device 100 may be a PC computer, a mobile terminal, a server, or other terminal device capable of running an application. The electronic device 100 in the present application may include one or more of the following components: a processor 110, a memory 120, and one or more applications, wherein the one or more applications may be stored in the memory 120 and configured to be executed by the one or more processors 110, the one or more applications configured to perform the methods as described in the aforementioned method embodiments.

Processor 110 may include one or more processing cores. The processor 110 connects various parts within the overall electronic device 100 using various interfaces and lines, and performs various functions of the electronic device 100 and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 120 and calling data stored in the memory 120. Alternatively, the processor 110 may be implemented in at least one hardware form of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 110 may integrate one or more of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a modem, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing display content; the modem is used to handle wireless communications. It is understood that the modem may not be integrated into the processor 110, but may be implemented by a communication chip.

The Memory 120 may include a Random Access Memory (RAM) or a Read-Only Memory (Read-Only Memory). The memory 120 may be used to store instructions, programs, code sets, or instruction sets. The memory 120 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for implementing at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing various method embodiments described below, and the like. The storage data area may also store data created by the electronic device 100 during use (e.g., phone book, audio-video data, chat log data), and the like.

It will be appreciated that the configuration shown in FIG. 16 is merely exemplary, and that electronic device 100 may include more or fewer components than shown in FIG. 16, or have a completely different configuration than shown in FIG. 16. The embodiments of the present application are not limited thereto.

Referring to fig. 17, a block diagram of a computer-readable storage medium according to an embodiment of the present application is shown. The computer-readable medium 800 has stored therein a program code that can be called by a processor to execute the method described in the above-described method embodiments.

The computer-readable storage medium 800 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable and programmable read only memory), an EPROM, a hard disk, or a ROM. Alternatively, the computer-readable storage medium 800 includes a non-volatile computer-readable storage medium. The computer readable storage medium 800 has storage space for program code 810 to perform any of the method steps of the method described above. The program code can be read from or written to one or more computer program products. The program code 810 may be compressed, for example, in a suitable form.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not necessarily depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A method of model operation, the method comprising:

when a neural network model is operated, obtaining a first result processed by a deep convolution module in a target operator module of the neural network model, wherein the target operator module is obtained by fusing the deep convolution module and a point-by-point convolution module which are sequentially connected;

caching the first result into an allocated cache space;

processing the cached first result according to the point-by-point convolution module in the target operator module to obtain a second result;

and determining an output result of the neural network model according to the second result.

2. The method of claim 1, wherein obtaining the first result processed by the deep convolution module in the target operator module of the neural network model comprises:

determining target data in a sliding window based on input data of a target operator module in the neural network model;

and processing the target data according to the depth convolution module in the target operator module to obtain a first result.

3. The method of claim 2, wherein prior to said determining target data within a sliding window based on input data of a target operator module in said neural network model, the method further comprises:

determining the buffer capacity of the allocated buffer space;

and determining the scale of the sliding window according to the cache capacity.

4. The method of claim 2, wherein determining the output result of the neural network model based on the second result comprises:

traversing the input data according to the sliding window, repeatedly executing the input data based on a target operator module in the neural network model, determining the target data in the sliding window, and processing the cached first result according to the point-by-point convolution module in the target operator module to obtain a plurality of second results;

and taking a plurality of second results as output data of a target operator module, and determining the output result of the neural network model based on the output data.

5. The method according to any one of claims 1 to 4, wherein before the obtaining the first result processed by the deep convolution module in the target operator module of the neural network model when the neural network model is executed, the method further comprises:

determining a network model to be optimized, wherein the network model to be optimized at least comprises a depth convolution module and a point-by-point convolution module which are sequentially connected;

fusing the depth convolution module and the point-by-point convolution module to obtain a fused target operator module;

and optimizing the network model to be optimized according to the target operator module to obtain the neural network model.

6. The method of claim 5, wherein determining the network model to be optimized comprises:

traversing a network structure in the network model, and judging whether a depth convolution module and a point-by-point convolution module which are connected in sequence exist or not;

and when the depth convolution module and the point-by-point convolution module which are sequentially connected exist, taking the network model as the network model to be optimized.

7. The method according to any of claims 1-4, wherein said buffering the first result into an allocated buffer space comprises:

and caching the first result into an allocated cache space in a layout mode of batch number, channel number, height and width (NCHW).

8. A model running apparatus, characterized in that the apparatus comprises:

the first acquisition module is used for acquiring a first result processed by a deep convolution module in a target operator module of the neural network model when the neural network model is operated, and the target operator module is obtained by fusing the deep convolution module and the point-by-point convolution module which are sequentially connected;

a result caching module, configured to cache the first result in an allocated cache space;

the second obtaining module is used for processing the cached first result according to the point-by-point convolution module in the target operator module to obtain a second result;

and the result determining module is used for determining the output result of the neural network model according to the second result.

9. An electronic device, comprising:

one or more processors;

a memory;

one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more applications configured to perform the method of any of claims 1-7.

10. A computer-readable storage medium, having stored thereon program code that can be invoked by a processor to perform the method according to any one of claims 1 to 7.