CN115481732A

CN115481732A - Method and apparatus for processing feature maps via an artificial intelligence accelerator

Info

Publication number: CN115481732A
Application number: CN202211152487.6A
Authority: CN
Inventors: 李建军; 姚猛; 王振江; 凌坤
Original assignee: Beijing Horizon Information Technology Co Ltd
Current assignee: Beijing Horizon Information Technology Co Ltd
Priority date: 2022-09-21
Filing date: 2022-09-21
Publication date: 2022-12-16

Abstract

A method and apparatus for processing a feature map with an artificial intelligence accelerator is disclosed. The method for processing the characteristic diagram through the artificial intelligence accelerator comprises the following steps: determining a merging convolutional layer in the target neural network model, wherein layer parameters of the merging convolutional layer comprise a depth convolution parameter and a point convolution parameter; determining a cache region corresponding to the merged convolution layer; based on the depth convolution parameters, performing depth convolution operation on the input feature map of the merging convolution layer; sequentially caching at least two depth convolution results obtained by the depth convolution operation into a cache region, wherein any two depth convolution results in the at least two depth convolution results correspond to different local regions in the input feature map; performing point convolution operation on the depth convolution result cached in the cache region based on the point convolution parameters; and determining an output characteristic diagram of the merged convolutional layer based on a point convolution result obtained by the point convolution operation. The embodiment of the disclosure can effectively improve the operation speed of the depth separable convolution.

Description

Method and apparatus for processing feature maps via an artificial intelligence accelerator

Technical Field

The present disclosure relates to the field of integrated circuit technology, and in particular, to a method and apparatus for processing a feature map via an artificial intelligence accelerator.

Background

Depth Separable Convolution (Depthwise Separable Convolution) consisting of two parts, depth Convolution (Depthwise Conv) and point Convolution (Pointwise Conv), is currently widely used.

In specific implementation, deep convolution operation is generally performed on a feature map to be subjected to deep separable convolution, after all Data in the feature map complete the deep convolution operation and all deep convolution results are stored in a Double Data Rate (DDR) Synchronous Dynamic Random Access Memory, point convolution operation is performed by using Data stored in the DDR, so as to implement the deep separable convolution of the feature map.

Disclosure of Invention

The method and the device for processing the depth separable convolution are provided in order to solve the technical problems that the depth separable convolution operation speed is low and actual requirements are difficult to meet well. Embodiments of the present disclosure provide a method and apparatus for processing a feature map by an artificial intelligence accelerator.

According to an aspect of the embodiments of the present disclosure, there is provided a method for processing a feature map by an artificial intelligence accelerator, including:

determining a merged convolutional layer in a target neural network model, wherein layer parameters of the merged convolutional layer comprise a depth convolution parameter and a point convolution parameter;

determining a cache region corresponding to the merged convolution layer;

performing depth convolution operation on the input feature map of the merged convolutional layer based on the depth convolution parameters;

sequentially caching at least two depth convolution results obtained by the depth convolution operation to the cache region, wherein any two depth convolution results in the at least two depth convolution results correspond to different local regions in the input feature map;

performing point convolution operation on the depth convolution result cached in the cache region based on the point convolution parameter;

and determining the output characteristic diagram of the merged convolutional layer based on the point convolution result obtained by the point convolution operation.

According to another aspect of the embodiments of the present disclosure, there is provided a method for compiling a neural network model, including:

determining a neural network model to be compiled;

determining a first convolutional layer and a second convolutional layer which are paired from the neural network model to be compiled;

merging the first convolution layer and the second convolution layer to obtain a merged convolution layer, wherein layer parameters of the merged convolution layer comprise depth convolution parameters of the first convolution layer and point convolution parameters of the second convolution layer;

allocating a buffer area for the merged convolution layer;

compiling to generate a target neural network model based on the merging convolutional layer, the cache region and the network layer except the first convolutional layer and the second convolutional layer in the neural network model to be compiled, wherein the target neural network model comprises instructions for executing the method for processing the characteristic diagram through the artificial intelligence accelerator.

According to still another aspect of an embodiment of the present disclosure, there is provided an apparatus for processing a feature map by an artificial intelligence accelerator, including:

a first determining module, configured to determine a merged convolutional layer in a target neural network model, where layer parameters of the merged convolutional layer include a depth convolution parameter and a point convolution parameter;

a second determining module, configured to determine a cache region corresponding to the merged convolution layer determined by the first determining module;

a first operation module, configured to perform a depth convolution operation on an input feature map of the merged convolutional layer based on the depth convolution parameter of the merged convolutional layer determined by the first determination module;

the buffer module is configured to sequentially buffer at least two depth convolution results obtained through the depth convolution operation of the first operation module to the buffer area determined by the second determination module, where any two depth convolution results of the at least two depth convolution results correspond to different local areas in the input feature map;

a second operation module, configured to perform a point convolution operation on the depth convolution result cached in the cache region determined by the second determination module based on the point convolution parameter of the merged convolution layer determined by the first determination module;

a third determining module, configured to determine, based on a point convolution result obtained by the point convolution operation of the second operation module, an output feature map of the merged convolution layer determined by the first determining module.

According to another aspect of the embodiments of the present disclosure, there is provided a neural network model compiling apparatus including:

the fourth determining module is used for determining the neural network model to be compiled;

a fifth determining module, configured to determine a first convolutional layer and a second convolutional layer that are paired from the neural network model to be compiled determined by the fourth determining module;

a merging module, configured to merge the first convolution layer and the second convolution layer determined by the fifth determining module to obtain a merged convolution layer, where layer parameters of the merged convolution layer include a depth convolution parameter of the first convolution layer and a point convolution parameter of the second convolution layer;

the distribution module is used for distributing a cache region for the merged convolution layer obtained by the merging module;

a generating module, configured to compile and generate a target neural network model based on the merged convolutional layer obtained by the merging module, the cache region allocated by the allocating module, and a network layer determined by the fourth determining module in the to-be-compiled neural network model except for the first convolutional layer and the second convolutional layer determined by the fifth determining module, where the target neural network model includes an instruction for executing the method for processing the feature map by using the artificial intelligence accelerator.

According to still another aspect of an embodiment of the present disclosure, there is provided a computer-readable storage medium storing a computer program for executing the above method of processing a feature map by an artificial intelligence accelerator or the compiling method of a neural network model.

According to still another aspect of an embodiment of the present disclosure, there is provided an electronic apparatus including:

a processor;

a memory for storing the processor-executable instructions;

and the processor is used for reading the executable instructions from the memory and executing the instructions to realize the method for processing the characteristic diagram through the artificial intelligence accelerator or the compiling method of the neural network model.

Based on the method and the device for processing the characteristic diagram through the artificial intelligence accelerator, the compiling method and the device of the neural network model, the computer-readable storage medium and the electronic equipment, the fusion of the deep convolution and the point convolution operation is effectively realized by introducing the cache region based on the depth convolution parameter and the point convolution parameter of the merged convolution layer, so that the deep separable convolution is efficiently and quickly realized; in addition, since the DDR external to the Artificial Intelligence (AI) accelerator is not used in the embodiments of the present disclosure, the deep separable convolution is realized only by the cache inside the AI accelerator, so that the operation speed of the deep separable convolution can be effectively increased.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent by describing in more detail embodiments of the present disclosure with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the disclosure, and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the principles of the disclosure and not to limit the disclosure. In the drawings, like reference numbers generally indicate like parts or steps.

Fig. 1-1 is a schematic diagram of a convolution kernel shape of a general convolution in the related art.

Fig. 1-2 are schematic diagrams of convolution kernel shapes for deep convolution in the related art.

Fig. 1-3 are schematic diagrams of convolution kernel shapes for point convolution in the related art.

Fig. 2-1 is a schematic diagram of a calculation manner of a general convolution in the related art.

Fig. 2-2 is a schematic diagram of a calculation manner of a depth separable convolution of depth convolution + point convolution in the related art.

FIG. 3 is an implementation schematic diagram of depth separable convolution in an exemplary embodiment of the present disclosure.

FIG. 4 is a flowchart illustrating a method for processing a feature map via an artificial intelligence accelerator according to an exemplary embodiment of the disclosure.

FIG. 5 is a flowchart illustrating a method for processing a feature map by an artificial intelligence accelerator according to another exemplary embodiment of the disclosure.

FIG. 6 is a flowchart illustrating a method for processing a feature map via an artificial intelligence accelerator according to yet another exemplary embodiment of the present disclosure.

Fig. 7 is a flowchart illustrating a compiling method of a neural network model according to an exemplary embodiment of the disclosure.

Fig. 8 is a flowchart illustrating a compiling method of a neural network model according to another exemplary embodiment of the present disclosure.

FIG. 9 is an implementation schematic diagram of depth separable convolution in an exemplary embodiment of the present disclosure.

Fig. 10 is a schematic structural diagram of an apparatus for processing a feature map by an artificial intelligence accelerator according to an exemplary embodiment of the present disclosure.

Fig. 11 is a schematic structural diagram of an apparatus for processing a feature map by an artificial intelligence accelerator according to another exemplary embodiment of the present disclosure.

Fig. 12 is a schematic structural diagram of an apparatus for processing a feature map by an artificial intelligence accelerator according to still another exemplary embodiment of the present disclosure.

Fig. 13 is a schematic structural diagram of a compiling apparatus for a neural network model according to an exemplary embodiment of the disclosure.

Fig. 14 is a schematic structural diagram of a compiling apparatus of a neural network model according to another exemplary embodiment of the present disclosure.

Fig. 15 is a block diagram of an electronic device provided in an exemplary embodiment of the present disclosure.

Detailed Description

Hereinafter, example embodiments according to the present disclosure will be described in detail with reference to the accompanying drawings. The described embodiments are only a few embodiments of the present disclosure, not all embodiments, and the present disclosure is not limited by the example embodiments.

It should be noted that: the relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.

It will be understood by those within the art that the terms "first", "second", etc. in the embodiments of the present disclosure are used only for distinguishing between different steps, devices or modules, etc., and do not denote a particular technical meaning or a necessary logical order. "plurality" may mean two or more, and "at least one" may mean one, two or more.

It is also to be understood that any reference to any component, data, or structure in the embodiments of the present disclosure may be generally understood as one or more, unless explicitly defined otherwise or indicated to the contrary hereinafter.

The term "and/or" in this disclosure is only one kind of association relationship describing the associated object, and means that there may be three kinds of relationships, for example, a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" in the present disclosure generally indicates that the former and latter associated objects are in an "or" relationship.

It should also be understood that the description of the embodiments in the present disclosure emphasizes the differences between the embodiments, and the same or similar parts may be referred to each other, and are not repeated for brevity.

It should be understood that the dimensions of the various parts shown in the drawings are not drawn to scale.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.

Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

The disclosed embodiments may be applied to electronic devices such as terminal devices, computer systems, servers, etc., which are operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known terminal devices, computing systems, environments, and/or configurations that may be suitable for use with electronic devices, such as terminal devices, computer systems, servers, and the like, include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, microprocessor-based systems, set-top boxes, programmable consumer electronics, network pcs, minicomputer systems, mainframe computer systems, distributed cloud computing environments that include any of the above, and the like.

Electronic devices such as terminal devices, computer systems, servers, etc. may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Program modules may include routines, programs, objects, components, logic, data structures, etc. that perform particular tasks or implement particular abstract data types. The computer system/server may be practiced in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.

Summary of the application

Deep convolution and point convolution are two particular types of convolution; the depth convolution may also be referred to as DW convolution, and the point convolution may also be referred to as PW convolution. As shown in fig. 1-1, the convolution kernel shape in the normal convolution is N × Dk × M; as shown in fig. 1-2, the convolution kernel shape in the deep convolution is M × Dk 1; as shown in fig. 1-3, the convolution kernel shape in the dot convolution is N × 1 × m.

The depth separable convolution formed by the depth convolution and the point convolution is widely applied at present, the calculation mode of the common convolution can be shown in figure 2-1, and the calculation mode of the depth separable convolution of the depth convolution plus the point convolution can be shown in figure 2-2.

It should be noted that, when the depth separable convolution is specifically implemented, depth convolution operation is generally performed on a feature map to be subjected to the depth separable convolution first, after all data in the feature map complete the depth convolution operation and all depth convolution results are stored in a DDR inside an artificial intelligent accelerator, point convolution operation is performed using the data stored in the DDR. Thus, it takes a very long time to complete the depth separable convolution of the feature map, the operation speed is slow, and it is difficult to well meet the actual requirements.

Exemplary System

In order to solve the technical problem that the deep separable convolution operation speed is slow and the actual requirement is difficult to meet well, as shown in fig. 3, the present disclosure can be improved from the compiling stage and the executing stage of the neural network model respectively; in the compiling stage, the deep convolutional layer and the point convolutional layer can be merged to obtain a merged convolutional layer, a cache region is distributed for the merged convolutional layer, and the merged convolutional layer and the cache region are used for instruction generation; in the execution stage, the deep separable convolution can be efficiently and quickly realized through the execution of the instruction generated in the compiling stage.

Exemplary method

FIG. 4 is a flowchart illustrating a method for processing a feature map by an artificial intelligence accelerator according to an exemplary embodiment of the disclosure. The method shown in fig. 4 includes step 410, step 420, step 430, step 440, step 450, and step 460, which are described below.

Step 410, determining a merged convolutional layer in the target neural network model, wherein layer parameters of the merged convolutional layer comprise a depth convolution parameter and a point convolution parameter.

It should be noted that the target neural network model may be any neural network model compiled by the following neural network model compiling method, and the target neural network model may include a merged convolutional layer and other network layers except the merged convolutional layer.

Layer parameters of the merged convolutional layer may include depth convolution parameters and point convolution parameters; the depth convolution parameter refers to a parameter required for realizing depth convolution operation, and the point convolution parameter refers to a parameter required for realizing point convolution operation. Optionally, the depth convolution parameters may include a depth convolution kernel, a step value (dwstride), a padding value (dwpadding), and the like, where the depth convolution kernel carries information such as a convolution kernel shape (dwkernel shape), a weight value (dwweight), and the like; the point convolution parameters may include a point convolution kernel, where the point convolution kernel carries information such as the number of convolution kernels (pw kernel num) and the weight value (pw weight).

Other network layers besides merge convolutional layers include, but are not limited to, deep convolutional layers, point convolutional layers, pooling layers, modified linear unit (ReLU) layers, and the like.

Step 420, determine the buffer corresponding to the merged convolution layer.

It should be noted that the cache region corresponding to the merged convolutional layer may be: in the compile phase, the compiler allocates buffers to the merged convolutional layer. Optionally, the cache region may be a certain region in a Static Random-Access Memory (SRAM) of the chip; the buffer may also be referred to as a Temporary buffer.

And step 430, performing depth convolution operation on the input feature map of the merged convolution layer based on the depth convolution parameters.

In step 430, a deep convolution operation may be performed on the input feature map of the merged convolution layer based on the deep convolution kernel, the step value, and the padding value in the deep convolution parameter, that is, a convolution operation is performed on each channel of the input feature map separately.

If the fill value is not 0, the input feature map of the merged convolution layer may be edge-filled based on the fill value to obtain the input feature map after the edge-filling process, and for example, if the fill value is 1, the input feature map may be filled with one pixel for each of the upper, lower, left, and right sides.

Next, the depth convolution kernel may be slid on the input feature map after the edge filling processing based on the step value, and while the depth convolution kernel is slid, a summation operation after element-corresponding multiplication is performed on each channel of the area covered by the depth convolution kernel and a corresponding channel in the depth convolution kernel.

And 440, sequentially caching at least two depth convolution results obtained by the depth convolution operation into a cache region, wherein any two depth convolution results in the at least two depth convolution results correspond to different local regions in the input feature map.

It should be noted that any depth convolution result obtained by the depth convolution operation corresponds to a local area in the input feature map, and a relationship between the depth convolution result and the local area may be understood as: when the region covered by the depth convolution kernel only includes the local region, or the region covered by the depth convolution kernel only includes the local region and a part of the edge filling region in the input feature map, performing operation of element corresponding multiplication and summation on each channel of the region covered by the depth convolution kernel and a corresponding channel in the depth convolution kernel to obtain an operation result value corresponding to each channel of the region covered by the depth convolution kernel, wherein the operation result values can form the depth convolution result. That is, the width and height of each deep convolution result are both 1, and the number of channels is the same as that of the input feature map.

In the process of sliding the depth convolution kernel, the area covered by the depth convolution kernel is changed continuously, so that a plurality of depth convolution results corresponding to different local areas in the input feature map can be obtained in sequence, and after a depth convolution result is obtained each time, the obtained depth convolution result can be cached in a cache region.

And step 450, performing point convolution operation on the depth convolution result cached in the cache region based on the point convolution parameters.

In step 450, a point convolution operation may be performed on the depth convolution result buffered in the buffer based on the point convolution kernel in the point convolution parameter, that is, a weighted summation in the channel direction may be performed on the depth convolution result buffered in the buffer.

Step 460, determining an output characteristic diagram of the merged convolutional layer based on the point convolution result obtained by the point convolution operation.

Since all the depth convolution results obtained by the depth convolution operation are sequentially cached in the cache region, the point convolution results corresponding to all the depth convolution results can be obtained by executing step 450, and thus, in step 460, the output feature map of the merged convolution layer can be obtained by summarizing the point convolution results corresponding to all the depth convolution results.

In the embodiment of the disclosure, the depth convolution parameters and the point convolution parameters based on the merged convolution layer are introduced into the cache region, so that the fusion of the depth convolution and the point convolution operation is effectively realized, and the depth separable convolution is efficiently and quickly realized; in addition, according to the embodiment of the disclosure, since the DDR positioned outside the artificial intelligence accelerator is not used, the deep separable convolution is realized only through the cache inside the artificial intelligence accelerator, and thus the operation speed of the deep separable convolution can be effectively increased.

Based on the embodiment shown in fig. 4, as shown in fig. 5, step 450 includes step 4505, step 4507, step 4509, and step 4511.

Step 4505, a first size relationship between the space occupancy of the buffer and a preset space occupancy is determined.

Alternatively, the preset space occupancy may be 85%, 90%, 95%, etc., which are not listed here.

In step 4505, the current space occupancy of the buffer may be determined, and the current space occupancy of the buffer may be compared with a preset space occupancy to obtain a first size relationship.

Step 4507 determines a second size relationship between the number of deep convolution results buffered in the buffer and the first predetermined number.

Alternatively, the first predetermined number may be 2, 3, 4, 5, etc., which are not listed here.

In step 4507, the number of the depth convolution results currently cached in the buffer may be determined, and the number of the depth convolution results currently cached in the buffer may be compared with a first preset number to obtain a second size relationship.

Step 4509, a time relationship between the current time and a preset time is determined.

Here, the preset time may be a preset time at which the dot convolution operation is performed. Optionally, a certain time may be used as the start time, and the time at each interval of the preset time duration is used as the preset time, for example, if T1 is the start time, and the preset time duration is T, then T1+ T, T1+2T, \ 8230 \ 8230;, and T1+ NT may all be used as the preset time.

In step 4509, the current time may be determined by calling a system clock function, and then a time relationship between the current time and the preset time may be determined, where the time relationship may be used to represent a temporal precedence relationship between the current time and the preset time.

Step 4511, in response to that at least one of the first size relationship, the second size relationship, and the time relationship is used to trigger a point convolution operation, performing a point convolution operation on the depth convolution result cached in the cache region based on the point convolution parameter.

Assuming that the first size relationship represents that the current space occupancy of the cache region is greater than or equal to the preset space occupancy as a first condition, if the first condition is satisfied, it may be determined that the first size relationship is used for the trigger point convolution operation, and if the first condition is not satisfied, it may be determined that the first size relationship is not used for the trigger point convolution operation.

Assuming that the number of the depth convolution results currently cached in the second size relationship representation cache region is greater than or equal to the first preset number, which is a second condition, when the second condition is satisfied, it may be determined that the second size relationship is used for the trigger point convolution operation, and when the second condition is not satisfied, it may be determined that the second size relationship is not used for the trigger point convolution operation.

Assuming that the time relationship represents that the current time is equal to the preset time as a third condition, when the third condition is met, it may be determined that the time relationship is used for the trigger point convolution operation, and when the third condition is not met, it may be determined that the time relationship is not used for the trigger point convolution operation.

It should be noted that, when at least one of the first condition, the second condition, and the third condition is satisfied, the buffer may be buffered with sufficient depth convolution results by default, and if the point convolution operation is not performed on the depth convolution results in time, the depth convolution results may be covered by depth convolution results that are subsequently buffered in the buffer, so that a part of the point convolution results is missed. In view of this, in the embodiment of the present disclosure, the first size relationship, the second size relationship, and the time relationship may be referred to, and the time for triggering the point convolution operation may be efficiently and quickly determined, so that when there are sufficient deep convolution results cached in the cache region, the point convolution operation may be performed on the deep convolution results in time, and thus, the integrity and the accuracy of the finally obtained output feature map of the merged convolution layer are effectively ensured.

Based on the embodiment shown in fig. 4, as shown in fig. 6, step 440 includes step 4401.

And step 4401, sequentially caching at least two depth convolution results obtained by the depth convolution operation to a first cache subarea in the cache area.

Here, the buffer area may include a first buffer sub-area and a second buffer sub-area. Optionally, the spatial sizes of the first cache sub-area and the second cache sub-area may be the same or different.

Step 450, comprising step 4513 and step 4515.

Step 4513, the result of the depth convolution cached in the first cache sub-area is transferred to the second cache sub-area.

Optionally, the deep convolution result cached in the first cache sub-region may be transferred to the second cache sub-region when the space occupancy of the first cache sub-region is greater than or equal to the preset space occupancy; or, the depth convolution results cached in the first cache sub-region may be transferred to the second cache sub-region when the number of the depth convolution results cached in the first cache sub-region is greater than or equal to the first preset number.

Optionally, when the depth convolution result cached in the first cache sub-region is transferred to the second cache sub-region, the depth convolution result cached in the first cache sub-region may be emptied.

And 4515, performing point convolution operation on the depth convolution result cached in the second cache sub-region based on the point convolution parameter.

Optionally, under the condition that the space occupancy rate of the second buffer sub-area is greater than or equal to the preset space occupancy rate, the number of the depth convolution results cached by the second buffer sub-area is greater than or equal to the first preset number, and at least one of the three conditions that the current time is equal to the preset time is satisfied, the point convolution operation may be performed on the depth convolution results cached by the second buffer sub-area based on the point convolution parameter.

In the embodiment of the disclosure, by dividing the buffer area into the first buffer area and the second buffer area, the first buffer area may be used to directly buffer the depth convolution result obtained by the depth convolution operation, and the depth convolution result buffered by the second buffer area may be used to perform the point convolution operation, so that it may be considered that the first buffer area is specially allocated to the depth convolution operation and the second buffer area is specially allocated to the point convolution operation, thereby implementing concurrent execution of the depth convolution operation and the point convolution operation, and further increasing the operation speed of the depth separable convolution.

Fig. 7 is a flowchart illustrating a compiling method of a neural network model according to an exemplary embodiment of the disclosure. The method shown in fig. 7 includes

steps

710, 720, 730, 740, and 750, which are described below.

Step 710, determining a neural network model to be compiled.

It should be noted that the neural network model to be compiled refers to a neural network model that needs to be compiled, and the neural network model to be compiled may include a plurality of network layers, where the plurality of network layers include, but are not limited to, a deep convolutional layer, a point convolutional layer, a pooling layer, a modified linear unit (ReLU) layer, and the like.

Step 720, determining a first convolutional layer and a second convolutional layer which are paired from the neural network model to be compiled.

In step 720, all convolutional layers may be determined from a plurality of network layers included in the neural network model to be compiled, and then a first convolutional layer and a second convolutional layer that are paired may be selected from all determined convolutional layers, where the pairing of the first convolutional layer and the second convolutional layer may be: the combination of the first convolutional layer and the second convolutional layer can realize a depth separable convolution; the first convolution layer may be a depth convolution layer, and the layer parameters of the first convolution layer may include depth convolution parameters; the second convolutional layer may be a point convolutional layer, and the layer parameters of the second convolutional layer may include point convolution parameters.

Step 730, merging the first convolution layer and the second convolution layer to obtain a merged convolution layer, wherein layer parameters of the merged convolution layer include depth convolution parameters of the first convolution layer and point convolution parameters of the second convolution layer.

In step 730, a depth convolution parameter may be extracted from the layer parameters of the first convolutional layer, a point convolution parameter may be extracted from the layer parameters of the second convolutional layer, and then a merged convolutional layer whose layer parameters include the depth convolution parameter and the point convolution parameter may be obtained through merging between the first convolutional layer and the second convolutional layer.

Step 740, allocate a buffer for the merged convolutional layer.

Optionally, a region may be selected from the SRAM as a cache region to be allocated to the merged convolutional layer according to a specific rule; alternatively, one region may be randomly selected from the SRAM as a buffer region to be allocated to the merged convolution layer.

Step 750, compiling to generate a target neural network model based on the merged convolutional layer, the cache region, and the network layer except the first convolutional layer and the second convolutional layer in the neural network model to be compiled, wherein the target neural network model includes instructions for executing a method for processing a feature map through an artificial intelligence accelerator (e.g., the method for processing the feature map through the artificial intelligence accelerator in the embodiment shown in fig. 4, fig. 5, or fig. 6).

In step 750, the compiler may perform compilation processing on the merged convolutional layer, the cache region, and the network layers except the first convolutional layer and the second convolutional layer in the neural network model to be compiled to generate a binary target neural network model, and the specific compilation processing manner may be any implementable manner according to actual requirements, which is not described in detail in this disclosure.

In the embodiment of the disclosure, in the compiling stage, the first convolution layer and the second convolution layer paired in the neural network model to be compiled may be merged to obtain a merged convolution layer, and a cache region is allocated to the merged convolution layer, so that the merged convolution layer, the cache region, and the network layers except the first convolution layer and the second convolution layer in the neural network model to be compiled are used for model compiling to generate the target neural network model. In the execution stage, the depth convolution parameters and the point convolution parameters based on the merged convolution layer are introduced into the cache region, so that the fusion of the depth convolution and the point convolution operation is effectively realized, and the purpose of efficiently and quickly realizing the depth separable convolution is achieved; in addition, according to the embodiment of the disclosure, since the DDR positioned outside the artificial intelligence accelerator is not used, the deep separable convolution is realized only through the cache inside the artificial intelligence accelerator, and thus the operation speed of the deep separable convolution can be effectively increased.

Based on the embodiment shown in fig. 7, as shown in fig. 8, step 720 includes

steps

7201, 7203 and 7205.

Step 7201, determining output result usage information of the first convolution layer in the neural network model to be compiled.

Optionally, output result usage information of the first convolution layer may be extracted from the neural network model to be compiled, and the output result usage information may be used to characterize which network layers in the neural network model to be compiled are specifically provided with the output result of the first convolution layer for use; wherein, the output result usage information may also be referred to as use-def relationship.

Step 7203, determining the number and type of network layers using the output result of the first convolution layer in the neural network model to be compiled based on the output result usage information.

In one example, the first convolution layer is denoted conv1, and the output result usage information obtained in step 7201 is in the form: conv2_ output = conv (conv 1_ output), it can be determined that the output result of conv1 is provided to the network layer, conv2, for use, and the type of conv2 is a convolutional layer.

In another example, the first convolution layer is denoted conv1, and the output result usage information obtained in step 7201 is in the form of: conv2_ output = conv (conv 1_ output), and ReLU1_ output = conv (conv 1_ output), it can be determined that the output result of conv1 is provided for use by both the conv2 and ReLU1 network layers, and that the type of conv2 is a convolutional layer and the type of ReLU1 is a modified linear unit layer.

Step 7205, a second convolutional layer paired with the first convolutional layer is determined based on the number and type of the network layers using the output result.

In one embodiment, step 7205 includes:

in response to the number of network layers using the output result being a second preset number and the type of network layers using the output result being a point convolution layer, determining the network layer using the output result as a second convolution layer paired with the first convolution layer.

Alternatively, the second preset number may be 1.

In step 7205, it may be determined whether the number of network layers using the output result of the first convolution layer is 1 and whether the type of network layer using the output result is a point convolution layer.

If the number of network layers using the output result of the first convolution layer is 1 and the type of network layer using the output result is a point convolution layer, the network layer using the output result of the first convolution layer may be determined as a second convolution layer paired with the first convolution layer, and then merging of the first network layer and the second network layer may be performed.

If the number of the network layers using the output result of the first convolutional layer is not 1, and/or the type of the network layers using the output result is not a point convolutional layer, it may be determined that a second network layer paired with the first network layer does not exist in the neural network model to be compiled, and subsequently, merging of the first network layer and the second network layer is not required.

In this way, the second convolution layer paired with the first convolution layer can be efficiently and quickly determined by comparing the number of network layers using the output result of the first convolution layer with the specific number and comparing the type of network layers using the output result with the specific type.

In the embodiment of the disclosure, the number and the type of network layers using the output result of the first convolutional layer in the neural network model to be compiled can be efficiently and quickly determined by referring to the output result use information of the first convolutional layer in the neural network model to be compiled, so that the second convolutional layer paired with the first convolutional layer can be efficiently and quickly determined based on the determination result.

In an alternative example, in the compiling stage, assuming that there is a conv1 in the neural network model to be compiled, and the type of the conv1 (equivalent to the first convolution layer in the foregoing) is a deep convolution layer, based on the output result usage relation of the conv1, it is determined that the output result of the conv1 is used by only one network layer in the neural network model to be compiled, and the type of the network layer is a point convolution layer, the network layer may be used as a convolution layer paired with the conv1 (equivalent to the second convolution layer in the foregoing).

Assuming that the convolutional layer paired with Conv1 is Conv2, conv1 and Conv2 may be merged to obtain a merged network layer (which may be referred to as DwPw Conv), and the layer parameters of the merged network layer may include the deep convolution parameter of Conv1 (which may be referred to as Depthwise Conv Weight) and the point convolution parameter of Conv2 (which may be referred to as poitwise Conv Weight). In addition, a buffer (which may be referred to as a temporal buffer) may be allocated to the merging network layer, and the instruction may be generated by compiling based on the merging network layer, the buffer, and the network layers other than conv1 and conv2 in the neural network model to be compiled.

In the execution phase, instructions generated by the compilation phase through assembly may be executed. Specifically, as shown in fig. 9, a deep convolution operation may be performed on an Input Feature map (which may be referred to as Input Feature) of the merged convolutional layer by using a deep convolution parameter in the layer parameters of the merged network layer, and the obtained respective deep convolution results may be sequentially buffered in a buffer, and a point convolution operation may be performed on the deep convolution results in the buffer by using a point convolution parameter in the layer parameters of the merged network layer, so that the point convolution results are used for generating an output Feature map of the merged convolutional layer.

It is assumed that the input and output of conv1 are expressed as follows: conv1_ output = conv (input), and the input and output of conv2 are represented as follows: conv2_ output = conv (conv 1_ output), the input and output of the merged convolutional layer can be represented as follows: conv2_ output = DWPW _ conv (input).

In summary, the embodiments of the present disclosure fully utilize the characteristics that the operation speed of the deep convolution operation is fast, the calculation of the point convolution operation is simple, and only a part of results of the deep convolution operation needs to be relied on, and by introducing the buffer area, the point convolution operation does not need to wait until all data in the input feature map complete the deep convolution operation before executing, thereby effectively improving the operation speed of the deep separable convolution.

Any of the methods provided by the embodiments of the present disclosure for processing a feature map with an artificial intelligence accelerator may be performed by any suitable device having data processing capabilities, including but not limited to: terminal equipment, a server and the like. Alternatively, any of the methods for processing a feature map by an artificial intelligence accelerator provided by the embodiments of the present disclosure may be executed by a processor, for example, the processor may execute any of the methods for processing a feature map by an artificial intelligence accelerator mentioned by the embodiments of the present disclosure by calling a corresponding instruction stored in a memory. Which will not be described in detail below.

Any of the neural network model compiling methods provided by the embodiments of the present disclosure may be performed by any suitable device having data processing capabilities, including but not limited to: terminal equipment, a server and the like. Alternatively, the compiling method of any one of the neural network models provided by the embodiments of the present disclosure may be executed by a processor, for example, the processor may execute the compiling method of any one of the neural network models mentioned by the embodiments of the present disclosure by calling a corresponding instruction stored in a memory.

Exemplary devices

Fig. 10 is a schematic structural diagram of an apparatus for processing a feature map by an artificial intelligence accelerator according to an exemplary embodiment of the present disclosure. The apparatus shown in fig. 10 includes a first determination module 1010, a second determination module 1020, a first operation module 1030, a cache module 1040, a second operation module 1050, and a third determination module 1060.

A first determining module 1010, configured to determine a merged convolutional layer in the target neural network model, where layer parameters of the merged convolutional layer include a depth convolution parameter and a point convolution parameter;

a second determining module 1020, configured to determine a cache region corresponding to the merged convolutional layer determined by the first determining module 1010;

a first operation module 1030, configured to perform a depth convolution operation on the input feature map of the merged convolution layer based on the depth convolution parameter of the merged convolution layer determined by the first determination module 1010;

the buffer module 1040 is configured to buffer at least two depth convolution results obtained by the depth convolution operation of the first operation module 1030 to the buffer area determined by the second determination module 1020 in sequence, where any two depth convolution results of the at least two depth convolution results correspond to different local areas in the input feature map;

a second operation module 1050, configured to perform a point convolution operation on the depth convolution result cached in the cache region determined by the second determination module 1020, based on the point convolution parameter of the merged convolution layer determined by the first determination module 1010;

the third determining module 1060 is configured to determine the output feature map of the merged convolutional layer determined by the first determining module 1010 based on the dot convolution result obtained by the dot convolution operation of the second calculating module 1050.

In an alternative example, as shown in fig. 11, the caching module 1040 includes:

the first determining submodule 10401 is configured to determine a first cache sub-area in the cache area;

the cache submodule 10403 is configured to cache at least two depth convolution results obtained by the depth convolution operation of the first operation module 1030 to a first cache sub-area in the cache area determined by the first determination submodule 10401 in sequence;

the second operation module 1050 includes:

a second determining submodule 10501, configured to determine a second cache sub-area in the cache area;

the dump submodule 10503 is configured to dump the depth convolution result cached in the first cache sub-area determined by the first determining submodule 10401 to the second cache sub-area determined by the second determining submodule 10501;

the first operation sub-module 10505 is configured to perform a point convolution operation on the depth convolution result cached in the second cache sub-region determined by the second determination sub-module 10501, based on the point convolution parameter of the merged convolution layer determined by the first determination module 1010.

In an alternative example, as shown in fig. 12, the second operation module 1050 includes:

a third determining submodule 10507, configured to determine a first size relationship between the space occupancy of the cache area determined by the second determining module 1020 and the preset space occupancy;

a fourth determining submodule 10509, configured to determine a second size relationship between the number of the depth convolution results cached in the cache area determined by the second determining module 1020 and the first preset number;

a fifth determining submodule 10511, configured to determine a time relationship between the current time and a preset time;

the second operation sub-module 10513 is configured to, in response to at least one of the first size relationship determined by the third determination sub-module 10507, the second size relationship determined by the fourth determination sub-module 10509, and the time relationship determined by the fifth determination sub-module 10511 being used for triggering a point convolution operation, perform the point convolution operation on the depth convolution result buffered in the buffer determined by the second determination module 1020 based on the point convolution parameter of the merged convolution layer determined by the first determination module 1010.

Fig. 13 is a schematic structural diagram of a compiling apparatus of a neural network model according to an exemplary embodiment of the present disclosure. The apparatus shown in fig. 13 includes a fourth determining module 1310, a fifth determining module 1320, a merging module 1330, an allocating module 1340, and a generating module 1350.

A fourth determining module 1310, configured to determine a neural network model to be compiled;

a fifth determining module 1320, configured to determine a first convolutional layer and a second convolutional layer that are paired from the neural network model to be compiled determined by the fourth determining module 1310;

a merging module 1330, configured to merge the first convolutional layer and the second convolutional layer determined by the fifth determining module 1320 to obtain a merged convolutional layer, where layer parameters of the merged convolutional layer include a depth convolution parameter of the first convolutional layer and a point convolution parameter of the second convolutional layer;

an allocating module 1340, configured to allocate a buffer area for the merged convolutional layer obtained by the merging module 1330;

a generating module 1350, configured to compile and generate a target neural network model based on the merged convolutional layer obtained by the merging module 1330, the buffer allocated by the allocating module 1340, and the network layers determined by the fourth determining module 1310 in the neural network model to be compiled, except for the first convolutional layer and the second convolutional layer determined by the fifth determining module 1320, where the target neural network model includes instructions for performing the above method for processing the feature map by the artificial intelligence accelerator.

In an alternative example, as shown in fig. 14, the fifth determining module 1320 includes:

a sixth determining submodule 13201, configured to determine output result usage information of the first convolutional layer in the neural network model to be compiled, which is determined by the fourth determining module 1310;

a seventh determining submodule 13203, configured to determine, based on the output result usage information determined by the sixth determining submodule 13201, the number and the type of network layers that use the output result of the first convolutional layer in the neural network model to be compiled;

an eighth determining submodule 13205 is configured to determine the second convolutional layer paired with the first convolutional layer based on the number and the type of the network layers using the output result determined by the seventh determining submodule 13203.

In an alternative example, as shown in fig. 14, the eighth determination submodule 13205 includes:

a judging unit 132051 configured to judge whether the number of the network layers using the output result determined by the seventh determining sub-module 13203 is a preset number, and judge whether the type of the network layers using the output result determined by the seventh determining sub-module 13203 is a point convolution layer;

a determining unit 132053, configured to determine, in response to the determining unit 132051 determining that the number of the network layers using the output result is the second preset number and the type of the network layers using the output result is the point convolution layer, the network layer using the output result as the second convolution layer paired with the first convolution layer.

Exemplary electronic device

Next, an electronic apparatus according to an embodiment of the present disclosure is described with reference to fig. 15. The electronic device may be either or both of the first device and the second device, or a stand-alone device separate from them, which stand-alone device may communicate with the first device and the second device to receive the acquired input signals therefrom.

FIG. 15 illustrates a block diagram of an electronic device in accordance with an embodiment of the disclosure.

As shown in fig. 15, the electronic device 1500 includes one or more processors 1510 and memory 1520.

The processor 1510 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 1500 to perform desired functions.

The memory 1520 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, read Only Memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on the computer-readable storage medium and executed by the processor 1510 to implement the method for processing a feature map or the method for compiling a neural network model by an artificial intelligence accelerator of the various embodiments of the present disclosure described above. Various contents such as an input signal, a signal component, a noise component, etc. may also be stored in the computer-readable storage medium.

In one example, the electronic device 1500 may further include: an input device 1530 and an output device 1540, these components being interconnected by a bus system and/or other form of linkage (not shown).

For example, when the electronic device is a first device or a second device, the input 1530 may be a microphone or a microphone array. When the electronic device is a stand-alone device, the input 1530 may be a communication network connector for receiving the collected input signals from the first device and the second device.

The input device 1530 may also include, for example, a keyboard, a mouse, etc.

The output device 1540 can output various information including the specified distance information, direction information, and the like to the outside. The output devices 1540 can include, for example, a display, speakers, printer, and the like, as well as communication networks and remote output devices connected thereto.

Of course, for simplicity, only some of the components of the electronic device 1500 relevant to the present disclosure are shown in fig. 15, and components such as buses, input/output interfaces, and the like are omitted. In addition, electronic device 1500 may include any other suitable components depending on the particular application.

Exemplary computer program product and computer-readable storage Medium

In addition to the above-described methods and apparatus, embodiments of the present disclosure may also be a computer program product comprising computer program instructions that, when executed by a processor, cause the processor to perform the steps in a method of processing a feature map or a method of compiling a neural network model with an artificial intelligence accelerator according to various embodiments of the present disclosure described in the "exemplary methods" section above in this specification.

The computer program product may write program code for carrying out operations for embodiments of the present disclosure in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present disclosure may also be a computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, cause the processor to perform the steps in the compilation of a neural network model or a method of processing a feature map by an artificial intelligence accelerator according to various embodiments of the present disclosure described in the "exemplary methods" section above in this specification.

The computer-readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The foregoing describes the general principles of the present disclosure in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present disclosure are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present disclosure. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the disclosure is not intended to be limited to the specific details so described.

In the present specification, the embodiments are described in a progressive manner, and each embodiment focuses on differences from other embodiments, and the same or similar parts in each embodiment are referred to each other. For the system embodiment, since it basically corresponds to the method embodiment, the description is relatively simple, and reference may be made to the partial description of the method embodiment for relevant points.

The block diagrams of devices, apparatuses, systems referred to in this disclosure are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith. As used herein, the words "or" and "refer to, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".

The methods and apparatus of the present disclosure may be implemented in a number of ways. For example, the methods and apparatus of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order for the steps of the method is for illustration only, and the steps of the method of the present disclosure are not limited to the order specifically described above unless specifically stated otherwise. Further, in some embodiments, the present disclosure may also be embodied as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.

It is also noted that in the devices, apparatuses, and methods of the present disclosure, each component or step can be decomposed and/or recombined. Such decomposition and/or recombination should be considered as equivalents of the present disclosure.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the disclosure. Thus, the present disclosure is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, this description is not intended to limit embodiments of the disclosure to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

Claims

1. A method of processing a feature map by an artificial intelligence accelerator, comprising:

determining a cache region corresponding to the merged convolution layer;

2. The method of claim 1, wherein,

the sequentially caching at least two depth convolution results obtained by the depth convolution operation into the cache region includes:

sequentially caching at least two depth convolution results obtained by the depth convolution operation to a first cache subarea in the cache area;

performing, on the basis of the point convolution parameter, a point convolution operation on the depth convolution result cached in the cache region, including:

transferring the depth convolution result cached in the first cache subregion to the second cache subregion;

and performing point convolution operation on the depth convolution result cached in the second cache subregion based on the point convolution parameters.

3. The method of claim 1, wherein performing a point convolution operation on the depth convolution result buffered by the buffer based on the point convolution parameter comprises:

determining a first size relation between the space occupancy rate of the cache region and a preset space occupancy rate;

determining a second size relation between the number of the depth convolution results cached in the cache region and a first preset number;

determining a time relation between the current time and a preset time;

performing a point convolution operation on the depth convolution result cached in the cache region based on the point convolution parameter in response to at least one of the first size relation, the second size relation and the time relation being used for triggering the point convolution operation.

4. A method of compiling a neural network model, comprising:

determining a neural network model to be compiled;

allocating a buffer for the merged convolutional layer;

compiling to generate a target neural network model based on the merged convolutional layer, the cache region, and network layers except the first convolutional layer and the second convolutional layer in the neural network model to be compiled, wherein the target neural network model comprises instructions for executing the method for processing the feature map through the artificial intelligence accelerator according to any one of claims 1 to 3.

5. The method of claim 4, wherein the determining paired first and second convolutional layers from the neural network model to be compiled comprises:

determining output result use information of a first convolution layer in the neural network model to be compiled;

determining the number and the type of network layers using the output result of the first convolution layer in the neural network model to be compiled based on the output result use information;

determining a second convolutional layer paired with the first convolutional layer based on the number and type of network layers using the output result.

6. The method of claim 5, wherein the determining a second convolutional layer paired with the first convolutional layer based on the number and type of network layers using the output result comprises:

7. An apparatus for processing a feature map with an artificial intelligence accelerator, comprising:

a second determining module, configured to determine a cache region corresponding to the merged convolutional layer determined by the first determining module;

the cache module is configured to sequentially cache at least two depth convolution results obtained through the depth convolution operation of the first operation module to the cache region determined by the second determination module, where any two depth convolution results of the at least two depth convolution results correspond to different local regions in the input feature map;

8. An apparatus for compiling a neural network model, comprising:

the distribution module is used for distributing a cache region for the merging convolution layer obtained by the merging module;

a generating module, configured to compile to generate a target neural network model based on the merged convolutional layer obtained by the merging module, the cache region allocated by the allocating module, and the network layers determined by the fourth determining module except the first convolutional layer and the second convolutional layer determined by the fifth determining module in the neural network model to be compiled, where the target neural network model includes instructions for executing the method for processing a feature map by an artificial intelligence accelerator according to any one of claims 1 to 3.

9. A computer-readable storage medium storing a computer program for executing the method for processing a feature map by an artificial intelligence accelerator according to any one of claims 1 to 3, or executing the method for compiling a neural network model according to any one of claims 4 to 6.

10. An electronic device, the electronic device comprising:

a processor;

a memory for storing the processor-executable instructions;

the processor is used for reading the executable instructions from the memory and executing the instructions to realize the method for processing the feature map through the artificial intelligence accelerator according to any one of the claims 1 to 3 or realize the compiling method of the neural network model according to any one of the claims 4 to 6.