CN111931927A

CN111931927A - Method and device for reducing occupation of computing resources in NPU

Info

Publication number: CN111931927A
Application number: CN202011114887.9A
Authority: CN
Inventors: 戴舒诣; 范名超
Original assignee: Aojie Intelligent Technology Shanghai Co ltd
Current assignee: Aojie Intelligent Technology Shanghai Co ltd
Priority date: 2020-10-19
Filing date: 2020-10-19
Publication date: 2020-11-13
Anticipated expiration: 2040-10-19
Also published as: CN111931927B

Abstract

The application provides a method and a device for reducing computing resource occupation in an NPU (neutral point Unit), wherein the NPU comprises a pulse computing array PE array module for convolution operation, firstly, an average pooled shared division coefficient kernel _ size is obtained, then, the kernel _ size is expanded into a first convolution kernel, then, the first convolution kernel is input into the PE array module, and the average pooling is realized through the PE array module. Independent modules are not arranged for average pooling, the modules are mapped to PE array modules which are originally mainly used for realizing convolution operation, the PE array modules are converted into convolution operation of the same type, and the original PE array modules can directly realize arbitrary average pooling layer calculation, so that occupation of the independent average pooling modules on hardware calculation resources is avoided, the condition of shortage of the hardware calculation resources is relieved, and power consumption of NPUs is reduced.

Description

Method and device for reducing occupation of computing resources in NPU

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a method and an apparatus for reducing computing resource occupation in an NPU.

Background

Artificial Intelligence (AI) is a branch of computer science, and by studying the design principles and implementation methods of various intelligent machines, the machines have the functions of perception, reasoning and decision making. With the rapid development of artificial intelligence technology, neural networks (e.g., deep neural networks, convolutional neural networks, etc.) have achieved great success in processing and analyzing various media information such as images, videos, and voices in recent years.

With the development of neural networks, in order to obtain better effects, technicians have begun to design hardware devices such as special processors for the neural networks. The accelerator is a special processor capable of supporting multiple Neural networks, is optimized for the special direction of an AI algorithm, has an acceleration Unit only related to the AI algorithm, and realizes the rapid calculation of some specific formulas in the AI algorithm on a hardware level. In the NPU, a PE array (Processing Element) module is an important module for implementing core convolution operation in a convolutional neural network, and specifically, implements point multiplication and addition operation of an image (image) and a kernel (kernel). The PE array occupies a large amount of computing resources in a hardware FPGA/chip, and parallel operation can be simultaneously carried out during working so as to achieve the purpose of efficient computing processing.

However, in the process of implementing the scheme of the present application, the inventor finds that, besides a PE array module for implementing core convolution calculation occupies most of hardware calculation resources, a module for implementing average pooling (averaging pooling) operation in a pooling layer (pooling layer) also needs considerable hardware calculation resources, which easily causes hardware calculation resource shortage of the FPGA and power consumption increase.

In the prior art, when the situation of hardware resource shortage is faced, optimization is often performed in an independent average pooling module, and the purpose of saving computing resources is achieved by adjusting a segmentation computing process. For example, in a prior art, in view of the problem of hardware resources, the composition of the computing unit that processes one path of data in the average pooling module may be adjusted to be an adder and a comparator to reduce the hardware overhead, however, since at least N paths of data need to be processed in parallel, the hardware actually needs N computing units, and the computing resources occupied by the average pooling module are still large. In another prior art, although the overall flow of the accelerator is adjusted, an independent average pooling module is still used, the remaining hardware computing resources of the convolution module still need to be consumed \ seized, and the overall power consumption is increased during the operation of the accelerator.

Disclosure of Invention

The application provides a method and a device for reducing the occupation of computing resources in an NPU (neural network unit), so as to solve the problem of shortage of hardware computing resources in the NPU.

According to a first aspect of embodiments of the present application, there is provided a method for reducing computational resource occupation in an NPU, the method being used for a neural network processing unit NPU, wherein the NPU includes a systolic computing array PE array module for convolution operation;

the method comprises the following steps:

obtaining an average pooled shared division coefficient kernel _ size, wherein the kernel _ size is a size of a pooling kernel used for the average pooling;

expanding the kernel _ size to a first convolution kernel, wherein the first convolution kernel is a convolution kernel for deep convolution calculations;

inputting the first convolution kernel to the PE array module to achieve the average pooling by the PE array module.

Optionally, expanding the kernel _ size into a first convolution kernel includes:

acquiring the size of the pooling kernel and the reciprocal of the kernel _ size;

generating the first convolution kernel, wherein the first convolution kernel is the same size as the pooling kernel and each multiplication factor w within the first convolution kernel is the reciprocal of the kernel _ size.

Optionally:

the PE array module consists of a plurality of data computing blocks tile, and each tile consists of a plurality of data computing slices slice; when the PE array module works, the input feature graph is divided into a plurality of copies with the same quantity as the tiles, and the tiles are respectively input to carry out convolution operation independently; slice pulses in each column of the PE array module share the same weight, slice pulses in each tile share the same image, and the convolution result is output in a pulsating mode.

Optionally:

each slice comprises an addition tree, an accumulator and a plurality of parallel multipliers; when convolution operation is carried out, each multiplier is used for carrying out point multiplication on the input divided two-dimensional images corresponding to different channels and the weights respectively, obtaining multi-channel sum through the addition tree, and sending the sum to the accumulator for accumulation; when the average pooling is carried out, only one multiplier in the plurality of parallel multipliers is used for carrying out dot multiplication on the multiplication factor w in the first convolution kernel and the image pixel point corresponding to the multiplication factor w, and the dot multiplication result is directly input into the accumulator so as to accumulate all dot multiplication results on the first convolution kernel plane.

Optionally, when the average pooling is a global average pooling, the method further includes:

dividing the input feature map into a plurality of copies with the same number as tiles to correspond to each tile;

and dividing the input first convolution kernel into the same size as the input feature map, performing average pooling on the divided images by each tile, and summing the divided pooled results output by slices of each column of the PE array.

Optionally, before obtaining the average pooled shared division coefficient kernel _ size, the method further includes:

when the running layer of the NPU is a convolution layer, directly transmitting the convolution kernel output by the weight cache to the PE array module to realize convolution operation;

when the running layer of the NPU is an average pooling layer, the step of obtaining the average pooled shared division coefficient kernel _ size and thereafter is performed.

According to a second aspect of the embodiments of the present application, there is provided an apparatus for reducing computational resource occupation in an NPU, the apparatus being used for a neural network processing unit NPU, and the NPU including a systolic computing array PE array module for convolution operation;

the device comprises:

a convolution kernel obtaining unit, configured to obtain an average pooled shared division coefficient kernel _ size, where the kernel _ size is a size of a pooled kernel used for the average pooling, and expand the kernel _ size into a first convolution kernel, where the first convolution kernel is a convolution kernel used for deep convolution calculation;

a convolution kernel input unit, configured to input the first convolution kernel to the PE array module to implement the average pooling through the PE array module.

Optionally, when the kernel _ size is expanded into the first convolution kernel, the convolution kernel acquisition unit is specifically configured to:

Optionally:

Optionally, the apparatus further comprises:

and a global average pooling acceleration unit, configured to, when the average pooling is global average pooling, divide the input feature map into a plurality of pieces having the same number as tiles to correspond to each tile, divide the input first convolution kernel into pieces having the same size as the input feature map, each tile performs average pooling on the divided images, and sums the divided pooled results output by slices of each column of the PE array.

Optionally, the apparatus further comprises:

and the weight selection unit is used for directly transmitting the convolution kernel output by the weight cache to the PE array module to realize convolution operation when the operation layer of the NPU is a convolution layer before the convolution kernel acquisition unit is triggered, and triggering the convolution kernel acquisition unit and the convolution kernel input unit when the operation layer of the NPU is an average pooling layer.

According to a third aspect of the embodiments of the present application, there is provided a neural network processing unit NPU, where the NPU includes any one of the above-mentioned devices for reducing the occupation of computing resources in the NPU and the systolic computing array PE array module for convolution operation.

The technical scheme provided by the embodiment of the application can have the following beneficial effects:

the application provides a method for reducing the occupation of computing resources in an NPU, wherein the NPU comprises a ripple computing array PE array module for convolution operation. Firstly, obtaining an average pooled shared division coefficient kernel _ size, then expanding the kernel _ size into a first convolution kernel, namely a convolution kernel for deep convolution calculation, and inputting the first convolution kernel into the PE array module so as to realize the average pooling through the PE array module. Compared with the prior art in which a neural network accelerator can use extra computing resources for the average power, and set independent modules, an independent module is not set for the average power in the scheme of the application, but an original PE array module for convolution operation in an NPU is utilized, and on the basis that the module can process convolution operation, the average power is mapped to the PE array module which is originally mainly used for realizing convolution operation under the condition that the output efficiency of the average power in the NPU is not reduced, that is, the average power is converted into one type of convolution operation, so that the original PE array module can directly realize arbitrary average power layer calculation, occupation of hardware computing resources by an independent average pooling module is avoided, the condition of shortage of hardware computing resources is relieved, and the power consumption of the NPU is also reduced.

In addition, aiming at the special scene of global average potential in the average potential, the method and the device make some adjustments on the basis of the original PE array, sum the parallel pooling results of the segmented images, namely, after the input image is segmented, multiply-add operation is performed on each part of the images in parallel, and the data pulsation output characteristic of the PE array is utilized, and only one accumulator is used for each column, so that the higher-level parallel operation is realized, and the operational efficiency of the global average potential is improved under the condition of reducing the resource power consumption to the maximum extent.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise. Furthermore, these descriptions should not be construed as limiting the embodiments, wherein elements having the same reference number designation are identified as similar elements throughout the figures, and the drawings are not to scale unless otherwise specified.

FIG. 1 is a schematic diagram of an average pooling calculation process;

FIG. 2 is a schematic diagram of a depth convolution calculation process;

fig. 3 is a schematic flow chart of a method for reducing occupation of computing resources in an NPU according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a PE array in an embodiment of the present application;

FIG. 5 is a schematic diagram of slice used in convolution operation in the embodiment of the present application;

FIG. 6 is a schematic diagram of a convolution kernel extended according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a weight selector in an embodiment of the present application;

FIG. 8 is a schematic diagram of slice used in average pooling in an embodiment of the present application;

FIG. 9 is a schematic diagram of a global average pooling calculation process;

FIG. 10 is a diagram illustrating PE array applied to global average pooling in an embodiment of the present application;

fig. 11 is a schematic diagram of an apparatus for reducing occupation of computing resources in an NPU according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described in detail below with reference to the drawings in the embodiments of the present application. When referring to the drawings, the same numbers in different drawings represent the same or similar elements unless otherwise specified. It should be apparent that the examples described below are only a part of examples of the present application and not all examples, or that the embodiments described in the following exemplary examples do not represent all embodiments consistent with the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first," "second," "third," and the like in the description, claims, and drawings of the embodiments of the present application are used for distinguishing between different objects and not for limiting a particular order. In the embodiments of the present application, words such as "exemplary" or "for example" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "e.g.," should not be construed as preferred or advantageous over other embodiments or designs. Rather, use of the word "exemplary" or "such as" is intended to present concepts related in a concrete fashion.

The embodiment of the application can be applied to many fields in artificial intelligence, such as intelligent manufacturing, intelligent transportation, intelligent home, intelligent medical treatment, intelligent security, automatic driving, safe cities and other fields.

In particular, the embodiment of the application can be applied to the fields of image classification, image retrieval, image semantic segmentation, image super-resolution, natural language processing and the like which need to use a (deep) neural network.

For the convenience of understanding, the following briefly introduces terms and concepts related to deep neural networks, convolution kernels, pooling layers, and the like:

a Deep Neural Network (DNN), also called a multi-layer neural network, may be understood as a neural network with multiple hidden layers. The DNNs are divided according to the positions of different layers, and the neural networks inside the DNNs can be divided into three categories: input layer, hidden layer, output layer. Generally, the first layer is an input layer, the last layer is an output layer, and the middle layers are hidden layers. The layers are all connected, that is, any neuron of the ith layer is connected with any neuron of the (i + 1) th layer. In deep neural networks, more hidden layers make the network more able to depict complex situations in the real world. Theoretically, the more parameters the higher the model complexity, the larger the "capacity", which means that it can accomplish more complex learning tasks. The process of training the deep neural network is the process of learning the weight matrix, and the final purpose of the process is to obtain the weight matrix (the weight matrix formed by the vectors W of many layers) of all the layers of the trained deep neural network.

A Convolutional Neural Network (CNN) is a deep neural network with a convolutional structure. Convolutional neural networks may consist of many neural network layers, for example, layers of several different types, usually convolutional, active, pooled, fully-connected, are generated alternately, with the depth of each filter in the network increasing from left to right, usually consisting of one or more fully-connected layers. The convolutional neural network comprises a feature extractor consisting of convolutional layers and sub-sampling layers, which can be regarded as a filter. The convolutional layer is a neuron layer for performing convolutional processing on an input signal in a convolutional neural network. In convolutional layers of convolutional neural networks, one neuron may be connected to only a portion of the neighbor neurons. In a convolutional layer, there are usually several characteristic planes, and each characteristic plane may be composed of several neural units arranged in a rectangular shape. The neural units of the same feature plane share weights, where the shared weights are convolution kernels. Sharing weights may be understood as the way image information is extracted is location independent. The convolution kernel can be initialized in the form of a matrix of random size, and can be learned to obtain reasonable weights in the training process of the convolutional neural network. In addition, sharing weights brings the direct benefit of reducing connections between layers of the convolutional neural network, while reducing the risk of overfitting.

And (3) rolling layers:

it may comprise a number of convolution operators, also called kernels, which in image processing act as a filter to extract specific information from the input image matrix, which may be essentially a weight matrix, usually predefined, whose size should be related to the size of the image. Different weight matrices may be used to extract different features in the image, e.g., one weight matrix to extract image edge information, another weight matrix to extract a particular color of the image, yet another weight matrix to blur unwanted noise in the image, etc. The plurality of weight matrices have the same size (row × column), the sizes of the convolution feature maps extracted by the plurality of weight matrices having the same size are also the same, and the extracted plurality of convolution feature maps having the same size are combined to form the output of the convolution operation. The weight values in the weight matrixes need to be obtained through a large amount of training in practical application, and each weight matrix formed by the trained weight values can be used for extracting information from an input image, so that the convolutional neural network can carry out correct prediction.

A pooling layer:

since it is often desirable to reduce the number of training parameters, it is often desirable to periodically introduce pooling layers after a convolutional layer, either one convolutional layer followed by one pooling layer or multiple convolutional layers followed by one or more pooling layers. The role of pooling is embodied in down-sampling: the significant features are reserved, the feature dimension is reduced, and the receptive field of the kernel is increased. In the image processing process, the purpose of the pooling layer is to reduce the space size of the image so as to simplify the network computation complexity and avoid the occurrence of overfitting to a certain extent, and to perform feature compression and extract main features. The pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling the input image to smaller sized images. An average pooling (averaging) operator may calculate pixel values in an image over a certain range to generate an average as a result of the average pooling. The max pooling operator may take the pixel with the largest value in a particular range as the result of the max pooling. In addition, just as the size of the weighting matrix used in the convolutional layer should be related to the image size, the operators in the pooling layer should also be related to the image size. The size of the image output after the processing by the pooling layer may be smaller than the size of the image input to the pooling layer, and each pixel point in the image output by the pooling layer represents an average value or a maximum value of a corresponding sub-region of the image input to the pooling layer.

A neural network layer:

after the convolutional layer/pooling layer processing, the convolutional neural network is not enough to output the required output information. Since, as mentioned above, the convolutional/pooling layers only extract features and reduce the parameters introduced by the input image. However, in order to generate the final output information (required class information or other relevant information), the convolutional neural network needs to generate one or a set of the required number of classes of outputs using the neural network layer. Therefore, the neural network layer may include a plurality of hidden layers and an output layer, and parameters included in the plurality of hidden layers may be obtained by pre-training according to training data related to a specific task type, for example, the task type may include image recognition, image classification, image super-resolution reconstruction, and the like.

Referring to fig. 1 and 2, average pooling is to move a kernel equal to or smaller than the feature map size on a 2D image (two-dimensional image) of each channel of an input feature map, sum each point (e.g., a, b, c, D in fig. 1) corresponding to the feature map in a kernel frame, and divide the sum by the number of corresponding points in the kernel to obtain an average value of corresponding pixels in the kernel frame.

The depth convolution is a multiplication coefficient (e.g. w in fig. 2) between the pixel points (e.g. a, b, c, D in fig. 2) of the kernel corresponding to the 2D image of each channel of the input feature map and the kernel (e.g. w in fig. 2)₀,w₁,w₂,w₃) The corresponding dot product is performed first and then the accumulation is performed.

The inventor finds that in the process of implementing the scheme of the application, the calculation operations actually required to be carried out by both the average pooling and the depth convolution are basically the same in algorithm, and only the flow sequence is different. It can be found from the variation of the formulas shown in fig. 1 and fig. 2 that the division coefficient kernel _ size (i.e. shared division coefficient) during averaging in the averaging potential can be converted to be used as the multiplication coefficient (w 0, w1, w2, w 3) of the kernel, the multiplication factor in the kernel and the dot multiplication of the pixel point corresponding to the feature map are firstly performed, and then the summation of the dot multiplication results is performed. Therefore, the average posing operation which needs to be supported by the independent hardware module originally can be converted into depthwise contribution operation calculation which is supported by the PE, only by means of the original PE array (pulse calculation array) module, and the independent average posing module does not need to be generated, thereby avoiding the consumption of extra hardware calculation resources.

Fig. 3 is a schematic flowchart of a method for reducing occupation of computing resources in an NPU according to an embodiment of the present application.

The method can be used for a neural network processing unit NPU, and the NPU comprises a pulse computation array PE array module used for convolution operation.

The following first describes the PE array module in this embodiment. In this embodiment or some other embodiments of the present application, the PE array module may be composed of a plurality of tile calculation blocks (the number of tiles may be referred to as tile _ num), where each tile is composed of a plurality of tile slices; when the PE array module works, the input feature map (feature map) is divided into a plurality of copies (namely tile _ num copies) with the same number as tiles, and each tile is input to carry out convolution operation independently; slice pulses in each column of the PE array module share the same weight (weight), slice pulses in each tile share the same image (image), and the convolution result is output in a pulsating mode.

For example, see fig. 4, where fig. 4 is a schematic diagram of a PE array in the embodiment of the present application. In fig. 4, PE array is a systolic matrix hardware computation module designed for general 3D constraint high efficiency pipeline computation, and fig. 4 shows the general construction and data flow of PE array. The PE array is composed of multiple tiles (i.e., PE tiles), 4 PE tiles being shown in fig. 4. When the NPU performs convolution operation, the input feather map is divided into tile _ num parts (one of which is Data0, Data1 and the like in FIG. 4), and each tile is input to perform convolution operation independently. Each tile is composed of a plurality of slices (namely PE slices), slice pulses in each column share the same weight (each weight is a kernel, namely an independent 3D convolution kernel, and when the amplitude posing is performed, the slice pulses in each tile share the same image), and the convolution result is output in a pulsating mode, so that the high-efficiency parallel computation of pulsating data is realized. For example, in fig. 4, each PE slice0 of PE tiles 0-3 constitutes a column and shares the same weight 0.

In FIG. 4, weight 0-weight 15 are hardware buffers used to store different 3d kernel, independent of the number of input channels, and determined by the hardware resources expected to be consumed. In practice this represents the number of output image channels processed at once in the hardware. The first convolution kernel is stored in the buffer, the 3D kernel in one convolution layer is the number of channels with N corresponding image outputs, and the weight buffer is used to store different 3D kernels.

It should be noted that, for specific implementation of the PE array module, such as the number of tiles and slices, the embodiment is not limited, and those skilled in the art may select and design the PE array module according to different requirements/different scenarios, and these choices and designs may be used herein without departing from the spirit and scope of the present application.

The following describes the data slice in the PE array module of this embodiment. As an example, see fig. 5, fig. 5 is a schematic diagram of slice used in convolution operation in this embodiment of the present application. In this embodiment or some other embodiments of the present application, each slice includes an addition tree, an accumulator, and a plurality of parallel multipliers; when convolution operation is carried out, each multiplier is used for carrying out point multiplication on a divided two-dimensional image (2D image) and a weight (weight) corresponding to different input channels (channels), obtaining multi-channel sum through the addition tree, and sending the multi-channel sum into the accumulator for accumulation; when the average pooling is carried out, only one multiplier in the plurality of parallel multipliers is used for carrying out dot multiplication on the multiplication factor w in the first convolution kernel and the image pixel point corresponding to the multiplication factor w, and the dot multiplication result is directly input into the accumulator so as to accumulate all dot multiplication results on the first convolution kernel plane.

The method in this embodiment may include the steps of:

s301, obtaining an average pooled shared division coefficient kernel _ size, where the kernel _ size is a size of a pooled kernel used for the average pooling.

The pooling core is a kernel used for average pooling, which is essentially a weight matrix, so the size of the pooling core is the number of rows x the number of columns of the matrix.

S302, the kernel _ size is expanded into a first convolution kernel, wherein the first convolution kernel is a convolution kernel used for deep convolution calculation.

By observing fig. 1 and 2, the inventors found that both the average pooling and the depth convolution actually need to perform substantially the same computational operations in the algorithm, only the flow order is different, so that the conversion can be performed.

In this embodiment or some other embodiments of the present application, the expanding the kernel _ size to the first convolution kernel may specifically include:

For example, see FIG. 6, where w in FIG. 6 is the inverse of kernel _ size, which can be extended to a matrix, resulting in a convolution kernel.

S303, inputting the first convolution kernel into the PE array module to realize the average pooling through the PE array module.

Because the PE array module is a module for convolution operation, after converting pooling into convolution, the PE array module can be used to implement pooling operation.

Compared with the prior art in which a neural network accelerator can use extra computing resources for the average power, and set independent modules, an independent module is not set for the average power in the scheme of the application, but an original PE array module for convolution operation in an NPU is utilized, and on the basis that the module can process convolution operation, the average power is mapped to the PE array module which is originally mainly used for realizing convolution operation under the condition that the output efficiency of the average power in the NPU is not reduced, that is, the average power is converted into one type of convolution operation, so that the original PE array module can directly realize arbitrary average power layer calculation, occupation of hardware computing resources by an independent average pooling module is avoided, the condition of shortage of hardware computing resources is relieved, and the power consumption of the NPU is also reduced.

In this embodiment or some other embodiments of the present application, in order not to affect the original function of the PE array, a step of pre-judgment may be added, and when the average pooling operation needs to be performed, the method is performed, and when the ordinary convolution operation needs to be performed, the method is performed according to the original flow.

In other words, before the obtaining the average pooled shared division coefficient kernel _ size, the method may further include:

The present application is further illustrated below with reference to examples:

for PE array, in order to make the averaging discharging mapping operable on PE array to output correct result and reduce the storage time of NPU data and memory power consumption of this layer, in a specific implementation, a weight mux (weight selector) module may be added to a portion of PE array module where weight is input, as shown in fig. 7 for example.

When the NPU running layer is a normal convolution, the weight mux module directly outputs the kernel output by the weight buffer to the PE array. When the NPU runs the average potential layer, the system inputs a multiplication factor w in kernel obtained through kernel _ size as a parameter (weight _ avg _ scale) to the weight mux module, the weight mux module processes the parameter w into a kernel structure in a depthwise constraint form through data expansion, and then inputs the kernel structure into the PE array for convolution operation, so that the average potential layer is realized without adjusting the PE array structure. In other words, weight is not cached in the weight buffer in a large amount in the pooling layer, and only a register of a few bits is needed to store weight _ avg _ scalar, and then the weight _ avg _ scalar is expanded in real time through the weight mux and then is input into the PE.

For slice, fig. 5 already shows a schematic diagram when used for convolution operations. When the calculation of the average potential layer is implemented, as shown in fig. 8 as an example, since the calculation converted into depthwise convolution is implemented, and the convolution of the image and the kernel is performed relatively independently for each channel, the input image and kernel are performed by using only one of the plurality of parallel multipliers in the PE sliceAnd corresponding point multiplication is performed, so that the power consumption is saved, and the point multiplication of the kernel multiplication factor w and the image pixel point corresponding to the kernel multiplication factor w is realized (for example, aw, bw and cw in fig. 2). In fig. 8, a selector (mux) is added to the original slice, and when performing convolution operation, the mux sends the sum obtained by the addition tree to the accumulator for accumulation; when performing averaging discharging, the mux inputs the dot product of one multiplier into the accumulator, and accumulates all the dot product on the kernel plane to obtain a similar result as shown in fig. 2₀+b*w₁+c*w₂+d*w₃The average porous result of (1).

In addition, the average potential has a special global average potential layer with the same size as the image, as shown in fig. 9, that is, the pixel values of the whole image in the 2d plane are averaged. The inventor finds that, in the process of implementing the scheme of the present application, when the picture size is large, no matter the pooling layer is implemented by using the PE array or the long-flow accumulation calculation is required in other independent pooling modules.

For this particular case, in this embodiment or some other embodiments of the present application, when the average pooling is global average pooling, the method may further include:

Considering that mapping to depthwise fusion realizes kernel with the same size as the feature map and the same factor, and the accumulation times of the accumulator in the PE array are more during the global fusion, in order to improve the efficiency, the image and the kernel can be correspondingly divided into the same number of parts as the tile, and parallel multiply-add calculation is performed on different parts of the input feature map.

As an example, in a specific implementation, a slice _ column _ acc module may be added to the PE array, and the result of each tile accumulation is summed, so as to improve the calculation efficiency of the PE array module for global placement.

For example, as shown in fig. 10, a newly added slice _ column _ acc module is connected to the output of PE slice of each column of PE array (one column is shown by a dashed box in the figure). When the current layer is global average potential, the input feature graph is divided into the same number of parts as the number of tiles, the input kernel is correspondingly divided into the same size as the image, each PE tile carries out average potential on the divided picture, and the slice _ column _ acc connected to each column of the PE array sums up the divided pooling results. Considering the systolic flow of the data stream inside the PE array, the slice in each slice in the PE array will be the pulsating output pooling result, each slice _ column _ acc only uses one data selector and accumulator to implement time-sharing accumulation on the pooling result data of each slice in the column, and the whole PE array only adds column _ num adders, thereby achieving the latency of the whole pooling process shortened by the time _ num times. The pipelined operation flow further improves the operation efficiency during the global operation on the premise of saving hardware computing resources.

Aiming at the special scene of global architecture, the method and the device have the advantages that some adjustment is made on the basis of the original PE array, the parallel pooling results of the segmented images are summed, namely, after the input image is segmented, multiplication and addition operation is performed on each partial image in parallel, the data pulsation output characteristic of the PE array is utilized, and only one accumulator is used for each column, so that the higher-level parallel operation is realized, and the operational efficiency of the global architecture is improved under the condition of reducing the resource power consumption to the maximum extent.

The following are embodiments of the apparatus of the present application that may be used to perform embodiments of the method of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method of the present application.

The device is used for a neural network processing unit NPU, and the NPU comprises a pulse computation array PE array module for convolution operation;

the device comprises:

a convolution kernel obtaining unit 1101, configured to obtain an average pooled shared division coefficient kernel _ size, where the kernel _ size is a size of a pooling kernel used for the average pooling, and expand the kernel _ size into a first convolution kernel, where the first convolution kernel is a convolution kernel used for deep convolution calculation;

a convolution kernel input unit 1102, configured to input the first convolution kernel to the PE array module to implement the average pooling through the PE array module.

In this embodiment or some other embodiments of the present application, when the kernel _ size is expanded into the first convolution kernel, the convolution kernel obtaining unit may be specifically configured to:

In this embodiment or some other embodiments of the present application, specifically, the PE array module may be composed of a plurality of data computing blocks tile, and each tile is composed of a plurality of data computing slices slice; when the PE array module works, the input feature graph is divided into a plurality of copies with the same quantity as the tiles, and the tiles are respectively input to carry out convolution operation independently; slice pulses in each column of the PE array module share the same weight, slice pulses in each tile share the same image, and the convolution result is output in a pulsating mode.

In this embodiment or some other embodiments of the present application, specifically, each slice may include an addition tree, an accumulator, and a plurality of parallel multipliers; when convolution operation is carried out, each multiplier is used for carrying out point multiplication on the input divided two-dimensional images corresponding to different channels and the weights respectively, obtaining multi-channel sum through the addition tree, and sending the sum to the accumulator for accumulation; when the average pooling is carried out, only one multiplier in the plurality of parallel multipliers is used for carrying out dot multiplication on the multiplication factor w in the first convolution kernel and the image pixel point corresponding to the multiplication factor w, and the dot multiplication result is directly input into the accumulator so as to accumulate all dot multiplication results on the first convolution kernel plane.

In this embodiment or some other embodiments of the present application, the apparatus may further include:

The embodiment provides a method for reducing the occupation of computing resources in an NPU, wherein the NPU comprises a ripple computing array PE array module for convolution operation. Firstly, obtaining an average pooled shared division coefficient kernel _ size, then expanding the kernel _ size into a first convolution kernel, namely a convolution kernel for deep convolution calculation, and inputting the first convolution kernel into the PE array module so as to realize the average pooling through the PE array module. Compared with the prior art in which a neural network accelerator can use extra computing resources for the average power, and set independent modules, an independent module is not set for the average power in the scheme of the application, but an original PE array module for convolution operation in an NPU is utilized, and on the basis that the module can process convolution operation, the average power is mapped to the PE array module which is originally mainly used for realizing convolution operation under the condition that the output efficiency of the average power in the NPU is not reduced, that is, the average power is converted into one type of convolution operation, so that the original PE array module can directly realize arbitrary average power layer calculation, occupation of hardware computing resources by an independent average pooling module is avoided, the condition of shortage of hardware computing resources is relieved, and the power consumption of the NPU is also reduced.

Regarding the apparatus in the foregoing embodiments, the specific manner in which each unit \ module executes operations has been described in detail in the embodiments of the related method, and is not described herein again. In the present application, the names of the above units/modules do not limit the units/modules themselves, and in practical implementations, the units/modules may be referred to by other names, so long as the functions of the units/modules are similar to those of the present application, and all of the units/modules belong to the scope of the claims and the equivalent technology of the present application.

In addition, the application also provides a neural network processing unit NPU, wherein the NPU comprises any one of the devices for reducing the occupation of computing resources in the NPU and the pulse computing array PE array module for convolution operation.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A method for reducing computational resource usage in an NPU, comprising:

the method is used for a neural network processing unit NPU, wherein the NPU comprises a pulse computation array PE array module used for convolution operation;

the method comprises the following steps:

2. The method of claim 1, wherein expanding the kernel _ size into a first convolution kernel comprises:

3. The method of claim 2, wherein:

4. The method of claim 3, wherein:

5. The method of claim 3, wherein when the average pooling is a global average pooling, the method further comprises:

6. The method of claim 1, wherein prior to obtaining the average pooled shared division coefficient kernel _ size, the method further comprises:

7. An apparatus for reducing computational resource usage in an NPU, comprising:

the device comprises:

8. The apparatus according to claim 7, wherein the convolution kernel acquisition unit, when expanding the kernel _ size to the first convolution kernel, is specifically configured to:

9. The apparatus of claim 8, wherein:

10. The apparatus of claim 9, wherein:

11. The apparatus of claim 9, further comprising:

12. The apparatus of claim 7, further comprising:

13. A neural network processing unit NPU, comprising the apparatus for reducing the occupation of computing resources in an NPU according to any one of claims 7 to 12, and the systolic computing array PE array module for convolution operation.