CN111914999B

CN111914999B - Method and equipment for reducing calculation bandwidth of neural network accelerator

Info

Publication number: CN111914999B
Application number: CN202010753645.8A
Authority: CN
Inventors: 尹昆
Original assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Current assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Priority date: 2020-07-30
Filing date: 2020-07-30
Publication date: 2024-04-19
Anticipated expiration: 2040-07-30
Also published as: CN111914999A

Abstract

The invention provides a method and equipment for reducing the calculation bandwidth of a neural network accelerator, wherein the method comprises the following steps: aiming at the output points in each column, carrying out calculation on the characteristic data required by calculating a first output point from an external memory to a fast memory of a neural network accelerator according to one-dimensional flattening to obtain output data corresponding to the first output point; and carrying the multiplexed characteristic data to the head of the rapid memory, carrying the residual characteristic data from the external memory to the rear part of the rapid memory for calculation according to one-dimensional flattening, and obtaining the output data corresponding to the current output point until the output data of the last output point in the column is obtained.

Description

Method and equipment for reducing calculation bandwidth of neural network accelerator

Technical Field

The present invention relates to the field of data processing, and in particular, to a method and apparatus for reducing the computational bandwidth of a neural network accelerator.

Background

In most of the existing neural network systems, the amount of characteristic data and convolution kernel data is huge, the partial data is generally stored in a memory, and the memory access speed is low due to cost. During convolution calculation, a main control core (such as ARM) is required to transfer characteristic data and a convolution core required by calculation from a slow external memory to a fast internal memory in a neural network accelerator through DMA (Direct Memory Access ), and then the accelerator reads the data to perform convolution calculation.

During specific calculation, according to the sequence, calculating the characteristic data required by the first output point, carrying the characteristic data from an external memory to a fast memory of a neural network accelerator, and calculating to obtain the first output data; when the second output point is needed, the needed data is carried from the external memory to the fast memory of the neural network accelerator for calculation.

In calculating the second output data point, the required characteristic data needs to be carried from the external memory to the fast memory of the accelerator again, and then calculation is performed, and the characteristic data is huge, so that the data bandwidth between the neural network accelerator and the external memory is large.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a method and equipment for reducing the calculation bandwidth of a neural network accelerator; the data reusability brought by the step length is fully utilized, so that the data reading quantity of the neural network accelerator to the slow external memory is reduced, and the efficiency of the neural network accelerator is improved.

Specifically, the present invention proposes the following specific embodiments:

The embodiment of the invention provides a method for reducing the calculation bandwidth of a neural network accelerator, which is applied to a feature graph comprising a plurality of output points and a plurality of feature data, wherein the output points are distributed in different columns in the feature graph, and the positions of the output points in each column are sequentially arranged from top to bottom according to the Height direction; each output point corresponds to the characteristic data, and the position distribution of the characteristic data is consistent with the position distribution of the corresponding output point; the method comprises the following steps:

Aiming at the output points in each column, carrying out calculation on the characteristic data required by calculating a first output point from an external memory to a fast memory of a neural network accelerator according to one-dimensional flattening to obtain output data corresponding to the first output point;

Carrying the multiplexing characteristic data to the head of the rapid memory, carrying the rest characteristic data from the external memory to the rear part of the rapid memory for calculation according to one-dimensional flattening, and obtaining output data corresponding to the current output point until obtaining the output data of the last output point in the column;

the multiplexing characteristic data is the same part in the characteristic data corresponding to the last output point and the current output point after calculation, and the last output point and the current output point are adjacent output points; and the multiplexing characteristic data and the residual characteristic data form characteristic data required for calculating the current output point.

In a specific embodiment, before "the first feature data required for calculating the first output point is transferred from the external memory to the fast memory of the neural network accelerator for calculation according to one-dimensional flattening", the method further includes:

and carrying the convolution kernel of the neural network system from the external memory to the fast memory of the neural network accelerator according to one-dimensional flattening.

In a specific embodiment, the size of the feature data corresponding to each output point is consistent with the size of the convolution kernel.

In a specific embodiment, the output data is calculated based on the characteristic data and the convolution kernel.

In one specific embodiment of the present invention,

The position difference of the feature data corresponding to the adjacent output points in each column in the feature map is preset with a step length;

and the characteristic data corresponding to the position of the preset step length is the residual characteristic data.

The embodiment of the invention also provides equipment for reducing the calculation bandwidth of the neural network accelerator, which is applied to a feature graph comprising a plurality of output points and a plurality of feature data, wherein the output points are distributed in different columns in the feature graph, and the positions of the output points in each column are sequentially arranged from top to bottom according to the Height direction; each output point corresponds to the characteristic data, and the position distribution of the characteristic data is consistent with the position distribution of the corresponding output point; the apparatus includes:

The first processing module is used for carrying the characteristic data required by calculating the first output point from the external memory to the fast memory of the neural network accelerator according to one-dimensional flattening aiming at the output point in each column to obtain the output data corresponding to the first output point;

The second processing module is used for carrying the multiplexing characteristic data to the head part of the rapid memory, carrying the residual characteristic data from the external memory to the rear part of the rapid memory for calculation according to one-dimensional flattening, and obtaining the output data corresponding to the current output point until the output data of the last output point in the column is obtained;

In a specific embodiment, the method further comprises:

And the convolution kernel module is used for carrying the convolution kernel of the neural network system from the external memory to the fast memory of the neural network accelerator according to one-dimensional flattening.

In one specific embodiment of the present invention,

Compared with the prior art, the method and the device have the advantages that the data reusability caused by step length is fully utilized through optimization in the calculation direction of the convolutional neural network, and the data reading quantity of the neural network accelerator to the slow external memory is reduced, so that the efficiency of the neural network accelerator is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method for reducing the computational bandwidth of a neural network accelerator according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a method for reducing the computational bandwidth of a neural network accelerator according to an embodiment of the present invention;

FIG. 3 is a schematic diagram illustrating characteristic data handling in a method for reducing computing bandwidth of a neural network accelerator according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of convolutional kernel handling in a method for reducing computational bandwidth of a neural network accelerator according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a method for reducing the bandwidth of a neural network accelerator according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of an apparatus for reducing the computational bandwidth of a neural network accelerator according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of an apparatus for reducing a calculation bandwidth of a neural network accelerator according to an embodiment of the present invention.

Detailed Description

Hereinafter, various embodiments of the present disclosure will be more fully described. The present disclosure is capable of various embodiments and of modifications and variations therein. However, it should be understood that: there is no intention to limit the various embodiments of the disclosure to the specific embodiments disclosed herein, but rather the disclosure is to be interpreted to cover all modifications, equivalents, and/or alternatives falling within the spirit and scope of the various embodiments of the disclosure.

The terminology used in the various embodiments of the disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the various embodiments of the disclosure. As used herein, the singular is intended to include the plural as well, unless the context clearly indicates otherwise. Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which various embodiments of this disclosure belong. The terms (such as those defined in commonly used dictionaries) will be interpreted as having a meaning that is the same as the context of the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein in the various embodiments of the disclosure.

Example 1

The embodiment 1 of the invention discloses a method for reducing the calculation bandwidth of a neural network accelerator, which is applied to a feature map comprising a plurality of output points and a plurality of feature data, wherein the output points are distributed in different columns in the feature map, and the positions of the output points in each column are sequentially arranged from top to bottom according to the Height direction; each output point corresponds to the characteristic data, and the position distribution of the characteristic data is consistent with the position distribution of the corresponding output point; as shown in fig. 1-5, the method comprises the steps of:

step 101, aiming at the output points in each column, carrying the characteristic data required by calculating a first output point from an external memory to a fast memory of a neural network accelerator for calculation according to one-dimensional flattening to obtain the output data corresponding to the first output point;

Specifically, before step 101, as shown in fig. 4, the method further includes: and carrying the convolution kernel of the neural network system from the external memory to the fast memory of the neural network accelerator according to one-dimensional flattening. In a specific embodiment, the size of the feature data corresponding to each output point is consistent with the size of the convolution kernel. In a specific embodiment, the output data is calculated based on the characteristic data and the convolution kernel.

Specifically, the feature data and the convolution kernel generally store data in a format of [ Height, width, channel (depth) ] (HWC for short); the feature map is shown in fig. 2 and 3, taking the convolution kernel size (size) =3, and step size=2 as an example, the size of the feature data corresponding to each output point is 3×3 in the HW direction, so that, for example, the first output point in column 1 corresponds to the data of Channel where 0, c, c×2, p+0, p+c, p+c×2, p×2+0, p×2+c, p×2+c×2 in fig. 3; and the data is carried from the external memory to the fast internal memory of the neural network accelerator for calculation after one-dimensional flattening, and the obtained output result is shown in fig. 5, wherein the output result is the data where 0 is located.

102, Conveying the multiplexing characteristic data to the head of the rapid memory, carrying the rest characteristic data from the external memory to the rear part of the rapid memory according to one-dimensional flattening, and calculating to obtain output data corresponding to the current output point until the output data of the last output point in the column is obtained;

In a specific embodiment, the position difference of the feature data corresponding to the adjacent output points in each column in the feature map is preset to a step length; and the characteristic data corresponding to the position of the preset step length is the residual characteristic data.

After calculating the output data of the first output point, calculating a second output point, where the second output point is adjacent to the first output point on the Height, and there is a multiplexed portion, and still taking fig. 3 as an example, the second output point in column 1 corresponds to fig. 3: data of channels where p2+0, p2+c 2, p3+0, p3+c 2, p4+0, p4+c 2 are located, whereby the same portion as the first output point, i.e. data of channels where p2+0, p2+c 2 are located, is present, whereby P2+0, p2+c 2 are transported to the head of the flash memory when calculating the second output point, and the remaining characteristic data (i.e. data of channels where p+3+0, p+3+c, p+3+c+2, p+4+0, p+4+c, p+4+c+c) are located) are transferred from the external memory to the rear portion of the fast memory, so that the convolution kernel, the multiplexing characteristic data and the remaining characteristic data are combined to calculate the second output point, and the obtained output result is shown in fig. 5, where p+0 is the data of the channels where p+0 is located.

As for the third output point and the fourth output point in the 1 st column, calculation is performed according to the above method until the output points in the column are all calculated; the calculation of the output points of the 2 nd column is performed, and the specific calculation mode of the 2 nd column is the same as the calculation mode of the 1 st column, that is, steps 101-102 are performed, and finally, the calculation of all the output points of all the columns is completed.

Example 2

The embodiment 2 of the invention also discloses a method for reducing the calculation bandwidth of the neural network accelerator, which is shown in fig. 2-5 and comprises the following steps:

step 1, carrying a convolution kernel from an external memory to a fast memory of a neural network accelerator according to one-dimensional flattening;

Step 2, the characteristic data (data of channels where 0, c 2, p+0, p+c 2, p+2+0, p+2+c 2) required for calculating the first output point in fig. 3 are transferred from the external memory to the fast memory of the neural network accelerator according to one-dimensional flattening;

step 3, performing inner product (or other calculation methods) of the vector to calculate matrix convolution to obtain first output data;

Step 4A, the reusable part (the data of channels where p+0, p+2+c, p+2+c+c in fig. 3 are located) of the feature data required for calculating the first output point and the second output point is independently carried in the fast memory by the neural network accelerator, and moved to the head of the internal fast memory, which has a much higher efficiency than the access to the external;

Step 4B, calculating the remaining characteristic data (data of channels where p×3+0, p×3+c, p×3+c+c 2, p×4+0, p×4+c 2 in fig. 3) required for the second output point, with a size of (2×3×channels), and carrying the remaining characteristic data from the external memory to the second half of the fast memory of the neural network accelerator according to one-dimensional flattening;

step 5, performing inner product (or other calculation method) of the vector to calculate matrix convolution to obtain a second output;

And repeating the above operation for other output points, and calculating to complete the whole feature map.

Based on the method, the interaction between the neural network accelerator and the external memory is reduced by 1- (2×3×channel)/(3×3×channel) =1/3. The above example is described based on calculating output data points point by point, but the practical application of the method is not limited thereto, and is applicable to the case of calculating output data points in parallel/serial at multiple points.

Example 3

The embodiment 3 of the invention also discloses equipment for reducing the calculation bandwidth of the neural network accelerator, which is applied to a feature graph comprising a plurality of output points and a plurality of feature data, wherein the output points are distributed in different columns in the feature graph, and the positions of the output points in each column are sequentially arranged from top to bottom according to the Height direction; each output point corresponds to the characteristic data, and the position distribution of the characteristic data is consistent with the position distribution of the corresponding output point; as shown in fig. 6, the apparatus includes:

the first processing module 201 is configured to perform calculation on feature data required for calculating a first output point for each output point in each column from an external memory to a fast memory of the neural network accelerator according to one-dimensional flattening, so as to obtain output data corresponding to the first output point;

A second processing module 202, configured to carry the multiplexed feature data to the header of the flash memory, and perform calculation on the remaining feature data from the external memory to the rear portion of the flash memory according to one-dimensional flattening, so as to obtain output data corresponding to a current output point, until output data of a last output point in the column is obtained;

In a specific embodiment, as shown in fig. 7, the method further includes:

And the convolution kernel module 203 is used for carrying the convolution kernels of the neural network system from the external memory to the fast memory of the neural network accelerator according to one-dimensional flattening.

In one specific embodiment of the present invention,

Those skilled in the art will appreciate that the drawing is merely a schematic illustration of a preferred implementation scenario and that the modules or flows in the drawing are not necessarily required to practice the invention.

Those skilled in the art will appreciate that modules in an apparatus in an implementation scenario may be distributed in an apparatus in an implementation scenario according to an implementation scenario description, or that corresponding changes may be located in one or more apparatuses different from the implementation scenario. The modules of the implementation scenario may be combined into one module, or may be further split into a plurality of sub-modules.

The above-mentioned inventive sequence numbers are merely for description and do not represent advantages or disadvantages of the implementation scenario.

The foregoing disclosure is merely illustrative of some embodiments of the invention, and the invention is not limited thereto, as modifications may be made by those skilled in the art without departing from the scope of the invention.

Claims

1. The method for reducing the calculation bandwidth of the neural network accelerator is characterized by being applied to a feature map comprising a plurality of output points and a plurality of feature data, wherein the output points are distributed in different columns in the feature map, and the positions of the output points in each column are sequentially arranged from top to bottom according to the Height direction; each output point corresponds to the characteristic data, and the position distribution of the characteristic data is consistent with the position distribution of the corresponding output point; the method comprises the following steps:

After the output data of the first output point is calculated, calculating a second output point, wherein the specific second output point is adjacent to the first output point on the Height and has a multiplexing part; carrying the multiplexing characteristic data to the head of the rapid memory, carrying the rest characteristic data from the external memory to the rear part of the rapid memory for calculation according to one-dimensional flattening, and obtaining output data corresponding to the current output point until obtaining the output data of the last output point in the column;

2. The method of reducing computational bandwidth of a neural network accelerator of claim 1, further comprising, prior to "carrying the first characteristic data required to compute the first output point from memory to the fast memory of the neural network accelerator for computation according to one-dimensional flattening":

3. A method of reducing computational bandwidth of a neural network accelerator as defined in claim 2, wherein the size of the characteristic data corresponding to each of the output points corresponds to the size of the convolution kernel.

4. A method of reducing computational bandwidth of a neural network accelerator as defined in claim 2, wherein the output data is computed based on the characteristic data and the convolution kernel.

5. A method of reducing computational bandwidth of a neural network accelerator as defined in claim 1,

6. The device for reducing the calculation bandwidth of the neural network accelerator is characterized by being applied to a feature map comprising a plurality of output points and a plurality of feature data, wherein the output points are distributed in different columns in the feature map, and the positions of the output points in each column are sequentially arranged from top to bottom according to the Height direction; each output point corresponds to the characteristic data, and the position distribution of the characteristic data is consistent with the position distribution of the corresponding output point; the apparatus includes:

the second processing module is used for calculating a second output point after calculating the output data of the first output point, wherein the specific second output point is adjacent to the first output point on the Height and has a multiplexing part; multiplexing the characteristic data to be transported to the head of the rapid memory, flattening the rest characteristic data in one dimension, and calculating from the external memory to the rear part in the rapid memory to obtain the output data corresponding to the current output point until the output data of the last output point in the column is obtained;

7. The apparatus for reducing computational bandwidth of a neural network accelerator of claim 6, further comprising:

8. The apparatus for reducing computational bandwidth of a neural network accelerator of claim 7, wherein the size of the characteristic data corresponding to each of the output points corresponds to the size of the convolution kernel.

9. The apparatus for reducing computational bandwidth of a neural network accelerator of claim 7, wherein the output data is computed based on the characteristic data and the convolution kernel.

10. An apparatus for reducing computational bandwidth of a neural network accelerator as defined in claim 6,