CN114219080A

CN114219080A - Neural network acceleration processing method and related device

Info

Publication number: CN114219080A
Application number: CN202111682484.9A
Authority: CN
Inventors: 刘海威; 董刚; 杨宏斌; 蒋东东; 王斌强; 胡克坤; 赵雅倩
Original assignee: Inspur Beijing Electronic Information Industry Co Ltd
Current assignee: Inspur Beijing Electronic Information Industry Co Ltd
Priority date: 2021-12-31
Filing date: 2021-12-31
Publication date: 2022-03-22

Abstract

The application discloses a neural network acceleration processing method, which comprises the following steps: the acceleration equipment acquires feature data from an on-chip feature cache module; acquiring weight data from an on-chip weight cache module; performing parallel convolution calculation on the characteristic data and the weight data in a pipeline mode by adopting a plurality of processing combinations in the point-by-point processing array to obtain a point-by-point convolution result; performing accumulation calculation processing on the point-by-point convolution result by adopting an accumulation module to obtain a point-by-point accumulation processing result; and performing parallel convolution calculation on the point-by-point accumulation processing result and the corresponding weight data in the on-chip weight cache module in a pipeline mode by adopting a channel-by-channel processing array to obtain a channel-by-channel convolution result, so that the effect of accelerating the convolution calculation is improved, and the convolution and the common convolution can be separated by compatible processing depth. The application also discloses a neural network acceleration processing device, an acceleration system and a computer readable storage medium, which have the beneficial effects.

Description

Neural network acceleration processing method and related device

Technical Field

The present application relates to the field of neural network computing technologies, and in particular, to a neural network acceleration processing method, a neural network acceleration processing apparatus, an acceleration system, and a computer-readable storage medium.

Background

With the continuous development of information technology, CNN (Convolutional Neural Networks) is applied more widely and has higher and higher requirements on precision, so that CNN algorithm models become larger and larger, and the computational complexity is higher and higher, and thus, the computation amount of the CNN network is inevitably and greatly increased. On the other hand, as the real-time performance in the fields of automatic driving, industrial internet, etc. is more and more high, more and more CNN inference needs to be completed at the edge, and thus the inference needs to be accelerated.

In the related art, most neural network accelerators pay attention to convolution acceleration, but due to the limitation of access bandwidth, the parallelism of convolution calculation cannot be too high, and the acceleration effect is limited. In addition, some neural network accelerators use deep separable convolution to replace part of standard convolution layers to accelerate the network, and the architecture of the neural network accelerators is specially designed for the deep separable convolution, so that the compatibility of the standard convolution calculation process is not good enough, and the acceleration performance is influenced.

Therefore, how to improve the effect of accelerating the convolution calculation is a key issue of attention by those skilled in the art.

Disclosure of Invention

An object of the present application is to provide a neural network acceleration processing method, a neural network acceleration processing apparatus, an acceleration system, and a computer-readable storage medium, which improve the effect of accelerating convolution calculation.

In order to solve the above technical problem, the present application provides a neural network acceleration processing method, including:

the acceleration equipment acquires feature data from an on-chip feature cache module;

acquiring weight data from an on-chip weight cache module;

performing parallel convolution calculation on the characteristic data and the weight data in a pipeline mode by adopting a plurality of processing combinations in the point-by-point processing array to obtain a point-by-point convolution result; wherein each of the processing combinations comprises a plurality of processing cores; wherein each of the processing cores comprises a plurality of multiplication units and a set of addition trees;

performing accumulation calculation processing on the point-by-point convolution result by adopting an accumulation module to obtain a point-by-point accumulation processing result;

and performing parallel convolution calculation on the point-by-point accumulation processing result and the corresponding weight data in the on-chip weight cache module in a pipeline mode by adopting a channel-by-channel processing array to obtain a channel-by-channel convolution result.

Optionally, performing parallel convolution calculation on the feature data and the weight data in a pipeline manner by using a plurality of processing combinations in the point-by-point processing array to obtain a point-by-point convolution result, where the point-by-point convolution result includes:

receiving feature data and weight data corresponding to each path in parallel from N paths of input channels of the point-by-point processing array;

and performing parallel convolution calculation on the feature data and the weight data of the corresponding channel through each processing core in each processing combination to obtain a point-by-point convolution result corresponding to each output channel.

Optionally, an accumulation module is used to perform accumulation calculation processing on the point-by-point convolution result to obtain a point-by-point accumulation processing result, where the method includes:

accumulating the point-by-point convolution results of each output channel by adopting the accumulation module to obtain accumulated data;

performing data processing on the accumulated data to obtain the point-by-point accumulation processing result; wherein the data processing at least comprises one or more of quantization processing, blending activation processing and residual processing.

Optionally, performing parallel convolution calculation on the point-by-point accumulated processing result and the corresponding weight data in the on-chip weight cache module in a pipeline manner by using a channel-by-channel processing array to obtain a channel-by-channel convolution result, where the method includes:

when depth separable convolution calculation is carried out, writing the point-by-point accumulation processing result into a point-by-point feature cache;

acquiring channel-by-channel weight data from an on-chip weight cache module;

performing parallel convolution calculation on the point-by-point accumulation processing result and the channel-by-channel weight data in a pipeline mode by adopting a plurality of channel-by-channel processing cores of a channel-by-channel processing array to obtain a channel-by-channel convolution result; wherein each of the channel-by-channel processing cores includes a plurality of multiplication units and a set of addition trees.

Optionally, the method further includes:

when depth separable convolution calculation is carried out, carrying out fusion activation processing on the channel-by-channel convolution result, and carrying out quantization processing on the processing result to obtain a channel-by-channel processing result;

when performing convolution calculation of 3x3, performing accumulation calculation on the channel-by-channel convolution result by adopting a 3x3 accumulation module, and performing data processing on the result of the accumulation calculation to obtain a 3x3 convolution calculation result; wherein the data processing at least comprises one or more of quantization processing, blending activation processing and residual processing.

Optionally, the method further includes:

storing feature data based on a preset format by using the on-chip feature cache module; the on-chip characteristic caching module is constructed by three groups of caching units.

Optionally, the method further includes:

performing pooling calculation on the feature data in the on-chip feature caching module through a pooling module to obtain a pooling result;

and writing the pooling result into the on-chip characteristic caching module.

The present application also provides a neural network accelerated processing apparatus, including:

the characteristic data acquisition module is used for acquiring characteristic data from the on-chip characteristic cache module;

the weight data acquisition module is used for acquiring weight data from the on-chip weight cache module;

the point-by-point calculation module is used for carrying out parallel convolution calculation on the characteristic data and the weight data in a pipeline mode by adopting a plurality of processing combinations in the point-by-point processing array to obtain a point-by-point convolution result; wherein each of the processing combinations comprises a plurality of processing cores; wherein each of the processing cores comprises a plurality of multiplication units and a set of addition trees;

a 1x1 accumulation calculation module, configured to perform accumulation calculation processing on the point-by-point convolution result to obtain a point-by-point accumulation processing result;

and the channel-by-channel calculation module is used for performing parallel convolution calculation on the point-by-point accumulation processing result and the corresponding weight data in the on-chip weight cache module in a pipeline mode to obtain a channel-by-channel convolution result.

The present application further provides an acceleration system, comprising:

a memory for storing a computer program;

a processor for implementing the steps of the neural network acceleration processing method as described above when executing the computer program.

The present application also provides a computer-readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the neural network acceleration processing method as set forth above.

The application provides a neural network acceleration processing method, which comprises the following steps: the acceleration equipment acquires feature data from an on-chip feature cache module; acquiring weight data from an on-chip weight cache module; performing parallel convolution calculation on the characteristic data and the weight data in a pipeline mode by adopting a plurality of processing combinations in the point-by-point processing array to obtain a point-by-point convolution result; wherein each of the processing combinations comprises a plurality of processing cores; wherein each of the processing cores comprises a plurality of multiplication units and a set of addition trees; performing accumulation calculation processing on the point-by-point convolution result by adopting an accumulation module to obtain a point-by-point accumulation processing result; and performing parallel convolution calculation on the point-by-point accumulation processing result and the corresponding weight data in the on-chip weight cache module in a pipeline mode by adopting a channel-by-channel processing array to obtain a channel-by-channel convolution result.

The characteristic data and the weight data in the cache are calculated through the point-by-point processing array, parallel calculation is carried out through a plurality of processing combinations in the point-by-point processing array and processing cores in the point-by-point processing array, and data processing is carried out in a pipeline mode through the channel-by-channel processing array, so that the pipelined parallel convolution calculation process is realized, the parallelism of the calculation process is improved, and the calculation efficiency of the neural network is improved. And compatible with both deep separable and ordinary convolutions.

The present application further provides a neural network acceleration processing apparatus, an acceleration system, and a computer-readable storage medium, which have the above beneficial effects and are not described herein again.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a flowchart of a neural network acceleration processing method according to an embodiment of the present disclosure;

fig. 2 is a schematic structural diagram of a neural network accelerator according to an embodiment of the present disclosure;

fig. 3 is a schematic diagram of a PE core structure of a neural network accelerated processing method according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a point-by-point PE combination of a neural network acceleration processing method according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a point-by-point PE array of a neural network acceleration processing method according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of a 1x1 accumulation module of a neural network acceleration processing method according to an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of a channel-by-channel PE core of a neural network acceleration processing method according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a channel-by-channel PE array of a neural network acceleration processing method according to an embodiment of the present disclosure;

fig. 9 is a schematic structural diagram of a 3x3 accumulation module of a neural network acceleration processing method according to an embodiment of the present disclosure;

fig. 10 is a schematic structural diagram of an on-chip feature cache module of a neural network accelerated processing method according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of a neural network accelerated processing apparatus according to an embodiment of the present disclosure.

Detailed Description

The core of the application is to provide a neural network acceleration processing method, a neural network acceleration processing device, an acceleration system and a computer readable storage medium, so that the effect of accelerating convolution calculation is improved.

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Therefore, the present application provides an accelerated processing method for a neural network, which calculates feature data and weight data in a cache by a point-by-point processing array, performs parallel calculation by a plurality of processing combinations in the point-by-point processing array and processing cores therein, and performs data processing in a pipeline manner by a channel-by-channel processing array, thereby implementing a pipeline-type parallel convolution calculation process, improving the parallelism of the calculation process, and improving the calculation efficiency of the neural network. And compatible with both deep separable and ordinary convolutions.

The following describes a neural network acceleration processing method provided by the present application, by way of an example.

Referring to fig. 1, fig. 1 is a flowchart of a neural network acceleration processing method according to an embodiment of the present disclosure.

In this embodiment, the method may include:

s101, the acceleration equipment acquires feature data from an on-chip feature cache module;

it can be seen that this alternative is directed to accelerating the device's acquisition of feature data from the on-chip feature cache module.

The on-chip characteristic caching module temporarily stores characteristic data in the CNN network computing process and can also be used for data caching in the input and output processes. The temporary stored feature data comprises initial first-layer input feature data, convolutional layer calculation results, pooling layer calculation results and the like.

S102, acquiring weight data from an on-chip weight cache module;

it can be seen that this alternative is directed to accelerating the device's acquisition of weight data from the on-chip weight cache module.

The on-chip weight cache is used for storing weight data and providing the weight data for the point-by-point processing array and the pass-by-pass processing array according to the required format and sequence. When the weight parameter number of the neural network is not very large and the used storage resources on an FPGA (Field-Programmable Gate Array)/ASIC (Application Specific Integrated Circuit) chip are sufficient, all weight data can be stored on the chip; if the weight parameter quantity of the neural network is larger and the on-chip storage resources are more tense, the weight data can be stored in an off-chip memory, the required data is read from the off-chip memory during calculation, and then the data is output to the point-by-point processing array or the channel-by-channel processing array.

S103, performing parallel convolution calculation on the feature data and the weight data in a pipeline mode by adopting a plurality of processing combinations in the point-by-point processing array to obtain a point-by-point convolution result; wherein each processing combination comprises a plurality of processing cores; wherein each processing core comprises a plurality of multiplication units and a set of addition trees;

on the basis of S101 and S102, this step aims to perform parallel convolution calculation on the feature data and the weight data using a plurality of processing combinations in the point-by-point processing array, resulting in a point-by-point convolution result.

Further, the step may include:

step 1, parallelly receiving feature data and weight data corresponding to each path from N paths of input channels of a point-by-point processing array;

and 2, performing parallel convolution calculation on the feature data and the weight data of the corresponding channel through each processing core in each processing combination to obtain a point-by-point convolution result corresponding to each output channel.

It can be seen that this alternative is primarily illustrative of how parallel convolution computations can be performed. In this alternative, feature data and weight data corresponding to each path are received in parallel from N paths of input channels of the point-by-point processing array, and feature data and weight data of a corresponding channel are subjected to parallel convolution calculation by each processing core in each processing combination to obtain a point-by-point convolution result corresponding to each output channel. Wherein N may be 8.

S104, performing accumulation calculation processing on the point-by-point convolution result by adopting an accumulation module to obtain a point-by-point accumulation processing result;

on the basis of S103, the step aims to adopt an accumulation module to perform accumulation calculation processing on the point-by-point convolution result to obtain a point-by-point accumulation processing result

The point-by-point processing array is used for 1x1 convolution calculation, the accumulation module is also used as a 1x1 accumulation module, and is mainly used for completing the accumulation of partial sums of all input channels and the quantification of accumulation results, and in addition, the activation and residual error functions are integrated. A selector is arranged behind the 1x1 accumulation module, and when the separable convolution of the calculated depth is carried out, the calculation result of the 1x1 accumulation module is output to the feature cache module for calculating the channel-by-channel convolution; when only 1x1 convolution is calculated, the calculation result of the accumulation module is output to the on-chip characteristic buffer module.

Further, the step may include:

step 1, performing accumulation processing on a point-by-point convolution result of each output channel by adopting an accumulation module to obtain accumulated data;

step 2, carrying out data processing on the accumulated data to obtain a point-by-point accumulation processing result; wherein, the data processing at least comprises one or more of quantization processing, blending activation processing and residual error processing.

It can be seen that the present alternative scheme is primarily illustrative of how the accumulation calculation is performed. In the alternative scheme, an accumulation module is adopted to perform accumulation processing on the point-by-point convolution result of each output channel to obtain accumulated data, and data processing is performed on the accumulated data to obtain a point-by-point accumulation processing result; wherein, the data processing at least comprises one or more of quantization processing, blending activation processing and residual error processing.

And S105, performing parallel convolution calculation on the point-by-point accumulated processing result and the corresponding weight data in the on-chip weight cache module in a pipeline mode by adopting the channel-by-channel processing array to obtain a channel-by-channel convolution result.

On the basis of S104, the step aims to adopt a channel-by-channel processing array to carry out parallel convolution calculation on the point-by-point accumulated processing result and the corresponding weight data in the on-chip weight cache module in a pipeline mode to obtain a channel-by-channel convolution result.

The channel-by-channel processing array can be used for calculating the channel-by-channel convolution and can also be used for calculating the standard convolution. Similar to the 1x1 accumulation module, the 3x3 accumulation module may integrate quantization, activation, and residual functions in addition to completing input channel partial and accumulation. When calculating the channel-by-channel convolution, the channel-by-channel processing array receives data from the point-by-point feature cache, and the calculation result is the final convolution layer output and is directly output to the on-chip feature cache module; when standard convolution (3x3 convolution) is calculated, the channel-by-channel processing array reads characteristic data from the on-chip characteristic cache module for calculation, the calculation result needs to pass through a 3x3 accumulation module to complete the addition of each input channel part, and the final result is obtained and output to the on-chip characteristic cache module after the operations such as quantization, activation, residual calculation and the like.

Further, the step may include:

step 1, when depth separable convolution calculation is carried out, writing point-by-point accumulation processing results into a point-by-point feature cache;

step 2, acquiring channel-by-channel weight data from an on-chip weight cache module;

step 3, performing parallel convolution calculation on the point-by-point accumulated processing result and the channel-by-channel weight data in a pipeline mode by adopting a plurality of channel-by-channel processing cores of the channel-by-channel processing array to obtain a channel-by-channel convolution result; wherein each channel-by-channel processing core includes a plurality of multiplication units and a set of addition trees.

It can be seen that, after the processing of the array point by point is also described in this alternative, the processing calculation can be performed through the channel-by-channel array, so as to realize the accelerated calculation of the separable convolutional network. In the alternative, when depth separable convolution calculation is carried out, processing results are written into a point-by-point feature cache, channel-by-channel weight data are obtained from an on-chip weight cache module, and parallel convolution calculation is carried out on point-by-point accumulated processing results and channel-by-channel weight data in a pipeline mode by adopting a plurality of channel-by-channel processing cores of a channel-by-channel processing array to obtain channel-by-channel convolution results; wherein each channel-by-channel processing core includes a plurality of multiplication units and a set of addition trees.

In addition, the present embodiment may further include:

step 1, when depth separable convolution calculation is carried out, carrying out fusion activation processing on a channel-by-channel convolution result, and carrying out quantization processing on a processing result to obtain a channel-by-channel processing result;

step 2, when performing convolution calculation of 3x3, performing accumulation calculation on the channel-by-channel convolution result by adopting a 3x3 accumulation module, and performing data processing on the result of the accumulation calculation to obtain a 3x3 convolution calculation result; wherein, the data processing at least comprises one or more of quantization processing, blending activation processing and residual error processing.

On the basis of the former alternative, the compatibility of the depth separable convolution and the 3x3 convolution is realized at the same time, and the application range of the acceleration device is improved.

In addition, the present embodiment may further include:

storing feature data based on a preset format by adopting an on-chip feature cache module; the on-chip characteristic cache module is constructed by three groups of cache units.

In addition, the present embodiment may further include:

step 1, performing pooling calculation on feature data in an on-chip feature cache module through a pooling module to obtain a pooling result;

and 2, writing the pooling result into the on-chip characteristic caching module.

That is, the pooling module reads data from the on-chip feature caching module and writes back the data after the calculation is completed; in addition, the data operation module is used for completing other operations, such as channel data splicing and the like.

In summary, in the embodiment, the feature data and the weight data in the cache are calculated by the point-by-point processing array, parallel calculation is performed by the processing combinations and the processing cores in the point-by-point processing array, and data processing is performed in a pipeline manner by the channel-by-channel processing array, so that a pipelined parallel convolution calculation process is realized, the parallelism of the calculation process is improved, and the calculation efficiency of the neural network is improved. And compatible with both deep separable and ordinary convolutions.

The following further describes a neural network accelerator of a neural network acceleration processing method provided in the present application, by using a specific embodiment.

Referring to fig. 2, fig. 2 is a schematic structural diagram of a neural network accelerator according to an embodiment of the present disclosure.

In this embodiment, a neural network accelerator mainly accelerates a CNN network, and the accelerator mainly includes an input (on-chip) filter (weight) cache module, an on-chip feature cache module, a point-by-point convolution Processing unit array (hereinafter referred to as a point-by-point PE (Processing unit) array), a 1x1 accumulation module, a channel-by-channel convolution Processing unit array (hereinafter referred to as a channel-by-channel PE array), a 3x3 accumulation module, and a pooling module, and fig. 2 is an overall structural block diagram of the accelerator.

The neural network accelerator is generally deployed on an FPGA or an ASIC, and the CPU directly issues the preprocessed feature data to an on-chip feature cache; when the weight data size is not large, the CPU can select to send the weight data to an input (on-chip) filter for cache storage; when the CNN network parameters are more, the CPU can choose to send the weight data to the off-chip memory, the input (on-chip) cache module is used as a first-level data cache, and the data is read from the off-chip memory and then is transmitted to the point-by-point PE array and the channel-by-channel PE array.

The pointwise PE array is used for 1x1 convolution calculation, and the 1x1 accumulation module is mainly used for completing accumulation of partial sums of all input channels and quantification of accumulated results, and integrates activation and residual functions. A selector is arranged behind the 1x1 accumulation module, and when the separable convolution of the calculated depth is carried out, the calculation result of the 1x1 accumulation module is output to the feature cache module for calculating the channel-by-channel convolution; when only 1x1 convolution is calculated, the calculation result of the accumulation module is output to the on-chip characteristic buffer module.

The per-channel PE array may be used to compute both the per-channel convolution and the standard convolution. Similar to the 1x1 accumulation module, the 3x3 accumulation module integrates quantization, activation, and residual functions in addition to completing input channel partial and accumulation. When calculating the channel-by-channel convolution, the channel-by-channel PE array receives data from the point-by-point feature cache, and the calculation result is the final convolution layer output and is directly output to the on-chip feature cache module; when standard convolution (3x3 convolution) is calculated, the channel-by-channel PE array reads characteristic data from the on-chip characteristic cache module for calculation, the calculation result needs to pass through a 3x3 accumulation module to complete partial sum addition of each input channel, and then the calculation result is subjected to operations such as quantization and the like (optionally activation and residual calculation) to obtain a final result and then output the final result to the on-chip characteristic cache module.

The pooling module reads data from the on-chip characteristic caching module and writes the data back after the calculation is finished; the data operation module is used for completing other operations, such as channel data splicing and the like.

The neural network accelerator of the present embodiment is mainly used for CNN network acceleration, and the main modules thereof are described below.

Referring to fig. 3, fig. 3 is a schematic diagram of a PE core structure of a neural network accelerated processing method according to an embodiment of the present disclosure.

The PE core structure in the pointwise PE array is shown in fig. 3, and each PE core includes 8 multiplication units and a set of addition trees. d0-d8 are feature data of 8 input channels, w1-w8 are weight data of corresponding 8 channels, and 8 multiplication units respectively complete 1x1 convolution multiplication operations of 8 channels. And after the addition of the partial sums of the 8 input channels is completed by the addition tree, outputting the result. It follows that a single PE core is able to perform parallel computations of the 8 input channel characteristics.

Referring to fig. 4, fig. 4 is a schematic diagram of a point-by-point PE combination structure of a neural network acceleration processing method according to an embodiment of the present application.

The point-by-point PE combination is composed of 8 point-by-point PE cores, as shown in fig. 4, which are PE1-PE8, respectively, and each PE core completes the multiply-accumulate calculation of 8 features and 8 filter data. Where D1 represents the feature data (8 in total) of 8 input channels corresponding to position 1 on the feature map, D2 represents the feature data of 8 input channels corresponding to position 2 on the feature map, and so on. The 8 input channels filter is the weight data (total 8) corresponding to the above mentioned 8 input channels, and one point-by-point PE combination shares a group of weight data (8), which can complete the calculation of 8 paths of feature data under the 8 input channels in parallel.

Referring to fig. 5, fig. 5 is a schematic diagram illustrating a point-by-point PE array structure of a neural network acceleration processing method according to an embodiment of the present disclosure.

As shown in fig. 5, the pointwise PE array consists of 32 pointwise PE combinations, each processing combination calculating one output channel. Feature input 1 represents 8 sets 8 of input channel features for output channel 1, feature input 2 represents 8 sets 8 of input channel features for output channel 2, feature input 3-feature input 32, and so on. The weight input F1 represents the 8-input channel weight data for output channel 1, the weight input F1 represents the 8-input channel weight data for output channel 2, the weight input F3-the weight input F32, and so on.

By now it can be seen that the pointwise PE array can perform parallel computation in 3 dimensions of the profile data, input channels and output channels. The present embodiment is illustrated with 8-way feature data, 8 input channels, and 32 output channels, where both the feature data and the filter data are quantized to an INT8 type.

Referring to fig. 6, fig. 6 is a schematic structural diagram of a 1x1 accumulation module of a neural network acceleration processing method according to an embodiment of the present disclosure.

The 1x1 accumulation module receives the partial sum result output by the point-by-point PE array, and completes the operations of accumulation calculation, quantization, activation, etc., and its structure is shown in fig. 6. The 1x1 accumulation module performs parallel computations on the 32 output channels, matching the pointwise PE array. In the figure, the partial sum 1-partial sum 32 is the partial sum of the output channels 1-32, and the partial sum needs to be quantized after being accumulated. The quantization is to change the accumulated result (INT32) into INT8 format again, so that the bit width of the data after 32-channel parallel computation is 8 × 32 × 8bit 2048bit, which is too high to process the subsequent data more heavily, so that it is changed into 8-feature 8 Output channel 512bit by FIFO (First Input First Output).

Depending on the network architecture, it is possible to choose whether to perform the activation operation and the residual calculation in the 1x1 accumulation module. If the point-by-point convolution layer is followed by residual calculation, the corresponding data can be read from the on-chip cache unit while performing 1x1 accumulation calculation, and the data is written back to the on-chip feature cache module after the residual calculation is completed. The characteristic data sequence written back to the on-chip characteristic cache module every time is the same, so that the data can be read from the on-chip cache module in sequence, and a pipeline can be formed by accumulation calculation and residual operation, so that a large number of access actions can be reduced, and the processing time can be saved. And the final calculation result of the 1x1 accumulation module is output to a channel-by-channel convolution array or an on-chip characteristic buffer module after passing through a selector.

Further, when the neural network accelerator calculates the depth separable convolution, the processing result of the point-by-point PE array firstly enters the point-by-point feature cache. The point-by-point characteristic cache mainly changes the order and the port format of the cached data according to requirements and transmits the data to the channel-by-channel PE array.

Referring to fig. 7, fig. 7 is a schematic diagram of a channel-by-channel PE core structure of a neural network acceleration processing method according to an embodiment of the present application.

The channel-by-channel PE array consists of 32 channel-by-channel PE cores, and 3x3 convolution calculation of 32 output channels can be completed in parallel. The structure of the channel-by-channel PE core is shown in FIG. 7, the PE core comprises 9 multiplication units and a group of addition trees, each multiplication unit completes multiplication of feature data (d 0-d8 in FIG. 7) and weight data (w 0-w8 in FIG. 7), the addition trees complete summation calculation of 9 multiplication results, and the addition trees comprise 4 stages and are suitable for pipeline calculation. A quantization and activation operation component is connected behind the addition tree, and if the calculation separable convolution is carried out, the quantization operation and the activation operation are needed; and if the standard convolution is calculated, directly outputting the accumulation result of the addition tree.

Referring to fig. 8, fig. 8 is a schematic diagram illustrating a channel-by-channel PE array structure of a neural network acceleration processing method according to an embodiment of the present disclosure.

The structure of the channel-by-channel PE array is shown in fig. 8, and the array contains 32 channel-by-channel processing kernels (PE1-PE32), and each PE kernel completes the 3 × 3 convolution calculation of one output channel. The feature channel 1-the feature channel 32 are feature data corresponding to the output channels 1-32 respectively, each channel includes 9 feature values, the weight channel 1-the weight channel 32 are weight data corresponding to the output channels 1-32 respectively, and each channel includes 9 weight values. In order to ensure that the data format and the storage sequence of the channel-by-channel convolution calculation results are consistent with the point-by-point convolution results, a FIFO is arranged behind each PE core, and when the calculation results are accumulated to 8, the calculation results are output together. In addition, the 8-feature × 32-channel data output by the 32 PE cores in parallel also needs to pass through the FIFO, be converted into 8-feature × 8 channels (512 bits), and then be output. And finally, the data is output to a 3x3 accumulation module or an on-chip feature caching module after passing through the selector.

Referring to fig. 9, fig. 9 is a schematic structural diagram of a 3x3 accumulation module of a neural network acceleration processing method according to an embodiment of the present disclosure.

When calculating the standard convolution, the channel-by-channel array calculation result needs to be accumulated to obtain the final result. The neural network accelerator of this embodiment includes a 3x3 accumulation module, whose structure is substantially the same as that of the 1x1 accumulation module, as shown in fig. 9, the 3x3 accumulation module receives the partial sum result output by the pointwise PE array, and completes the operations of accumulation calculation, quantization, activation, and the like. The 3x3 accumulation module performs parallel computations on the 32 output channels to match the channel-by-channel PE array. In the figure, the partial sum 1-partial sum 32 is the partial sum of the output channels 1-32, and the partial sum needs to be quantized after being accumulated. The quantization is to change the accumulated result (INT32) into INT8 format again, so that the bit width of the data after 32-channel parallel computation is 8 × 32 × 8bit to 2048bit, which is too high to process the subsequent data more heavily, so that the data is changed into 8-feature × 8 output channel to 512bit by using FIFO.

Similar to the 1x1 accumulation module, depending on the network architecture, it may be selected whether to perform the activation operation and the residual calculation in the 3x3 accumulation module. If the convolution layer is followed by residual calculation, the corresponding data can be read from the on-chip cache unit while performing 3x3 accumulation calculation, and the data is written back to the on-chip feature cache module after the residual calculation is completed. Likewise, the order of the feature data written back to the on-chip feature cache module each time is the same, so data can be read from the on-chip feature cache module in sequence. Unlike the 1x1 accumulation module, the results of the 3x3 accumulation module are written directly back to the on-chip feature cache module.

The on-chip filter cache module is used for storing the weight data and providing the filter data for the point-by-point PE array and the channel-by-channel PE array according to the required format and sequence. When the weight parameters of the neural network are not large and the used FPGA/ASIC chip has sufficient storage resources, all weight data can be stored on the chip; if the weight parameter quantity of the neural network is larger and the on-chip storage resources are more tense, the weight data can be stored in an off-chip memory, the required data is read from the off-chip memory during calculation, and then the data is output to the point-by-point PE array or the channel-by-channel PE array.

Referring to fig. 10, fig. 10 is a schematic structural diagram of an on-chip feature cache module of a neural network accelerated processing method according to an embodiment of the present disclosure.

The on-chip characteristic buffer module temporarily stores the characteristic data in the CNN network computing process and can also be used for data buffering in the input and output processes. The temporary stored feature data comprises initial first-layer input feature data, convolutional layer calculation results, pooling layer calculation results and the like. As shown in fig. 10, the on-chip feature cache module mainly includes 3 storage blocks, namely, a storage block a, a storage block B, and a storage block C, where the structures of the 3 storage blocks are completely the same, the port widths are all 512 bits, and each storage block is composed of a simple dual-port RAM, which can be read or written, and the consumption of RAM resources can be reduced by selecting the simple dual-port RAM. Each memory block has independent read-write control logic, the module integrally comprises 3 groups of interfaces, A, B and C respectively, and the module determines which (or several) needs to be read or written through the memory block selection control logic. 3 storage blocks are arranged, any 2 storage blocks can form a ping-pong structure, the requirement of pipeline processing such as convolution calculation, residual calculation and the like is met, and the following calculation of various scenes is met:

when calculating convolution, setting the characteristic data needing to be calculated at present to be temporarily stored in the storage block A, reading the characteristic data from the storage block A, and writing the result back to the storage block B while performing pipeline calculation by the convolution array;

when calculating the convolution layer and the residual error, assuming that the calculation result of a certain previous convolution layer m is temporarily stored in a storage block A, and the characteristic data of the convolution layer required to be calculated is temporarily stored in a storage block B, reading the characteristic data from the storage block B for convolution calculation when calculating, simultaneously reading the data required to be subjected to the residual error from the storage block A when calculating by an accumulation module, and writing the final result into a storage block C after calculating the residual error data;

when the pooling calculation is carried out, if the feature data which needs to be calculated at present is temporarily stored in the storage block A, the data is read from the storage block A, and the data is written back to the storage block B after the calculation is finished.

According to the situations, intermediate results (calculation results of convolution layers, residual errors and the like) in the CNN network are written to the on-chip feature cache module for storage, and data are read from the on-chip feature cache module when next-layer calculation is carried out, so that access to the off-chip memory is avoided, the limitation of insufficient bandwidth for accessing the off-chip memory can be broken, and the time for reading and writing the data can be shortened. In addition, 3 storage blocks are arranged, residual calculation can be carried out while convolution is calculated, and the whole residual operation is integrated into a pipeline, so that the calculation of the whole network can be accelerated.

The pooling operation is one of the most common calculations in the CNN network, and the pooling module in this embodiment is used to complete the pooling calculation, and the pooling module reads data from the on-chip feature cache module and writes the data back to the on-chip feature cache module after the calculation is completed. Other data operation modules can complete operations such as data splicing and the like, and can be flexibly designed according to network structures and the conditions of available resources of the FPGA/ASIC.

As can be seen, this embodiment can support a pointwise PE array design with 3 dimensions of parallel computation: the structural design of the point-by-point PE array enables parallel computation to be carried out on 3 dimensions of characteristic diagram data, input channels and output channels, and therefore the computation parallelism can be greatly increased. All intermediate calculation results of the CNN network are temporarily stored on the chip, so that frequent access to an off-chip memory is avoided, and the influence of insufficient access bandwidth on acceleration performance is eliminated; through the design of 3 groups of storage blocks, residual calculation is embedded into an accumulation module to realize pipeline processing, so that the access times are reduced, and the network calculation process is accelerated; 3 groups of independent read-write storage blocks are designed, and any two groups of independent read-write storage blocks can form a ping-pong structure so that the ping-pong structure is suitable for assembly line processing; in addition, 3 sets of memory blocks can guarantee pipelined processing of residual computations.

In the following, the neural network acceleration processing apparatus provided in the embodiment of the present application is introduced, and the neural network acceleration processing apparatus described below and the neural network acceleration processing method described above may be referred to correspondingly.

Referring to fig. 11, fig. 11 is a schematic structural diagram of a neural network accelerated processing apparatus according to an embodiment of the present disclosure.

In this embodiment, the apparatus may include:

a feature data obtaining module 100, configured to obtain feature data from an on-chip feature cache module;

a weight data obtaining module 200, configured to obtain weight data from the on-chip weight cache module;

a point-by-point calculation module 300, configured to perform parallel convolution calculation on the feature data and the weight data in a pipeline manner by using multiple processing combinations in a point-by-point processing array, so as to obtain a point-by-point convolution result; wherein each of the processing combinations comprises a plurality of processing cores; wherein each of the processing cores comprises a plurality of multiplication units and a set of addition trees;

a 1x1 accumulation calculation module 400, configured to perform accumulation calculation processing on the point-by-point convolution result to obtain a point-by-point accumulation processing result;

and a channel-by-channel calculation module 500, configured to perform parallel convolution calculation on the point-by-point accumulation processing result and the corresponding weight data in the on-chip weight cache module in a pipeline manner, so as to obtain a channel-by-channel convolution result.

Optionally, the point-by-point calculating module 300 is specifically configured to receive, in parallel, the feature data and the weight data corresponding to each path from the N paths of input channels of the point-by-point processing array; and performing parallel convolution calculation on the feature data and the weight data of the corresponding channel through each processing core in each processing combination to obtain a point-by-point convolution result corresponding to each output channel.

Optionally, the 1x1 accumulation calculation module 400 is specifically configured to perform accumulation processing on the point-by-point convolution result of each output channel by using the accumulation module to obtain accumulated data; performing data processing on the accumulated data to obtain the point-by-point accumulation processing result; wherein the data processing at least comprises one or more of quantization processing, blending activation processing and residual processing.

Optionally, the channel-by-channel calculating module 500 is specifically configured to, when performing depth separable convolution calculation, write the point-by-point accumulation processing result into a point-by-point feature cache; acquiring channel-by-channel weight data from an on-chip weight cache module; performing parallel convolution calculation on the point-by-point accumulation processing result and the channel-by-channel weight data in a pipeline mode by adopting a plurality of channel-by-channel processing cores of a channel-by-channel processing array to obtain a channel-by-channel convolution result; wherein each of the channel-by-channel processing cores includes a plurality of multiplication units and a set of addition trees.

Optionally, the apparatus may further include:

a 3x3 accumulation calculation module, configured to perform, when performing depth separable convolution calculation, blend-in activation processing on the channel-by-channel convolution result, and perform quantization processing on a processing result to obtain a channel-by-channel processing result; when performing convolution calculation of 3x3, performing accumulation calculation on the channel-by-channel convolution result by adopting a 3x3 accumulation module, and performing data processing on the result of the accumulation calculation to obtain a 3x3 convolution calculation result; wherein the data processing at least comprises one or more of quantization processing, blending activation processing and residual processing.

Optionally, the apparatus may further include:

the characteristic data storage module is used for storing characteristic data based on a preset format by adopting the on-chip characteristic cache module; the on-chip characteristic caching module is constructed by three groups of caching units.

Optionally, the apparatus may further include:

the pooling module is used for performing pooling calculation on the feature data in the on-chip feature caching module to obtain a pooling result; and writing the pooling result into an on-chip characteristic caching module.

The embodiment of the present application further provides an acceleration system, which includes:

a memory for storing a computer program;

a processor for implementing the steps of the neural network accelerated processing method of any one of claims 1 to 7 when executing the computer program.

An embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements the steps of the neural network accelerated processing method according to any one of claims 1 to 7.

The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

A neural network acceleration processing method, a neural network acceleration processing apparatus, an acceleration system, and a computer-readable storage medium provided by the present application are described in detail above. The principles and embodiments of the present application are explained herein using specific examples, which are provided only to help understand the method and the core idea of the present application. It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present application without departing from the principle of the present application, and such improvements and modifications also fall within the scope of the claims of the present application.

Claims

1. A neural network accelerated processing method is characterized by comprising the following steps:

acquiring weight data from an on-chip weight cache module;

2. The neural network accelerated processing method of claim 1, wherein performing parallel convolution calculation on the feature data and the weight data in a pipeline manner by using a plurality of processing combinations in a point-by-point processing array to obtain a point-by-point convolution result, comprises:

3. The neural network acceleration processing method according to claim 2, wherein performing accumulation calculation processing on the point-by-point convolution result by using an accumulation module to obtain a point-by-point accumulation processing result includes:

4. The neural network accelerated processing method according to claim 1, wherein parallel convolution calculation is performed on the point-by-point accumulated processing result and corresponding weight data in the on-chip weight cache module in a pipeline manner by using a channel-by-channel processing array to obtain a channel-by-channel convolution result, and the method includes:

acquiring channel-by-channel weight data from an on-chip weight cache module;

5. The neural network accelerated processing method according to claim 4, further comprising:

6. The neural network accelerated processing method according to any one of claims 1 to 5, further comprising:

7. The neural network accelerated processing method according to claim 6, further comprising:

and writing the pooling result into the on-chip characteristic caching module.

8. An apparatus for neural network accelerated processing, comprising:

9. An acceleration system, comprising:

a memory for storing a computer program;

10. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the steps of the neural network accelerated processing method according to any one of claims 1 to 7.