CN112257844B

CN112257844B - Convolutional neural network accelerator based on mixed precision configuration and implementation method thereof

Info

Publication number: CN112257844B
Application number: CN202011050462.6A
Authority: CN
Inventors: 卓成; 周鲜; 张力; 郭楚亮
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2020-09-29
Filing date: 2020-09-29
Publication date: 2022-04-26
Anticipated expiration: 2040-09-29
Also published as: CN112257844A

Abstract

The invention discloses a convolution neural network accelerator based on mixed precision configuration and a realization method thereof, wherein the accelerator comprises: the device comprises a weight separation module, a low-precision processing module and a high-precision processing module; after the weight passes through a weight separation module of the accelerator, the high-precision weight and the low-precision weight are separated, the low-precision weight is calculated by using a low-precision processing module, the high-precision weight is calculated by using a high-precision processing module, and the results of the two parts are added by using an addition module, so that the convolution calculation is finally realized. The invention provides a mixed precision configuration framework which can be used for a general convolutional neural network accelerator, weight is processed separately according to precision, the calculation bit width of a calculation unit array is reduced, a large amount of storage area is saved, and the operation power consumption of the accelerator is reduced.

Description

Convolutional neural network accelerator based on mixed precision configuration and implementation method thereof

Technical Field

The invention relates to the technical field of engineering such as a convolutional neural network, low-power design, edge calculation, a hardware accelerator and the like, in particular to a convolutional neural network accelerator based on mixed precision configuration and an implementation method thereof.

Background

Convolutional neural networks are widely used in a variety of deep learning fields, particularly in edge computing applications. In mobile end devices, there are often limitations to the available computing and memory resources, and therefore more efficient hardware is needed to achieve smaller memory, faster reasoning, and lower power consumption. Compared with a GPU with extremely high power consumption, the FPGA and the ASIC are more suitable solutions of the low-power-consumption convolutional neural network accelerator. However, with the increasing complexity of the Convolutional Neural Network (CNN) algorithm, the computation requirement and the storage requirement will be increased significantly, which is a great challenge for the design of the Deep Neural Network (DNN) hardware accelerator. There have been some substantial advances in designing lightweight CNNs to achieve better speed and accuracy. In these studies, quantization and sparsification have become an effective training method. Many weight values can be zeroed or quantized to fewer bits with limited loss of precision for the trained network. Designing a basic Processing Element (PE) that processes only low bits to reduce overhead is therefore a very effective way to save power consumption.

Various general accelerator architectures have been studied by many researchers for the distribution characteristics of weights. However, the precision control of previous work is at most at hierarchical granularity, i.e. in the same layer, all operations need to be stored and calculated with the same precision. But even in the same layer, the effective bit width of the weights varies greatly. The effective bit width described above is the minimum number of bits that can represent the weight without losing precision. For example, even if a 16-bit integer number is used to store data "3", the valid bit width of this data is only 2 bits, and the remaining 14 bits are all invalid. This phenomenon arises because of the robustness of the neural network, which tends to converge to more sparse weights during the quantization-driven training process. In order to guarantee accuracy, the architecture in the previous work must use the highest effective precision, i.e. the longest bit width, for each layer. However, many bits, especially the high order bits, can be skipped in actual storage and computation to save storage or computation resources without any loss of network precision.

In summary, a convolutional neural network accelerator configured based on mixed precision is provided, weights of different precisions are calculated separately in calculation of the same layer of neural network, and the method becomes a key for improving hardware utilization rate, saving storage space and reducing power consumption.

Disclosure of Invention

The invention aims to provide a convolutional neural network accelerator based on mixed precision configuration and an implementation method thereof aiming at the defects of the prior art.

The purpose of the invention is realized by the following technical scheme: a convolutional neural network accelerator configured based on mixed precision is characterized in that each convolutional layer in the convolutional neural network is calculated by the convolutional neural network accelerator, and the convolutional neural network accelerator comprises a low-precision processing module, a high-precision processing module, an on-chip global buffer and a weight separation module. The first layer of input of the convolutional neural network is an image to be processed, and the last layer of output is an image after the convolution calculation is finished;

the on-chip global buffer is composed of a global low-precision weight memory, a global high-precision weight memory, a feature map global buffer, a part in the calculation process and the global buffer. The feature map global buffer stores the feature map read from the external memory;

the weight separation module is used for reading weight data from an external memory, judging the weight data to be high-precision weight data or low-precision weight data according to a weight bit width threshold, then storing the low-precision weight data into a global low-precision weight memory, and storing the high-precision weight data and the positions of the high-precision weights in all weights into the global high-precision weight memory.

The low-precision processing module is composed of a rectangular array consisting of a plurality of low-precision computing units, each low-precision computing unit reads low-precision weight data from the global low-precision weight memory and reads feature map data from the feature map global buffer, and each row of low-precision computing units in the rectangular array perform multiplication and addition computation and accumulate computation results and then output the computation results to the high-precision processing module.

The high-precision processing module is composed of a high-precision calculating unit and an adding module, and the number of the high-precision calculating unit and the number of the adding module are the same as the number of columns of the low-precision calculating unit. Each high-precision calculation unit reads high-precision weight data and the position of the high-precision weight in all weights from a global high-precision weight memory. Each high-precision calculation unit comprises a weight decoding module and a feature map extraction module, the position of the weight is decoded, and the number of rows and columns of feature map data in all feature maps, feature map data multiplexing information and the number of output channels are obtained and used for subsequent calculation of the high-precision calculation unit, and the specific process is as follows:

and according to the position of each high-precision weight in all the weights, the coordinates k, u, i and j of the high-precision weight in all the weight data can be obtained by decoding. Each high precision weight data and its coordinates may be represented as W [ k ] [ u ] [ i ] [ j ], where k and u represent the number of input channels and the number of output channels for the weight, and i and j represent the number of rows and columns for the weight. The specific decoding process is as follows:

p is i + j × S + u × S × R + k × S × R × C. Wherein p is the position of the high-precision weight in all weights, S is the length of all weights, R is the height of all weights, and C is the number of input channels of all weights, i is more than or equal to 0 and less than S, j is more than or equal to 0 and less than R, and k is more than or equal to 0 and less than C. Through the relationship, the weight decoding module can obtain the coordinates k, u, i and j of high-precision weights in all weight data through the values of p, S, R and C.

The feature map position corresponding to each weight is only related to i and j of the weight, and the specific relationship is that the low-precision calculating unit and the high-precision calculating unit are added in columns, so that the parameter j determines the number of columns of feature map data in all feature maps, and the weight decoding module transmits the number of columns into the feature map extracting module; since the low-precision computing unit and the high-precision computing unit are not added in rows, i needs to implement a mapping relationship to the number of rows of the feature map data in all feature maps according to the arrangement of the low-precision computing unit and the high-precision computing unit, a lookup table is established according to the mapping relationship, the lookup table is deployed in the weight decoding module, and the number of rows of the feature map data in all feature maps can be obtained after the parameter i passes through the lookup table, and the number of rows is transmitted to the feature map extraction module. And the feature map extraction module extracts feature map data from the feature map global buffer according to the number of rows and columns of the feature map data in all feature maps.

Since the same profile data may correspond to a plurality of high-precision weights, the multiplexing of the profile data needs to be implemented by using the profile data multiplexing information. According to the characteristics of convolution calculation, each feature map data needs to correspond to the weight of the same input channel, so that the input channel parameter k with high precision weight can be used as multiplexing information of the feature map; each high-precision computing unit comprises a feature map reading module, the weight decoding module transmits the parameter k serving as feature map data multiplexing information to the feature map reading module, and the feature map reading module controls reading of the feature map extracted by the feature map extraction module according to the feature map data multiplexing information so as to realize feature map data multiplexing.

Each high-precision calculation unit carries out ordered calculation according to the high-precision weight data and the extracted feature map data, part sums of a plurality of output channels need to be accumulated during calculation, and the number of the output channels during accumulation is controlled according to a parameter u; and after the calculation of the high-precision calculation unit is finished, reading the parameter u of the weight decoding module to determine the output channel of the calculation result, and outputting the calculation result to the correct output channel of the addition module.

Each adding module reads the calculation results of the corresponding columns in the rectangular array formed by the low-precision calculation units and the calculation results of the corresponding high-precision calculation units, adds the calculation results to obtain partial sum data, and outputs the partial sum data and the partial sum data to the on-chip global buffer;

repeating the above process until the partial and global buffers store all the partial and data required for calculation; and accumulating the partial sum data to obtain a complete output characteristic diagram and outputting the complete output characteristic diagram to an external memory.

And (5) calculating each convolution layer in the convolution neural network by an accelerator to obtain an image after the convolution calculation is finished.

Further, the weight effective bit width greater than the weight bit width threshold is recorded as a high precision weight, and the weight effective bit width less than or equal to the weight bit width threshold is recorded as a low precision weight.

Furthermore, each low-precision computing unit has an independent position coordinate in the rectangular array and corresponds to the bus address one by one, the low-precision computing units read low-precision weight data from the global low-precision weight memory according to a certain sequence, and the sequence is determined according to the structure and the requirement of the convolutional neural network.

Further, the external memory stores weight data of all the convolution layers, each convolution layer comprises a plurality of sets of weight data required by calculation, and the weight separation module reads one set of weight data from the external memory each time.

Further, the specific process of performing multiply-add calculation and accumulating calculation results by each column of low-precision calculation units in the rectangular array is as follows: each low-precision computing unit comprises a low-precision feature map memory, a low-precision weight memory, a low-precision part and memory, a low-precision multiplier and a low-precision adder. The low-precision characteristic diagram memory stores characteristic diagram data read from the characteristic diagram global buffer, the low-precision weight memory stores weight data read from the global low-precision weight memory, the low-precision weight memory and the low-precision characteristic diagram memory are connected with a low-precision multiplier, the low-precision multiplier calculates one weight of the low-precision weight memory and pixel points on the characteristic diagram in the low-precision characteristic diagram memory at each time, the result, the low-precision part and the result in the memory are input into a low-precision adder together for accumulation calculation, and the results in the low-precision part and the memory are the results of the calculation of the last weight and the pixel points on the characteristic diagram; during the last accumulation calculation, the calculation result is added with the input part sum to obtain the output part sum of the low-precision calculation unit; the input part sum is the output part sum of the last low-precision computing unit; the above process is repeated until the one-dimensional linear convolution operation is completed.

Furthermore, the area of the low-precision computing unit is more than half smaller than that of the high-precision computing unit, and the number of the low-precision computing units is dozens of times larger than that of the high-precision computing unit.

Further, each high-precision computing unit also comprises a high-precision feature map memory, a high-precision weight memory, a high-precision part and memory and a high-precision multiplier-adder. The high-precision feature map memory stores feature map data extracted from the feature map global buffer, the high-precision weight memory stores weight data read from the global high-precision weight memory, and the high-precision part and the memory store the sum of parts of output channels when the high-precision calculation unit performs ordered calculation; and the high-precision calculating unit sends the read high-precision weight data, the read feature map data and the part sum of the output channel into the high-precision multiplier-adder for calculation, accumulates the result and then outputs the result to the adding module.

Further, the number of output channels when controlling the accumulation according to the parameter u is specifically: and controlling the number of output channels during calculation and accumulation according to the high-precision part corresponding to the parameter u output by the weight decoding module and the reading address of the memory.

Further, the weight separation module comprises a counter, and the position of the high-precision weight in all weights is counted by the counter in the weight separation module.

The invention also provides a method for realizing the convolutional neural network accelerator based on mixed precision configuration, which comprises the following steps:

(1) the weight separation module is used for reading weight data from an external memory, judging the weight data to be high-precision weight data or low-precision weight data according to a weight bit width threshold, storing the low-precision weight data into a global low-precision weight memory, and storing the high-precision weight data and the positions of the high-precision weights in all weights into the global high-precision weight memory;

(2) the low-precision computing unit reads low-precision weight data from the global low-precision weight memory in the step (1), reads characteristic diagram data from a characteristic diagram global buffer, performs multiply-add computation on each column of low-precision computing units in the rectangular array, accumulates the computation result, and outputs the result to the high-precision processing module;

(3) the high-precision computing unit reads high-precision weight data and the positions of the high-precision weights in all the weights from the global high-precision weight memory in the step (1); decoding the position of the weight by a weight decoding module to obtain the number of rows and columns of the feature map data in all feature maps, the feature map data multiplexing information and the number of output channels;

(4) extracting feature map data from a feature map global buffer by a feature map extraction module in the high-precision computing unit according to the number of rows and columns of the feature map data obtained in the step (3) in all feature maps; the high-precision computing unit carries out sequential computation according to the high-precision weight data and the extracted feature map data, part sums of a plurality of output channels need to be accumulated during computation, and a computation result is output to a correct output channel of the adding module;

(5) the addition module reads the calculation results of the corresponding columns in the rectangular array formed by the low-precision calculation units in the step (2) and the calculation results of the high-precision calculation units in the step (4), adds the calculation results to obtain partial sum data, and outputs the partial sum data and the partial sum data to the on-chip global buffer;

(6) repeating steps (1) - (5) until the partial and global buffers store all portions and data required for the computation; and accumulating the partial sum data to obtain a complete output characteristic diagram, and outputting the complete output characteristic diagram to an external memory to finish the calculation of one convolution layer.

The invention has the beneficial effects that: the invention provides a convolution neural network accelerator based on mixed precision configuration and an implementation method thereof. Meanwhile, the high-precision computing unit controls a subsequent multiply-add computing mode through the decoding weight position so as to ensure the correctness of the mixed precision computing architecture. The architecture optimizes the parallelism of the neural network hardware accelerator, improves the hardware utilization rate, saves the weight storage space and reduces the calculation power consumption.

Drawings

FIG. 1 is a generalized DNN hardware accelerator architecture;

FIG. 2 is a proposed hybrid-precision-configuration-based DNN hardware accelerator architecture;

FIG. 3 is a block diagram of a low precision computing Unit (LPPE);

FIG. 4 is a block diagram of a high precision computing Unit (FPPE) structure;

FIG. 5 is a graph comparing the amount of space saved at different layers when computing (a) AlexNet (b) VGG16 networks;

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings.

As shown in fig. 1, a common general-purpose DNN hardware accelerator architecture is mainly composed of a compute unit (PE) array, an on-chip global cache (GLB), and additional control logic. Each PE mainly includes an internal data memory, a multiplier-adder, and a control section. Data from an external memory (DRAM) is written into the PE array through an on-chip global cache, and after each PE in the PE array completes calculation, the calculation result is returned to the DRAM.

The invention provides a convolutional neural network accelerator architecture (as shown in fig. 2) based on mixed precision configuration aiming at a general architecture of fig. 1, which mainly comprises a low-precision processing module, a high-precision processing module, an on-chip global buffer and a weight separation module. When data passes through the global cache from the external DRAM, the high-precision weights are separated independently by the weight separation module, the high-precision weight data and the position information of the high-precision weights in all the weights are stored in a global high-precision weight memory (FPB), the low-precision weights are stored in a global low-precision weight memory, and the feature map data are stored in a feature map global cache in the on-chip global cache. And then the global low-precision weight memory and the global high-precision weight memory respectively transmit the weight data into the low-precision processing module and the high-precision processing module, the LPPE reads the feature map data from the feature map global buffer, and the feature map extraction module in the FPPE extracts the feature map data from the feature map global buffer. And the calculation units with two accuracies perform multiply-add calculation and accumulation according to the weight and the characteristic diagram data, the PE units with two accuracies transmit calculation results to an addition module in the high-accuracy processing module after the calculation is completed, and the calculation results of the addition module are transmitted back to a part in the on-chip global cache and the global cache. Repeating the above process until the partial and global buffers store all the partial and data required for calculation; and accumulating the partial sum data to obtain a complete output characteristic diagram to an external memory.

The detailed description is as follows:

the on-chip global buffer is composed of a global low-precision weight memory, a global high-precision weight memory, a feature map global buffer, a part in the calculation process and the global buffer. The global low-precision weight memory, the feature map global register and the partial and global registers are common global registers (GLB), and the global high-precision weight memory (FPB) is a structure specific to the architecture.

And the weight separation module stores the weight bit width threshold value, namely the effective precision of each weight, into a memory with corresponding precision. If the weight is a high-precision weight, the position of the weight is recorded by a counter and is stored in the FPB together with the high-precision weight. Specifically, let W be the weight of a 16-bit fixed point number { W }₁₅,14…W ₁0, for a given weight threshold W₈If { W }₁₄…W₈7 are all equal to sign bit W₁₅That is, the weight is considered as a low-precision weight, and the other weights are considered as high-precision weights.

The low-precision processing module is composed of a plurality of low-precision computing units (LPPE) which are arranged in a matrix shape. The weight storage bit width in each LPPE is low, and the calculation precision of the multiplier-adder is low.

The LPPE is described in detail as follows:

as shown in fig. 3, LPPE is composed of a low-precision feature map memory, a low-precision weight memory, a low-precision part sum memory, a low-precision multiplier, and a low-precision adder. The low-precision feature map memory stores feature map data read from the feature map global buffer, and the low-precision weight memory stores weight data read from the global low-precision weight memory. The low-precision multiplier calculates one weight of the low-precision weight memory and pixel points on a feature map in the low-precision feature map memory each time, inputs a result, a low-precision part and a result in the memory into the low-precision adder together for accumulation calculation, and the results in the low-precision part and the memory are the results of the calculation of the last weight and the pixel points on the feature map; during the last accumulation calculation, the calculation result is added with the input part sum to obtain the output part sum of the low-precision calculation unit; the input part sum is the output part sum of the last low-precision computing unit; the above process is repeated until the one-dimensional linear convolution operation is completed.

A high-precision processing module: the device mainly comprises a plurality of high-precision computing units (FPPEs) and adding modules, wherein the number of the FPPEs and the number of the adding modules are the same as the number of columns of the low-precision computing units. The weight storage bit width in each FPPE is high, and the calculation precision of the multiplier-adder is high.

The above FPPE is described in detail as follows:

as shown in fig. 4, the FPPE comprises a weight decoding module, a feature map extracting module, a feature map reading module, a high-precision feature map memory, a high-precision weight memory, a high-precision part and memory, and a high-precision multiplier-adder.

And the weight decoding module can decode the coordinates k, u, i and j of each high-precision weight in all weight data according to the position of each high-precision weight in all weights. Each high precision weight data may be represented as W [ k ] [ u ] [ i ] [ j ], stored in a high precision weight memory, where k and u represent the number of input channels and output channels for the weight, and i and j represent the number of rows and columns for the weight. The specific decoding process is as follows: p + j + S + u + S + R + k + S + C. Wherein p is the position of the high-precision weight in all weights, S is the length of all weights, R is the height of all weights, and C is the number of input channels of all weights, i is more than or equal to 0 and less than S, j is more than or equal to 0 and less than R, and k is more than or equal to 0 and less than C. Through the relation, the coordinates k, u, i and j of the high-precision weights in all weight data can be obtained through decoding by the values of p, S, R and C.

The feature map position corresponding to each weight is only related to i and j of the weight, and the specific relationship is that j determines the number of columns of feature map data in all feature maps because the low-precision calculating unit and the high-precision calculating unit are added in columns, and the weight decoding module transmits the number of columns into the feature map extracting module; since the low-precision calculating unit and the high-precision calculating unit are not added in rows, i needs to implement a mapping relationship to the number of rows of the feature map data in all feature maps according to the arrangement of the low-precision calculating unit and the high-precision calculating unit, the mapping relationship is configured in the weight decoding module in the form of a lookup table, and the number of rows of the feature map data in all feature maps can be obtained after the parameter i passes through the lookup table, and the number of rows is transmitted to the feature map extracting module. And the feature map extraction module extracts feature map data from the feature map global buffer according to the number of rows and columns of the feature map data in all feature maps and stores the feature map data in the high-precision feature map memory.

Since the same profile data may correspond to a plurality of high-precision weights, the multiplexing of the profile data needs to be implemented by using the profile data multiplexing information. According to the characteristics of convolution calculation, each feature map data needs to correspond to the weight of the same input channel, so that the input channel information k with high precision weight can be used as multiplexing information of the feature map, and the feature map reading module controls whether the high-precision feature map memory needs to be read or not according to the feature map data multiplexing information.

When calculating, the part sums of a plurality of output channels need to be accumulated, the part sums of the output channels are stored in a high-precision part and a memory, and the parameter u can obtain the reading addresses of the high-precision part and the memory through a lookup table so as to control the number of the output channels when calculating; and when the output is carried out, the parameter u is needed to clearly output an output channel of a result.

Each high-precision calculating unit carries out sequential calculation according to the high-precision weight data, the extracted feature map data and the part of the output channel and the high-precision multiplier-adder and outputs the result to the adding module; each adding module reads the calculation results of the corresponding columns in the rectangular array formed by the low-precision calculation units and the calculation results of the corresponding high-precision calculation units, adds the calculation results to obtain partial sum data, and outputs the partial sum data and the partial sum data to the on-chip global buffer;

FPPE is the most important component of the architecture, and in order to ensure that the computation result of FPPE is correct, the following three problems need to be solved:

(a) how to find the feature map data corresponding to the high-precision weight data.

(b) The same feature map data may correspond to a plurality of high-precision weight data, and how to multiplex the high-precision weight data.

(c) The calculation result has a plurality of output channels, and how to accumulate the result to the correct output channel.

The invention uses the method of decoding and mapping the weight position, namely, the weight decoding module is used to obtain the positions i, j, k and u of the weight in all weights, and the three required control information are mapped according to the network shape, the shape of the computing unit array and other characteristics. The specific mapping method is as follows:

in view of the above problem (a), the position of the feature map data corresponding to each high-precision weight in all feature maps is only related to the position thereof in all weights. And (3) passing the decoded parameter i with high precision weight through a lookup table with mapping information and then transmitting the parameter i and the parameter j into a feature map extraction module, wherein the feature map extraction module extracts corresponding feature map data from a feature map global buffer in an on-chip global buffer according to the two parameters and stores the feature map data into a memory in the FPPE in an orderly manner.

Aiming at the problem (b), in the convolution calculation, the same characteristic diagram needs to be calculated with a plurality of weight data, and the weights meet the characteristics that the characteristic diagram has corresponding input channels, namely, whether the values of the high-precision weight input channels are the same or not is judged. If they are identical, it means that they need to be multiplied by the same profile.

For the above problem (c), since the data stream selected by the accelerator adds the results in columns according to the output channels of the partial sums, only the weighted output channels need to be distinguished in the FPPE. For the weights with the same output channel, accumulation can be performed inside the FPPE, and the final result is output and then added with the calculation result of the LPPE.

In the process of solving the three problems, when the weight is calculated with high precision, the multiplexing of three levels of data is realized, specifically:

(1) and (4) multiplexing the weights. Since the weights are stored in the high-precision processing module and remain unchanged for several calculation cycles, continuous calculation is achieved by continuously extracting new feature map data at the feature map extraction module of the FPPE. Thus, multiplexing of weight data within the FPPE can be achieved.

(2) And multiplexing the characteristic diagram data. Although the feature maps are continuously flowing between the FPPEs, the same feature map data is repeatedly calculated with a plurality of high-precision weight data within the FPPEs, and thus, data multiplexing of the feature maps can be realized.

(3) Partial and multiplex. The product result of the high-precision weight with the same output channel and the feature map data needs to be accumulated, the accumulation process is completed inside the FPPE, and the data multiplexing of partial sum can be realized.

Examples

The method can be applied to edge-end intelligent computing equipment, and is used for accelerating convolution calculation and improving the efficiency of convolution neural network calculation such as image identification and detection. Compared with the traditional architecture, the invention can save the deployment area, reduce the calculation energy consumption and improve the energy efficiency ratio.

The DNN neural network hardware accelerator architecture based on the mixing precision is realized on a ZYNQ platform ZCU102 of Xilinx, a 16-bit signed fixed point neural network is deployed, and 8 bits are defined as weight thresholds. The 8-bit-width low-precision processing module and the 16-bit-width high-precision processing module are used as computing cores, and the hybrid precision computation of the AlexNet and the VGG16 network is realized through the computation performance evaluation of the system based on the programmable SOC platform. Table 1 shows the resource consumption comparison of the hybrid-precision architecture and the infrastructure. From the data in the table, it can be calculated that the mixed-precision architecture saves nearly 50% of the weight storage space compared to the base architecture. And meanwhile, the dynamic power consumption is reduced by 12.1%. Fig. 5 illustrates the relative memory savings of a hybrid-precision architecture over the different layers of the infrastructure. Because the FPPE adds extra control logic and cache, the actual saved space of the whole system is not equal to the saved space of weight storage, but the saved space is still 17.8% when AlexNet network is calculated, and 16.8% when VGG16 is calculated.

TABLE 1 comparison of Performance of basic System and hybrid-precision architecture

	Universal system/PE	Mixed precision/LPPE	Hybrid precision/FPPE
				Number of	168	168	14
Storage space (byte)	608(1x)	384(0.63x)	1003(1.63x)
				LUT number (MAC)	280(1x)	128(0.46x)	280(1x)
LUT number (PE)	1313(1x)	777(0.59x)	1606(1.22x)

This patent is not limited to the preferred embodiments described above. Any other various forms of implementation of the hardware accelerator architecture of convolutional neural network for multiple precision mixing can be derived from the teaching of this patent, and all equivalent changes and modifications made according to the claims of the present invention shall fall within the scope of the present patent.

Claims

1. A convolutional neural network accelerator configured based on mixed precision is characterized in that each convolutional layer in the convolutional neural network is calculated by the convolutional neural network accelerator, and the convolutional neural network accelerator comprises a low-precision processing module, a high-precision processing module, an on-chip global buffer and a weight separation module; the first layer of input of the convolutional neural network is an image to be processed, and the last layer of output is an image after the convolution calculation is finished;

the on-chip global buffer is composed of a global low-precision weight memory, a global high-precision weight memory, a feature map global buffer, a part in the calculation process and the global buffer; the feature map global buffer stores the feature map read from the external memory;

the weight separation module is used for reading weight data from an external memory, judging the weight data to be high-precision weight data or low-precision weight data according to a weight bit width threshold, storing the low-precision weight data into a global low-precision weight memory, and storing the high-precision weight data and the positions of the high-precision weights in all weights into the global high-precision weight memory;

the low-precision processing module is composed of a rectangular array consisting of a plurality of low-precision computing units, each low-precision computing unit reads low-precision weight data from the global low-precision weight memory and reads feature map data from the feature map global buffer, and each row of low-precision computing units in the rectangular array perform multiplication and addition computation and accumulate computation results and then output the computation results to the high-precision processing module;

the high-precision processing module consists of a high-precision calculating unit and an adding module, and the number of the high-precision calculating unit and the number of the adding module are the same as the number of columns of the low-precision calculating unit; each high-precision computing unit reads high-precision weight data and the positions of the high-precision weights in all the weights from a global high-precision weight memory; each high-precision calculation unit comprises a weight decoding module and a feature map extraction module, the position of the weight is decoded, and the number of rows and columns of feature map data in all feature maps, feature map data multiplexing information and the number of output channels are obtained and used for subsequent calculation of the high-precision calculation unit, and the specific process is as follows:

decoding according to the position of each high-precision weight in all weights to obtain coordinates k, u, i and j of the high-precision weight in all weight data; each high-precision weight data and its coordinate can be represented as W [ k ] [ u ] [ i ] [ j ], wherein k and u represent the number of input channels and the number of output channels of the weight, and i and j represent the number of rows and columns of the weight; the specific decoding process is as follows: p ═ i + j × S + u × S × R + k × S × R × C; wherein p is the position of the high-precision weight in all weights, S is the length of all weights, R is the height of all weights, and C is the number of input channels of all weights, wherein i is more than or equal to 0 and less than S, j is more than or equal to 0 and less than R, and k is more than or equal to 0 and less than C; according to the relation, the weight decoding module can obtain the coordinates k, u, i and j of the high-precision weight in all weight data through the values of p, S, R and C;

the feature map position corresponding to each weight is only related to i and j of the weight, and the specific relationship is that the low-precision calculating unit and the high-precision calculating unit are added in columns, so that the parameter j determines the number of columns of feature map data in all feature maps, and the weight decoding module transmits the number of columns into the feature map extracting module; because the low-precision computing unit and the high-precision computing unit are not added in rows, i needs to implement a mapping relation to the number of rows of feature map data in all feature maps according to the arrangement mode of the low-precision computing unit and the high-precision computing unit, establish a lookup table for the mapping relation, deploy the lookup table in a weight decoding module, obtain the number of rows of feature map data in all feature maps after a parameter i passes through the lookup table, and transmit the number of rows into a feature map extraction module; the characteristic diagram extraction module extracts characteristic diagram data from the characteristic diagram global buffer according to the number of rows and columns of the characteristic diagram data in all the characteristic diagrams;

because the same feature map data may correspond to a plurality of high-precision weights, the feature map data multiplexing information is required to be utilized to realize the multiplexing of the feature map data; according to the characteristics of convolution calculation, each feature map data needs to correspond to the weight of the same input channel, so that the input channel parameter k with high precision weight can be used as multiplexing information of the feature map; each high-precision computing unit comprises a feature map reading module, the weight decoding module transmits the parameter k serving as feature map data multiplexing information to the feature map reading module, and the feature map reading module controls the reading of the feature map extracted by the feature map extraction module according to the feature map data multiplexing information so as to realize feature map data multiplexing;

each high-precision calculation unit carries out ordered calculation according to the high-precision weight data and the extracted feature map data, part sums of a plurality of output channels need to be accumulated during calculation, and the number of the output channels during accumulation is controlled according to a parameter u; after the calculation of the high-precision calculation unit is finished, reading the parameter u of the weight decoding module to determine the output channel of the calculation result, and outputting the calculation result to the correct output channel of the addition module;

repeating the above process until the partial and global buffers store all the partial and data required for calculation; accumulating the partial sum data to obtain a complete output characteristic diagram, and outputting the complete output characteristic diagram to an external memory;

2. The convolutional neural network accelerator based on hybrid precision configuration as claimed in claim 1, wherein weights with effective bit width greater than weight bit width threshold are denoted as high precision weights, and weights with effective bit width less than or equal to weight bit width threshold are denoted as low precision weights.

3. The convolutional neural network accelerator based on hybrid-precision configuration as claimed in claim 1, wherein each low-precision computing unit has an independent position coordinate in a rectangular array and corresponds to a bus address one to one, and the low-precision computing units read the low-precision weight data from the global low-precision weight memory according to a certain sequence, wherein the sequence is determined according to the structure and requirements of the convolutional neural network.

4. The convolutional neural network accelerator as claimed in claim 1, wherein the external memory stores weight data of all convolutional layers, each convolutional layer contains several sets of weight data required for calculation, and the weight separation module reads one set of weight data from the external memory at a time.

5. The convolutional neural network accelerator based on mixed precision configuration as claimed in claim 1, wherein each column of low precision computing units in the rectangular array performs the following specific processes of multiply-add computation and accumulation of computation results: each low-precision computing unit comprises a low-precision feature map memory, a low-precision weight memory, a low-precision part and memory, a low-precision multiplier and a low-precision adder; the low-precision characteristic diagram memory stores characteristic diagram data read from the characteristic diagram global buffer, the low-precision weight memory stores weight data read from the global low-precision weight memory, the low-precision weight memory and the low-precision characteristic diagram memory are connected with a low-precision multiplier, the low-precision multiplier calculates one weight of the low-precision weight memory and pixel points on the characteristic diagram in the low-precision characteristic diagram memory at each time, the result, the low-precision part and the result in the memory are input into a low-precision adder together for accumulation calculation, and the results in the low-precision part and the memory are the results of the calculation of the last weight and the pixel points on the characteristic diagram; during the last accumulation calculation, the calculation result is added with the input part sum to obtain the output part sum of the low-precision calculation unit; the input part sum is the output part sum of the last low-precision computing unit; the above process is repeated until the one-dimensional linear convolution operation is completed.

6. The convolutional neural network accelerator based on hybrid-precision configuration as claimed in claim 1, wherein the area of the low-precision computing unit is more than half smaller than that of the high-precision computing unit, and the number of the low-precision computing units is dozens of times larger than that of the high-precision computing unit.

7. The convolutional neural network accelerator based on hybrid-precision configuration as claimed in claim 1, wherein each high-precision computing unit further comprises a high-precision feature map memory, a high-precision weight memory, a high-precision part and memory, and a high-precision multiplier-adder; the high-precision feature map memory stores feature map data extracted from the feature map global buffer, the high-precision weight memory stores weight data read from the global high-precision weight memory, and the high-precision part and the memory store the sum of parts of output channels when the high-precision calculation unit performs ordered calculation; and the high-precision calculating unit sends the read high-precision weight data, the read feature map data and the part sum of the output channel into the high-precision multiplier-adder for calculation, accumulates the result and then outputs the result to the adding module.

8. The convolutional neural network accelerator configured based on hybrid precision according to claim 7, wherein the number of output channels when accumulated according to the parameter u is specifically: and controlling the number of output channels during calculation and accumulation according to the high-precision part corresponding to the parameter u output by the weight decoding module and the reading address of the memory.

9. The convolutional neural network accelerator based on mixed precision configuration as claimed in claim 1, wherein the weight separation module comprises a counter, and the position of the high-precision weight in all weights is counted by the counter in the weight separation module.

10. The implementation method of the convolutional neural network accelerator based on the hybrid precision configuration as claimed in claim 1, wherein the method specifically comprises the following steps: